Титульная страница
Менеджмент
Цель системы качества
Управление качеством
Методология моделирования процесса (IDEF)
Стандарты менеджмента
Cистема менеджмента качества ISO 9001
Total Quality Management
Экологический менеджмент
Качество в лабораториях ISO 17025
Аудит систем менеджмента качества
Отраслевые стандарты
Качество в IT
Качество в фармацевтике
Безопасность на производстве OHSAS 18001
Оформление документации
Ссылки

Chapter 7: Other Request Phases
The Child Initialization and Exit Phases
The Post Read Request Phase
The URI Translation Phase
The Header Parser Phase
- Implementing an Unsupported HTTP Method
Customizing the Type Checker Phase
- A DBI-Based Type Checker
Customizing the Fixup Phase
The Logging Phase
- Send E-Mail When a Page is Hit
- A DBI Database Logger
Registered Cleanups
Handling Proxy Requests
Perl Server-Side Includes
Subclassing the Apache Class

Chapter 7: Other Request Phases

The previous chapters have taken you on a wide-ranging tour of the most popular and useful areas of the Apache API. But we're not done yet! The Apache API allows you to customize URI translation, logging, the handling of proxy transactions, and the manner in which HTTP headers are parsed. There's even a way to incorporate snippets of Perl code directly into HTML pages that use server-side includes.

We've already shown you how to customize the response, authentication, authorization, and access control phases of the Apache request cycle. Now we'll fill in the cracks. At the end of the chapter, we show you the Perl server-side include system, and demonstrate a technique for extending the Apache Perl API by subclassing the Apache request object itself.

The Child Initialization and Exit Phases

Apache provides hooks into the child process initialization and exit handling. The child process initialization handler, installed with PerlChildInitHandler is called just after the main server forks off a child but before the child has processed any incoming requests. The child exit handler, installed with PerlChildExitHandler, is called just before the child process is destroyed.

You might need to install handlers for these phases in order to perform some sort of module initialization that won't survive a fork. For example, the Apache::DBI module has a child init handler that initializes a cache of per-child database connections, and the Apache::Resource module steps in during this phase to set up resource limits on the child processes. The latter is configured in this way:

 PerlChildInitHandler Apache::Resource

Like other handlers, you can install a child init handler programatically using Apache::push_handlers(). However, because the child init phase comes so early, the only practical place to do this is from within the parent process, in a Perl startup file configured with a PerlModule or PerlRequire directive. For example, here's how to install an anonymous subroutine that will execute during child initialization to choose a truly random seed value for Perl's random number generator:

 use Math::TrulyRandom ();
 Apache->push_handlers(PerlChildInitHandler => sub {
 srand Math::TrulyRandom::truly_random_value();
 });

Install this piece of code in the Perl startup file. By changing the value of the random number seed on a per-child basis, it ensures that each child process produces a different sequence of random numbers when the built in rand() function is called.

The child exit phase complements the child intialization phase. Child processes may exit for various reasons: the MaxRequestsPerChild limit may have been reached, the parent server was shutdown, or a fatal error occurred. This phase gives modules a chance to tidy up after themselves before the process exits.

The most straightforward way to install a child exit handler is with the explicit PerlChildExitHandler directive, as in:

 PerlChildExitHandler Apache::Guillotine

During the child exit phase, mod_perl invokes the Perl API function, perl_destruct()* to run the contents of END blocks and to invoke the DESTROY method for any global objects that have not gone out of scope already. Refer to the Chapter 9 section Special Global Variables, Subroutines and Literals for details.

Note: neither child initialization nor exit hooks are available on Win32 platforms for the reason that the Win32 port of Apache uses a single process.

footnote: *perl_destruct() is an internal Perl subroutine that is normally called just once by the Perl executable after a script is run.

The Post Read Request Phase

When a listening server receives an incoming request, it reads the HTTP request line and parses any HTTP headers sent along with it. Provided that what's been read is valid HTTP, Apache gives modules an early chance to step in during the post_read_request phase, known to the Perl API world as the PerlPostReadRequestHandler. This is the very first callback that Apache makes when serving an HTTP request, and it happens even before URI translation turns the requested URI into a physical pathname.

The post_read_request phase is a handy place to initialize per-request data that will be available to other handlers in the request chain. Because of its usefulness as an initialize routine, mod_perl provides the directive PerlInitHandler as a more readable alias to PerlPostReadRequestHandler.

Since the post_read_request phase happens before URI translation, PerlPostReadRequestHandler cannot appear in <Location>, <Directory> or <Files> sections. However the PerlInitHandler directive is actually a bit special. When it appears outside a directory section, it acts as an alias for PerlPostReadRequestHandler as just described. However, when it appears within a directory section, it acts as an alias for PerlHeaderParserHandler (discussed later in this chapter), allowing for per-directory initialization. In other words, wherever you put PerlInitHandler, it will act the way you expect.

Several optional Apache modules install handlers for the post_read_request phase. For example, the mod_unique_id module steps in here to create the UNIQUE_ID environment variable. When the module is activated, this variable is unique to each request over an extended period of time, and so is useful for logging and the generation of session IDs (see Chapter 5). Perl scripts can get at the value of this variable by reading $ENV{UNIQUE_ID}, or by calling $r->subprocess_env('UNIQUE_ID').

mod_setenvif also steps in during this phase to allow you to set enviroment variables based on the incoming client headers. For example, this directive will set the environment variable LOCAL_REFERRAL to true if the Referer header matches a certain regular expression:

 SetEnvIf Referer \.acme\.com LOCAL_REFERRAL

mod_perl itself uses the post_read_request phase to process the PerlPassEnv and PerlSetEnv directives, allowing environment variables to be passed to modules that execute early in the request cycle. The built-in Apache equivalents, PassEnv and SetEnv don't get processed until the fixup phase, which may be too late. The Apache::StatINC module, which watches .pm files for changes and reloads them if necessary, is also usually installed into this phase:

 PerlPostReadRequestHandler Apache::StatINC
 PerlInitHandler Apache::StatINC # same thing, but easier to type

The URI Translation Phase

One of the Web's virtues is its Uniform Resource Identifier (URI) and Uniform Resource Locator (URL) standards.* End users never know for sure what is sitting behind a URI. It could be a static file, a dynamic script, a proxied request, or something even more esoteric. The file or program behind a URI may change over time, but this too is transparent to the end user.

footnote: *Technically a URL is a fully qualified Web location, such as http://www.yahoo.com/pets/animals/ferrets, while a URI is a more general term that encompasses partial paths (<I/pets/animals/ferrets>) and other addressing schemes as well.

Much of Apache's power and flexibility comes from its highly configurable URI translation phase, which comes relatively early in the request cycle, after the post_read_request and before the header_parser phases. During this phase, the URI requested by the remote browser is translated into a physical filename, which may in turn be returned directly to the browser as a static document, or passed on to a CGI script or Apache API module for processing. During URI translation, each module that has declared its interest in handling this phase is given a chance to modify the URI. The first module to handle the phase (i.e. return something other than a status of DECLINED) terminates the phase. This prevents several URI translators from interfering with one another by trying to map the same URI onto several different file paths.

By default, two URI translation handlers are installed in stock Apache distributions. The mod_alias module looks for the existence of several directives that may apply to the current URI. These include Alias, ScriptAlias, Redirect, AliasMatch, and other directives. If it finds one, it uses the directive's value to map the URI to a file or directory somewhere on the server's physical file system. Otherwise, the request falls through to the http_core module (where the default response handler is also found). http_core simply appends the URI to the value of the DocumentRoot configuration directive, forming a file path relative to the document root.

The optional mod_rewrite module implements a much more comprehensive URI translator that allows you to slice and dice URIs in various interesting ways. It is extremely powerful, but uses a series of pattern matching conditions and substitution rules that can be difficult to get right.

Once a translation handler has done its work, Apache walks along the returned filename path in the manner described in Chapter 4, finding where the path part of the URI ends and the additional path information begins. This phase of processing is performed internally and cannot be modified by the module API.

In addition to their intended role in transforming URIs, translation handlers are sometimes used to associate certain types of URIs with specific upstream handlers. We'll see examples of this later in this chapter when we discuss creating custom proxy services.

A Very Simple Translation Handler

Let's look at an example. Many of the documents browsed on a web site are files that are located under the configured DocumentRoot. That is, the requested URI is a filename relative to a directory on the hard disk. Just so you can see how simple a translation handler's job can be, we present a Perl version of Apache's default translation handler found in the http_core module.

 package Apache::DefaultTrans;

 use Apache::Constants qw(:common BAD_REQUEST);
 use Apache::Log ();

 sub handler {
 my $r = shift;
 my $uri = $r->uri;

 if($uri !~ m:^/: or index($uri, '*')) {
 $r->log->error("Invalid URI in request ", $r->the_request);
 return BAD_REQUEST;
 }

 $r->filename($r->document_root . $r->uri);

 return OK;
 }

 1;
 __END__

The handler begins by subjecting the requested URI to a few sanity checks, making sure that it begins with a slash and doesn't contain any ``*'' characters. If the URI fails these tests, we log an error message and return BAD_REQUEST. Otherwise, all is well and we join together the value of the DocumentRoot directive (retrieved by calling the request object's document_root() method) with the URI to create the complete file path. The file path is now written into the request object by passing it to the filename() method.

We don't check at this point whether the file exists or can be opened. This is the job of handlers further down the request chain.

To install this handler, just add the following directive to the main part of your perl.conf configuration file (or any other Apache configuration file, if you prefer):

 PerlTransHandler Apache::DefaultTrans

Beware. You probably won't want to keep this handler installed for long. Because it overrides other translation handlers, you'll lose the use of Alias, ScriptAlias and other standard directives.

A Practical Translation Handler

Here's a slightly more complex example. Consider a Web-based system for archiving software binaries and source code. On a nightly basis an automated system will copy changed and new files from a master repository to multiple mirror sites. Because of the vagaries of the Internet, it's important to confirm that the entire file, and not just a fragment of it, is copied from one mirror site to the other.

One technique for solving this problem would be to create an MD5 checksum for each file and store the information on the repository. After the mirror site copies the file, it checksums the file and compares it against the master checksum retrieved from the repository. If the two values match, then the integrity of the copied file is confirmed.

In this section, we'll begin a simple system to retrieve precomputed MD5 checksums from an archive of files. To retrieve the checksum for a file, you simply append the extension .cksm to the end of its URL. For example, if the archived file you wish to retrieve is:

 /archive/software/cookie_cutter.tar.gz

then you can retrieve a text file containing its MD5 checksum by fetching this URL:

 /archive/software/cookie_cutter.tar.gz.cksm

The checksum files will be precomputed and stored in a directory tree that parallels the document hierarchy. For example, if the document itself is physically stored in:

 /home/httpd/htdocs/archive/software/cookie_cutter.tar.gz

then its checksum will be stored in a parallel tree in a file named:

 /home/httpd/checksums/archive/software/cookie_cutter.tar.gz

The URI translation handler's job is to map requests for /file/path/filename.cksm files into the physical file /home/httpd/checksums/file/path/filename. When used from a browser, the results look something like the screenshot in Figure 7.1.

Figure 7.1: A Checksum File Retrieved by Apache::Checksum1

As often happens with Perl programs, the problem takes longer to state than to solve. Listing 7.1 shows a translation handler, Apache::Checksum1 that accomplishes this task. The structure is similar to other Apache Perl modules. After the usual preamble, the handler() subroutine shifts the Apache request object off the call stack and uses it to recover the URI of the current request, which is stashed in the local variable $uri. The subroutine next looks for a configuration directive named ChecksumDir which defines the top of the tree where the checksums are to be found. If defined, handler() stores the value in a local named $cksumdir. Otherwise, it assumes a default value defined in DEFAULT_CHECKSUM_DIR.

Now the subroutine checks whether this URI needs special handling. It does this by attempting a string substitution which will replace the .cksm URI with a physical path to the corresponding file in the checksums directory tree. If the substitution returns a false value, then the requested URI does not end with the .cksm extension and we return DECLINED. This leaves the requested URI unchanged and allows Apache's other translation handlers to work on it. If, on the other hand, the substitution returns a true result, then $uri holds the correct physical pathname to the checksum file. We call the request object's filename() method to set the physical path returned to Apache, and return OK. This tells Apache that the URI was successfully translated and prevents any other translation handlers from being called.

Listing 7.1: A URI Translator for Checksum Files

 package Apache::Checksum1;
 # file: Apache/Checksum1.pm
 use strict;
 use Apache::Constants qw(:common);
 use constant DEFAULT_CHECKSUM_DIR => '/usr/tmp/checksums';

 sub handler {
 my $r = shift;
 my $uri = $r->uri;
 my $cksumdir = $r->dir_config('ChecksumDir') || DEFAULT_CHECKSUM_DIR;
 $cksumdir = $r->server_root_relative($cksumdir);
 return DECLINED unless $uri =~ s!^(.+)\.cksm$!$cksumdir$1!;
 $r->filename($uri);
 return OK;
 }

 1;
 __END__

The configuration for this translation handler should look something like this:

 # checksum translation handler directives
 PerlTransHandler Apache::Checksum1
 PerlSetVar ChecksumDir /home/httpd/checksums
 <Directory /home/httpd/checksums>
 ForceType text/plain
 </Directory>

This configuration declares a URI translation handler with the PerlTransHandler directive, and sets the Perl configuration variable ChecksumDir to /home/httpd/checksums, the top of the checksum tree. We also need a <Directory> section to force all files in the checksums directory to be of type text/plain. Otherwise, the default MIME type checker will try to use each checksum file's extension to determine its MIME type.

There are a couple of important points about this configuration section. First, the PerlTransHandler and PerlSetVar directives are located in the main section of the configuration file, not in a <Directory>, <Location> or <Files> section. This is because URI translation phase runs very early in the request processing cycle, before Apache has a definite URI or filepath to use in selecting an appropriate <Directory> <Location> or <Files> section to take its configuration from. For the same reason, PerlTransHandler is not allowed in .htaccess files, although you can use it in virtual host sections.

The second point is that the ForceType directive is located in a <Directory> section rather than in a <Location> block. The reason for this is that the <Location> section refers to the requested URI, which is not changed by this particular translation handler. To apply access control rules and other options to the physical file path returned by the translation handler, we must use <Directory> or <Files>.

To set up the checksum tree, you'll have to write a script that will recurse through the Web document hierarchy (or a portion of it), and create a mirror directory of checksum files. In case you're interested in implementing a system like this one, Listing 7.2 gives a short script named checksum.pl that does this. It uses the File::Find module to walk the tree of source files, the MD5 module to generate MD5 checksums, and File::Path and File::Basename for filename manipulations. New checksum files are only created if the checksum file doesn't exist or the modification time of the source file is more recent than that of an existing checksum file.

You call the scriptlike this:

 % checksum.pl -source ~www/htdocs -dest ~www/checksums

Replace ~www/htdocs and ~www/checksums with the paths to the Web document tree and the checksums directory on your system.

Listing 7.2: checksum.pl Creates a Parallel Tree of Checksum Files

 #!/usr/local/bin/perl

 use File::Find;
 use File::Path;
 use File::Basename;
 use IO::File;
 use MD5;
 use Getopt::Long;
 use strict;
 use vars qw($SOURCE $DESTINATION $MD5);

 GetOptions('source=s' => \$SOURCE,
 'destination=s' => \$DESTINATION) || die <<USAGE;
 Usage: $0
 Create a checksum tree.
 Options:
 -source <path> File tree to traverse [.]
 -destination <path> Destination for checksum tree [TMPDIR]
 Option names may be abbreviated.
 USAGE
 
 $SOURCE ||= '.';
 $DESTINATION ||= $ENV{TMPDIR} || '/tmp';
 die "Must specify absolute destination directory" unless $DESTINATION=~m!^/!;
 $MD5 = new MD5;
 
 find(\&wanted,$SOURCE);
 
 # This routine is called for each node (directory or file) in the
 # source tree. On entry, $_ contains the filename, 
 # and $File::Find::name contains its full path.
 sub wanted {
 return unless -f $_ && -r _;
 my $modtime = (stat _)[9];
 my ($source,$dest,$url);
 $source = $File::Find::name;
 ($dest = $source)=~s/^$SOURCE/$DESTINATION/o;
 return if -e $dest && $modtime <= (stat $dest)[9];
 ($url = $source) =~s/^$SOURCE//o;
 make_checksum($_,$dest,$url);
 }
 
 # This routine is called with the source file, the destination in which
 # to write the checksum, and a URL to attach as a comment to the checksum.
 sub make_checksum {
 my ($source,$dest,$url) = @_;
 my $sfile = IO::File->new($source) || die "Couldn't open $source: $!\n";
 mkpath dirname($dest); # create the intermediate directories
 my $dfile = IO::File->new(">$dest") || die "Couldn't open $dest: $!\n";
 $MD5->reset;
 $MD5->addfile($sfile);
 print $dfile $MD5->hexdigest(),"\t$url\n"; # write the checksum
 }
 __END__

Using a Translation Handler to Change the URI

Instead of completely translating a URI into a filename, a translation handler can modify the URI itself and let other handlers do the work of completing the translation into a physical path. This is very useful because it allows the handler to interoperate with other URI translation directives such as Alias and UserDir.

To change the URI, your translation handler should set it with the Apache request object's uri() method instead of (or in addition to) the filename() method: $r->uri($new_uri);

After changing the URI, your handler should then return DECLINED, not OK. This may seem counter-intuitive. However by returning DECLINED, your translation handler is telling Apache that it has declined to do the actual work of matching the URI to a filename and is asking Apache to pass the modified request on to other registered translation handlers.

Listing 7.3 shows a reworked version of the checksum translation handler that alters the URI rather than sets the filename directly. The code is nearly identical to the first version of this module, but instead of retrieving a physical directory path from a PerlSetVar configuration variable named ChecksumDir, the handler looks for a variable named ChecksumPath which is expected to contain the virtual (URI space) directory in which the checksums can be found. If the variable isn't defined, then /checksums is assumed. We perform the string substitution on the requested URI as before. If the substitution succeeds, we write the modified URI back into the request record by calling the request object's uri() method. We then return DECLINED so that Apache will pass the altered request on to other translation handlers.

Listing 7.3: A Translation Handler that Changes the URI

 package Apache::Checksum2;
 # file: Apache/Checksum2.pm
 use strict;
 use Apache::Constants qw(:common);
 use constant DEFAULT_CHECKSUM_PATH => '/checksums';

 sub handler {
 my $r = shift;
 my $uri = $r->uri;
 my $cksumpath = $r->dir_config('ChecksumPath') || DEFAULT_CHECKSUM_PATH;
 return DECLINED unless $uri =~ s!^(.+)\.cksm$!$cksumpath$1!;
 $r->uri($uri);
 return DECLINED;
 }

 1;
 __END__

The configuration file entries needed to work with Apache::Checksum2 are shown below. Instead of passing the translation handler a physical path in the ChecksumDir variable, we use ChecksumPath to pass a virtual URI path. The actual translation from a URI to a physical path is done by the standard mod_alias module from information provided by an Alias directive. Another point to notice is that because the translation handler changed the URI, we can now use a <Location> section to force the type of the checksum files to text/plain.

 PerlTransHandler Apache::Checksum2
 PerlSetVar ChecksumPath /checksums
 Alias /checksums/ /home/www/checksums/
 <Location /checksums>
 ForceType text/plain
 </Location>

In addition to interoperating well with other translation directives, this version of the checksum translation handler deals correctly with the implicit retrieval of index.html files when the URI ends in a directory name. For example, retrieving the partial URI /archive/software/.cksm will be correctly transformed into a request for /home/httpd/checksums/archive/software/index.html.

On the downside, this version of the translation module may issue potentially confusing error messages if a checksum file is missing. For example, if the user requests URI /archive/software/index.html.cksm and the checksum file is not present, Apache's default ``Not Found'' error message will read ``The requested URL /checksums/archive/software/index.html was not found on this server.'' The user may be confused to see an error message refers to a different URI than the one he requested.

Another example of altering the URI on the fly can be found in Chapter 5, where we used a translation handler to manage session IDs embedded in URIs. This handler copies the session ID from the URI into an environment variable for later use by the content handler, then strips the session ID from the URI and writes it back into the request record.

Installing a Custom Response Handler in the URI Translation Phase

In addition to its official use as the place to modify the URI and/or filename of the requested document, the translation phase is also a convenient place to set up custom content handlers for particular URIs. To continue with our checksum example, let's generate the checksum from the requested file on the fly rather than using a precomputed value. This eliminates the need to maintain a parallel directory of checksum files, but adds the cost of additional CPU cycles every time a checksum is requested.

Listing 7.4 shows Apache::Checksum3. It's a little longer than the previous examples, so we'll step through it a chunk at a time.

 package Apache::Checksum3;
 # file: Apache/Checksum3.pm
 use strict;
 use Apache::Constants qw(:common);
 use Apache::File ();
 use MD5 ();

 my $MD5 = MD5->new;

Because this module is going to produce the MD5 checksums itself, we bring in the Apache::File and MD5 modules. We then create a file-scoped lexical MD5 object that will be used within the package to generate the MD5 checksums of requested files.

 sub handler {
 my $r = shift;
 my $uri = $r->uri;
 return DECLINED unless $uri =~ s/\.cksm$//;
 $r->uri($uri);

We now define two subroutines. The first, named handler() is responsible for the translation phase of the request. Like its predecessors, this subroutine recovers the URI from the request object and looks for the telltale .cksm extension. However, instead of constructing a new path that points into the checksums directory, we simply strip off the extension and write the modified path back into the request record.

 $r->handler("perl-script");
 $r->push_handlers(PerlHandler => \&checksum_handler);
 return DECLINED;
 }

Now the interesting part begins. We set the request's content handler to point to the second subroutine in the module, checksum_handler(). This is done in two phases. First we call $r->handler("perl-script") to tell Apache to invoke the Perl interpreter for the content phase of the request. Next we call push_handlers() to tell Perl to call our checksum_handler() method when the time comes. Our work done, we return a result code of DECLINED in order to let the other translation handlers do their job.

Apache will now proceed as usual through the authorization, authentication, MIME type checking, and fixup phases until it gets to the content phase, at which point our Apache::Checksum3 will be reentered through the checksum_handler() routine:

 sub checksum_handler {
 my $r = shift;
 my $file = $r->filename;
 my $sfile = Apache::File->new($file) || return DECLINED;
 $r->content_type('text/plain');
 $r->send_http_header;
 return OK if $r->header_only;
 $MD5->reset;
 $MD5->addfile($sfile);
 $r->print($MD5->hexdigest(),"\t",$r->uri,"\n");
 return OK;
 }

Like the various content handlers we saw in Chapter 4, checksum_handler() calls the request object's filename() method to retrieve the physical filename and attempts to open it, returning DECLINED in case of an error. The subroutine sets the content type to text/plain and sends the HTTP header. If this is a HEAD request, we return. Otherwise, we invoke the MD5 module's reset() method to clear the checksum algorithm, call addfile() to process the contents of the file, and then hexdigest() to emit the checksum.

Because this module is entirely self-contained, it has the simplest configuration of them all:

 PerlTransHandler Apache::Checksum3

Like other PerlTransHandler directives, this one must be located in the main part of the configuration file or in a virtual host section.

Listing 7.4: Calculating Checksums on the Fly

 package Apache::Checksum3;
 # file: Apache/Checksum3.pm
 use strict;
 use Apache::Constants qw(:common);
 use Apache::File ();
 use MD5 ();
 
 my $MD5 = MD5->new;
 
 sub handler {
 my $r = shift;
 my $uri = $r->uri;
 return DECLINED unless $uri =~ s/\.cksm$//;
 $r->uri($uri);
 $r->handler("perl-script");
 $r->push_handlers(PerlHandler => \&checksum_handler);
 return DECLINED;
 }
 
 sub checksum_handler {
 my $r = shift;
 my $file = $r->filename;
 my $sfile = Apache::File->new($file) || return DECLINED;
 $r->content_type('text/plain');
 $r->send_http_header;
 return OK if $r->header_only;
 $MD5->reset;
 $MD5->addfile($sfile);
 $r->print($MD5->hexdigest(),"\t",$r->uri,"\n");
 return OK;
 }
 
 1;
 __END__

Don't think that you must always write a custom translation handler in order to gain control over the URI translation phase. The powerful mod_rewrite module gives you great power to customize this phase. For example, by adding a mod_rewrite RewriteRule directive, you can define a substitution rule that transforms requests for .cksm URIs into requests for files in the checksum directory, doing in a single line what our first example of a translation handler did in 17.

The Header Parser Phase

After Apache has translated the URI into a filename, it enters the header parser phase. This phase gives handlers a chance to examine the incoming request header and to take special action, perhaps altering the headers on the fly (as we will do below to create an anonymous proxy server), or blocking unwanted transactions at an early stage. For example, the header parser phase is commonly used to block unwanted robots before they consume the server resources during the later phases. You could use the Apache::BlockAgent module, implemented as an access handler in the last chapter, to block robots during this earlier phase.

Header parser handlers are installed with the PerlHeaderParserHandler. Because the URI has been mapped to a filename at this point, the directive is allowed in .htaccess files and directory configuration sections, as well as in the main body of the configuration files. All registered header parser handlers will be run unless one returns an error code or DONE.

When PerlInitHandler is used within a directory section or a .htaccess file, it acts as an alias to PerlHeaderParserHeader.

Implementing an Unsupported HTTP Method

One non-trivial use for the header parser phase is to implement an unsupported HTTP request method. The Apache server handles the most common HTTP methods, such as GET, HEAD and POST. Apache also provides hooks for managing the less commonly used PUT and DELETE methods, but the work of processing the method is left to third-party modules to implement. In addition to these methods, there are certain methods that are part of the HTTP/1.1 draft that are not supported by Apache at this time. One such method is PATCH*, which is used to change the contents of a document on the server side by applying to it a ``diff'' file provided by the client.

footnote: *Just two weeks prior to the production stage of this book, Script support for the PATCH method was added in Apache 1.3.4-dev.

This section will show how to extend the Apache server in an entirely new direction to support the PATCH method. The same techniques can be used to experiment with other parts of HTTP drafts or customize the HTTP protocol for special applications.

If you've never worked with patch files, you'll be surprised at how insanely useful they are. Say you have two versions of a large file, an older version named file.1.html and a newer version named file.2.html. You can use the diff command to compute the difference between the two, like this:

 % diff file.1.html file.2.html > file.diff

When diff is finished, the output file.diff file will contain only the lines that have changed between the two files, along with information indicating the positions of the changed lines in the files. You can examine a ``diff'' file in a text editor to see how the two files differ. More interestingly, however, you can use Larry Wall's patch program to apply the diff to file.1.html, transforming it in into a new file identical to file.2.html. Patch is simple to use:

 % patch file.1.html < file.diff

Because two versions of the same file tend to be more similar than they are different, diff files are usually short, making it much more efficient to send the diff file around than the entire new version. This is the rationale for the HTTP/1.1 PATCH method. It complements PUT, which is used to transmit a whole new document to the server, by sending what should be changed between an existing document and a new one. When a client requests a document with the PATCH method, the URL it provides corresponds to the file to be patched, and the request's content is the diff file to be applied.

Listing 7.5 gives the code for the PATCH handler, appropriately named Apache::PATCH. It defines both the server-side routines for accepting PATCH documents, and a small client-side program to use for submitting patch files to the server.

 package Apache::PATCH;
 # file: Apache/PATCH.pm
 
 use strict;
 use vars qw($VERSION @EXPORT @ISA);
 use Apache::Constants qw(:common BAD_REQUEST);
 use Apache::File ();
 use File::Basename 'dirname';
 
 @ISA = qw(Exporter);
 @EXPORT = qw(PATCH);
 $VERSION = '1.00';
 
 use constant PATCH_TYPE => 'application/diff';
 my $PATCH_CMD = "/usr/local/bin/patch";

We begin by pulling in required modules, including Apache::File and File::Basename. We also bring in the Exporter module. This is not used by the server-side routines, but is needed by the client-side library to export the PATCH() subroutine. We now declare some constants, including a MIME type for the submitted patch files, the location of the patch program on our system, and two constants that will be used to create temporary scratch files.

The main entry point to server-side routines is through a header parsing phase handler named handler(). It detects whether the request uses the PATCH method, and if so, installs a custom response handler to deal with it. This means we can install the patch routines with this configuration directive:

 PerlHeaderParserHandler Apache::PATCH

The rationale for installing the patch handler with the PerlHeaderParserHandler directive rather than PerlTransHandler is that we can use the former directive within directory sections and .htaccess files, allowing us to make the PATCH method active only for certain parts of the document tree.

The definition of handler() is simple. :

 sub handler {
 my $r = shift;
 return DECLINED unless $r->method eq 'PATCH';
 unless ($r->some_auth_required) {
 $r->log_reason("Apache::PATCH requires access control");
 return FORBIDDEN;
 }
 $r->handler("perl-script");
 $r->push_handlers(PerlHandler => \&patch_handler);
 return OK;
 }

We recover the request object and call method() to determine whether the request method equals ``PATCH''. If not, we decline the transaction. Next we perform a simple but important security check. We call some_auth_required() to determine whether the requested URI is under password protection. If the document is not protected, we log an error and return a result code of FORBIDDEN. This is a hard-wired insurance that the file to be patched is protected in some way using any of the many authentication modules available to Apache (see Chapter 6 for a few).

If the request passes the checks, we adjust the content handler to be the patch_handler() subroutine by calling the request object's handler() and push_handlers() methods. This done, we return OK, allowing other installed header parsers to process the request.

The true work of the module is done in the patch_handler() subroutine, which is called during the response phase:

 sub patch_handler {
 my $r = shift;
 
 return BAD_REQUEST
 unless lc($r->header_in("Content-type")) eq PATCH_TYPE;

This subroutine recovers the request object and immediately checks the content type of the submitted data. Unless the submitted data has MIME type application/diff, indicating a diff file, we return a result code of BAD_REQUEST.

 # get file to patch
 my $filename = $r->filename;
 my $dirname = dirname($filename);
 my $reason;
 do {
 -e $r->finfo or $reason = "$filename does not exist", last;
 -w _ or $reason = "$filename is not writable", last;
 -w $dirname or $reason = "$filename directory is not writable", last;
 };
 if ($reason) {
 $r->log_reason($reason);
 return FORBIDDEN;
 }

Next we check whether the patch operation is likely to succeed. In order for the patch program to work properly, both the file to be patched and the directory that contains it must be writable by the current process.* This is because patch creates a temporary file while processing the diff and renames it when it has successfully completed its task. We recover the filename corresponding to the request, and the name of the directory that contains it. We then subject the two to a series of file tests. If any of the tests fails, we log the error and return FORBIDDEN.

footnote: *In order for the PATCH method to work you will have to make the files and directories to be patchable writable by the Web server process. You can do this either by making the directories world-writable, or by changing their user or group ownerships so that the Web server has write permission. This has security implications, as it allows buggy CGI scripts and other Web server security holes to alter the document tree. A more secure solution would be to implement PATCH using a conventional CGI script running under the standard Apache suexec extension, or the sbox CGI wrapper (http://stein.cshl.org/WWW/software/sbox).

 # get patch data
 my $patch;
 $r->read($patch, $r->header_in("Content-length"));
 
 # new temporary file to hold output of patch command
 my($tmpname, $patch_out) = Apache::File->tmpfile;
 unless($patch_out) {
 $r->log_reason("can't create temporary output file: $!");
 return FORBIDDEN;
 }

The next job is to retrieve the patch data from the request. We do this using the request object's read() method to copy Content-length bytes of patch data from the request to a local variable named $patch. We are about to call the patch command, but before we do so we must arrange for its output (both standard output and standard error) to be saved to a temporary file so that we can relay the output to the user. We call the Apache::File method tmpfile() to return a unique temporary filename. We store the temporary file's name and handle into variables named $tmpname and $patch_out, respectively. If for some reason tmpfile() is unable to open a temporary file it will return an empty list. We log the error and return FORBIDDEN.

 # redirect child processes stdout and stderr to temporary file
 open STDOUT, ">&=" . fileno($patch_out);

We want the output from patch to go to the temporary file rather than to standard output (which was closed by the parent server long, long ago). So we reopen STDOUT, using the ``>&='' notation to open it on the same file descriptor as $patch_out.* See the description of open() in the perlfunc manual page for a more detailed description of this facility.

footnote: *Why not just redirect the output of patch to the temporary file by invoking patch with the ``>$tmpname'' notation? Because this leaves us exposed to a race condition in which some other process replaces the temporary file with a link to a more important file. When patch writes to this file, it inadvertently clobbers it. Arranging for patch to write directly to the filehandle returned by tmpfile() avoids this trap.

 # open a pipe to the patch command
 local $ENV{PATH}; #keep -T happy 
 my $patch_in = Apache::File->new("| $PATCH_CMD $filename 2>&1");
 unless ($patch_in) {
 $r->log_reason("can't open pipe to $PATCH_CMD: $!");
 return FORBIDDEN;
 }

At this point we open up a pipe to the patch command and store the pipe in a new filehandle named $patch_in. We call patch with a single command-line argument, the name of the file to change stored in $filename. The piped open command also uses the ``2>&1'' notation, which is the Bourne shell's arcane way of indicating that standard error should be redirected to the same place that standard output is directed, which in this case is to the temporary file. If we can't open the pipe for some reason, we log the error and exit.

 # write data to the patch command
 print $patch_in $patch;
 close $patch_in;
 close $patch_out;

We now print the diff file to the patch pipe. patch will process the diff file, and write its output to the temporary file. After printing, we close the command pipe and the temporary filehandle.

 $patch_out = Apache::File->new($tmpname);
 
 # send the result to the user
 $r->send_http_header("text/plain");
 $r->send_fd($patch_out);
 close $patch_out;
 
 return OK;
 }

The last task is to send the patch output back to the client. We send the HTTP header, using the convenience form that allows us to set the MIME type in a single step. We now send the contents of the temporary file using the request method's send_fd() method. Our work done, we close the temporary filehandle and return OK.*

footnote: Users interested the HTTP PATCH method should also be aware of the IETF WebDAV -- ``Distributed Authoring and Versioning'' standard: http://www.ics.uci.edu/pub/ietf/webdav/ And Greg Stein's Apache module implementation of these protocol extensions: http://www.lyra.org/greg/mod_dav/

Listing 7.5: Implementing the PATCH Method

 package Apache::PATCH;
 # file: Apache/PATCH.pm
 
 use strict;
 use vars qw($VERSION @EXPORT @ISA);
 use Apache::Constants qw(:common BAD_REQUEST);
 use Apache::File ();
 use File::Basename 'dirname';
 
 @ISA = qw(Exporter);
 @EXPORT = qw(PATCH);
 $VERSION = '1.00';
 
 use constant PATCH_TYPE => 'application/diff';
 my $PATCH_CMD = "/usr/local/bin/patch";
 
 sub handler {
 my $r = shift;
 return DECLINED unless $r->method eq 'PATCH';
 unless ($r->some_auth_required) {
 $r->log_reason("Apache::PATCH requires access control");
 return FORBIDDEN;
 }
 $r->handler("perl-script");
 $r->push_handlers(PerlHandler => \&patch_handler);
 return OK;
 }
 
 sub patch_handler {
 my $r = shift;
 
 return BAD_REQUEST
 unless lc($r->header_in("Content-type")) eq PATCH_TYPE;
 
 # get file to patch
 my $filename = $r->filename;
 my $dirname = dirname($filename);
 my $reason;
 do {
 -e $r->finfo or $reason = "$filename does not exist", last;
 -w _ or $reason = "$filename is not writable", last;
 -w $dirname or $reason = "$filename directory is not writable", last;
 };
 if ($reason) {
 $r->log_reason($reason);
 return FORBIDDEN;
 }
 
 # get patch data
 my $patch;
 $r->read($patch, $r->header_in("Content-length"));
 
 # new temporary file to hold output of patch command
 my($tmpname, $patch_out) = Apache::File->tmpfile;
 unless($patch_out) {
 $r->log_reason("can't create temporary output file: $!");
 return FORBIDDEN;
 }
 
 # redirect child processes stdout and stderr to temporary file
 open STDOUT, ">&=" . fileno($patch_out);
 
 # open a pipe to the patch command
 local $ENV{PATH}; #keep -T happy 
 my $patch_in = Apache::File->new("| $PATCH_CMD $filename 2>&1");
 unless ($patch_in) {
 $r->log_reason("can't open pipe to $PATCH_CMD: $!");
 return FORBIDDEN;
 }
 # write data to the patch command
 print $patch_in $patch;
 close $patch_in;
 close $patch_out;
 
 $patch_out = Apache::File->new($tmpname);
 
 # send the result to the user
 $r->send_http_header("text/plain");
 $r->send_fd($patch_out);
 close $patch_out;
 
 return OK;
 }
 
 # This part is for command-line invocation only.
 my $opt_C;
 
 sub PATCH {
 require LWP::UserAgent;
 @Apache::PATCH::ISA = qw(LWP::UserAgent);
 
 my $ua = __PACKAGE__->new;
 my $url;
 my $args = @_ ? \@_ : \@ARGV;
 
 while (my $arg = shift @$args) {
 $opt_C = shift @$args, next if $arg eq "-C";
 $url = $arg;
 }
 
 my $req = HTTP::Request->new('PATCH' => $url);
 
 my $patch = join '', <STDIN>;
 $req->content(\$patch);
 $req->header('Content-length' => length $patch);
 $req->header('Content-type' => PATCH_TYPE);
 my $res = $ua->request($req);
 
 if($res->is_success) {
 print $res->content;
 }
 else {
 print $res->as_string;
 }
 }
 
 sub get_basic_credentials {
 my($self, $realm, $uri) = @_;
 return split ':', $opt_C, 2;
 }
 
 1;
 __END__

At the time this chapter was written, no Web browser or publishing system had actually implemented the PATCH method. The remainder of the listing contains code for implementing a PATCH client. You can use this code from the command line to send patch files to servers that have the PATCH handler installed and watch the documents change in front of your eyes.

The PATCH client is simple thanks to the LWP library. Its main entry point is an exported subroutine named PATCH():

 sub PATCH {
 require LWP::UserAgent;
 @Apache::PATCH::ISA = qw(LWP::UserAgent);
 
 my $ua = __PACKAGE__->new;
 my $url;
 my $args = @_ ? \@_ : \@ARGV;
 
 while (my $arg = shift @$args) {
 $opt_C = shift @$args, next if $arg eq "-C";
 $url = $arg;
 }

PATCH() starts by creating a new LWP user agent using the subclassing technique discussed later in the Apache::AdBlocker module (see Handling Proxy Requests in this chapter). It then recovers the authentication username and password from the command line by looking for a -C (credentials) switch, which is stored into a package lexical named $opt_C. The subroutine then shifts the URL of the document to patch off the command line and store it in $url.

 my $req = HTTP::Request->new('PATCH' => $url);
 
 my $patch = join '', <STDIN>;
 $req->content(\$patch);
 $req->header('Content-length' => length $patch);
 $req->header('Content-type' => PATCH_TYPE);
 my $res = $ua->request($req);

The subroutine now creates a new HTTP::Request object that specifies PATCH as its request method, and sets its content to the diff file read in from STDIN. It also sets the Content-length and Content-type HTTP headers to the length of the diff file and application/diff respectively. Having set up the request, the subroutine sends the request to the remote server by calling the user agent's request() method.

 if($res->is_success) {
 print $res->content;
 }
 else {
 print $res->as_string;
 }
 }

If the response indicates success (is_success() returns true) then we print out the text of the server's response. Otherwise the routine prints the error message contained in the response object's as_string() method.

 sub get_basic_credentials {
 my($self, $realm, $uri) = @_;
 return split ':', $opt_C, 2;
 }

The get_basic_credentials() method, defined at the bottom of the source listing, is actually an override of an LWP::UserAgent method. When LWP::UserAgent tries to access a document that is password protected, it invokes this method to return the username and password required to fetch the resource. By subclassing LWP::UserAgent into our own package and then defining a get_basic_credentials() method, we're able to provide our parent class with the contents of the $opt_C command-line switch.

To run the client from the command line, invoke it like this:

 % perl -MApache::PATCH -e PATCH -- -C username:password \
 http://www.modperl.com/index.html < index.html.diff

 Hmm... Looks like a new-style context diff to me...
 The text leading up to this was:
 --------------------------
 |*** index.html.new Mon Aug 24 21:52:29 1998
 |--- index.html Mon Aug 24 21:51:06 1998
 --------------------------
 Patching file /home/httpd/htdocs/index.html using Plan A...
 Hunk #1 succeeded at 8.
 done

A tiny script named PATCH that uses the module can save some typing:

 #!/usr/local/bin/perl

 use Apache::PATCH;
 PATCH;

 __END__

Now the command looks like this:

 % PATCH -C username:password \
 http://www.modperl.com/index.html < index.html.diff

Customizing the Type Checker Phase

Following the successful completion of the access control and authentication steps (if configured), Apache tries to determine the MIME type (e.g. image/gif) and encoding type (e.g. x-gzip) of the requested document. The types and encodings are usually determined by filename extensions (the term ``suffix'' is used interchangeably with ``extension'' in the Apache source code and documentation). Table 7.1 lists a few common examples.

Table 7.1: MIME Types and Encodings for Common File Extensions

 MIME types:
 extension | type
 --------------------------
 .txt | text/plain
 .html,.htm | text/html
 .gif | image/gif
 .jpg,.jpeg | image/jpeg
 .mpeg,.mpg | video/mpeg
 .pdf | application/pdf

 Encodings:
 extension | encoding
 --------------------------
 .gz | x-gzip
 .Z | x-compress

By default, Apache's type checker phase is handled by the standard mod_mime module, which combines the information stored in the server's conf/mime.types file with AddType and AddEncoding directives to map file extensions onto MIME types and encodings.

Once the document's MIME type is determined, the information is saved in the content_type field of the request record, where it is later used during the response phase to determine which module will be responsible for generating the document content. In general, file types that are determined by the AddType and mime.types mapping will be served from disk. If the optional mod_mmap_static or Apache::Mmap modules are installed, the file may be served straight from shared memory, since both content handlers accept */* in order to handle any document type not specifically requested by another handler.

The contents of the request record's content_type field are used to set the default outgoing Content-type header, which the client uses to decide how to render the document. However, as we've seen, content handlers can, and often do, change the content type during the later response phase.

In addition to its responsibility for choosing MIME and encoding types for the requested document, the type checking phase handler also performs the crucial task of selecting the content handler for the document. mod_mime looks first for a SetHandler directive in the current directory or location. If one is set, it uses that handler for the requested document. Otherwise it dispatches the request based on the MIME type of the document. This process was described in more detail at the beginning of Chapter 4. Also see Reimplementing mod_mime in Perl, below, where we reproduce all of mod_mime's functionality with a Perl module.

A DBI-Based Type Checker

In this section, we'll show you a simple type checker handler that determines the MIME type of the document on the basis of a DBI database lookup. Each record of the database table will contain the name of the file, its MIME type, and its encoding.* If no type is registered in the database, we fall through to the default mod_mime handler.

footnote: *An obvious limitation of this module is that it can't distinguish between similarly-named files in different directories. However, if you were to use something like this, it would probably be to manage a large archive of documents with esoteric formats.

This module, Apache::MimeDBI makes use of the simple Tie::DBI class that was introduced in the previous chapter. Briefly, this class lets you tie a hash to a relational database table. The tied variable appears as a hash of hashes in which the outer hash is a list of table records indexed by the table's primary key, and the inner hash contains the columns of that record, indexed by column name. To give a concrete example, for the purposes of this module we'll set up a database table named doc_types having this structure:

 +----------+------------+------------+
 | filename | mime_type | encoding |
 +----------+------------+------------+
 | test1 | text/plain | NULL |
 | test2 | text/html | NULL |
 | test3 | text/html | x-compress |
 | test4 | text/html | x-gzip |
 | test5 | image/gif | NULL |
 +----------+------------+------------+

Assuming that a hash named %DB is tied to this table, we'll be able to access its columns in this way:

 $type = $DB{'test2'}{'mime_type'};
 $encoding = $DB{'test2'}{'encoding'};

Listing 7.6 gives the source for Apache::MimeDBI.

 package Apache::MimeDBI;
 # file Apache/MimeDBI.pm
 
 use strict;
 use Apache::Constants qw(:common);
 use Tie::DBI ();
 use File::Basename qw(basename);
 
 use constant DEFAULT_DSN => 'mysql:test_www';
 use constant DEFAULT_LOGIN => ':';
 use constant DEFAULT_TABLE => 'doc_types';
 use constant DEFAULT_FIELDS => 'filename:mime_type:encoding';

The module starts by pulling in necessary Perl libraries, including Tie::DBI and the File::Basename filename parser. It also defines a series of default configuration constants. DEFAULT_DSN is the default DBI data source to use, in the format driver:database:host:port. DEFAULT_LOGIN is the username and password for the Web server to use to log into the database, separated by a ``:''. Both fields are blank by default, indicating no password is to be provided. DEFAULT_TABLE is the name of the table in which to look for the MIME type and encoding information. DEFAULT_FIELDS are the names of the filename, MIME type and encoding columns, again separated by the ``:'' character. These default values can be overridden with the per-directory Perl configuration variables MIMEDatabase, MIMELogin, MIMETable and MIMEFields.

 sub handler {
 my $r = shift;
 
 # get filename
 my $file = basename $r->filename;
 
 # get configuration information
 my $dsn = $r->dir_config('MIMEDatabase') || DEFAULT_DSN;
 my $table = $r->dir_config('MIMETable') || DEFAULT_TABLE;
 my($filefield, $mimefield, $encodingfield) = 
 split ':',$r->dir_config('MIMEFields') || DEFAULT_FIELDS;
 my($user, $pass) = split ':', $r->dir_config('MIMELogin') || DEFAULT_LOGIN;

The handler() subroutine begins by shifting the request object off the subroutine call stack and using it to recover the requested document's filename. The directory part of the filename is then stripped away using the basename() routine imported from File::Basename. Next, we fetch the values of our four configuration variables. If any are undefined, we default to the values defined by the previously-declared constants.

 tie my %DB, 'Tie::DBI', {
 'db' => $dsn, 'table' => $table, 'key' => $filefield,
 'user' => $user, 'password' => $pass,
 };
 my $record;

We now tie a hash named %DB to the indicated database by calling the tie() operator. If the hash is successfully tied to the database, this routine will return a true value (actually, an object reference to the underlying Tie::DBI object itself). Otherwise we return a value of DECLINED and allow other modules their chance at the MIME checking phase.

 return DECLINED unless tied %DB and $record = $DB{$file};

The next step is to check the tied hash to see if there is a record corresponding to the current filename. If there is, we store the record in a variable named $record. Otherwise, we again return DECLINED. This allows files that are not specifically named in the database to fall through to the standard file-extension based MIME type determination.

 $r->content_type($record->{$mimefield});
 $r->content_encoding($record->{$encodingfield}) 
 if $record->{$encodingfield};

Since the file is listed in the database, we fetch the values of the MIME type and encoding columns and write them into the request record by calling the request object's content_type() and content_encoding() respectively. Since most documents do not have an encoding type, we only call content_encoding() if the column is defined.

 return OK;
 }

Our work is done, so we exit the handler subroutine with an OK status code.

At the end of this module is a short shell script which you can use to initialize a test database named test_www. It will create the table shown in the example above.

To install this module, add a PerlTypeHandler directive like this one to one of the configuration files or a .htaccess file:

 <Location /mimedbi>
 PerlTypeHandler Apache::MimeDBI
 </Location>

If you need to change the name of the database, the login information, or the table structure, be sure to include the appropriate PerlSetVar directives as well.

Figure 7.2 shows the automatic listing of a directory under the control of Apache::MimeDBI. The directory contains several files. ``test1'' through ``test5'' are listed in the database with the MIME types and encodings shown in the table above. Their icons reflect the MIME types returned by the handler subroutine. This MIME type will also be passed to the browser when it loads and renders the document. test6.html doesn't have an entry in the database, so it falls through to the standard MIME checking module, which figures out its type through its file extension. test7 has neither an entry in the database nor a recognized file extension, so it is displayed with the ``unknown document'' icon. Without help from Apache::MimeDBI, all the files without extensions would end up as unknown MIME types.

Figure 7.2: An automatic listing of a directory controlled by Apache::MimeDBI

If you use this module, you should be sure to install and load Apache::DBI during the server startup phase as described in Chapter 5. This will make the underlying database connections persistent, dramatically decreasing the time necessary for the handler to do its work.

Listing 7.6: A DBI-Based MIME Type Checker

 package Apache::MimeDBI;
 # file Apache/MimeDBI.pm
 
 use strict;
 use Apache::Constants qw(:common);
 use Tie::DBI ();
 use File::Basename qw(basename);
 
 use constant DEFAULT_DSN => 'mysql:test_www';
 use constant DEFAULT_LOGIN => ':';
 use constant DEFAULT_TABLE => 'doc_types';
 use constant DEFAULT_FIELDS => 'filename:mime_type:encoding';
 
 sub handler {
 my $r = shift;
 
 # get filename
 my $file = basename $r->filename;
 
 # get configuration information
 my $dsn = $r->dir_config('MIMEDatabase') || DEFAULT_DSN;
 my $table = $r->dir_config('MIMETable') || DEFAULT_TABLE;
 my($filefield, $mimefield, $encodingfield) = 
 split ':', $r->dir_config('MIMEFields') || DEFAULT_FIELDS;
 my($user, $pass) = split ':', $r->dir_config('MIMELogin') || DEFAULT_LOGIN;
 
 # pull information out of the database
 tie my %DB, 'Tie::DBI', {
 'db' => $dsn, 'table' => $table, 'key' => $filefield,
 'user' => $user, 'password' => $pass,
 };
 my $record;
 return DECLINED unless tied %DB and $record = $DB{$file};
 
 # set the content type and encoding
 $r->content_type($record->{$mimefield});
 $r->content_encoding($record->{$encodingfield}) 
 if $record->{$encodingfield};
 
 return OK;
 }
 
 1;
 __END__
 
 # Here's a shell script to add the test data
 
 #!/bin/sh
 mysql test_www <<END
 DROP TABLE doc_types;
 CREATE TABLE doc_types (
 filename char(127) primary key,
 mime_type char(30) not null,
 encoding char(30)
 );
 INSERT into doc_types values ('test1','text/plain',null);
 INSERT into doc_types values ('test2','text/html',null);
 INSERT into doc_types values ('test3','text/html','x-compress');
 INSERT into doc_types values ('test4','text/html','x-gzip');
 INSERT into doc_types values ('test5','image/gif',null);
 END

Customizing the Fixup Phase

The fixup phase is sandwiched between the type checking phase and the response phase. It gives modules a last minute chance to add information to the environment or to modify the request record before the content handler is invoked.

The standard mod_usertrack module implements the CookieTracking directive in this phase, adding a user-tracking cookie to the outgoing HTTP headers, and recording a copy of the incoming cookie to the notes table for logging purposes.

As an example of a useful Perl-based fixup handler, we'll look at Apache::HttpEquiv, a module written by Rob Hartill and used here with his permission. The idea of Apache::HttpEquiv is simple. The module scans the requested HTML file for any >META< tags containing the HTTP-EQUIV and CONTENT attributes. The information is then added to the outgoing HTTP headers.

For example, if the requested file contains this HTML:

 <HTML>
 <HEAD><TITLE>My Page</TITLE>
 <META HTTP-EQUIV="Expires" CONTENT="Wed, 31 Jul 1998 16:40:00 GMT">
 <META HTTP-EQUIV="Set-Cookie" CONTENT="open=sesame">

The handler will convert the >META< tags into these response headers:

 Expires: Wed, 31 Jul 1998 16:40:00 GMT
 Set-Cookie: open=sesame

Listing 7.7 gives the succinct code for Apache::HttpEquiv. The handler() routine begins by testing the current request for suitability. It returns with a status code of DECLINED if any of the following are true:

This is a subrequest.
The requested document's MIME type is something other than text/html.
The requested file cannot be opened.

Item #2 is the main reason that this module has to be run as a fixup handler. Prior to this phase, the MIME type of the document is not known because the MIME type checker hasn't yet run.

Next the handler scans through the requested file, line by line, looking for suitable >META< tags. If any are found, the request object's header_out() method is called to set the indicated header. To gain a little bit of efficiency, the subroutine aborts the search early when a <BODY> or </HEAD> tag is encountered.

Once the file is completely scanned, the subroutine closes and return an OK status code.

To configure Apache::HttpEquiv add the following line to your configuration file:

 <Location /httpequiv>
 PerlFixupHandler Apache::HttpEquiv
 </Location>

Listing 7.7: Apache::HttpEquiv turns tags into HTTP Headers

 package Apache::HttpEquiv;
 # file: Apache/HttpEquiv.pm
 use strict;
 use Apache::Constants qw(:common);

 sub handler {
 my $r = shift;
 local(*FILE);

 return DECLINED if # don't scan the file if..
 !$r->is_main # a subrequest
 || $r->content_type ne "text/html" # it isn't HTML
 || !open(FILE, $r->filename); # we can't open it

 while(<FILE>) {
 last if m!<BODY>|</HEAD>!i; # exit early if in BODY
 if (m/META HTTP-EQUIV="([^"]+)"\s+CONTENT="([^"]+)"/i) {
 $r->header_out($1 => $2);
 }
 }
 close(FILE);
 return OK;
 } 
 
 1;
 __END__

The Logging Phase

The very last phase of the transaction before the cleanup at the end is the logging phase. At this point, the request record contains everything there is to know about the transaction, including the content handler's final status code and the number of bytes transferred from the server to the client.

Apache's built-in logging module mod_log_config ordinarily handles this phase by writing a line of summary information to the transfer log. As its name implies this module is highly configurable. You can give it printf()-like format strings to customize the appearance of the transfer log to your requirements, have it open multiple log files, or even have it pipe the log information to an external process for special processing.

By handling the logging phase yourself you can perform special processing at the end of each transaction. For example, you can update a database of cumulative hits, bump up a set of hit count files, or notify the owner of a document that his page has been viewed. There are a number of log handlers on CPAN, including Apache::DBILogger, which sends log information to a relational database, and Apache::Traffic, which keeps summaries of bytes transferred on a per-user basis.

Send E-Mail When a Page is Hit

The first example of a log handler that we'll show is Apache::LogMail. It sends e-mail to a designated address whenever a particular page is hit, and could be used in low-volume applications such as ISP customers' vanity home pages. A typical configuration directive would look like this:

 <Location /~kryan>
 PerlLogHandler Apache::LogMail
 PerlSetVar LogMailto [email protected]
 PerlSetVar LogPattern \.(html|txt)$
 </Location>

With this configuration in place, hits on pages in the /~kryan directory will generate e-mail messages. The LogMailto Perl configuration variable specifies [email protected] as the lucky recipient of these messages, and LogPattern limits the messages to files that end with .html or .txt (thus eliminating noise from hits on inline images).

Listing 7.8 shows the code. After the usual preliminaries, we define the logging phase's handler() routine:

 sub handler {
 my $r = shift;
 
 my $mailto = $r->dir_config('LogMailto');
 return DECLINED unless $mailto;
 my $filepattern = $r->dir_config('LogPattern');
 return DECLINED if $filepattern 
 && $r->filename !~ /$filepattern/;

The subroutine begins by fetching the contents of the LogMailto configuration variable. If none are defined, it declines the transaction. Next it fetches the contents of LogPattern. If it finds one, it compares the requested document's filename to the pattern and again declines the transaction if no match is found.

 my $request = $r->the_request;
 my $uri = $r->uri;
 my $agent = $r->header_in("User-agent");
 my $bytes = $r->bytes_sent;
 my $remote = $r->get_remote_host;
 my $status = $r->status_line;
 my $date = localtime;

Now the subroutine gathers up various fields of interest from the request object, including the requested URI, the User-Agent header, the name of the remote host, and the number of bytes sent (method bytes_sent()).

 local $ENV{PATH}; #keep -T happy
 unless (open MAIL, "|/usr/lib/sendmail -oi -t") {
 $r->log_error("Couldn't open mail: $!");
 return DECLINED;
 }

We open a pipe to the sendmail program* and use it to send a message to the designated user with the information we've gathered. The flags used to open up the sendmail pipe instruct it to take the recipient's address from the header rather than the command line, and prevent it from terminating prematurely if it sees a line consisting of a dot.

 print MAIL <<END;
 To: $mailto
 From: mod_perl httpd <$from>
 Subject: Somebody looked at $uri

 At $date, a user at $remote looked at
 $uri using the $agent browser. 

 The request was $request, 
 which resulted returned a code of $status. 

 $bytes bytes were transferred.
 END
 close MAIL;
 return OK;
 }

All text that we print to the MAIL pipe is transferred to sendmail's standard input. The only trick here is to start the message with a properly formatted mail header with the To:, From: and Subject: fields followed by a blank line. When we close the pipe, the mail is bundled up and sent off for delivery.

The final e-mail message will look something like this:

 From: Mod Perl <[email protected]> 
 To: [email protected]
 Subject: Somebody looked at /~kryan/guestbook.txt 
 Date: Thu, 27 Aug 1998 08:14:23 -0400 
 
 At Thu Aug 27 08:14:23 1998, a user at 192.168.2.1 looked at 
 /~kryan/guestbook.txt using the Mozilla/4.04 [en] (X11; I; Linux 
 2.0.33 i686) browser. 
 
 The request was GET /~kryan/guestbook.txt HTTP/1.0, 
 which resulted returned a code of 200 OK. 
 
 462 bytes were transferred.

Listing 7.8: A Logging Module to Notify of Hits via E-Mail

 package Apache::LogMail;
 # File: Apache/LogMail.pm
 
 use strict;
 use Apache::Constants qw(:common);
 
 sub handler {
 my $r = shift;
 
 my $mailto = $r->dir_config('LogMailto');
 return DECLINED unless $mailto;
 
 my $filepattern = $r->dir_config('LogPattern');
 return DECLINED if $filepattern 
 && $r->filename !~ /$filepattern/;
 
 my $request = $r->the_request;
 my $uri = $r->uri;
 my $agent = $r->header_in("User-agent");
 my $bytes = $r->bytes_sent;
 my $remote = $r->get_remote_host;
 my $status = $r->status_line;
 my $date = localtime;
 
 my $from = $r->server->server_admin || "webmaster";
 local $ENV{PATH}; #keep -T happy
 unless (open MAIL, "|/usr/lib/sendmail -oi -t") {
 $r->log_error("Couldn't open mail: $!");
 return DECLINED;
 }
 
 print MAIL <<END;
 To: $mailto
 From: mod_perl httpd <$from>
 Subject: Somebody looked at $uri
 
 At $date, a user at $remote looked at
 $uri using the $agent browser. 
 
 The request was $request, 
 which resulted returned a code of $status. 
 
 $bytes bytes were transferred.
 END
 
 close MAIL;
 return OK;
 }
 
 1;
 __END__

footnote: *sendmail is only available on Unix systems. If you are using Windows or Windows NT, you would be best served by replacing the piped open with the appropriate calls to the Perl Net::SMTP module. You can find this module on CPAN.

A DBI Database Logger

The second example of a log phase handler is a DBI database logger. The information from the transaction is sent to a relational database using the DBI interface. The record of each transaction is appended to the end of a relational table, which can be queried and summarized in a myriad of ways using SQL.

This is a skeletal version of the much more complete Apache::DBILog and Apache::DBILogConfig modules, which you should consult before rolling your own.

In preparation to use this module you'll need to set up a database with the appropriate table definition. A suitable MySQL table named access_log is shown here:

 +---------+--------------+------+-----+---------------------+-------+
 | Field | Type | Null | Key | Default | Extra |
 +---------+--------------+------+-----+---------------------+-------+
 | when | datetime | | | 0000-00-00 00:00:00 | |
 | host | char(255) | | | | |
 | method | char(4) | | | | |
 | url | char(255) | | | | |
 | auth | char(50) | YES | | NULL | |
 | browser | char(50) | YES | | NULL | |
 | referer | char(255) | YES | | NULL | |
 | status | int(3) | | | 0 | |
 | bytes | int(8) | YES | | 0 | |
 +---------+--------------+------+-----+---------------------+-------+

This table can be created with the following script:

 #!/bin/sh
 
 mysql -B test_www <<END
 create table access_log (
 when datetime not null,
 host varchar(255) not null,
 method varchar(4) not null,
 url varchar(255) not null,
 auth varchar(50),
 browser varchar(50),
 referer varchar(255),
 status smallint(3) default 0,
 bytes int(8)
 );
 END

The database must be writable by the Web server, which should be provided with the appropriate username and password to log in.

The code (Listing 7.9) is short and very similar to the previous example, so we won't reproduce it inline.

We begin by bringing in modules that we need, including DBI and the ht_time() function from Apache::Util. Next we declare some constants defining the database, table and database login information. Since this is just a skeleton of a module, we have hard-coded these values rather than take them from PerlSetVar configuration directives. You can follow the model of Apache::MimeDBI if you wish to make this module more configurable.

The handler() subroutine recovers the request object and uses it to fetch all the information we're interested in recording, which we store in locals. We also call ht_time() to produce a nicely-formatted representation of the request_time() in a format that SQL accepts. We connect to the database and create a statement handle containing a SQL INSERT statement. We invoke the statement handler's execute() statement to write the information into the database, and return with a status code of OK.

The only trick to this handler, that we left out of Apache::LogMail, is the use of the last() to recover the request object. last() returns the final request object in a chain of internal redirects and other subrequests. Usually there are no subrequests and last() just returns the main (first) request object, in which case, the $orig and $r objects in Apache::LogDBI would point to the same request record. In the event that a subrequest did occur, a / request being resolved to /index.html for example, we want to log the request_time, uri and status from the original request.

Listing 7.9: A DBI Database Log Handler

 package Apache::LogDBI;
 # file: Apache/LogDBI.pm
 use Apache::Constants qw(:common);
 
 use strict;
 use DBI ();
 use Apache::Util qw(ht_time);
 
 use constant DSN => 'dbi:mysql:test_www';
 use constant DB_TABLE => 'access_log';
 use constant DB_AUTH => ':';
 
 sub handler {
 my $orig = shift;
 my $r = $orig->last;
 my $date = ht_time($orig->request_time, '%Y-%m-%d %H:%M:%S', 0);
 my $host = $r->get_remote_host;
 my $method = $r->method;
 my $url = $orig->uri;
 my $user = $r->connection->user;
 my $referer = $r->header_in('Referer');
 my $browser = $r->header_in('User-agent');
 my $status = $orig->status;
 my $bytes = $r->bytes_sent;
 
 my $dbh = DBI->connect(DSN, split ':', DB_AUTH) || die $DBI::errstr;
 my $sth = $dbh->prepare("INSERT INTO ${\DB_TABLE} VALUES(?,?,?,?,?,?,?,?,?)") 
 || die $dbh->errstr;
 
 $sth->execute($date,$host,$method,$url,$user,
 $browser,$referer,$status,$bytes) || die $dbh->errstr;
 return OK;
 }
 
 1;
 __END__

Having Web transactions logged to a relational database gives you the ability to pose questions of great complexity. Just to give you a taste of what's possible, here are a few useful SQL queries to try:

How many hits have I had to date, and how many total bytes transferred?

 SELECT count(*),sum(bytes) FROM access_log;

How many hits did I have the day before yesterday?

 SELECT count(*) FROM access_log 
 WHERE to_days(when)=to_days(now())-2;

How many hits have I had, grouped by hour of access?

 SELECT date_format(when,'H') as hour,count(*) FROM access_log 
 GROUP BY hour;

What URLs may be broken, and who is pointing at them?

 SELECT url,referer,count(url) FROM access_log 
 WHERE status=404 
 GROUP BY url;

What are the top ten most popular URLs on my site?

 SELECT url,count(*) as count FROM access_log 
 GROUP BY url 
 ORDER BY count desc 
 LIMIT 10;

What is my site's bandwidth, sorted by the hour of day?

 SELECT date_format(when,'H') as hour,
 sum(bytes)/(60*60) as bytes_per_min
 FROM access_log 
 GROUP BY hour;

This handler can be installed with the following configuration file directive:

 PerlLogHandler Apache::LogDBI

You can place this directive in the main part of the configuration file in order to log all accesses, or place it in a directory section if you're interested in logging a particular section of the site only.

An alternative is to install Apache::LogDBI as a cleanup handler, as described in the next section.

Registered Cleanups

Although the logging phase is the last official phase of the request cycle, there is one last place where modules can do work. This is the cleanup phase, during which any code registered as a cleanup handler is called to perform any per-transaction tidying up that the module may need to do.

Cleanup handlers can be installed in either of two ways. They can be installed by calling the request object's register_cleanup() method with a reference to a subroutine or method to invoke, or by using the PerlCleanupHandler directive to register a subroutine from within the server configuration file. Examples:

 # within a module file
 $r->register_cleanup(sub { warn "server $$ done serving request\n" });

 # within a configuration file
 PerlModule Apache::Guillotine # make sure it's loaded
 PerlCleanupHandler Apache::Guillotine::mopup()

There is not actually a cleanup phase per se. Instead the C API provides a callback mechanism for functions that are invoked just before their memory pool is destroyed. A handful of Apache API methods use this mechanism underneath for simple but important tasks such as ensuring that files, directory handles and sockets are closed. In Chapter 10, you will see that the C version expects a few more arguments, including the pool pointer.

There are actually two register_cleanup() methods, one associated with the Apache request object, and the other associated with the Apache::Server object. The difference between the two is that handlers installed with the request object's method will be run when the request is done, while handlers installed with the server object's method will only be run only when the server shuts down or restarts:

 $r->register_cleanup(sub { "child $$ served another request" })
 Apache->server->register_cleanup(sub { warn "server $$ restarting\n" });

We've already been using register_cleanup() indirectly with the Apache::File tmpfile() method, where it is used to unlink a temporary file at the end of the transaction even if the handler aborts prematurely. Another example can be found in CGI.pm, where a cleanup handler resets that module's package globals to a known state after each transaction. Here's the relevant code fragment:

 Apache->request->register_cleanup(\&CGI::_reset_globals);

A more subtle use of registered cleanups is to perform delayed processing on requests. For example, certain contributed mod_perl logging modules like Apache::DBILogger and Apache::Traffic take a bit more time to do their work than the standard logging modules do when they append a line of text to a flat file. Although the overhead is small, it does lengthen the amount of time the user has to wait before the browser's progress monitor indicates that the page is fully loaded. In order to squeeze out the last ounce of performance, these modules defer the real work to the cleanup phase. Because cleanups occur after the response is finished, the user will not have to wait for the logging module to complete its work.*

To take advantage of delayed processing, we can run the previous section's Apache::LogDBI module during the cleanup phase rather than the log phase. The change is simple. Just replace the PerlLogHandler directive with PerlCleanupHandler:

 PerlCleanupHandler Apache::LogDBI

footnote: *Of course, moving the work out of the transaction and into the cleanup phase just means that the child server or thread cannot serve another request until this work is done. This only becomes a problem if the number of concurrent requests exceeds the level that your server can handle. In this case, the next incoming request may have to wait a little longer for the connection to be established. You can decide if the subjective tradeoff is worth it.

Because the cleanup handler can be used for post-transactional processing, the Perl API provides post_connection() as an alias for register_cleanup(). This can improve code readability somewhat:

 sub handler {
 shift->post_connection(\&logger);
 }

Cleanup handlers follow the same calling conventions as other handlers. On entry, they receive a reference to an Apache object containing all the accumulated request and response information. They can return a status code if they wish to, but Apache will ignore it.

We've finally run out of transaction phases to talk about, so we turn our attention to a more esoteric aspect of Apache, the proxy server API.

Handling Proxy Requests

The HTTP proxy protocol was originally designed to allow users unfortunate enough to be stuck behind a firewall to access external Web sites. Instead of connecting to the remote server directly, an action forbidden by the firewall, users point their browsers at a proxy server located on the firewall machine itself. The proxy goes out and fetches the requested document from the remote site, and forwards the retrieved document to the user.

Nowadays most firewall systems have a simple Web proxy built right in, so there's no need for dedicated proxying servers. However proxy servers are still useful for a variety of purposes. For example, a caching proxy (of which Apache is one example) will store frequently-requested remote documents in a disk directory and return the cached documents directly to the browser instead of fetching them anew. Anonymizing proxies take the outgoing request and strip out all the headers that can be used to identify the user or his browser. By writing Apache API modules that participate in the proxy process, you can achieve your own special processing of proxy requests.

The proxy request/response protocol is nearly the same as vanilla HTTP. The major difference is that instead of requesting a server-relative URI in the request line, the client asks for a full URL, complete with scheme and host. In addition, a few optional HTTP headers beginning with Proxy- may be added to the request. For example, a normal (non-proxy) HTTP request sent by a browser might like this:

 GET /foo/index.html HTTP/1.0
 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
 Pragma: no-cache
 Connection: Keep-Alive
 User-Agent: Mozilla/2.01 (WinNT; I)
 Host: www.modperl.com:80

In contrast, the corresponding HTTP proxy request will look like this:

 GET http://www.modperl.com/foo/index.html HTTP/1.0
 Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
 Pragma: no-cache
 User-Agent: Mozilla/2.01 (WinNT; I)
 Host: www.modperl.com:80 
 Proxy-Connection: Keep-Alive

Notice the URL in the request line of an HTTP proxy request includes the scheme and hostname. This information enables the proxy server to initiate a connection to the distant server. To generate this type of request, the user must configure his browser so that HTTP and, optionally, FTP requests are proxied to the server. This usually involves setting values in the browser's preference screens. An Apache server will be able to respond to this type of request if it has been compiled with the mod_proxy module. This module is part of the core Apache distribution but is not compiled in by default.

You can interact with Apache's proxy mechanism at the translation handler phase. There are two types of interventions you can make. You can take an ordinary (non-proxy) request and change it into one so that it will be handled by Apache's standard proxy module. Or you can take an incoming proxy request and install your own content handler for it so that you can examine and possibly modify the response from the remote server.

Invoking mod_proxy for Non-Proxy Requests

We'll look first at Apache::PassThru, an example of how to turn an ordinary request into a proxy request.* Because this technique uses Apache's optional mod_proxy module, this module will have to be compiled and installed in order for this example to run on your system.

footnote: *There are several third party Perl API modules on CPAN that handle proxy requests, including one named Apache::ProxyPass and another named Apache::ProxyPassThru. If you are looking for the functionality of Apache::PassThru you should examine one of these more finished products before using this one as the basis for your own module.

The idea behind the example is simple. Requests for URIs beginning with a certain path will be dynamically transformed into a proxy request. For example, we might transform requests for URLs beginning with /CPAN/ into a request for http://www.perl.com/CPAN/. The request to www.perl.com will be done completely behind the scenes; nothing will reveal to the user that the directory hierarchy is being served from a third-party server rather than our own. This functionality is the same as the ProxyPass directive provided by mod_proxy itself. You can also achieve the same effect by providing an appropriate rewrite rule to mod_rewrite.

The configuration for this example uses a PerlSetVar to set a variable named PerlPassThru. A typical entry in the configuration directive will look like this:

 PerlTransHandler Apache::PassThru
 PerlSetVar PerlPassThru '/CPAN/ => http://www.perl.com/,\
 /search/ => http://www.altavista.digital.com/'

The PerlPassThru variable contains a string representing a series of URI=>proxy pairs, separated by commas. A backslash at the end of a line can be used to split the string over several lines, improving readability (the ability to use backslash as a continuation character is actually an Apache configuration file feature, but not a well-publicized one). In this example, we map the URI /CPAN/ to http://www.perl.com/ and /search/ to http://www.altavista.digital.com/. For the mapping to work correctly, local directory names should end with a slash in the manner shown in the example.

The short code for Apache::PassThru is given in Listing 7.10. The handler() subroutine begins by retrieving the request object, and calling its proxyreq() method to determine whether the current request is a proxy request:

 sub handler {
 my $r = shift;
 return DECLINED if $r->proxyreq;

If this is already a proxy request, we don't want to alter it in any way, so we decline the transaction. Otherwise we retrieve the value of PerlPassThru, split it into its key/value components with a pattern match, and store the result in a hash named %mappings:

 my $uri = $r->uri;
 my %mappings = split /\s*(?:,|=>)\s*/, $r->dir_config('PerlPassThru');

We now loop through each of the local paths, looking for a match with the current request's URI. If a match is found, we perform a string substitution to replace the local path to the corresponding proxy URI. Otherwise we continue to loop:

 for my $src (keys %mappings) {
 next unless $uri =~ s/^$src/$mappings{$src}/;
 $r->proxyreq(1);
 $r->uri($uri);
 $r->filename("proxy:$uri");
 $r->handler('proxy-server');
 return OK;
 }

If the URI substitution succeeds, there are four steps we need to take to transform this request into something that mod_proxy will handle. The first two are obvious, but the others are less so. First, we need to set the proxy request flag to a true value by calling $r->proxyreq(1). Next, we change the requested URI to the proxy URI by calling the request object's uri() method. In the third step, we set the request filename to the string ``proxy:'' followed by the URI, as in proxy:http://www.perl.com/CPAN/. This is a special filename format recognized by mod_proxy, and as such is somewhat arbitrary. The last step is to set the content handler to ``proxy-server'', so that the request is passed to mod_proxy to handle the response phase.

 }
 return DECLINED;
 }

If we turned the local path into a proxy request, we return OK from the translation handler. Otherwise we returned DECLINED.

Listing 7.10: Invoking Apache's Proxy Request Mechanism from Within a Translation Handler

 package Apache::PassThru;
 # file: Apache/PassThru.pm;
 use strict;
 use Apache::Constants qw(:common);

 sub handler {
 my $r = shift;
 return DECLINED if $r->proxyreq;
 my $uri = $r->uri;
 my %mappings = split /\s*(?:,|=>)\s*/, $r->dir_config('PerlPassThru');
 for my $src (keys %mappings) {
 next unless $uri =~ s/^$src/$mappings{$src}/;
 $r->proxyreq(1);
 $r->uri($uri);
 $r->filename("proxy:$uri");
 $r->handler('proxy-server');
 return OK;
 }
 return DECLINED;
 }
 1;
 __END__

An Anonymizing Proxy

As public concern about the ability of Web servers to track people's surfing sessions grows, anonymizing proxies are becoming more popular. An anonymizing proxy is similar to an ordinary Web proxy, except that certain HTTP headers that provide identifying information such as the Referer, Cookie, User-Agent and From fields are quietly stripped from the request before forwarding it on to the remote server. Not only is this identifying information removed, but the identity of the requesting host is obscured. The remote server knows only the hostname and IP address of the proxy machine, not the identity of the machine the user is browsing from.

You can write a simple anonymizing proxy in the Apache Perl API in all of 18 lines (including comments). The source code listing is shown in Listing 7.11. Like the previous example, it uses Apache's mod_proxy, so that module must be installed before this example will run correctly.

The module defines a package global named @Remove containing the names of all the request headers to be stripped from the request. In this example we remove User-Agent, Cookie, Referer, and the now infrequently-used From field. The handler() subroutine begins by fetching the Apache request object and checking whether the current request uses the proxy protocol. However, unlike the previous example where we wanted the existence of the proxy to be secret, here we expect the user to explicitly configure his browser to use our anonymizing proxy. So here we return DECLINED if proxyreq() returns false.

If proxyreq() returns true we now know that we are in the midst of a proxy request. We loop through each of the fields to be stripped and delete them from the incoming headers table by using the request object's header_in() method to set the field to undef. We then return OK to signal Apache to continue to the request processing. That's all there is to it.

To activate the anonymizing proxy, install it as a URI translation handler as before:

 PerlTransHandler Apache::AnonProxy

Another alternative that works just as well is to call the module during the later header parser parsing phase (see the discussion of this phase below). In some ways this makes more sense because we aren't doing any actual URI translation, but we are modifying the HTTP header. Here is the appropriate directive:

 PerlHeaderParserHandler Apache::AnonProxy

The drawback to using PerlHeaderParserHandler like this is that, unlike PerlTransHandler, the directive is allowed in directory configuration sections and .htaccess files. But directory configuration sections are irrelevant in proxy requests, so the directive will silently fail if placed in one of these sections. The directive should go in the main part of the one of the configuration files, or in a virtual host section.

Listing 7.11: A Simple Anonymizing Proxy

 package Apache::AnonProxy;
 # file: Apache/AnonProxy.pm
 use strict;
 use Apache::Constants qw(:common);

 my @Remove = qw(user-agent cookie from referer);

 sub handler {
 my $r = shift;
 return DECLINED unless $r->proxyreq;
 foreach (@Remove) {
 $r->header_in($_ => undef);
 } 
 return OK;
 }

 1;
 __END__

In order to test that this handler was actually working, we set up a test Apache server as the target of the proxy requests and added the following entry to its configuration file:

 CustomLog logs/nosy_log "%h %{Referer}i %{User-Agent}i %{Cookie}i %U"

This created a ``nosy'' log that contains entries for the referrer, user agent and cookie fields. Before installing the anonymous proxy module, entries in this log looked like this (the lines have been wrapped to fit on the page):

 192.168.2.5 http://prego/ Mozilla/4.04 [en] (X11; I; Linux 2.0.33 i686) 
 - /tkdocs/tk_toc.ht
 192.168.2.5 http://prego/ Mozilla/4.04 [en] (X11; I; Linux 2.0.33 i686) 
 POMIS=10074 /perl/hangman1.pl

In contrast, after installing the anonymizing proxy module, all the identifying information was stripped out, leaving only the IP address of the proxy machine:

 192.168.2.5 - - - /perl/hangman1.pl
 192.168.2.5 - - - /icons/hangman/h0.gif
 192.168.2.5 - - - /cgi-bin/info2www

Handling the Proxy Process Ourselves

As long as you only need to monitor or modify the request half of a proxy transaction, you can use Apache's mod_proxy module directly as we did in the previous two examples. However, if you also want to intercept the response so as to modify the information returned from the remote server, then you'll need to handle the proxy request on your own.

In this section we present Apache::AdBlocker. This module replaces Apache's mod_proxy with a specialized proxy that filters the content of certain URLs. Specifically, it looks for URLs that are likely to be banner advertisements and replaces their content with a transparent GIF image that says ``Blocked Ad''. This can be used to ``lower the volume'' of commercial sites by removing distracting animated GIFs and brightly colored banners. Figure 7.3 shows what the AltaVista search site looks like when fetched through the Apache::AdBlocker proxy.

Figure 7.3: The AltaVista Search Engine after Filtering by Apache::AdBlocker

The code for Apache::AdBlocker is given in Listing 7.12. It is a bit more complicated than the other modules we've worked with in this chapter, but not much more so. The basic strategy is to install two handlers. The first handler is activated during the URI translation phase. It doesn't actually alter the URI or filename in any way, but does inspect the transaction to see if it is a proxy request. If this is the case, the handler installs a custom content handler to actually go out and do the request. In this respect the translation handler is similar to Apache::Checksum3, which also installs a custom content handler for certain URIs.

Later on, when its content handler is called the module uses the Perl LWP library to fetch the remote document. If the document does not appear to be a banner ad, the content handler forwards it on to the waiting client. Otherwise the handler does a little switcharoo, replacing the advertisement with a custom GIF image of exactly the same size and shape as the ad. This bit of legerdemain is completely invisible to the browser, which goes ahead and renders the image as if it were the original banner ad.

In addition to the LWP library, this module requires the GD and Image::Size libraries for creating and manipulating images. They are available on CPAN if you do not already have them installed.

Turning to the code, after the familiar preamble we create a new LWP::UserAgent object that we will use to make all our requests for documents from remote servers.

 @ISA = qw(LWP::UserAgent);
 $VERSION = '1.00';
 
 my $UA = __PACKAGE__->new;
 $UA->agent(join "/", __PACKAGE__, $VERSION);

 sub redirect_ok {0}

We subclass LWP::UserAgent, using the @ISA global to create an inheritance relationship between LWP::UserAgent and our own package. We only override the LWP::UserAgent's redirect_ok() method, which assures redirects are handled properly by the browser.

We now create a new instance of the LWP::UserAgent subclass, using the special token __PACKAGE__ which evaluates at compile time to the name of the current package. In this case, __PACKAGE__->new is equivalent to Apache::AdBlocker->new (or new Apache::AdBlocker if you prefer Smalltalk syntax). Immediately afterward we call the object's agent() method with a string composed of the package name and version number. This is the calling card that LWP sends to the remote hosts' Web servers as the HTTP User-Agent field. The method we use for constructing the User-Agent field creates the string ``Apache::AdBlocker/1.00''.

 my $Ad = join "|", qw{ads? advertisements? banners? adv promotions?};

The last initialization step is to define a package global named $Ad that defines a pattern match that picks up many (but certainly not all) banner advertisement URIs. Most ads contain variants on the words ``ad'', ``advertisement'', ``banner'', ``promotion'' somewhere in the URI, although this may have changed by the time you read this!

 sub handler {
 my $r = shift;
 return DECLINED unless $r->proxyreq;
 $r->handler("perl-script"); #ok, let's do it
 $r->push_handlers(PerlHandler => \&proxy_handler);
 return OK;
 }

The next part of the module is the definition of the handler() subroutine, which in this case will be run during the URI translation phase. It simply checks whether the current transaction is a proxy requests and declines the transaction if not. Otherwise, it calls the request object's handler() method to set the content handler to ``perl-script'', and calls push_handlers() to make the module's proxy_handler() subroutine the callback for the response phase of the transaction. handler() then returns OK to flag that it has handled the URI translation phase.

Most of the work is done in proxy_handler(). Its job is to use LWP's object-oriented methods to create an HTTP::Request object. The HTTP::Request is then forwarded to the remote host by the LWP::UserAgent, returning an HTTP::Respons. The response must then be forwarded on to the waiting browser, possibly after replacing the content. The only subtlety here is the need to copy the request headers from the incoming Apache request's headers_in() table to the HTTP::Request, and, in turn, to copy the response headers from the HTTP::Response into the Apache request headers_out() table. If this copying back and forth isn't performed, then documents that rely on the exact values of certain HTTP fields, such as CGI scripts, will fail to work correctly across the proxy.

 sub proxy_handler {
 my $r = shift;
 
 my $request = HTTP::Request->new($r->method, $r->uri);

proxy_handler() starts by recovering the Apache request object. It then uses the request object's method() and uri() methods to fetch the request method and the URI. These are used to create and initialize a new HTTP::Request. We now feed the incoming header fields from the Apache request object into the corresponding fields in the outgoing HTTP::Request:

 $r->headers_in->do(sub { 
 $request->header(@_); 
 });

We use a little trick to accomplish the copy. The headers_in() method, as opposed to the header_in() method that we have seen before, returns an instance of the Apache::Table class. This class, described in more detail in the next chapter (see The Apache::Table Class), implements methods for manipulating Apache's various table-like structures, including the incoming and outgoing HTTP header fields. One of these methods is do(), which when passed a CODE reference, invokes the code once for each header field, passing the routine the header's name and value each time. In this case, we call do() with an anonymous subroutine that passes the header keys and values on to the HTTP::Request object's header() method. It is important to use headers->do() here rather than copying the headers into a hash because certain headers, particularly Cookie, can be multivalued.

 # copy POST data, if any
 if($r->method eq 'POST') {
 my $len = $r->header_in('Content-length');
 my $buf;
 $r->read($buf, $len);
 $request->content($buf);
 }

The next block of code checks whether the request method is POST. If so, we must copy the POSTed data from the incoming request to the HTTP::Request object. We do this by calling the request object's read() method to read the POST data into a temporary buffer. The data is then copied into the HTTP::Request by calling its content() method. Request methods other than POST may include a request body, but this example does not cope with these rare cases.

The HTTP::Request object is now complete, so we can actually issue the request:

 my $response = $UA->request($request);

We pass the HTTP::Request object to the user agent's request() method. After a brief delay for the network fetch, the call returns an HTTP::Response object, which we copy into a variable named $response.

 $r->content_type($response->header('Content-type'));
 $r->status($response->code);
 $r->status_line(join " ", $response->code, $response->message);

Now the process of copying the headers is reversed. Every header in the LWP HTTP::Response object must be copied to the Apache request object. First we handle a few special cases. We call the HTTP::Response object's header() method to fetch the content type of the returned document and immediately pass the result to the Apache request object's content_type() method. Next, we set the numeric HTTP status code and the human-readable HTTP status line. We call the HTTP::Response object's code() and message() methods to return the numeric code and human readable messages respectively, and copy them to the Apache request object, using the status() and status_line() methods to set the values.

When the special case headers are done, we copy all the other header fields, using the HTTP::Response object's scan() method:

 $response->scan(sub {
 $r->header_out(@_);
 });

scan() is similar to the Apache::Table do() method: it loops through each of the header fields, invoking an anonymous callback routine for each one. The callback sets the corresponding field in the Apache request object using the header_out() method.

 if ($r->header_only) {
 $r->send_http_header();
 return OK;
 }

The outgoing header is complete at this point, so we check whether the current transaction is a HEAD request. If so, we emit the HTTP header and exit with an OK status code.

 my $content = \$response->content;
 if($r->content_type =~ /^image/ and $r->uri =~ /\b($Ad)\b/i) {
 block_ad($content);
 $r->content_type("image/gif");
 }

Otherwise, the time has come to deal with potential banner ads. To identify likely ads, we require that the document be an image and that its URI satisfy the regular expression match defined at the top of the module. We retrieve the document contents by calling the HTTP::Response object's content() method,* and store a reference to the contents in a local variable named $content. We now check whether the document's MIME type is one of the image variants and that the URI satisfes the advertisement pattern match. If both of these are true, we call block_ad() to replace the content with a customized image. We also set the document's content type to image/gif, since this is what block_ad() produces.

 $r->content_type('text/html') unless $$content;
 $r->send_http_header;
 $r->print($$content || $response->error_as_HTML);

We send the HTTP header, then print the document contents. Notice that the document content may be empty, which can happen when LWP connects to a server that is down or busy. In this case, instead of printing an empty document, we return the nicely-formatted error message returned by the HTTP::Response object's error_as_HTML() method.

 return OK;
 }

Our work is done, we return an OK status code.

footnote: *In this example we call the response object's content() method to slurp the document content into a scalar. However it can be more efficient to use the three-argument form of LWP::UserAgent's response() method to read the content in fixed size chunks. See the LWP::UserAgent manual page for details.

The block_ad() subroutine is short and sweet. Its job is to take an image in any of several possible formats and replace it with a custom GIF of exactly the same dimensions. The GIF will be transparent, allowing the page background color to show through, and will have the words ``Blocked Ad'' printed in large friendly letters in the upper left-hand corner.

 sub block_ad {
 my $data = shift;
 my($x, $y) = imgsize($data);
 
 my $im = GD::Image->new($x,$y);

To get the width and height of the image we call imgsize(), a function imported from the Image::Size module. imgsize() recognizes most Web image formats, including GIF, JPEG, XBM and PNG. Using these values, we create a new blank GD::Image object and store it in a variable named $im.

 my $white = $im->colorAllocate(255,255,255);
 my $black = $im->colorAllocate(0,0,0); 
 my $red = $im->colorAllocate(255,0,0);

We call the image object's colorAllocate() method three times to allocate color table entries for white, black and red. Then we declare that the white color is transparent, using the transparent() method.

 $im->transparent($white);
 $im->string(GD::gdLargeFont(),5,5,"Blocked Ad",$red);
 $im->rectangle(0,0,$x-1,$y-1,$black);
 
 $$data = $im->gif;
 }

The routine calls the string() method to draw the message starting at coordinates (5,5), and finally frames the whole image with a black rectangle. The custom image is now converted into GIF format with the gif() method, and copied into $$data, overwriting whatever was there before.

Activating this module is just a matter of adding the following line to one of the configuration files:

 PerlTransHandler Apache::AdBlocker

Users who wish to make use of this filtering service should configure their browsers to proxy their requests through your server.

Listing 7.12: A Banner Ad Blocking Proxy

 package Apache::AdBlocker;
 # file: Apache/AdBlocker.pm
 
 use strict;
 use vars qw(@ISA $VERSION);
 use Apache::Constants qw(:common);
 use GD ();
 use Image::Size qw(imgsize);
 use LWP::UserAgent ();
 
 @ISA = qw(LWP::UserAgent);
 $VERSION = '1.00';
 
 my $UA = __PACKAGE__->new;
 $UA->agent(join "/", __PACKAGE__, $VERSION);

 sub redirect_ok {0}
 
 my $Ad = join "|", qw{ads? advertisements? banners? adv promotions?};
 
 sub handler {
 my $r = shift;
 return DECLINED unless $r->proxyreq;
 $r->handler("perl-script"); #ok, let's do it
 $r->push_handlers(PerlHandler => \&proxy_handler);
 return OK;
 }
 
 sub proxy_handler {
 my $r = shift;
 
 my $request = HTTP::Request->new($r->method, $r->uri);
 
 $r->headers_in->do(sub { 
 $request->header(@_); 
 });
 
 # copy POST data, if any
 if($r->method eq 'POST') {
 my $len = $r->header_in('Content-length');
 my $buf;
 $r->read($buf, $len);
 $request->content($buf);
 }
 
 my $response = $UA->request($request);
 $r->content_type($response->header('Content-type'));
 
 #feed response back into our request_rec*
 $r->status($response->code);
 $r->status_line(join " ", $response->code, $response->message);
 $response->scan(sub {
 $r->header_out(@_);
 });
 
 if ($r->header_only) {
 $r->send_http_header();
 return OK;
 }
 
 my $content = \$response->content;
 if($r->content_type =~ /^image/ and $r->uri =~ /\b($Ad)\b/i) {
 block_ad($content);
 $r->content_type("image/gif");
 }
 
 $r->content_type('text/html') unless $$content;
 $r->send_http_header;
 $r->print($$content || $response->error_as_HTML);
 
 return OK;
 }
 
 sub block_ad {
 my $data = shift;
 my($x, $y) = imgsize($data);
 
 my $im = GD::Image->new($x,$y);
 
 my $white = $im->colorAllocate(255,255,255);
 my $black = $im->colorAllocate(0,0,0); 
 my $red = $im->colorAllocate(255,0,0); 
 
 $im->transparent($white);
 $im->string(GD::gdLargeFont(),5,5,"Blocked Ad",$red);
 $im->rectangle(0,0,$x-1,$y-1,$black);
 
 $$data = $im->gif;
 }
 
 1;
 __END__

Perl Server-Side Includes

Another feature of mod_perl is that it integrates with the Apache mod_include server-side include system. Provided that mod_perl was built with the PERL_SSI option (or with the recommended setting of EVERYTHING=1), the Perl API adds a new #perl element to the standard mod_include server-side include system, allowing server-side includes to call Perl subroutines directly.

The syntax for calling Perl from SSI documents looks like this:

 <!--#perl sub="subroutine" args="arguments"-->

The tag looks like other server-side include tags but contains the embedded element #perl. The #perl element recognizes two attributes, sub and args. The required sub attribute specifies the subroutine to be invoked. This attribute must occur once only in the tag. It can be the name of any subroutine already loaded into the server (with a PerlModule directive, for instance), or an anonymous subroutine created on the fly. When this subroutine is invoked, it is passed a blessed Apache request object just as if it were a handler for one of the request phases. Any text that the subroutine prints will appear on the HTML page.

The optional args attribute can occur once or several times in the tag. If present, args attributes specify additional arguments to be passed to the subroutine. They will be presented to the subroutine in the same order in which they occur in the tag.

Listing 7.18 shows a simple server-side include system that uses #perl elements. It has two Perl includes. The simpler of the two is just a call to a routine named MySSI::remote_host(). When executed, it calls the request object's get_remote_host() method to fetch the DNS name of the remote host machine:

 <!--#perl sub="MySSI::remote_host" -->

MySSI::remote_host() must be preloaded in order for this include to succeed. One way to do this is inside the Perl startup file. Alternatively, it could be defined in a module named MySSI.pm and loaded with the directive PerlModule MySSI. In either case, the definition of remote_host() looks like this:

 package MySSI;
 sub remote_host {
 my $r = shift;
 print $r->get_remote_host;
 }

We could also have defined the routine to call the request object's print() method, as in $r->print($r->get_remote_host). It's your call.

The more complex of the two includes defined in this example calls a Perl subroutine that it creates on the fly. It looks like this:

 <!--#perl arg="Hello" arg="SSI" arg="World"
 sub="sub {
 my($r, @args) = @_;
 print qq(@args);
 }"
 -->

In this case the sub attribute points to an anonymous subroutine defined using the sub { } notation. This subroutine retrieves the request object and a list of arguments, which it simply prints out. Because double quotes are already used to surround the attribute, we use Perl's qq operator to surround the arguments. An equally valid alternative would be to backslash the quotes, as in print \"@args\".

This tag also has three arg attributes, which are passed, in order of appearance, to the subroutine.

In order to try this example out, you'll have to have server-side includes activated. This can be done by uncommenting the following two lines in the standard srm.conf server configuration file:

 AddType text/html .shtml
 AddHandler server-parsed .shtml

You'll also have to activate the Includes option in the directory in which the document is located. The final result is shown in Figure 7.4.

Listing 7.18: This server-side include document uses #perl elements

 <html>
 <!-- file: perl_include.shtml -->
 <head>
 <title> mod_include #perl example </title>
 </head>
 <body>
 <h1>mod_include #perl example</h1>
 
 This document uses the <i>mod_include</i> <b>perl</b> command to 
 invoke Perl subroutines.
 
 <h3>Here is an Anonymous Subroutine</h3>
 
 Message =
 
 <!--#perl arg="Hello" arg="SSI" arg="World"
 sub="sub {
 my($r, @args) = @_;
 print qq(@args);
 }"
 -->
 
 <h3>Here is a Predefined Subroutine</h3>
 
 Remote host = <!--#perl sub="MySSI::remote_host" -->
 
 <hr>
 </body>
 </html>

Figure 7.4: The page displayed by the example server-side include document.

That's all there is to it. You can mix and match any of the standard mod_include commands in your document along with any Perl code that you see fit. There's also an Apache::Include module included with the mod_perl distribution that allows you to invoke Apache::Registry scripts directly from within server-side includes. See Appendix A for details.

While this approach is rather simple, it is not particularly powerful. If you wish to produce complex server-side include documents with conditional sections and content derived from databases, we recommend that you explore HTML::Embperl, Apache::ePerl, HTML::Mason and other template-based systems that can be found on CPAN. Also see Appendix E: HTML::Embperl, which contains an abbreviated version of the HTML::Embperl manual page, courtesy Gerald Richter.

Subclassing the Apache Class

It's appropriate that the last topic we discuss in this chapter is how to extend the Apache class itself with Perl's subclassing mechanism. Because the Perl API is object-oriented, you are free to subclass the Apache class should you wish to override its behavior in any way.

To be successful, the new class must add Apache (or another Apache subclass) to its @ISA array. In addition, the subclass's new() method must return a blessed hash reference which contains either an r or _r key. This key must point to a bona fide Apache object.

The example below (listing 7.19) subclasses Apache, overriding the print and rflush methods. The Apache::MyRequest::print method does not send data directly to the client. Instead, it pushes all data into an array reference, inside the Apache::MyRequest object. When the rflush method is called, the SUPER class methods, print and rflush are called to actually send the data to the client.

Here is an example of an Apache::Registry script that uses Apache::MyRequest. The send_http_header() method is inherited from the Apache class, while the print() and rflush() methods invoke those in the Apache::MyRequest class:

 use Apache::MyRequest ();
 sub handler {
 my $r = Apache::MyRequest->new(shift);
 $r->send_http_header('text/plain');
 $r->print(qw(one two three));
 $r->rflush;
 ...
 }

Listing 7.19: Apache::MyRequest is a subclass of Apache

 package Apache::MyRequest;
 use strict;
 
 use Apache ();
 use vars qw(@ISA);
 @ISA = qw(Apache);
 
 sub new {
 my($class, $r) = @_;
 $r ||= Apache->request;
 return bless {
 '_r' => $r,
 'data' => [],
 }, $class;
 }
 
 sub print {
 my $self = shift;
 push @{$self->{data}}, @_;
 }
 
 sub rflush {
 my $self = shift;
 $self->SUPER::print("MyDATA:\n", join "\n", @{$self->{data}});
 $self->SUPER::rflush;
 @{$self->{data}} = ();
 }
 
 1;
 __END__

The next chapter covers another important topic in the Apache Perl API: how to control and customize the Apache configuration process so that modules can implement first-class configuration directives of their own.