CGI Tutorial

Sambar Server Documentation

CGI Tutorial

Overview
The CGI (Common Gateway Interface) is an invaluable programming aide for learning how the WWW functions. CGIs provide a way for you to run a program in response to a WWW server request. Perl is the most common language used for CGI programs, but virtually any language can be used (i.e. sh, AppleScript, Visual Basic, Delphi, C, C++). The original CGI/1.1 specification is still available at the NCSA site.

HTML FORMs
Before diving into CGIs, you must understand HTML FORMs. If you've ever filled out a series of fields in a browser and clicked on the "submit" button, you've seen an HTML FORM. The data in the HTML FORMs typically provides the input to server-side programs (i.e. CGIs); the CGIs take the HTML FORM data and perform some action like placing an order with the vendor's purchasing system, or sending mail to a company employee.

The following is a simple HTML FORM which executes a Perl script that displays the FORM contents.

<form method="post" action="/cgi-bin/dumpenv.pl"> Email: <input type="text" name="email" size="25"> Message: <textarea name="message" rows=3 cols=60> </textarea> <input type=submit value="Send message"> </form>

The HTML FORM action identifies the CGI program that will do something with the data from the form. In the above example, the CGI dumpenv.pl is the script that will receive the form data. The "method" tells the browser how to package the content when sending it to the WWW server. There are two basic methods: GET and POST. There is very little functional difference between these two methods; the significant differences are:

If you use GET (the default if no method is specified), all the data from the form will show up in one long line in the URL window of your browser when you submit it.
Many WWW servers impose severe length limitations on the amount of data that may be passed via the GET method.
The CGI programs receive variable data from GET FORMs from the environment variable QUERY_STRING, whereas variable data passed via POST is available from the stdin stream. Important: The server will not send an EOF on the end of the stdin data, instead you must use the environment variable CONTENT_LENGTH to determine how much data you should read from stdin (see example below).

Inside a FORM, INPUT, SELECT, TEXTAREA tags are used to specify interface elements. Each INPUT field in a FORM must have parameters indicating the "type" (i.e. text for textual input fields) and "name" of the field. There are numerous INPUT attributes, including:

TYPE must be one of:
- "text" (text entry field; this is the default)
- "password" (text entry field; entered characters are represented as asterisks)
- "checkbox" (a single toggle button; on or off)
- "radio" (a single toggle button; on or off; other toggles with the same NAME are grouped into "one of many" behavior)
- "submit" (a pushbutton that causes the current form to be packaged up into a query URL and sent to a remote server)
- "reset" (a pushbutton that causes the various input elements in the FORM to be reset to their default values)
NAME is the symbolic name (not a displayed name -- normal HTML within the form is used for that) for this input field. This must be present for all types but "submit" and "reset", as it is used when putting together the query string that gets sent to the remote server when the filled-out form is submitted.
VALUE, for a text or password entry field, can be used to specify the default contents of the field. For a checkbox or a radio button, VALUE specifies the value of the button when it is checked (unchecked checkboxes are disregarded when submitting queries); the default value for a checkbox or radio button is "on". For types "submit" and "reset", VALUE can be used to specify the label for the pushbutton.
CHECKED (no value needed) specifies that this checkbox or radio button is checked by default; this is only appropriate for checkboxes and radio buttons.
SIZE is the physical size of the input field in characters; this is only appropriate for text entry fields and password entry fields. If this is not present, the default is 20.
MAXLENGTH is the maximum number of characters that are accepted as input; this is only appropriate for text entry fields and password entry fields (and only for single-line text entry fields at that). If this is not present, the default will be unlimited. The text entry field is assumed to scroll appropriately if MAXLENGTH is greater than SIZE.

When the user clicks on the "submit" button, the browser sends all the data from the input fields to the program designated in the "action" line. Important: Every FORM must end with </form> so that the browser knows where the form ends.

Passing FORM data
When the user clicks on the "submit" button on a form, the browser program links the name/value pairs of field data together into one long buffer:

http://localhost/cgi-bin/dumpenv.pl?email=foobar&message=This+is+a+test

Note: The above URL would be displayed in the browser if the GET "method" was used (POST methods transport the data slightly differently, but the idea is the same.) The first portion of the URL indicates what server to send the request to: http://localhost. Localhost is a special term for the local machine. The next portion of the URL indicates the CGI script to execute: /cgi-bin/dumpenv.pl. Finally, the remainder of the script following the question mark (?) is the concatinated name/value for data in an encoded format.

The server receives the request and first attempts to find the /cgi-bin directory configured for the server. Next, it determines if and how to execute the script dumpenv.pl. Important: By default, many web servers do not permit CGI execution. WWW servers can be configured to recognize CGI programs in different ways. For some, any URL that calls for a file in a certain directory (often, "cgi-bin") indicates that the WWW server should try to run whatever it finds there as a CGI program. Others can be configured to use the file extension (the ".pl" or ".cgi") to indicate that certain files are programs rather than HTML pages, graphics, or other file types. You must understand how the server has been configured to execute CGI programs before you can proceed. For the remainder of this example, we assume that the web server is set up to recognize anything ending in .pl as a Perl CGI program and that there is a "cgi-bin" directory for script execution.

The browser appends a "?" onto the end of the URI in order to indicate that what follows is data for the program to use: http://localhost/cgi-bin/dumpenv.pl?. The WWW server then parses the URL and breaks the request into the URI, http://localhost/cgi-bin/dumpenv.pl, and the URI name/value pair arguments email=foobar&message=This+is+a+test. The question mark (?) designates the separation. Whatever you have a "name=" tag in the FORM becomes the name, and whatever is submitted for that field by the user becomes the value. Each name/value pair is separated in the URL line by the ampersand (&).

Parsing FORM data
The CGI program receives the name/value pair arguments in one long line either via the QUERY_STRING environment variable or stdin. The program is then required to split the name/value pairs up and decode the strings for use.

For POST or PUT FORM data, the information will be sent to the CGI script via stdin. The server will send CONTENT_LENGTH bytes on this file descriptor. For example, the FORM sample above might send 35 bytes encoded as: email=foobar&message=This+is+a+test. In this case, the server will set the CONTENT_LENGTH environment variable to 35 and set the CONTENT_TYPE environment variable to application/x-www-form-urlencoded. The first byte on the CGI program's standard input will be "e", followed by the rest of the encoded string.

Fortunately, there are many packages available to decode CGI arguments into useable form. The CGI program sends its output to stdout. This output can either be a document generated by the program, or instructions to the server for retrieving the desired output. The following is a simple Perl script which takes HTML POST form input and displays the name/value pairs to the client:

	#!/usr/local/perl/perl
	print "CGI Variables\n";

	# Get the FORM content-type and length
	$content_type = $ENV{'CONTENT_TYPE'};
	$content_len = $ENV{'CONTENT_LENGTH'};

	# Buffer the POST content
	binmode STDIN;
	read(STDIN, $buffer, $content_len);

	# Parse and display the FORM data.
	if ((!$content_type) ||
	    ($content_type eq 'application/x-www-form-urlencoded'))
	{
		# Process the name=value argument pairs
		@args = split(/&/, $buffer);

		$data = '';
		foreach $pair (@args) 
		{
			($name, $value) = split(/=/, $pair);
	
			# Unescape the argument value 
			$value =~ tr/+/ /;
			$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;

			# Print the name=value pair
			print "$name: $value\n";
		}
	}
	else
	{
		print "Invalid content type (expecting POST data)!\n";
		exit(1);
	}

	# DONE
	exit(0);

Next, see if you can enhance the above script to accept and process FORM data passed via GET.

Environment Variables
As you can see in the above script, environment variables are used to pass information about the FORM data to the CGI program. The following is a list of some of the standard environment variables available.

Environment Variable Description

SERVER_SOFTWARE is the name and version of the server answering the request.

SERVER_NAME is the server's hostname, DNS alias, or IP address as it would appear in self-referencing URLs.

GATEWAY_INTERFACE is the revision of the CGI sepcification to which the server complies.

SERVER_PROTOCOL is the name and revision of the protocol this request came in with.

SERVER_PORT specifies port to which the request was sent.

REQUEST_METHOD is the method with which the request was made: "GET", "POST" etc.

QUERY_STRING is defined as anything following the first '?' in the URL. Typically this data is the encoded results from your GET form. The string is encoded in the standard URL format changing spaces to +, and encoding special characters with %xx hexadecimal encoding.

PATH_INFO is the extra path information, as given by the client.

PATH_TRANSLATED is the translated version of PATH_INFO, which takes the path and does a virtual-to-physical maping to it.

SCRIPT_NAME is a virtual path to the script being executed.

REMOTE_HOST is the host name making the request. If DNS lookup is turned off, the REMOTE_ADDR is set and this variable is unset.

REMOTE_ADDR is IP address of the remote host making the request.

CONTENT_LENGTH is length of any attached information from an HTTP POST.

CONTENT_TYPE is the media type of the posted data (usually application/x-www-form-urlencoded).

Environment Variable	Description
SERVER_SOFTWARE	is the name and version of the server answering the request.
SERVER_NAME	is the server's hostname, DNS alias, or IP address as it would appear in self-referencing URLs.
GATEWAY_INTERFACE	is the revision of the CGI sepcification to which the server complies.
SERVER_PROTOCOL	is the name and revision of the protocol this request came in with.
SERVER_PORT	specifies port to which the request was sent.
REQUEST_METHOD	is the method with which the request was made: "GET", "POST" etc.
QUERY_STRING	is defined as anything following the first '?' in the URL. Typically this data is the encoded results from your GET form. The string is encoded in the standard URL format changing spaces to +, and encoding special characters with %xx hexadecimal encoding.
PATH_INFO	is the extra path information, as given by the client.
PATH_TRANSLATED	is the translated version of PATH_INFO, which takes the path and does a virtual-to-physical maping to it.
SCRIPT_NAME	is a virtual path to the script being executed.
REMOTE_HOST	is the host name making the request. If DNS lookup is turned off, the REMOTE_ADDR is set and this variable is unset.
REMOTE_ADDR	is IP address of the remote host making the request.
CONTENT_LENGTH	is length of any attached information from an HTTP POST.
CONTENT_TYPE	is the media type of the posted data (usually application/x-www-form-urlencoded).

Returning Data
CGI programs can return content in many different document types (i.e. text, images, audio). They can also return references to other documents. To tell the server what kind of document you are sending back, CGI requires you to place a short header on your output. This header is ASCII text, consisting of lines separated by either linefeeds or carriage returns (or both) followed by a single blank line. The output body then follows in whatever native format.

If you begin your script output with either "HTTP/" then the server will send all output exactly as the script has written it to the client. Otherwise, the server will send a default header back (text/html file type) with any data returned from the script. Important: If you do not choose to write the entire HTTP header, you should not provide any special headers, as they will appear as part of the body after server processing.

If you begin your script with any of the following:

Content-type:
Location:
Transfer-Encoding:
Last-Modified:
Set-Cookie:

the server will append the appropriate HTTP response status (200 or 302) followed by the headers and content of your script exactly as received.

For example, to send back HTML to the client, your output should read:

        Content-type: text/html

        <HTML><HEAD>
        <TITLE>output of HTML from CGI script</TITLE>
        </HEAD><BODY>
        <H1>Sample output</H1>
        Blah, blah, blah.
        </BODY></HTML>

In the above example, the response prepended is: HTTP/1.0 200 OK
To reference a file on another HTTP server, you would output something like this:

        Location: http://www.sambar.com/
        Content-type: text/html

        <HTML><HEAD>
        <TITLE>Whoops...it moved</TITLE>
        </HEAD><BODY>
        <H1>Content Moved!</H1>
        </BODY></HTML>

In the above example, the response prepended is: HTTP/1.0 302 MOVED
Note: The Location: directive should come prior to the Content-type: directive.