ch19 (7)


Chapter 19 -- Principles of Gateway Programming Chapter 19 Principles of Gateway Programming CONTENTS The Internetworking Protocol Family (TCP/IP) The HyperText Transfer Protocol (HTTP) What Is the Common Gateway Interface? The CGI The CGI: An Expanding Horizon Hardware and Server Software Platforms CGI Guidelines and Principles Software and Hardware Platform of the Gateway Programming Section Examples Principles of Gateway Programming Check In this chapter, I start with principles, including a brief description of the Internet protocols that enable the World Wide Web in general and gateway programming in particular: Transmission Control Protocol/Internet Protocol (TCP/IP) and the HyperText Transfer Protocol (HTTP). The Web can be thought of as a distributed information system. It is capable of supporting, seamlessly and globally, rapid and efficient multimedia information transfer between information content sites (servers) and information content requesters (clients). The servers are distributed in the truest sense of the word because there is no geographic constraint whatsoever on their location. Note the three critical properties of HTTP: its statelessness, its built-in mechanisms for an arbitrarily rich set of data representations (its extensibility), and its use of the connectionless TCP/IP backbone for data communication. The chapter then moves on to the Common Gateway Interface (CGI). Important fundamental terminology is introduced, such as the methods that HTTP supports. The advantages that the CGI environment affords both information requesters and information providers are discussed and illustrated with short Perl programs. Finally, typical hardware and software choices for Web sites are reviewed and the stage is set for the examples that I present in Chapters 20 through 25. The Internetworking Protocol Family (TCP/IP) It's not necessary to be a "propeller-head" (although it helps!) to grasp the essentials of the TCP/IP family. From the standpoint of the Web developer, here's what you really have to know: The TCP/IP family is organized into four layers.  The lowest layer is the link layer (corresponding to hardware such as the computer's network interface card); next comes the network layer where the Internet Protocol (IP) operates. Above the network layer is the transport layer, where you find the Transmission Control Protocol (TCP). Finally, at the top sits the application layer where familiar services such as File Transfer Protocol (FTP), Network News Transfer Protocol (NNTP), and others exist. The notion of a layered organization is very convenient: Applications call on the services of library routines offered by the transport layer (most frequently, TCP); in turn, TCP calls on routines offered by protocols in the network layer (usually, IP), and so on. TCP guarantees end-to-end transmission of data from the Internet sender to the Internet recipient.  Big data streams are broken up into smaller packets and reassembled when they arrive at the recipient's site. Mercifully, this breakdown and reassembly are transparent to Internet users. Keep in mind that the TCP protocol incurs overhead: It must set up the connection and keep track of packet sequence numbers on both ends. It also must implement a timing mechanism in order to ask for packet resends after a certain amount of time has passed. IP gives you the familiar addressing scheme of four numbers, separated by periods.  The NYU EDGAR development site, for example, has an IP address of 128.122.197.196. If the user always had to type in these numbers to invoke an Internet service, the world would be a gloomy place, but of course the Internet provides a name-to-address translation via Domain Name Service (DNS), so the EDGAR machine has a friendlier name: edgar.stern.nyu.edu. IP, unlike TCP, is a connectionless protocol.  This means that the route of data from the sender to the recipient is not predetermined. Along the way, the packets of data might well encounter numerous routing machines that use algorithmic methods for determining the next packet hop; each packet makes its own way from router to router until the final destination is reached. The TCP/IP family of protocols are open protocols (that is, they are not proprietary or for-profit).  Openness means that Internet users are not beholden to a commercial vendor for supporting or enhancing the TCP/IP standard. Well-established standards-review procedures, participating engineering groups such as the Internet Engineering Task Force (IETF), and draft standards on-line (known as Requests for Comments, or RFCs) are freely available to all.(See note) Note The concept of openness lies at the very heart of the Internet and gives it an inimitable charm. Openness means accelerated standards development, cooperation among vendors, and a win-win situation for developers and users. The ideals of cooperation and interoperability are addressed again in "The HyperText Transfer Protocol (HTTP)," later in this chapter. The Internet therefore can adapt to network congestion by rerouting data packets around problem areas. Again, end users do not have to know the nitty-gritty details (but they do have to suffer the consequences of peak usage, slowing down everybody's packets!). Tip Aspiring, ambitious Web developers should immerse themselves in the nitty-gritty of TCP/IP standards-both the current state of affairs and possible future directions. (See note) The Internet Multicasting Service, for example, has a very interesting on-line section called "New and Trendy Protocols" that makes for fascinating reading and might well be a portent of things to come.(See note) If you're an employee at a large installation, my advice is to show healthy curiosity and ask the system administrators to fill you in on the infrastructure and Internet connectivity at your firm. Be careful, though-sometimes the sys admins bite! You don't need a World Wide Web to perform some of the more basic tasks on the Internet. I can transfer ASCII files or binary images, for example, from one machine to another using FTP. I can log onto a remote machine using Telnet, rlogin, or rsh. Or, I can browse hierarchically based (menued) data using Gopher. Most machines support standard e-mail as well as Simple Mail Transfer Protocol (SMTP), and if a site subscribes to Usenet, the newsgroups are accessible using NNTP. On a UNIX-based machine, the basic services are enumerated in the file /etc/services. Each service corresponds to a standard port. Telnet is mapped to port 23, for example, and FTP is mapped to port 21. All ports below 1024 are privileged, meaning that only the system administrator(s) who can become the root on the machine is able to manipulate the service and port mapping. Figure 19.1 shows a typical FTP session. Figure 19.1 : A user at New York University asks for a documentation file from the Internet Multicasing Service using FTP. The important thing to realize about basic services such as FTP or Telnet is that they establish what potentially might be a long-lasting connection. Users can stay connected for quite a while, typically facing a 900-second idle time-out. It is possible to FTP one file after another from an FTP site or to log on all day on a remote machine via Telnet, issuing only sporadic commands. This is taxing to the host machines as well because they have only limited sockets available to service the user community. The problem, of course, is that when users are in a terminal session and want to Telnet to a different machine or FTP to a different FTP site, it's necessary to close the current connection and start a new one. Theoretically, a hardy soul might build an interesting hypermedia resource site by FTPing interesting images, video, and so on from archives around the world. He or she also might accumulate a great amount of textual information content in a similar fashion. Yet, in the "bad old days," there was no way to publish the resource base to the global Internet community. The only recourse was to write about the site on the Usenet newsgroups, and then allow anonymous FTP to support other users to mirror some or all of the files. The hypermedia was viewable only to a privileged set of local users. What is missing? None of these services, alone or in combination, affords the possibility of allowing machines around the world to collaborate in a rich hypermedia environment. When the '90s started, it was virtually unimaginable that the efficient sharing of text, video, and audio resources was just around the corner. One way to think of the problem is to consider that it was impossible, just a few short years ago, to request hypermedia data for local viewing from a remote machine using a TCP/IP pipe. There simply was no standard to support the request or the answer. Filling the Collaborative Vacuum The global Internet community was blessed, in 1991, by Tim Berners-Lee's implementation of the HTTP protocol at CERN, the European Center for High-Energy Physics in Geneva, Switzerland. Another way to look at "collaboration" in this context is the capability to publish the hypermedia resource base locally and have it viewable globally, and the capability to swiftly and easily transfer the hypermedia resources, annotate them, and republish them on another site. HTTP is the powerful protocol engine that enables remote hypermedia collaboration and stands at the very essence of the World Wide Web. The HyperText Transfer Protocol (HTTP) The HTTP protocol is a member of the TCP/IP suite of protocols because it uses TCP/IP for its transport. Theoretically, HTTP is transfer-independent; it could use User Datagram Protocol (UDP) instead of TCP, or X.25 instead of IP. The important things to keep in mind are that HTTP is at the top application layer of the TCP/IP family and typically uses TCP for transport and IP for routing. In other words, the HTTP specification (HTTP) presupposes the existence of a backbone network connecting all the machines (in the case of the Internet, TCP/IP) and all the packets flowing from client to server (and vice versa) to take advantage of the standard TCP/IP protocol. More specifically, HTTP uses the TCP transport service (and its associated overhead; TCP similarly uses IP to route the packets) when connecting an information requester (a client) to an information provider (a server). It encompasses several broad areas: A comprehensive addressing scheme  When an HTML hyperlink is composed, the URL is of the general form http://machine-name:port-number/path/file.html. Note that the machine name conforms to the IP addressing scheme; it might be of the form aaa.bbb.ccc.ddd.edu or, using DNS lookup, the machine's "English" equivalent may be used. Note further that the path is not the absolute path on the server machine; instead, it is a relative path to the server's document root directory. More generally, a URL reference is of the type service://machine/file.file-extension and, in this way, the more basic Internet services are subsumed by the HTTP protocol.(See note) To construct a link to create a hyperlink to an Edgar NYU research paper, for example, you can use this code: <A HREF="ftp://edgar.stern.nyu.edu/pub/papers/edgar.ps"> By subsume, I mean that a non-HTTP request is fulfilled in the Web environment; a request for an FTP file therefore results in that file being cached locally with the usual Web browser operations available (Save As, Print, and so on) without sacrificing the essential flexibility of being able to jump to the next URL. The scheme format changes slightly from service to service; for example, an FTP request permits this optional construction: <A HREF="ftp://jsmith:pass99@landru.lab.com"> This example has user jsmith logging on the FTP server at landru.lab.com with password pass99. The HTTP service has no such Userid:Password construction. An extensible and open representation for data types  When the client sends a transaction to the server, headers are attached that conform to standard Internet e-mail specifications (RFC822).(See note) At this time, the client can limit the representation schemes that are deemed acceptable or throw the doors wide open and allow any representation (possibly one of which the client is not aware). Normally, from the standpoint of gateway programming, most client requests expect an answer in plain text or HTML. It's not at all necessary that developers know the full specification of client request headers, but full details are available on-line.(See note) When the HTTP server transmits information back to the client, it includes a Multipurpose Internet Mail Extension (MIME) header to tell the client what kind of data follows the header. The server does not need to have the capability to parse or interpret a data type; it can pass the data back to the client, and translation then depends on the client possessing the appropriate utility (image viewer, movie player, and so on) corresponding to that data type. Interestingly, there are ways for the client to request information about a file (metadata) rather than the file itself (data) using the HEAD method, which I discuss more fully in Chapter 20. Note The MIME specification, originally developed for e-mail attachments, has been adapted in a very important way for the Web.(See note) MIME is discussed further in Chapter 20. For now, it's enough to remember that the HTTP protocol requires that data flowing back to the client has a properly formatted set of header lines. The HTTP protocol also has several important properties: It is stateless.  Statelessness means that after the server responds to the client's request, the connection between client and server is dropped. This has important ramifications and is in direct contrast to the basic Internet services such as FTP or Telnet. In an FTP session, if I request a file from a remote site, I am still logged on (the connection is still there) until I explicitly quit, or I am logged off by the remote machine (an inactivity time-out). Statelessness also means, from the standpoint of the web developer, that there is no "memory" between client connections. In the pure HTTP server implementation, there is no trace of recent activity from a given client address, and the server treats every request as if it were brand new-that is, without context. This might strike the reader as a waste of network resources; after all, TCP went through quite a bit of effort to set up an end-to-end link, and HTTP drops the connection after only one transaction cycle. Indeed, the primary motivation behind establishing a persistent actual or de facto connection between server and client is to eliminate this waste. Throughout Part IV, I present workarounds or protocol extensions that maintain or alter the state and in effect keep the client/server connection alive for more than one cycle. The most important state-preservation technique (arguably, because it's not accepted as an Internet standard) is Netscape's Cookie scheme, which I explore in detail with a sample application in Chapter 24, "Scripting for the Unknown: The Control of Chaos." It is rapid.  In short, the client requests, the server responds, the end. Berners-Lee's stated goal of a hypermedia response-answer cycle on the order of 100 milliseconds definitely has been met "on a clear Net day." The perceived delay ("This site is so slow today!") can be blamed, usually, on general network congestion. Caution It's up to the web developer to avoid adding to the congestion woes of the client! Throughout Part IV, I stress ways to plan data structures and accesses to these data structures in efficient ways. There are portable implementation solutions.  Thanks to Tim Berners-Lee, Henrik Frystyk, Larry Masinter, Roy Fielding, and many others, the Internet research community has been involved from the outset in implementing solutions for HTTP servers and HTTP browsers. Perhaps the most portable solution introduced so far is Anselm Baird-Smith's Jigsaw product-an HTTP server written completely in Java. Announced at the May 5, 1996 World Wide Web conference in Paris, France, Jigsaw offers completely painless portability, because it is written for the abstract Java Virtual Machine (it is architecture-neutral). More information on Jigsaw (including download information) is available at http://www.w3.org/pub/WWW/Jigsaw. Note On UNIX boxes, the standard HTTP port is port 80, and the server daemon is called httpd. The httpd program can run as a stand-alone program, waiting for a request at port 80, or it can run off of the Inetd (consulting the system files /etc/services and /etc/inetd.conf when a port 80 request is received). The httpd daemon also can be started on a nonprivileged port, such as port 8000, but then, of course, the client must specify a URL such as http://machine-name:8000/path/file.html, and it's up to the server to publicize the oddball port! Not a happy task. If the port is not specified, port 80 is assumed. If the server is expected to be popular, it's a better idea to run the httpd daemon stand-alone because there is overhead in consulting the /etc/inetd.conf file every time a client request is received. Its future direction will be open.  Peripatetic Mr. Berners-Lee now heads the World Wide Web Consortium (W3C), which provides an open forum for development in many different arenas.(See note) The Netscape Communications Corporation, for example, has developed a security scheme called the Secure Sockets Layer, and has published the SSL specifications for all to see (I talk more on security in Chapter 25, "Transaction Security and Security Administration"). The W3C is evaluating this as well as a commercial competitor's ideas for Secure HTTP (SHTTP) in a rigorous and impartial manner. An organization such as the W3C is a fantastic resource for the Web development community; top engineers and theorists can enter an open forum and freely discuss ideas and new directions for the protocol. Tip It's a great idea for the budding Web developer to closely follow the ideas that are being bandied about by the W3C. One important idea is the Uniform Resource Identifier, which is Request for Comment (RFC) 1630. (See note) Currently, users often encounter the frustration of clicking a hypertext link only to find that the URL no longer is valid. The URI specs allow for the possibility of encoding a forwarding address, in a manner of speaking, when a link moves. Another critical advance, announced by the W3C on March 5, 1996, is Hakon Lie's "Cascading Style Sheets (CSS),"(See note) which allows the HTML author to suggest fonts, colors, and horizontal spacing in a document (which the client then can override if desired). The W3C has developed Amaya, an HTML browser and editor to demonstrate the power of CSS. The momentum of CSS is growing with Netscape's announcement that its feature set will be embedded in the Netscape Navigator 4.0 browser. The list of ideas goes on and on; the more developers know today, the more they are ready tomorrow when the concept becomes a practical reality. And if the time and resources exist, a trip to the WWW conference is highly recommended to keep up with the latest initiatives.(See note) Its weaknesses are known and are being addressed.  In one intriguing and noteworthy example, the current HTTP 1.0 often causes performance problems on the server side and on the network, because it sets up a new connection for every request. Simon Spero has published a progress report on what the W3C calls HTTP Next Generation (HTTP-NG). As Spero states, HTTP-NG "divides up the connection (between client and server) into lots of different channels…each object is returned over its own channel." Spero further points out that the HTTP-NG protocol permits complex data types such as video to redirect the URL to a video-transfer protocol, and only then is the data fetched for the client. HTTP-NG also keeps a session ID, thus bestowing "state." The Netscape Cookie mechanism takes a different approach: A server script can write information to the client file system that can be passed back to the server when the same client accesses the same server domain and path (see Chapter 24). Again, the Web developer should make a point of keeping abreast of developments in HTTP-NG, Secure HTTP, Netscape cookies, Netscape SSL, and other hot industry issues.(See note) Let's imagine now the state of the world just after the HTTP protocol was introduced (and yes, it was an instant and smashing success) but before the advent of our next topic, the CGI. In 1991, we had our accustomed TCP/IP Internet connectivity, and then there was the HTTP protocol in operation. That means that we had many HTML coders integrating text, video, and audio at their server sites, and many more clients anxious to get at the servers' delights. Remote collaboration was achieved; clients could request hypermedia data from a remote server and view it locally. Consider, though, one such client session. Without the CGI, clients can navigate only from one hypertext link to the next-each one containing text, audio, video, or some other data type. This inefficient means of browsing a large information store consists of nothing more than the actions shown in Figure 19.2. Figure 19.2 : Without the CGI-an inefficient browsing session. The drawbacks of navigating serially from link to link, with each link producing one discrete preexisting data item, are potentially severe at some server locations. For the user, it would be annoying to browse numerous links at a large server site to find a specific item of interest. For the Web developer, there would be no way to provide an ad-hoc mechanism for querying data (of any type), and it wouldn't be possible to build HTML documents dynamically at request time. Naturally, some sites can stand fully on their own, without gateway-supplied interactivity. What Is the Common Gateway Interface? The Common Gateway Interface (CGI) is a means for the HTTP server to talk to programs on your, or someone else's, machine. The name was very aptly chosen: Common  The idea is that each server and client program, regardless of the operating system platform, adheres to the same standard mechanisms for the flow of data between client, server, and gateway program. This enables a high level of portability between a wide variety of machines and operating systems. Gateway  Although a CGI program can be a stand-alone program, it also can act as a mediator between the HTTP server and any other program that can accept at runtime some form of command-line input (for example, standard input, stdin, or environmental variables). This means that a SQL database program that has no built-in means for talking to an HTTP server can be accessed by a gateway program, for example. The gateway program usually can be developed in any number of languages, regardless of the external program. Interface  The standard mechanisms provide a complete environment for developers. There is no need for a developer to learn the nuts and bolts of the HTTP server source code. After you understand the interface, you can develop gateway programs; all you need to know in terms of the HTTP protocol is how the data flows in and out. CGI programs go beyond the static model of a client issuing one HTTP request after another. Instead of passively reading server data content one prewritten screen at a time, the CGI specification allows the information provider to serve up different documents depending on the client's request. The CGI spec also allows the gateway program to create new documents on-the-fly-that is, at the time the client makes the request. A current Table of Contents HTML document, listing all HTML documents in a directory, easily can be composed by a CGI program. I demonstrate this useful program in Chapter 20. Note particularly the synergism between organizations permitted by the capability of CGI programs to call each other across the Internet. By mutual agreement, companies can feed each other parameters to perform ad-hoc queries on proprietary data stores. I show an example of such interaction in Chapter 21, "Gateway Programming I: Programming Libraries and Databases," in the discussion of the Stock Ticker Symbol application. The CGI Recall Figure 19.2, which illustrated a schematic data flow without the advantages of the Common Gateway Interface. Adding in the CGI, the picture now looks like the one depicted in Figure 19.3. Figure 19.3 : A schematic overview of data flow using the CGI. The first step is data being transmitted from a client to a server (1). The server then hands the request to the CGI program for execution (2). Output (if any) is passed back to the server (3). The output, if it exists, is sent to the client (4). The initial connection from client to server is dropped after the event (4). The transaction follows: The client sends a request conforming to the URL standard to the server. This request must include the type of service desired (for example, HTTP, FTP, Telnet, and so on) and the location (for example, //(machine name or IP)/(filename)) of the resource. Attached to this request is header data supplied by the client. (Headers are covered in the next chapter.) The HTTP server parses the incoming request and decides what to do next. For a non-HTTP request, the appropriate service is subsumed. An FTP request retrieves the appropriate file and returns it to the client's browser, for example. It's important that the retrieved file now is sitting locally in the client's browser and all of the usual Web browser buttons (Save, Print, Open URL, and so on) are available. For an HTTP request, the server locates the file that is being requested. Depending on the file's type, the server then makes a decision about what to do with the file. How the server reacts to different file types is a configuration issue determined by the maintainer of the server. (Configuring HTTP is beyond the scope of this book. I deal only with commonly used file types.) If the server doesn't understand the file type, it usually is configured to send the file back as plain text. An HTML file is sent back to the client. In most cases, the server does not parse or interpret the file in any way; the client software parses the HTML tags to properly format the output to the user. A major exception to this rule is when server-side includes are used by the web developer (using SSIs is an important technique and is discussed fully in Chapter 20). If the server recognizes the file as an executable file or a CGI program, it runs the program, attaching the following: The header data received from the client, if any, and its own header data. This data is passed to the gateway program as environment variables. The program execution parameters, if any, attached to the gateway program by the client. Again, this data is passed to the CGI program as environment variables or as input to the program's stdin or command line. The method by which the data is passed is determined by the developer. The next section contains a brief introduction to methods, and Chapter 20 gives a fuller explanation. The black box in Figure 19.3 is the gateway program, and this is where web developers stand or fall. What must it do? The gateway program must parse the input received by the server and then generate a response and/or output to send back to the server. There are conditions to how the program must behave: If there is no data to send back to the client, the program still must send a response indicating that. Remember that, at this point, the HTTP connection still is open. Caution The web developer must be attuned to the possibility of a CGI program that mistakenly generates no response. This misbehavior causes processes to pile up, which eventually can crash the server machine. If there is data to send back to the client, the gateway program precedes that data with a header so that the server understands, followed by the output data, which must conform to the MIME formatting conventions. The data must be the type that is indicated by the header. The format and content of the response are critical. This is not, however, difficult to master, as I show later. The server reads the CGI program output and again makes a decision what to do, based on the header. In general, the server might take two types of actions. If the header is of the Location type, the server fetches the file indicated or tells the client to fetch that file. A Content-type header causes the server to send the data back to the client. The client then is responsible for handling the incoming data and properly formatting it for output to the user. After the client receives all the data, the HTTP connection closes (recall the important property of HTTP statelessness). Data Passing and Methods If all the CGI environment allowed you to do was run external programs without the client being able to supply data in an ad-hoc manner, the Web would be a dull place. Fortunately, this is not the case. Using different techniques, the client can pass arguments or data to the gateway program through the HTTP server. The gateway program, instead of being a static program with the same output every time it's run, instead becomes a dynamic entity that responds to the end user's needs. A client can pass data to the gateway program in one of two ways: via environment variables or as standard input (also known as stdin) to the program. Two environment variables are available for gathering user input data-QUERY_STRING and PATH_INFO-and there are a few ways to get data into those variables. The developer can put data into these variables through a normal HTML link: <A HREF=http://www.some.box/sign.pl?passed-argument> Click here to run the program </a> Everything after the first question mark in a URL is put into the QUERY_STRING variable-in this instance, the characters passed-argument. Note Text search packages, such as WAIS and FreeWAIS, existed before the Web and the HTTP protocol were invented. In the CGI context, keywords (separated by the plus (+) character) are passed to the gateway program as if the client had executed a METHOD=GET. Therefore, the environmental variable QUERY_STRING is used for WAIS and WAIS-like packages; this was an implementation decision by the designers of the HTTP protocol. Gateway program interfacing to text search packages is fully discussed in Chapter 22, "Gateway Programming II: Text Search and Retrieval Tools." Similarly, to put data into the PATH_INFO variable, the following HTML link can be coded: <A HREF=http://www.some.box/walk.pl/direction=north/speed= slow> Start the program </a> In this case, the server would find the CGI executable, walk.pl, and put everything after that into the PATH_INFO variable: "/direction=north/speed=slow" Both these variables also can be modified using different methods within <FORM> tags. A form with METHOD=GET, for example, puts data into the QUERY_STRING variable: <FORM METHOD=GET ACTION="http://www.some.box/name.pl"> First Name<INPUT NAME = "First Name"><BR> Last Name<INPUT NAME = "Last Name"><BR> <INPUT TYPE=submit VALUE="Submit"> </FORM> This puts whatever text the user types into the QUERY_STRING environment variable. The gateway program can read and echo to the screen the First Name and Last Name form data with the following code: #!/usr/local/bin/perl # name.pl print "Content-type:  text/html\n\n"; print "You input \"$ENV{QUERY_STRING}\" in the input boxes\n\n"; exit; If the user typed foo and bar as values, the QUERY_STRING environmental variable would have the value First+Name=foo&Last+Name=bar. Figure 19.4 shows the output screen the end users see. Figure 19.4 : The output from the simple METHOD_GET form.A? and the encoded data is appended to the form's new URL. Note that the data the user input is appended to the new URL after a question mark. Also, the data is encoded and must be decoded. Encoding simply means that certain characters, such as spaces, are translated before they are passed to the gateway program. The developer must perform a simple "untranslate" step to properly use the data, and publicly available tools can help. Encoding and decoding are discussed in Chapter 20. Caution Passing data via environment variables is useful but can have some limitations and actually cause system problems. A gateway program handing off a very long string (URL plus query string) to a shell script might crash the script due to built-in shell limitations on the length of the command line. DOS programmers recognize this as the familiar running out of environment space problem. To bypass the potential dangers of the METHOD=GET technique, the NCSA recommends that you pass data through standard input to the external program whenever possible. A form with METHOD=POST is used to pass data to a gateway program's stdin. Again, the data is encoded when it is passed to the gateway process, and it must be decoded. I change my Perl script to read like this: #!/usr/local/bin/perl # name.pl $user_input = read(STDIN, $_, $ENV{CONTENT_LENGTH}); print "Content-type:  text/html\n\n"; print "You input \"$user_input\" in the input boxes\n\n"; exit; This program produces an output screen identical to the preceding one, except that the resulting URL does not show an encoded QUERY_STRING after the program name. A METHOD=POST form can use the environment variables as well as stdin. By changing the FORM tag to <FORM METHOD=POST ACTION=http://www.some.box/name.pl/screen=subscribe> data goes into both stdin and the PATH_INFO variable. The post method of form handling is considered favorable, because there is no limitation on the amount of data that can be passed to the gateway program. Keep in mind the exception of text-search engines that place keywords as if METHOD=GET were used. The important point to remember is that CGI gives you different ways to pass data to the gateway program, and these methods can be combined. This ties in nicely with the Web's intrinsic properties of openness and extensibility. A third method, HEAD, also is useful. A client can use the HEAD method to request information about a given file on a server, such as the date and time it was created. The most direct way to do this is by opening a Telnet connection to the server (I attach parenthetical remarks to the right, which you should not type if you want to follow along): Telnet edgar.stern.nyu.edu 80        (I connect to the server at port 80) Trying 128.122.197.196 …             (The network is trying to make my connection) Connected to edgar.stern.nyu.edu.    (Good, I am connected). Escape character is '^]'.            (If I hit ^] I can break out of the Telnet                                      session) HEAD /tools.shtml / HTTP1.0          (I issue this command to retrieve information                                      about tools.shtml)                                      (I hit Enter twice to give me this blank line;              this is essential) HTTP/1.0 200 Document follows        (This is the server's response) Date: Wed, 26 Jun 1996 21:36:51 GMT  (The date and time the document tools.shtml                                      was last modified) Server: NCSA/1.5                     (The server software and version number                                      delimited by a '/' character) Content-type: text/html              (The MIME type and subtype of the tools.shtml                                      file) In this session, I connect to the Web server edgar.stern.nyu.edu at the default HTTP port 80 and then issue a native HTTP header command to retrieve information (metadata) about the file tools.shtml, which lives in the server's document root. Note that the actual file system path of the file tools.shtml is independent of the document root; in this case, tools.shtml actually is located in the path /usr/local/edgar/web, but the client does not need to know this. With METHOD=GET and METHOD=POST, a script only needs to supply a MIME header and then the data, which is actually the last line in this session. Here, I supply a native header (the method HEAD followed by the file name, a '/' delimiter, and the protocol version number 1.0), which usually is handled automatically when Web clients and servers talk to one another. Note that the Telnet session establishes a clear-text interactive session with the HTTP server daemon, exactly as a Telnet to port 25 on a standard UNIX machine would start an interactive session with the SMTP daemon. Not all network protocols are like this; some expect (encoded) C-language structures as input. The brute-force Telnet approach is impractical in order to take good advantage of the HEAD method, however. CGI programs can be written to open a connection to the server port, get this metainformation programmatically, and then take a logical action accordingly. I show this in Chapter 20. The CGI: An Expanding Horizon CGI programming really does expand the horizon of the Web. The simple concept of passing data to a gateway program instantly opens up all sorts of options for web developers and changes the nature of the World Wide Web. Now web developers can enhance their content with applications that involve the end user in producing output. Developers subtly can alter the nature of their site from a more passive model, requiring little or no user input (and consequently more free-form surfing), to the more active model, accepting or requiring more user input. How much interactivity a developer adds to a site depends on the content and purpose of the site. A Web site advertising a particular product might be best off with little or no interactive applications. The advertiser wants to tell a story about the product, and limiting the end user's options causes the end user to focus more on the content of the site. At the other end of the spectrum are sites such as The EDGAR Project at town.hall.org with its massive database of SEC filings. Here there is no story-only data-and, through CGI programming, the developers have been able to create a number of engaging applications that make it easy for users to meet their needs. Hardware and Server Software Platforms Web servers can be run from Macs, OS/2 machines, boxes running a wide array of UNIX flavors (Ultrix, Sun OS, AIX, XENIX, HP/UX, and so on), MS-Windows machines, and other operating systems. The National Center for Supercomputing Applications (NCSA) development of its httpd server software and Mosaic browser software (which gave us inline decoding of GIF images) greatly helped to popularize the Web in 1992 and 1993; the CERN server is similar, and both have been ported to all the aforementioned platforms. More recently, the Netscape Communications Corporation has introduced server software that supports data encryption. The examples I present in Part IV are based on code developed on Sun OS 4.1.3_U1 or Solaris 2.4, and were tested on the NCSA httpd server versions 1.3 and 1.5 beta. These are typical combinations; there are many other permutations of hardware and server software, and the principles remain the same no matter what a particular site chooses. CGI scripts can be written in any language that is capable of understanding standard input, output, and environmental variables. For the developer's sanity, it's better to choose a language that can be well documented and can access large data structures efficiently. In the UNIX environment, for example, the interpreted script languages Perl (Practical Extraction and Reporting Language) or Tcl (Tool Command Language) can be used. Or, compiled languages such as C or C++ are equally good choices. Perl has been ported to many platforms, including the Macintosh, Windows 3.1, Windows 95, and Windows NT. Recent 32-bit Windows Perl ports (which work well on Windows NT and less well on Windows 95) are discussed at http://www.perl.hip.com/. In Windows, you also can choose Borland's Turbo Pascal, C, or C++ for Windows, as do dozens of other programming language options. Client-side scripting (including, for example, JavaScript and VBScript), where processing is totally removed from the Web server, is a separate topic and is discussed in Chapter 23, "Client-Side Scripting." CGI Guidelines and Principles The usual caveats should be observed when a web developer first considers what programming environment to choose: Worry about your underlying data structures first.  The best CGI programs in the world won't save you if the underlying data they are trying to access is a garbled mess. Use well-accepted principles such as database normalization to structure tabular data. Plan carefully, in advance, the location of your programs before you write them.  Make sure that production areas do not get filled with test scripts. Establish a mechanism to do version control on programs. Avoid headaches in this area before they start! If possible, establish a test Web server with the identical configuration as the production Web server.  The test machine can be used as a proving ground for new applications or new technologies, such as Secure Sockets Layer (SSL) or Microsoft's CryptoAPI. If both machines have the same operating system and server architecture, code can be rolled out to production by a simple file transfer. When making the CGI software choice, the developer should remember to use a language that is readable, maintainable, and enhanceable.  The language's inherent capabilities should be mapped to what type of access and retrieval is needed, given the site's information content. This includes fast access to data structures (and yes, capacity planning should be done in this regard; could the data structure outgrow your resources?) and the capability to perform text manipulation with a minimum of agony. And, most important, can the proposed package hook easily into third-party applications? Are there examples of this package working with other packages on-line-that is, demonstrable at a Web site? If there aren't, the developer probably should have a few second thoughts. And Once the Coding Starts It is important to follow some hackneyed programming guidelines when you code CGI scripts. Most important, the code should be documented-not too little and not too much! The discipline of production and test directories always should be enforced, and maintenance logs should be kept up to date as important bug fixes are made or enhancements are written. If it transpires that a large-scale effort is bogging down in a certain programming environment, it is important to keep an open mind. There is no iron-clad rule that a single language be used. Inspiration often comes from cool sites on the Web; after further inspection, I usually find that the developers have used several powerful tools to attack the problem. Software and Hardware Platform of the Gateway Programming Section Examples In the remaining chapters of Part IV, I show a wide variety of CGI programs written in Perl 4.036 and Perl 5.00x. Perl is quite readable and also quite powerful; it combines the flexibility and utility of a shell language with the capability to make low-level system calls as you might expect in a C program. Speaking of expect, I also demonstrate Perl interfacing with the script language Expect. I show Perl working well with the text-search tools WAIS and Glimpse, relational databases such as Oracle and Sybase, and object-relational packages such as Illustra. To illustrate the popular topic of Web crawlers, spiders, and robots, a simple yet effective Perl spider is presented and discussed. An example also is given of Perl 5 using Lincoln Stein's CGI.pm module to take advantage of the Netscape Cookie scheme. For variety, Tcl/TK, Expect, and Python also are presented in Chapter 26. The capability to mix and match can't be stressed highly enough to the budding web developer; I therefore carefully go over examples of all the aforementioned applications. Principles of Gateway Programming Check Developers should have a firm grasp on the history and underlying principles of the HyperText Transfer Protocol. It is in the developer's best interest to stay attuned to the evolution of the HTTP standard and be aware of matters currently under consideration by the W3C standards body, which is accomplished most easily via the Usenet newsgroups. The client/server cycle, with or without CGI programs, should be a familiar one to web developers. In addition, the basic methods (GET, POST, and HEAD) and the standard ways of passing data (environmental variables and standard input) should be part of our basic vocabulary. Footnotes Internet Requests for Comments (RFCs), might sound like dry stuff, but the first two mentioned in this chapter are a must-read for the web developer. It's also very handy to know about the complete RFC Index (about 500KB) at http://www.cis.ohio-state.edu/htbi A variety of excellent texts describe the TCP/IP protocol. Some are more detailed than others. One set that I've enjoyed is W. Richard Stevens's TCP/IP Illustrated, Volumes I, II and III, published by Addison-Wesley. The Internet Multicasting Service home page is at http://www.town.hall.org/, and you can find its discussion of "New and Trendy Protocols" at http://www.town.hall.org/trendy/trendy.html. T. Berners-Lee, L. Masinter, M. McCahill, Uniform Resource Locators (URL). 12/20/1994 at http://www.cis.ohio-state.edu/htbin/rfc/rfc1738.html. Internet Request for Comments The basic HTTP specification is on-line at http://www.w3.org/pub/WWW/Protocols/HTTP/HTTP2.html courtesy of Tim Berners-Lee. The Internet Engineering Task Force HTTP Working Group's current activities are viewable at http://www.ics.uci.edu/pub/ietf/http/. The MIME specification is addressed in several RFCs; here are the two basic ones: RFC 1521, N. Borenstein, N. Freed, "MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies," 09/23/1993, available in ASCII text and PostScript; and RFC 1522, K. Moore, "MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text," 09/23/1993, available in ASCII text. The W3C Consortium is hosted in Europe by the French information research agency INRIA (budgetary considerations caused CERN to bow out at the end of 1994) and in the United States by the Massachusetts Institute of Technology. Its stated objective (and one well worth noting) is to "ensure the evolution of the World Wide Web (W3) protocols into a true information infrastructure in such a fashion that smooth transitions will be assured both now and in the future. Toward this goal, the MIT Consortium team will develop, support, test, disseminate W3 protocols and reference implementations of such protocols and be a vendor-neutral convenor of the community developing W3 products. In this latter role, the team will act as a coordinator for W3 development to ensure maximum possible standardization and interoperability." More information is available at http://www.w3.org/hypertext/WWW/Consortium/Prospectus/FAQ.html. T. Berners-Lee, "Universal Resource Identifiers in WWW: A Unifying Syntax for the Expression of Names and Addresses of Objects on the Network as Used in the World-Wide Web," 06/09/1994, at http://www.cis.ohio-state.edu/htbin/rfc/rfc1630.html. Cascading Style Sheets are discussed at http://www.w3.org/pub/WWW/Style/css/, and more information about the author, Hakon Lie, is at http://www.w3.org/pub/WWW/People/howcome/. The next World Wide Web conference will be hosted by Stanford University in Santa Clara, California, April 1997. Look at http://www6conf.slac.stanford.edu/ for more details. Simon Spero, at UNNC Sunsite/EIT, discusses his proof of concept implementation of HTTP-NG and the basic HTTP-NG architecture at http://www.w3.org/pub/WWW/Protocols/HTTP-NG/Overview.html.

Wyszukiwarka

Podobne podstrony:
ch19
ch19
ch19
CH19 (6)
ch19 (12)
ch19
ch19
ch19
ch19
ch19
CH19
ch19
ch19
ch19 (8)
CH19 (14)
CH19 (14)

więcej podobnych podstron