the design and implementation of the gecko nfs web proxy

SOFTWARE—PRACTICE AND EXPERIENCESoftw. Pract. Exper.2001;31:637–665 (DOI: 10.1002/spe.381)

The design and implementationof the Gecko NFS Web proxy

Scott Baker∗,† and John H. Hartman

Department of Computer Science, The University of Arizona, Tucson, AZ 85721, U.S.A.

SUMMARY

The World Wide Web uses a naming scheme (URLs), transfer protocol (HTTP), and cache algorithmsthat are different from traditional network file systems. This makes it impossible for unmodified Unixapplications to access the Web; as a result, the Web is only accessible to special Web-enabled applications.Gecko bridges the gap between Web-enabled applications and conventional applications, allowing anyunmodified Unix application to access the Web. We developed two prototypes of the Gecko system atthe University of Arizona and incorporated the many lessons learned from the first into the second.Minimizing the amount of state was a key goal of the Gecko redesign and the second prototype uses aunique compression technique to reduce server state. Experiments performed on the prototype show thatGecko not only makes the Web accessible to unmodified applications, but does so at a performance thatmeets or exceeds that of HTTP. Copyright 2001 John Wiley & Sons, Ltd.

KEY WORDS: NFS; HTTP; proxy; file system; hyperlink

INTRODUCTION

Despite its enormous popularity and the vast amount of information it holds, the World Wide Webis only accessible to applications designed specifically to access it. The Web transfers data via theHypertext Transfer Protocol (HTTP) [1], which is not directly supported by contemporary operatingsystems. As a result, each application that wants to access the Web must implement HTTP itself, ormake use of a library or execution environment that does. This is unfortunate, and means that onlyWeb-enabledapplications can use the Web; conventional applications designed to work on files cannotoperate on a Web page, unless a Web-enabled application first copies the page into the local file system.For example, a user who wishes to edit a Web document in a conventional word processor must first

∗Correspondence to: Scott Baker, Department of Computer Science, Gould-Simpson Building 721, The University of Arizona,Tucson, AZ 85721, U.S.A.†E-mail: [email protected]

Contract/grant sponsor: NSF; contract/grant number: CCR-9624845, CDA-9J00991

Copyright 2001 John Wiley & Sons, Ltd.Received 23 March 2000

Revised 30 November 2000Accepted 30 November 2000

638 S. BAKER AND J. H. HARTMAN

Figure 1. Gecko architecture. Gecko acts as an intermediary between the HTTP protocol used by the Web, andthe NFS protocol used by its clients. A single Gecko server may have many clients. Both the Gecko server and its

clients cache Web pages; the server ensures that client caches remain consistent with the server’s copy.

use a Web browser, or similar Web-enabled application, to download the page into a file, then launchthe word processor on the local copy.

Gecko provides a file system interface to the Web, allowing a conventional application to accessWeb pages as easily as it does files. No modifications to the application are necessary, nor must theapplication be recompiled. Gecko opens the Web to conventional applications, such as the following.

• Usingfind to find all pages reachable by following links from an initial page.• Searching Web pages usinggrep. A simple spider and search engine are easily constructed from

grep andfind.• Executing programs directly from the Web. Similarly, libraries stored on the Web are directly

accessible told, eliminating the need for local copies.• Traditional Unix utilities such asls, cp, more, andcat may be used on Web pages to parse

directory structures, copy content to local storage, and view content directly.• Using a file system indexing tool, such asglimpse, to index Web pages.

Gecko is simultaneously an HTTP proxy and an NFS server; the Gecko server‡ accepts requestsfrom clients using the NFS protocol and satisfies them by accessing the Web via the HTTP protocol(Figure1). The NFS protocol is an industry-standard specification created by Sun Microsystems [2] thatis available on many platforms, including Unix and Windows-based operating systems. By accessingthe Web through NFS, all of the NFS functionality built into the client, such as file caching, file namecaching, cache consistency, etc., operate on Web pages just as they do on NFS files. Clients share pagescached on the Gecko proxy server, and access them using the NFS protocol, instead of the slower HTTPprotocol. In addition, NFS maintains the consistency of pages cached on clients, so that when one clientreads a new version of a page, all clients see the change.

‡Throughout this paper we use the termclient to refer to a computer that accesses Gecko,Gecko serverto refer to the Geckoproxy/server, andWeb serverto refer to an HTTP-based server of Web pages.

Copyright 2001 John Wiley & Sons, Ltd. Softw. Pract. Exper.2001;31:637–665

GECKO NFS WEB PROXY 639

Gecko is not unique in proving a file system interface to the Web. Other file systems, such asWebFS [3], provide access to the Web, but do so by installing a new type of file system into theclient operating system that fetches pages from the Web via HTTP. While this technique works, it is notportable between operating systems, and does not provide for a centralized proxy cache. What is uniqueabout Gecko is that it uses the NFS protocol to access a shared Gecko server, which in turn uses HTTPto access the Web. This has several advantages. First, no changes are required to the client machines.NFS is available for most UNIX and Windows-based operating systems and is fully integrated into theoperating system. This means, among other things, that files accessed through NFS are cached by theclient operating system, providing a Web page cache with no additional effort. The client operatingsystem takes care of managing the cache, including deciding how much data to cache, and when datashould be replaced. All applications on the same client share the cache, so that a page brought into thecache by one application can be accessed by another application without fetching another copy fromthe Web. No longer must browsers and other applications implement their own cache of Web pages toimprove performance.

Second, NFS has been tuned for high-performance control operations and data transfers on a local-area network, making it a better choice for communicating with a proxy than HTTP. HTTP is based onTCP, requiring connections to be established even for small transfers. Although more recent versionsof HTTP allow persistent connections, NFS avoids the issue altogether by using RPC to communicatebetween the clients and the NFS server. A Gecko client can access a cached Web page from the Geckoserver via NFS faster than from a Squid proxy via HTTP. For example, a 32 kbyte Web page may beretrieved in only 48 ms via Gecko, whereas it takes 60 ms to retrieve the same document via HTTP.

Third, by default NFS provides stronger consistency between the client and proxy than HTTP. NFSuses a dynamic time-out mechanism to ensure that the pages cached on the clients remain consistentwith the NFS server. This means that when one client reads a new version of a Web page, all Geckoclients see the updated page.

We designed and implemented two Gecko prototypes that demonstrate the viability of theGecko architecture. Although we found the first prototype, Gecko-I [4], useful and its performancesatisfactory, its implementation turned out to be much more complex than we anticipated, primarilybecause of the difficulty in maintaining the state the Gecko server needs to convert NFS requests intoHTTP requests. In a nutshell, what can be done in a single HTTP request must be expressed by a Geckoclient as a series of separate NFS requests, and the Gecko server must retain enough state to determinethe relationship between these individual requests. This state representshard state, state that cannot belost in a Gecko server crash, as opposed tosoft state, which can be lost without affecting the system’scorrectness. This, in addition to the complexities of caching Web pages, greatly complicated the Geckoserver design, and made the entire system particularly vulnerable to server failures. Hard state alsolimits the system’s scalability because the Gecko-I server must reliably store the state associated withall its clients.

As a result of our experiences with Gecko-I, we designed a second prototype, Gecko-II, so that theserver contained primarily soft state. We found it impossible to eliminate hard state on the Gecko serverentirely, but we were able to move much of it to the clients, by having them provide the necessary stateto the Gecko server on each access. This was difficult to do within the confines of the NFS protocol,but the end result is a system that is substantially simpler than the first prototype, can scale to moreclients, has comparable performance to Gecko-I, and better performance than accessing Squid directlyvia HTTP.



Figure 2. A sample Web site. The home page,index.html, contains links to many otherpages, both on the same server (mysite.com) and remote servers (friend.com). There are

also links to these pages from other servers (yahoo.com).

GECKO INTERFACE

Documents, also commonly known as pages, in the Web are generally stored using the HTML [5] fileformat. HTML is an ASCII-based page description that has embedded text formatting keywords. Thereare many keywords that affect page appearance such as font style, colors, paragraph alignment, etc. Themost powerful feature of HTML is the hyperlink keyword that links pages together. A hyperlink is areference from the source page to a target page, in the form of a URL embedded in the source page.When a user clicks on a hyperlink, the Web browser uses the embedded URL to load a new page.Hyperlinks can refer to pages on different Web servers as well as the same server.

Consider a Web site (Figure2) whose main pagehttp://www.mysite.com/index.htmlcontains severaldifferent hyperlinks:

• an on-site link to the documentproducts.htmlin the same directory as the source,index.html;• an on-site link to a document,intro.html, in the subdirectoryservices;



Figure 3. Directory structure. Theservicesdirectory is located in the root and contains the pageintro.html.

• an off-site link to an different Web site,http://www.friend.com/, on an entirely different computerattached to the Internet.

Naming Web pages

Providing access to the Web via NFS requires that Web pages have names in the NFS name space.NFS uses the standard Unix directory hierarchy to name files. A file name consists of componentnames separated by ‘/’ characters, each component specifying a name mapping within the particulardirectory. For example, the name/a/b specifies the file or directoryb within the directorya, withinroot directory ‘/ ’. A file or directory may have many names, but each name refers to a single file ordirectory.

Naming in the Web is more complicated. A Web page is accessed via its URL, which is also asequence of names separated by ‘/’ characters. Although its format is similar to a Unix file name,it is not a hierarchical directory structure: even though page/a/b exists,/a may not. However, mostWeb servers have adopted the convention that Web naming parallels the Unix directory hierarchy. Forexample, the URLhttp://www.mysite.com/services/intro.htmlrefers to the fileservices/intro.htmlonthe servermysite.com.

This organization leads to the first Web naming scheme, theURL hierarchy. Since URLslook like Unix files names, an implicit hierarchy can be formed from the URL namespace.For example, the URLhttp://www.mysite.com/services/intro.htmlcan be thought of as a child ofhttp://www.mysite.com/services/(Figure 3). Although it is tempting to map the URL hierarchydirectly to the Unix directory hierarchy in this fashion, there are several complications. First, unlikethe Unix directory hierarchy, the URL hierarchy may have disconnected components. Althoughhttp://www.mysite.com/services/intro.htmlis a valid URL, http://www.mysite.com/services/may notbe; an attempt to access the page at the latter URL will fail because the page does not exist.

Second, even ifhttp://www.mysite.com/services/exists, the Web provides no direct way of learningits children in the URL hierarchy. A Unix directory is a special type of file that contains the names ofother files and directories that are its children, forming the directory hierarchy. Unlike the Web, Unixprovides facilities for obtaining the children of a directory. Furthermore, the HTML contents of a page



Figure 4. HTTP request/reply structure. It takes both a URL and the request headers toidentify a Web document. Even this is not quite sufficient, as the Web server may return

a different page for identical requests (e.g. to change the advertising).

may bear little or no relationship to its children in the URL hierarchy. It is possible and common forWeb sites to have hidden pages that are not linked to by any other page. The URL for a hidden pagecannot be found by browsing the Web.

Many Web pages are stored in HTML format, allowing pages to be named by the HTMLlinkgraph. HTML pages may contain URLs that refer to related pages, embedded images, frames, etc.Relative page names can therefore be formed from the source page containing the hyperlink, plus theidentification of the hyperlink within the page. The link graph represents the naming scheme used bythe browsing paradigm, in which a user moves from page to page by following hyperlinks. There needbe no relationship between a page’s URL and the URLs of the links it contains; they may be in differentdirectories on the same server or different servers altogether. Thus any page can have a link to any otherpage, creating a directed graph that may have cycles.

Despite the different ways of naming a Web page, neither a page’s URL nor its location in the linkgraph uniquely identifies its contents. When a client requests a page it specifies a set ofrequest headersthat may be used by the server to tailor the page contents (Figure4). The headers are ASCII strings thatinclude such information as the client’s preferred language, browser type and version, authenticationinformation, etc. (Figure5). The Web server can use this information to modify the page contents;for example, a URL requested with a Spanish language preference may yield different results thanthe same URL requested with an English preference. The request headers are thus an integral part of



Language: EnglishUser-Agent: Mozilla/3.02Gold (Win95; U)Accept: image/gif, image/jpeg, */*Referrer: http://www.mysite.com/someurl.htmlHost: mysite.comCookie: user=billAuthorization: Basic MFJoKTb9MTU6QmFvPYI=

Figure 5. Sample request headers. Clients can provide headers when requesting a pagethat allow the server to tailor the page contents.

naming, because changing the headers may change the resulting page. An arbitrary number of headersmay be specified, and each header may be of an arbitrary length, so an arbitrary length string is requiredto identify a Web page.

There are generally two types of headers: those that have a relatively small, finite set of values thatcan be enumerated, and those that take on any value. For example, there are relatively few languages inuse on the Web, making the ‘language’ header of the enumeration variety. The User-Agent and Acceptheaders are also of this same variety. Other headers, such as Referrer, Cookie, and Authorization,cannot be enumerated; these headers may potentially change often and be different for each Web pagethat is retrieved. The distinction between the seldom changing enumeration headers and often changingnon-enumeration headers will play an important role in compressing the header sets.

There are several complications in mapping the Web’s dual naming schemes to Unix file names.First, a user may specify a page through either the link graph, by specifying a link within anotherpage, or the URL directly. There may be many links to the same URL, and many URLs for the samepage. Second, although every page must have a URL, not every page is accessible from the link graph,i.e. hidden pages have no links that point to them. This means that both the link graph and the URLhierarchy must be encoded in the Unix naming scheme. Last, request headers enable a single URL torefer to multiple pages. This must also be handled by the naming scheme.

Mapping Web names to Unix names

Gecko maps a URL to a Unix file name by treating each component of the URL as adirectory, and every Web page as a Unix directory whose contents reflect the structure of thepage. The URLhttp://www.mysite.com/services/intro.htmlis converted into the Unix file name/web/www.mysite.com/services/intro.html. The page is represented by a Unix directory containing a.contentsfile that contains the HTML contents of the page, and a.headersfile that contains the responseheaders returned with the page. Each hyperlink in the page is represented by a symbolic link that startswith the prefix.link and points to the target of the symbolic link. If the page contains a reference tohttp://www.friend.com/index.htmlthen the directory/www.mysite.com/services/index.htmlwill containa symbolic link to../../www.friend.com/index.html. This arrangement allows both the URL hierarchyand the link graph to be traversed in the Unix name space.



Although this naming scheme allows Web pages to be named using the Unix naming scheme, thedirectories that represent Web pages behave slightly differently from standard Unix directories. Thereis no way to query a Web server to find out the children of a URL short of trying all possible names,which means that the best Gecko can do is to add children to a directory as they are accessed. Thedirectory contents are learned as the directory’s children are accessed, so that the directory contentschange over time even though the underlying Web page does not.

There are several ways of integrating request headers into this naming scheme. A simple solutionis to embed them in the Unix name for the page, by concatenating them with the page’s URL. Forexample:

HTTP Request: Unix Request:http://www.mysite.com/services/intro.html /web/languageenglish/user-agentmozilla 3.02/Language: English www.mysite.com/services/intro.htmlUser-Agent: Mozilla/3.02

The downside of this approach is that the Unix pathname grows very large and unwieldy. Most Webrequests have at least a half-dozen headers, and with advanced HTTP features such as cookies, the totallength of the headers may exceed 1 kbyte, the maximum length of a file name on many Unix systems.Even if the headers were to fit in the file name, standard Unix applications and users would have tocontend with long and complex pathnames.

Another possibility is to associate the request headers with the desired URL implicitly. In thissolution, the Gecko server maintains a set of headers that it sends to Web servers along with therequested URLs. Clients use an out-of-band communication channel to modify the headers on theserver. As long as the headers are set up appropriately, this method allows applications and users toaccess pages by specifying only the URL. An example is:

HTTP Request: Unix Request:http://www.mysite.com/services/intro.html set “Language: English” via alternate channelLanguage: English set “User-Agent: Mozilla/3.02” via alternate channelUser-Agent: Mozilla/3.02 use /web/www.mysite.com/services/intro.html

This approach encodes only the URL in the pathname, which is manageable both by users and byexisting applications. However, it does not support concurrent accesses. If, for example, two differentusers wish to use different headers to access the Web, their modifications to the set of headers mayconflict and lead to unpredictable results. The only way to avoid this conflict is to serialize requestsfrom different users, which would prohibit multitasking and impose a significant performance penalty.

Header sets

In Gecko we use a hybrid approach based on both an extension to the file system name space andimplicit headers. The Gecko server maintains a collection ofheader sets, each containing one or morerequest headers. Clients can create and delete header sets, as well as modify their contents. Eachheader set has a unique identifier, and the Unix name space is extended by adding this identifier tothe beginning of the Unix file name. This allows a user or application to specify which header setshould be used when fetching a particular URL, and avoids the synchronization problems of a singleheader set. For example:



HTTP Request: Unix Request:http://www.mysite.com/services/intro.html put “Language: English” into header set #1234Language: English put “User-Agent: Mozilla/3.02” into header set #1234User-Agent: Mozilla/3.02 use pathname/web/#1234/www.mysite.com/services/intro.html

This is a compromise between long file names and simplicity. Each client can maintain its own headerset(s) on the server, avoiding race conditions between clients. Clients can share header sets as longas they synchronize between themselves, and the server can use the same underlying storage forheader sets with the same contents. Efficiency is improved because each client process manipulatesits own header set independently, and only needs to modify the headers as necessary. Many headers,such as the browser type, may never change, while others only change slowly. Each client updatesits own headers as appropriate, without interacting with other clients. A default header set containingreasonable defaults is available for applications that do not wish to use customized header sets.

Privacy and security

Privacy and security are issues in Gecko, as they are in any file system. Gecko must ensure that a user’sprivacy is not compromised as he or she accesses the Web, and it must also ensure that one user cannotmodify another’s pages, nor masquerade as another user. Our security and privacy goals for Gecko areas follows.

• A page that is retrieved by one user must not be viewable by another user, unless the user wishesit to be public.

• Any metadata associated with a user’s request must be private, unless the user wishes it to bepublic. A user should not be able to see or modify another user’s header sets without his or herpermission.

• The URLs accessed by a user must not be visible to others, unless the user wishes them to bepublic. A user should not be able to see what Web pages another has accessed.

Gecko must address these privacy and security concerns within the semantics of the NFS protocol.NFS inherits its security features from the Unix file system, which provides the notion of file ownership,and three sets of permissions: individual, group, and other. If the group and other permission bits aredisabled by the owner, then only the user that owns the file will be able to view the contents. Similarly,only the owner of a directory will be able to view the children of a directory.

Gecko uses these permission bits in conjunction with header sets to enforce security. The user thatcreates a header set automatically owns all files that are retrieved with that header set. The group andother permission bits may be disabled, effectively hiding the pages and directories from all other users.

Privacy and sharing are at odds with one another. If all users protect all pages, then no sharing ispossible. On the other hand, if all pages are sharable then no privacy is possible. Privacy settings maybe configured for a header set by its owner. A user can make a header set private by setting the UNIXpermission bits appropriately.

For example, Gecko includes a companion client-side proxy that facilitates connecting HTTP-basedapplications to the NFS file system. The client-side proxy may be configured via a command-lineoption to provide the following security defaults.

• Everything Public: All data retrieved by a user are public and therefore visible to everyone.



• Cookies and Authentication Private, all else Public: If the user uses cookies or authentication,then the page and directories retrieved with those headers will be private. Otherwise the page willbe public. This is a good compromise in that it hides the items that contain private informationwhile letting everything else be shared.

• Everything Private: All data retrieved by a given user are private and viewable to nobody otherthan the owner.

Gecko example

The following example illustrates how the Web is accessed through Gecko. In the following the userhas mounted the Gecko file system as the directory/web . The user begins by issuing acd commandinto the directory structure and into the University of Arizona home page:

>cd /web/www.cs.arizona.edu

>ls -aototal 31-rw-rw-rw- 2 nobody 3550 Oct 15 17:15 .contents-rw-rw-rw- 2 nobody 257 Oct 15 17:15 .headerslrwxrwxrwx 2 nobody 4096 Dec 31 1969 .link0 ->.../w3.arizona.edu:180/enroll/lrwxrwxrwx 2 nobody 4096 Dec 31 1969 .link1 ->.../www.arizona.edu/estudents.html/lrwxrwxrwx 2 nobody 4096 Dec 31 1969 ... -> ../drwxrwxrwx 2 nobody 4096 Dec 31 1969 estudents.html/(subsequent directory entries eliminated for brevity)

The directory for thewww.arizona.edu page contains several files: the.contents filecontaining the page’s HTML; a.headers file that contains the headers returned when the page wasoriginally fetched from the Web server; and several.link files, each of which represents a hyperlinkin the page and is a symbolic link to the Gecko path name of the hyperlink’s target. The symbolic linksare relative to allow different clients to mount Gecko at different places in their file systems; the ‘...’symbolic link reduces the complexity of the links in the.link files by factoring out the commonprefix.

The user views the contents and headers of the Web page as follows:

>cat .headers

HTTP/1.0 200 Document FollowsDate: Fri, 10 Dec 1999 19:28:32 GMTServer: Apache/1.2.4Last-Modified: Thu, 09 Dec 1999 21:01:27 GMTETag: "2906-ded-361d2827"Content-Length: 3565Accept-Ranges: bytesConnection: closeContent-Type: text/html



>cat .contents<html><head><TITLE>The University of Arizona</TITLE></head>(remainder eliminated for brevity)

Following a hyperlink is as simple as using thecd command to change to the directory that representsthe target of the link. When this is executed, the Linuxpwd/getcwdalgorithm automatically resolvesthe symbolic link and determines the correct directory.

>cd .link1;pwd/web/www.arizona.edu/estudents.html

Translating between HTTP and NFS

Gecko uses the NFS protocol, instead of HTTP, to transfer page contents from the Gecko server to theclient. This not only allows unmodified clients to access Gecko, but also improves performance on alocal-area network, and ensures consistency between pages cached on the clients and on the server.

The Gecko server provides NFS access to its clients, and services those requests by accessing theWeb via HTTP. Thus the Gecko server must translate between the NFS and HTTP protocols. Theprocess of transferring a file between a server and a client can be broken down into four distinct phases.Name resolutionmaps the textual representation of the file name into an internal representation thatuniquely identifies the file and is used by subsequent phases.Attribute retrieval retrieves essentialfile information necessary to complete the transfer, such as permissions, content type and length, andmodification time.Content transfertransfers the file data from the server to the client, and thecleanupphase closes the file and terminates the transfer.

Although the HTTP and NFS transfer protocols are quite different, Gecko provides the entireHTTP functionality via the NFS protocol. There are several idiosyncrasies of the protocols that maketranslating between the two difficult. One is that HTTP supports arefreshfunction that a client canuse to fetch a fresh copy of the page directly from the Web server, bypassing any intervening caches.NFS provides no direct support for refresh. Second, the client-side NFS implementations use a fileinode numberprovided by the NFS server to identify files uniquely. HTTP does not provide uniqueidentifiers for pages, making it difficult for the Gecko server to generate inode numbers for pages. Forboth of these cases we have developed work-arounds that allow Gecko to translate between HTTP andNFS properly.

HTTP transfers

HTTP performs the entire transfer via a single TCP connection. TCP [6] connections are bi-directionalcharacter streams through which the client may write an arbitrary number of bytes to the server andthe server may respond by writing an arbitrary number of bytes to the client. There is no explicitrequest/reply order inherent in the TCP protocol, nor is there any explicit bundling of bytes into fixed-size records. To access a Web page the client opens the TCP connection to the Web server, sendsthe URL for the requested page and associated request headers, and receives the requested page over



the same connection. The client can then close the connection or request another page on the sameconnection. Since all four phases of file transfer are accomplished in one connection, once the URLand headers have been sent there is no ambiguity about which page the client is accessing. There is noneed for the server to retain a client-specific state between connections.

Name resolution in HTTP is a two-step process. First, the client must extract the server’s name fromthe URL and use this name to create a connection to the server. The client then sends the URL andrequest headers to the server, and the server converts them into an internal representation; for mostWeb servers the URL is the file name of the page in the server’s local file system, although this neednot be the case.

Attribute retrieval is performed once the client has sent the URL and headers. Typically a Web serveracquires attributes such as the file size and modification time from the attributes of the file that containsthe page. Other attributes, such as the content type of the file or cookies to be returned along with thepage, are generated by the Web server itself. The block of headers is sent to the client as a series ofplain text strings terminated by a blank line.

Content transfer is performed by sending the entire page over the TCP connection. HTTP does notperform any error checking or flow control, instead relying on the TCP protocol for these. Althoughnewer specifications of HTTP do permit the client to request only a portion of the file contents, mostHTTP implementations require the entire contents to be transmitted at once. One complication withrequesting only part of a page is that there is no guarantee that subsequent requests for additional partswill come from the same page; the server may change the page on every access, to change advertising,for example.

Transfer termination is done by closing the TCP connection. The TCP protocol is responsible forcommunicating the end of transmission and other cleanup details. Once the client detects a closedconnection, it knows the entire page has been received.

NFS transfers

Rather than handling the entire transfer in a single connection, NFS breaks the transfer down intoseveral RPCs. NFS uses UDP as the communication protocol between machines, instead of HTTP’sTCP. UDP is a non-reliable, datagram protocol that has less overhead than TCP, but also provides lessfunctionality.

Name resolution involves both the client and the server. The client breaks the file name into itsindividual components, and contacts the server to resolve each independently. For each component, theclient sends the server a fixed-sizehandlefor the parent directory and the component name. An NFShandle is either 32 or 64 bytes of data, depending on the version of NFS, that uniquely identify a file.The server replies to the request with a handle for the child. The handle for the root directory is givento the client when the file system is mounted. This style of name resolution allows the client to cachethe results for use during subsequent name resolutions.

An NFS client receives file attributes when it invokes various NFS functions on the server (suchas Lookup), and it can request them explicitly via the GetAttr NFS function. This function is usedprimarily to verify that the cached copy of a file is up-to-date, by comparing its modification time withthat of the server’s copy. NFS attributes are fixed-size and only include the standard Unix attributessuch as permissions, file size, modification time, etc. There is no provision for including arbitraryheader information as with HTTP.



The NFS Read function is used to transfer the contents of a file from the server to the client. Theclient specifies the handle of the file from which to read, the starting offset within the file, and thenumber of bytes desired. The client can issue reads of any size, although most operating systems breakreads into manageable units of 8 kB or smaller. Reads may be issued out of order as in a random seekand may skip unwanted sections of the file. Most client operating systems support some form of readahead and/or parallel read capability to improve performance.

NFS transfers are not explicitly terminated as are HTTP transfers. A client can cache a file handleindefinitely and use it to read from the file. Most clients poll the server periodically using the GetAttrfunction to verify the consistency of cached files; the polling period is typically 3–60 s, dependinghow long it has been since the file was last modified. This implies that although a client can hold ahandle indefinitely, accesses to the file will be preceded by a Lookup or GetAttr request at most 60 sbeforehand.

Page refresh

HTTP allows the client to request a fresh version of a document directly from the Web server, bypassingany cached copies. This is useful for pages containing information such as stock quotes where the userdoes not want stale information. Refresh in HTTP is accomplished by sending a request with the header‘Pragma: no-cache’. This header causes intermediate proxies to bypass their caches.

The page refresh function does not mesh well with a standard Unix file system interface. Unixtypically does not provide a mechanism for forcing a file access to bypass the cache, or even to flusha file from the cache. Furthermore, an NFS client keeps its cache up-to-date with the server by pollingperiodically, and there is no way for an application program to force the client to poll. Gecko overcomesthis problem by allowing a version number to be appended to a file name that simply informs theGecko server that a new copy of the file should be fetched from the Web server. For example, anapplication requesting/www.myhost.com/someurl.html/.contentsmay request a fresh version by using/www.myhost.com/someurl.html/.contents#1. The client-side operating system and the NFS protocolwill treat these as different files, bypassing any cached copies.

Unix inode numbers

In addition to the NFS handles that identify files, the NFS protocol also assigns each file a unique32-bit inode number. Although both the handle and the inode number uniquely identify a file, makingthe inode number redundant, most Unix operating systems require each file to have an inode number.NFS thus provides for the file’s inode number to be returned along with the handle; this is easy for astandard NFS file server as it simply returns the inode number for the underlying file. Assigning inodenumbers is much more difficult for Gecko, as the pages it serves are obtained from the Web, not a localfile system.

The simplest solution to this problem is to assign inode numbers sequentially. The root is assignedinode number 0, the first page accessed is number 1, and so on. However, there are only 232 possibleinode numbers and eventually the sequence will wrap. When this happens several pages may be giventhe same inode number, causing clients to confuse their contents.

Another possible solution is to always issue a new page the least-recently-used inode number. The232 inode numbers almost ensure that any file previously issued the same number is no longer cached



by any client. On the other hand, the 232 inode numbers make it impractical to maintain a data structurethat keeps track of the order in which they are issued.

Neither of these solutions is satisfactory, so we decided to use a hybrid approach based oninodegroups. The 32-bit inode number is partitioned into a 16-bit inode group number and a 16-bit inodecounter. Inode groups are assigned in least-recently-used order, but inode numbers within the groupsare assigned sequentially. Each Web site is allocated its own inode group and a site is allocated multipleinode groups if it contains more than 216 pages. The idea is that a given client generally accesses onlya few Web sites at a time; we assume that a client rarely accesses pages from many different sites atthe same time. As the clients navigate to new Web sites, new inode groups are needed at about thesame rate that old inode groups are discarded. Keeping the inode groups in least-recently-used orderis feasible because there are only 216 of them, and keeping track of the next inode number to issue ineach group requires only one counter per group.

GECKO SERVER STATE

One of the challenges in designing Gecko, or any system with a client/server architecture, is managingserver state. Ideally, at least in terms of the server implementation effort and the system scalability, theserver would contain no client-related state. If the server does not remember anything about its clients,then the server can scale to any number of clients without running out of resources in which to storeclient state, and a server crash does not lose irreplaceable data.

This ideal is often difficult to achieve, however, as the server usually must retain some informationabout a client; if nothing else, the communication protocol used between the client and server probablymaintains state on both ends. The important distinction about server state is whether is ithard stateor soft state. Hard state is state that cannot be lost without affecting the correctness of the system,whereas soft state either does not affect correctness or can be reconstructed in some way. Hard serverstate should be avoided, whereas soft server state can be used to improve system performance withoutaffecting its correctness. Soft state has the additional advantage that it can be discarded by the serverbecause of space constraints.

Both NFS and the Web employ only soft server state as a way of increasing system scalability andtolerating server failures. NFS servers do not keep track of information such as which files each clienthas open or cached. The server also commits all data written to disk, so that the server memory nevercontains any data that are not also stored on the disk. For this reason an NFS server is often referredto asstateless, although this is a misnomer because it does contain soft state such as an in-memoryfile cache. NFS uses handles to avoid hard state on the server. A handle uniquely identifies a file, andis returned to the client by the server as part of name resolution.The client passes the handle to theserver on subsequent accesses to the file, and the handle contains enough information for the serverto determine which file to access, even if the server has crashed in the interim. NFS thereby avoidshard server state by requiring that the client store the handle and present it to the server when needed.Although this means that the handle may be lost in a client crash, presumably the applications that areusing the file associated with the handle are also lost, so that system correctness is not an issue.

The Web also avoids hard server state by pushing state onto the client and requiring the client topresent the state to the server when appropriate. The HTTP protocol provides a general mechanism fordoing this calledcookies. A cookie is an ASCII string returned by a Web server to a client in response



to a page access. A cookie is variable in length, and its contents are not interpreted by the client.The client simply sends the cookie on subsequent accesses to the same server, allowing the server toretain information about the client. For example, a server may return a cookie to a client when theclient provides the correct password for a site. The client provides this cookie on subsequent accesses,allowing the server to verify that the client has already logged-in to the site. A client crash may causethe cookie to be lost, but it does not affect the server operation. Most Web clients store their cookies inpersistent storage such as a disk to avoid the inconvenience of losing them on a client or browser crash.

Ideally, the Gecko server should also contain only soft state, but this is difficult to do in its role asa proxy between NFS and HTTP. The Gecko server must map between NFS file handles and HTTPURLs and headers. HTTP also provides cookies to the client, in this case the Gecko server, that are tobe supplied on subsequent accesses to the same Web server. The URLs, headers, and cookies representhard state for Gecko, as they cannot be lost without affecting the correctness of the system. If, forexample, the Gecko server loses a cookie that represents a successful log-in to a site, then authorizationwill fail the next time the site is visited and the user will have to retype the password. Unfortunately,Gecko cannot fully solve this state problem by passing the state to its clients. The NFS protocol doesnot provide a general cookie mechanism, so the amount of information the Gecko server can encodein the existing NFS protocol and return to the client is limited. This makes it impossible to avoid hardstate on the Gecko server entirely.

Furthermore, the Gecko server can only learn directory contents over time, as the children of thedirectory are accessed. This information too represents hard state, since its loss means that the directorycontents are no longer known. It would come as a shock to an application that a directory it has beenaccessing is now empty.

GECKO-I

To validate the Gecko design we implemented a prototype Gecko server, Gecko-I [4]. The server isimplemented in the C programming language on the Linux operating system, version 2.0.32. The Linuxthread library, Pthreads, is used for synchronization, and the SunRPC library is used to implement theNFS version 2 protocol. The Gecko clients can run any operating system that supports NFS.

The major implementation issues of Gecko 1 are how to convert between NFS handles and HTTPURLs and headers, and how to cache data.

Name state

NFS clients use fixed-size handles to refer to files, whereas HTTP uses a combination of URL andrequest headers. When a client accesses a Web page represented by an NFS handle the Gecko-I servermust somehow convert the handle into its associated URL and headers. Although the handles areopaque to the clients and the Gecko-I server may therefore encode any information it wants in them, aURL and headers may be several thousand bytes long, making it impossible to store them in the handledirectly. For this reason the Gecko-I server maintains a database that maps NFS handles into URLs andheaders.

Header sets in Gecko-I are identified by a 64-bitheader set identifierand stored in an in-memoryhash table. The header set identifier is subdivided into two halves, a 32-bitownerfield that identifies



the user who created the header set, and a 32-bitsessionfield that allows each user to have 232 headersets. The name of a Gecko-I header set is simply the ASCII hexadecimal representation of the 32-bitidentifier. For example, user #1 may create his third header set by naming it /web/#000000010000003.A user updates his header set by writing to a special.headersetfile in his header set root directory.

Gecko-I computes the MD5 checksum [7] of a page’s URL to generate the NFS handle for thatpage. The MD5 algorithm is cryptographically secure, making it practically impossible for two URLsto map to the same handle, and ensuring that the handles are uniformly distributed within the spaceof valid handles. The MD5 checksum is 16 bytes long and is used as the first half of the 32-byteNFS handle. The remaining 16 bytes contain the header set identifier and miscellaneous file-specificflags. The Gecko-I server stores the URL information in its database, indexed by the NFS handle. Theinformation in the in-memory header set hash table is not stored to disk by the Gecko-I prototype andis lost if Gecko-I is restarted.

An administrator configurable limit, 65 535 by default, is imposed on the number of handles storedin the database. If this limit is exceeded, then handles are discarded in LRU order. To avoid discardinga handle that is still in use, a handle is only discarded if it has not been used within the last 60 s. Weassume that clients adhere to the default NFS behavior of checking the consistency of cached filesand attributes at least every 60 s. Before a client reuses a handle that is more than 60 s old, it willusually perform a Lookup followed by a GetAttr to ensure that the handle and any cached data are stillvalid. These operations provide the Gecko-I server with enough information to regenerate the URLand request headers for the page. If Gecko were to discard a handle within 60 s of its last use, then it ispossible that a client is still using the handle and will issue another request, such as a Read, on it. Thiswill cause the request to fail. The only time that the client does not perform a Lookup and GetAtter onan old handle is if an application has had the file open the entire time. In this case, the client will re-usethe handle without validating its consistency. Thus, Gecko-I does not handle files that are open longerthan 60 s correctly if the handle is discarded in the interim. Unfortunately, NFS clients do not notify theserver when a file is opened or closed, making it impossible to eliminate this window of vulnerability.

If a request is made that requires creating a new handle, but all existing handles have been usedwithin the last 60 s, the NFS request is dropped. This causes the client to retry the request indefinitely,as per the NFS protocol. Eventually a handle should age to the point that it can be discarded, allowingthe pending request to complete. The default server has 65 535 handles and it is unlikely that they willall be used in the last 60 s, so requests should be rarely dropped.

The server stores handle information in a gdbm [8] database. Gdbm automatically moves databetween memory and disk, allowing the server’s handle information to exceed the size of physicalmemory. Gdbm is operated in fast mode with consistency relaxed in order to improve performance. Inaddition to the URL, each handle record also contains information on its children to assist in directoryread operations.

Directory state

Directory information is stored in the same gdbm database that holds the other handle relatedinformation. The database contains the handle, name, and URL for each child of a directory. As newchildren are added to a handle, the gdbm record associated with that handle grows dynamically tocontain the additional data.



Page cache

The Gecko-I server contains a large cache of Web pages to improve access performance and allowclients to share pages. The Gecko-I server has two caches: an in-memory cache that holds pages thatare presently open and memory-mapped by the server, and a larger disk cache. Gecko-I keeps recentlyaccessed pages open and mapped into the server’s memory so that accesses to those pages are fast.Read operations, which may occur in rapid succession on a page, are served directly from the memorycache whenever possible and bypass the disk cache, page locking layer, and gdbm handle database.

The disk cache stores Web pages in standard Unix files, named by the MD5 checksum of the pages’URLs. The cache is configured as a tree structure with each node representing a directory that containsat most 256 entries to avoid the performance degradation caused by large Unix directories. The MD5checksum contained in a handle is treated as a pathname consisting of 32 component names that areused to traverse the disk cache directory tree. The size of and load on the directories in this tree shouldbe fairly balanced because MD5 distributes checksums uniformly throughout the checksum space.

Experiences with Gecko-I

We have successfully used Gecko-I to access the Web via a browser and through standard Unixcommands. We have used the Unixcd and ls commands to traverse the Web, and thecp commandto copy Web pages to the local file system. Executables can be invoked directly, and libraries can beused to link programs. The Unix commandgrep can be used to search Web pages, and we have builta simple spider and search engine usingfind andgrep together. In general the experience has beenpositive, and the advantages of providing a file-system interface to the Web readily apparent.

However, we also found many shortcomings, both during the implementation of Gecko-I and duringthe testing phase. The centralized database in particular presented several problems.

• The database contents represent hard state, so if the database is lost or corrupted due to anunexpected shutdown, clients are unable to resolve their NFS handles, and client applicationsfail.

• Since all state is concentrated in the database, lock contention proved to be a serious problem, interms of performance, difficulty of implementation, and correctness.

• Rather than compartmentalizing functionality into independent modules, all code was organizedabout the database, leading to significant interdependencies, since several logically-separatemodules share the same database record.

All of these shortcomings were caused by maintaining hard state on the Gecko-I server. For thisreason we developed a new prototype, Gecko-II, that relies more on soft server state than on hardserver state.

GECKO-II

The overall design goal for Gecko-II is to push as much hard server state as feasible onto the clients,leaving only soft state on the server. The limited functionality of the NFS protocol makes this difficult,



Figure 6. Gecko and Squid. Client computers speak NFS to Gecko, which in turn forwards the request to Squidvia HTTP. Squid is responsible for retrieving the page from the Web.

and precludes eliminating all hard state on the Gecko-II server, but by reducing the amount of hardstate simplifies the server design and increases the scalability of the system.

Gecko-II employs a more modular design than Gecko-I. ThePage Retrievalmodule is responsiblefor retrieving page contents. Gecko-II does not implement a page cache, relying instead on an externalcaching proxy. TheAttribute CacheandDirectory Cacheassociate attributes and children with eachpage. The conversion between handles and URLs and headers is done by theHandle Resolutionmodule.

Page retrieval

The page retrieval module retrieves pages from the Web as necessary. The arguments to the pageretrieval module are theHandle, Offset, andLengthof the read operation to be performed, and it usesHTTP to fetch the data for the page.

Although the page retrieval module could talk to the Web directly, we chose to interface it with anexternal proxy cache, such as Squid [9], for performance reasons (Figure6). Since the external proxycache handles all aspects of HTTP caching, no memory or disk caches are implemented by Gecko-II.Although the Gecko-II server and Squid proxy can be located on different computers, we anticipatethat they will often be co-located. For this reason we made a minor modification to Squid to supportUnix-domain sockets, as they have much higher performance for intra-computer communication thanTCP/IP sockets. We made no other modifications to Squid.

The conventional file system semantics of NFS allow a client application to retrieve file data inmultiple small chunks. For example, themore command may retrieve each screenful of data, display iton the terminal, and wait for a user to press a key before retrieving the next screenful of data. Retrievingdata in this manner is convenient for application programs since they need only request the data thatthey need. An implicit benefit is automatic flow control, in the sense that the application program doesnot request the next chunk of data until it is ready for it.

HTTP, on the other hand, is intended to retrieve entire documents as a single transfer. The applicationprogram must be prepared to deal with the whole document at once, regardless of what portion orquantity is necessary. The designers of HTTP recognized the pitfalls of this approach and added thenotion of apartial read, or range request. The partial read allows the client to specify an offset andthe number of bytes to be retrieved. At first, it seemed obvious to use this functionality in the PageRetrieval module—each NFS request could simply be paired to an equivalent HTTP partial read.



There exists a significant flaw with partial reads in the HTTP protocol because pages may bedynamic and non-cacheable. For example, several popular Web sites such as Yahoo [10] displayadvertisements on their main pages. The advertisers want as many hits as possible, so the pageis designated non-cacheable. Furthermore, advertisements are rotated so that each request retrievesdifferent advertisements. Consider what happens when two partial read requests are issued, one for thefirst 4 kbytes of the document and the other for the second 4 kbytes. Since the page is non-cacheable,each request must be sent to the Web server. Since the page is dynamic, each request retrieves part ofa different version of the document. The client is left with two pieces from different documents.

To solve this problem, we use a heuristic solution that is correct in most cases, but still may leadto an inconsistent page. We assume that in most cases if an application wishes to retrieve a wholedocument, it will issue its read requests over a relatively short time period. Therefore Gecko-II employsthe following strategy for handling read requests for less than the whole page. If Squid has the pagecached, then the partial read request is forwarded to Squid. Otherwise, the Gecko-II server reads theentire page from Squid. If the page is cacheable, then Squid will cache it so that subsequent requestshit in the Squid cache. The Gecko-II server also buffers the entire page in its memory for a period of2 s from the most recent read request to the page. New read requests are satisfied by the buffer, andreset the buffer’s time to live to 2 s. This mechanism, although a bit ungainly, is sufficient for handlingmost partial read requests to uncacheable pages. It only works incorrectly if the client reads the pagevery slowly, so that read requests are separated by more than two seconds. It is impossible to solvethis problem correctly because the Web does not provide unique identifiers for different versions of thesame page§.

Directory cache

The directory cache maintains the hierarchical relationships between parent and child pages. Theserelationships are represented by NFS directories. The following information is stored for each page.

• Hard links : Each child in the URL hierarchy is represented by a hard link. The children arelearned over time from page accesses, as HTTP does not provide a function for determining thechildren of a page directly. A hard link is stored as a mapping between a child’s name and itshandle.

• Symbolic links: Each child in the link graph is represented by a symbolic link. The childrenare determined by parsing the HTML content of the parent, and extracting the hyperlinks. Asymbolic link is simply a mapping from one file name to another, that when encountered causesthe client to re-resolve using the new name.

• Parent: Every UNIX directory has an entry ‘..’ that refers to its parent. This is used to traversethe directory hierarchy back towards the root.

Partitioning the children between hard and symbolic links allows Gecko to distinguish betweenchildren that may be discarded without any loss of information and those that should be retained.

§HTTP 1.1 does include the notion of an E-TAG that may be used for cache-consistency purposes only. Although the E-TAGmay be used to invalidate a cached page (i.e. if it is old), the E-TAG may not be used as an identifier to retrieve a specific versionof a page.



A symbolic link represents a hyperlink, and hyperlinks may always be regenerated by reparsing thesource document. Discarding children is only done if memory resources become exhausted.

Using symbolic links does present a problem since the string returned by a symbolic link is evaluatedby the client. This makes it impossible to use absolute pathnames in symbolic links, as different clientsmay mount the file system on different mount points. Consider what happens if one client mountsthe Web on/web, whereas another client mounts it on/www. If symbolic links contained absolutepathnames then they would be wrong for one of the clients.

We solved this problem by making all symbolic links contain relative paths rather than absolutepaths. A relative path is one that refers to the context of the current directory. The ‘..’ entry can beused to move upwards in the directory hierarchy relative to the current directory, allowing any absolutepath to be expressed as a relative path. As a convenience to the user, Gecko-II also creates in everydirectory the symbolic link ‘...’ that points to the root of the Gecko file system, and uses this entryin the symbolic links. This factors out the common prefix to the root, and makes the symbolic linkcontents more concise.

Handle resolution

Unlike Gecko-I, Gecko-II does not store the URL and request header information in a database indexedby the NFS handle. Instead, Gecko-II compresses the URL and headers so that they fit into the handleitself. Regenerating the URL and headers from a handle is as simple as decompressing the handle. Thehandles are opaque to the clients, so the server is free to encode information inside the handle howeverit wants, without affecting the clients. Unfortunately, a page’s URL and headers can be many thousandsof bytes long, so it may not be possible to compress them to fit in the fixed-size NFS handle. The handleresolution module supports a variety of compression algorithms, and can be easily extended to supportmore. Each compression algorithm is tried until one succeeds in compressing the information into thehandle. If none succeed the information is stored in a hash table that is indexed by the handle. The hashtable represents hard state, so it is only used when compression fails.

There are several factors that affect how Gecko-II uses compression to encode URL and headerinformation in handles. First, compression and decompression of handles is only done on the Gecko-IIserver. The server compresses the URL and headers into a handle when a new handle is generated, anddecompresses the handle to obtain this information when the handle is used in subsequent accesses. Asa result, decompression performance affects the overall system performance more than compressionperformance. Second, NFS handles are of fixed size, so there is no advantage to compressing theinformation so that it is smaller than a handle. Most compression algorithms are designed to compressdata to the smallest possible size. They are used to conserve resources such as disk space or networkbandwidth where variable size data are used. However, the constraint upon Gecko is that items mustfit within the NFS handle. Therefore, an algorithm should not be chosen based upon the absolutecompression ratio, but rather how often the algorithm can achieve the compression requirement.Assuming that 64-byte NFS version 3 handles are used, an algorithm that compresses 90% of theinput space to 60 bytes and 10% to 500 bytes is better than an algorithm that compresses 100% of theinput space to 65 bytes.

Many compression techniques make use of metadata, such as dictionaries or tables, duringcompression and decompression. The same metadata used to compress data must be used whendecompressing it. When the compressed data are transferred over a network or stored in a file, the



metadata used to compress them must be packaged along with the data, so that the decompressionworks properly. Gecko-II, however, has no such requirement. Clients do not compress or decompresshandles, so there is no need to communicate the compression metadata between the clients and theserver. The metadata does represent hard state on the server, but it changes very slowly, if at all, and iseasily stored on disk to prevent its loss during a crash. Although some compression algorithms computethe metadata from the input data, it takes a significant amount of input data before the metadata iscomplete enough for the algorithm to compress effectively. Therefore, the compression algorithmsused in Gecko-II do not create the metadata dynamically, as the typical URL and headers are too short.Instead, the metadata is pre-computed by training the compression algorithms with a set of sampleinputs that represent typical URLs and header sets. For our experiments we used training inputs thatwere disjoint from the inputs used in the experiment.

The Gecko-II server makes use of two classes of compression algorithms that seem well-suited forcompressing URLs and headers. The first class is character-based algorithms that function by assigningdifferent encodings to various symbols in the input alphabet. For example, the common letter ‘e’ mayreceive a shorter encoding than the less common ‘z’. Character-based algorithms are well suited forURLs since the alphabet is highly biased in favor of the lower case letters and numbers, with a verylow probability of upper case letters or non-printable binary characters.

The second class of algorithms is dictionary-based algorithms. These work by assigning encodingsto groups of characters. For example, in the English language the commonly occurring word ‘the’ maybe assigned a short encoding whereas the rarely occurring word ‘xylophone’ may be assigned a longencoding. These algorithms work well on HTTP names because certain substrings such as ‘index.html’or ‘cgi-bin’ occur very often.

Character compression

A sampling of Web URLs was performed by taking the Web sitewww.web100.com, and spideringeach page listed on it to a level of three links deep. This yielded approximately 3 500 URLs from100 websites. Our sampling of URLs provided the following results about the alphabet used by HTTPURLs:

• 69% lower case;• 6% digits;• 11% punctuation (including slash);• 9% upper case;• 9% slash;• The top 30 characters occur 91% of the time;• The top 62 characters occur 99% of the time.

This analysis shows that substantial compression is possible by assigning concise encodings topopular characters; fully 69% of all characters in URLs are from the set of 26 lower-case letters.Thisleads to three algorithms.

8to5

In this scheme each character is represented by one or more 5-bit encodings. One encoding represents astop character, and another an escape character, leaving 30 encodings that represent the 30 most popular



Figure 7. Huffman encoding example. A sample encoding of three symbols. The most frequent symbol, ‘A’,receives a 1-bit code whereas the other two symbols receive 2-bit codes.

characters. Exactly which characters are assigned to the 30 values is determined when the algorithmis trained, but the information above shows that the top 30 characters occur 91% of the time. Theremaining characters that are not assigned a 5-bit encoding are represented by a 13-bit code consistingof the 5-bit escape character followed by the 8-bit ASCII value.

8to6

Similar to the 8to5 method, the 8to6 compression method assigns the top 60 characters a single 6-bit code while assigning the remaining characters a 14-bit code. This gives more characters a shortencoding, at the cost of one more bit per character.

Huffman

The Huffman [11] algorithm assigns a variable length encoding to each alphabet symbol accordingto the frequency that the symbol occurs. Frequently occurring symbols receive short codes whereasinfrequent characters receive long codes. The algorithm is initialized by creating a Huffman tree(Figure7) from a training set. Each leaf of the Huffman tree corresponds to a symbol in the alphabet,and the Huffman code assigned to the symbol is obtained by following the path from the root to aleaf. Huffman compression encodes the alphabet better than the fixed 8to5 and 8to6 encodings becauseextremely frequent alphabet symbols will be assigned very compact codes.

Dictionary compression

Web URLs contain many repeated substrings. For example, the substrings ‘index’, ‘html’, ‘.com’, and‘cgi-bin’ frequently occur. This implies that a dictionary-based compression algorithm may be able toachieve high compression ratios because each one of these strings will be assigned a short encoding.



Dictionary <-- all single-character alphabet symbolsold_string <-- get_char() + ’\0’while (!EOF)

character <-- get_char()new_string = old_string + characterif (dictionary_exists(new_string))

old_string = new_stringelse

code = dictionary_lookup(old_string)output_code(code)dictionary_add(new_string)old_string <-- character + ’\0’

Figure 8. LZW pseudocode.

The LZW [12] algorithm is one of the most popular dictionary compression algorithms. It works bymaintaining a current working string of input characters. As long as the string exists in the dictionary,characters are appended to it. If the string does not exist in the dictionary, then it is added to thedictionary and a new empty string is created (Figure8).

A drawback of the LZW algorithm is that it assigns each substring a fixed length encoding. Forexample, both the popular string ‘index.html’ and the unpopular string ‘xylophone’ are assigned thesame length codes.

Hybrid compression

URLs are very well adapted to both character compression algorithms and dictionary-based algorithms,but better results are possible if both types of algorithms are used together. Dictionary compression isfirst used to assign fixed-length encodings to the common substrings, and then character-frequencycompression is used to assign smaller encodings to more frequent substrings. The following hybridcompression algorithms are used in the Gecko-II server:

8to6 dictionary

The 8to6 dictionary algorithm was constructed by taking the 8to6 algorithm and assigning 10 of theencodings to represent 10 substrings that were hand-picked from the URL list. Thus, 8to6 dictionaryencodes the top 50 characters as 6-bit encodings, 10 substrings as 6-bit encodings, and everything elseas a 14-bit encoding.

LzwHuffman

The LzwHuffman algorithm works by first compressing the input URL using the LZW algorithm toconvert the substrings into fixed 13-bit symbols, and then applying the Huffman algorithm to assignsmaller encodings to the most popular LZW symbols.



Dynamic metadata

All of these compression techniques have a common requirement of first being primed with a trainingset. The training set configures the algorithm to recognize compressible properties of URLs, such asthe frequency of alphabet symbols in Huffman or the presence of common substrings in LZW. Theseschemes work quite well as long as the training set is representative of the URLs that will eventuallybe compressed.

However, a problem occurs if the algorithms are used on a training set that is not representative of thedata. For example, consider what happens when a training set of URLs for English Web sites is used tocompress URLs for Japanese Web sites. The alphabet symbol frequencies would be mismatched, andan algorithm such as Huffman would be ineffective. The Japanese URLs would not contain the samecommon substrings as the English URLs, making an algorithm such as LZW ineffective.

Although the current Gecko-II proxy relies on training sets, it is possible to modify the systemto learn from its input and update its metadata accordingly. It is infeasible to update the metadataconstantly, so instead the system would use a version number stored in the handle to indicate whichversion of the metadata to use. The system maintains several versions of the metadata, including theversion currently being used to compress information, and an ‘in-training’ version that is being trainedusing the information and will eventually become the current version. Not only does the system trainthe in-training version of the metadata, but it also compresses the information using the in-trainingversion and compares the result with the compression using the current version. When the in-trainingversion does sufficiently better to warrant the change, the in-training version is declared the currentversion and a new in-training version is started.

The version number in the handle allows the server to determine which version of the metadatato use when decompressing the handle. A metadata version is stored until it is unlikely that a clientstill has a handle that uses the metadata, given that NFS clients only cache handles for 60 s. Althoughcreating new metadata versions frequently improves compression performance, it incurs additionalload on the server because the metadata represents hard state. If the metadata is lost, no handles can bedecompressed. For this reason the metadata must be stored on disk, an expensive operation that shouldonly be done infrequently. Dynamically creating new metadata versions, and tuning the versioninginterval to achieve optimal performance is an area of future work.

HEADER COMPRESSION

Compressing URLs into NFS handles solves only part of the problem—fully naming a Web page alsorequires a header set to be specified. Ideally, these headers sets would also be encoded in the NFShandle, otherwise they represent hard state on the server. In general this is a difficult proposition, sincea header set can be thousands of characters long and is therefore unlikely to fit in an NFS handleeven when compressed. Fortunately, there are several properties of header sets that can be exploitedto improve compressibility. First, many headers are enumerable because they take on relatively fewdifferent values. Examples are ‘User-Agent’, ‘Language’, and ‘Accept’. The dictionary-based LZWalgorithm should be able to encode each of these headers as a single 13-bit code.

Second, several headers can be discarded without affecting page correctness. For example, the‘Referrer’ header specifies the URL of the page the user was previously viewing. The values of this



header clearly cannot be enumerated and would not achieve good dictionary compression. The Referrerheader rarely affects page correctness, is primarily a statistical tool for the Web server, and we havefound simply discarding this header to be a viable solution.

This leaves the non-enumerable headers that cannot be discarded without affecting page correctness.There are two prime examples of such headers: cookies and authentication. Both of these headers areused fairly infrequently, but to discard them would certainly lead to incorrect behavior. Compressingcookies is usually not an option, as the cookie string could be several hundreds of bytes long. Wehave found the only solution is to use a fallback hash table. Similar to the fallback URL table, thefallback header set table contains those header sets that cannot be compressed¶. The many-to-onerelationship between pages and header sets implies that there should be relatively few items in thehash table. For example, even a user who is regularly sending authentication headers usually sendsthe same authentication header to all pages on a given site. In the worst case, when all header sets areincompressible, the amount of space required by the hash table is proportional to the number of users,rather than the much larger number of page requests. Furthermore, we anticipate that header sets will becreated and modified relatively infrequently, so that the hard state represented by the hash table couldbe stored on disk. This would prevent its loss in a server crash. The present Gecko-II implementationdoes not store the hash table on disk, but this could easily be implemented by a reliable databasepackage such as gdbm.

EXPERIENCE WITH GECKO-II

Gecko-II transfer performance

Quantitative measurements of Gecko-II demonstrate its performance advantages over HTTP as atransport protocol. Tests were performed using two workstations connected via a 100 Mb s−1. Ethernetswitch. Each workstation is a 200 MHz Pentium machine containing 128 Mbytes of memory and 2–4 Gbytes of disk storage. Workstations were configured with Linux kernel version 2.0.30 and pthreadsversion 0.5.

We tested the performance of Gecko by requesting documents through both the Gecko-II server andSquid. File sizes were tested in the range of 256 bytes to 1 Mbytes. For each test, we measured theamount of time to retrieve the file from Squid directly, as well as from the Gecko-II server.

Figure9 shows results of the experiment as performed between three machines. One machine wasconfigured as a client and ran the benchmark program, the second machine contained the Gecko-IIserver and Squid, and the third machine contained an Apache [13] Web server that provided the pages.Gecko-II was connected to Squid via a Unix domain socket.

For each file, a dummy request was first initiated to pre-load the Squid cache with the page beforethe actual measurement. This ensures that the measured time represents only the connection betweenthe client and proxy, and not the time required by the proxy to fetch the page from the Web.

¶The present Gecko-II prototype does not implement compression of header sets, and so all header sets are stored in the fallbackhash table. Gecko usage demonstrates that clients require very few header sets.



Figure 9. Page access time. This graph shows the time to accessWeb pages of varying size through Gecko and through Squid.

Thehttpmeasurements represent the time in milliseconds required to retrieve a file from Squid. Thenfs measurements represent the amount of time required to fetch that same page with Gecko-II. Theonfsmeasurement is the amount of time required by NFS to perform the bulk data transfer, and ignoresNFS lookup and getattr functions. Thecnfsmeasurement is the amount of time required to fetch thefile if it is already in the operating system cache.

We measured the amount of overhead associated with looking up filenames in the NFS protocol bytaking the difference of theonfsandnfsnumbers, which is an average of 7.5 ms. Since this overhead isa fixed amount per file transfer, it has a greater affect on small files than on large files. For a 32 kbytefile, this overhead amounts to 15% of the total file retrieval time.

Gecko-II automatically makes use of the operating system cache, a feature unavailable to HTTP.When a file is already in the operating system cache, no network communication is necessary, and thefile access time as indicated by the cnfs numbers is very small, requiring only one millisecond for a32 kbyte file.

By using the NFS protocol, Gecko-II provides greater performance than connecting directly toSquid, while at the same time providing increased functionality. This is due largely in part to theoverhead of the TCP protocol, which includes extensive connection setup and tear-down time. TheUDP protocol that is used by NFS, however, does not include such overhead and performs much betterthan TCP in a local area network. At file sizes of 256 kbytes or greater, the TCP protocol’s increasedparallelism effectively overcomes the overhead, providing better performance than NFS. However,most Web pages and graphics are much smaller than 256 kbytes, making Gecko-II well suited foraccelerating Web performance.

Compression results

To study the effectiveness of various algorithms, we took a sampling of HTTP URLs by spidering theWeb sitewww.web100.comto three levels deep. We then compressed this collection of URLs with



Table I. Compression results. These are the results of compressing the URLs fromwww.web100.comto three levels deep using the different compression schemes. The column‘Fit’ shows the percentage of the URLs that the algorithm was able to compress into 60 bytes

and would therefore fit in an NFS 3 handle.

Algorithm Byte size Compression Ratio (%) Fit (%)

Uncompressed 237 296 100 728to5 170 021 71 868to6 181 477 76 848to6 dictionary 136 462 57 89LZW 143 254 60 91Huffman 146 331 61 90LzwHuffman 128 876 54 92Best 113 919 48 92

various algorithms and noted which algorithms worked best. Results for the six algorithms we testedare shown in TableI.

The hybrid algorithms, 8to6 dictionary and LzwHuffman, yield the best compression results. Thebest row is obtained by trying all of the compression methods on a URL and using the result of the onethat yielded the smallest encoding.

The NFS version 3 handle is 64-bytes‖ in length. If 4 bytes of the handle are reserved for othermetadata such as flags and file type information, then 60 bytes remain to store the encoded URL. Usingthe best compression scheme of 48%, URLs of up to 120 bytes may be accommodated in the 64-bytehandles. The best compression scheme was able to compress 92% of the URLs in our sample set intothe required handle size.

RELATED WORK

The WebNFS [14] project, a proposal by Sun Microsystems, seeks to replace the HTTP protocol with amore Internet-compatible version of the NFS protocol. WebNFS offers many of the same performanceimprovements as Gecko, such as the ability to read a page without requiring a dedicated TCPconnection. However, the specification departs significantly from traditional NFS semantics, sufferscompatibility problems with existing applications, and is designed mainly for integration with specificWeb-enabled applications. It is also unclear how WebNFS servers will cope with HTTP headers thatmodify content such as multiple language or browser types.

‖The current implementation of Gecko-II uses the NFS version 2 specification and its more restrictive 32-byte handles. It wouldbe a straightforward extension for Gecko-II to support the 64-byte handles provided by NFS version 3, and will likely be donein the future.



WebFS [3], a project of UC Berkeley, is designed with goals similar to that of Gecko. WebFS isimplemented as a loadable kernel module for the Solaris operating system and provides a mappingof the URL namespace into a file system namespace. While WebFS seeks to maintain compatibilitywith HTTP, the true goal of the WebFS is to replace HTTP with a more flexible protocol that providessupport for cache coherency and authentication. WebFS requires end-users to modify the operatingsystem by installing a kernel module, whereas the Gecko prototype is accessible to unmodified clientsthrough the standard NFS protocol. WebFS does not provide server-side parsing of document links asdoes Gecko, and requires a special client-side utility program to be run to parse links and expand thenamespace. WebFS does not provide a mechanism for communicating HTTP request headers.

Operating system vendors, such as Microsoft, continue to integrate the Internet into their products.However, these integrations are usually performed at the user interface level and not the file systemlevel. The result is that, although users perceive an integration of the Web with their computers, Webcontent is not available to most common applications and utilities. Such solutions continue to use HTTPand suffer the impact of its performance implications.

The Alex [15] system operates as an NFS front end to the FTP protocol in much the same manner thatGecko operates as a front end to the Web. Alex provides transparent access of FTP sites to applicationprograms and provides cache consistency

HTTPFS [16] is a user-level file system that uses HTTP as a communication medium. It treats HTTPas a pipe, running a specialized C++ class library on the client side and a CGI script on the server side.Although the class library provided by HTTPFS does not require modification of source code, it doesrequire linking of client applications to the HTTPFS library.

CONCLUSION

Gecko provides access to the Web via NFS, allowing Web pages to be named, accessed, and cachedlike standard Unix files. Unmodified Unix applications such ascat andgrep can be used to manipulatepages, eliminating the need for specialized Web-enabled applications. NFS provides cache consistencybetween clients and the Gecko server, ensuring that all applications using the Gecko system see thesame version of a page. NFS also improves the performance of accessing pages on the Gecko server.Pages are automatically cached on the client, and name lookup results are cached for subsequentlookups, significantly improving their performance.

Our two Gecko prototypes demonstrate the viability of the Gecko design. Although Gecko-I wasuseful, its dependence on hard server state limited its fault tolerance and scalability, while greatlyincreasing its implementation complexity. Gecko-II compresses information into NFS handles toreduce the amount of hard server state. The resulting system is more tolerant of server failures andscales better than Gecko-I, while also being much simpler to implement. Experiments show that Gecko-II’s performance is comparable to, or better than, Apache or Squid, making the Web accessible to thevast collection of Unix applications without imposing a performance penalty.

ACKNOWLEDGEMENTS

Patrick Bridges did some of the original work on using NFS to access the Web as part of a class project. This workwas supported in part by NSF grants CCR-9624845 and CDA-9500991.



REFERENCES

1. Fielding R, Gettys J, Mogul JC, Frystyk H, Masinter L, Leach P, Berners-Lee T. Hypertext transfer protocol—HTTP/1.1.http://www.ietf.cnri.reston.va.us/internet-drafts/draft-ietf-http-v11-spec-rev-03.txt.

2. Sandberg R, Goldberg D, Kleiman S, Walsh D, Lyon B. Design and implementation of the Sun network filesystem.Proceedings of the Summer 1985 USENIX Conference, 1985; 119–130.

3. Vahdat AM, Eastham PC, Anderson TE. WebFs: A global cache coherent file system. http://www.cs.berkeley.edu/ vah-dat/webfs/webfs.html.

4. Baker S, Hartman JH. The Gecko NFS Web Proxy.Computer Networks1999; 31:1725–1736. Also appeared in theProceedings of the 8th International World Wide Web Conference, 1999.

5. Berners-Lee T, Connolly D. RFC 1866: HTML 2.0 Specification. http://www.w3.org/MarkUp/html-spec/.6. Information Sciences Institute. RFC 793: Transmission Control Protocol. http://www.alternic.net/rfcs/700/rfc793.txt.html.7. Rivest R. RFC 1321: The MD5 Message-Digest Algorithm. http://www.alternic.net/rfcs/1300/rfc1321.txt.html.8. Free Software Foundation. Gdbm: The GNU database library. http://www.cl.cam.ac.uk/texinfodoc/gdbmtoc.html.9. Squid Web Proxy Cache. http://www.squid-cache.org/.

10. Yahoo! Inc. http://www.yahoo.com/.11. Huffman DA. A method for the construction of minimum-redundancy codes.Proceedings of the IRE1952;40(9):1098–

1101.12. Welch T. A technique for high-performance data compression.IEEE Computer1984;17(6):8–19.13. Laurie B, Laurie P.Apache: The Definitive Guide. O’Reilly & Associates: Sebastopol, CA, 1997.14. Callaghan B. WebNFS Server Specification. http://194.52.182.97/rfc/rfc2055.html.15. Cate V. Alex—a global file system.Proceedings of the 1992 USENIX File Systems Workshop, 1992; 1–12.16. Kiselyov O. A network file system over HTTP: remote access and modification of files and files.USENIX’99 Freenix Track

Proceedings. http://www.usenix.org/events/usenix99/fullpapers/kiselyov/kiselyov.pdf.