1 web servers herng-yow chen. 2 outline survey many different types of software and hardware web...
TRANSCRIPT
1
Web Servers
Herng-Yow Chen
2
Outline Survey many different types of software an
d hardware web servers. Describe how to write a simple diagnostic
web server in Perl. Explain how web servers process HTTP tra
nsactions, step by step.
3
Different types of web servers General-purpose software web server Web server appliances Embedded web servers
4
Jobs of web servers Implement HTTP and the related TCP
connection handling. Manage the server-slide resource and
provide administrative features to configure, control, and enhance the web service.
5
Jobs of Operating System Manages the hardware details of the underl
ying computer system Provide TCP/IP network support Provide filesystems to hold web resources Provide process management to control co
mputing activities.
6
General-purpose software web server
General-purpose software web servers run on standard, network-enabled computer system.
Open source software (such as Apache or W3C’s Jigsaw).
Commercial software (such as Microsoft’s and iPlanet’s web servers).
Web server software is available for just about every computer and operating systems.
7
General-Purpose Software Web Servers
In September 2004, the Netcaft survey (http://news.netcraft.com/archives/web_server_survey.html)
8
Web server appliances Web server appliances are prepackaged software/hardwa
re solutions. The vendor preinstalls a software server onto a vendor-chosen computer platform and preconfigures the software.
Sun/Cobalt RaQ web appliance(http://www.cobalt.com)
Toshiba Magnia SG10 (http://www.toshiba.com) IBM Whistle web server application (http://www.whistle.com)
Appliance solutions remove the need to install and configuration software and often greatly simplify administration. However, the web server often is less flexible, feature-rich, and the server hardware is not easily upgradable.
9
Embedded web servers Embedded servers are tiny web servers intended
to be embedded into consumer products (e.g., printers or home appliances).
Allow users to administer their consumer devices using a convenient web browser interface. IPic match-head sized web server
(http://www-ccs.cs.umass.edu/~shri/iPic.html) NetMedia SitePlayer SP1 Ethernet web server
(http://www.siteplayer.com)
10
A Minimal Perl Web server Type-o-serve – a minimal Perl web server
used for HTTP debugging http://www.http-guide.com/tools/type-o-serv
e.pl
11
A Minimal Perl Web ServerGET /blah.txt HTTP/1.1
Accept: */*
Accept-language: en-us
Accept-encoding: gzip, deflate
User-agent: Mozilla/4.0
Host: www.csie.ncnu.edu.tw:8080
Connection: Keep-alive
HTTP/1.0 200 OK
Connection: close
Content-type: text/plain
Hi there!
% ./type-o-serve.pl 8080
<<Request From 'www.csie.ncnu.edu.tw'>>
GET /blah.txt HTTP/1.1
Accept: */*
Accept-language: en-us
Accept-encoding: gzip, deflate
User-agent: Mozilla/4.0
Host: www.csie.ncnu.edu.tw:8080
Connection: Keep-alive
<<Type Response followed by '.’>>
HTTP/1.0 200 OK
Connection: close
Content-type: text-plain
Hi there!
HTTP request message
Type-o-serve dialog
HTTP response message
12
What do web servers do?
1. Set up connection
2. Receive request
3. Process request
4. Access resource
5. Construct response
6. Send response
7. Log transaction
13
What Real Web Servers Do
client Network interface
TCP/IP network stack
Operating system
Object Storage
User space
(5)Create response
HTTP server software process(3)Process
request
(1)Set up connection
(4)Access resource(7) Log
transaction
(6)Send response
(2)Receive request
14
Step 1: accepting client connections
Handling new connections Exacting client IP from a new TCP connection
Client hostname identification Using “reverse DNS”
Determining the client user through ident Some web servers support the IETF ident prot
ocol
15
Handling new connection When a client requests a TCP connection to the
web server, the web server establishes the connection and determines which client is on the other side of the connection, extracting the IP address from the TCP connection. (e.g., using getpeername call in UNIX socket)
The server is free to reject and immediately close connections, because the client IP is unauthorized or is known malicious client.
Once a new connection is established and accepted, the server adds the new connection to its list of existing connections and prepares to watch for data on the connection.
16
Client host identification Most web servers can be configured to convert client IP a
ddresses into client hostnames, using “reverse DNS.” The hostname information is used for detailed access con
trol and logging. Note that hostname lookups can take a long time, slowing
down web transactions. Many high-performance web servers either disable hostname resolution or enable it only for particular content.
Ex: Configuring Apache to lookup hostnames for HTML and CGI resourcesHostnameLookups off<Files ~ “\. (html | htm | cgi)$”>
HostanmeLookups on</Files>
17
Determining the client user through ident
The ident protocol let servers find out what username initiated an HTTP connection.
The username information is particularly useful for logging – the 2nd field of the popular Common Log Format contains the ident username of each HTTP request. (RFC931, the updated ident specification is documented by RFC 1413).
If a client supports the ident protocol, the client listens on TCP port 113 for ident requests.
18
Determining the Client User Through ident
Web serverMary
HTTP connection
ident connection
Port 80
Port 80Port
113
Port 4236
4236, 80:USERID:UNIX:MARY
(b)Server establishes ident connection4236, 80
(c)Server sends request
(a) Mary establishes new HTTP connection
(d)Client returns ident response
19
Ident protocol (cont.) Ident can work inside organizations, but it does n
ot work well across public Internet for the following reasons.
Many client PC don’t run the identd identification protocol daemon software.
The ident protocol significantly delays HTTP transactions. Many firewalls won’t permit incoming ident traffic. The ident protocol is insecure and easy to fabricate. The ident protocol doesn’t support virtual IP address well. There are privacy concerns about exporting client usernames.
Enable ident lookup in Apache IdentityCheck on Common Log Format log files typically contain typhens (-) in the 2
nd filed if no ident information is available.
20
Step 2: Receiving request messages As the data arrives on connections, the server
reads out the data and start parsing the request message. Parse the request line looking for the request method,
the specified URI, and the version number. Read the message headers, each ending in CRLF. Detects the end-of-headers blank line, ending in
CRLF. Reads the request body, if any (length specified by
Content-Length header) Internet Representations of Messages
Some web servers also store the request message in internal data structures that make the message easy to manipulate.
21
Receiving Request Messages
Internet
GET /specials/hychen.gif HTTP/1.0CRLF
Accept: image/gifCRLF
Host: www.j
Request message being read from network
serverclient
LF CR LF CR moc.erawdrah-seo
22
Internal Representations of MessageGET /specials/saw-blade.gif HTTP/1.0CRLF
Accept: image/gifCRLF
Host: www.joes-hardware.comCRLF
CRLF
specials/saw-blade.gif
www.joes-hardware.com
Image/gifName:Host
Name:Accept
Value: ●
Value: ●
method: 1
version:1.0
uri: ●
header count: 2
headers: ●
body: -
Parse
23
Different web server architectures
Single-threaded web servers Multi-process and multi-threaded web
servers Multiplexed I/O web servers
Non-blocking network accessing Multiplexed multi-threaded web servers
24
Connection Input/Output Processing Architectures
25
Step 3: Processing requests Once the web server has received a
request, it can process the request using method, resource, headers, and optional body.
Some method (e.g., POST) require entity body data in the request message. A few methods (e.g., GET) forbid entity body data in the request message.
26
Step 4: Mapping and Accessing resources
Docroot Virtually hosted docroots User home directory docroots Directory Listings Dynamic content resource mapping Server-Side Include (SSI) Access Control
27
Docroots Web servers support different kinds of resource mapping, b
ut the simplest form of mapping uses the request URI to name a file in the web server’s filesystem.
Typically, a special folder in the web server filesystem is reserved for web content. The folder is called the document root, or docroot.
The web server takes the URI from the request message and appends it to the document root. The docroot setting in apache servers
DocumentRoot /usr/local/httpd/files
Servers must be careful not to let relative URLs back up out of a document root and expose other parts of the filesystem. E.g., http://www.csie.ncnu.edu.tw/../
28
Docroots
GET /specials/hychen.gif HTTP/1.0
Host: www.csie.ncnu.edu.tw
Internet
client
Object Storage
Web serverRequest URI: /specials/hychen.gif Server resource: /usr/local/httpd/files/specials/hychen.gif
Request message
/usr/local/httpd/filesdocroots
29
Virtually hosted docroots Virtually hosted web servers host multiple
web site on the same web server, giving each site its own distinct document root on the server.
A virtual hosted web server identifies the correct document root to use from the IP or hostname in the Host header.
30
Apache’s virtual host configuration <VirtualHost www.joes-hardware.com>
ServerName www.joes-hardware.com DocumentRoot /docs/joe TransferLog /log/joe.access_log ErrorLog /logs/joe.error_log
</VirtualHost>
<VirtualHost www.marys-hardware.com> ServerName www.marys-hardware.com DocumentRoot /docs/mary TransferLog /log/mary.access_log ErrorLog /logs/mary.error_log
</VirtualHost>
31
Virtually hosted docroots
/docs/joe
/docs/mary
www.joes-hardware.com
www.marys-antiques.com
GET /index.html HTTP/1.0
Host: www.joes-hardware.com
GET /index.html HTTP/1.0
Host: www.marys-antiques.com
Internet
client
Request message A
Request message B
32
User home directory docroots
/home/bob/public_html
www.joes-hardware.com
www.marys-antiques.com
GET /~bob/index.html HTTP/1.0
GET /~betty/index.html HTTP/1.0
Internet
client
Request message A
Request message B
/home/betty/public_html
33
User home directory docroots Another common use of docroots gives people private we
b site on a web server. A typical convention maps URIs whose paths begin with a
slash and tilde (/~) followed by a username to a private document root for that user.
The private docroot is often the folder called public_html inside that user’s home directory, but it can be configured differently (e.g., in the NCNU web server, we use WWW as the user’s private document root.)
In apache’s configuration, UserDir public_html
34
Directory listings A web serer can receive request for directory
URLs, where the path resolves to a directory, not a file.
Most web servers can be configured to take a few different actions when a client requests a directory URL: Return an error. Return a special, default, “index file” instead of the
directory. Scan the directory, and return an HTML page
containing the contents.
35
Directory Listings (continued) Most web servers look for a file named index.htm
l or index.htm inside a directory to represent that directory.
In apache configuration DirectoryIndex index.html index.htm home.html home.
html index.cgi
Disable the automatic generation of directory index files with the apache directive: Option -Indexes
36
Dynamic content resource mapping Web server also can map URIs to dynamic resou
rces – that is, to programs that generate content on demand.
In fact, a whole class of web servers called application servers connect web servers t sophisticated backend applications.
The web server need to be able to tell when a resource is a dynamic resource, where the dynamic content generator program is located, and how to runt he program.
37
Dynamic content … In apache’s configuration
ScriptAlias /cgi-bin/ /usr/lcoal/etc/httpd/cgi-programs/ AddHandler cgi-script .cgi
CGI is an early, simple, and popular interface for executing server-side applications. Modern application servers have more powerful and server-side dynamic content support, including Active Server Pages, java servlets, and PHP.
38
Dynamic Content Resource Mapping
serverclient
Internet
39
Server-Side Includes (SSI) Many web servers also provide support for
server-side includes. If a resource is flagged as containing server-side
includes, the server processes the resource contents before sending them to the client.
The content are scanned for certain special patterns, which can be variable name or embedded scripts. The special patterns are replaced with the values of variables or the output of executable scripts.
This is an easy way to create dynamic content.
40
Access controls Web servers also can assign access controls to
particular resource.
When a request arrives for an access-controlled resource, the web server can control access based on the IP address of the client, or it can issues a password challenge to get access to the resource.
We will see more details in the later lecture, chapter 12 (HTTP authentication).
41
Step 5: Building Responses Once the web server has identified the
resource, it performs the action described in the request method and returns the response message, which contains status code, response header, and a response body.
Response Entities MIME Typing Redirection
42
Response entities If the transaction generated a response
body, the content is sent back with the response message, which usually contains: a Content-Type header, i.e. MIME typing a Content-Length header, describing body size The actual message body content
43
MIME typing The web server is responsible for determining the
MIME type of the response body. There are many ways to configure servers to
associate MIME types with resources: mime.types: extension-based type association Magic typing: content-based association, scanning a known
patterns Explicit typing: force particular files or directory contents to
have a MIME types, regardless of the file extension or contents. Type negotiation: server is configured to store a resource in
multiple document formats. In a client-server negotiation process the server can determine the “best” format to use. (chapter17)
44
MIME Typing
www.csie.ncnu.edu.tw
GET /specials/hychen.gif HTTP/1.1
Host: www.csie.ncnu.edu.tw
HTTP/1.1 200 OK
Content-type: image/gif
Content-length: 8572
client
hychen.gif fileHTTP request message contains the command and the URI
45
Redirection Web servers sometimes return redirection respon
ses (indicated by a 3XX return code) instead of success messages. The Location response header contains a URI for the new or preferred location of the content. Redirections are useful for: Permanently moved resources Temporarily moved resources URL augmentation Load balancing Server affinity Canonicalizing directory names
46
300-399: Redirection Status Code
Status code Reason Phrase300 Multiple Choices
301 Moved Permanently
302 Found
303 See other
304 Not Modified
305 Use Proxy
306 (Unused)
307 Temporary Redirect
47
Step 6: Sending Responses The servers may have many connections to many clients,
some idle, some sending data to the server, and some carrying response data back to the clients.
The servers needs to keep track of connection state and handle persistent connections with special care.
For non-persistent connections, the server is expected to close its side of connection when the entire message is sent.
For persistent connections, the connection may stay open, in which case the server needs to be extra cautious to compute the Content-Length header correctly, or the client will have no way of knowing when a response ends (c.f., Chapter 4).
48
Step 7: Logging Finally, when a transaction is complete, the
web server notes an entry into a log file, describing the transaction performed.
Most web servers provide several configurable forms of logging. (Later lectures, Chapter 21, for details)
49
Reference: Web server http://www.apache.org
The apache web site http://www.w3c.org/Jigsaw
Jigsaw- W3C’s Server http://www.ietf.org/rfc/rfc1413.txt
RFC 1413, “Identification Protocol,” By M. St. Johns.