caching and content distribution networks. some interesting observations r top 1 % of all documents...

33
Caching and Content Distribution Networks

Upload: christina-warner

Post on 24-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Caching and Content Distribution Networks

Some Interesting Observations

Top 1 % of all documents account for 20% - 35% of proxy requests

Top 10% account for 45% - 55% of requests

It takes 25% to 40% of all documents to account for 70% of requests

It takes 70% to 80% of all documents to account for 90% of requests

Web Caching

As an example, we use the web to illustrate caching and other related issues

browser Web Proxycache

request

response

request

response

Web server

browserWeb server

request

response

Web Browser Caching

Web browsers have their own caches. When a page is downloaded from a site the web page is put into the browser cache.

This is especially useful in those cases when the back button is pressed.

If a new copy is needed then a “refresh” can be done.

No page stays permanently in the cache. There is limited room. A replacement algorithm is needed to determine

which cached page should be purged.

Web Browser Caching

Client pull The server provides the content with instructions on

when the client should ask for a refreshed copy of the content or if the content should be cached.

Server push The server transmits page information to the screen. The browser application displays the information

and leaves the connection to the server open. With an open connection, the server can continue to

push updated pages for your screen to display on an ongoing basis. You can close the connection by closing the page.

The server is in control Browser caches are different from proxy caches

(discussed next).

Web Caching

Proxy caches (also called proxy server) Intercepts HTTP requests from client

• Serves object if in its cache• If not goes to object’s home server

– On behalf of user, gets the object and possibly deposits in its cache before returning to user

• Usually deployed at edges of a network– Wide area bandwidth savings, improved response time

and increased availability of static web-based objects A browser may have to be configured to point

to the proxy server. Usually a proxy cache is purchased and

installed by an ISP

Push-Based Approach Server tracks all proxies that have requested

objects If a web page is modified, notify each proxy Notification types

Indicate object has changed [invalidate] Send new version of object [update]

How to decide between invalidate and updates? Pros and cons? One approach: send updates for more

frequently accessed objects, invalidate for rest

proxyWeb server

push

Push-Based Approaches

Advantages Provide tight consistency [minimal stale data] Proxies can be passive

Disadvantages Need to maintain state at the server

• Recall that HTTP is stateless• Need mechanisms beyond HTTP

State may need to be maintained indefinitely• Not resilient to server crashes

The disadvantage is the reason why push-based approaches are not used

Pull-Based Approaches

The proxy is entirely responsible for maintaining consistency

The proxy periodically polls the server to see if object has changed Use if-modified-since HTTP messages: This type of

message can be used by a proxy to tell a remote server to return a copy only if it has been modified.

Key question: When should a proxy poll? Server-assigned Time-to-Live (TTL) values

• No guarantee if the object will change in the interim

proxyWeb server

poll

response

Pull-Based Approach: Intelligent Polling

Proxy can dynamically determine the refresh interval Compute based on past observations

• Start with a conservative refresh interval• Increase interval if object has not changed

between two successive polls• Decrease interval if object is updated between

two polls• Adaptive: No prior knowledge of object

characteristics needed

Pull-Based Approach

Advantages Server remains stateless Resilient to both server and proxy failures

Disadvantages Weaker consistency guarantees (objects

can change between two polls and proxy will contain stale data until next poll)

High message overhead

A Hybrid Approach: Leases Lease: Duration of time for which server agrees to notify

proxy of modification Issue lease on first request, send notification until expiry

Need to renew lease upon expiry Smooth tradeoff between state and messages exchanged

Zero duration => polling, Infinite leases => server-push Efficiency depends on the lease duration Limited use

Client Proxy

Server

Get + lease req

Reply + leaseread

Invalidate/update

Cooperative Caching

Caching infrastructure can have multiple web proxies Proxies can be arranged in a hierarchy or

other structures Proxies can cooperate with one another

• Answer client requests• Propagate server notifications

Uses a combination of HTTP and ICP (Internet Caching Protocol).

• ICP can be used by one cache to quickly ask another cache if it has an object.

• HTTP is used to actually retrieve the object.

Problems

Caching proxies do not serve all Internet users.

Content providers (say, Web servers) cannot rely on existence and correct implementation of caching proxies.

Accounting issues with caching proxies: Example: www.cnn.com needs to know the

number of hits to the advertisements displayed on the web page.

Content Distribution Networks (CDN)

Business Model: A content provider such as www.cnn.com or Yahoo pays a CDN company (such as Akamai) to get its content to the requesting users with short delays.

A CDN provides a mechanism for Replicating content on multiple servers

in the InternetProviding clients with a means to

determine the servers that can deliver the content fastest.

Terminology

Content: Any publicly accessible combination of text, images, applets, frames, MP3, video, flash, virtual reality objects, etc.

Content Provider: Any individual, organization, or company that has content that it wishes to make available to users.

Origin Server: Content provider’s server , where the content is first uploaded.

Surrogate Server (sometimes called edge server): Content distributor’s server, where the replicated content is kept.

Players

Content Provider

H/W and S/W Vendor

Content Distributor

Hosting Provider

Yahoo, MSNBC, CNN

Cisco, Lucent, Inktomi, CacheFlow

Akamai, Digital Island, AT&T

Exodus

Sells se

rvers

Send content

Install

servers

CDN: Distribution

The CDN company places hundreds of CDN servers in Internet hosting centers.

The CDN replicates its customers’ content in the CDN servers. Whenever, a customer updates its content (e.g., web page), the CDN redistributes the fresh content to the CDN servers.

The CDN provides a mechanism so that when a user requests content, the content is provided by the CDN server that can most rapidly deliver the content to the user. This can be the closest CDN server to the user

(perhaps in the same ISP as the user) or may be a CDN server with a congestion-free path to the user.

CDN: Distribution

CDN server in Asia

CDN server in Europe

CDN server in SouthAmerica

CDN distribution node

Origin server inNorth America

push content

push content

push contentpush content

Akamai CDN

CDN: Functional Components

Distribution Service Redirection Service Accounting and Billing system

CDN:Distribution Service

The content provider determines which of its objects it wants the CDN to distribute.

The content provider tags and then pushes this content to a CDN node, which in turn replicates and pushes the content to all its CDN servers.

CDN: Distribution Service

When a browser in a user’s host is instructed to retrieve a specific object (specified using a URL), how does the browser determine whether it should retrieve the object from the origin server or from one of the CDN servers?

As an example, suppose the hostname of the content provider is www.cnn.com

Suppose the hostname of the CDN company is www.akamai.com

CDN: Redirection

Users get an html document from www.cnn.com; this could be index.html

The file index.html uses a modified URL for content that has been replicated. Example: If the gif files are what has been

replicated then <img src=“http://cnn.com/af/x.gif> may be modified as follows:

<img src=http://a73.g.akamaitech.net/7/23/cnn.com/af/x.gif>

The browser needs to resolve aXYZ.g.akamaitech.net hostname for replicated content.

CDN: Redirection

DNS is configured so that all queries about g.akamaitech.net that arrive at a DNS server are sent to an authoritative DNS server for g.akamaitech.net. This is referred to as a Akamai DNS server (authoritative DNS server)

When the Akamai DNS server receives the query, it extracts the IP address of the requesting browser.

Based on the IP address and information that it has about the Internet (called a map), the IP address of an Akamai server(surrogate server) is returned to the requesting browser based on policy e.g., select the server that is the fewest hops away.

CDN Redirection

The Akamai DNS server IP address is now in the cache of the local DNS server.This implies that it is not always

necessary to go to the root DNS server. The TTL associated with the IP address of

an Akamai server(surrogate) is relatively small. This is done for performance reasons.

Akamai content distribution servers are caches

CDN Redirection

What if content is not there? If the request content is not found then the

surrogate will ask other surrogates within a specified region for information.

If requested information is still not found or is stale, then a request is made to the original web site.

CDN Redirection

...

<img src="http://www.cdn.com/cnn/images/1.gif”>

...

Index.html

GET w

ww

.cnn.co

m/in

dex.h

tml

Ind

ex.h

tml

DNS query: cdn.com ?

GET /cnn/im

ages/1.gif

1.gif

64.236.24.28

Authoritative DNS server for cdn.com

Local DNS serverClient

CNN.com

64.236.24.28

DN

S qu

ery:

cdn

.com

?64

.236

.24.

28

PUT /images/*.gif

CDN Selection The tricky issue is selecting which local

content server to use for a particular request Want to spread load evenly Want minimal impact if server is added or

removed. In Akamai, each surrogate server sends

measurement results to the Network Operations Communications Center (NOCC). Measurement results include number of active

TCP connections, HTTP request arrival rate, bandwidth availability, etc

This information is used by the Akamai DNS server.

Accounting Mechanism Accounting mechanisms collect and

track information related to request routing, distribution and delivery.

Information is gathered in real time and put into log files for each CDN component.

This gets sent to the Network Operations Communications Center (NOCC).

Full Site Delivery vs. Partial Site Delivery

Full Site Delivery : All the contents are delivered by the CDN (including HTML, images, and other objects).

Partial Site delivery: Only images, streaming media and other bandwidth intensive objects delivered by the CDN.

CDNs and Content

Content Suitable for CDNS Images Streaming media Java applets Static information

Content not suitable Dynamic information Personalized information

Current Akamai Customers

Summary

We have examined replication and issues related to the design and implementation of a replicated system.

Many choices and tradeoffs to consider