a new approach for a proxy-level web caching mechanism

Decision Support Systems 46 (2008) 52–60

Contents lists available at ScienceDirect

Decision Support Systems

j ourna l homepage: www.e lsev ie r.com/ locate /dss

A new approach for a proxy-level web caching mechanism

Chetan Kumar a,⁎, John B. Norris b,1

a Department of Information Systems and Operations Management, College of Business Administration, California State University San Marcos,333 South Twin Oaks Valley Road, San Marcos, CA 92096, United Statesb Krannert School of Management, Purdue University, 403 West State Street, West Lafayette, IN 47907, United States

a r t i c l e i n f o

⁎ Corresponding author. Tel.: +1 760 477 3976.E-mail addresses: [email protected] (C. Kumar),

(J.B. Norris).

0167-9236/$ – see front matter © 2008 Elsevier B.V.doi:10.1016/j.dss.2008.05.001

a b s t r a c t

Article history:Received 12 October 2007Received in revised form 7 April 2008Accepted 21 May 2008Available online 27 May 2008

In this study we propose a new proxy-level web caching mechanism that takes into accountaggregate patterns observed in user object requests. Our integrated caching mechanismconsists of a quasi-static portion that exploits historical request patterns, as well as a dynamicportion that handles deviations from normal usage patterns. This approach is morecomprehensive than existing mechanisms because it captures both the static and thedynamic dimensions of user web requests. The performance of our mechanism is empiricallytested against the popular least recently used (LRU) caching policy using an actual proxy tracedataset. The results demonstrate that our mechanism performs favorably versus LRU. Ourcaching approach should be beneficial for computer network administrators to significantlyreduce web user delays due to increasing traffic on the Internet.

© 2008 Elsevier B.V. All rights reserved.

Keywords:Web cachingProxy-level mechanismWeb request patternsPerformance evaluation

1. Introduction and problem motivation

There has been tremendous growth in the amount ofinformation available on the Internet. The trend of increasingtraffic on the Internet is likely to continue [5]. Despitetechnological advances this huge traffic can lead to consider-able delays in accessing objects on the web [13,17]. Webcaching is one of the approaches to reduce such delays.Caching involves storing copies of objects in locations that arerelatively close to the user. This allows user requests to beserved faster than if they were served directly from the originweb server [2,5,9,10].

Caching may be performed at different levels, namely thebrowser, proxy, and web-server levels [6,11]. Browser cachingtypically occurs closest to the end user, such as the usercomputer's hard disk [13]. Proxy caches are situated atnetwork access points for web users [7]. Consequentlyproxy caches can store documents and directly serve requestsfor them in the network, thereby avoiding repeated traffic to

[email protected]

All rights reserved.

web servers. This results in reducing network traffic, load onweb servers, and the average delays experienced by networkusers while accessing the web [1,5]. Proxy caching is widelyused by computer network administrators, technology pro-viders, and businesses to reduce user delays on the Internet[7]. Examples include proxy caching solution providers suchas IBM (www.ibm.com/websphere), Internet service providers(ISP) such as AOL (www.aol.com), and content deliverynetwork (CDN) firms such as Akamai (www.akamai.com).Effective proxy caching has benefits for both the specificnetwork where it is used as well as for all Internet users ingeneral. Web-server caching, which is performed at thesource of web content, focuses on reducing demand for HTTPconnections to a single server [13]. A caching mechanism,performed at any network level, implies the characterizationof the following two key decisions: which objects are to bestored in the cache (cache entry decision), and which of thecurrently cached objects are to be evicted to make room fornew ones (cache replacement decision). In this paper wepropose a new proxy-level cachingmechanism that takes intoaccount aggregate patterns observed in user object requests.

Based on how frequently the contents of the cache aremodified the existing caching mechanisms can be classifiedinto two types: static, where the contents of the cache are

http://www.ibm.com/websphere

http://www.aol.com

http://www.aol.com

http://www.akamai.com

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1016/j.dss.2008.05.001

http://www.sciencedirect.com/science/journal/01679236

53C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60

fixed, and dynamic, where the contents are changed dynami-cally according to incoming user requests [2,13,14,17]. Ourproposed caching mechanism falls into a third category:quasi-static, where the same objects are retained in the cachein between pre-determined time intervals, while they couldbe changed across intervals. Thus this mechanism is staticwithin time intervals but may be dynamic across intervals.

We derive themotivation for our approach from studies thathave demonstrated the existence of patterns inproxy-level userrequests. Cao and Irani [1], Rizzo and Vicisano [15], andLorenzetti et al. [12], have shown that users typically re-accessdocuments on a daily basis, with surges in demand for adocument occurring in multiples of 24 h. Cao and Irani [1] havefurther demonstrated that these proxy proxy-level patternsexist due to the combined effect of individual user re-accessbehavior, even with the presence of browser caches. Howevertheearlier studies concentrate on repeating24h accesspatternsfor static documents that remainunchanged in termsof contentand size. Instead in our caching approachwe aim to identify andexploit repeating access patterns for documents whose con-tents may be changed over time, but whose uniform resourcelocator (URL) address remains the same. Examples of thesetypes of web content include the front pages of many sites (e.g.,www.yahoo.com, www.aol.com, www.cnn.com, www.google.com, etc.) which may vary the specific content of their sites,but retain the same name for the home page. Thus even if thecontents of the website front page change, as long as we haveidentified aggregate user patterns for accessing the site at aparticular time of the day, then the latest contents of the frontpage can be downloaded prior to the spike in user requests.

We exploit the repeated-access pattern in our mechanismby making the caching decision for a specific time intervalbased on the history of observed requests for the sameinterval. However the caching decision also depends on thecost associated with caching. Determining the optimal quasi-static caching decision, while considering these factors, is thefirst part of our study. Following that we extend themechanism to include a dynamic policy as well, i.e., currentuser access patterns are also taken into consideration todetermine the objects to be cached. Note that users mayexhibit repeating access patterns beyond front page level ofweb sites. For example after viewing the CNN front page usersmay often view reports in the CNN/Weather section. Anotherexample is when users access a section of a website that has aURLwith a unique session ID assigned to it. However since thenames of documents in a specific section of a website aretypically changed across days, it will be difficult to identifyhistorical access patterns at a sub-front page level. Hence allobjects beyond the front page levels are treated as currentrequests and can be cached by the dynamic part of our policy.A caching mechanism that contains both quasi-static anddynamic dimensions can handle, besides normal usagepatterns, unanticipated events such as natural disasters ormajor accidentswhich can generate huge unexpected loads onwebsites. Therefore the objective of this study is to develop anintegrated caching mechanism that utilizes both historicallyand currently observed request occurrences. The performanceof the proposedmechanism is evaluated using realworld data.The results indicate that our caching approach is beneficial forcomputer network administrators to significantly reducedelays experienced by web users at proxy server levels.

The plan of the rest of this paper is as follows. We firstdiscuss literature related to our topic. We then illustrate themodel for our caching mechanism. Next we present perfor-mance results for the mechanism using a proxy trace dataset.The results section consists of two sub-parts: first the quasi-static portion is evaluated, followed by the integratedmechanism. Finally we discuss conclusions and areas forfuture research.

2. Literature review

While caching has been extensively studied in computerscience, there has recently been a growing interest in thetopic in the Information System (IS) area. Datta et al. [5] haveidentified caching to be a key research area due to itsapplication in reducing user delays while accessing theincreasingly congested Internet. Zeng et al. [18] have furtherhighlighted the benefits of caching strategies that can handledynamic content. Podlipnig and Boszormenyi [14], Zeng et al.[18], and Datta et al. [5], provide an extensive survey of thenumerous caching techniques that have been proposed. Theseinclude popular cache replacement strategies such as leastrecently used (LRU), where the least recently requested objectis evicted from the cache to make space for a new one; leastfrequently used (LFU), where the least frequently requestedobject is removed; lowest latency first, where the documentwith lowest download latency or time delay is removed; size,where the largest document is evicted; and their numerousextensions. The LRU policy, with its variations, is one of themost commonly employed proxy caching mechanisms[1,12,14,18]. The advantage of LRU is its simplicity. On theother hand LRU does not consider historical patterns in userrequests. Most caching studies focus on improving perfor-mance on metrics such as user latency and bandwidthreduction. There have been relatively few studies thatconsider a data or model driven approach for managingcaches effectively. The Mookherjee and Tan [13] studyprovides an analytical framework for the LRU cache replace-ment policy. The framework is utilized to evaluate LRU policyperformance under different demand and cache character-istics. The study specifically models caching at the browserlevel for individual caches. Hosanagar and Tan [9] develop amodel for optimal replication of objects using a version of theLRU policy. The study considers a framework of two cacheswhose capacities are partitioned into regions where the levelof duplication is controlled. Their model does not utilizehistorical patterns of web requests for caching decisions.Hosanagar et al. [10] develop an incentive compatible pricingscheme for caching, involving content providers and cacheoperators, withmultiple levels of Quality of Service. The studyfocuses on improving adoption of caching services amongcontent providers rather than developing a specific cachingmechanism. Cockburn and Mckenzie [4], and Tauscher andGreenberg [16], specifically study client-side behavior in thecontext of the Internet. They suggest that the probability ofusers revisiting websites is very high. Cao and Irani [1], Rizzoand Vicisano [15], and Lorenzetti et al. [12], have furtherdemonstrated 24 h re-access patterns for documents. But theydo not consider dynamic documents such as website frontpages that often change contents. Zeng et al. [18] describesome caching methods that attempt to predict past access

http://www.yahoo.com

http://www.yahoo.com

http://www.aol.com

http://www.aol.com

http://www.cnn.com

http://www.cnn.com

http://www.google.com



54 C. Kumar, J.B. Norris / Decision Support Systems 46 (2008) 52–60

requests, such as Top-10 algorithm that compiles a list of mostpopular websites. Our caching mechanism partially followsalong those lines in the quasi-static portion, but also exploitsthe 24 h re-access patterns mentioned earlier. Further weinclude a dynamic portion in our integrated mechanism thatcan handle deviations from past requests. Therefore our studyis distinct from prior work as we consider both historicalpatterns and requests for dynamic content in order to developa comprehensive proxy-level caching mechanism. This studyexpands on preliminary research of Kumar and Norris [11] bya comprehensive performance evaluation of the proposedmechanism using an actual proxy trace dataset.

3. Model

We first consider a quasi-static caching mechanism withno dynamic policy. A 0–1 mathematical program model isdeveloped for this mechanism. It minimizes the total cost ofcaching and the delay due to requests for objects that are notin the cache, given the constraint on cache size. Theparameters for the mathematical program model are: thenumber, n, of time intervals in a 24-h period for whichcaching decisions are to be provided; the capacity, k, of thecache; the cost, c, of caching an object; and the delay, t, todownload an object from the web server. The number of pastrequests, Rijpast, for object i (i=1,…, m) in time interval j (j=1,…, n) is an objective coefficient.

The variables are:

xij ¼ 1 if object i is cached at time interval j0 otherwise:

�

yij ¼ 1 if caching cost is incurred for xij ¼ 10 otherwise:

�

Using the above, we formulate the mathematical programfor the quasi-static model as follows:

Wð Þ min t ∑m

i¼1∑n

j¼11−xij� �

Rpastij þ c ∑

m

i¼1∑n

j¼2yij þ c ∑

m

i¼1xi1 ð1Þ

s:t: xij−xij−1 � yij for i ¼ 1; . . .;m and j ¼ 2; . . .;n ð2Þ

∑m

xij � k for j ¼ 1; . . .;n ð3Þ
i¼1
xij; yija 0;1f g for i ¼ 1; . . .;m and j ¼ 1; . . .;n: ð4Þ

The decision variable xij provides the optimal solution forproblem (W), i.e., the objects that should be cached at any
time interval. Variable yij is used for calculating the cost ofcaching. The first term of the objective function (1) representsthe communication delays incurred when requests are serveddirectly from the origin server. When an object is not presentin the cache xij=0 and the cost for serving requests is t timesthe number of requests. The second and third terms togethercapture the cost of updating caches that is incurred whenevera new object is brought into the cache. This cost takes intoaccount the computer resources required to refresh cachecontents. The second term considers cache updating cost atany given time interval jwhere j≠1. Note that this cost is to beincurred only if an object is not already present in the cache inthe previous interval j−1. The third term is updating cost in
initial time interval j=1 which is always incurred whenobjects are first cached in the quasi-static mechanism. Theobjective is to minimize these costs, given the followingconstraints. Constraints (2) ensures that cache updating costin period j is only incurred when a new object is brought intothe cache in relation to preceding period j−1. This is becauseif for an object i both xij−1=1 and xij=1 then the cache neednot be updated. Only if xij−1=0 and xij=1 then yij=1 and anupdating cost is incurred. Constraints (3) capture the cachecapacity restrictions. Constraints (4) ensure that the variablescan only take binary values.

Collecting terms and simplifying in Eq. (1), (W) can berewritten as follows:

DWð Þ min −t ∑m

i¼1∑n

j¼2xijR

pastij þ c ∑

m

i¼1∑n

j¼2yij

þ ∑m

i¼1−tRpast

i1 þ c� �

xi1 s:t: 2ð Þ; 3ð Þ; 4ð Þ: ð5Þ

The solution to (DW) provides the optimal quasi-staticcaching decisions for the n time periods under consideration.
Problem (DW) is a 0–1 Integer program among the class of NPproblems that are hard to solve [8]. However, our problemstructure is such that mathematical program solvers such asCPLEX can solve reasonably large problem sizes quickly usingbranch and bound techniques. This is facilitated as we assumeuniform object sizes as a first cut of the model, analogous tothe basic LRU policy. In addition in our problem n is typicallysmall in the order of 24 or 48 time intervals, which reducesthe number of variables under consideration. Solutionapproaches and problem sizes are discussed further inperformance analysis Section 4. To study the performance ofour approach, (DW) is solved with observed Rij
past values fromhistorical data and using different values of parameters n, k, c,and t. In order to conduct a statistically meaningful analysiswe use amovingwindow of 30 days of proxy requests data forrecording Rij

past. Commercially available mathematical pro-gram solver CPLEX version 8.1 is used for solving (DW).

Next we incorporate a dynamic dimension to our cachingmechanism. A portion of the cache is now set aside to handlecases when there are a significant number of requests thatdeviate from historically observed patterns. The partitioningof the cache between the quasi-static and the dynamicportions is determined by the volume of unanticipatedrequests. The more deviation there is between currentlyobserved request occurrences and Rij

past, the greater will bethe dynamic portion. This kind of adaptive allocation of theproportion of dynamic and quasi-static caching allows themechanism to handle both historical patterns as well asdeviations from such patterns.

4. Performance analysis

The performance of our caching mechanism (in terms ofcaching costs and request delays) is tested against the LRUcaching mechanism, using an actual proxy trace dataset. TheLRU cache replacement strategy, along with its many exten-sions, is widely employed for proxy caching [1,12,14,18]. There-forewe use LRU as a benchmark to compare the performance ofour proposed mechanism. We employ the following cacheentry and replacement policies for implementing an LRU

Table 1URL request frequencies for days 1 through 62

Frequency of requests Number of URLs (cumulative)

N10,000 32 (32)9999–5000 17 (49)4999–2000 68 (117)1999–1000 128 (245)999–500 227 (472)499–200 537 (1009)199–100 905 (1914)99–50 2703 (3608)49–30 2371 (5979)b30 139,166 (145,145)


caching mechanism. Any newly requested object, that is notalready present in the cache, is always brought into thecache. The least recently requested object is evicted from thecache to make way for new objects. A detailed illustration ofLRU implementation using parameters is provided in thenext sub-section.

We have obtained data of the proxy traces of web objectsrequested in the nine server locations of the IRCache networkacross multiple days (www.ircache.net). For performancetesting we utilize 62 days of trace data of the New YorkIRCache proxy server. The data was collected between 29April and 30 June, 2004. The comprehensive trace dataincludes the URLs of requests, the times when they arerequested, the type of object requested, an assigned identifierto the IP address of the user requesting an URL, and theelapsed time for serving the request. In our model we includeall front page requests that are defined as “page views” (i.e.,objects with suffixes such as .htm, .html, .php, and .jsp) inChrist et al. [3]. As mentioned earlier, the documents withinspecific sections of the website front page are to be cached inthe dynamic portion of our mechanism. If a specific sectionhas a URL name that is unchanged across days then it may betreated as an object to be considered for caching at the quasi-

Table 2Top 20 requested sites for days 1 through 62

Site Number of requests

yahoo.com 138,775friendster.com 89,814microsoft.com 57,951gator.com 52,064msn.com 46,096doubleclick.net 43,550cisinternet.net 38,131google.com 37,140icq.com 36,914animespy.com 34,327water.com 32,395ebay.com 31,095hotbar.com 24,503formulababe.com 23,288phpwebhosting.com 21,107atwola.com 20,855wv-cis.net 17,47517tahun.com 15,950aol.com 15,950atdmt.com 15,756

static portion. As stated earlier, we assume all objects to be ofunit size as a first cut of themodel. Subsequent versions of themechanism can be modified to include different object sizes.

We include the following subset of fields from thecomprehensive dataset for our analysis: date of access, timeof access (ranges from 0 to 86,400 s in a 24 h period), frontpage level URL requests, and frequency of URL requests overtime period under consideration. The request frequencies aredetermined by assigning an ID to every unique URL front pageand counting the number of occurrences of the ID over a timeperiod. The comprehensive dataset of logs with all 62 days ofrequests is 2.1 GB in size. After converting the dataset to aMicrosoft Access database, and querying for the requiredfields, the final dataset is 800 MB and has 2,567,818 records.The frequency distribution for URL requests for all 62 days ispresented in Table 1. The distribution indicates that there area large number of URLs that are requested a few times and asmall set of very popular sites. Of the 145,145 total uniqueURLs in the dataset, 139,166 URLs are requested less than 30times each. There are only 32 URLs that are requested morethan 10,000 times. This confirms to intuition that users tendto revisit particular websites that are popular across the broadpopulation. This pattern can be exploited by the quasi-staticportion of our caching mechanism. The 20 most requestedsites for the comprehensive 62 days dataset are presented inTable 2. The most popular site is yahoo.com, and other wellknown sites such as microsoft.com and google.com are alsopart of the list. Note that since yahoo.com is a portal thatoffers a large number of services such as email, search, news,music, etc., it is not surprising that it is themost requested sitein our dataset.

As mentioned earlier, we are using a 30 day movingwindow for recording Rij

past. Therefore it is useful to character-ize the dataset patterns and evaluate URL frequency forbatches of 30 days of requests instead of the comprehensive62 days. Using the first 30 days of the dataset we confirm thatthe pattern of a few very popular sites is repeated. Of the145,145 total unique URLs, 145,022 URLs are requested lessthan 500 times each. There are only 12URLs that are requestedmore than 8000 times. While considering the most requestedsites for the first 30 days, the usual suspects of popular sitessuch as yahoo.com, microsoft.com, and google.com arepresent as in the comprehensive 62 days case. For furtherillustration refer toTables 3 and4, that present theURL requestfrequencies, and the 20 most requested sites, respectively, ofthe first 30 days of data. These results indicate that a 30 day

Table 3URL request frequencies for days 1 through 30

Frequency of requests Number of URLs (cumulative)

N15,000 4 (4)14,999–12,000 5 (9)11,999–8000 3(12)7999–5000 5 (17)4999–3000 12 (29)2999–1500 32 (61)1499–1200 11 (72)1199–1000 11 (83)999–500 40 (123)b500 145,022 (145,145)

http://www.ircache.net


Table 4Top 20 requested sites for days 1 through 30


yahoo.com 43,624friendster.com 30,264microsoft.com 15,126water.com 15,111icq.com 14,766animespy.com 14,542atwola.com 13,314msn.com 13,123google.com 12,726phpwebhosting.com 11,573ebay.com 8009adbureau.net 670517tahun.com 6609gator.com 6440216.66.24.58 (NY IRCache node) 5649doubleclick.net 4900everyone.net 4626ircache.net 4241go.com 3776plasa.com 3684

Table 5Top 20 requested sites for day 31


friendster.com 2513yahoo.com 2288icq.com 885microsoft.com 712msn.com 563water.com 550animespy.com 50717tahun.com 501adbureau.net 473google.com 372phpwebhosting.com 340doubleclick.net 334ebay.com 323detik.com 232gator.com 207216.66.24.58 (NY IRCache node) 182go.com 153geocities.com 142everyone.net 142atwola.com 123


moving window is a good indicator for overall URL frequencypatterns that exist in proxy proxy-level requests.

The values for Rijpast and parameters n, k, c, and t are the

input to our mathematical program model. Proxy trace data isused for recording Rij

past. The solution of the model provides uswith the optimal caching decisions based on the historicalpattern over themovingwindow.Using these cachingdecisionsweevaluate theperformanceof ourmodel on theobject requestpatterns, Rijcurrent, for the day following the 30 day movingwindow. In this manner we can evaluate the performance forany given proxy log period. We then repeat this performanceanalysis with the dynamic dimension included in the cachingmechanism. Themechanismthat includes bothquasi-static anddynamic portions is referred to as integrated. Sincewe considerRijpast to be an indicator of Rijcurrent request patterns, it would be

interesting to note if any similarities can be identified byvisually inspecting the request data. Table 5 presents the 20most requested sites for day 31 alone (i.e., Rijcurrent), the dayfollowing the moving window of days 1 through 30 (i.e., Rijpast).As can be observed popular sites such as yahoo.com andmicrosoft.com are also present here (refer to Table 5), thoughthe ranking order may be different from that of days 1 through30 (refer to Table 4). Of course this is just one instance ofaggregate URL popularities in Rij

past corresponding to Rijcurrent.

The actual benefit of the above approach is to be evaluated bythe overall model performance.

The parameters n, k, c, and t values that are used in themathematical program model are likely to have an impact onthe overall performance of our approach. Therefore we utilizedifferent values to analyze the sensitivity of these parameterson mechanism performance. In addition for the integratedmechanism we vary the percentage allocation of cache spacebetween the quasi-static and dynamic portions and test theeffect on overall performance. Themechanism performance iscompared to LRU policy for each case. In the following sub-sections we first present the performance of the quasi-staticportion of the mechanism, followed by the performance ofthe integrated mechanism.

4.1. Quasi-static caching mechanism

We first compare the performance of our quasi-staticcaching mechanism versus the popularly used LRU cachingmechanism. In our mechanism the caching decisions for anygiven 24 h time period are determined by recording Rij

past for30 day requests prior to the given day. Problem (DW) is thensolved for a given set of parameter values. These tasks, as wellas all other performance testing, are accomplished usingC programming language and CPLEX mathematical programsolver version 8.1. Problem (DW) can be solved quite quicklyfor even relatively large problem sizes. This is demonstratedas follows. We begin by dividing a day into 24 1 h timeintervals for caching decisions. This means that the quasi-static portion of the cache is refreshed every hour. Settingn=24 is a starting point given that repeating 24 h object re-access patterns have been previously identified [1,15]. Letm=6000 objects and cache capacity k=1000 objects. Hereproxy cache capacity is relatively large in proportion to thetotal number of objects being requested. This can occur ifnetwork administrators invest resources in acquiring a largedisk storage space for the cache. The benefit of caching is dueto reducing requests to originweb servers with unit cost t as itties up network bandwidth. In contrast cost of internallyupdating caches c is relatively inexpensive as it is not affectedby network congestion. Accordingly we first set t /c ratio to betwo with t=2 and c=1. This ratio can be changed dependingon a judgment of relative costs by network administrators. Amore detailed discussion of the current t/c ratio value, as wellas an alternative scenario, is provided later. Since Rij

past is acoefficient that does not affect the solution time of (DW) wegenerate it randomly. Using the above values of m=6000,n=24, k=1000, t=2, and c=1, the optimal solutions for (DW)require a pre solve time of 9.42 s, and a root relaxationsolution time of 125.46 s. This shows that (DW) can be quicklysolved for reasonably large problem sizes. Since in ourproblem instances n is typically small in the order of 24 or48, the number of variables is reduced. Given our problemstructure, the optimal quasi-static caching decisions can be

Fig. 1. Implementation of LRU mechanism.


determined using available program solvers. Significantdeviations from past requests can be handled by the dynamicportion of the mechanism. As we later demonstrate this oftenfavors our mechanism, in addition to less frequent cacheupdating costs, while comparing performance to heuristicbased procedures such as LRU caching strategy. However, indifferent situations where problem size is greatly expanded, itis difficult to find optimal solutions in a reasonable timeframe. For example, if m=100,000, n=250, and k=10,000,(DW) cannot be solved to optimality within an upper boundof 30 min. In cases where problem size is very large,alternative solution approaches, that aim to find good resultsquickly, can be developed. Examples include dynamicprogramming and heuristic procedures [8].

Table 6 describes the characteristics of the comprehen-sive proxy cache dataset for all 62 days, when the m mostrequested URLs are included for analysis. As mentionedearlier, when we include all URLs that have 1 or morerequests then there 145,145 total sites. There are a largenumber of URLs that are requested only a few times. Byincreasing the inclusion limit to sites that are requestedmore than 5 times we consider the top 26,938 requestedURLs. For testing purposes we use the top 53 requested URLswhich include all sites that have 4500 or more requests.Recording Rij

past for any dataset is a one-time event that is notimpacted by parameter changes. An index is created forwebsite requests and counting Rij

past is proportional tonumber of objectsm. The trace of website requests is alreadymaintained in the proxy server. Using the trace Rij

past may berecorded prior to the time at which it is used. Since we areexploiting 24 h re-access patterns, at the very least we haveone full day before the values are required. Therefore Rij

past

does not have to be freshly determined for a given timeinterval. This process is further aided by using a 30 daymoving window for Rij

past. For any new day the previous29 days of requests count already exists and only theadditional day count can be quickly added. For these reasonsRijpast is an input coefficient for (DW) that does not affect

solution time. The trace data is also used for implementingthe LRU caching mechanism. The steps for LRU implementa-tion are detailed in Fig. 1.

We now compare the performance of the quasi-staticmodel and LRU caching mechanism for different sets ofparameter values. Table 7 measures the performance of thetwo mechanisms while varying n for quasi-static decisionmaking and keeping other parameters constant. For bothmechanisms total cost is t times the number of requests forURL objects not in cache+c times the number of objectsbrought into cache. Varying n determines the time intervals

Table 6Proxy requests comprehensive dataset characteristics

Top m requested objects Number of requests Total number of records

145,145 ≥1 2,567,81826,938 N5 2,360,2145978 N30 2,109,9051006 N200 1,772,191100 N2,300 1,240,17553 N4,500 1,087,9163 N55,000 286,540

at which cached objects are updated by the quasi-staticmechanism. The appropriate choice of n depends on specificproxy request patterns. Based on the performance resultscache administrators decide if objects in quasi-static modelare to be refreshed very frequently or infrequently. For ourtests we vary n from 48 to 3. Using this range we can typicallyidentify interior solutions for the quasi-static model whereoptimal n is in between the upper and lower values. Note thatchoice of n does not affect LRU mechanism costs. This isbecause in LRU cache updating decision is evaluated on thearrival of every new request. The optimal solution for thequasi-static model based on Rij

past from days 1 to 30 isevaluated on Rij

current for day 31. The LRU mechanism costs onday 31 are determined starting from the cache contentsobtained after simulating themodel on requests from day 1 to30. As before we begin by setting t=2 and c=1. For this t/cratio it is twice as beneficial to serve a request internally fromcache as compared to origin server. The t/c ratio of 2 capturesthe normal situationwhere origin servers are not chokedwithextraordinarily high traffic. Alternatively the network con-nection is fast such as a T1 line. An example of this scenario isillustrated by the IRCache network (www.ircache.net). If arequested object is cached at a location close to the user, then

Table 7Quasi-static performance for Rij

past=days 1 to 30, Rijcurrent=day 31, t=2, c=1,and varying n; corresponding LRU mechanism cost is 12,009

Parameter values (varying n)m=53, k=10, t=2, c=1

Quasi-staticmechanism cost

% improvement over LRU

n=48 9486 21.01n=36 9463 21.20n=24 9428 21.49n=18 9093 24.28n=12 9025 24.85n=6 9121 24.05n=3 9858 17.91


Table 8Quasi-static performance for Rijpast=days 1 to 30, Rijcurrent=day 31, t=200, c=1,and varying n; corresponding LRU mechanism cost is 804,603

Parameter values (varying n) m=53,k=10, t=200, c=1

Quasi-staticmechanism cost

% improvementover LRU

n=48 1,003,680 −24.74n=36 974,463 −21.11n=24 937,445 −16.51n=18 921,639 −14.55n=12 899,827 −11.83n=6 909,823 −13.08n=3 984,414 −22.35

Table 9Integrated performance for Rij

past=days 1 to 30, Rijcurrent=day 31, n=48, andvarying quasi-static portion; corresponding LRU mechanism cost is 12,009

Parameter values Quasi-staticportion %

Integratedmechanism cost

% improvement overLRU

m=53, n=48, k=10,t=2, and c=1

10 6981 41.8720 7455 37.9250 4974 58.5870 4935 58.9190 5214 56.58


the waiting time in fractions of seconds is about 5 times lessthan the alternative case. Of course while this is true for asingle request, the total network delays depend on theaggregation of individual delays. In effect a relatively smallt/c ratio means the penalty for having to access a distantorigin web server is not disproportionately large. This ratio islater changed. Using the above parameters, we observe thatvarying n indeed affects quasi-static mechanism perfor-mance. The costs decrease while n is decreased from 48 to12 and then costs increase when n is further reduced to 3. Thebest quasi-static performance occurs when n=12 and thisminimum cost value is 9025. For same t=2 and c=1 valuesLRUmechanism cost, using the procedure outlined in Fig. 1, is12,009. At n=12, quasi-static mechanism improves over LRUperformance by almost 25%. The quasi-static mechanismalways outperforms the LRU policy, as shown by the positivepercentage improvement figures in Table 7. This is becausethe latter mechanism incurs frequent costs of bringing newobjects into the cache. In order to compare the above resultsto other days in the dataset we run the models again for Rijpast

days 2 to 31 and Rijcurrent day 32. We confirm that similar

patterns exist for the above durations, where the minimumcost of 9841 is achieved by the quasi-static mechanism forn=48. The corresponding LRU mechanism cost is alwayshigher at 14,265. However how would the two mechanismsperform relative to one another when the cost for not servingrequests from the cache is very much higher than the cost offrequently updating cache contents? This is tested by settingt=200 and c=1, and running the quasi-static model fordifferent values of n. The t/c ratio of 200 can be representativeof the case where the origin servers may be experiencing veryhigh traffic. Alternatively the network connection may be ofvery slow speeds or severely congested. An example of thisscenario is when due to a major disaster news websitesexperience extraordinarily high traffic from users seekingupdates. Since the origin servers would be very slow thepenalty for accessing them is very high. The results, reportedin Table 8, show that using the above t/c value the LRUmechanism with a cost of 804,603 always outperforms thequasi-static model. This is indicated by the negative figures ofquasi-static performance relative to LRU in percentage terms.The quasi-staticmechanism has aminimum cost of 899,827 atn=12 where it under performs LRU by almost 12%. This occursbecause now the cost of frequently updating cache contents inthe LRU mechanism is quite small compared to the relativelylarge cost of even a few request misses in the quasi-staticmechanism. Thereforewe conclude that as long as the relativedifference between the per unit cost of updating caches and

that of URL request misses is not greatly exaggerated, thequasi-static mechanism outperforms the LRU policy for ourgiven dataset.

4.2. Integrated caching mechanism

We now include a dynamic portion in our mechanism, inaddition to the quasi-static model, to create an integratedcaching mechanism that can also handle requests thatsignificantly deviate from historically observed patterns. Thedynamic portion of our mechanism is implemented using avariation of the traditional LRU policy as follows. For the cacheentry policy of the dynamic portionwe ensure that only newlyrequested URLs that are not already present in the quasi-staticpart of the cache can be brought in. The cache replacementpolicy is the same as the usual LRUmechanismwhere the leastrecently requested site is evicted tomakeway for new requests.The proportion of cache space allocated to the quasi-static anddynamic portions can have a significant impact on the overallmechanism performance. Of course the proportion allocationthat produces the best performance is specific to the patterns ofthe requests under consideration. We parametrically test theallocation proportion onperformance for the proxy dataset andevaluate the costs of our integrated mechanism versus thetraditional LRU policy used for comparison earlier (refer Fig. 1).We retain the earlier values of t=2 and c=1, and begin bysetting n=48. Table 9 presents the results for days 1 to 30, day31, and varying proportion of quasi-static portion in theintegrated mechanism (with the remaining part of cachecapacity allocated to the dynamic policy). We observe thatintegrated mechanism performance improves as the quasi-static portion increased from10% to 70% of capacity, afterwhichperformance worsens. The best integrated mechanism perfor-mance of 4935 is achieved at 70% quasi-static portion and 30%dynamic portion. It greatly improves on the LRU mechanismcost of 12,009 by almost 59%. As noted before the parameterchanges of n do not affect LRU performance. Note that theintegratedmechanism also outperforms the purely quasi-staticmechanism costs of 9486 (refer to Table 7) for the aboveparameter values.

Next we compare performance of the mechanisms, resultspresented in Table 10, by setting n=24 and retaining theearlier parameter values. In this case the integrated mechan-ism costs decrease as we increase quasi-static portionallocation from 10% to 30%, after which the costs increasefor higher quasi-static allocation. The integrated mechanism,which has the least cost of 5349 at 30% quasi-static allocation,improves on the LRU caching policy cost of 12,009 by about55%. It also performs better than the purely quasi-static

Table 10Integrated performance for Rij

past=days 1 to 30, Rijcurrent=day 31, n=24, andvarying quasi-static portion; corresponding LRU mechanism cost is 12,009

Parameter values Quasi-staticportion %

Integratedmechanism cost

% improvement overLRU

m=53, n=24, k=10,t=2, and c=1

10 8784 26.8520 8058 32.9030 5349 55.4650 5385 55.1670 6132 48.94


mechanism cost of 9428 (refer to Table 8). In order to confirmthis result we test the model performances for the aboveparameter values on Rij

past=days 2 to 31 and Rijcurrent=day 32.

Once again we observe that the integrated mechanism, thathas the least cost of 5979 at 30% quasi-static allocation,outperforms both the LRU policy cost of 14,265 and the purequasi-static mechanism cost of 9923. Therefore, it can beconcluded that for the above parameter values, the decreasingorder of mechanism performance is the integrated, followedby pure quasi-static, and finally the LRU caching policy.

The above results demonstrate the advantage of ourcaching mechanisms over LRU policy. An effective cachingmechanism has many benefits for all Internet users, includingreduced network traffic, load on web servers, and web userdelays [1,5]. The difference can be immediately apparent to anend user. A website that is cached may seem to loadinstantaneously compared to several seconds delay in thealternative case. Users appreciate fast loading websites andtend to revisit them. In addition, Internet companies can saveon investing resources in server farms around the world forreplicating web content to improve load speeds [7]. We haveshown that by exploiting historical re-access patterns cachingat the proxy level can be improved. For example, in our testsour integrated mechanism performed better than LRU policyby more than 50% in terms of total costs (refer to Tables 9 and10). We have used a portion of a proxy trace dataset forperformance testing. Given our test results, we believe thatour proxy caching mechanism can significantly reduce delaysfor web users if it were to be deployed in large scale networks.

5. Discussion and conclusions

In this study we propose a new proxy-level cachingmechanism that takes into account aggregate patternsobserved in user object requests. Previous studies have shownthat users typically re-access documents on a daily basis. Weexploit this repeated-access pattern of users in our mechanismby making caching decisions for a specific time interval basedon the history of observed requests for the same interval. Thisforms thequasi-static portion of ourmechanism. Following thatwe extend the mechanism to include a dynamic policy as well,i.e., the current user access patterns are also taken intoconsideration to determine the objects to be cached. Ourintegrated caching mechanism, that contains both quasi-staticand dynamic policies, can handle besides normal usagepatterns unanticipated events that can generate huge unex-pected loads on websites. Hence our approach is morecomprehensive than the existing mechanisms because itcaptures both the static and the dynamic dimensions of userweb object request patterns. We compare the performance of

both our quasi-static and integrated mechanisms against thepopularly used LRU caching policy. The parametric test results,using a comprehensive IRCache network proxy trace dataset,indicate that our mechanisms outperform the LRUmechanism.Our caching approach should be useful for computer networkadministrators and online content providers to significantlyreduce delays experienced by web users at proxy server levels.

There are a number of interesting avenues for futureresearch, detailed as follows. In this study we have used themost popular requested sites for testing purposes and thequasi-static mechanism has performed better than the LRUpolicy. It would be interesting to determine how themechanisms perform if we were to consider URLs that arenot in the top range in terms of request popularity, so that thecost of updating the cache has a lesser effect on overall costexperienced by the two mechanisms. Currently any sub-frontpage level of requests, or website sectionswith unique sessionIDs, are handled by the dynamic portion of ourmechanism. Anarea for improvement would be to develop methods foridentifying patterns for these dynamic contents as well. Thusfar we have considered a continuous 30 day moving windowas an indicator for historical patterns. We may be able toimprove performance by considering historical patternsspecific to the day of the week. Alternatively we could collatepast data depending on whether the day under considerationis aweekday or aweekend. In the futurewe also plan to extendour model to include different object sizes, as well as test itagainst other popular cachingmechanisms such as k-LRU, LFU,and Top-10 variants [14,18]. Another interesting areawould beto develop an analytical model for our integrated mechanismin order to characterize and compare it to existing analyticalstudies on LRU policy [13]. Finally, an area of extension is totest our mechanism performance using alternative solutionapproaches, such as dynamic programming and heuristicbased procedures, which aim to find good solutions quickly.

Our approach is the first attempt to adopt a quasi-staticmechanism for caching decisions. We have also proposed anovel combination of quasi-static and dynamic schemes. Bydesign, this approach is more comprehensive than the existingmechanisms because it captures both the static and thedynamic dimensions of web requests. As our testing resultsindicate it is likely to perform better than the existingapproaches and should prove to beaneffective caching strategy.

Acknowledgements

We thank Prabuddha De, Amar Narisetty, Karthik Kannan,seminar participants of Purdue University, and the 2004Americas Conference of Information Systems (AMCIS) Doc-toral Consortium for valuable comments and contributions onthis study.

References

[1] C. Cao, S. Irani, Cost-aware WWW proxy caching algorithms, Proceed-ings of the Usenix Symposium on Internet Technologies and Systems,1997.

[2] R.I. Chiang, P.B. Goes, Z. Zhang, Periodic cache replacement policy fordynamic content at application server, Decision Support Systems 43(2007) 336–348.

[3] M. Christ, R. Krishnan, D. Nagil, O. Gunther, R. Kraut, On saturation ofweb usage by lay Internet users, Carnegie Mellon University WorkingPaper, 2000.


[4] A. Cockburn, B. McKenzie, Pushing back: evaluating a new behaviour forthe back and forward buttons in web browsers, International Journal ofHuman-Computer Studies (2002).

[5] A. Datta, K. Dutta, H. Thomas, D. VanderMeer, World wide wait: a studyof Internet scalability and cache-based approaches to alleviate it,Management Science 49 (10) (2003) 1425–1444.

[6] B.D. Davison, A web caching primer, IEEE Internet Computing 5 (4)(2001) 38–45.

[7] B.D. Davison,Web Caching and Content Delivery Resources, 2007 http://www.web-caching.com.

[8] M.R. Garey, D.S. Johnson, Computers and Intractability, W.H. Freeman,New York, 1979.

[9] K. Hosanagar, Y. Tan, Optimal duplication in cooperative web caching,Proceedings of the 13th Workshop on Information Technology andSystems, 2004.

[10] K. Hosanagar, R. Krishnan, J. Chuang, V. Choudhary, Pricing and resourceallocation in caching services with multiple levels of QoS, ManagementScience, 51 (12) (2005) 1844–1859.

[11] C. Kumar, J.B. Norris, A proxy-level web caching mechanism usinghistorical user request patterns, Working Paper, 2007.

[12] P. Lorenzetti, L. Rizzo, L. Vicisano, Replacement Policies for a ProxyCache, 1996 http://info.iet.unipi.it/~luigi/research.html.

[13] V.S. Mookherjee, Y. Tan, Analysis of a least recently used cache manage-ment policy forwebbrowsers, OperationsResearch50 (2) (2002) 345–357.

[14] S. Podlipnig, L. Boszormenyi, A survey of web cache replacementstrategies, ACM Computing Surveys 35 (4) (2003) 374–398.

[15] L. Rizzo, L. Vicisano, Replacement policies for a proxy cache, IEEE/ACMTransactions on Networking 8 (2) (2000) 158–170.

[16] L. Tauscher, S. Greenberg, How people revisit web pages: empiricalfindings and implications for the design of history systems, Interna-tional Journal of Human Computer Studies 47 (1997) 97–138 Specialissue on World Wide Web Usability.

[17] E.F. Watson, Y. Shi, Y. Chen, A user-access model-driven approach toproxy cache performance analysis, Decision Support Systems 25 (1999)309–338.

[18] D. Zeng, F. Wang, andM. Liu, EfficientWeb Content Delivery Using ProxyCaching Techniques, IEEE Transactions On Systems, Man, And Cyber-netics—Part C: Applications And Reviews 34(3) (2004) 270–280.

Glossaryn: number of time intervals in a 24-hour periodm: number of objectsi: object (i=1,…, m)j: time interval (j=1,…, n)k: capacity of the cachec: cost of caching an objectt: delay to download an object from the web serverxij: decision variable where xij=1 if object i is cached at

time interval j, and xij=0 otherwisevariable where yij=1 if caching cost is incurred forxij=1, and yij=0 otherwise

yij:

number of past requests for object i in time interval jfor a 30 day moving window

Rijpast:

number of current requests for object i in timeinterval j for the day following the 30 day moving

Rijcurrent:

window

mar is an Assistant Professor in the Department of Informationd Operations Management at the College of Business Adminis-

Chetan KuSystems antration, California State University San Marcos. He received his PhD from theKrannert School of Management, Purdue University. His research interestsinclude pricing and optimization mechanisms for managing computernetworks, caching mechanisms, peer-to-peer networks, ecommerce me-chanisms, web analytics, and IS strategy for firms. He has presented hisresearch at conferences such as WEB, WISE, ICIS Doctoral Consortium, andAMCIS Doctoral Consortium. He has served as a reviewer for journals such asEJOR, JMIS, DSS, and JECR.

John B. Norris received his PhD from the Quantitative Methods area at theKrannert School of Management, Purdue University. His research interestsinclude web analytics, healthcare management, and decision support toolsfor student team assignment. He has presented his research at AOM, DSI,INFORMS, and POMS conferences.

http://www.web-caching.com

http://www.web-caching.com

http://info.iet.unipi.it/~luigi/research.html

a new approach for a proxy-level web caching mechanism

Documents