1 modeling and caching of peer-to-peer trafﬁcmhefeeda/papers/tr2006_11.pdf3 we perform trace-based...

1

Modeling and Caching of Peer-to-Peer TrafficOsama Saleh and Mohamed Hefeeda

School of Computing Science

Simon Fraser University

Surrey, Canada

{osaleh, mhefeeda}@cs.sfu.ca

Technical Report: TR 2006-11

Abstract

Peer-to-peer (P2P) file sharing systems generate a major portion of the Internet traffic, and this portion

is expected to increase in the future. We explore the potential of deploying caches in different Autonomous

Systems (ASes) to reduce the cost imposed by P2P traffic on Internet service providers and to alleviate

the load on the Internet backbone. We conduct a measurement study to analyze P2P characteristics

that are relevant to caching, such as object popularity and size distributions. Our study shows that the

popularity of P2P objects in different ASes can be modeled bya Mandelbrot-Zipf distribution, and that

several workloads exist in P2P systems, each could be modeled by a Beta distribution. Guided by our

findings, we develop a novel caching algorithm for P2P trafficthat is based on object segmentation and

partial admission and eviction of objects. Our trace-basedsimulations show that with a relatively small

cache size, a byte hit rate of up to 35% can be acheived by our algorithm, which is a few percentages

smaller than that of the off-line optimal algorithm. Our results also show that our algorithm achieves a

byte hit rate that is at least 40% more and at most triple the byte hit rate of the common web caching

algorithms. Furthermore, our algorithm is robust in face ofaborted downloads—a common case in P2P

systems—and achieves a byte hit rate that is even higher thanin the case of full downloads.

I. INTRODUCTION

Peer-to-peer (P2P) file-sharing systems have gained tremendous popularity in the past few years. More

users are continually joining such systems and more objectsare being made available, enticing even more

users to join. Currently, the traffic generated by P2P systems accounts for a major fraction of the Internet

traffic today [1], and is bound to increase [2]. The increasingly large volume of P2P traffic highlights

2

the importance of caching such traffic to reduce the cost incurred by Internet Services Providers (ISPs)

and alleviate the load on the Internet backbone.

Caching in the Internet has mainly been considered for the web and video streaming traffic, with

little attention to the P2P traffic which has become the dominant bandwidth consumer. Many caching

algorithms for web traffic [3] and for video streaming systems [4] have been proposed and analyzed.

Directly applying such algorithms to cache P2P traffic may not yield the best cache performance, because

of the different traffic characteristics as well as the caching objectives. For instance, the main objectives

in web and video streaming systems are to reduce user’s perceived latency, servers’ load, and WAN

bandwidth consumption. While the objectives in P2P cachingwould mainly be to reduce the WAN

bandwidth consumption, since there are no servers in P2P systems and latency is not a major concern

for clients.

In addition, unlike web objects, objects in P2P systems are mostly immutable, have large sizes, and

their popularity does not follow Zipf-like distribution [1]. Furthermore, although objects in P2P and

video streaming systems share some characteristics (e.g.,immutability and and large size), streaming

systems impose stringent timing requirements which limit the flexibility of the caching algorithms in

choosing which segments to store in the cache. Therefore, new caching algorithms that consider the new

characteristics and objectives need to be designed and evaluated.

In this paper, we, first, study the P2P traffic characteristics via a three-month measurement study on

a popular file-sharing system. We use the results obtained from our measurement to develop a deeper

understanding of the P2P traffic characteristics that are relevant to caching. Such characteristics include

object popularity, object size and popularity dynamics. We, then, design and evaluate a novel P2P caching

algorithm for object admission, segmentation and replacement.

Specifically, our contributions can be summarized as follows. First, we develop new models for P2P

traffic based on our measurement study. We show that the popularity of P2P objects can be modeled

by a Mandelbrot-Zipf distribution, which is a generalized form of Zipf-like distributions with an extra

parameter. This extra parameters captures the flattened head nature of the popularity distribution near

the lowest ranked objects (when plotted on a log-log scale).The flattened head nature has been also

observed by a previous study [1], but no specific distribution was given. We also show that P2P traffic

is composed of different workloads, each of which could be modeled by a Beta distribution. Second, we

analyze the impact of the Mandelbrot-Zipf popularity modelon caching and show that relying on object

popularity alone may not yield a high hit rate. Third, we design a new caching algorithm for P2P traffic

that is based on segmentation, partial admission and eviction of objects.

3

We perform trace-based simulations to evaluate the performance of our algorithm and compare it

against common web caching algorithms (LRU and LFU) and a recent caching algorithm proposed for

P2P systems. Our results show that with a relatively small cache size, less than 10% of the total traffic,

a byte hit rate of up to 35% can be achieved by our algorithm, which is a few percentages smaller than

the byte hit rate achieved by an off-line optimal algorithm with complete knowledge of future requests.

Our results also show that our algorithm achieves a byte hit rate that is at least 40% more and at most

triple the byte hit rate of the common web caching algorithms.

The rest of this paper is organized as follows. In Section II,we summarize the related work. Section

III describes our measurement study, provides models for popularity and object size and analyzes the

effect of Mandelbrot-Zipf distribution on the cache hit rate. Our P2P caching algorithm is described in

Section IV. We evaluate the performance of our algorithm using trace-based simulation in Section V.

Section VI concludes the paper and outlines future directions for this work.

II. RELATED WORK

Several measurement studies have analyzed various aspectsof P2P systems, and a few of them have

proposed caching algorithms for P2P traffic. Gummadi et al. [1] study the object characteristics of P2P

traffic in Kazaa and show that P2P objects are mainly immutable, multimedia, large objects that are

downloaded at most once. The study shows that the popularityof P2P objects exhibits a flattened head and

does not follow Zipf distribution, which is typically used to model the popularity of web objects [5]. The

study provides a simulation model for generating P2P trafficthat mimics the observed popularity curve,

but it does not provide any closed-form models for it. Sen andWang [6] study the aggregate properties

of P2P traffic in a large-scale ISP, and confirm that P2P trafficdoes not obey Zipf’s distribution. Their

observations also show that few clients are responsible formost of the traffic.

Klemm et al. [7] study the query behavior in Gnutella file sharing network and propose using two Zipf-

like distributions to model query popularity. However, measuring popularity of P2P objects by queries

may not be accurate. Queries are issued using keywords, not object IDs, and keywords are generated by

independent users. Thus, unique objects may not correspondto unique keywords. Further, queries do not

imply downloads, since a query could be abandoned or go unanswered. We use unique reply messages

in our measurement. These reply messages correspond to replicas of objects that do exist on hosts, and

were downloaded sometime in the past.

While the aforementioned studies provide useful insights into P2P systems, they were not explicitly

designed to study caching P2P traffic. Therefore, they did not focus on analyzing the impact of P2P

4

traffic characteristics on caching. The study in [1] highlighted the potential of caching and briefly studied

the impact of traffic characteristics on caching. But the study was performed in only one network domain

(a university campus network).

Leibowitz et al. [8] study the feasibility of caching P2P traffic. They conclude that P2P traffic is highly

repetitive and responds well to caching. However, the studydoes not provide any algorithm for caching.

Wierzbicki et al. [9] propose two cache replacement policies for P2P traffic: MINRS (Minimum Relative

Size), which evicts the object with the least cached fraction, and LSB (Least Sent Byte), which evicts

the object which has served the least number of bytes from thecache. Our simulation results show that

our algorithm outperforms LSB, which is better than MINRS according to their results.

Another related problem to caching is P2P traffic identification. Because P2P traffic is not standardized,

different protocols use different control messages, and some use encryption and dynamic ports. Sen et al.

[10] propose a technique for identifying P2P traffic using application level signatures. Karagianniset et al.

[11] propose a non-payload P2P traffic identification at the transport layer based on network connection

patterns.

III. M ODELING P2P TRAFFIC

We are interested in deploying caches in different autonomous systems (ASes) to reduce the WAN

traffic imposed by P2P systems. Thus, our measurement study focuses on measuring the characteristics

of P2P traffic that would be observed by theseindividual caches, and would impact their performance.

Such characteristics include object popularity, popularity dynamics, number of P2P clients per AS, and

object size distribution. We measure such characteristicsin several ASes of various sizes.

A. Measurement Methodology

We conduct a passive measurement study of the Gnutella file-sharing network [12]. According to

the Gnutella protocol specifications peers exchange several types of messages including PING, PONG,

QUERY and QUERYHIT. A QUERY message contains a set of keywords a user is searching for and is

propagated to all neighbors in the overlay for a distance specified by the TTL field. A typical value for

TTL is seven hops. The QUERY message contains the search keywords, a TTL field and the the address

of the immediate neighbor which forwarded the message to thecurrent peer. If a peer has one or more

of the requested files, it replies with a QUERYHIT message. A QUERYHIT message is routed on the

reverse path of the QUERY message it is responding to, and it contains the name and the URN (uniform

resource name) of the file, the IP address of the responding node, and file size. Upon receiving replies

5

(a) (b)

(c)

100

101

102

103

104

105

101

102

103

Rank

Fre

qu

en

cy

AS11803

Object popularityMZipf(0.47, 5)

(d)

Fig. 1. Object popularity in P2P traffic.

from several peers, the querying peer chooses a set of peers and establishes direct connections with them

to retrieve the requested file.

We modify an open-source Gnutella client called Limewire [13] by implementing a listener class

which logs all messages that pass through it. Limewire implementation has two kinds of peers:ultra-

peers, characterized by high bandwidth and long connection periods, andleaf-peerswhich are ordinary

peers that only connect to ultra-peers. We run our measurement in an ultra-peer mode and allow it to

maintain up to 500 simultaneous connections to other Gnutella peers. On average, our measurement node

was connected to 279 nodes, 63% of which were ultra peers. Ourmeasurement node passively recorded

6

100

101

102

103

104

101

102

103

104

Rank

Fre

qu

en

cy

AS1403

Object popularityMZipf(10, 0.7)

(a) (b)

100

101

102

103

104

101

102

103

Rank

Fre

qu

en

cy

AS95


(c)

100

101

102

103

104

105

101

102

103

Rank

Fre

qu

en

cy

AS9406


(d)

Fig. 2. Object popularity in P2P traffic.

the contents of all QUERY and QUERYHIT messages that pass through it without injecting any traffic

into the network.

The measurement study was conducted between 16 January 2006and 16 April 2006. Our measurement

peer is located at Simon Fraser University, Canada. But since the Gnutella protocol does not favor nodes

based on their geographic locations [7], we are able to observe peers from many ASes. Table I summarizes

some statistics about our measurement. All measurement data is stored in several trace files with a total

size of approximately 40 giga bytes, and will be made available to the research community at [14].

7

TABLE I

MEASUREMENT STUDY STATISTICS.

Measurement period 16 Jan. 2006 - 16 Apr. 2006

Number of QUERY messages 214,264,108

Number of QUERYHIT messages 107,063,419

Number of unique objects 11,449,168

Number of IP addresses 23,148,1245

Number of unique ASes 16,295

Number of ASes with more than 10,000 replies 392

B. Measuring and Modeling Object Popularity

The relative popularity of an object is defined as the probability of requesting that object relative to

other objects. Object popularity is critical for the performance of the cache. Intuitively, storing the more

popular objects in the cache is expected to yield higher hit rates.

Since we are interested in the performance of individual caches, we measure the popularity of objects

in eachAS, not the global popularity of objects. To measure the popularity of an object in a specific

AS, we count the number of replicas of that object in the AS considered. Number of replicas indicates

the number downloads that were completed in the past. This means that if a cache were deployed, it

would have seen a similar number of requests. This assumes that most of the downloads were supplied by

peers from outside the AS, which is actually the case becausepeers in most current P2P networks have

no sense of network proximity and thus do not favor local peers over non-local peers. In fact, previous

studies [1] have shown that up to 86% of the requested P2P objects were downloaded from peers outside

of the local network even though they were available locally.

Notice that measuring popularity using replica count does not account for aborted downloads, i.e.,

requests that initiated a download but never finished it. This means that the popularity of individual

objects is somewhat underestimated. This may make up for theoverestimation resulted from assuming

that all replicas are downloaded from peers outside of the AS. Note also that, from the cache perspective,

it is the relative popularity of objects, not the absolute number of requests of each objects, that influences

the cache performance. Thus we opt to measure popularity as the number of replicas that exist for each

object instead of the number of requests it receives.

To count the number of replicas of a given object, we extract from our trace all QUERYHIT messages

8

TABLE II

OBJECT POPULARITY IN SEVERALAS DOMAINS.

AS No. unique IPs unique objects MZipf(α, q) % of cacheable traffic

9548 464, 551 570, 122 (0.65, 5) 42.30

2609 13, 958 181, 184 (0.55, 25) 54.75

9406 35, 048 400, 381 (1.0, 287) 84.71

2461 485, 958 492, 416 (0.67, 5) 33.12

223 393, 471 477, 801 (0.62, 5) 34.55

105 427, 244 462, 396 (0.62, 0) 38.45

14832 515 46, 704 (0.60, 55) 95.34

18538 337 33, 198 (0.62, 60) 93.74

95 6, 121 87, 730 (0.6, 50) 54.47

1859 437, 901 767, 890 (0.53, 10) 46.16

11715 543, 708 677, 259 (0.56, 5) 38.69

1403 661, 242 618, 708 (0.7, 10) 42.67

2120 533, 210 584, 072 (0.58, 4) 43.25

1161 1000 135, 921 (0.60, 40) 16.67

379 368, 604 418, 263 (0.62, 8) 48.72

11803 340, 211 435, 128 (0.47, 5) 31.38

which contain the unique ID (URN) of that object. Since theseQUERYHIT messages contain the IP

addresses of the responding nodes that have copies of the requested object, we can determine the number

of replicas by counting the number of unique IP addresses associated with a specific object ID. We

then map these unique IP addresses to their corresponding ASes by using the GeoIP database [15]. The

number of unique IP addresses per AS associated with an object AS gives us the number of replicas of

the considered object in that AS.

We compute the popularity of each object in each of the top 16 ASes (in terms of the number of

QUERYHIT messages sent). These top 16 ASes contribute around 38% of the total traffic seen by our

study. We rank objects based on their popularity, and we plotpopularity versus rank. Figures 1 and 2

show a sample of our results for eight ASes. Results for otherASes are very similar. As shown in the

figures, there is aflattenedhead in the popularity curve of P2P objects. This flattened head indicates

that objects at the lowest ranks are not actually very popular. This flattened head phenomenon could

be attributed to two main characteristics of objects in P2P systems: immutability and large sizes. The

immutability of P2P objects eliminates the need for a user todownload an object more than once. This

download at most once behavior has also been observed in previous studies [1]. The large size of objects,

9

and therefore the long time required to download them, may make users download only objects that they

are really interested in. This is in contrast to web objects,which take much shorter times to download,

and therefore, users may download web objects even if they are of marginal interest to them. These two

characteristics reduce the total number of requests for popular objects.

Figures 1 and 2 also show that, unlike the case for web objects[5], using a Zipf-like distribution

to model the popularity of P2P objects would result in a significant error. In log-log scale, the Zipf-

like distribution appears as a straight line, which can reasonably fit most of the popularity distribution

except the left-most part, i.e., the flattened head. In particular, the Zipf-like distribution would greatly

overestimate the popularity of objects at the lowest ranks,which are the most important to caching

mechanisms because those are the good candidates to be stored in the cache.

We propose a new model that captures the flattened head of the popularity distribution of objects in

P2P systems. Our model uses the Mandelbrot-Zipf distribution [16], which is a general form of Zipf-like

distribution. The Mandelbrot-Zipf distribution defines the probability of accessing an object at ranki out

of N available objects as

p(i) =H

(i + q)α, (1)

whereH =∑N

i=11/(i+q)α, α is the skewness factor, andq ≥ 0 is a parameter which we call theplateau

factor because it is the reason behind the plateau shape nearto the left-most part of the distribution (see

Figure 3). Notice that the higher the value ofq, the more flattened the head of the distribution is, and

whenq = 0, Mandelbrot-Zipf distribution degenerates to a Zipf-likedistribution with a skewness factor

α.

To validate this popularity model, we fit the popularity distributions of the top 16 ASes to Mandelbrot-

Zipf distribution using the Matlab distribution fitting tool. Our results, some shown in Figures 1 and 2,

indicate that Mandelbrot-Zipf distribution models reasonably well the popularity of P2P objects. Table II

summarizesα andq values for the top sixteen ASes. We observe that a typical value for α is between

0.4 and 0.70, and a typical value ofq is between25 and 60. We also observe that ASes with a large

ratio of number of objects to the number of hosts have a largeq value, which corresponds to a more

flattened popularity curve. This could be explained by the request at most once behavior of peers, where

the number of replicas of a certain object in an AS is bounded by the number of hosts that exist in

that AS. We also notice that such ASes are responsible for most of the traffic. For instance, AS number

14832, Figure 1(a), contributed the highest reply rate among all ASes we looked at even though it has

10

100

101

102

103

100

101

102

103

Rank

Fre

qu

en

cy

Zipf (1)MZipf(1, 20)MZipf(1, 50)MZipf(1, 100)

Fig. 3. Mandelbrot-Zipf, denoted as MZipf(α, q), versus Zipf, denoted as Zipf(α), distributions.

only 515 hosts. Those heavy-hitter ASes have also being observed in previous studies [6].

We are aware that the number of unique IP addresses in an AS might not reflect the number of unique

hosts due to dynamic IP addressing. To approximate the number of unique hosts, we look at the number

of dynamic IP addresses found in a random subset of our trace containing more than100, 000 unique

replies. We identify an IP address as dynamic if two IP addresses reply with the the same object URN

and local object index. A local index is included in the queryhit replies and it identifies the location of

the object on the disk of the responding peer. We find out that 8.3% of the IP addresses in this sample

trace were dynamic IP addresses. Although such an approximation might be different depending on the

AS, it nontheless points out the existence of dynamic IP addresses that could lead to over-estimating the

number of unique peers in an AS.

We notice that a large number of hosts in an AS results in a small value of the plateau parameterq,

and a large amount of uncacheable traffic (i.e., objects withonly once replica). The larger the number

of hosts, the higher the chance of popular objects being downloaded more often which results in a less

flattened head observed in the loglog plot of the popularity distribution. This is equivalent to a small

value of q. A small q value has a positive effect on caching, since storing the popular objects yields

higher hit rate. This can be seen in ASes 9548, 223 and 2461 where the number of hosts is in the order

of hundreds of thousands, and the value ofq is five or less; see table II.

On the downside, a large number of hosts means a higher ratio of uncacheable traffic. This could be

11

100

101

102

30

40

50

60

70

80

90

100

110

Relative Cache Size (%)

Hit

Ra

te (

%)

Zipf(1)MZipf(1,25)MZipf(1, 60)

Fig. 4. Hit rate loss under Mandelbrot-Zipf popularity distribution.

explained by the fact that a larger number hosts means a more diverse user community, which means

a higher one-timer population. A high ratio of uncacheable traffic affects caching negatively, because

the maximum byte hit rate that could be achieved is bounded above by the amount of cacheable traffic.

However, the amount of cacheable traffic is not an indicationof a higher byte hit rate under finite size

caches. This is because the cache performance depends only on the the most popular objects, assuming

that the cache stores them. Cacheable traffic, however, refers to anything that was downloaded more

than once; an extreme example is when all objects are downloaded just twice. Thus, the skewness of the

popularity distribution is the deciding factor in cache performance. ASes with small values ofq and large

values ofα would benefit the most from deploying caches. This means thatthe theoretical maximum

limits of the cacheability of P2P traffic from previous studies [1], [8] do not necessarily imply a good

cache performance.

C. The Effect of Mandelbrot-Zipf Popularity on Caching

In the previous section we showed that P2P popularity could be modeled using the Mandelbrot-Zipf

distribution. In this section, we analyze the impact of thispopularity model on the cache hit rate.

Let us assume that only theC most popular objects are stored in the cache, that is an LFU policy is

used. We also assume that all objects have the same size. Then, the hit rateHR is given by:

12

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

80

90

100

Relative Cache Size

Hit

rate

Zipf(1.0)

mzipf(0.5,50)

mzipf(0.5,40)

Fig. 5. Hit rate loss under Mandelbrot-Zipf popularity distribution with LRU.

HR =C

∑

i=1

p(i) =C

∑

i=1

H

(i + q)α

≈

∫ C

i=1

H

(i + q)α= H ln

q + C

q + 1. (2)

For Zipf-like popularity distribution (q = 0), HR = H ln C. Figure 4 plots the hit rate under both

popularity distributions as a function of the cache size. Similar hit rate reduction has been observed for

LRU as shown in Figure 5. Both figures show that the hit rate under Mandelbrot-Zipf suffers severely,

especially under large values ofq, and small cache sizes. This is because large values ofq mean that

popular objects receive less requests, and thus storing them does not yield a high hit rate. Under small

cache sizes, the hit rate suffers more because the fraction of objects that could be store in the cache

receives only a small portion of the total traffic. This indicates that caching schemes that capitalize on

object popularity alone may not yield high hit rates and may not very effective in reducing the ever-

growing P2P traffic.

D. Popularity Dynamics

Since P2P objects are mostly immutable, users need not download them more than once. Once

downloaded, popular objects will receive less and less requests and other new popular objects are

13

0 2 4 6 8 10 120

1

2

3

4

5

6

7

8

9

10x 10

4

Week

Nu

mb

er

of

req

ue

sts

(a) Top five popularity

2 4 6 8 10 120

1

2

3

4

5

6

7

8

9

10x 10

4

Week

Fre

qu

en

cy o

f re

qu

ests

(b) Top ten popularity

2 4 6 8 10 120

2

4

6

8

10

12

14

16

18x 10

4

Week

Fre

qu

en

cy o

f re

qu

ests

(c) Top (11 − 20)th popularity

Fig. 6. Popularity degradation in P2P systems.

introduced in the system. To quantify this change in popularity globally, we study popularity dynamics

at two levels: across all ASes (globally) and per AS.

For global popularity dynamics we pick the top 20 most requested objects during the first month of our

measurement and see how their popularity changes over the entire measurement period. We look at the

average number of requests received by objects ranked from one to five, from one to ten and from ten to

twenty. We observe that the popularity grows as the file is introduced to reach its peak, and then degrades

steadily in a matter of weeks after its peak. We cannot make any claims as to when those popular files

were introduced, but the pattern of increase then decrease indicates that those top 20 objects are new

popular objects in the system. Figure 6 shows how the popularity degrades for the top five, top ten and

next top ten most popular objects. As evident from the figure,popularity in P2P systems is dynamic, and

changes over time. Similar observations about object lifetime in other P2P systems have been made in

[1], [17]. Figure 9 summarizes the percentages of requests received by the top most popular objects.

To study the popularity dynamics per AS, we pick the top ten most requested objects seen in the

first month of our measurement and see how their popularity changes over time in each AS. We do the

experiment for the top eight ASes: the top four ASes, in termsof the number of replies they sent, and

for another four random ASes. As Figures 7 and 8 show, the number of requests received by the top ten

most popular objects decreases over time for the top four ASes and the random ASes, respectively. Thus

popularity of P2P objects is dynamic and changes over time.

Such popularity dynamics may have negative effects on caching, since cached objects tend to grow

stale after they have built a good popularity profile. Thus, caches that capitalize on access frequency

might cache old popular objects unnecessarily. However, webelieve that the introduction of new popular

14

1 2 3 4 5 6 7 8 9 10 11 120

1

2

3

4

5

6

7

8x 10

4

Week

Nu

mb

er

of

req

ue

sts

AS14832

(a)

1 2 3 4 5 6 7 8 9 10 11 120

0.5

1

1.5

2

2.5

3

3.5x 10

4

Week

Nu

mb

er

of

req

ue

sts

AS18538

(b)

1 2 3 4 5 6 7 8 9 10 11 120

2000

4000

6000

8000

10000

12000

14000

Week

Nu

mb

er

of re

qu

est

s

AS2161

(c)

1 2 3 4 5 6 7 8 9 10 11 120

1000

2000

3000

4000

5000

6000

7000

8000

Week

Nu

mb

er

of re

qu

est

s

AS95

(d)

Fig. 7. Popularity dynamics of the top five most popular objects in the top four ASes.

objects negates the effect of caching old popular objects. That is, the introduction of new popular objects

competing for cache space will result in the eviction of older popular objects eventually.

E. Measuring and Modeling Object Size

Many traditional cache replacement algorithms use object size in their decision to evict objects [3].

Other algorithms, such as Greedy-Dual Size [18], incorporate size with other characteristics to maximize

cache performance. Thus, understanding how object size is distributed is critical to developing effective

caching schemes.

15

1 2 3 4 5 6 7 8 9 10 11 120

1000

2000

3000

4000

5000

6000

7000

Week

Nu

mb

er

of re

qu

est

sAS9406

(a)

1 2 3 4 5 6 7 8 9 10 11 120

1000

2000

3000

4000

5000

6000

Week

Nu

mb

er

of re

qu

est

s

AS1782

(b)

1 2 3 4 5 6 7 8 9 10 11 120

1000

2000

3000

4000

5000

6000

Week

Nu

mb

er

of

req

ue

sts

AS17318

(c)

1 2 3 4 5 6 7 8 9 10 11 120

2000

4000

6000

8000

10000

12000

Week

Nu

mb

er

of re

qu

est

s

AS1310

(d)

Fig. 8. Popularity dynamics of the top five most popular objects in four randomly-chosen ASes.

To develop a model for P2P object sizes, we collect information about objects in two popular P2P file-

sharing systems: Gnutella and BitTorrent. In BitTorrent , there are a number of web servers that maintain

metadata,.torrent, files about files in the network. These servers are known as torrent sites. We develop

a script which contact four popular torrent sites and downloads random .torrent files. We downloaded

100 thousand .torrent files, 50% of which are unique. Each .torrent file contains information about the

shared object, such as object size, number of segments in theobject and the size of each segment. For

the Gnutella system, we use randomly pick 500 thousand unique objects from the traces we collected

during our measurement study and analyze their sizes.

16

2 4 6 8 10 120

5

10

15

20

25

30

35

Week

Per

cent

age

of r

eque

sts

(%)

Fig. 9. Popularity degradation of the top five most popular objects in the top twenty ASes.

The histograms of the size distribution in both BitTorrent and Gnutella, Figures 10(a) and 11(a), show

that there exists several peaks where each peak correspondsto a certain workload. It is clear that it

is hard to model the entire size range using a single distribution. Therefore, we divide the whole size

range (approximately 0 MB - 25,000 MB) into multiple intervals. The choices of such intervals is not

random, but governed by the existence of different types of content. For example, a peak around 700

MB corresponds to most shared CD movies; another peak around250 MB corresponds to high quality

TV video series and some software objects while a peak aroundfew MB corresponds to video clips and

audio clips.

Using the Chi-Square goodness-of-fit test, we show that the size distribution of each content type (i.e.,

workload) can can be modeled as a Beta distribution. We also show that the size distribution of objects

observed per AS for the Gnutella traffic we collected is also multimodal and contains multiple workloads,

which could be modeled using the Beta distribution. A beta distribution is defined as follows: A random

variable is said to be beta distributed with parametersα andβ (both positive) in the interval[A,B] if

the probability density function (pdf) ofX is

P (x) =

1

B − A

Γ(α + β)

Γ(α)Γ(β)

(

x − A

B − A

)α−1 (

B − x

B − A

)β−1

A ≤ x ≤ B

0 Otherwise

(3)

When A = 0, B = 1, the above distribution is called the standardbeta distribution and with a pdf

P (x) = (1 − x)β−1xα−1

(

Γ(α + β)

Γ(α)Γ(β)

)

17

0 500 1000 1500 20000

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Object Size (MB)

Fre

quen

cy

(a)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Size

Cum

ulat

ive

prob

abili

ty

(0 − 349) MBBeta(0.62, 1.52)

(b)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Size

Cum

ulat

ive

prob

abili

ty

(350 − 700) MB

Beta(0.48, 0.32)

(c)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Size

Cum

ulat

ive

prob

abili

ty

(701 − 1399) MBBeta(0.59, 0.55)

(d)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Size

Cum

ulat

ive

prob

abili

ty

(1400 − 4500)MBBeta(0.38, 0.53)

(e)

0 0.2 0.4 0.6 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Size

Cum

ulat

ive

prob

abili

ty

(4500 − 25000) MBBeta(0.61, 2.11)

(f)

Fig. 10. Object size distribution in BitTorrent.

18

whereΓ(α) is the gamma function and is defined as:

Γ(α) =

∫

∞

0

xα−1e−xdx , x ≥ 1

Figure 10 shows the CDF (Cumulative Distribution Function)of object sizes in BitTorrent fitted against

the CDF of the standard Beta distribution. Similarly, Figure 11 shows that Beta distribution could be

used to reasonably model the object size distribution in Gnutella. We notice that the sizes of objects

in BitTorrent are larger than in Gnutella. While objects of less than100 MB constitute approximately

90% of objects in Gnutella, they only constitute 36% in BitTorrent. Figure 12 shows the distribution of

objects into workloads in Gnutella and BitTorrent. The factthat objects in BitTorrent are usually large

video objects could be explained as follows. BitTorrent offers higher availability, due to its tit-for-tat

mechanism where each peer contributes by uploading to otherpeers, while a peer in Gnutella can choose

to free-ride without running the risk of getting slower downloading rates. Thus peers in Gnutella may not

be as willing to share content, which means objects are less available. BitTorrent, on the other hand gives

each peer a downloading rate proportional to its uploading rate which punishes free-riders, thus achieving

higher availability. Higher availability means less download aborts, and thus faster object retrieval, which

in turn promotes sharing of large objects.

We should keep in mind that, even though most of the downloaded objects observed by our measurement

were small objects, the majority of the traffic was contributed by large sized objects. To study this, we

pick the top ten ASes, and look at the fraction of objects withsize less than 100 MB and larger than 100

MB. For each of these two categories, we count the number of objects downloaded in the respective AS,

and the sum of their sizes, which corresponds to the traffic they contributed. As Figure 13 shows, objects

with size larger than 100 MB account for only about 10% of the total number of downloaded, but they

are responsible for about 90% of the traffic. This shows that large size of objects is a very important

factor to consider for any caching policy which strives to reduce the P2P traffic on the Internet backbone.

Knowing the distribution of P2P objects and the fraction of requests for each workload, we can estimate

the required cache size (Size) to achieve a certain byte hit rate.

Size = BHR ·n

∑

i=1

λi.µi ,

whereλi is the fraction of requests received by workloadi and µi is the mean of the respective size

distribution in workloadi (which is Beta distributed), and is defined as follows

µ = A + (B − A)α

α + β.

19

0 500 1000 15000

1

2

3

4

5

6x 10

5

Object Size (MB)

Fre

quen

cy

(a)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Size

Cum

ulat

ive

prob

abili

ty

(0 − 349) MBBeta(0.5, 1.42)

(b)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Size

Cum

ulat

ive

prob

abili

ty

(350 − 700) MBBeta(0.63, 0.44)

(c)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Size

Cum

ulat

ive

prob

abili

ty

(700 − 1400) MB

Beta(0.29, 0.57)

(d)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Size

Cum

ulat

ive

prob

abili

ty

> 1400 MB

Beta(0.34, 1.26)

(e)

Fig. 11. Object size distribution in Gnutella.

20

[0,10] [10,100] [100, 700] [700, 1400] > 14000

10

20

30

40

50

60

Object size (MB)

Per

cent

age

of o

bjec

ts (

%)

GnutellaBittorrent

Fig. 12. Object size comparison under BitTorrent and Gnutella.

We discuss the the relationship between byte hit rate and cache size further in the evaluation section.

F. Summary of the measurement study

We designed our measurement study to analyze the P2P traffic that would be observed by individual

P2P caches deployed in different ASes. We found that popularP2P objects are not as highly popular as

their web counterparts. The popularity distribution of P2Pobjects has a flattened head at lowest ranks,

and therefore, modeling this popularity as a Zipf-like distribution yields a significant error. We also

found that a generalized form of the Zipf distribution, called Mandelbrot-Zipf, captures this flattened

head nature and therefore is a better model for popularity ofP2P objects. Furthermore, we found that the

Mandelbrot-Zipf popularity has a negative effect on hit rates of caches that use the common LRU and

LFU policies. In addition, our study revealed the dynamic nature of object popularity in P2P systems,

which should be considered by the P2P caching algorithms. Finlay, we found that objects in P2P systems

have much larger sizes than web objects, and they can be classified into several categories based on

content types. Each category can be modeled as a Beta distribution.

IV. P2P CACHING ALGORITHM

With the understanding of the P2P traffic we developed, we design and evaluate a novel P2P caching

scheme based on segmentation and partial caching. Partial caching is necessary because of large objects

sizes. Unlike traditional memory systems where a page must be loaded to memory before it is used,

21

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

100

110

AS Rank

Pe

rce

nta

ge

o

f #

of

do

wn

loa

ds

/ #

of

byt

es

(%)

% of # of downloads % of number of bytes downloaded

Fig. 13. Number of downloads for objects with size larger than 100 MB, and the number of bytes they contributed to the total

traffic. This is measured for the top 10 ASes. The Figure showsthat even though large objects constitute only 10% of the total

number of downloads, they are responsible for about 90% of the traffic.

objects in web or P2P systems do not have to be cached on every cache miss. In particular, objects in

P2P systems are large and the most popular objects may not receive too many requests. Thus storing

an entire object upon request may waste cache space. P2P caching should take a conservative approach

towards admitting new objects into the cache to reduce the cost of storing unpopular objects.

The basic idea of our algorithm is to cache of each object a portion that is proportional to its popularity.

That way popular objects will be incrementally given more cache space than unpopular objects. To achieve

this goal, our algorithm caches a constant portion of an object when it first sees that object. As the object

receives more hits, its cached size increases. The rate at which the cached size increases grows with

the number of hits an object receives. To facilitate admission and eviction of objects, our policy divides

objects into segments. The smallest granularity that our algorithm caches/evicts is one segment.

We use different segment sizes for different workloads because using the same segment size for all

workloads may favor some objects over others and introduce additional cache management overhead.

For example, if we use a small segment size for all workloads,large objects will have large number

of segments. This will introduce a high overhead of managinglarge number of segments within each

object. Such overhead includes locating which segments of an object are cached, and deciding which

segments to evict. On the other hand, using a large segment size for all workloads will favor smaller

objects since they will have smaller number of segments and get cached quicker. Making segment size

22

relative to object size has the same two problems. This is because the range of P2P object sizes extends

to few gigabytes. Thus using relative segmentation, as a percentage of an object size, may result in large

objects having segments that are orders of magnitude largerthan unsegmented smaller object.

Consistent with our observations of multiple workloads, weuse four different segment sizes. For

objects with sizes less than 10 MB, we use a segment size of 512KB, for objects between 10 MB and

100 MB, a segment size of 1 MB, and for objects between 100 MB and 800 MB a segment size of

2 MB. Objects whose size exceeds 800 MB are divided into segments of 10 MB in size. We call this

segmentation scheme variable segmentation scheme. Note that our segmentation procedure is inline with

most segmentation schemes used by real P2P systems. Our analysis of more than 100,000 torrent files

shows that the vast majority of Bittorrent objects are segmented into 256 KB, 512 KB, 1 MB and 2

MB segments. E-Donkey uses segments of size 9.28 MB [19]. Depending on the client implementation,

segment size in Gnutella can either be a percentage of objectsize, as in BearShare [20], or fixed at few

hundred KBs [12].

For each workload the cache keeps the average of object size in this workload. Denote the average

object size in workloadw asµw. Further letγi be the number of bytes served from objecti normalized

by its cached size. That is,γi can be considered as number of times each byte in this object has been

served from the cache. The cache ranks objects according to their γ value, such that for objects ranked

from 1 to n, γ1 ≥ γ2 ≥ γ3 ≥ · · · ≥ γn. From this point on, we refer to an object at ranki simply

as objecti. When an object is seen for the first time, only one segment of it is stored in the cache. If

a request arrives for an object of which at least one segment is cached, then the cache computes the

number of segments to be added to this object’s segments as(γi/γ1)µw, whereµw is the mean of the

object size in workloadw which objecti belongs to. Notice that this is only the number of segments the

cache could store of objecti. But since downloads can be aborted at any point during the session, the

number of segmentsactually cached upon a request, denoted byk, is given by

k = min[missed, max(1,γi

γ1

µw)] , (4)

wheremissed is the number of requested segments not in the cache. This means that the cache will stop

storing uncached segments if the client fails or aborts the download, and that the cache stores at least

one segment of an object.

The pseudo-code of our algorithm appears in Figure 14. At a high level, the algorithm works as follows.

The cache intercepts client requests and extracts the object ID and the requested range. If no segments

of the requested range is in the cache, the cache stores at least one segment of the requested range. If the

23

entire requested range is cached, the cache will serve it to the client. If the requested range is partially

available in the cache, the cache serves the cached segmentsto the client, and decides how many of the

missing segments to be cached using Eq. (4). In all cases, thecache updates the average object size of

the workload to which the object belongs and theγ values of the requested object.

request(object i, requested range)

1. if object i is not in the cache

2. add one segment ofi to cache, evicting if necessary

3. else

4. hit = cached range∩ requested range

5. γi += hit / cached size ofi

6. missed = (requested range - hit)/segment size

7. k = min[missed, max(1,γi

γ1

µw)]

8. if cache does not have space fork segments

9. evict segments from the least valued object

10. addk segments of objecti to the cache

11. return

Fig. 14. P2P caching algorithm.

The algorithm uses a priority queue data structure to store objects according to theirγi value. When

performing eviction, the required segments are removed from the least valued objects. Currently, our

algorithm evicts contiguous segments from the least valuedobjects without favoring any segments over

others. This is because P2P downloads are likely to start from anywhere in the file [9]. It remains a future

goal of ours to study in-object request patterns and improveour algorithm accordingly. Our algorithm

needs to performO(logN) comparisons with every hit or miss, whereN is the number of objects in the

cache. But since objects are large, this is a reasonable costconsidering the small number of objects the

cache will contain.

V. EVALUATION

In this section, we use trace-driven simulation to study theperformance of our P2P caching algorithm,

and compare it against two common web caching algorithms (LRU, LFU and GDS) and a recent caching

24

algorithm proposed for P2P systems, and the offline optimal policy.

A. Experimental Setup

We use traces obtained from our measurement study to conductour experiments. Our objective is to

study the effectiveness of deploying caches in several ASes. Thus, we collect information about objects

found in a certain AS and measure the byte hit rate that could be achieved if a cache were to be deployed

in that AS. We use the byte hit rate as the performance metric because we are mainly interested in reducing

the WAN traffic

We run several AS traces through the cache and compare the byte hit rate achieved using several

caching algorithms. In addition to our algorithm, we implement the Least Recently Used (LRU), Least

Frequently Used (LFU), Greedy-Dual Size (GDS) [18] and Least Sent Bytes (LSB) [9], and the off-line

optimal (OPT) algorithms. LRU capitalizes on the temporal correlation in requests, and thus replaces the

oldest object in the cache. LFU sorts objects based on their access frequency and evicts the object with

the least frequency. GDS maintains a characteristic valueHi for each object, whereHi = costi/sizei+L.

The object with the smallestHi value will be evicted next. Thecost parameter could be set to maximize

hit rate, minimize latency or minimize miss rate. We set thesize parameter to be 1 +size/536, which is

the estimated number of packets sent and received upon a miss. LSB is designed for P2P traffic caching

and it uses the number of transmitted bytes of an object as a sorting key. Objects which have transmitted

the least amount of bytes will be evicted next. OPT looks at the entire stream of requests off-line and

caches the objects that will serve the most number of bytes from the cache.

We measure the byte hit rate under several scenarios. First,we consider the case when objects requested

by peers are downloaded entirely, that is, there are no aborted transactions. Then, we consider the

case when the downloading peers pre-maturely terminate downloads during the session, which is not

uncommon in P2P systems. We also run our experiments under the independent reference model, where

entries of the trace file are randomly shuffled to eliminate the effect of temporal correlation and popularity

degradation. We do that to isolate and study the effect of object popularity on caching. Finally, we run

our experiments on the original trace with preserved temporal correlation between requests to study the

combined effect of popularity and popularity degradation.

In all experiments, we use the ASes different traffic characteristics, which are representative of P2P

traffic observed in other ASes. We concentrate on top ASes which sent the most number of replies to

ensure that we have enough traffic from an AS to evaluate the effectiveness of deploying a cache for it.

Caches usually strive to store objects that have a higher likelihood of being requested in the near

25

future. This likelihood of requesting objects is called temporal locality of requests [21], [22]. Temporal

locality has two components: popularity and temporal correlations. Popularity of an object is the number

of requests it receives relative to other objects. Temporalcorrelation of requests means that rquests to

the same object tend to be clustered or correlated together in time. To illustrate the difference consider

the following two sequences

. . . ABCAEGFAXY NAACEABA . . .

in which A receives more requests than other object. Thus isA is more popular than other objects.

Consider a second sequence

. . . ABBBACCCEAFBBAB . . .

in which requests to objectB areC correlated and tend to appear close to each other. Temporal correlation

is usually measured in the number of intervening requests between two consecutive requests to the same

object. A shorter sequence of intervening requests imply higher temporal correlations.

Temporal locality of web request streams has been widely studied in the literature (see [21]–[23] and

references therein). However, we believe that temporal correlations is not as critical to P2P caching as it

is to web caching. This is because web objects are small, which means that the cache can accommodate

a long sequence of requests, during which temporal localitywill be evident. But in P2P caching, objects

are orders of magnitude larger than their web counterparts,which means that a sequence of P2P objects

that could be stored in the cache is orders of magnitude shorter than a sequence of web objects.

B. Caching under Full Downloads

One of the main advantages of our caching policy is that it effectively minimizes the impact of

unpopular traffic by admitting objects incrementally. Traditional policies usually cache an entire object

upon a miss, evicting other, perhaps more popular, objects if necessary. This may waste cache space

by storing unpopular objects in the cache. To investigate the impact of partial versus full caching, we,

first, assume that we have a no-failure scenario where peers download the object entirely. We look at all

objects found in an AS, line them up in a randomly shuffled stream of requests and run them through

the cache. We vary the cache size between 0 and 1000 GB.

Figure 15 shows the byte hit rate for four representative ASes with different characteristics. These ASes

have different maximum achievable byte hit rates, which is defined as the fraction of traffic downloaded

more than once, i.e., cachebale traffic, over the total amount of traffic. As shown in the figure, our policy

outperforms other policies by a wide margin. For instance, in AS397 (Figure 15(a)) with a cache of 600

26

0 200 400 600 800 10000

5

10

15

20

25

30

35

Cache Size (GB)

Byt

e H

it R

ate

Byte hit rate under AS397

OPTP2PLSBLFULRUGDS

(a) 48% of traffic is cacheable

0 200 400 600 800 10000

5

10

15

20

25

30

Cache Size(GB)

Byt

e H

it R

ate

(%

)

AS95

OPT

P2P

LSB

LFU

LRU

GDS

(b) 54% of traffic is cacheable

0 200 400 600 800 10000

2

4

6

8

10

12

14

16

Cache Size (GB)

Byt

e H

it R

ate

AS1161

OPTP2PLSBLFULRUGDS

(c) 16% of traffic is cacheable

0 200 400 600 800 10000

2

4

6

8

10

12

14

16

18

20

Cache Size (GB)

Byt

e H

it R

ate

AS14832

OPTP2PLSBLFULRUGDS

(d) 95% of traffic is cacheable

Fig. 15. Byte hit rate for different caching algorithms. No aborted downloads.

GB, our policy achieves a byte hit rate of 24%, which is almostdouble the rate achieved by LRU, LFU,

GDS and LSB policies. Moreover, the byte hit rate achieved byour algorithm is about 3% less than that

of the optimal algorithm. Our trace shows that the amount of traffic seen in AS397 is around 24.9 tera

bytes. This means that a reasonable cache of size 600 GB wouldhave served about 6 tera bytes locally

using our algorithm.

The reason traditional policies perform poorly for P2P traffic is because of the effect of unpopular

objects. For example, one-timer objects are stored entirely under traditional policies on a miss. Under our

policy, however, only one segment of each one-timer will findits way into the cache, thus minimizing

27

1 2 3 4 5 6 7 8 9 100

50

100

150

200

AS rank

Re

lativ

e in

cre

ase

in b

yte

hit

rate

(P2P − LRU)/LRU

(P2P − LSB)/LSB

(P2P − LFU)/LFU

Fig. 16. Relative byte hit rate improvement using our algorithm over LRU, LFU, and LSB algorithms in the top 10 ASes. No

aborted downloads.

their effect.

As can be seen from the figures, our policy consistently preforms better than traditional policies. Figure

16 summarizes the relative improvement in byte hit rate thatour policy achieves over LRU, LFU, and

LSB for the top ten ASes. The relative improvement is computed as the difference between the byte

hit rate achieved by our policy and the byte hit rate achievedby another policy normalized by the byte

hit rate of the other policy. For instance the improvement over LRU would be(P2P − LRU)/LRU .

The figure shows a relative improvement of at least 40% and up to 180% can be achieved by using our

algorithm. That is a significant gain given the large volume of the P2P traffic. We notice that the relative

gain our policy achieves is larger in ASes with a substantialfraction of one-timers.

As a final comment on Figure 15, consider the absolute byte hitrate archived under our algorithm

as well as the optimal byte hit rate in AS14832 and AS397. Notice that although the percentage of

cacheable traffic in AS397 is less than that of AS14832, the absolute byte hit rate is higher in AS397.

This is because popular objects in AS14832 do not get as many requests as their counterparts in AS397.

That is, the popularity distribution of AS14832 has a more flattened head than that of AS397. We

computed the skewness factorα and the plateau factorq of the Mandelbrot-Zipf distribution that best fits

the popularity distributions of these two ASes. We found that AS397 hasα = 0.55 and q = 25, while

AS14832 hasα = 0.6 and aq = 55. Thus smallerq values mean less flattened heads, and yield higher

byte hit rates (Section V-D elaborates more on the impact ofα and q on the byte hit rate). This also

28

means that the amount of cacheable traffic in an AS is not a goodenough indication of the achievable

byte hit rate.

Next, we study the effect of temporal correlations on P2P caching. As we mentioned previously,

temporal correlations imply that an object just requested has a higher degree of being requested in the

near future, and also that the number of requests this objectreceives tend to be correlated and decrease

over time. Our traces exhibit a dynamic request behavior where requests to popular objects decrease

over time. To quantify this effect, we use unshuffled traces from our measurement. Figure 18 shows the

byte hit rate under AS397 and AS95 with unshuffled traces. Unshuffled traces have temporal correlations

preserved and feature a popularity degradation of the most popular objects as we showed in Section III-D.

As the figure shows LRU and GDS perform slightly better than LSB, and LFU because they make use

of the temporal correlation between requests. However, theachieved byte hit rate of LRU, LFU, GDS

and LSB is still low compared to the byte hit rate achieved by our algorithm. Note that the absolute

byte hit rate under our algorithm is slightly smaller than inthe previous experiment where we used the

independent reference model. This is because our algorithmdoes not consider object aging. However,

this reduction is small, less than 3%. The fact that the performance of our algorithm does not suffer

much under popularity degradation and still outperforms other algorithms (e.g., LRU) could be explained

as follows. We believe that object size is the dominant factor in caching for P2P systems, because the

maximum cache size we used (1000 GB) is still small compared to the total size of objects, less than

10% in most cases. Also, new popular objects are usually introduced into the system, thus evicting old

popular objects and minimizing the negative effect of object staleness. We conclude that object admission

strategy is a key element in determining the byte hit rate, which our algorithm capitalizes on. Similar

results were obtained for the remaining ASes.

But since we are not able to control the degree of correlation(i.e., inter-request distance), we modify a

synthetic generation tool to obtain such traces. We modify the the ProWGen traffic generation tool [24] to

use Mandelbrot-Zipf distribution and provide object sizesfrom our traces. The ProWGen uses the LRU

Stack Model (LRUSM) to generate temporal correlations between requests. LRUSM uses a stack of depth

n to keep an ordered list of the last requestedn objects such that the subset of objects in the stack have a

higher probability of being accessed again that they would if they were not in the stack. That is, objects

stored in the stack have the probability of being accessed given by Mandelbrot-Zipf plus a small fractio

proportional to their position in the stack. Details of how ProWGen implements the LRUSM are provided

in [24]. To study the sensitivity of our algorithm to temporal correlations, we vary depth of the LRU

stack between 0 and 500, where higher values for stack depth correspond to higher temporal correlation.

29

0 200 400 600 800 10000

5

10

15

20

25

30

35

40

45

Cache Size (GB)

Byt

e H

it R

ate

(%

)

LRUSM(500)IRMLRUSM(100)

Fig. 17. Sensitivity of the P2P caching algorithm to temporal correlations.

As Figure 17 shows, the byte hit rate improves slightly underhigh degrees of temporal correlations.

However, the byte hit rate improvement is still very small compared to the byte hit rate achieved under

the Independent Reference Model (IRM). The fact that our policy suffered slightly under real traces,

but improved under synthetic traces can be explained as follows. Under our traces, requests to popular

objects were clustered together such their popularity diesover time. However, in the synthetic traces

temporal correlations of objects does not have any popularity bias whereby popular objects are requested

in a certain region of the sequence and then grow stale. But inboth cases, our policy is insensitive to

temporal correlations of P2P requests. As we explained in Section III-D, the introduction of new popular

objects, and the relatively small cache size compared to object sizes mitigates the effect of popularity

dynamics. Thus, object size is the dominant factor in P2P caches, and our policy capitalizes on it.

C. Caching with Aborted Downloads

Due to the nature of P2P systems, peers could fail during a download, or abort a download. We run

experiments to study the effect of aborted downloads on caching, and how robust our caching policy

is. Following observations from [1], we allow 66% of downloads to abort anywhere in the download

session. While our policy is designed to deal with aborted downloads, web replacement policies usually

download the entire object upon a miss, and at times perform pre-fetching of popular objects. This is

reasonable in the web since web objects are usually small, which means they take less cache space..

But in P2P systems, objects are larger, and partial downloads constitute a significant number of the total

30

0 200 400 600 800 10000

5

10

15

20

25

30

Cache Size (GB)

Byt

e H

it R

ate

(%

)

AS95

OPTP2PLSBLFULRUGDS

(a) 54% of traffic is cacheable

0 200 400 600 800 10000

5

10

15

20

25

30

35

Cache Size (GB)

Byt

e H

it R

ate

(%

)

AS397

OPTP2PLSBLFULRUGDS

(b) 48% of traffic is cacheable

Fig. 18. Byte hit rate for different caching algorithms using traces with unshuffled traces (popularity degradation).

number of downloads. Figure 19 compares the byte hit rate with aborted downloads. Compared to the

scenario of caching under full downloads (Section V-B), theperformance of our algorithm improves

slightly while the performance of other algorithms declines. The improvement in our policy could be

explained by the fact that fewer bytes are missed in case of a failure. The performance of LRU, LFU,

GDS and LSB declines because they store an object entirely upon a miss regardless of how much of it

the client actually downloads. Hence, under aborted download scenario the byte hit rates for traditional

policies suffers even more than it does under full download scenario. Similar results were obtained for

the other top ten ASes. Our policy consistently outperformsother policies with a significant improvement

margin in all top ten ASes. Figure 21 shows that the relative improvement in byte hit rate over LRU,

LFU and LSB is at least 50% and up to 200%.

D. Effect ofα and q on P2P Caching

As we mention in Section III, P2P traffic can be modeled by a Mandelbrot-Zipf distribution with a

skewness factorα between 0.4 and 0.65 and a plateau factorq between 10 and 70. In this section, we

study the effect ofα and q on the byte hit rate of our algorithm via simulation, since our trace-drive

simulation presented above may not capture the performanceof our algorithm for all possible values

of α and q. We randomly pick100, 000 objects from our traces and generate their frequencies using

Mandelbrot-Zipf with various values forα andq. We fix the cache size at 1,000 GB and we assume a

no-failure model where peers download objects entirely.

31

0 200 400 600 800 10000

5

10

15

20

25

Cache Size (GB)

Byt

e H

it R

ate

(%

)

AS95

P2PLSBLFULRUGDS

(a)

0 200 400 600 800 10000

5

10

15

20

25

30

35

Cache Size (GB)

Byt

e H

it R

ate

AS397

P2PLSBLFULRUGDS

(b)

0 200 400 600 800 10000

1

2

3

4

5

6

7

8

9

10

Cache Size (GB)

Byt

e H

it R

ate

AS1161

P2P

LSB

LFU

LRU

GDS

(c)

0 200 400 600 800 10000

2

4

6

8

10

12

14

16

18

Cache Size (GB)

Byt

e H

it R

ate

(%

)

AS14832

P2PLSBLFULRUGDS

(d)

Fig. 19. Byte hit rate for different caching algorithms using traces with aborted downloads.

To study the effect of differentα values, we fixq and changeα between 0.4 and 1. As shown in

Figure 20(a), the byte hit rate increases as the skewness factor α increases. This is intuitive since higher

values ofα mean that objects at the lowest ranks are more popular and caching them yields higher byte

hit rate. Thus ASes whose popularity distribution is characterized with highα values would benefit more

than those with lowα values.

Another parameter that determines the achievable byte hit rate is the plateau factorq of the popularity

distribution. q reflects the flattened head we observed in Section III. Figure20(b) shows that the byte

32

0.4 0.5 0.6 0.7 0.8 0.9 10.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

α

Byt

e H

it R

ate

q = 20

q = 50

(a)

0 10 20 30 40 50 60 700.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

q

Byt

e H

it R

ate

α = 0.6 α = 0.4

(b)

Fig. 20. The impact of the skewness factorα (a) and the plateau factorq (b) on the byte hit rate of our algorithm.

hit rate decreases asq increases. The reason for this is that a higher value ofq means popular objects

are requested less often. This has a negative impact on a cache that stores those popular because they

receive less downloads resulting in a decreased byte hit rate.

E. Effect of Segmentation on P2P Caching

Our algorithm uses a variable segmentation scheme whereby objects belonging to the same workload

have the same segment size. To investigate the effect of segmentation, we repeat the experiment of

aborted downloads under AS397. Using different segment sizes, we, first, evaluate the byte hit rate when

all objects have the same segment size (1 MB, 50 MB, and 100 MB)regardless of their size. Then

we evaluate our variable segmentation scheme which we described in Section IV. As can be seen from

Figure 22, as the segment size increases, the byte hit rate degrades. This is because the cache evicts and

admits segments only, and when a download aborts in the middle of downloading a segment, the rest of it

will be unnecessarily stored. Similarly, under large segment sizes, we may admit more unpopular objects

since admitting one segment of a one-timer amounts to admitting a substantial part of the object. The

figure also shows that our variable segmentation scheme achieves a higher byte hit rate than a uniform

1 MB segmentation. We believe the minor computation overhead added to our segmentation scheme is

justifiable given such performance gain.

33

1 2 3 4 5 6 7 8 9 100

20

40

60

80

100

120

140

160

180

200

AS rank

Re

lativ

e in

cre

ase

in b

yte

hit

rate

(P2P − LRU)/LRU

(P2P − LSB)/LSB

(P2P − LFU)/LFU

Fig. 21. Relative byte hit rate improvement using our algorithm over LRU, LFU, and LSB algorithms in the top 10 ASes.

With aborted downloads.

VI. CONCLUSION AND RELATED WORK

In this paper, we performed a three-month measurement studyon the Gnutella P2P system. Using

real-world traces, we studied and modeled characteristicsof P2P traffic that are relevant to caching.

We found that the popularity distribution of P2P objects cannot be captured accurately using Zipf-like

distributions. We proposed a new Mandelbrot-Zipf model forP2P popularity and showed that it closely

models object popularity in several AS domains. We also showed the existence of multiple workloads in

P2P systems, and used Beta distribution to model the size distribution of each workload. Our measurement

study indicated that (i) The Mandelbrot-Zipf popularity has a negative effect on hit rates of caches that

use LRU, LFU or GDS policies, and (ii) Object admission strategies in P2P caching are as critical to

cache performance as object replacement strategies.

We designed and evaluated a new P2P caching algorithm that isbased on segmentation and incremental

admission of objects according to their popularity. Using trace-driven simulations we showed that our

algorithm outperforms traditional algorithms by a significant margin and achieves a byte hit rate that is

up to triple the byte hit rate achieved by other algorithms. We also showed that temporal correlation of

requests does not have a big impact on P2P caching because object size is the dominant factor. Finally,

because of aborted downloads, a caching policy should not perform pre-fetching of objects if its objective

is to maximize byte hit rate, which is the strategy taken by our algorithm.

We are currently implementing our algorithm in the Squid [25] proxy cache and testing it using our

34

0 200 400 600 800 10000

5

10

15

20

25

30

35

Cache Size (GB)

Byte

Hit R

ate

variable seg.1 MB50 MB100 MB

Fig. 22. The effect of segmentation on byte hit rate under P2Palgorithm.

traces. In the future, we intend to investigate preferential segment eviction, whereby segments of the

same object carry different values.

REFERENCES

[1] K. Gummadi, R. Dunn, S. Saroiu, S. Gribble, H. Levy, and J.Zahorjan, “Measurement, modeling, and analysis of a peer-

to-peer file-sharing workload,” inProc. of the 19th ACM Symposium on Operating Systems Principles (SOSP’03), Bolton

Landing, NY, USA, Oct. 2003.

[2] T. Karagiannis, A. Broido, N. Brownlee, K. C. Claffy, andM. Faloutsos, “Is P2P dying or just hiding?” inProc. of IEEE

Global Telecommunications Conference (GLOBECOM 2004), Dallas, TX, USA, Nov. 2004.

[3] S. Podlipnig and L. Bszrmenyi, “A survey of web cache replacement strategies,”ACM Computing Surveys, vol. 35, no. 4,

pp. 347–398, Dec. 2003.

[4] J. Liu and J. Xu, “Proxy caching for media streaming over the Internet,”IEEE Communications Magazine, vol. 42, no. 8,

pp. 88–94, Aug. 2004.

[5] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “Web caching and Zipf-like distributions: Evidence and implications,”

in Proc. of INFOCOM’99, New York, NY, USA, Mar. 1999.

[6] S. Sen and J. Wang, “Analyzing peer-to-peer traffic across large networks,”IEEE/ACM Transactions on Networking, vol. 12,

no. 2, pp. 219–232, Apr. 2004.

[7] A. Klemm, C. Lindemann, M. K. Vernon, and O. P. Waldhorst,“Characterizing the query behavior in peer-to-peer file

sharing systems,” inProc. of ACM/SIGCOMM Internet Measurement Conference (IMC’04), Taormina, Sicily, Italy, Oct.

2004.

[8] N. Leibowitz, A. Bergman, R. Ben-Shaul, and A. Shavit, “Are file swapping networks cacheable? Characterizing P2P

traffic,” in Proc. of WWW Caching Workshop, Boulder, CO, USA, Aug. 2002.

35

[9] A. Wierzbicki, N. Leibowitz, M. Ripeanu, and R. Wozniak,“Cache replacement policies revisited: The case of P2P traffic,”

in Proc. of the 4th International Workshop on Global and Peer-to-Peer Computing (GP2P’04), Chicago, IL, USA, Apr.

2004.

[10] S. Sen, O. Spatscheck, and D. Wang, “Accurate, scalablein-network identification of P2P traffic using application

signatures,” inProc. of the 13th International World Wide Web Conference (WWW’04), New York City, NY, USA, May

2004.

[11] T. Karagiannis, M. Faloutsos, and K. Claffy, “Transport layer identification of P2P traffic,” inProc. of ACM/SIGCOMM

Internet Measurement Conference (IMC’04), Taormina, Sicily, Italy, Oct. 2004.

[12] “Gnutella home page,” http://www.gnutella.com.

[13] “Limewire Home Page,” http://www.limewire.com/.

[14] “Network systems lab homepage,” http://nsl.cs.surrey.sfu.ca.

[15] “GeoIP Database Home Page,” http://www.maxmind.com.

[16] Z. Silagadze, “Citations and the Zipf-Mandelbrot’s law,” Complex Systems, vol. 11, no. 487–499, 1997.

[17] J. Pouwelse, P. Garbacki, D. Epema, and H. Sips, “The Bittorrent P2P file-sharing system: Measurements and analysis,”

in Proc. of the 4th International Workshop on Peer-To-Peer Systems, Ithaca, NY, USA, Feb. 2005.

[18] P. Cao and S. Irani, “Cost-aware WWW proxy caching algorithms,” in Proc. of USENIX Symposium on Internet Technologies

and Systems, Monterey, CA, USA, Dec 1997.

[19] “Emule Project Home Page,” http://www.emule-project.net/.

[20] “BearShare Home Page,” http://www.bearshare.com/.

[21] S. Jin and A. Bestavros, “Sources and Characteristics of Web Temporal Locality,” in8th Modeling, Ana. & Simulation of

Comput. & Telecomm. Syst. (MASCOTS), Aug 2000, pp. 28–35.

[22] K. Psounis, A. Zhu, B. Prabhakar, and R. Motwani, “Modeling correlations in web traces and implications for designing

replacement policies,”Journal of Computer Networks, vol. 45, no. 4, pp. 379–398, Jul 2004.

[23] R. Fonseca, V. Almeida, M. Crovella, and B. Abrahao, “Onthe Intrinsic Locality Properties of Web Reference Streams,”

in Proc. of IEEE INFOCOM’03, San Francisco, CA, USA, Apr. 2003.

[24] M. Busari and C. Williamson, “ProWGen: a synthetic workload generation tool for simulation evaluation of web proxy

caches,”Journal of Computer Networks, vol. 38, no. 6, pp. 779–794, Apr. 2002.

[25] “Squid Web Proxy Cache,” www.squid-cache.org/.

1 modeling and caching of peer-to-peer trafﬁcmhefeeda/papers/tr2006_11.pdf3 we perform trace-based...

Documents