1 modeling and caching of peer-to-peer trafficmhefeeda/papers/tr2006_11.pdf3 we perform trace-based...
TRANSCRIPT
1
Modeling and Caching of Peer-to-Peer TrafficOsama Saleh and Mohamed Hefeeda
School of Computing Science
Simon Fraser University
Surrey, Canada
{osaleh, mhefeeda}@cs.sfu.ca
Technical Report: TR 2006-11
Abstract
Peer-to-peer (P2P) file sharing systems generate a major portion of the Internet traffic, and this portion
is expected to increase in the future. We explore the potential of deploying caches in different Autonomous
Systems (ASes) to reduce the cost imposed by P2P traffic on Internet service providers and to alleviate
the load on the Internet backbone. We conduct a measurement study to analyze P2P characteristics
that are relevant to caching, such as object popularity and size distributions. Our study shows that the
popularity of P2P objects in different ASes can be modeled bya Mandelbrot-Zipf distribution, and that
several workloads exist in P2P systems, each could be modeled by a Beta distribution. Guided by our
findings, we develop a novel caching algorithm for P2P trafficthat is based on object segmentation and
partial admission and eviction of objects. Our trace-basedsimulations show that with a relatively small
cache size, a byte hit rate of up to 35% can be acheived by our algorithm, which is a few percentages
smaller than that of the off-line optimal algorithm. Our results also show that our algorithm achieves a
byte hit rate that is at least 40% more and at most triple the byte hit rate of the common web caching
algorithms. Furthermore, our algorithm is robust in face ofaborted downloads—a common case in P2P
systems—and achieves a byte hit rate that is even higher thanin the case of full downloads.
I. INTRODUCTION
Peer-to-peer (P2P) file-sharing systems have gained tremendous popularity in the past few years. More
users are continually joining such systems and more objectsare being made available, enticing even more
users to join. Currently, the traffic generated by P2P systems accounts for a major fraction of the Internet
traffic today [1], and is bound to increase [2]. The increasingly large volume of P2P traffic highlights
2
the importance of caching such traffic to reduce the cost incurred by Internet Services Providers (ISPs)
and alleviate the load on the Internet backbone.
Caching in the Internet has mainly been considered for the web and video streaming traffic, with
little attention to the P2P traffic which has become the dominant bandwidth consumer. Many caching
algorithms for web traffic [3] and for video streaming systems [4] have been proposed and analyzed.
Directly applying such algorithms to cache P2P traffic may not yield the best cache performance, because
of the different traffic characteristics as well as the caching objectives. For instance, the main objectives
in web and video streaming systems are to reduce user’s perceived latency, servers’ load, and WAN
bandwidth consumption. While the objectives in P2P cachingwould mainly be to reduce the WAN
bandwidth consumption, since there are no servers in P2P systems and latency is not a major concern
for clients.
In addition, unlike web objects, objects in P2P systems are mostly immutable, have large sizes, and
their popularity does not follow Zipf-like distribution [1]. Furthermore, although objects in P2P and
video streaming systems share some characteristics (e.g.,immutability and and large size), streaming
systems impose stringent timing requirements which limit the flexibility of the caching algorithms in
choosing which segments to store in the cache. Therefore, new caching algorithms that consider the new
characteristics and objectives need to be designed and evaluated.
In this paper, we, first, study the P2P traffic characteristics via a three-month measurement study on
a popular file-sharing system. We use the results obtained from our measurement to develop a deeper
understanding of the P2P traffic characteristics that are relevant to caching. Such characteristics include
object popularity, object size and popularity dynamics. We, then, design and evaluate a novel P2P caching
algorithm for object admission, segmentation and replacement.
Specifically, our contributions can be summarized as follows. First, we develop new models for P2P
traffic based on our measurement study. We show that the popularity of P2P objects can be modeled
by a Mandelbrot-Zipf distribution, which is a generalized form of Zipf-like distributions with an extra
parameter. This extra parameters captures the flattened head nature of the popularity distribution near
the lowest ranked objects (when plotted on a log-log scale).The flattened head nature has been also
observed by a previous study [1], but no specific distribution was given. We also show that P2P traffic
is composed of different workloads, each of which could be modeled by a Beta distribution. Second, we
analyze the impact of the Mandelbrot-Zipf popularity modelon caching and show that relying on object
popularity alone may not yield a high hit rate. Third, we design a new caching algorithm for P2P traffic
that is based on segmentation, partial admission and eviction of objects.
3
We perform trace-based simulations to evaluate the performance of our algorithm and compare it
against common web caching algorithms (LRU and LFU) and a recent caching algorithm proposed for
P2P systems. Our results show that with a relatively small cache size, less than 10% of the total traffic,
a byte hit rate of up to 35% can be achieved by our algorithm, which is a few percentages smaller than
the byte hit rate achieved by an off-line optimal algorithm with complete knowledge of future requests.
Our results also show that our algorithm achieves a byte hit rate that is at least 40% more and at most
triple the byte hit rate of the common web caching algorithms.
The rest of this paper is organized as follows. In Section II,we summarize the related work. Section
III describes our measurement study, provides models for popularity and object size and analyzes the
effect of Mandelbrot-Zipf distribution on the cache hit rate. Our P2P caching algorithm is described in
Section IV. We evaluate the performance of our algorithm using trace-based simulation in Section V.
Section VI concludes the paper and outlines future directions for this work.
II. RELATED WORK
Several measurement studies have analyzed various aspectsof P2P systems, and a few of them have
proposed caching algorithms for P2P traffic. Gummadi et al. [1] study the object characteristics of P2P
traffic in Kazaa and show that P2P objects are mainly immutable, multimedia, large objects that are
downloaded at most once. The study shows that the popularityof P2P objects exhibits a flattened head and
does not follow Zipf distribution, which is typically used to model the popularity of web objects [5]. The
study provides a simulation model for generating P2P trafficthat mimics the observed popularity curve,
but it does not provide any closed-form models for it. Sen andWang [6] study the aggregate properties
of P2P traffic in a large-scale ISP, and confirm that P2P trafficdoes not obey Zipf’s distribution. Their
observations also show that few clients are responsible formost of the traffic.
Klemm et al. [7] study the query behavior in Gnutella file sharing network and propose using two Zipf-
like distributions to model query popularity. However, measuring popularity of P2P objects by queries
may not be accurate. Queries are issued using keywords, not object IDs, and keywords are generated by
independent users. Thus, unique objects may not correspondto unique keywords. Further, queries do not
imply downloads, since a query could be abandoned or go unanswered. We use unique reply messages
in our measurement. These reply messages correspond to replicas of objects that do exist on hosts, and
were downloaded sometime in the past.
While the aforementioned studies provide useful insights into P2P systems, they were not explicitly
designed to study caching P2P traffic. Therefore, they did not focus on analyzing the impact of P2P
4
traffic characteristics on caching. The study in [1] highlighted the potential of caching and briefly studied
the impact of traffic characteristics on caching. But the study was performed in only one network domain
(a university campus network).
Leibowitz et al. [8] study the feasibility of caching P2P traffic. They conclude that P2P traffic is highly
repetitive and responds well to caching. However, the studydoes not provide any algorithm for caching.
Wierzbicki et al. [9] propose two cache replacement policies for P2P traffic: MINRS (Minimum Relative
Size), which evicts the object with the least cached fraction, and LSB (Least Sent Byte), which evicts
the object which has served the least number of bytes from thecache. Our simulation results show that
our algorithm outperforms LSB, which is better than MINRS according to their results.
Another related problem to caching is P2P traffic identification. Because P2P traffic is not standardized,
different protocols use different control messages, and some use encryption and dynamic ports. Sen et al.
[10] propose a technique for identifying P2P traffic using application level signatures. Karagianniset et al.
[11] propose a non-payload P2P traffic identification at the transport layer based on network connection
patterns.
III. M ODELING P2P TRAFFIC
We are interested in deploying caches in different autonomous systems (ASes) to reduce the WAN
traffic imposed by P2P systems. Thus, our measurement study focuses on measuring the characteristics
of P2P traffic that would be observed by theseindividual caches, and would impact their performance.
Such characteristics include object popularity, popularity dynamics, number of P2P clients per AS, and
object size distribution. We measure such characteristicsin several ASes of various sizes.
A. Measurement Methodology
We conduct a passive measurement study of the Gnutella file-sharing network [12]. According to
the Gnutella protocol specifications peers exchange several types of messages including PING, PONG,
QUERY and QUERYHIT. A QUERY message contains a set of keywords a user is searching for and is
propagated to all neighbors in the overlay for a distance specified by the TTL field. A typical value for
TTL is seven hops. The QUERY message contains the search keywords, a TTL field and the the address
of the immediate neighbor which forwarded the message to thecurrent peer. If a peer has one or more
of the requested files, it replies with a QUERYHIT message. A QUERYHIT message is routed on the
reverse path of the QUERY message it is responding to, and it contains the name and the URN (uniform
resource name) of the file, the IP address of the responding node, and file size. Upon receiving replies
5
(a) (b)
(c)
100
101
102
103
104
105
101
102
103
Rank
Fre
qu
en
cy
AS11803
Object popularityMZipf(0.47, 5)
(d)
Fig. 1. Object popularity in P2P traffic.
from several peers, the querying peer chooses a set of peers and establishes direct connections with them
to retrieve the requested file.
We modify an open-source Gnutella client called Limewire [13] by implementing a listener class
which logs all messages that pass through it. Limewire implementation has two kinds of peers:ultra-
peers, characterized by high bandwidth and long connection periods, andleaf-peerswhich are ordinary
peers that only connect to ultra-peers. We run our measurement in an ultra-peer mode and allow it to
maintain up to 500 simultaneous connections to other Gnutella peers. On average, our measurement node
was connected to 279 nodes, 63% of which were ultra peers. Ourmeasurement node passively recorded
6
100
101
102
103
104
101
102
103
104
Rank
Fre
qu
en
cy
AS1403
Object popularityMZipf(10, 0.7)
(a) (b)
100
101
102
103
104
101
102
103
Rank
Fre
qu
en
cy
AS95
Object popularityMZipf(0.62, 50)
(c)
100
101
102
103
104
105
101
102
103
Rank
Fre
qu
en
cy
AS9406
Object popularityMZipf(1.0, 287)
(d)
Fig. 2. Object popularity in P2P traffic.
the contents of all QUERY and QUERYHIT messages that pass through it without injecting any traffic
into the network.
The measurement study was conducted between 16 January 2006and 16 April 2006. Our measurement
peer is located at Simon Fraser University, Canada. But since the Gnutella protocol does not favor nodes
based on their geographic locations [7], we are able to observe peers from many ASes. Table I summarizes
some statistics about our measurement. All measurement data is stored in several trace files with a total
size of approximately 40 giga bytes, and will be made available to the research community at [14].
7
TABLE I
MEASUREMENT STUDY STATISTICS.
Measurement period 16 Jan. 2006 - 16 Apr. 2006
Number of QUERY messages 214,264,108
Number of QUERYHIT messages 107,063,419
Number of unique objects 11,449,168
Number of IP addresses 23,148,1245
Number of unique ASes 16,295
Number of ASes with more than 10,000 replies 392
B. Measuring and Modeling Object Popularity
The relative popularity of an object is defined as the probability of requesting that object relative to
other objects. Object popularity is critical for the performance of the cache. Intuitively, storing the more
popular objects in the cache is expected to yield higher hit rates.
Since we are interested in the performance of individual caches, we measure the popularity of objects
in eachAS, not the global popularity of objects. To measure the popularity of an object in a specific
AS, we count the number of replicas of that object in the AS considered. Number of replicas indicates
the number downloads that were completed in the past. This means that if a cache were deployed, it
would have seen a similar number of requests. This assumes that most of the downloads were supplied by
peers from outside the AS, which is actually the case becausepeers in most current P2P networks have
no sense of network proximity and thus do not favor local peers over non-local peers. In fact, previous
studies [1] have shown that up to 86% of the requested P2P objects were downloaded from peers outside
of the local network even though they were available locally.
Notice that measuring popularity using replica count does not account for aborted downloads, i.e.,
requests that initiated a download but never finished it. This means that the popularity of individual
objects is somewhat underestimated. This may make up for theoverestimation resulted from assuming
that all replicas are downloaded from peers outside of the AS. Note also that, from the cache perspective,
it is the relative popularity of objects, not the absolute number of requests of each objects, that influences
the cache performance. Thus we opt to measure popularity as the number of replicas that exist for each
object instead of the number of requests it receives.
To count the number of replicas of a given object, we extract from our trace all QUERYHIT messages
8
TABLE II
OBJECT POPULARITY IN SEVERALAS DOMAINS.
AS No. unique IPs unique objects MZipf(α, q) % of cacheable traffic
9548 464, 551 570, 122 (0.65, 5) 42.30
2609 13, 958 181, 184 (0.55, 25) 54.75
9406 35, 048 400, 381 (1.0, 287) 84.71
2461 485, 958 492, 416 (0.67, 5) 33.12
223 393, 471 477, 801 (0.62, 5) 34.55
105 427, 244 462, 396 (0.62, 0) 38.45
14832 515 46, 704 (0.60, 55) 95.34
18538 337 33, 198 (0.62, 60) 93.74
95 6, 121 87, 730 (0.6, 50) 54.47
1859 437, 901 767, 890 (0.53, 10) 46.16
11715 543, 708 677, 259 (0.56, 5) 38.69
1403 661, 242 618, 708 (0.7, 10) 42.67
2120 533, 210 584, 072 (0.58, 4) 43.25
1161 1000 135, 921 (0.60, 40) 16.67
379 368, 604 418, 263 (0.62, 8) 48.72
11803 340, 211 435, 128 (0.47, 5) 31.38
which contain the unique ID (URN) of that object. Since theseQUERYHIT messages contain the IP
addresses of the responding nodes that have copies of the requested object, we can determine the number
of replicas by counting the number of unique IP addresses associated with a specific object ID. We
then map these unique IP addresses to their corresponding ASes by using the GeoIP database [15]. The
number of unique IP addresses per AS associated with an object AS gives us the number of replicas of
the considered object in that AS.
We compute the popularity of each object in each of the top 16 ASes (in terms of the number of
QUERYHIT messages sent). These top 16 ASes contribute around 38% of the total traffic seen by our
study. We rank objects based on their popularity, and we plotpopularity versus rank. Figures 1 and 2
show a sample of our results for eight ASes. Results for otherASes are very similar. As shown in the
figures, there is aflattenedhead in the popularity curve of P2P objects. This flattened head indicates
that objects at the lowest ranks are not actually very popular. This flattened head phenomenon could
be attributed to two main characteristics of objects in P2P systems: immutability and large sizes. The
immutability of P2P objects eliminates the need for a user todownload an object more than once. This
download at most once behavior has also been observed in previous studies [1]. The large size of objects,
9
and therefore the long time required to download them, may make users download only objects that they
are really interested in. This is in contrast to web objects,which take much shorter times to download,
and therefore, users may download web objects even if they are of marginal interest to them. These two
characteristics reduce the total number of requests for popular objects.
Figures 1 and 2 also show that, unlike the case for web objects[5], using a Zipf-like distribution
to model the popularity of P2P objects would result in a significant error. In log-log scale, the Zipf-
like distribution appears as a straight line, which can reasonably fit most of the popularity distribution
except the left-most part, i.e., the flattened head. In particular, the Zipf-like distribution would greatly
overestimate the popularity of objects at the lowest ranks,which are the most important to caching
mechanisms because those are the good candidates to be stored in the cache.
We propose a new model that captures the flattened head of the popularity distribution of objects in
P2P systems. Our model uses the Mandelbrot-Zipf distribution [16], which is a general form of Zipf-like
distribution. The Mandelbrot-Zipf distribution defines the probability of accessing an object at ranki out
of N available objects as
p(i) =H
(i + q)α, (1)
whereH =∑N
i=11/(i+q)α, α is the skewness factor, andq ≥ 0 is a parameter which we call theplateau
factor because it is the reason behind the plateau shape nearto the left-most part of the distribution (see
Figure 3). Notice that the higher the value ofq, the more flattened the head of the distribution is, and
whenq = 0, Mandelbrot-Zipf distribution degenerates to a Zipf-likedistribution with a skewness factor
α.
To validate this popularity model, we fit the popularity distributions of the top 16 ASes to Mandelbrot-
Zipf distribution using the Matlab distribution fitting tool. Our results, some shown in Figures 1 and 2,
indicate that Mandelbrot-Zipf distribution models reasonably well the popularity of P2P objects. Table II
summarizesα andq values for the top sixteen ASes. We observe that a typical value for α is between
0.4 and 0.70, and a typical value ofq is between25 and 60. We also observe that ASes with a large
ratio of number of objects to the number of hosts have a largeq value, which corresponds to a more
flattened popularity curve. This could be explained by the request at most once behavior of peers, where
the number of replicas of a certain object in an AS is bounded by the number of hosts that exist in
that AS. We also notice that such ASes are responsible for most of the traffic. For instance, AS number
14832, Figure 1(a), contributed the highest reply rate among all ASes we looked at even though it has
10
100
101
102
103
100
101
102
103
Rank
Fre
qu
en
cy
Zipf (1)MZipf(1, 20)MZipf(1, 50)MZipf(1, 100)
Fig. 3. Mandelbrot-Zipf, denoted as MZipf(α, q), versus Zipf, denoted as Zipf(α), distributions.
only 515 hosts. Those heavy-hitter ASes have also being observed in previous studies [6].
We are aware that the number of unique IP addresses in an AS might not reflect the number of unique
hosts due to dynamic IP addressing. To approximate the number of unique hosts, we look at the number
of dynamic IP addresses found in a random subset of our trace containing more than100, 000 unique
replies. We identify an IP address as dynamic if two IP addresses reply with the the same object URN
and local object index. A local index is included in the queryhit replies and it identifies the location of
the object on the disk of the responding peer. We find out that 8.3% of the IP addresses in this sample
trace were dynamic IP addresses. Although such an approximation might be different depending on the
AS, it nontheless points out the existence of dynamic IP addresses that could lead to over-estimating the
number of unique peers in an AS.
We notice that a large number of hosts in an AS results in a small value of the plateau parameterq,
and a large amount of uncacheable traffic (i.e., objects withonly once replica). The larger the number
of hosts, the higher the chance of popular objects being downloaded more often which results in a less
flattened head observed in the loglog plot of the popularity distribution. This is equivalent to a small
value of q. A small q value has a positive effect on caching, since storing the popular objects yields
higher hit rate. This can be seen in ASes 9548, 223 and 2461 where the number of hosts is in the order
of hundreds of thousands, and the value ofq is five or less; see table II.
On the downside, a large number of hosts means a higher ratio of uncacheable traffic. This could be
11
100
101
102
30
40
50
60
70
80
90
100
110
Relative Cache Size (%)
Hit
Ra
te (
%)
Zipf(1)MZipf(1,25)MZipf(1, 60)
Fig. 4. Hit rate loss under Mandelbrot-Zipf popularity distribution.
explained by the fact that a larger number hosts means a more diverse user community, which means
a higher one-timer population. A high ratio of uncacheable traffic affects caching negatively, because
the maximum byte hit rate that could be achieved is bounded above by the amount of cacheable traffic.
However, the amount of cacheable traffic is not an indicationof a higher byte hit rate under finite size
caches. This is because the cache performance depends only on the the most popular objects, assuming
that the cache stores them. Cacheable traffic, however, refers to anything that was downloaded more
than once; an extreme example is when all objects are downloaded just twice. Thus, the skewness of the
popularity distribution is the deciding factor in cache performance. ASes with small values ofq and large
values ofα would benefit the most from deploying caches. This means thatthe theoretical maximum
limits of the cacheability of P2P traffic from previous studies [1], [8] do not necessarily imply a good
cache performance.
C. The Effect of Mandelbrot-Zipf Popularity on Caching
In the previous section we showed that P2P popularity could be modeled using the Mandelbrot-Zipf
distribution. In this section, we analyze the impact of thispopularity model on the cache hit rate.
Let us assume that only theC most popular objects are stored in the cache, that is an LFU policy is
used. We also assume that all objects have the same size. Then, the hit rateHR is given by:
12
0 10 20 30 40 50 60 700
10
20
30
40
50
60
70
80
90
100
Relative Cache Size
Hit
rate
Zipf(1.0)
mzipf(0.5,50)
mzipf(0.5,40)
Fig. 5. Hit rate loss under Mandelbrot-Zipf popularity distribution with LRU.
HR =C
∑
i=1
p(i) =C
∑
i=1
H
(i + q)α
≈
∫ C
i=1
H
(i + q)α= H ln
q + C
q + 1. (2)
For Zipf-like popularity distribution (q = 0), HR = H ln C. Figure 4 plots the hit rate under both
popularity distributions as a function of the cache size. Similar hit rate reduction has been observed for
LRU as shown in Figure 5. Both figures show that the hit rate under Mandelbrot-Zipf suffers severely,
especially under large values ofq, and small cache sizes. This is because large values ofq mean that
popular objects receive less requests, and thus storing them does not yield a high hit rate. Under small
cache sizes, the hit rate suffers more because the fraction of objects that could be store in the cache
receives only a small portion of the total traffic. This indicates that caching schemes that capitalize on
object popularity alone may not yield high hit rates and may not very effective in reducing the ever-
growing P2P traffic.
D. Popularity Dynamics
Since P2P objects are mostly immutable, users need not download them more than once. Once
downloaded, popular objects will receive less and less requests and other new popular objects are
13
0 2 4 6 8 10 120
1
2
3
4
5
6
7
8
9
10x 10
4
Week
Nu
mb
er
of
req
ue
sts
(a) Top five popularity
2 4 6 8 10 120
1
2
3
4
5
6
7
8
9
10x 10
4
Week
Fre
qu
en
cy o
f re
qu
ests
(b) Top ten popularity
2 4 6 8 10 120
2
4
6
8
10
12
14
16
18x 10
4
Week
Fre
qu
en
cy o
f re
qu
ests
(c) Top (11 − 20)th popularity
Fig. 6. Popularity degradation in P2P systems.
introduced in the system. To quantify this change in popularity globally, we study popularity dynamics
at two levels: across all ASes (globally) and per AS.
For global popularity dynamics we pick the top 20 most requested objects during the first month of our
measurement and see how their popularity changes over the entire measurement period. We look at the
average number of requests received by objects ranked from one to five, from one to ten and from ten to
twenty. We observe that the popularity grows as the file is introduced to reach its peak, and then degrades
steadily in a matter of weeks after its peak. We cannot make any claims as to when those popular files
were introduced, but the pattern of increase then decrease indicates that those top 20 objects are new
popular objects in the system. Figure 6 shows how the popularity degrades for the top five, top ten and
next top ten most popular objects. As evident from the figure,popularity in P2P systems is dynamic, and
changes over time. Similar observations about object lifetime in other P2P systems have been made in
[1], [17]. Figure 9 summarizes the percentages of requests received by the top most popular objects.
To study the popularity dynamics per AS, we pick the top ten most requested objects seen in the
first month of our measurement and see how their popularity changes over time in each AS. We do the
experiment for the top eight ASes: the top four ASes, in termsof the number of replies they sent, and
for another four random ASes. As Figures 7 and 8 show, the number of requests received by the top ten
most popular objects decreases over time for the top four ASes and the random ASes, respectively. Thus
popularity of P2P objects is dynamic and changes over time.
Such popularity dynamics may have negative effects on caching, since cached objects tend to grow
stale after they have built a good popularity profile. Thus, caches that capitalize on access frequency
might cache old popular objects unnecessarily. However, webelieve that the introduction of new popular
14
1 2 3 4 5 6 7 8 9 10 11 120
1
2
3
4
5
6
7
8x 10
4
Week
Nu
mb
er
of
req
ue
sts
AS14832
(a)
1 2 3 4 5 6 7 8 9 10 11 120
0.5
1
1.5
2
2.5
3
3.5x 10
4
Week
Nu
mb
er
of
req
ue
sts
AS18538
(b)
1 2 3 4 5 6 7 8 9 10 11 120
2000
4000
6000
8000
10000
12000
14000
Week
Nu
mb
er
of re
qu
est
s
AS2161
(c)
1 2 3 4 5 6 7 8 9 10 11 120
1000
2000
3000
4000
5000
6000
7000
8000
Week
Nu
mb
er
of re
qu
est
s
AS95
(d)
Fig. 7. Popularity dynamics of the top five most popular objects in the top four ASes.
objects negates the effect of caching old popular objects. That is, the introduction of new popular objects
competing for cache space will result in the eviction of older popular objects eventually.
E. Measuring and Modeling Object Size
Many traditional cache replacement algorithms use object size in their decision to evict objects [3].
Other algorithms, such as Greedy-Dual Size [18], incorporate size with other characteristics to maximize
cache performance. Thus, understanding how object size is distributed is critical to developing effective
caching schemes.
15
1 2 3 4 5 6 7 8 9 10 11 120
1000
2000
3000
4000
5000
6000
7000
Week
Nu
mb
er
of re
qu
est
sAS9406
(a)
1 2 3 4 5 6 7 8 9 10 11 120
1000
2000
3000
4000
5000
6000
Week
Nu
mb
er
of re
qu
est
s
AS1782
(b)
1 2 3 4 5 6 7 8 9 10 11 120
1000
2000
3000
4000
5000
6000
Week
Nu
mb
er
of
req
ue
sts
AS17318
(c)
1 2 3 4 5 6 7 8 9 10 11 120
2000
4000
6000
8000
10000
12000
Week
Nu
mb
er
of re
qu
est
s
AS1310
(d)
Fig. 8. Popularity dynamics of the top five most popular objects in four randomly-chosen ASes.
To develop a model for P2P object sizes, we collect information about objects in two popular P2P file-
sharing systems: Gnutella and BitTorrent. In BitTorrent , there are a number of web servers that maintain
metadata,.torrent, files about files in the network. These servers are known as torrent sites. We develop
a script which contact four popular torrent sites and downloads random .torrent files. We downloaded
100 thousand .torrent files, 50% of which are unique. Each .torrent file contains information about the
shared object, such as object size, number of segments in theobject and the size of each segment. For
the Gnutella system, we use randomly pick 500 thousand unique objects from the traces we collected
during our measurement study and analyze their sizes.
16
2 4 6 8 10 120
5
10
15
20
25
30
35
Week
Per
cent
age
of r
eque
sts
(%)
Fig. 9. Popularity degradation of the top five most popular objects in the top twenty ASes.
The histograms of the size distribution in both BitTorrent and Gnutella, Figures 10(a) and 11(a), show
that there exists several peaks where each peak correspondsto a certain workload. It is clear that it
is hard to model the entire size range using a single distribution. Therefore, we divide the whole size
range (approximately 0 MB - 25,000 MB) into multiple intervals. The choices of such intervals is not
random, but governed by the existence of different types of content. For example, a peak around 700
MB corresponds to most shared CD movies; another peak around250 MB corresponds to high quality
TV video series and some software objects while a peak aroundfew MB corresponds to video clips and
audio clips.
Using the Chi-Square goodness-of-fit test, we show that the size distribution of each content type (i.e.,
workload) can can be modeled as a Beta distribution. We also show that the size distribution of objects
observed per AS for the Gnutella traffic we collected is also multimodal and contains multiple workloads,
which could be modeled using the Beta distribution. A beta distribution is defined as follows: A random
variable is said to be beta distributed with parametersα andβ (both positive) in the interval[A,B] if
the probability density function (pdf) ofX is
P (x) =
1
B − A
Γ(α + β)
Γ(α)Γ(β)
(
x − A
B − A
)α−1 (
B − x
B − A
)β−1
A ≤ x ≤ B
0 Otherwise
(3)
When A = 0, B = 1, the above distribution is called the standardbeta distribution and with a pdf
P (x) = (1 − x)β−1xα−1
(
Γ(α + β)
Γ(α)Γ(β)
)
17
0 500 1000 1500 20000
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Object Size (MB)
Fre
quen
cy
(a)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Size
Cum
ulat
ive
prob
abili
ty
(0 − 349) MBBeta(0.62, 1.52)
(b)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Size
Cum
ulat
ive
prob
abili
ty
(350 − 700) MB
Beta(0.48, 0.32)
(c)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Size
Cum
ulat
ive
prob
abili
ty
(701 − 1399) MBBeta(0.59, 0.55)
(d)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Size
Cum
ulat
ive
prob
abili
ty
(1400 − 4500)MBBeta(0.38, 0.53)
(e)
0 0.2 0.4 0.6 0.80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Size
Cum
ulat
ive
prob
abili
ty
(4500 − 25000) MBBeta(0.61, 2.11)
(f)
Fig. 10. Object size distribution in BitTorrent.
18
whereΓ(α) is the gamma function and is defined as:
Γ(α) =
∫
∞
0
xα−1e−xdx , x ≥ 1
Figure 10 shows the CDF (Cumulative Distribution Function)of object sizes in BitTorrent fitted against
the CDF of the standard Beta distribution. Similarly, Figure 11 shows that Beta distribution could be
used to reasonably model the object size distribution in Gnutella. We notice that the sizes of objects
in BitTorrent are larger than in Gnutella. While objects of less than100 MB constitute approximately
90% of objects in Gnutella, they only constitute 36% in BitTorrent. Figure 12 shows the distribution of
objects into workloads in Gnutella and BitTorrent. The factthat objects in BitTorrent are usually large
video objects could be explained as follows. BitTorrent offers higher availability, due to its tit-for-tat
mechanism where each peer contributes by uploading to otherpeers, while a peer in Gnutella can choose
to free-ride without running the risk of getting slower downloading rates. Thus peers in Gnutella may not
be as willing to share content, which means objects are less available. BitTorrent, on the other hand gives
each peer a downloading rate proportional to its uploading rate which punishes free-riders, thus achieving
higher availability. Higher availability means less download aborts, and thus faster object retrieval, which
in turn promotes sharing of large objects.
We should keep in mind that, even though most of the downloaded objects observed by our measurement
were small objects, the majority of the traffic was contributed by large sized objects. To study this, we
pick the top ten ASes, and look at the fraction of objects withsize less than 100 MB and larger than 100
MB. For each of these two categories, we count the number of objects downloaded in the respective AS,
and the sum of their sizes, which corresponds to the traffic they contributed. As Figure 13 shows, objects
with size larger than 100 MB account for only about 10% of the total number of downloaded, but they
are responsible for about 90% of the traffic. This shows that large size of objects is a very important
factor to consider for any caching policy which strives to reduce the P2P traffic on the Internet backbone.
Knowing the distribution of P2P objects and the fraction of requests for each workload, we can estimate
the required cache size (Size) to achieve a certain byte hit rate.
Size = BHR ·n
∑
i=1
λi.µi ,
whereλi is the fraction of requests received by workloadi and µi is the mean of the respective size
distribution in workloadi (which is Beta distributed), and is defined as follows
µ = A + (B − A)α
α + β.
19
0 500 1000 15000
1
2
3
4
5
6x 10
5
Object Size (MB)
Fre
quen
cy
(a)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Size
Cum
ulat
ive
prob
abili
ty
(0 − 349) MBBeta(0.5, 1.42)
(b)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Size
Cum
ulat
ive
prob
abili
ty
(350 − 700) MBBeta(0.63, 0.44)
(c)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Size
Cum
ulat
ive
prob
abili
ty
(700 − 1400) MB
Beta(0.29, 0.57)
(d)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Size
Cum
ulat
ive
prob
abili
ty
> 1400 MB
Beta(0.34, 1.26)
(e)
Fig. 11. Object size distribution in Gnutella.
20
[0,10] [10,100] [100, 700] [700, 1400] > 14000
10
20
30
40
50
60
Object size (MB)
Per
cent
age
of o
bjec
ts (
%)
GnutellaBittorrent
Fig. 12. Object size comparison under BitTorrent and Gnutella.
We discuss the the relationship between byte hit rate and cache size further in the evaluation section.
F. Summary of the measurement study
We designed our measurement study to analyze the P2P traffic that would be observed by individual
P2P caches deployed in different ASes. We found that popularP2P objects are not as highly popular as
their web counterparts. The popularity distribution of P2Pobjects has a flattened head at lowest ranks,
and therefore, modeling this popularity as a Zipf-like distribution yields a significant error. We also
found that a generalized form of the Zipf distribution, called Mandelbrot-Zipf, captures this flattened
head nature and therefore is a better model for popularity ofP2P objects. Furthermore, we found that the
Mandelbrot-Zipf popularity has a negative effect on hit rates of caches that use the common LRU and
LFU policies. In addition, our study revealed the dynamic nature of object popularity in P2P systems,
which should be considered by the P2P caching algorithms. Finlay, we found that objects in P2P systems
have much larger sizes than web objects, and they can be classified into several categories based on
content types. Each category can be modeled as a Beta distribution.
IV. P2P CACHING ALGORITHM
With the understanding of the P2P traffic we developed, we design and evaluate a novel P2P caching
scheme based on segmentation and partial caching. Partial caching is necessary because of large objects
sizes. Unlike traditional memory systems where a page must be loaded to memory before it is used,
21
1 2 3 4 5 6 7 8 9 100
10
20
30
40
50
60
70
80
90
100
110
AS Rank
Pe
rce
nta
ge
o
f #
of
do
wn
loa
ds
/ #
of
byt
es
(%)
% of # of downloads % of number of bytes downloaded
Fig. 13. Number of downloads for objects with size larger than 100 MB, and the number of bytes they contributed to the total
traffic. This is measured for the top 10 ASes. The Figure showsthat even though large objects constitute only 10% of the total
number of downloads, they are responsible for about 90% of the traffic.
objects in web or P2P systems do not have to be cached on every cache miss. In particular, objects in
P2P systems are large and the most popular objects may not receive too many requests. Thus storing
an entire object upon request may waste cache space. P2P caching should take a conservative approach
towards admitting new objects into the cache to reduce the cost of storing unpopular objects.
The basic idea of our algorithm is to cache of each object a portion that is proportional to its popularity.
That way popular objects will be incrementally given more cache space than unpopular objects. To achieve
this goal, our algorithm caches a constant portion of an object when it first sees that object. As the object
receives more hits, its cached size increases. The rate at which the cached size increases grows with
the number of hits an object receives. To facilitate admission and eviction of objects, our policy divides
objects into segments. The smallest granularity that our algorithm caches/evicts is one segment.
We use different segment sizes for different workloads because using the same segment size for all
workloads may favor some objects over others and introduce additional cache management overhead.
For example, if we use a small segment size for all workloads,large objects will have large number
of segments. This will introduce a high overhead of managinglarge number of segments within each
object. Such overhead includes locating which segments of an object are cached, and deciding which
segments to evict. On the other hand, using a large segment size for all workloads will favor smaller
objects since they will have smaller number of segments and get cached quicker. Making segment size
22
relative to object size has the same two problems. This is because the range of P2P object sizes extends
to few gigabytes. Thus using relative segmentation, as a percentage of an object size, may result in large
objects having segments that are orders of magnitude largerthan unsegmented smaller object.
Consistent with our observations of multiple workloads, weuse four different segment sizes. For
objects with sizes less than 10 MB, we use a segment size of 512KB, for objects between 10 MB and
100 MB, a segment size of 1 MB, and for objects between 100 MB and 800 MB a segment size of
2 MB. Objects whose size exceeds 800 MB are divided into segments of 10 MB in size. We call this
segmentation scheme variable segmentation scheme. Note that our segmentation procedure is inline with
most segmentation schemes used by real P2P systems. Our analysis of more than 100,000 torrent files
shows that the vast majority of Bittorrent objects are segmented into 256 KB, 512 KB, 1 MB and 2
MB segments. E-Donkey uses segments of size 9.28 MB [19]. Depending on the client implementation,
segment size in Gnutella can either be a percentage of objectsize, as in BearShare [20], or fixed at few
hundred KBs [12].
For each workload the cache keeps the average of object size in this workload. Denote the average
object size in workloadw asµw. Further letγi be the number of bytes served from objecti normalized
by its cached size. That is,γi can be considered as number of times each byte in this object has been
served from the cache. The cache ranks objects according to their γ value, such that for objects ranked
from 1 to n, γ1 ≥ γ2 ≥ γ3 ≥ · · · ≥ γn. From this point on, we refer to an object at ranki simply
as objecti. When an object is seen for the first time, only one segment of it is stored in the cache. If
a request arrives for an object of which at least one segment is cached, then the cache computes the
number of segments to be added to this object’s segments as(γi/γ1)µw, whereµw is the mean of the
object size in workloadw which objecti belongs to. Notice that this is only the number of segments the
cache could store of objecti. But since downloads can be aborted at any point during the session, the
number of segmentsactually cached upon a request, denoted byk, is given by
k = min[missed, max(1,γi
γ1
µw)] , (4)
wheremissed is the number of requested segments not in the cache. This means that the cache will stop
storing uncached segments if the client fails or aborts the download, and that the cache stores at least
one segment of an object.
The pseudo-code of our algorithm appears in Figure 14. At a high level, the algorithm works as follows.
The cache intercepts client requests and extracts the object ID and the requested range. If no segments
of the requested range is in the cache, the cache stores at least one segment of the requested range. If the
23
entire requested range is cached, the cache will serve it to the client. If the requested range is partially
available in the cache, the cache serves the cached segmentsto the client, and decides how many of the
missing segments to be cached using Eq. (4). In all cases, thecache updates the average object size of
the workload to which the object belongs and theγ values of the requested object.
request(object i, requested range)
1. if object i is not in the cache
2. add one segment ofi to cache, evicting if necessary
3. else
4. hit = cached range∩ requested range
5. γi += hit / cached size ofi
6. missed = (requested range - hit)/segment size
7. k = min[missed, max(1,γi
γ1
µw)]
8. if cache does not have space fork segments
9. evict segments from the least valued object
10. addk segments of objecti to the cache
11. return
Fig. 14. P2P caching algorithm.
The algorithm uses a priority queue data structure to store objects according to theirγi value. When
performing eviction, the required segments are removed from the least valued objects. Currently, our
algorithm evicts contiguous segments from the least valuedobjects without favoring any segments over
others. This is because P2P downloads are likely to start from anywhere in the file [9]. It remains a future
goal of ours to study in-object request patterns and improveour algorithm accordingly. Our algorithm
needs to performO(logN) comparisons with every hit or miss, whereN is the number of objects in the
cache. But since objects are large, this is a reasonable costconsidering the small number of objects the
cache will contain.
V. EVALUATION
In this section, we use trace-driven simulation to study theperformance of our P2P caching algorithm,
and compare it against two common web caching algorithms (LRU, LFU and GDS) and a recent caching
24
algorithm proposed for P2P systems, and the offline optimal policy.
A. Experimental Setup
We use traces obtained from our measurement study to conductour experiments. Our objective is to
study the effectiveness of deploying caches in several ASes. Thus, we collect information about objects
found in a certain AS and measure the byte hit rate that could be achieved if a cache were to be deployed
in that AS. We use the byte hit rate as the performance metric because we are mainly interested in reducing
the WAN traffic
We run several AS traces through the cache and compare the byte hit rate achieved using several
caching algorithms. In addition to our algorithm, we implement the Least Recently Used (LRU), Least
Frequently Used (LFU), Greedy-Dual Size (GDS) [18] and Least Sent Bytes (LSB) [9], and the off-line
optimal (OPT) algorithms. LRU capitalizes on the temporal correlation in requests, and thus replaces the
oldest object in the cache. LFU sorts objects based on their access frequency and evicts the object with
the least frequency. GDS maintains a characteristic valueHi for each object, whereHi = costi/sizei+L.
The object with the smallestHi value will be evicted next. Thecost parameter could be set to maximize
hit rate, minimize latency or minimize miss rate. We set thesize parameter to be 1 +size/536, which is
the estimated number of packets sent and received upon a miss. LSB is designed for P2P traffic caching
and it uses the number of transmitted bytes of an object as a sorting key. Objects which have transmitted
the least amount of bytes will be evicted next. OPT looks at the entire stream of requests off-line and
caches the objects that will serve the most number of bytes from the cache.
We measure the byte hit rate under several scenarios. First,we consider the case when objects requested
by peers are downloaded entirely, that is, there are no aborted transactions. Then, we consider the
case when the downloading peers pre-maturely terminate downloads during the session, which is not
uncommon in P2P systems. We also run our experiments under the independent reference model, where
entries of the trace file are randomly shuffled to eliminate the effect of temporal correlation and popularity
degradation. We do that to isolate and study the effect of object popularity on caching. Finally, we run
our experiments on the original trace with preserved temporal correlation between requests to study the
combined effect of popularity and popularity degradation.
In all experiments, we use the ASes different traffic characteristics, which are representative of P2P
traffic observed in other ASes. We concentrate on top ASes which sent the most number of replies to
ensure that we have enough traffic from an AS to evaluate the effectiveness of deploying a cache for it.
Caches usually strive to store objects that have a higher likelihood of being requested in the near
25
future. This likelihood of requesting objects is called temporal locality of requests [21], [22]. Temporal
locality has two components: popularity and temporal correlations. Popularity of an object is the number
of requests it receives relative to other objects. Temporalcorrelation of requests means that rquests to
the same object tend to be clustered or correlated together in time. To illustrate the difference consider
the following two sequences
. . . ABCAEGFAXY NAACEABA . . .
in which A receives more requests than other object. Thus isA is more popular than other objects.
Consider a second sequence
. . . ABBBACCCEAFBBAB . . .
in which requests to objectB areC correlated and tend to appear close to each other. Temporal correlation
is usually measured in the number of intervening requests between two consecutive requests to the same
object. A shorter sequence of intervening requests imply higher temporal correlations.
Temporal locality of web request streams has been widely studied in the literature (see [21]–[23] and
references therein). However, we believe that temporal correlations is not as critical to P2P caching as it
is to web caching. This is because web objects are small, which means that the cache can accommodate
a long sequence of requests, during which temporal localitywill be evident. But in P2P caching, objects
are orders of magnitude larger than their web counterparts,which means that a sequence of P2P objects
that could be stored in the cache is orders of magnitude shorter than a sequence of web objects.
B. Caching under Full Downloads
One of the main advantages of our caching policy is that it effectively minimizes the impact of
unpopular traffic by admitting objects incrementally. Traditional policies usually cache an entire object
upon a miss, evicting other, perhaps more popular, objects if necessary. This may waste cache space
by storing unpopular objects in the cache. To investigate the impact of partial versus full caching, we,
first, assume that we have a no-failure scenario where peers download the object entirely. We look at all
objects found in an AS, line them up in a randomly shuffled stream of requests and run them through
the cache. We vary the cache size between 0 and 1000 GB.
Figure 15 shows the byte hit rate for four representative ASes with different characteristics. These ASes
have different maximum achievable byte hit rates, which is defined as the fraction of traffic downloaded
more than once, i.e., cachebale traffic, over the total amount of traffic. As shown in the figure, our policy
outperforms other policies by a wide margin. For instance, in AS397 (Figure 15(a)) with a cache of 600
26
0 200 400 600 800 10000
5
10
15
20
25
30
35
Cache Size (GB)
Byt
e H
it R
ate
Byte hit rate under AS397
OPTP2PLSBLFULRUGDS
(a) 48% of traffic is cacheable
0 200 400 600 800 10000
5
10
15
20
25
30
Cache Size(GB)
Byt
e H
it R
ate
(%
)
AS95
OPT
P2P
LSB
LFU
LRU
GDS
(b) 54% of traffic is cacheable
0 200 400 600 800 10000
2
4
6
8
10
12
14
16
Cache Size (GB)
Byt
e H
it R
ate
AS1161
OPTP2PLSBLFULRUGDS
(c) 16% of traffic is cacheable
0 200 400 600 800 10000
2
4
6
8
10
12
14
16
18
20
Cache Size (GB)
Byt
e H
it R
ate
AS14832
OPTP2PLSBLFULRUGDS
(d) 95% of traffic is cacheable
Fig. 15. Byte hit rate for different caching algorithms. No aborted downloads.
GB, our policy achieves a byte hit rate of 24%, which is almostdouble the rate achieved by LRU, LFU,
GDS and LSB policies. Moreover, the byte hit rate achieved byour algorithm is about 3% less than that
of the optimal algorithm. Our trace shows that the amount of traffic seen in AS397 is around 24.9 tera
bytes. This means that a reasonable cache of size 600 GB wouldhave served about 6 tera bytes locally
using our algorithm.
The reason traditional policies perform poorly for P2P traffic is because of the effect of unpopular
objects. For example, one-timer objects are stored entirely under traditional policies on a miss. Under our
policy, however, only one segment of each one-timer will findits way into the cache, thus minimizing
27
1 2 3 4 5 6 7 8 9 100
50
100
150
200
AS rank
Re
lativ
e in
cre
ase
in b
yte
hit
rate
(P2P − LRU)/LRU
(P2P − LSB)/LSB
(P2P − LFU)/LFU
Fig. 16. Relative byte hit rate improvement using our algorithm over LRU, LFU, and LSB algorithms in the top 10 ASes. No
aborted downloads.
their effect.
As can be seen from the figures, our policy consistently preforms better than traditional policies. Figure
16 summarizes the relative improvement in byte hit rate thatour policy achieves over LRU, LFU, and
LSB for the top ten ASes. The relative improvement is computed as the difference between the byte
hit rate achieved by our policy and the byte hit rate achievedby another policy normalized by the byte
hit rate of the other policy. For instance the improvement over LRU would be(P2P − LRU)/LRU .
The figure shows a relative improvement of at least 40% and up to 180% can be achieved by using our
algorithm. That is a significant gain given the large volume of the P2P traffic. We notice that the relative
gain our policy achieves is larger in ASes with a substantialfraction of one-timers.
As a final comment on Figure 15, consider the absolute byte hitrate archived under our algorithm
as well as the optimal byte hit rate in AS14832 and AS397. Notice that although the percentage of
cacheable traffic in AS397 is less than that of AS14832, the absolute byte hit rate is higher in AS397.
This is because popular objects in AS14832 do not get as many requests as their counterparts in AS397.
That is, the popularity distribution of AS14832 has a more flattened head than that of AS397. We
computed the skewness factorα and the plateau factorq of the Mandelbrot-Zipf distribution that best fits
the popularity distributions of these two ASes. We found that AS397 hasα = 0.55 and q = 25, while
AS14832 hasα = 0.6 and aq = 55. Thus smallerq values mean less flattened heads, and yield higher
byte hit rates (Section V-D elaborates more on the impact ofα and q on the byte hit rate). This also
28
means that the amount of cacheable traffic in an AS is not a goodenough indication of the achievable
byte hit rate.
Next, we study the effect of temporal correlations on P2P caching. As we mentioned previously,
temporal correlations imply that an object just requested has a higher degree of being requested in the
near future, and also that the number of requests this objectreceives tend to be correlated and decrease
over time. Our traces exhibit a dynamic request behavior where requests to popular objects decrease
over time. To quantify this effect, we use unshuffled traces from our measurement. Figure 18 shows the
byte hit rate under AS397 and AS95 with unshuffled traces. Unshuffled traces have temporal correlations
preserved and feature a popularity degradation of the most popular objects as we showed in Section III-D.
As the figure shows LRU and GDS perform slightly better than LSB, and LFU because they make use
of the temporal correlation between requests. However, theachieved byte hit rate of LRU, LFU, GDS
and LSB is still low compared to the byte hit rate achieved by our algorithm. Note that the absolute
byte hit rate under our algorithm is slightly smaller than inthe previous experiment where we used the
independent reference model. This is because our algorithmdoes not consider object aging. However,
this reduction is small, less than 3%. The fact that the performance of our algorithm does not suffer
much under popularity degradation and still outperforms other algorithms (e.g., LRU) could be explained
as follows. We believe that object size is the dominant factor in caching for P2P systems, because the
maximum cache size we used (1000 GB) is still small compared to the total size of objects, less than
10% in most cases. Also, new popular objects are usually introduced into the system, thus evicting old
popular objects and minimizing the negative effect of object staleness. We conclude that object admission
strategy is a key element in determining the byte hit rate, which our algorithm capitalizes on. Similar
results were obtained for the remaining ASes.
But since we are not able to control the degree of correlation(i.e., inter-request distance), we modify a
synthetic generation tool to obtain such traces. We modify the the ProWGen traffic generation tool [24] to
use Mandelbrot-Zipf distribution and provide object sizesfrom our traces. The ProWGen uses the LRU
Stack Model (LRUSM) to generate temporal correlations between requests. LRUSM uses a stack of depth
n to keep an ordered list of the last requestedn objects such that the subset of objects in the stack have a
higher probability of being accessed again that they would if they were not in the stack. That is, objects
stored in the stack have the probability of being accessed given by Mandelbrot-Zipf plus a small fractio
proportional to their position in the stack. Details of how ProWGen implements the LRUSM are provided
in [24]. To study the sensitivity of our algorithm to temporal correlations, we vary depth of the LRU
stack between 0 and 500, where higher values for stack depth correspond to higher temporal correlation.
29
0 200 400 600 800 10000
5
10
15
20
25
30
35
40
45
Cache Size (GB)
Byt
e H
it R
ate
(%
)
LRUSM(500)IRMLRUSM(100)
Fig. 17. Sensitivity of the P2P caching algorithm to temporal correlations.
As Figure 17 shows, the byte hit rate improves slightly underhigh degrees of temporal correlations.
However, the byte hit rate improvement is still very small compared to the byte hit rate achieved under
the Independent Reference Model (IRM). The fact that our policy suffered slightly under real traces,
but improved under synthetic traces can be explained as follows. Under our traces, requests to popular
objects were clustered together such their popularity diesover time. However, in the synthetic traces
temporal correlations of objects does not have any popularity bias whereby popular objects are requested
in a certain region of the sequence and then grow stale. But inboth cases, our policy is insensitive to
temporal correlations of P2P requests. As we explained in Section III-D, the introduction of new popular
objects, and the relatively small cache size compared to object sizes mitigates the effect of popularity
dynamics. Thus, object size is the dominant factor in P2P caches, and our policy capitalizes on it.
C. Caching with Aborted Downloads
Due to the nature of P2P systems, peers could fail during a download, or abort a download. We run
experiments to study the effect of aborted downloads on caching, and how robust our caching policy
is. Following observations from [1], we allow 66% of downloads to abort anywhere in the download
session. While our policy is designed to deal with aborted downloads, web replacement policies usually
download the entire object upon a miss, and at times perform pre-fetching of popular objects. This is
reasonable in the web since web objects are usually small, which means they take less cache space..
But in P2P systems, objects are larger, and partial downloads constitute a significant number of the total
30
0 200 400 600 800 10000
5
10
15
20
25
30
Cache Size (GB)
Byt
e H
it R
ate
(%
)
AS95
OPTP2PLSBLFULRUGDS
(a) 54% of traffic is cacheable
0 200 400 600 800 10000
5
10
15
20
25
30
35
Cache Size (GB)
Byt
e H
it R
ate
(%
)
AS397
OPTP2PLSBLFULRUGDS
(b) 48% of traffic is cacheable
Fig. 18. Byte hit rate for different caching algorithms using traces with unshuffled traces (popularity degradation).
number of downloads. Figure 19 compares the byte hit rate with aborted downloads. Compared to the
scenario of caching under full downloads (Section V-B), theperformance of our algorithm improves
slightly while the performance of other algorithms declines. The improvement in our policy could be
explained by the fact that fewer bytes are missed in case of a failure. The performance of LRU, LFU,
GDS and LSB declines because they store an object entirely upon a miss regardless of how much of it
the client actually downloads. Hence, under aborted download scenario the byte hit rates for traditional
policies suffers even more than it does under full download scenario. Similar results were obtained for
the other top ten ASes. Our policy consistently outperformsother policies with a significant improvement
margin in all top ten ASes. Figure 21 shows that the relative improvement in byte hit rate over LRU,
LFU and LSB is at least 50% and up to 200%.
D. Effect ofα and q on P2P Caching
As we mention in Section III, P2P traffic can be modeled by a Mandelbrot-Zipf distribution with a
skewness factorα between 0.4 and 0.65 and a plateau factorq between 10 and 70. In this section, we
study the effect ofα and q on the byte hit rate of our algorithm via simulation, since our trace-drive
simulation presented above may not capture the performanceof our algorithm for all possible values
of α and q. We randomly pick100, 000 objects from our traces and generate their frequencies using
Mandelbrot-Zipf with various values forα andq. We fix the cache size at 1,000 GB and we assume a
no-failure model where peers download objects entirely.
31
0 200 400 600 800 10000
5
10
15
20
25
Cache Size (GB)
Byt
e H
it R
ate
(%
)
AS95
P2PLSBLFULRUGDS
(a)
0 200 400 600 800 10000
5
10
15
20
25
30
35
Cache Size (GB)
Byt
e H
it R
ate
AS397
P2PLSBLFULRUGDS
(b)
0 200 400 600 800 10000
1
2
3
4
5
6
7
8
9
10
Cache Size (GB)
Byt
e H
it R
ate
AS1161
P2P
LSB
LFU
LRU
GDS
(c)
0 200 400 600 800 10000
2
4
6
8
10
12
14
16
18
Cache Size (GB)
Byt
e H
it R
ate
(%
)
AS14832
P2PLSBLFULRUGDS
(d)
Fig. 19. Byte hit rate for different caching algorithms using traces with aborted downloads.
To study the effect of differentα values, we fixq and changeα between 0.4 and 1. As shown in
Figure 20(a), the byte hit rate increases as the skewness factor α increases. This is intuitive since higher
values ofα mean that objects at the lowest ranks are more popular and caching them yields higher byte
hit rate. Thus ASes whose popularity distribution is characterized with highα values would benefit more
than those with lowα values.
Another parameter that determines the achievable byte hit rate is the plateau factorq of the popularity
distribution. q reflects the flattened head we observed in Section III. Figure20(b) shows that the byte
32
0.4 0.5 0.6 0.7 0.8 0.9 10.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
α
Byt
e H
it R
ate
q = 20
q = 50
(a)
0 10 20 30 40 50 60 700.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
q
Byt
e H
it R
ate
α = 0.6 α = 0.4
(b)
Fig. 20. The impact of the skewness factorα (a) and the plateau factorq (b) on the byte hit rate of our algorithm.
hit rate decreases asq increases. The reason for this is that a higher value ofq means popular objects
are requested less often. This has a negative impact on a cache that stores those popular because they
receive less downloads resulting in a decreased byte hit rate.
E. Effect of Segmentation on P2P Caching
Our algorithm uses a variable segmentation scheme whereby objects belonging to the same workload
have the same segment size. To investigate the effect of segmentation, we repeat the experiment of
aborted downloads under AS397. Using different segment sizes, we, first, evaluate the byte hit rate when
all objects have the same segment size (1 MB, 50 MB, and 100 MB)regardless of their size. Then
we evaluate our variable segmentation scheme which we described in Section IV. As can be seen from
Figure 22, as the segment size increases, the byte hit rate degrades. This is because the cache evicts and
admits segments only, and when a download aborts in the middle of downloading a segment, the rest of it
will be unnecessarily stored. Similarly, under large segment sizes, we may admit more unpopular objects
since admitting one segment of a one-timer amounts to admitting a substantial part of the object. The
figure also shows that our variable segmentation scheme achieves a higher byte hit rate than a uniform
1 MB segmentation. We believe the minor computation overhead added to our segmentation scheme is
justifiable given such performance gain.
33
1 2 3 4 5 6 7 8 9 100
20
40
60
80
100
120
140
160
180
200
AS rank
Re
lativ
e in
cre
ase
in b
yte
hit
rate
(P2P − LRU)/LRU
(P2P − LSB)/LSB
(P2P − LFU)/LFU
Fig. 21. Relative byte hit rate improvement using our algorithm over LRU, LFU, and LSB algorithms in the top 10 ASes.
With aborted downloads.
VI. CONCLUSION AND RELATED WORK
In this paper, we performed a three-month measurement studyon the Gnutella P2P system. Using
real-world traces, we studied and modeled characteristicsof P2P traffic that are relevant to caching.
We found that the popularity distribution of P2P objects cannot be captured accurately using Zipf-like
distributions. We proposed a new Mandelbrot-Zipf model forP2P popularity and showed that it closely
models object popularity in several AS domains. We also showed the existence of multiple workloads in
P2P systems, and used Beta distribution to model the size distribution of each workload. Our measurement
study indicated that (i) The Mandelbrot-Zipf popularity has a negative effect on hit rates of caches that
use LRU, LFU or GDS policies, and (ii) Object admission strategies in P2P caching are as critical to
cache performance as object replacement strategies.
We designed and evaluated a new P2P caching algorithm that isbased on segmentation and incremental
admission of objects according to their popularity. Using trace-driven simulations we showed that our
algorithm outperforms traditional algorithms by a significant margin and achieves a byte hit rate that is
up to triple the byte hit rate achieved by other algorithms. We also showed that temporal correlation of
requests does not have a big impact on P2P caching because object size is the dominant factor. Finally,
because of aborted downloads, a caching policy should not perform pre-fetching of objects if its objective
is to maximize byte hit rate, which is the strategy taken by our algorithm.
We are currently implementing our algorithm in the Squid [25] proxy cache and testing it using our
34
0 200 400 600 800 10000
5
10
15
20
25
30
35
Cache Size (GB)
Byte
Hit R
ate
variable seg.1 MB50 MB100 MB
Fig. 22. The effect of segmentation on byte hit rate under P2Palgorithm.
traces. In the future, we intend to investigate preferential segment eviction, whereby segments of the
same object carry different values.
REFERENCES
[1] K. Gummadi, R. Dunn, S. Saroiu, S. Gribble, H. Levy, and J.Zahorjan, “Measurement, modeling, and analysis of a peer-
to-peer file-sharing workload,” inProc. of the 19th ACM Symposium on Operating Systems Principles (SOSP’03), Bolton
Landing, NY, USA, Oct. 2003.
[2] T. Karagiannis, A. Broido, N. Brownlee, K. C. Claffy, andM. Faloutsos, “Is P2P dying or just hiding?” inProc. of IEEE
Global Telecommunications Conference (GLOBECOM 2004), Dallas, TX, USA, Nov. 2004.
[3] S. Podlipnig and L. Bszrmenyi, “A survey of web cache replacement strategies,”ACM Computing Surveys, vol. 35, no. 4,
pp. 347–398, Dec. 2003.
[4] J. Liu and J. Xu, “Proxy caching for media streaming over the Internet,”IEEE Communications Magazine, vol. 42, no. 8,
pp. 88–94, Aug. 2004.
[5] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker, “Web caching and Zipf-like distributions: Evidence and implications,”
in Proc. of INFOCOM’99, New York, NY, USA, Mar. 1999.
[6] S. Sen and J. Wang, “Analyzing peer-to-peer traffic across large networks,”IEEE/ACM Transactions on Networking, vol. 12,
no. 2, pp. 219–232, Apr. 2004.
[7] A. Klemm, C. Lindemann, M. K. Vernon, and O. P. Waldhorst,“Characterizing the query behavior in peer-to-peer file
sharing systems,” inProc. of ACM/SIGCOMM Internet Measurement Conference (IMC’04), Taormina, Sicily, Italy, Oct.
2004.
[8] N. Leibowitz, A. Bergman, R. Ben-Shaul, and A. Shavit, “Are file swapping networks cacheable? Characterizing P2P
traffic,” in Proc. of WWW Caching Workshop, Boulder, CO, USA, Aug. 2002.
35
[9] A. Wierzbicki, N. Leibowitz, M. Ripeanu, and R. Wozniak,“Cache replacement policies revisited: The case of P2P traffic,”
in Proc. of the 4th International Workshop on Global and Peer-to-Peer Computing (GP2P’04), Chicago, IL, USA, Apr.
2004.
[10] S. Sen, O. Spatscheck, and D. Wang, “Accurate, scalablein-network identification of P2P traffic using application
signatures,” inProc. of the 13th International World Wide Web Conference (WWW’04), New York City, NY, USA, May
2004.
[11] T. Karagiannis, M. Faloutsos, and K. Claffy, “Transport layer identification of P2P traffic,” inProc. of ACM/SIGCOMM
Internet Measurement Conference (IMC’04), Taormina, Sicily, Italy, Oct. 2004.
[12] “Gnutella home page,” http://www.gnutella.com.
[13] “Limewire Home Page,” http://www.limewire.com/.
[14] “Network systems lab homepage,” http://nsl.cs.surrey.sfu.ca.
[15] “GeoIP Database Home Page,” http://www.maxmind.com.
[16] Z. Silagadze, “Citations and the Zipf-Mandelbrot’s law,” Complex Systems, vol. 11, no. 487–499, 1997.
[17] J. Pouwelse, P. Garbacki, D. Epema, and H. Sips, “The Bittorrent P2P file-sharing system: Measurements and analysis,”
in Proc. of the 4th International Workshop on Peer-To-Peer Systems, Ithaca, NY, USA, Feb. 2005.
[18] P. Cao and S. Irani, “Cost-aware WWW proxy caching algorithms,” in Proc. of USENIX Symposium on Internet Technologies
and Systems, Monterey, CA, USA, Dec 1997.
[19] “Emule Project Home Page,” http://www.emule-project.net/.
[20] “BearShare Home Page,” http://www.bearshare.com/.
[21] S. Jin and A. Bestavros, “Sources and Characteristics of Web Temporal Locality,” in8th Modeling, Ana. & Simulation of
Comput. & Telecomm. Syst. (MASCOTS), Aug 2000, pp. 28–35.
[22] K. Psounis, A. Zhu, B. Prabhakar, and R. Motwani, “Modeling correlations in web traces and implications for designing
replacement policies,”Journal of Computer Networks, vol. 45, no. 4, pp. 379–398, Jul 2004.
[23] R. Fonseca, V. Almeida, M. Crovella, and B. Abrahao, “Onthe Intrinsic Locality Properties of Web Reference Streams,”
in Proc. of IEEE INFOCOM’03, San Francisco, CA, USA, Apr. 2003.
[24] M. Busari and C. Williamson, “ProWGen: a synthetic workload generation tool for simulation evaluation of web proxy
caches,”Journal of Computer Networks, vol. 38, no. 6, pp. 779–794, Apr. 2002.
[25] “Squid Web Proxy Cache,” www.squid-cache.org/.