traffic patterns in peer-to-peer-networking christian...
TRANSCRIPT
Traffic Patterns in Peer-to-Peer-Networking
Christian Schindelhauer
joint work with Amir Alsbih
Thomas Janson
Albert-Ludwig University FreiburgDepartment of Computer Science
Computer Networks and Telematics
to be presented at ITA
Global Internet Traffic Shares1993-2004
Source: CacheLogic 2005
FTP
Peer-to-Peer
Web
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
CacheLogic Research Trends of Internet Protocols 1993-2004
Share
of In
tern
et tr
affi
c
0
10
20
30
40
50
60
70
2
P2P Share Germany 2007
Quelle: Ipoque 2007
3
What Germans Download 2007by Volume
4Quelle: Ipoque 2007
P2P Systems Germany 2007by Volume
5Quelle: Ipoque 2007
Motivation
Peer-to-Peer-Networks- Quality depends on user behavior- High churn rates - Egoistic users
Only a small number of independent studies of Internet traffic
We analyze the complete traffic of 20,000 users in August 2009 of a German digital cable TV based Internet provider.- Traffic was centrally monitored - Type classification by deep packet inspection- Looked at BitTorrent traffic
6
Network Monitoring
Network monitoring systems- installed in recent
years- allow to monitor the
behavior of each user.
Motivation- new governmental
regulations - detection and
prevention of • Internet fraud
• denial-of-service attacks
• spam mailers• phishing attacks, • criminal conspiracies, • forbidden contents • copyright violations.
ISPs are (usually) not the juridical target - are required to uphold
an infrastructure, which allows law enforcement to take action in such cases.
7
Background of the Study
Our study- limited access to anonymized
user data - gathered by a network
monitoring systems using deep packet inspection (DPI)
Main product of „our“ ISP: digital cable TV- thousands of German
households- byproduct they also offer
telephone and Internet service
German households are connected via DSL- rural area the bandwidth is
rather low
- urban areas high data rates
Mobile phone carriers providing GPRS, EDGE and HSDPA gain in traffic.
Digital TV cable is a stable market- Installation of the necessary
infrastructure is expensive- Television is still important
media of Germans- No open market for digital
cable TV. - Cable TV users extend their
contracts to include Internet service because of low prices and high bandwidth rates.
8
Internet over Digital TV Cable
Each user needs a digital cable modem- encodes and decodes the data traffic
Throughput rates range from 32-100 MBit/s download - DSL traffic: 2 to 16 kBit/s- HSDPA ending at 7.2 MBit/s
No network bottleneck - measured traffic behavior directly reflects the users wishes.
Ideal opportunity- What do Internet users want? - How long are users online? - How much data do users download or upload? - What are the network services they use?
9
BitTorrent and „Friends“
BitTorrent- most successful peer-to-peer network protocol- BitTorrent encourages to upload data using incentives
Several BitTorrent clients deviate from the original protocol - BitTyrant
• achieves a download gain up to 70 percent • strategic selection of peers in the swarm
- BitThief • free riding client• allows downloading without any upload • achieve higher download rates than the official client.
10
Bittorrent
Bram Cohen Bittorrent is a real (very successful) peer-to-peer network
- concentrates on download- uses (implicitly) multicast trees for the distribution of the parts of a file
Protocol is peer oriented and not data oriented Goals
- efficient download of a file using the uploads of all participating peers- efficient usage of upload
• usually upload is the bottleneck• e.g. asymmetric protocols like ISDN or DSL
- fairness among peers• seeders against leeches
- usage of several sources
11
Bittorrent: Coordination
Central coordination - by tracker host- for each file the tracker outputs a set of random peers from the set of
participating peers• in addition hash-code of the file contents and other control information
- tracker hosts to not store files• yet, providing a tracker file on a tracker host can have legal consequences
- Is often replaced with a decentralized peer-to-peer network
File- is partitions in smaller pieces
• as describec in tracker file- every participating peer can redistribute downloaded parts as soon as he
received it- Bittorrent aims at the Split-Stream idea
Interaction between the peers- two peers exchange their information about existing parts- according to the policy of Bittorrent outstanding parts are transmitted to the other
peer
12
BittorrentPart Selection
Problem- The Coupon-Collector-Problem is the reason for a uneven distribution of parts
• if a completely random choice is used
Measures- Rarest First
• Every peer tries to download the parts which are rarest- density is deduced from the comunication with other peers (or tracker host)
• in case the source is not available this increases the chances the peers can complete the download
- Random First (exception for new peers)• When peer starts it asks for a random part• Then the demand for seldom peers is reduced
- especially when peers only shortly join
- Endgame Mode• if nearly all parts have been loaded the downloading peers asks more connected
peers for the missing parts• then a slow peer can not stall the last download
13
BittorrentPolicy
Goal- self organizing system- good (uploading, seeding) peers are rewarded- bad (downloading, leeching) peers are penalized
Reward- good download speed- un-choking
Penalty- Choking of the bandwidth
Evaluation- Every peers Peers evaluates his environment from his
past experiences
14
BittorrentChoking
Every peer has a choke list- requests of choked peers are not served for some time- peers can be unchoked after some time
Adding to the choke list- Each peer has a fixed minimum amount of choked peers (e.g. 4)- Peers with the worst upload are added to the choke list
• and replace better peers Optimistic Unchoking
- Arbitrarily a candidate is removed from the list of choking candidates• the prevents maltreating a peer with a bad bandwidth
15
Deep Packet Inspection
Internet Service Provider- deep packet inspection system for analyzing the type of traffic
Using heuristics- Analyze the first few packets to identify a protocol
• Assumption further data exchange over the connection (IP socket) belongs to the same protocol.
- Only protocol headers of the first few packet are inspected- Applications without encryption can be identified this way
Encrypted protocols- can only identified by version numbers and other unencrypted
information- Up to 20 packets have to be inspected- User data cannot be processed
16
Traffic Types
17
Mea
n Ho
st T
raffi
c [k
b/s]
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
HTTP in
HTTP out
BitTorre
nt in
BitTorre
nt ou
t
NNTP in
NNTP out
eDon
key in
eDon
key o
utSSL
in
SSL ou
t
SHOUTcast
in
SHOUTcast
out
RTMP in
RTMP out
FTP trans
fer in
FTP trans
fer o
ut
Micros
oft BITS in
Micros
oft BITS o
ut
Gnutel
la in
Gnutel
la ou
t
Rest i
n
Rest o
ut
3.25
0.12
0.680.52
0.34
0.010.210.24 0.21
0.03 0.140.02 0.11 0 0.060.01 0.05 0 0.04 0
0.390.21
HTTP 44.4 %
BitTorrent 24.1 %
NNTP 14.2 %SHOUTcast 6.4 %
RTMP 5 %
eDonkey 4 %RTSP 1.2 %Skype 0.8 %
HTTP 14.6 %BitTorrent 64.3 %
NNTP 0.7 %SHOUTcast 0.7 %RTMP 0.4 %
eDonkey 16.3 %
RTSP 0.1 %Skype 3 %
Internet Traffic of a Geman ISPAugust 2009
18
Download
Upload
Our Data Source
After 15 minutes the DPI systems- reports the number of incoming and outgoing bytes for each
protocol for each user.- rollected in log files. - We have received the data without IP addresses
• replaced by anonymized IDs integer
For each interval of 15 minutes over a month - we know for each anonymized user the number of open
connections- the incoming and outgoing overall traffic- the incoming and outgoing unencrypted BitTorrent traffic- the sum of HTTP traffic of all users.
We have received the sum of overall traffic in this month for each host for each service type.
19
0
5000
10000
15000
20000
25000
30000
08 01 08 08 08 15 08 22 08 29 0
20
40
60
80
100
120
140
#hos
ts
perc
enta
ge
date
#hosts complete#hosts having any traffic
#hosts using BitTorrent complete#hosts not using BitTorrent
#hosts using BitTorrent
Overvie of all Traffic in August 2009
20
0 2 4 6 8
10 12 14 16
08 01 08 08 08 15 08 22 08 29
traffi
c [k
b/s]
date
complete incomplete outBitTorrent in
BitTorrent out
Overall Traffic Devolopent in August averaged over days 2009
21
0 2 4 6 8
10 12 14 16
08 01 08 08 08 15 08 22 08 29
traffi
c [k
b/s]
date
complete incomplete outBitTorrent in
BitTorrent out
Measured Traffic
22
Shortcomings of Data Set
Identification of each user by the IPv4 address is not completely reliable No reconnection every 24 hours
- unlike other ISPs- IPv4 address of a network user remains the same until the modem is rebooted
Possible reasons for a modem reboot are- hardware reset- disconnecting of the modem - power outage.
Error types- user occurs under several IP addresses
• leads to an overestimation of users. - different user might reuse a free IP address
• ISP assured us that IPv4 addresses are rarely reused
Look at the intervals when an IP address is used and count the number of such simultaneous time intervals.
- This number gives us a lower bound of the number of distinct users
23
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
08 01 08 08 08 15 08 22 08 29 09 05
#Hos
ts m
inim
um o
verla
p
Date
maxmeasured
random
Overlapping TechniqueBitTorrent Traffic only
24
0
5000
10000
15000
20000
25000
08 01 08 08 08 15 08 22 08 29 09 05
#Hos
ts m
inim
um o
verla
p
Date
maxmeasured
random
Overlap with Internet Traffic
25
0
5000
10000
15000
20000
25000
08 01 08 08 08 15 08 22 08 29
#hos
ts
date
#hosts with trafficmin. overlapping complete traffic
#hosts using BitTorrentmin overlapping BitTorrent traffic
Comparison
26
BitTorrent Traffic
Scatterplots for Up/Download Traffic BitTorrent and other traffic not related Remember: correlation coefficient
27
0
5
10
15
20
0 5 10 15 20
mea
n up
load
[kb/
s]
mean download [kb/s]
Scatterplott - BitTorrent Traffic
28
correlation coefficient: 0.58
0
0.1
0.2
0.3
0.4
0.5
0 0.1 0.2 0.3 0.4 0.5
mea
n up
load
[kb/
s]
mean download [kb/s]
Scatterplott - BitTorrent Traffic (Zoom in)
29
0
10
20
30
40
50
0 10 20 30 40 50
dow
nloa
d Bi
tTor
rent
[kb/
s]
download rest [kb/s]
mean download host traffic50% BitTorrent traffic
BitTorrent Traffic versus Other Traffic Downstream
30
correlation coefficient: −0.38.
0
10
20
30
40
50
0 10 20 30 40 50
uplo
ad B
itTor
rent
[kb/
s]
upload rest [kb/s]
mean upload host traffic50% BitTorrent traffic
Upload BitTorrent versus Other Traffic Upload
31
correlation coefficient −0.53
2 82 62 42 22022242628
210
0 2000 4000 6000 8000
>= s
hare
ratio
(up/
dow
n) [l
og]
#user
share ratioshare ratio 1:1
Upload / DownloadDistribution
32
2 82 62 42 22022242628
210
1 4 16 64 256 1024 4096
>= s
hare
ratio
(up/
dow
n) [l
og]
#user [log]
share ratioshare ratio 1:1
Upload / DownloadDistribution (Log-Log-Plot)
33
−300 −200 −100 0 100
020
040
0
Share−Ratio (Upload−Download) to Traffic (Upload+Download)
Upload − Download [Kb/Sec]
Up.
+ D
own.
[Kb/
Sec]
Host TrafficAvg. Host + Imbalance Traffic
Traffic Difference versus Traffic Sum
34
35
0
10
20
30
40
50
0 10 20 30 40 50
dow
nloa
d Bi
tTor
rent
[kb/
s]
download rest [kb/s]
mean download host traffic50% BitTorrent traffic
Figure 8: Scatterplot of(rest, BitTorrent) download pointsfor all hosts.
0
10
20
30
40
50
0 10 20 30 40 50
uplo
ad B
itTor
rent
[kb/
s]
upload rest [kb/s]
mean upload host traffic50% BitTorrent traffic
Figure 9: Scatterplot of(rest, BitTorrent) upload pointsfor all hosts.
1 2 5 10 20 50
1e−0
45e−0
31e−0
1
share−difference [kb/s] [log]
prob
abilit
y [lo
g]
share−dif. up−downshare−dif. Up−Down approx.share−dif. down−upshare−dif. down−up approx.
Figure 10: Probability distribution of share difference (upload− download).
to the same extent.From the scatterplot we have already seen that there is no sharp distribution for
BitTorrent traffic. In fact, if one considers the difference of download and uploadtraffic one observers a piecewise power law (Pareto) distribution 10.
Pd−u [share-difference x] ≈�
0.68 · (x + 1.1)−2.04 for x ≥ 0, (σ = 0.0008)4.33 · (3.07− x)−2.33 for x < 0 (σ = 0.0006)
To our knowledge we are the first to observe a heavy-tail distribution in the Bit-Torrent traffic difference. An explanation may be the power law distribution of theoverall BitTorrent upload and download, see Fig. 11 and the power law distributionof the continuous online period, see Fig. 12.
The online period can be very well described by a piecewise defined power law.The interval bounds are at 16h and 24h, which reflect the users’ decisions whether
9
From scatterplot: no sharp distribution for BitTorrent traffic.
Difference of download and upload traffic is a piecewise power law (Pareto) distribution
Explanation: maybe the power law distribution of the overall BitTorrent upload and download?
Share Difference of Torrent Traffic
1 2 5 10 20 50
1e−0
45e−0
31e−0
1
share−difference [kb/s] [log]
prob
abilit
y [lo
g]
share−dif. up−downshare−dif. Up−Down approx.share−dif. down−upshare−dif. down−up approx.
Share Difference of Torrent Traffic
36
2 42 22022242628
210
20 22 24 26 28 210 212
>= tr
affic
[kb/
s]
#user
download all timedownload when online
upload all timeupload when online
Overall Traffic Distribution
37
Periodicity
Obviously there is a perodicity in the data New idea:
- Look at Fourier Transformation- And normalized by frequency to receive the „energy“
level- and verify with averaged plots
38
0
500
1000
1500
2000
2500
3000
3500
4000
08 01 08 08 08 15 08 22 08 29 09 05
Traf
fic [B
ytes
per
Sec
ond]
Date
Incoming TrafficOutgoing Traffic
Average Traffic per Host
39
0 50
100 150 200 250 300
0 24 48 72 96 120 144 168 192 216 240
ener
gy [M
B/ho
ur]
period [hours]
Incoming TrafficOutgoing Traffic
Fourier Transformation of Traffic Pattern
40
0 0.5
1 1.5
2 2.5
3 3.5
4
0 2 4 6 8 10 12 14 16 18 20 22
traffi
c [k
b/s]
daytime [hours]
incoming trafficoutgoing traffic
Traffic Average per Hour of the Day
41
0 0.5
1 1.5
2 2.5
3 3.5
4
0 1 2 3 4 5 6 7 8 9 10 11
traffi
c [k
b/s]
half daytime [hours]
incoming trafficoutgoing traffic
Is there a 12h Periodicity?
42
0
2
4
6
8
10
Sat Sun Mon Tue Wed Thu Fri Sat
Traf
fic [K
iloby
tes/
Seco
nd]
Time [Hours]
Incoming TrafficOutgoing Traffic
Traffic Averaged per Weekday
43
How Long are Users Online
Online period- continuous amount of time when users are using
HTTP
Online times- sum of periods over a day/week/month
44
2 82 62 42 22022242628
210
0 2000 4000 6000 8000
>= s
hare
ratio
(up/
dow
n) [l
og]
#user
share ratioshare ratio 1:1
Figure 13: Cumulative share ratio distri-
bution (upload/download) in linear-log
scale.
2 8
2 6
2 4
2 2
20
22
24
26
0 2000 4000 6000 8000 10000
>= o
nlin
e tim
e [h
ours
] [lo
g]
#user
24 hours limitaverage daily online time [hours]
Figure 14: Cumulative daily online time
distribution in linear-log scale.
to let the hosts run over night.
P [online period t] ≈
0.18 · t−0.82for t ≥ 16, (σ = 0.013)
2782 · t−4.40for 16 < t ≤ 24, (σ = 0.00006)
11 · t−2.58for t > 24 (σ = 0.000015)
If one considers the ratio of upload and download (instead of the difference)
one observes an super-exponential decline and decrease on the left and right bound-
aries, which shows to a high concentration around the optimum ratio of 1. We could
not find a reasonable approximation of this functions, while we observe a similar
function of the cumulative daily online time, i.e. the number of hours a users is
producing any traffic.
We are also interested in describing the periodicity of the user behavior. While
the 24h frequency is obvious when one takes a look at Figures 2 and 3 one might
assume that also longer frequencies like a one week frequency (168h) or possibly
smaller frequency might influence user behavior. To approach this problem with-
out any personal bias we analyze the incoming and outgoing traffic of all users
using a Fourier analysis, see Fig. 15. For this we compute the absolute value of
the complex Fourier coefficient for each frequency f and divide this value by the
squared frequency. This method is well-known in the analysis of the radio waves
to measure the energy of each frequency. We observe a strong peak at 24h and
a smaller one at 12h. While there is a local maximum at 168h its size is rather
slow. To verify whether the 24h and 12h frequency actually describe some peri-
odic behavior we overlay all traffic in the12h period in Fig. 16 and 24h period in
Fig. 17. Obviously the 12h term does not show much periodicity while the 24h
period roughly resembles a sinus curve. So, the 12 hour period is a harmonic of
11
!!
!!
!!
!!
! !! ! ! ! !!
!!!!!!
!!!!!
!!!!!!!!!!!
!!
!!
!!!
!!
!!!!!
!!
!
!!!!!
!
!!!!
!
!
!!!!!!!!
!!
!
!!
!
!!
!!
!
!
!
!
!
!!!
!
!
!
!
!!!
!
!
!
!
!!!!!
!
!!
!
!
!
!
!
!!
!!
!!
!
!!
!!
!!
!!
!
!
!
!!!!!!!!
!
!!
!
!
!
!
!!!
1 2 5 10 20 50 100
5e−0
51e−0
35e−0
2Probability Distribution Online Period
online period [hours][log]
prob
abilit
y [lo
g]
16 24
probability for online period length [in hours]approximated function
cases of piecewise definition
Distribution of Online Periods
45
2 8
2 6
2 4
2 2
20
22
24
26
0 2000 4000 6000 8000 10000
>= o
nlin
e tim
e [h
ours
] [lo
g]
#user
24 hours limitaverage daily online time [hours]
Online Times per Day (Lin-Log-Plot)
46
2 8
2 6
2 4
2 2
20
22
24
26
1 4 16 64 256 1024 4096
>= o
nlin
e tim
e [h
ours
] [lo
g]
#user [log]
24 hours limitaverage daily online time [hours]
Online Times (log log Plot)
47
48
Conclusions
Analysis of web traffic of 21,766 hosts of an Internet service provider (ISP) in Germany
Emphasis BitTorrent traffic August in 2009 50% used BitTorrent At most 40% of BitTorrent users online at the
same time Many users participate in this peer-to-peer
network only for some short time periods Most Internet traffic is HTTP