statistical identification of encrypted web-browsing traffic
DESCRIPTION
Statistical Identification of Encrypted Web-Browsing Traffic. Qixiang Sun Stanford University Daniel R. Simon, Yi-Min Wang, Wilf Russell, Venkata N. Padmanabhan, Lili Qiu Microsoft Research. Outline. Motivation & Problem Intuition Hypothetical Attacker Attacker’s Success Rate - PowerPoint PPT PresentationTRANSCRIPT
Statistical Identification of Encrypted Web-Browsing Traffic
Qixiang SunStanford University
Daniel R. Simon, Yi-Min Wang, Wilf Russell, Venkata N. Padmanabhan, Lili Qiu
Microsoft Research
Outline
• Motivation & Problem• Intuition• Hypothetical Attacker• Attacker’s Success Rate• Countermeasures• Conclusion
Anonymous Web Browsing
• Protect personal information from Attacker’s Inference– Medical (Online support group)– Questionable Activities
• Question: Is this REALLY anonymous?
R1 R2 R3 R4
What’s Different?
In anonymous Web browsing– The chain of routers are used for both
sending and receiving data
Can link HTTP requests and responses!
– The target Web pages are publicly accessible
Responses are known!
Implication: The first link/router is an exploitable weakness.
What Information is Available?HTTP Get
HTTP Get
Response
Response
Bro
wse
r 1st R
outer
• Number of objects
• Object sizes
• Ordering of the objects
• Delay between packets
R1 R2 R3 R4
Intuition
• Number of objects and object sizes are sufficient to identify a Web page!
– On average, a Web page has 11 objects with each object yielding 8.4 bits of information
8.4*11 – log2(11!) 67 bits 1020 possibilities!!
– Currently, there are about 109 Web pages
An Hypothetical Attacker
List of target Sensitive sites URLs
ProgrammaticAccess to URL
& Traffic recording
Traffic patternConstruction &
Database update
TrafficPattern
Database
History
Similarity scoresCalculation
Decision module
Negative
Positive
R1
Traffic recording& Pattern construction
TrafficPattern
Browser
Guts of the Pattern Matching• Given two multisets of object sizes S1 and S2
Sim(S1, S2) = S1 S2 / S1 S2
• Decision module uses an absolute threshold.TrafficPattern
Database
TrafficPattern
Similarity scoresCalculation
Decision module
For example:S1 = {3KB, 3KB, 5KB}S2 = {3KB, 5KB, 5KB}
Sim(S1, S2) =
= 0.5
| {3KB, 5KB} |
| {3KB, 3KB, 5KB, 5KB} |
Experiment Setup
• Approximately 100,000 Web pages in total (URLs obtained from the Open Directory Project).
• The hypothetical attacker chooses about 2200 pages as target pages.
• Goal: Can these 2200 pages be identified without causing many false positives?
What is a Success and Failure?
• Successful Identification:– A target page passes the similarity threshold and is
not confused with other pages in the target set.
• False Positive:– A non-target page is incorrectly identified as one of
the target pages.
• Potential False Positive:– A page passes the similarity threshold when
compared with a single selected target page.
Attacker’s Success Rate
• A threshold of 0.5 is sufficient.
0
10
20
30
40
50
60
70
80
90
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Absolute Threshold
% o
f Pag
es
Identification rate(2191 targetpages)
Actual false-positives rate(98496 nontargetpages)
80.4%
2.1%
Is this small enough?
A Detailed Look Inside• False-positives are NOT generated uniformly!
707580859095
100
0 200 400 600 800 1000 1200
# of Potential False Positives
% o
f Tar
get P
ages
0-identifiable pages
HTTP 404sCommon-looking pages
Dynamism in Web Pages
• Most pages are relatively static
One-day-old pattern database is sufficient
0
20
40
60
80
100
0 0.2 0.4 0.6 0.8 1
Self Similarity Score
% o
f Tar
get P
ages
Countermeasures
• Padding– Individual objects– Add random-sized objects
• Morphing– Pipelining the HTTP GET requests– Pre-fetching
• Mimicking– Common templates or Web-hosting services
Padding Object Size• Linear – Nearest multiple of padding size• Exponential – Nearest power of 2
0
10
20
30
40
50
60
128 256 512 1024 2048 4096 8192 16384
Minimum Object Size
% o
f 0-id
entif
iabl
e pa
ges
Linear Padding
Exponential Padding
Padding Random Objects
05
1015202530354045
0.3 0.4 0.5 0.6 0.7
Absolute Threshold
% o
f 0-Id
entif
iabl
e P
ages
Multiple of 10
Two-chunk Pipelining
• Approximately 36% of the target pages are 0-identifiable.
– Very close to the theoretical limit of 1/e (assuming traffic patterns are random)
• Implication: Can harness the total entropy in the Web page traffic patterns.
One-chunk Pipelining
02468
1012
0 2 4 6 8 10 12
K (Number of Potential False Positives)
% o
f K
-iden
tifia
ble
Pag
es
Conclusion• Encrypted Web browsing can be identified by the target page’s “unique” traffic pattern.
010203040506070
Padding Bucket Size
% o
f Ide
ntifi
able
Site
s 0-identifiable1-identifiable2-identifiable
Linear Padding
05
1015
2025
3035
40
128
256
512
1024
2048
4096
8192
1638
4
Minimum Padding Size
% o
f Ide
ntifi
able
Site
s 0-identifiable1-identifiable2-identifiable
Exponential Padding
Pad Random Objects
05
1015202530354045
0.3 0.4 0.5 0.6 0.7
Absolute Threshold
% o
f Ide
ntifi
able
Site
s
Multiple of 10Multiple of 15Multiple of 20