david choffnes, northeastern university jingjing ren, northeastern university ashwin rao, university...
DESCRIPTION
How Frequently Is PII Leaked? 3 Basic tracking is common Significant fraction of very personal information leaked across all platforms PII leakage is pervasive! Fraction of top 100 apps leaking PII (Tested in September, 2015)TRANSCRIPT
David Choffnes, Northeastern University Jingjing Ren, Northeastern UniversityAshwin Rao, University of HelsinkiMartina Lindorfer, Vienna Univ. of TechnologyArnaud Legout, INRIA Sophia-Antipolis
ReCon: Revealing and Controling PII Leaks in Mobile Network Systems
DTL Workshop, Nov. 2015
Sponsored by:
2
Motivation Mobile devices
Rich sensors Ubiquitous connectivity
Key questions What personal information is transmitted? To whom does it go? What can average users do about it?
usernameemail address
home addressgender
real name
phone numberpassword...MAC address
Advertiser ID
3
How Frequently Is PII Leaked?
Basic tracking is common
Significant fraction of very personal information leaked across all platforms
PII leakage is pervasive!
Frac
tion
of to
p 10
0 ap
ps le
akin
g PI
I
(Tested in September, 2015)
4
How to Detect PII Leaks in Mobile? At the OS
Information flow analysis (static/dynamic/hybrid) Ok solutions, but not perfect or easily deployable
In the network Independent of OS, app store Easy to detect if you know what PII to search for
What if you don’t know the PII a priori?
5
ReCon: Automatically Identifying PII Leaks
Hypothesis: PII leaks have distinguishing characteristics Is it just simple key/value pairs (e.g., “user=R3C0N”)?
Nope, this leads to high FP/FN rates Need to learn the structure of PII leaks
Approach: Build ML classifiers to reliably detect leaks Does not require knowing PII in advance Resilient to changes in PII leak formats over time
We built ReCon Machine learning to reveal PII leaks from mobile devices Software middleboxes to intercept and control leaks Works on all major platforms (iOS, Android, Windows Phone)
6
ReCon: Viewing detected leaks
PII Category Device Identifiers Contact Information User Identifiers Credentials
User Feedback Correct Incorrect Not sure Not about me
7
Where They Know You’ve Been
Location information is hard todigest using text alone
WTKYB shows just how pervasivelocation tracking is
Creepiness factor to help userscare more about privacy(?)
8
Mitigating PII Leaks ReCon gives users control over leaks
Example simple strategies Block PII Modify PII Randomize identifiers Coarsen locations
Advanced mitigation (under dev) Mock user profiles Provide k-anonymity
9
How does ReCon work? Key challenges for ML-based PII detection
Which classifier do we use? C4.5 Decision Tree is best trade-off between speed and
accuracy
How do we train the classifier? Use traces from real users and controlled experiments Break flows into separate words that may indicate a leak Feature selection for scalability
How well are we doing? Controlled experiments In the wild: Only the users themselves know for sure!
Crowdsourced reinforcement
10
Key Results: ReCon accuracy How accurate is ReCon?
99% overall accuracy from controlled experiments
FPR: 2.2%, FNR: 3.5% Why?
Per-domain classifiers Decision tree captures non-trivial cases
11
Key Results: ReCon Has Good Coverage
How does it compare to other solutions?
Device Identifier User Identifier Contact Info Location0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%FlowDroid(Static IFA) Andrubis (Dynamic IFA) AppAudit(Hybrid IFA) ReCon
ReCon finds significantly more PII than IFA solutions
ReCon successfully idenifies missing leaks after
retrainingFrac
tion
of to
tal l
eaks
foun
d
12
Key Results: User study IRB-approved user study
24 iOS, 13 Android devices 20/26 responses: system useful & behavior
change 165 cases of credential leaks, 94 verified Average leaks: iOS > Android Unexpected, suspicious leaks
Recipe/cooking app tracks location Video/Game/News app leaks gender
And more… Check out http://recon.meddle.mobi
13
Summary ReCon: Provides transparency/control over
PII leaks Relies only on access to network traffic (OS
independent) Machine learning to automatically identify PII
leaks Crowdsourced reinforcement with user
feedback Works today! Check out
http://recon.meddle.mobiSponsor:Question
s?David Choffnes
14
Backups
15
Encryption and ReCon What is your answer for increasing use of
encryption? Recon needs access only to plaintext flows mcTLS, BlindBox Route to trusted middlebox that can do MITM
Works for most apps, but usually not logins Haystack (on Android device)
16
Encryption: What is leaked? Leaks over SSL (not much)
Send PII to trackers over SSL (100 apps/device) 6 iOS 2 Android 1 Windows
Problem with SSL Certification pining Not working with VPN enabled
Obfuscation Little evidence in controlled experiment using
IFA
17
Other applications of ReCon K-anonymity Explicit sharing
Allow users to control how much shared to third-parties
Obfuscation Retrain classifiers to identify obfuscated leaks Use static/dynamic to analysis tools that are
resilient to evasion techniques
18
Deployment models ReCon only needs access to network flows
VPN proxy (current deployment): tunnel to proxy server Currently supported by all mobile OSes Can run VMs anywhere in the world
Raspberry Pi In home network Enables HTTPS decryption with minimal additional
risks On device
Haystack on Android In network
Awazza and other APN/middlebox deployment models
19
Methodology Details Controlled experiments as ground truth Text classification approaches
Problem: Given a network flow, whether it contains PII information or not?
Feature Extraction: Bag-of-word model Example.com /someevent?x=1&y=2 {“z”:”xx@y”} Words: someevent, x, 1, y, 2, z, xx@y,
Per-Domain classifiers (e.g. Google-Analytics) Faster (compared to one-for-all) More accurate
Library: weka
20
Why Run ReCon? User incentives
Control over data leaks! Blocking unwanted content k-anonymity for increased privacy