privacy: lessons from the past decade vitaly shmatikov the university of texas at austin
TRANSCRIPT
![Page 1: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/1.jpg)
Privacy:Lessons from the Past
Decade
Vitaly ShmatikovThe University of Texas at Austin
![Page 2: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/2.jpg)
Browsing history
Medical andgenetic data
Web searches
TastesPurchases
slide 2
![Page 3: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/3.jpg)
Web tracking
slide 3
![Page 4: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/4.jpg)
Socialaggregation
Database marketing
Universal dataaccessibility
Aggregation
slide 4
![Page 5: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/5.jpg)
• Electronic medical records (EMR)– Cerner, Practice Fusion …
• Health-care datasets– Clinical studies, hospital discharge
databases …• Increasingly accompanied by DNA
information
– PatientsLikeMe.com
Medical data
slide 5
![Page 6: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/6.jpg)
High-dimensional datasets
• Row = user record• Column = dimension
– Example: purchased items
• Thousands or millions of dimensions– Netflix movie ratings:
35,000– Amazon purchases: 107
slide 6
![Page 7: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/7.jpg)
similarity
Netflix Prize dataset:Considering just movie
names, for 90% of records there isn’t a single other record
which is more than 30% similar
Average record has no “similar” records
Sparsity and “Long Tail”
slide 7
![Page 8: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/8.jpg)
Graph-structured social data
• Node attributes– Interests– Group
membership– Sexual
orientation
• Edge attributes– Date of creation– Strength– Type of
relationshipslide 8
![Page 9: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/9.jpg)
“Jefferson High”: romantic and sexual
networkReal data!
slide 9
![Page 10: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/10.jpg)
Whose data is it, anyway?
• Social networks– Information about relationships is shared
• Genome– Shared with all blood relatives
• Recommender systems– Complex algorithms make it impossible
to trace origin of data
Traditional notion: everyone owns and should control their personal
data
slide 10
![Page 11: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/11.jpg)
SearchMini-feedBeacon
Applications
Famous privacy breaches
Why did they happen? slide 11
![Page 12: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/12.jpg)
Data release today
• Datasets are “scrubbed” and published
• Why not interactive computation?– Infrastructure cost– Overhead of online privacy enforcement– Resource allocation and competition– Client privacy
• What about privacy of data subjects?– Answer: data have been ANONYMIZED
slide 12
![Page 13: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/13.jpg)
The crutch of anonymity
(U.S) (U.K)
Deals with ISPs to collectanonymized browsing data
forhighly targeted
advertising.
Users not notified.
Court ruling over YouTube user log data causes major privacy uproar.
Deal to anonymize viewing logs satisfies all
objections.slide 13
![Page 14: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/14.jpg)
Targeted advertising“… breakthrough technology that uses social graph data to dramatically improve online
marketing … "Social Engagement Data" consists of
anonymous information regarding the relationships
between people”
“The critical distinction … between the use of personal information for
advertisements in personally-identifiable form, and the use, dissemination, or sharing of
information with advertisers in non-personally-identifiable form.”
slide 14
![Page 15: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/15.jpg)
• Data are “scrubbed” by removing personally identifying information (PII)– Name, Social Security number, phone
number, email, address… what else?
• Problem: PII has no technical meaning– Defined in disclosure notification laws
• If certain information is lost, consumer must be notified
– In privacy breaches, any information can be personally identifying
The myth of the PII
slide 15
![Page 16: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/16.jpg)
More reading
• Narayanan and Shmatikov. “Myths and Fallacies of ‘Personally Identifiable Information’ ” (CACM 2010)
slide 16
![Page 17: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/17.jpg)
De-identification
Tries to achieve “privacy” by syntactic transformation of the data - Scrubbing of PII, k-anonymity, l-diversity…
Fatally flawed!Insecure against attackers with external informationDoes not compose (anonymize twice reveal data)No meaningful notion of privacyNo meaningful notion of utility slide 17
![Page 18: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/18.jpg)
Latanya Sweeney’s attack (1997)
Massachusetts hospital discharge dataset
Public voter datasetslide 18
![Page 19: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/19.jpg)
Closer look at two records
Age (70)
ZIP code (78705)
Sex (Male)
Age (70)
ZIP code (78705)
Sex (Male)
Name(Vitaly)
Disease(Jetlag)
Voter registration
Patient record
Identifiable,no sensitive data
Anonymized,contains sensitive data
slide 19
![Page 20: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/20.jpg)
Database join
Name(Vitaly)
Age (70)
Zip code (78705)
Sex (Male)
Disease(Jetlag)
Vitaly suffers from jetlag! slide 20
![Page 21: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/21.jpg)
Observation #1: data joins
• Attacker learns sensitive data by joining two datasets on common attributes– Anonymized dataset with sensitive
attributes• Example: age, race, symptoms
– “Harmless” dataset with individual identifiers
• Example: name, address, age, race
• Demographic attributes (age, ZIP code, race, etc.) are very common slide 21
![Page 22: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/22.jpg)
Observation #2: quasi-identifiers
• Sweeney’s observation: (birthdate, ZIP code, gender) uniquely
identifies 87% of US population– Side note: actually, only 63%
• Publishing a record with a quasi-identifier is as bad as publishing it with an explicit identity
• Eliminating quasi-identifiers is not desirable– For example, users of the dataset may
want to study distribution of diseases by age and ZIP code
slide 22
[Golle WPES ‘06]
![Page 23: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/23.jpg)
k-anonymity
• Proposed by Samarati and Sweeney– First appears in an SRI tech report (1998)
• Hundreds of papers since then– Extremely popular in the database and
data-mining communities (SIGMOD, ICDE, KDD, VLDB)
• Many k-anonymization algorithms, most based on generalization and suppression of quasi-identifiers
slide 23
![Page 24: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/24.jpg)
Anonymization in a nutshell
• Dataset is a relational table• Attributes (columns) are divided into quasi-identifiers and sensitive
attributes
• Generalize/suppress quasi-identifiers, but don’t touch sensitive attributes (keep them “truthful”)
Race Age Symptoms Blood type
Medical
history
… … … … …
… … … … …
slide 24
![Page 25: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/25.jpg)
k-anonymity: definition
• Any (transformed) quasi-identifier must appear in at least k records in the anonymized dataset– k is chosen by the data owner (how?)– Example: any age-race combination from
original DB must appear at least 10 times in anonymized DB
• Guarantees that any join on quasi-identifiers with the anonymized dataset will contain at least k records for each quasi-identifier
slide 25
![Page 26: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/26.jpg)
• Membership disclosure: cannot tell that a given person in the dataset
• Sensitive attribute disclosure: cannot tell that a given person has a certain sensitive attribute
• Identity disclosure: cannot tell which record corresponds to a given person
This interpretation is correct
(assuming the attacker only knows quasi-
identifiers)
This interpretation is correct
(assuming the attacker only knows quasi-
identifiers)
Two (and a half) interpretations
Does not imply any privacy!
Example: k clinical records, all HIV+
slide 26
![Page 27: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/27.jpg)
Curse of dimensionality
• Generalization fundamentally relies on spatial locality
– Each record must have k close neighbors
• Real-world datasets are very sparse– Netflix Prize dataset: 17,000 dimensions– Amazon: several million dimensions– “Nearest neighbor” is very far
• Projection to low dimensions loses all info
k-anonymized datasets are useless
Aggarwal VLDB ‘05
slide 27
![Page 28: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/28.jpg)
k-anonymity: definition
• Any (transformed) quasi-identifier must appear in at least k records in the anonymized datasetDoes not mention sensitive attributes at all!Does not say anything about the computations to be done on the dataAssumes that attacker will be able to join only on quasi-identifiers
... or how not to define privacy
slide 28
![Page 29: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/29.jpg)
Sensitive attribute disclosure
Intuitive reasoning:• k-anonymity prevents attacker from
telling which record corresponds to which person
• Therefore, attacker cannot tell that a certain person has a particular value of a sensitive attribute
This reasoning is fallacious!
slide 29
![Page 30: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/30.jpg)
3-anonymization
Caucas
78712 Flu
Asian 78705 Shingles
Caucas
78754 Flu
Asian 78705 Acne
AfrAm 78705 Acne
Caucas
78705 Flu
Caucas 787XX FluAsian/AfrAm
78705 Shingles
Caucas 787XX FluAsian/AfrAm
78705 Acne
Asian/AfrAm
78705 Acne
Caucas 787XX Flu
This is 3-anonymous, right?slide 30
![Page 31: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/31.jpg)
Joining with external database
… … …
Vitaly Caucas
78705
… … …
Caucas 787XX FluAsian/AfrAm
78705 Shingles
Caucas 787XX FluAsian/AfrAm
78705 Acne
Asian/AfrAm
78705 Acne
Caucas 787XX FluProblem: sensitive attributes are not “diverse” within each quasi-identifier group
slide 31
![Page 32: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/32.jpg)
Another attempt: l-diversity
Caucas 787XX Flu
Caucas 787XX Shingles
Caucas 787XX Acne
Caucas 787XX Flu
Caucas 787XX Acne
Caucas 787XX FluAsian/AfrAm 78XXX FluAsian/AfrAm 78XXX FluAsian/AfrAm 78XXX AcneAsian/AfrAm 78XXX ShinglesAsian/AfrAm 78XXX AcneAsian/AfrAm 78XXX Flu
Entropy of sensitive attributes within each quasi-identifier group must be at least L
slide 32
Machanavajjhala et al. ICDE ‘06
![Page 33: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/33.jpg)
Failure of l-diversity
… Cancer
… Cancer
… Cancer
… Flu
… Cancer
… Cancer
… Cancer
… Cancer
… Cancer
… Cancer
… Flu
… Flu
Original databaseQ1 Flu
Q1 Cancer
Q1 Cancer
Q1 Cancer
Q1 Cancer
Q1 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Flu
Q2 Flu
Anonymization BQ1 Flu
Q1 Flu
Q1 Cancer
Q1 Flu
Q1 Cancer
Q1 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Q2 Cancer
Anonymization A
99% have cancer
50% cancer quasi-identifier group is “diverse”
This leaks a ton of information!
50% cancer quasi-identifier group is “diverse”
This leaks a ton of information!
99% cancer quasi-identifier group is not “diverse”
…yet anonymized database does not leak anything
99% cancer quasi-identifier group is not “diverse”
…yet anonymized database does not leak anything
slide 33
![Page 34: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/34.jpg)
Membership disclosure
• With high probability, quasi-identifier uniquely identifies an individual in the population
• Modifying quasi-identifiers in the dataset does not affect their frequency in the population!– Suppose anonymized dataset contains
10 records with a certain quasi-identifier … and there are 10 people in the population who match it
• k-anonymity may not hide whether a given person is in the dataset
Nergiz et al. SIGMOD ‘07
slide 34
![Page 35: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/35.jpg)
What does attacker know?
Caucas
787XX
HIV+ Flu
Asian/AfrAm
787XX
HIV- Flu
Asian/AfrAm
787XX
HIV+ Shingles
Caucas
787XX
HIV- Acne
Caucas
787XX
HIV- Shingles
Caucas
787XX
HIV- Acne
This is against the rules!“flu” is not a quasi-identifierThis is against the rules!“flu” is not a quasi-identifier
Bob is Caucasian and I heard he was admitted to hospital with flu…
Yes… and this is yet another
problem with k-anonymity!
slide 35
![Page 36: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/36.jpg)
Other problems with k-anonymity
• Multiple releases of the same dataset break anonymity
• Mere knowledge of the k-anonymization algorithm is enough to reverse anonymization
slide 36
Ganta et al. KDD ‘08
Zhang et al. CCS ‘07
![Page 37: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/37.jpg)
k-Anonymity considered harmful
• Syntactic– Focuses on data transformation, not
on what can be learned from the anonymized dataset
– “k-anonymous” dataset can leak sensitive info
• “Quasi-identifier” fallacy– Assumes a priori that attacker will not know certain information about his
target
• Relies on locality– Destroys utility of many real-world
datasets
slide 37
![Page 38: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/38.jpg)
HIPAA Privacy Rule
“The identifiers that must be removed include direct identifiers, such as name, street address, social security number, as well as other identifiers, such as birth date, admission and discharge dates, and five-digit zip code. The safe harbor requires removal of geographic subdivisions smaller than a State, except for the initial three digits of a zip code if the geographic unit formed by combining all zip codes with the same initial three digits contains more than 20,000 people. In addition, age, if less than 90, gender, ethnicity, and other demographic information not listed may remain in the information. The safe harbor is intended to provide covered entities with a simple, definitive method that does not require much judgment by the covered entity to determine if the information is adequately de-identified."
"Under the safe harbor method, covered entities must remove all of a list of 18 enumerated identifiers and have no actual knowledge that the information remaining could be used, alone or in combination, to identify a subject of the information."
slide 38
![Page 39: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/39.jpg)
Lessons
• Anonymization does not work• “Personally identifiable” is
meaningless– Originally a legal term, unfortunately
crept into technical language in terms such as “quasi-identifier”
– Any piece of information is potentially identifying if it reduces the space of possibilities
– Background info about people is easy to obtain
• Linkage of information across virtual identities allows large-scale de-anonymization
slide 39
![Page 40: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/40.jpg)
How to do it right
• Privacy is not a property of the data– Syntactic definitions such as k-
anonymity are doomed to fail
• Privacy is a property of the computation carried out on the data
• Definition of privacy must be robust in the presence of auxiliary information –
differential privacyDwork et al. ’06-10
slide 40
![Page 41: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/41.jpg)
Mechanism is differentially private if every output is produced with similar probability whether any given input is included or notA
BCD
ABD
Differential privacy (intuition)
similar output distributions
slide 41
Risk for C does not increase much if
her data are included in the computation
![Page 42: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/42.jpg)
Computing in the year 201X
Illusion of infinite resourcesPay only for resources usedQuickly scale up or scale down …
Data
slide 42
![Page 43: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/43.jpg)
Programming model in year 201X
• Frameworks available to ease cloud programming
• MapReduce: parallel processing on clusters of machines
Reduce
Map
Output
Data
• Data mining• Genomic computation• Social networks
slide 43
![Page 44: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/44.jpg)
Programming model in year 201X
• Thousands of users upload their data – Healthcare, shopping transactions,
clickstream…
• Multiple third parties mine the data• Example: health-care data
– Incentive to contribute: Cheaper insurance, new drug research, inventory control in drugstores…
– Fear: What if someone targets my personal data?
• Insurance company learns something about my health and increases my premium or denies coverage
slide 44
![Page 45: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/45.jpg)
Privacy in the year 201X ?
Output
Information leak?
Health Data
Untrusted MapReduce
program
• Data mining• Genomic computation• Social networks
slide 45
![Page 46: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/46.jpg)
• Audit MapReduce programs for correctness?
Aim: confine the code instead of auditing
Also, where is the source code?
Hard to do! Enlightenment?
Audit untrusted code?
slide 46
![Page 47: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/47.jpg)
Airavat
Framework for privacy-preserving MapReduce computations with
untrusted codeUntrusted program
ProtectedData Airava
t
slide 47
![Page 48: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/48.jpg)
Airavat guarantee
Bounded information leak* about any individual data after performing a
MapReduce computation.
*Differential privacy
Untrusted program
ProtectedData Airava
t
slide 48
![Page 49: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/49.jpg)
map(k1,v1) list(k2,v2)reduce(k2, list(v2)) list(v2)
Data 1
Data 2
Data 3
Data 4
Output
Background: MapReduce
Map phase
Reduce phase slide 49
![Page 50: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/50.jpg)
iPad
Tablet
PC
iPad
Laptop
MapReduce example
Map(input){ if (input has iPad) print (iPad, 1) }Reduce(key, list(v)){ print (key + “,”+ SUM(v)) }
(iPad, 2)
Counts no. ofiPads sold(ipad,1)
(ipad,1)
SUM
Map phase
Reduce phase slide 50
![Page 51: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/51.jpg)
Airavat model
• Airavat runs on the cloud infrastructure – Cloud infrastructure: Hardware + VM– Airavat: Modified MapReduce + DFS +
JVM + SELinux
Cloud infrastructure
Airavat framework
1
Trusted
slide 51
![Page 52: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/52.jpg)
Airavat model
• Data provider uploads her data on Airavat– Sets up certain privacy parameters
Cloud infrastructure
Data provider
2
Airavat framework
1
Trusted
slide 52
![Page 53: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/53.jpg)
Airavat model
• Computation provider implements data mining algorithm– Untrusted, possibly malicious
Cloud infrastructure
Data provider
2
Airavat framework
1
3
Computation provider
Output
Program
Trusted
slide 53
![Page 54: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/54.jpg)
Threat model
• Airavat runs the computation and protects the privacy of the input data
Cloud infrastructure
Data provider
2
Airavat framework
1
3
Computation provider
Output
Program
Trusted
Threat
slide 54
![Page 55: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/55.jpg)
Airavat programming model
MapReduce program for data mining
Split MapReduce into untrusted mapper + trusted reducer
Data DataNo need to audit
Airavat
Untrusted Mapper
Trusted
Reducer
Limited set of stock reducers
slide 55
![Page 56: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/56.jpg)
Airavat programming model
MapReduce program for data mining
Data DataNo need to audit
Airavat
Untrusted Mapper
Trusted
Reducer
Need to confine the mappers !Guarantee: protect the privacy of
input data
slide 56
![Page 57: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/57.jpg)
Leaking via storage channels
Untrusted mapper code copies data, sends it over the network
Peter
Meg
Reduce
Map
Peter
Data
Chris
Leaks using system
resources slide 57
![Page 58: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/58.jpg)
Leaking via output
Output of the computation is also an information channel
Output 1 million if Peter bought
Vi*gra
Peter
Meg
Reduce
Map
Data
Chris
slide 58
![Page 59: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/59.jpg)
Airavat mechanisms
Prevent leaks throughstorage channels like network connections, files…
Reduce
Map
Mandatory access control
Differential privacy
Prevent leaks through the output of the computation
Output
Data
slide 59
![Page 60: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/60.jpg)
Confining untrusted code
MapReduce + DFS
SELinux
Untrusted
program
Given by the computation providerAdd mandatory access control (MAC)
Add MAC policy
Airavat
slide 60
![Page 61: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/61.jpg)
Confining untrusted code
MapReduce + DFS
SELinux
Untrusted
program
• We add mandatory access control to the MapReduce framework
• Label input, intermediate values, output
• Malicious code cannot leak labeled data
Data 1
Data 2
Data 3
Output
Access control label MapReduce
slide 61
![Page 62: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/62.jpg)
Confining untrusted code
MapReduce + DFS
SELinux
Untrusted
program
• SELinux policy to enforce MAC
• Creates trusted and untrusted domains
• Processes and files are labeled to restrict interaction
• Mappers reside in untrusted domain– Denied network access,
limited file system interactionslide 62
![Page 63: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/63.jpg)
Access control is not enough
• Labels can prevent the output from being read
• When can we remove the labels?
iPad
Tablet PC
iPad
Laptop
(iPad, 2)
Output leaks the presence of Peter !Pete
r
if (input belongs-to Peter) print (iPad, 1000000) (ipad,10000
01)
(ipad,1)
SUMAccess control label
Map phase Reduce phase
(iPad, 1000002)
slide 63
![Page 64: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/64.jpg)
Differential privacy (intuition)
A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not
Cynthia Dwork et al. Differential Privacy.
slide 64
![Page 65: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/65.jpg)
Differential privacy (intuition)
A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not
Output distribution
F(x)
A
B
C
Cynthia Dwork et al. Differential Privacy.
slide 65
![Page 66: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/66.jpg)
Differential privacy (intuition)
A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not
Similar output distributions
Bounded risk for D if she includes her data!
F(x) F(x)
A
B
C
A
B
C
D
Cynthia Dwork et al. Differential Privacy.
slide 66
![Page 67: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/67.jpg)
Achieving differential privacy
• A simple differentially private mechanism
• How much random noise should be added?
Tell me f(x)f(x)
+noise
…xn
x1
slide 67
![Page 68: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/68.jpg)
Achieving differential privacy
• Function sensitivity (intuition): maximum effect of any single input on the output– Aim: “mask” this effect to ensure privacy
• Example: average height of the people in this room has low sensitivity– Any single person’s height does not affect
the final average by too much– Calculating the maximum height has high
sensitivity
slide 68
![Page 69: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/69.jpg)
Achieving differential privacy
• Function sensitivity (intuition): maximum effect of any single input on the output– Aim: “mask” this effect to ensure privacy
• Example: SUM over input elements drawn from [0, M]
X1
X2
X3
X4
SUM Sensitivity = MMax. effect of any input
element is M
slide 69
![Page 70: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/70.jpg)
Achieving differential privacy
• A simple differentially private mechanism
f(x)+Lap(∆(f))
…xn
x1
Tell me f(x)
Intuition: Noise needed to mask the effect of a single input
Lap = Laplace distribution
∆(f) = sensitivity
slide 70
Dwork et al.
![Page 71: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/71.jpg)
• Mapper can be any piece of Java code (“black box”) but… range of mapper outputs must be declared in advance– Used to estimate “sensitivity” (how much
does a single input influence the output?)– Determines how much noise is added to
outputs to ensure differential privacy
• Example: consider mapper range [0, M] – SUM has the estimated sensitivity of M
slide 71
Enforcing differential privacy
![Page 72: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/72.jpg)
Enforcing differential privacy
• Malicious mappers may output values outside the range
• If a mapper produces a value outside the range, it is replaced by a value inside the range– User not notified… otherwise possible information
leakData
1
Data
2
Data
3
Data
4
Range enforcer
Noise
Mapper
Reducer
Range enforcer
Mapper
Ensures that code is not more sensitive than declared
slide 72
![Page 73: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/73.jpg)
• All mapper invocations must be independent
• Mapper may not store an input and use it later when processing another input– Otherwise, range-based sensitivity
estimates may be incorrect• We modify JVM to enforce mapper
independence– Each object is assigned an invocation number– JVM instrumentation prevents reuse of
objects from previous invocationslide 73
Enforcing sensitivity
![Page 74: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/74.jpg)
What can we compute?
• Reducers are responsible for enforcing privacy– Add appropriate amount of random noise to
the outputs
• Reducers must be trusted– Sample reducers: SUM, COUNT, THRESHOLD– Sufficient to perform data-mining algorithms,
search log processing, simple statistical computations, etc.
• With trusted mappers, more general computations are possible– Use exact sensitivity instead of range-based
estimatesslide 74
![Page 75: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin](https://reader035.vdocument.in/reader035/viewer/2022062408/56649ee95503460f94bfb195/html5/thumbnails/75.jpg)
More reading
• Roy et al. “Airavat: Security and Privacy for MapReduce” (NSDI 2010)
slide 75