privacy: lessons from the past decade vitaly shmatikov the university of texas at austin

Privacy:Lessons from the Past

Decade

Vitaly ShmatikovThe University of Texas at Austin

Browsing history

Medical andgenetic data

Web searches

TastesPurchases

slide 2

Web tracking

slide 3

Socialaggregation

Database marketing

Universal dataaccessibility

Aggregation

slide 4

• Electronic medical records (EMR)– Cerner, Practice Fusion …

• Health-care datasets– Clinical studies, hospital discharge

databases …• Increasingly accompanied by DNA

information

– PatientsLikeMe.com

Medical data

slide 5

High-dimensional datasets

• Row = user record• Column = dimension

– Example: purchased items

• Thousands or millions of dimensions– Netflix movie ratings:

35,000– Amazon purchases: 107

slide 6

similarity

Netflix Prize dataset:Considering just movie

names, for 90% of records there isn’t a single other record

which is more than 30% similar

Average record has no “similar” records

Sparsity and “Long Tail”

slide 7

Graph-structured social data

• Node attributes– Interests– Group

membership– Sexual

orientation

• Edge attributes– Date of creation– Strength– Type of

relationshipslide 8

“Jefferson High”: romantic and sexual

networkReal data!

slide 9

Whose data is it, anyway?

• Social networks– Information about relationships is shared

• Genome– Shared with all blood relatives

• Recommender systems– Complex algorithms make it impossible

to trace origin of data

Traditional notion: everyone owns and should control their personal

data

slide 10

SearchMini-feedBeacon

Applications

Famous privacy breaches

Why did they happen? slide 11

Data release today

• Datasets are “scrubbed” and published

• Why not interactive computation?– Infrastructure cost– Overhead of online privacy enforcement– Resource allocation and competition– Client privacy

• What about privacy of data subjects?– Answer: data have been ANONYMIZED

slide 12

The crutch of anonymity

(U.S) (U.K)

Deals with ISPs to collectanonymized browsing data

forhighly targeted

advertising.

Users not notified.

Court ruling over YouTube user log data causes major privacy uproar.

Deal to anonymize viewing logs satisfies all

objections.slide 13

Targeted advertising“… breakthrough technology that uses social graph data to dramatically improve online

marketing … "Social Engagement Data" consists of

anonymous information regarding the relationships

between people”

“The critical distinction … between the use of personal information for

advertisements in personally-identifiable form, and the use, dissemination, or sharing of

information with advertisers in non-personally-identifiable form.”

slide 14

• Data are “scrubbed” by removing personally identifying information (PII)– Name, Social Security number, phone

number, email, address… what else?

• Problem: PII has no technical meaning– Defined in disclosure notification laws

• If certain information is lost, consumer must be notified

– In privacy breaches, any information can be personally identifying

The myth of the PII

slide 15

More reading

• Narayanan and Shmatikov. “Myths and Fallacies of ‘Personally Identifiable Information’ ” (CACM 2010)

slide 16

De-identification

Tries to achieve “privacy” by syntactic transformation of the data - Scrubbing of PII, k-anonymity, l-diversity…

Fatally flawed!Insecure against attackers with external informationDoes not compose (anonymize twice reveal data)No meaningful notion of privacyNo meaningful notion of utility slide 17

Latanya Sweeney’s attack (1997)

Massachusetts hospital discharge dataset

Public voter datasetslide 18

Closer look at two records

Age (70)

ZIP code (78705)

Sex (Male)

Age (70)

ZIP code (78705)

Sex (Male)

Name(Vitaly)

Disease(Jetlag)

Voter registration

Patient record

Identifiable,no sensitive data

Anonymized,contains sensitive data

slide 19

Database join

Name(Vitaly)

Age (70)

Zip code (78705)

Sex (Male)

Disease(Jetlag)

Vitaly suffers from jetlag! slide 20

Observation #1: data joins

• Attacker learns sensitive data by joining two datasets on common attributes– Anonymized dataset with sensitive

attributes• Example: age, race, symptoms

– “Harmless” dataset with individual identifiers

• Example: name, address, age, race

• Demographic attributes (age, ZIP code, race, etc.) are very common slide 21

Observation #2: quasi-identifiers

• Sweeney’s observation: (birthdate, ZIP code, gender) uniquely

identifies 87% of US population– Side note: actually, only 63%

• Publishing a record with a quasi-identifier is as bad as publishing it with an explicit identity

• Eliminating quasi-identifiers is not desirable– For example, users of the dataset may

want to study distribution of diseases by age and ZIP code

slide 22

[Golle WPES ‘06]

k-anonymity

• Proposed by Samarati and Sweeney– First appears in an SRI tech report (1998)

• Hundreds of papers since then– Extremely popular in the database and

data-mining communities (SIGMOD, ICDE, KDD, VLDB)

• Many k-anonymization algorithms, most based on generalization and suppression of quasi-identifiers

slide 23

Anonymization in a nutshell

• Dataset is a relational table• Attributes (columns) are divided into quasi-identifiers and sensitive

attributes

• Generalize/suppress quasi-identifiers, but don’t touch sensitive attributes (keep them “truthful”)

Race Age Symptoms Blood type

Medical

history

… … … … …

… … … … …

slide 24

k-anonymity: definition

• Any (transformed) quasi-identifier must appear in at least k records in the anonymized dataset– k is chosen by the data owner (how?)– Example: any age-race combination from

original DB must appear at least 10 times in anonymized DB

• Guarantees that any join on quasi-identifiers with the anonymized dataset will contain at least k records for each quasi-identifier

slide 25

• Membership disclosure: cannot tell that a given person in the dataset

• Sensitive attribute disclosure: cannot tell that a given person has a certain sensitive attribute

• Identity disclosure: cannot tell which record corresponds to a given person

This interpretation is correct

(assuming the attacker only knows quasi-

identifiers)

This interpretation is correct

(assuming the attacker only knows quasi-

identifiers)

Two (and a half) interpretations

Does not imply any privacy!

Example: k clinical records, all HIV+

slide 26

Curse of dimensionality

• Generalization fundamentally relies on spatial locality

– Each record must have k close neighbors

• Real-world datasets are very sparse– Netflix Prize dataset: 17,000 dimensions– Amazon: several million dimensions– “Nearest neighbor” is very far

• Projection to low dimensions loses all info

k-anonymized datasets are useless

Aggarwal VLDB ‘05

slide 27

k-anonymity: definition

• Any (transformed) quasi-identifier must appear in at least k records in the anonymized datasetDoes not mention sensitive attributes at all!Does not say anything about the computations to be done on the dataAssumes that attacker will be able to join only on quasi-identifiers

... or how not to define privacy

slide 28

Sensitive attribute disclosure

Intuitive reasoning:• k-anonymity prevents attacker from

telling which record corresponds to which person

• Therefore, attacker cannot tell that a certain person has a particular value of a sensitive attribute

This reasoning is fallacious!

slide 29

3-anonymization

Caucas

78712 Flu

Asian 78705 Shingles

Caucas

78754 Flu

Asian 78705 Acne

AfrAm 78705 Acne

Caucas

78705 Flu

Caucas 787XX FluAsian/AfrAm

78705 Shingles


78705 Acne

Asian/AfrAm

78705 Acne

Caucas 787XX Flu

This is 3-anonymous, right?slide 30

Joining with external database

… … …

Vitaly Caucas

78705

… … …


78705 Shingles


78705 Acne

Asian/AfrAm

78705 Acne

Caucas 787XX FluProblem: sensitive attributes are not “diverse” within each quasi-identifier group

slide 31

Another attempt: l-diversity

Caucas 787XX Flu

Caucas 787XX Shingles

Caucas 787XX Acne

Caucas 787XX Flu

Caucas 787XX Acne

Caucas 787XX FluAsian/AfrAm 78XXX FluAsian/AfrAm 78XXX FluAsian/AfrAm 78XXX AcneAsian/AfrAm 78XXX ShinglesAsian/AfrAm 78XXX AcneAsian/AfrAm 78XXX Flu

Entropy of sensitive attributes within each quasi-identifier group must be at least L

slide 32

Machanavajjhala et al. ICDE ‘06

Failure of l-diversity

… Cancer

… Cancer

… Cancer

… Flu

… Cancer

… Cancer

… Cancer

… Cancer

… Cancer

… Cancer

… Flu

… Flu

Original databaseQ1 Flu

Q1 Cancer

Q1 Cancer

Q1 Cancer

Q1 Cancer

Q1 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Flu

Q2 Flu

Anonymization BQ1 Flu

Q1 Flu

Q1 Cancer

Q1 Flu

Q1 Cancer

Q1 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Anonymization A

99% have cancer

50% cancer quasi-identifier group is “diverse”

This leaks a ton of information!

50% cancer quasi-identifier group is “diverse”

This leaks a ton of information!

99% cancer quasi-identifier group is not “diverse”

…yet anonymized database does not leak anything

99% cancer quasi-identifier group is not “diverse”

…yet anonymized database does not leak anything

slide 33

Membership disclosure

• With high probability, quasi-identifier uniquely identifies an individual in the population

• Modifying quasi-identifiers in the dataset does not affect their frequency in the population!– Suppose anonymized dataset contains

10 records with a certain quasi-identifier … and there are 10 people in the population who match it

• k-anonymity may not hide whether a given person is in the dataset

Nergiz et al. SIGMOD ‘07

slide 34

What does attacker know?

Caucas

787XX

HIV+ Flu

Asian/AfrAm

787XX

HIV- Flu

Asian/AfrAm

787XX

HIV+ Shingles

Caucas

787XX

HIV- Acne

Caucas

787XX

HIV- Shingles

Caucas

787XX

HIV- Acne

This is against the rules!“flu” is not a quasi-identifierThis is against the rules!“flu” is not a quasi-identifier

Bob is Caucasian and I heard he was admitted to hospital with flu…

Yes… and this is yet another

problem with k-anonymity!

slide 35

Other problems with k-anonymity

• Multiple releases of the same dataset break anonymity

• Mere knowledge of the k-anonymization algorithm is enough to reverse anonymization

slide 36

Ganta et al. KDD ‘08

Zhang et al. CCS ‘07

k-Anonymity considered harmful

• Syntactic– Focuses on data transformation, not

on what can be learned from the anonymized dataset

– “k-anonymous” dataset can leak sensitive info

• “Quasi-identifier” fallacy– Assumes a priori that attacker will not know certain information about his

target

• Relies on locality– Destroys utility of many real-world

datasets

slide 37

HIPAA Privacy Rule

“The identifiers that must be removed include direct identifiers, such as name, street address, social security number, as well as other identifiers, such as birth date, admission and discharge dates, and five-digit zip code. The safe harbor requires removal of geographic subdivisions smaller than a State, except for the initial three digits of a zip code if the geographic unit formed by combining all zip codes with the same initial three digits contains more than 20,000 people. In addition, age, if less than 90, gender, ethnicity, and other demographic information not listed may remain in the information. The safe harbor is intended to provide covered entities with a simple, definitive method that does not require much judgment by the covered entity to determine if the information is adequately de-identified."

"Under the safe harbor method, covered entities must remove all of a list of 18 enumerated identifiers and have no actual knowledge that the information remaining could be used, alone or in combination, to identify a subject of the information."

slide 38

Lessons

• Anonymization does not work• “Personally identifiable” is

meaningless– Originally a legal term, unfortunately

crept into technical language in terms such as “quasi-identifier”

– Any piece of information is potentially identifying if it reduces the space of possibilities

– Background info about people is easy to obtain

• Linkage of information across virtual identities allows large-scale de-anonymization

slide 39

How to do it right

• Privacy is not a property of the data– Syntactic definitions such as k-

anonymity are doomed to fail

• Privacy is a property of the computation carried out on the data

• Definition of privacy must be robust in the presence of auxiliary information –

differential privacyDwork et al. ’06-10

slide 40

Mechanism is differentially private if every output is produced with similar probability whether any given input is included or notA

BCD

ABD

Differential privacy (intuition)

similar output distributions

slide 41

Risk for C does not increase much if

her data are included in the computation

Computing in the year 201X

Illusion of infinite resourcesPay only for resources usedQuickly scale up or scale down …

Data

slide 42

Programming model in year 201X

• Frameworks available to ease cloud programming

• MapReduce: parallel processing on clusters of machines

Reduce

Map

Output

Data

• Data mining• Genomic computation• Social networks

slide 43

Programming model in year 201X

• Thousands of users upload their data – Healthcare, shopping transactions,

clickstream…

• Multiple third parties mine the data• Example: health-care data

– Incentive to contribute: Cheaper insurance, new drug research, inventory control in drugstores…

– Fear: What if someone targets my personal data?

• Insurance company learns something about my health and increases my premium or denies coverage

slide 44

Privacy in the year 201X ?

Output

Information leak?

Health Data

Untrusted MapReduce

program

• Data mining• Genomic computation• Social networks

slide 45

• Audit MapReduce programs for correctness?

Aim: confine the code instead of auditing

Also, where is the source code?

Hard to do! Enlightenment?

Audit untrusted code?

slide 46

Airavat

Framework for privacy-preserving MapReduce computations with

untrusted codeUntrusted program

ProtectedData Airava

t

slide 47

Airavat guarantee

Bounded information leak* about any individual data after performing a

MapReduce computation.

*Differential privacy

Untrusted program

ProtectedData Airava

t

slide 48

map(k1,v1) list(k2,v2)reduce(k2, list(v2)) list(v2)

Data 1

Data 2

Data 3

Data 4

Output

Background: MapReduce

Map phase

Reduce phase slide 49

iPad

Tablet

PC

iPad

Laptop

MapReduce example

Map(input){ if (input has iPad) print (iPad, 1) }Reduce(key, list(v)){ print (key + “,”+ SUM(v)) }

(iPad, 2)

Counts no. ofiPads sold(ipad,1)

(ipad,1)

SUM

Map phase

Reduce phase slide 50

Airavat model

• Airavat runs on the cloud infrastructure – Cloud infrastructure: Hardware + VM– Airavat: Modified MapReduce + DFS +

JVM + SELinux

Cloud infrastructure

Airavat framework

1

Trusted

slide 51

Airavat model

• Data provider uploads her data on Airavat– Sets up certain privacy parameters


Data provider

2

Airavat framework

1

Trusted

slide 52

Airavat model

• Computation provider implements data mining algorithm– Untrusted, possibly malicious


Data provider

2

Airavat framework

1

3

Computation provider

Output

Program

Trusted

slide 53

Threat model

• Airavat runs the computation and protects the privacy of the input data


Data provider

2

Airavat framework

1

3

Computation provider

Output

Program

Trusted

Threat

slide 54

Airavat programming model

MapReduce program for data mining

Split MapReduce into untrusted mapper + trusted reducer

Data DataNo need to audit

Airavat

Untrusted Mapper

Trusted

Reducer

Limited set of stock reducers

slide 55

Airavat programming model

MapReduce program for data mining

Data DataNo need to audit

Airavat

Untrusted Mapper

Trusted

Reducer

Need to confine the mappers !Guarantee: protect the privacy of

input data

slide 56

Leaking via storage channels

Untrusted mapper code copies data, sends it over the network

Peter

Meg

Reduce

Map

Peter

Data

Chris

Leaks using system

resources slide 57

Leaking via output

Output of the computation is also an information channel

Output 1 million if Peter bought

Vi*gra

Peter

Meg

Reduce

Map

Data

Chris

slide 58

Airavat mechanisms

Prevent leaks throughstorage channels like network connections, files…

Reduce

Map

Mandatory access control

Differential privacy

Prevent leaks through the output of the computation

Output

Data

slide 59

Confining untrusted code

MapReduce + DFS

SELinux

Untrusted

program

Given by the computation providerAdd mandatory access control (MAC)

Add MAC policy

Airavat

slide 60


MapReduce + DFS

SELinux

Untrusted

program

• We add mandatory access control to the MapReduce framework

• Label input, intermediate values, output

• Malicious code cannot leak labeled data

Data 1

Data 2

Data 3

Output

Access control label MapReduce

slide 61


MapReduce + DFS

SELinux

Untrusted

program

• SELinux policy to enforce MAC

• Creates trusted and untrusted domains

• Processes and files are labeled to restrict interaction

• Mappers reside in untrusted domain– Denied network access,

limited file system interactionslide 62

Access control is not enough

• Labels can prevent the output from being read

• When can we remove the labels?

iPad

Tablet PC

iPad

Laptop

(iPad, 2)

Output leaks the presence of Peter !Pete

r

if (input belongs-to Peter) print (iPad, 1000000) (ipad,10000

01)

(ipad,1)

SUMAccess control label

Map phase Reduce phase

(iPad, 1000002)

slide 63


A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not

Cynthia Dwork et al. Differential Privacy.

slide 64



Output distribution

F(x)

A

B

C


slide 65



Similar output distributions

Bounded risk for D if she includes her data!

F(x) F(x)

A

B

C

A

B

C

D


slide 66

Achieving differential privacy

• A simple differentially private mechanism

• How much random noise should be added?

Tell me f(x)f(x)

+noise

…xn

x1

slide 67


• Function sensitivity (intuition): maximum effect of any single input on the output– Aim: “mask” this effect to ensure privacy

• Example: average height of the people in this room has low sensitivity– Any single person’s height does not affect

the final average by too much– Calculating the maximum height has high

sensitivity

slide 68


• Function sensitivity (intuition): maximum effect of any single input on the output– Aim: “mask” this effect to ensure privacy

• Example: SUM over input elements drawn from [0, M]

X1

X2

X3

X4

SUM Sensitivity = MMax. effect of any input

element is M

slide 69


• A simple differentially private mechanism

f(x)+Lap(∆(f))

…xn

x1

Tell me f(x)

Intuition: Noise needed to mask the effect of a single input

Lap = Laplace distribution

∆(f) = sensitivity

slide 70

Dwork et al.

• Mapper can be any piece of Java code (“black box”) but… range of mapper outputs must be declared in advance– Used to estimate “sensitivity” (how much

does a single input influence the output?)– Determines how much noise is added to

outputs to ensure differential privacy

• Example: consider mapper range [0, M] – SUM has the estimated sensitivity of M

slide 71

Enforcing differential privacy

Enforcing differential privacy

• Malicious mappers may output values outside the range

• If a mapper produces a value outside the range, it is replaced by a value inside the range– User not notified… otherwise possible information

leakData

1

Data

2

Data

3

Data

4

Range enforcer

Noise

Mapper

Reducer

Range enforcer

Mapper

Ensures that code is not more sensitive than declared

slide 72

• All mapper invocations must be independent

• Mapper may not store an input and use it later when processing another input– Otherwise, range-based sensitivity

estimates may be incorrect• We modify JVM to enforce mapper

independence– Each object is assigned an invocation number– JVM instrumentation prevents reuse of

objects from previous invocationslide 73

Enforcing sensitivity

What can we compute?

• Reducers are responsible for enforcing privacy– Add appropriate amount of random noise to

the outputs

• Reducers must be trusted– Sample reducers: SUM, COUNT, THRESHOLD– Sufficient to perform data-mining algorithms,

search log processing, simple statistical computations, etc.

• With trusted mappers, more general computations are possible– Use exact sensitivity instead of range-based

estimatesslide 74

More reading

• Roy et al. “Airavat: Security and Privacy for MapReduce” (NSDI 2010)

slide 75

privacy: lessons from the past decade vitaly shmatikov the university of texas at austin

Documents

privacy of data subjects

data scrubbing

social graph data

anonymized slide

sexual networkreal data

data release todaydatasets

web tracking slide

anonymous information