privacy: lessons from the past decade vitaly shmatikov the university of texas at austin

75
Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Upload: bridget-cleopatra-mcdonald

Post on 02-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Privacy:Lessons from the Past

Decade

Vitaly ShmatikovThe University of Texas at Austin

Page 2: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Browsing history

Medical andgenetic data

Web searches

TastesPurchases

slide 2

Page 3: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Web tracking

slide 3

Page 4: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Socialaggregation

Database marketing

Universal dataaccessibility

Aggregation

slide 4

Page 5: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

• Electronic medical records (EMR)– Cerner, Practice Fusion …

• Health-care datasets– Clinical studies, hospital discharge

databases …• Increasingly accompanied by DNA

information

– PatientsLikeMe.com

Medical data

slide 5

Page 6: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

High-dimensional datasets

• Row = user record• Column = dimension

– Example: purchased items

• Thousands or millions of dimensions– Netflix movie ratings:

35,000– Amazon purchases: 107

slide 6

Page 7: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

similarity

Netflix Prize dataset:Considering just movie

names, for 90% of records there isn’t a single other record

which is more than 30% similar

Average record has no “similar” records

Sparsity and “Long Tail”

slide 7

Page 8: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Graph-structured social data

• Node attributes– Interests– Group

membership– Sexual

orientation

• Edge attributes– Date of creation– Strength– Type of

relationshipslide 8

Page 9: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

“Jefferson High”: romantic and sexual

networkReal data!

slide 9

Page 10: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Whose data is it, anyway?

• Social networks– Information about relationships is shared

• Genome– Shared with all blood relatives

• Recommender systems– Complex algorithms make it impossible

to trace origin of data

Traditional notion: everyone owns and should control their personal

data

slide 10

Page 11: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

SearchMini-feedBeacon

Applications

Famous privacy breaches

Why did they happen? slide 11

Page 12: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Data release today

• Datasets are “scrubbed” and published

• Why not interactive computation?– Infrastructure cost– Overhead of online privacy enforcement– Resource allocation and competition– Client privacy

• What about privacy of data subjects?– Answer: data have been ANONYMIZED

slide 12

Page 13: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

The crutch of anonymity

(U.S) (U.K)

Deals with ISPs to collectanonymized browsing data

forhighly targeted

advertising.

Users not notified.

Court ruling over YouTube user log data causes major privacy uproar.

Deal to anonymize viewing logs satisfies all

objections.slide 13

Page 14: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Targeted advertising“… breakthrough technology that uses social graph data to dramatically improve online

marketing … "Social Engagement Data" consists of

anonymous information regarding the relationships

between people”

“The critical distinction … between the use of personal information for

advertisements in personally-identifiable form, and the use, dissemination, or sharing of

information with advertisers in non-personally-identifiable form.”

slide 14

Page 15: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

• Data are “scrubbed” by removing personally identifying information (PII)– Name, Social Security number, phone

number, email, address… what else?

• Problem: PII has no technical meaning– Defined in disclosure notification laws

• If certain information is lost, consumer must be notified

– In privacy breaches, any information can be personally identifying

The myth of the PII

slide 15

Page 16: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

More reading

• Narayanan and Shmatikov. “Myths and Fallacies of ‘Personally Identifiable Information’ ” (CACM 2010)

slide 16

Page 17: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

De-identification

Tries to achieve “privacy” by syntactic transformation of the data - Scrubbing of PII, k-anonymity, l-diversity…

Fatally flawed!Insecure against attackers with external informationDoes not compose (anonymize twice reveal data)No meaningful notion of privacyNo meaningful notion of utility slide 17

Page 18: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Latanya Sweeney’s attack (1997)

Massachusetts hospital discharge dataset

Public voter datasetslide 18

Page 19: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Closer look at two records

Age (70)

ZIP code (78705)

Sex (Male)

Age (70)

ZIP code (78705)

Sex (Male)

Name(Vitaly)

Disease(Jetlag)

Voter registration

Patient record

Identifiable,no sensitive data

Anonymized,contains sensitive data

slide 19

Page 20: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Database join

Name(Vitaly)

Age (70)

Zip code (78705)

Sex (Male)

Disease(Jetlag)

Vitaly suffers from jetlag! slide 20

Page 21: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Observation #1: data joins

• Attacker learns sensitive data by joining two datasets on common attributes– Anonymized dataset with sensitive

attributes• Example: age, race, symptoms

– “Harmless” dataset with individual identifiers

• Example: name, address, age, race

• Demographic attributes (age, ZIP code, race, etc.) are very common slide 21

Page 22: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Observation #2: quasi-identifiers

• Sweeney’s observation: (birthdate, ZIP code, gender) uniquely

identifies 87% of US population– Side note: actually, only 63%

• Publishing a record with a quasi-identifier is as bad as publishing it with an explicit identity

• Eliminating quasi-identifiers is not desirable– For example, users of the dataset may

want to study distribution of diseases by age and ZIP code

slide 22

[Golle WPES ‘06]

Page 23: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

k-anonymity

• Proposed by Samarati and Sweeney– First appears in an SRI tech report (1998)

• Hundreds of papers since then– Extremely popular in the database and

data-mining communities (SIGMOD, ICDE, KDD, VLDB)

• Many k-anonymization algorithms, most based on generalization and suppression of quasi-identifiers

slide 23

Page 24: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Anonymization in a nutshell

• Dataset is a relational table• Attributes (columns) are divided into quasi-identifiers and sensitive

attributes

• Generalize/suppress quasi-identifiers, but don’t touch sensitive attributes (keep them “truthful”)

Race Age Symptoms Blood type

Medical

history

… … … … …

… … … … …

slide 24

Page 25: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

k-anonymity: definition

• Any (transformed) quasi-identifier must appear in at least k records in the anonymized dataset– k is chosen by the data owner (how?)– Example: any age-race combination from

original DB must appear at least 10 times in anonymized DB

• Guarantees that any join on quasi-identifiers with the anonymized dataset will contain at least k records for each quasi-identifier

slide 25

Page 26: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

• Membership disclosure: cannot tell that a given person in the dataset

• Sensitive attribute disclosure: cannot tell that a given person has a certain sensitive attribute

• Identity disclosure: cannot tell which record corresponds to a given person

This interpretation is correct

(assuming the attacker only knows quasi-

identifiers)

This interpretation is correct

(assuming the attacker only knows quasi-

identifiers)

Two (and a half) interpretations

Does not imply any privacy!

Example: k clinical records, all HIV+

slide 26

Page 27: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Curse of dimensionality

• Generalization fundamentally relies on spatial locality

– Each record must have k close neighbors

• Real-world datasets are very sparse– Netflix Prize dataset: 17,000 dimensions– Amazon: several million dimensions– “Nearest neighbor” is very far

• Projection to low dimensions loses all info

k-anonymized datasets are useless

Aggarwal VLDB ‘05

slide 27

Page 28: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

k-anonymity: definition

• Any (transformed) quasi-identifier must appear in at least k records in the anonymized datasetDoes not mention sensitive attributes at all!Does not say anything about the computations to be done on the dataAssumes that attacker will be able to join only on quasi-identifiers

... or how not to define privacy

slide 28

Page 29: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Sensitive attribute disclosure

Intuitive reasoning:• k-anonymity prevents attacker from

telling which record corresponds to which person

• Therefore, attacker cannot tell that a certain person has a particular value of a sensitive attribute

This reasoning is fallacious!

slide 29

Page 30: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

3-anonymization

Caucas

78712 Flu

Asian 78705 Shingles

Caucas

78754 Flu

Asian 78705 Acne

AfrAm 78705 Acne

Caucas

78705 Flu

Caucas 787XX FluAsian/AfrAm

78705 Shingles

Caucas 787XX FluAsian/AfrAm

78705 Acne

Asian/AfrAm

78705 Acne

Caucas 787XX Flu

This is 3-anonymous, right?slide 30

Page 31: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Joining with external database

… … …

Vitaly Caucas

78705

… … …

Caucas 787XX FluAsian/AfrAm

78705 Shingles

Caucas 787XX FluAsian/AfrAm

78705 Acne

Asian/AfrAm

78705 Acne

Caucas 787XX FluProblem: sensitive attributes are not “diverse” within each quasi-identifier group

slide 31

Page 32: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Another attempt: l-diversity

Caucas 787XX Flu

Caucas 787XX Shingles

Caucas 787XX Acne

Caucas 787XX Flu

Caucas 787XX Acne

Caucas 787XX FluAsian/AfrAm 78XXX FluAsian/AfrAm 78XXX FluAsian/AfrAm 78XXX AcneAsian/AfrAm 78XXX ShinglesAsian/AfrAm 78XXX AcneAsian/AfrAm 78XXX Flu

Entropy of sensitive attributes within each quasi-identifier group must be at least L

slide 32

Machanavajjhala et al. ICDE ‘06

Page 33: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Failure of l-diversity

… Cancer

… Cancer

… Cancer

… Flu

… Cancer

… Cancer

… Cancer

… Cancer

… Cancer

… Cancer

… Flu

… Flu

Original databaseQ1 Flu

Q1 Cancer

Q1 Cancer

Q1 Cancer

Q1 Cancer

Q1 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Flu

Q2 Flu

Anonymization BQ1 Flu

Q1 Flu

Q1 Cancer

Q1 Flu

Q1 Cancer

Q1 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Q2 Cancer

Anonymization A

99% have cancer

50% cancer quasi-identifier group is “diverse”

This leaks a ton of information!

50% cancer quasi-identifier group is “diverse”

This leaks a ton of information!

99% cancer quasi-identifier group is not “diverse”

…yet anonymized database does not leak anything

99% cancer quasi-identifier group is not “diverse”

…yet anonymized database does not leak anything

slide 33

Page 34: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Membership disclosure

• With high probability, quasi-identifier uniquely identifies an individual in the population

• Modifying quasi-identifiers in the dataset does not affect their frequency in the population!– Suppose anonymized dataset contains

10 records with a certain quasi-identifier … and there are 10 people in the population who match it

• k-anonymity may not hide whether a given person is in the dataset

Nergiz et al. SIGMOD ‘07

slide 34

Page 35: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

What does attacker know?

Caucas

787XX

HIV+ Flu

Asian/AfrAm

787XX

HIV- Flu

Asian/AfrAm

787XX

HIV+ Shingles

Caucas

787XX

HIV- Acne

Caucas

787XX

HIV- Shingles

Caucas

787XX

HIV- Acne

This is against the rules!“flu” is not a quasi-identifierThis is against the rules!“flu” is not a quasi-identifier

Bob is Caucasian and I heard he was admitted to hospital with flu…

Yes… and this is yet another

problem with k-anonymity!

slide 35

Page 36: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Other problems with k-anonymity

• Multiple releases of the same dataset break anonymity

• Mere knowledge of the k-anonymization algorithm is enough to reverse anonymization

slide 36

Ganta et al. KDD ‘08

Zhang et al. CCS ‘07

Page 37: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

k-Anonymity considered harmful

• Syntactic– Focuses on data transformation, not

on what can be learned from the anonymized dataset

– “k-anonymous” dataset can leak sensitive info

• “Quasi-identifier” fallacy– Assumes a priori that attacker will not know certain information about his

target

• Relies on locality– Destroys utility of many real-world

datasets

slide 37

Page 38: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

HIPAA Privacy Rule

“The identifiers that must be removed include direct identifiers, such as name, street address, social security number, as well as other identifiers, such as birth date, admission and discharge dates, and five-digit zip code. The safe harbor requires removal of geographic subdivisions smaller than a State, except for the initial three digits of a zip code if the geographic unit formed by combining all zip codes with the same initial three digits contains more than 20,000 people. In addition, age, if less than 90, gender, ethnicity, and other demographic information not listed may remain in the information. The safe harbor is intended to provide covered entities with a simple, definitive method that does not require much judgment by the covered entity to determine if the information is adequately de-identified."

"Under the safe harbor method, covered entities must remove all of a list of 18 enumerated identifiers and have no actual knowledge that the information remaining could be used, alone or in combination, to identify a subject of the information."

slide 38

Page 39: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Lessons

• Anonymization does not work• “Personally identifiable” is

meaningless– Originally a legal term, unfortunately

crept into technical language in terms such as “quasi-identifier”

– Any piece of information is potentially identifying if it reduces the space of possibilities

– Background info about people is easy to obtain

• Linkage of information across virtual identities allows large-scale de-anonymization

slide 39

Page 40: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

How to do it right

• Privacy is not a property of the data– Syntactic definitions such as k-

anonymity are doomed to fail

• Privacy is a property of the computation carried out on the data

• Definition of privacy must be robust in the presence of auxiliary information –

differential privacyDwork et al. ’06-10

slide 40

Page 41: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Mechanism is differentially private if every output is produced with similar probability whether any given input is included or notA

BCD

ABD

Differential privacy (intuition)

similar output distributions

slide 41

Risk for C does not increase much if

her data are included in the computation

Page 42: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Computing in the year 201X

Illusion of infinite resourcesPay only for resources usedQuickly scale up or scale down …

Data

slide 42

Page 43: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Programming model in year 201X

• Frameworks available to ease cloud programming

• MapReduce: parallel processing on clusters of machines

Reduce

Map

Output

Data

• Data mining• Genomic computation• Social networks

slide 43

Page 44: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Programming model in year 201X

• Thousands of users upload their data – Healthcare, shopping transactions,

clickstream…

• Multiple third parties mine the data• Example: health-care data

– Incentive to contribute: Cheaper insurance, new drug research, inventory control in drugstores…

– Fear: What if someone targets my personal data?

• Insurance company learns something about my health and increases my premium or denies coverage

slide 44

Page 45: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Privacy in the year 201X ?

Output

Information leak?

Health Data

Untrusted MapReduce

program

• Data mining• Genomic computation• Social networks

slide 45

Page 46: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

• Audit MapReduce programs for correctness?

Aim: confine the code instead of auditing

Also, where is the source code?

Hard to do! Enlightenment?

Audit untrusted code?

slide 46

Page 47: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Airavat

Framework for privacy-preserving MapReduce computations with

untrusted codeUntrusted program

ProtectedData Airava

t

slide 47

Page 48: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Airavat guarantee

Bounded information leak* about any individual data after performing a

MapReduce computation.

*Differential privacy

Untrusted program

ProtectedData Airava

t

slide 48

Page 49: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

map(k1,v1) list(k2,v2)reduce(k2, list(v2)) list(v2)

Data 1

Data 2

Data 3

Data 4

Output

Background: MapReduce

Map phase

Reduce phase slide 49

Page 50: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

iPad

Tablet

PC

iPad

Laptop

MapReduce example

Map(input){ if (input has iPad) print (iPad, 1) }Reduce(key, list(v)){ print (key + “,”+ SUM(v)) }

(iPad, 2)

Counts no. ofiPads sold(ipad,1)

(ipad,1)

SUM

Map phase

Reduce phase slide 50

Page 51: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Airavat model

• Airavat runs on the cloud infrastructure – Cloud infrastructure: Hardware + VM– Airavat: Modified MapReduce + DFS +

JVM + SELinux

Cloud infrastructure

Airavat framework

1

Trusted

slide 51

Page 52: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Airavat model

• Data provider uploads her data on Airavat– Sets up certain privacy parameters

Cloud infrastructure

Data provider

2

Airavat framework

1

Trusted

slide 52

Page 53: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Airavat model

• Computation provider implements data mining algorithm– Untrusted, possibly malicious

Cloud infrastructure

Data provider

2

Airavat framework

1

3

Computation provider

Output

Program

Trusted

slide 53

Page 54: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Threat model

• Airavat runs the computation and protects the privacy of the input data

Cloud infrastructure

Data provider

2

Airavat framework

1

3

Computation provider

Output

Program

Trusted

Threat

slide 54

Page 55: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Airavat programming model

MapReduce program for data mining

Split MapReduce into untrusted mapper + trusted reducer

Data DataNo need to audit

Airavat

Untrusted Mapper

Trusted

Reducer

Limited set of stock reducers

slide 55

Page 56: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Airavat programming model

MapReduce program for data mining

Data DataNo need to audit

Airavat

Untrusted Mapper

Trusted

Reducer

Need to confine the mappers !Guarantee: protect the privacy of

input data

slide 56

Page 57: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Leaking via storage channels

Untrusted mapper code copies data, sends it over the network

Peter

Meg

Reduce

Map

Peter

Data

Chris

Leaks using system

resources slide 57

Page 58: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Leaking via output

Output of the computation is also an information channel

Output 1 million if Peter bought

Vi*gra

Peter

Meg

Reduce

Map

Data

Chris

slide 58

Page 59: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Airavat mechanisms

Prevent leaks throughstorage channels like network connections, files…

Reduce

Map

Mandatory access control

Differential privacy

Prevent leaks through the output of the computation

Output

Data

slide 59

Page 60: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Confining untrusted code

MapReduce + DFS

SELinux

Untrusted

program

Given by the computation providerAdd mandatory access control (MAC)

Add MAC policy

Airavat

slide 60

Page 61: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Confining untrusted code

MapReduce + DFS

SELinux

Untrusted

program

• We add mandatory access control to the MapReduce framework

• Label input, intermediate values, output

• Malicious code cannot leak labeled data

Data 1

Data 2

Data 3

Output

Access control label MapReduce

slide 61

Page 62: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Confining untrusted code

MapReduce + DFS

SELinux

Untrusted

program

• SELinux policy to enforce MAC

• Creates trusted and untrusted domains

• Processes and files are labeled to restrict interaction

• Mappers reside in untrusted domain– Denied network access,

limited file system interactionslide 62

Page 63: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Access control is not enough

• Labels can prevent the output from being read

• When can we remove the labels?

iPad

Tablet PC

iPad

Laptop

(iPad, 2)

Output leaks the presence of Peter !Pete

r

if (input belongs-to Peter) print (iPad, 1000000) (ipad,10000

01)

(ipad,1)

SUMAccess control label

Map phase Reduce phase

(iPad, 1000002)

slide 63

Page 64: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Differential privacy (intuition)

A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not

Cynthia Dwork et al. Differential Privacy.

slide 64

Page 65: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Differential privacy (intuition)

A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not

Output distribution

F(x)

A

B

C

Cynthia Dwork et al. Differential Privacy.

slide 65

Page 66: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Differential privacy (intuition)

A mechanism is differentially private if every output is produced with similar probability whether any given input is included or not

Similar output distributions

Bounded risk for D if she includes her data!

F(x) F(x)

A

B

C

A

B

C

D

Cynthia Dwork et al. Differential Privacy.

slide 66

Page 67: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Achieving differential privacy

• A simple differentially private mechanism

• How much random noise should be added?

Tell me f(x)f(x)

+noise

…xn

x1

slide 67

Page 68: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Achieving differential privacy

• Function sensitivity (intuition): maximum effect of any single input on the output– Aim: “mask” this effect to ensure privacy

• Example: average height of the people in this room has low sensitivity– Any single person’s height does not affect

the final average by too much– Calculating the maximum height has high

sensitivity

slide 68

Page 69: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Achieving differential privacy

• Function sensitivity (intuition): maximum effect of any single input on the output– Aim: “mask” this effect to ensure privacy

• Example: SUM over input elements drawn from [0, M]

X1

X2

X3

X4

SUM Sensitivity = MMax. effect of any input

element is M

slide 69

Page 70: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Achieving differential privacy

• A simple differentially private mechanism

f(x)+Lap(∆(f))

…xn

x1

Tell me f(x)

Intuition: Noise needed to mask the effect of a single input

Lap = Laplace distribution

∆(f) = sensitivity

slide 70

Dwork et al.

Page 71: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

• Mapper can be any piece of Java code (“black box”) but… range of mapper outputs must be declared in advance– Used to estimate “sensitivity” (how much

does a single input influence the output?)– Determines how much noise is added to

outputs to ensure differential privacy

• Example: consider mapper range [0, M] – SUM has the estimated sensitivity of M

slide 71

Enforcing differential privacy

Page 72: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

Enforcing differential privacy

• Malicious mappers may output values outside the range

• If a mapper produces a value outside the range, it is replaced by a value inside the range– User not notified… otherwise possible information

leakData

1

Data

2

Data

3

Data

4

Range enforcer

Noise

Mapper

Reducer

Range enforcer

Mapper

Ensures that code is not more sensitive than declared

slide 72

Page 73: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

• All mapper invocations must be independent

• Mapper may not store an input and use it later when processing another input– Otherwise, range-based sensitivity

estimates may be incorrect• We modify JVM to enforce mapper

independence– Each object is assigned an invocation number– JVM instrumentation prevents reuse of

objects from previous invocationslide 73

Enforcing sensitivity

Page 74: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

What can we compute?

• Reducers are responsible for enforcing privacy– Add appropriate amount of random noise to

the outputs

• Reducers must be trusted– Sample reducers: SUM, COUNT, THRESHOLD– Sufficient to perform data-mining algorithms,

search log processing, simple statistical computations, etc.

• With trusted mappers, more general computations are possible– Use exact sensitivity instead of range-based

estimatesslide 74

Page 75: Privacy: Lessons from the Past Decade Vitaly Shmatikov The University of Texas at Austin

More reading

• Roy et al. “Airavat: Security and Privacy for MapReduce” (NSDI 2010)

slide 75