the role of history and prediction in data privacy

27
The Role of History and Prediction in Data Privacy Kristen LeFevre University of Michigan May 13, 2009 QuickTime™ and a decompressor are needed to see thi

Upload: ariel-singleton

Post on 31-Dec-2015

38 views

Category:

Documents


1 download

DESCRIPTION

The Role of History and Prediction in Data Privacy. Kristen LeFevre University of Michigan May 13, 2009. Employment history. Healthcare, insurance information. E-mail. Supermarket transaction data. RFID, GPS Data. Data Privacy. Personal information collected every day. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Role of History and Prediction in Data Privacy

The Role of History and Prediction in Data Privacy

Kristen LeFevre

University of Michigan

May 13, 2009

QuickTime™ and a decompressor

are needed to see this picture.

Page 2: The Role of History and Prediction in Data Privacy

2

Data Privacy

• Personal information collected every day

Healthcare, insurance information

Supermarket transaction data

RFID, GPS Data

E-mailEmployment history

Web search / clickstream

Page 3: The Role of History and Prediction in Data Privacy

3

Data Privacy

• Legal, ethical, technical issues surrounding– Data ownership– Data collection– Data dissemination and use

• Considerable recent interest from technical community– High-profile mishaps and lawsuits– Compliance with data-sharing mandates QuickTime™ and a

decompressorare needed to see this picture.

Page 4: The Role of History and Prediction in Data Privacy

4

Privacy Protection Technologies for Public Datasets

• Goal: Protect sensitive personal information while preserving data utility

• Privacy Policies and Mechanisms• Example Policies:

– Protect individual identities– Protect the values of sensitive attributes– Differential privacy [Dwork 06]

• Example Mechanisms:– Generalize (“coarsen”) the data– Aggregate the data– Add random noise to the data– Add random noise to query results

Page 5: The Role of History and Prediction in Data Privacy

5

Observations

• Much work has focused on static data– One-time snapshot publishing– Disclosure by composing multiple different

snapshots of a static database [Xiao 07, Ganta 08]

– Auditing queries on a static database [Chin 81, Kenthapadi 06, …]

• What are the unique challenges when the data evolves over time?

Page 6: The Role of History and Prediction in Data Privacy

6

Outline

• Sample Problem: Continuously publishing privacy-sensitive GPS traces– Motivation & problem setup– Framework for reasoning about privacy– Algorithms for continuous publishing– Experimental results

• Applications to other dynamic dataspeculation

Page 7: The Role of History and Prediction in Data Privacy

7

GPS Traces(ongoing work w/ Wen Jin, Jignesh Patel)

• GPS devices attached to phones, cars• Interest in collecting and distributing

location traces in real time– Real-time traffic reporting– Adaptive pricing / placement of outdoor ads

• Simultaneous concern for personal privacy• Challenge: Can we continuously collect

and publish location traces without compromising individual privacy?

Page 8: The Role of History and Prediction in Data Privacy

8

Data Recipient

QuickTime™ and a decompressor

are needed to see this picture.

Problem Setting

QuickTime™ and a decompressor

are needed to see this picture.

Central TraceRepository

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

GPS Users (7 AM)P

riva

cy P

oli

cy

“Sanitized” LocationSnapshot

“Sanitized” LocationSnapshot

GPS Users (7:05 AM)

“Sanitized” LocationSnapshot

“Sanitized” LocationSnapshot

Page 9: The Role of History and Prediction in Data Privacy

9

Problem Setting

• Finite population of n users with unique identifiers {u1,…,un}

• Assume users’ locations are reported and published in discrete epochs t1,t2,…

• Location snapshot D(tj)– Associates each user with a location during

epoch tj

• Publish sanitized version D*(tj )

Page 10: The Role of History and Prediction in Data Privacy

10

Threat Model

• Attacker wants to determine the location of a target user ui during epoch tj

• Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages)

QuickTime™ and a decompressor

are needed to see this picture.

Page 11: The Role of History and Prediction in Data Privacy

11

Some Naïve Solutions

• Strawman 1: Replace users’ identifiers ({u1,…,un}) with pseudonyms ({p1,…,pn})

– Problem: Once attacker “unmasks” user pi, he can track her location forever

• Strawman 2: New pseudonyms ({p1j,…,pn

j}) at each epoch tj

– Problem: Users can still be tracked using multi-target tracking tools [Gruteser 05, Krumm 07]

Page 12: The Role of History and Prediction in Data Privacy

12

Key Problem: Motion Prediction

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture. QuickTime™ and a decompressor

are needed to see this picture.

1

2 3{Alice, Bob, Charlie}

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

QuickTime™ and a decompressor

are needed to see this picture.

4

5

6{Alice, Bob, Charlie}

What if the speedlimit is 60 mph?

Alice Alice

Page 13: The Role of History and Prediction in Data Privacy

13

Threat Model

• Attacker wants to determine the location of a target user ui during epoch tj

• Auxiliary Information: Attacker knows location information during some other epochs (e.g., Yellow Pages)

• Motion prediction: Given one or more locations for ui, attacker can predict (probabilistically) ui’s location during following and preceding epochs

Page 14: The Role of History and Prediction in Data Privacy

14

Privacy Principle: Temporal Unlinkability

• Consider an attacker who is able to identify (locate) target user uj during m sequential epochs

• Under reasonable assumptions, he should not be able to locate uj with high confidence during any other epochs*

*Similar in spirit to “mix zones” [Beresford 03], which addressed a related problem in a less-formal way.

Page 15: The Role of History and Prediction in Data Privacy

15

Sanitization Mechanism

• Needed to select a sanitization mechanism; chose one for maximum flexibility

• Assign each user ui consistent pseudonym pi

• Divide users into clusters– Within each cluster, break association between

pseudonym, location

• Release candidate for D(tj)

D*(tj) = {(C1(tj), L1(tj)),…, (CB(tj), LB(tj))} i=1..B Ci(tj) = {p1,…,pn}– Ci(tj) Ch(tj) = (i h)– Each Li(tj) contains the locations of users in Ci(tj)

Page 16: The Role of History and Prediction in Data Privacy

16

Sanitization Mechanism: Example

• Pseudonyms {p1, p2, p3, p4}

{p1,p2}

{p3,p4}

t0

QuickTime™ and a decompressor

are needed to see this picture.1QuickTime™ and a

decompressorare needed to see this picture.2

QuickTime™ and a decompressor

are needed to see this picture.3

QuickTime™ and a decompressor

are needed to see this picture.4

{p1,p2}

{p3,p4}

t1

QuickTime™ and a decompressor

are needed to see this picture.5QuickTime™ and a

decompressorare needed to see this picture.6

QuickTime™ and a decompressor

are needed to see this picture.7

QuickTime™ and a decompressor

are needed to see this picture.8

{p1,p3}

{p2,p4}

t2

QuickTime™ and a decompressor

are needed to see this picture.9

QuickTime™ and a decompressor

are needed to see this picture.10

QuickTime™ and a decompressor

are needed to see this picture.11QuickTime™ and a

decompressorare needed to see this picture.12

Page 17: The Role of History and Prediction in Data Privacy

17

Reasoning about Privacy

• How can we guarantee temporal unlinkability under the threats of auxiliary information and motion prediction?– (Using the cluster-based sanitization mechanism)

• Novel framework with two key components– Motion model describes location correlations

between epochs– Breach probability function describes an

attacker’s ability to compromise temporal unlinkability

Page 18: The Role of History and Prediction in Data Privacy

18

Motion Models

• Model motion using an h-step Markov chain– Conditional probability for user’s location, given his location

during h prior (future) epochs– Same motion model used by attacker and publisher

• Forward motion model template

– Pr[Loc(P,Tj) = Lj | Loc(P,Tj-1) = Lj-1, …, Loc(P,Tj-h) = Lj-h]

• Backward motion model template

– Pr[Loc(P,Tj) = Lj | Loc(P,Tj+1) = Lj+1, …, Loc(P,Tj+h) = Lj+h]

• Independent and replaceable component– For this work, used 1-step motion model based on velocity

distribution (speed and direction)

Page 19: The Role of History and Prediction in Data Privacy

19

Motion Models: Example

{p1,p2}

{p3,p4}

t0 t1

• Pseudonyms {p1, p2, p3, p4}• Epochs t0, t1, t2

QuickTime™ and a decompressor

are needed to see this picture.p1QuickTime™ and a

decompressorare needed to see this picture.p2

QuickTime™ and a decompressor

are needed to see this picture.p3

QuickTime™ and a decompressor

are needed to see this picture.p4

QuickTime™ and a decompressor

are needed to see this picture.aQuickTime™ and a

decompressorare needed to see this picture.b

QuickTime™ and a decompressor

are needed to see this picture.c

QuickTime™ and a decompressor

are needed to see this picture.d

t2

QuickTime™ and a decompressor

are needed to see this picture.p3

QuickTime™ and a decompressor

are needed to see this picture.p1

QuickTime™ and a decompressor

are needed to see this picture.p2QuickTime™ and a

decompressorare needed to see this picture.p4

Pr[loc(p1,t1) = a|Loc(p1,t0)=x]

Pr[loc(p1,t1) = b|Loc(p1,t0)=x]Pr[loc(p1,t1) = a|Loc(p1,t2)=y]

Page 20: The Role of History and Prediction in Data Privacy

20

Privacy Breaches

• Forward breach probability– Pr[Loc(P,Tj) = Lj | D(Tj-1), …, D(Tj-h), D*(Tj)]

• Backward breach probability– Pr[Loc(P,Tj) = Lj | D(Tj+1), …, D(Tj+h), D*(Tj)]

• Privacy Breach: Release candidate D*(Tj) causes a breach iff either of the following is true for threshold Cmax P, Lj Pr[Loc(P,Tj) = Lj | D(Tj-1), …, D(Tj-h), D*(Tj)] > C

max P, Lj Pr[Loc(P,Tj) = Lj | D(Tj+1), …, D(Tj-h), D*(Tj)] > C

Page 21: The Role of History and Prediction in Data Privacy

21

Privacy Breaches: Example

{p1,p2}

{p3,p4}

t0 t1

QuickTime™ and a decompressor

are needed to see this picture.p1QuickTime™ and a

decompressorare needed to see this picture.p2

QuickTime™ and a decompressor

are needed to see this picture.p3

QuickTime™ and a decompressor

are needed to see this picture.p4

QuickTime™ and a decompressor

are needed to see this picture.aQuickTime™ and a

decompressorare needed to see this picture.b

QuickTime™ and a decompressor

are needed to see this picture.c

QuickTime™ and a decompressor

are needed to see this picture.d

e1 = Pr[loc(p1,t1) = a|Loc(p1,t0)=x]

e2 = Pr[loc(p1,t1) = b|Loc(p1,t0)=x]

e3 = Pr[loc(p2,t1) = a|Loc(p2,t0)=y]

e4 = Pr[loc(p2,t1) = b|Loc(p2,t0)=y]

Pr[loc(p1,t1) = a|D(T0), D*(T1)] =

e1 * e4

e1 * e4 + e2 * e3

…Goal: Verify that all (forward and

backward) breach probabilities < threshold C

x

y

Page 22: The Role of History and Prediction in Data Privacy

22

Checking for Breaches

• Does release candidate D*(Tj) cause a breach?

• Brute force algorithm– Exponential in release candidate cluster size

• Heuristic pruning tools– Reduce the search space considerably in

practice

Page 23: The Role of History and Prediction in Data Privacy

23

Publishing Algorithms

• How to publish useful data, without causing a privacy breach?

• Cluster-based sanitization mechanism offers two main options– Increase cluster size (or change composition)– Reduce publication frequency

Page 24: The Role of History and Prediction in Data Privacy

24

Publishing Algorithms

• General Case– At each epoch Tj, publish the most compact release

candidate D*(Tj) that does not cause a breach– Need to delay publishing until epoch Tj+h to check for

backward breaches– NP-hard optimization problem; proposed alternative

heuristics

• Special Case– Durable clusters (same individuals at each epoch)– Motion model satisfies symmetry property– No need to delay publishing

Page 25: The Role of History and Prediction in Data Privacy

25

Experimental Study

• Used real highway traffic data from UM Transportation Research Institute

– GPS data sampled from cars of 72 volunteers– Sampling rate (epoch) = 0.01 seconds– Speed range 0-170 km/hour

• Also synthetic data– Able to control the generative motion distribution

Page 26: The Role of History and Prediction in Data Privacy

26

Experimental Study

• All static “snapshot” anonymization mechanisms vulnerable to motion prediction attacks– Applied two representative algorithms (r-Gather

[Aggarwal 06] and k-Condense [Aggarwal 04])– Each produces a set of clusters with k users each

QuickTime™ and a decompressor

are needed to see this picture.

r-Gather

QuickTime™ and a decompressor

are needed to see this picture.

k-Condense

Page 27: The Role of History and Prediction in Data Privacy

27

Speculation / Future Work

• GPS example illustrates importance of reasoning about data dynamics and history, and predictable patterns of change in privacy

• Dynamic private data in other apps.– E.g., longitudinal social science data

• Study subjects age predictably • Most people don’t move very far• Income changes predictably

• Hypothesis: History and prediction are important in these settings, too!