1 probabilistic inference protection on anonymized data raymond chi-wing wong (the hong kong...

32
1 Probabilistic Inference Protection on Anonymized Data Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology) Ada Wai-Chee Fu (the Chinese University of Hong Kong) Ke Wang (Simon Fraser University) Yabo Xu (Sun Yat-sen University) Jian Pei (Simon Fraser University) Philip S. Yu (Univerisity of Illinois at Chicago) Prepared by Raymond Chi-Wing Wong Presented by Raymond Chi-Wing Wong

Post on 19-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

1

Probabilistic Inference Protection on Anonymized Data

Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology)

Ada Wai-Chee Fu (the Chinese University of Hong Kong)Ke Wang (Simon Fraser University)Yabo Xu (Sun Yat-sen University)Jian Pei (Simon Fraser University)

Philip S. Yu (Univerisity of Illinois at Chicago)

Prepared by Raymond Chi-Wing WongPresented by Raymond Chi-Wing Wong

2

Outline1. Introduction

l-diversity

2. Background Knowledge3. Proposed Model4. Conclusion

3

1. l-diversityPatient Gender Age Disease

Alan Male 41 Lung Cancer

Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female

64 HIVRelease the data set to public

Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

Knowledge 1GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIVQI Table Sensitive Table

I also know Alan with (Male, 41)

Knowledge 2

Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancerwith probability=1/2.

In other words, P(Alan is linked to Lung Cancer) is at most 1/2.

Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2

This dataset satisfies 2-diversity.

4

1. l-diversityPatient Gender Age Disease

Alan Male 41 Lung Cancer

Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female

64 HIVRelease the data set to public

Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

Knowledge 1GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIVQI Table Sensitive Table

I also know Alan with (Male, 41)

Knowledge 2

Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2

This dataset satisfies 2-diversity.

Knowledge 3

p() Lung Cancer

Not Lung Cancer

Male 0.1 0.9

Female

0.003 0.997

QI Based Distribution

This can be obtained from statistical reports from the US department of Health and Human Services and other statistical data sources discussed in previous studies

5

1. l-diversityPatient Gender Age Disease

Alan Male 41 Lung Cancer

Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female

64 HIVRelease the data set to public

Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

Knowledge 1GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIVQI Table Sensitive Table

I also know Alan with (Male, 41)

Knowledge 2

Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2

This dataset satisfies 2-diversity.

Knowledge 3

p() Lung Cancer

Not Lung Cancer

Male 0.1 0.9

Female

0.003 0.997

QI Based Distribution

Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancerwith very high probability (much greater than 1/2).

It is more likely that a male patient is linked to Lung Cancer compared with a female patient.

Why?

6

1. l-diversityPatient Gender Age Disease

Alan Male 41 Lung Cancer

Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female

64 HIVRelease the data set to public

Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

Knowledge 1GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIVQI Table Sensitive Table

I also know Alan with (Male, 41)

Knowledge 2

Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2

This dataset satisfies 2-diversity.

Knowledge 3

p() Lung Cancer

Not Lung Cancer

Male 0.1 0.9

Female

0.003 0.997

QI Based Distribution

Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancerwith very high probability (much greater than 1/2).

We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

7

1. l-diversity

We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

8

1. l-diversity

Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive.

We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

9

1. l-diversity

Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size.

Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

10

1. l-diversity

Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size.

Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2

11

1. l-diversity

Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size.

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2

Related Work: There is a closely related work [LLZ09] for this problem.

[LLZ09] T. Li, N. Li and J. Zhang, “Modeling and Integrating BackgroundKnowledge in Data Anonymization”, ICDE 2009

[LLZ09] approximates the formula for this probability.Thus, there is no solid guarantee on the privacy protection.

12

1. l-diversity

Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size.

Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 )Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically,(1) Computing the condition is computationally cheap, and(2) The condition involves a monotonic function on the A-group size.

13

1. l-diversity The major idea of the condition includes

some simple calculations based on the statistics of an A-group

Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 )Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically,(1) Computing the condition is computationally cheap, and(2) The condition involves a monotonic function on the A-group size.

1. The size of the A-group (N)2. The privacy requirement (r)3. The global probabilities of each tuple in the A-group to a

sensitive value

14

1. l-diversity The major idea of the condition includes

some simple calculations based on the statistics of an A-group

Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

1. The size of the A-group (N)2. The privacy requirement (r)3. The global probabilities of each tuple in the A-group to a

sensitive value

Condition Check

N

r

Global probabilities

Satisfied/Not Satisfied

If it is satisfied, we deduce that the privacy

requirement is satisfied(e.g., P(Alan is linked to Lung Cancer) ≤

1/2)

15

4. Conclusion

1. Background Knowledge QI-based Probability Distribution

2. Two Challenges Challenge 1: The formula for the

probability is computationally expensive Challenge 2: The formula is not

monotonic

3. Proposed Condition overcomes Challenge 1 and Challenge 2

16

Q&A

17

1. l-diversityPatient Gender Age Disease

Alan Male 41 Lung Cancer

Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female

64 HIVRelease the data set to public

Gender Age Disease

Male 41 Lung Cancer

Female

42 Hypertension

Female

63 Flu

Female

64 HIV

Bucketization

GID = L1

These two tuples form an anonymized group (A-group)

These two tuples form another A-group.

GID = L2

A way to prevent this linkage.

There is another way to prevent this linkage called Generalization. The following principle to be discussed can also be applied to Generalization.

18

1. l-diversityPatient Gender Age Disease

Alan Male 41 Lung Cancer

Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female

64 HIVRelease the data set to public

Gender Age Disease

Male 41 Lung Cancer

Female

42 Hypertension

Female

63 Flu

Female

64 HIV

GID = L1

GID = L2

Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIVQI Table Sensitive Table

19

1. l-diversityPatient Gender Age Disease

Alan Male 41 Lung Cancer

Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female

64 HIVRelease the data set to public

Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIVQI Table Sensitive Table

20

1. l-diversityPatient Gender Age Disease

Alan Male 41 Lung Cancer

Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female

64 HIVRelease the data set to public

Gender Age Disease

Male 41 Lung Cancer

Female

42 Hypertension

Female

63 Flu

Female

64 HIV

Knowledge 1

I also know Alan with (Male, 41)

Knowledge 2

Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer.

21

1. l-diversity Monotonicity

Consider two A-groups

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIV

An A-group with GID = L1

An A-group with GID = L2

P(an individual is linked to a sensitive value) = 0.5

P(an individual is linked to a sensitive value) = 0.4

Merging

An A-group “merged”from these two A-groups

P(an individual is linked to a sensitive value) ≤ 0.5

The probability is monotonically decreasing when the size of the A-gourp increases.

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

22

1. l-diversity Non-Monotonicity

Consider two A-groups

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIV

An A-group with GID = L1

An A-group with GID = L2

P(an individual is linked to a sensitive value) = 0.5

P(an individual is linked to a sensitive value) = 0.4

Merging

An A-group “merged”from these two A-groups

It is possible that P(an individual is linked to a sensitive value) > 0.5

The probability is not monotonically decreasing when the size of the A-gourp increases.

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

23

1. l-diversityObjective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

Condition Check

N

r

Global probabilities

Satisfied/Not Satisfied

If it is satisfied, we deduce that the privacy

requirement is satisfied(e.g., P(Alan is linked to Lung Cancer) ≤

1/2)

Knowledge 3

p() Lung Cancer

Not Lung Cancer

Male 0.1 0.9

Female

0.003 0.997

QI Based Distribution

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

Knowledge 1GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIVSuppose we are interested in knowing whether P(Alan is linked to Lung Cancer) ≤ 1/2.

I also know Alan with (Male, 41)

Knowledge 2

2

2

0.10.003

For the sake of illustration, we focus onattribute Gender only.

24

Condition Check

N

r

Global probabilities

Satisfied/Not Satisfied

2

2

0.10.003

What is the condition check?

In the condition check,there is an expression ceil in terms of N, r and global probabilities to compute.

25

What is the condition check?

Theorem 1: If the condition is satisfied, then the privacy requirement is satisfied.

In the condition check,there is an expression ceil in terms of N, r and global probabilities to compute.

26

Theorem 2: Computing ceil can be done in O(1) time.

This means that we overcome Challenge 1.Challenge 1: Calculating the probability is computationally expensive.

This means that we overcome Challenge 2.Challenge 2: The formula for the original

probabilityis not monotonic with respect to the A-group

size.

Theorem 3: ceil is a monotonically increasing function on N where N is the A-group size.

27

Condition Check

N

r

Global probabilities

Satisfied/Not Satisfied

2

2

0.10.003

What is the condition check?

f1

f2

fmax = max{f1, f2}= max{0.1, 0.003}= 0.1

1 = fmax – f1 = 0.1 – 0.1 = 0

2 = fmax – f2 = 0.1 – 0.003 = 0.097

The greatest global probability

The difference between the greatest global probability and the “current” global probability

The condition is whether this difference 1 (and 2) is at most an expression ceil

ceil = (N-r)/fmax

fmax(r-1)/(1-fmax) + (N-1)

in terms of N, r and fmax.

28

What is the condition check?

fmax = max{f1, f2}= max{0.1, 0.003}= 0.1

1 = fmax – f1 = 0.1 – 0.1 = 0

2 = fmax – f2 = 0.1 – 0.003 = 0.097

The greatest global probability

The difference between the greatest global probability and the “current” global probability

The condition is whether this difference 1 (and 2) is at most an expression ceil

ceil = (N-r)/fmax

fmax(r-1)/(1-fmax) + (N-1)

Theorem 1: If i ≤ceil is satisfied, then the privacy requirement is satisfied.

29

Anonymization

The condition check gives hints for anonymization Initially, each tuple forms an A-group. Repeat the following until each A-

group satisfies the condition. If there is an A-group violating the

condition, merge this A-group with some other A-group such that the “merged” A-group satisfies the condition.

30

B.1.2 K-AnonymityCustomer Gender District Birthday Cancer

Raymond Male Shatin 29 Jan None

Peter Male Fanling 16 July Yes

Kitty Female Shatin 21 Oct None

Mary Female Shatin 8 Feb None

Gender District Birthday Cancer

Male NT * None

Male NT * Yes

Female Shatin * None

Female Shatin * None

Release the data set to public

Problem: to generate a data set such that each possible value appears at least TWO times.

This data set is 2-anonymous

Two Kinds of Generalisations1. ShatinNT2. 16 July*

“ShatinNT” causes LESS distortion than “16 July*”Question: how can we

measure the distortion?

31

B.1.2 K-Anonymity

Shatin Fanling Mongkok Jordon

NT KLN

HKG

29 Jan 16 July 21 Oct 8 Feb

Jan July Oct Feb

*

Measurement= 1/2 =0.5

Measurement= 2/2=1.0

Male Female

*

Measurement= 1/1=1.0

Conclusion: We propose a measurement of distortion of the modified/anonymized data.

32

B.1.2 K-Anonymity

Shatin Fanling Mongkok Jordon

NT KLN

HKG

29 Jan 16 July 21 Oct 8 Feb

Jan July Oct Feb

*

Measurement= 1/2 =0.5

Measurement= 2/2=1.0

Male Female

*

Measurement= 1/1=1.0

Can we modify the measurement?e.g. different weightings to each level