anatomy: simple and effective privacy preservation israel chernyak db seminar (winter 2009)

Anatomy: Simple and Anatomy: Simple and Effective Privacy Effective Privacy

PreservationPreservation

Israel ChernyakIsrael Chernyak

DB Seminar (winter 2009)DB Seminar (winter 2009)

ExampleExample

Hospital wants to release patients’ medical records

Attribute Disease is sensitive. Age, Sex, and Zipcode are the quasi-identifier (QI)

attributes.

GeneralizationGeneralization

A widely-used technique for preserving privacy. Tuples are divided into QI-groups. QI-values are transformed into less specific forms. Tuples in the same QI-group cannot be distinguished by

their QI-values.

IntroductionIntroduction

Anatomy – a technique for publishing sensitive data.

Protects privacy. Allows effective data analysis.

More effective than the conventional generalization.

Permits aggregate reasoning with average error below 10%.

Lower than the errors produced by generalization by orders of magnitude.

Measuring the degree of privacy Measuring the degree of privacy preservationpreservation

Two notions, k-anonymity and l-diversity, have been proposed to measure the degree of privacy preservation.

K-anonymityK-anonymity

A table is k-anonymous if each QI-group involves at least k tuples

The next table is 4-anonymous.

Even with a large k, k-anonymity may still allow an adversary to infer the sensitive value of an individual with extremely high confidence.

Where k-anonymity failsWhere k-anonymity fails

4-anonymous table4-anonymous table Last group has no privacyLast group has no privacy Background knowledge attacks are still Background knowledge attacks are still

possible.possible.

L-diversityL-diversity

A table is l-diverse if, in each QI-group, at most 1/l of the tuples possess the most frequent sensitive value.

Defects of Generalization in Defects of Generalization in AnalysisAnalysis

Generalization helps preserve privacy. In terms of l-diversity.

A trade-off exists between: Keeping the sensitive data private. Publishing records for research and analysis.

Defects of Generalization in Defects of Generalization in Analysis – an exampleAnalysis – an example

A researcher wants to estimate the result of the following query:

Defects of Generalization in Defects of Generalization in Analysis – an example (cont.)Analysis – an example (cont.)

1R

2R

1

1

0.05

2 0.1

Area R Qp

Area R

p

Defects of Generalization in Defects of Generalization in Analysis – an example (cont.)Analysis – an example (cont.)

Recall that the answer we got was 0.1, which, Recall that the answer we got was 0.1, which, however, is however, is 10 times10 times smaller than actual query smaller than actual query result.result.

Caused by the fact that the data distribution in Caused by the fact that the data distribution in R1 significantly deviates from uniformity.R1 significantly deviates from uniformity.

Nevertheless, given only the generalized table, Nevertheless, given only the generalized table, we cannot justify any other distribution we cannot justify any other distribution assumption.assumption.

This is an inherent problem of generalizationThis is an inherent problem of generalization preventing an analyst from correctly understanding preventing an analyst from correctly understanding

the data distribution inside each QI-group.the data distribution inside each QI-group.

AnatomyAnatomy

Anatomy vs. GeneralizationAnatomy vs. Generalization

Anatomy announces the QI values directly.

Permits more effective analysis than generalization.

Anatomy vs. Generalization – Anatomy vs. Generalization – example (cont.)example (cont.)

Given the previous query:

We proceed to calculate the probability p that a tuple in the QI-group falls in Q.

Anatomy vs. Generalization – Anatomy vs. Generalization – example (cont.)example (cont.)

No assumption about the data distribution is necessary

Because the distribution is precisely released.

0.5p 2 1p

Actual answer

Privacy PreservationPrivacy Preservation

Anatomy provides a convenient way for the data Anatomy provides a convenient way for the data publisher to find out for each tuple t: publisher to find out for each tuple t: The sensitive values that an adversary can associate The sensitive values that an adversary can associate

with t.with t. The probability of association.The probability of association. Pneumonia

Dyspepsia

flu

?

?

?

p1=0.5

p2=0.5

ConclusionConclusion

Given a pair of QIT and ST, an adversary Given a pair of QIT and ST, an adversary can correctly reconstruct any tuple with a can correctly reconstruct any tuple with a probability at most 1/l.probability at most 1/l. Therefore, the adversary can correctly infer Therefore, the adversary can correctly infer

the sensitive value of any individual with the sensitive value of any individual with probability at most 1/l.probability at most 1/l.

Anatomy vs. GeneralizationAnatomy vs. Generalization

Anatomy, isn’t an all-around winner:Anatomy, isn’t an all-around winner: Anatomy releases the QI-values directly.Anatomy releases the QI-values directly. Intuitively, it may provide a higher probability Intuitively, it may provide a higher probability

breach than generalization.breach than generalization. Nevertheless, such probability is always Nevertheless, such probability is always

bounded by 1/lbounded by 1/l• As long as the background knowledge of an As long as the background knowledge of an

adversary isn’t stronger than the level allowed by adversary isn’t stronger than the level allowed by the l-diversity model.the l-diversity model.

Assumptions we’ve made so farAssumptions we’ve made so far

Assumption 1: The adversary has the QI-values of the target

individual. Assumption 2:

The adversary knows that the individual is definitely involved in the data.

In fact, usually both assumptions are satisfied in practical privacy-attacking processes.

Assumptions - conclusionAssumptions - conclusion

In general, if both assumptions are true, In general, if both assumptions are true, anatomy provides as much privacy control anatomy provides as much privacy control as generalizationas generalization The privacy of a person is breached with a The privacy of a person is breached with a

probability at most 1/l.probability at most 1/l.

Anatomy – where its privacy failsAnatomy – where its privacy fails

An adversary that can make the first An adversary that can make the first assumption (knowing the QI-values)assumption (knowing the QI-values)

But not the second (existence of target in But not the second (existence of target in database)database)

The overall breach probability: The overall breach probability:

2 2|qi SA breachP Alice P Alice A

2 2|qi SA breachP Alice P Alice A

Anatomy – where its privacy Anatomy – where its privacy fails (cont.)fails (cont.)

Each member can be involved with equal likelihood

P(Alice is in the table) = 4/5

P(Alice is in the table) = 1

Nevertheless, the upper bound for this is still 1/l

Privacy in generalization vs privacy Privacy in generalization vs privacy in anatomy - conclusionin anatomy - conclusion

Although generalization has the above advantage over anatomy, the advantage cannot be leveraged in computing the published data.

This is because the publisher cannot predict or control the external database to be utilized by an adversary, and therefore, must guard against an “accurate” external source that does not involve any person absent in the published data.

Anatomizing Algorithm – group creation phase

Anatomizing Algorithm – hashing tuples

23, M, 11000, pneumonia

27, M, 13000, dyspepsia



61, F, 54000, flu

65, F, 25000, gastritis

65, F, 25000, flu

70, F, 30000, bronchitis

pneumonia dyspepsia flu gastritis bronchitis

Anatomizing Algorithm – group creation phase

Property 1: At the end of the group creation phase each non-empty Property 1: At the end of the group creation phase each non-empty bucket has only one tuple.bucket has only one tuple.

Only if at most n/l tuples are associated with the same AOnly if at most n/l tuples are associated with the same Ass value. value.

Anatomizing Algorithm – group Anatomizing Algorithm – group creation (l=2)creation (l=2)

pneumonia dyspepsia flu gastritis bronchitis





61, F, 54000, flu

65, F, 25000, flu 65, F,

25000, gastritis

70, F, 30000, bronchitis

QI1 QI2

QI3 QI4

Anatomizing Algorithm – residue-Anatomizing Algorithm – residue-assignment phaseassignment phase

Property 2: the set S’ (computed at line 11) Property 2: the set S’ (computed at line 11) always includes at least one QI-group.always includes at least one QI-group.

Property 3: after the residue-assignment phase, Property 3: after the residue-assignment phase, each QI-group has at least l tuples, and all each QI-group has at least l tuples, and all tuples in each QI-group have distinct Atuples in each QI-group have distinct Ass values. values.

Anatomizing Algorithm – populating Anatomizing Algorithm – populating the tablesthe tables

Reconstruction errorReconstruction error

We model each tuple t as a probability We model each tuple t as a probability density function:density function:

Note that both x and t are tuples, so x=t Note that both x and t are tuples, so x=t means x[i]=t[i] for all i.means x[i]=t[i] for all i.

: 0,1

1

0

t

t

G x DS

if x tG x

otherwise

Reconstruction error (cont.)Reconstruction error (cont.)

Given an approximation probability density Given an approximation probability density function , the error from the actual probability function , the error from the actual probability density function is:density function is:

A good publication method should minimize the A good publication method should minimize the following reconstruction error (RCE):following reconstruction error (RCE):

tG

2

tt tx DSErr G x G x dx

tt TRCE Err

Algorithm error boundsAlgorithm error bounds

If the cardinality n of TIf the cardinality n of T (the original table) (the original table) is a multiple of l, theis a multiple of l, the QIT and ST computed QIT and ST computed by Anatomizeby Anatomize achieve the lower bound ofachieve the lower bound of RCERCE.. n(1-1/l)n(1-1/l)

Otherwise, the RCE of the anatomized Otherwise, the RCE of the anatomized tablestables is higher than the lower bound by a is higher than the lower bound by a factor at most 1 + 1factor at most 1 + 1//n .n .

SummarySummary Privacy vs publication for research – a Privacy vs publication for research – a

serious concern.serious concern. Existing method (generalization) allows Existing method (generalization) allows

privacy, but doesn’t allow very accurate privacy, but doesn’t allow very accurate research.research.

Anatomy – a method that provides both Anatomy – a method that provides both good privacy (in terms of l-diversity) and good privacy (in terms of l-diversity) and allows for accurate research.allows for accurate research.

Summary (cont.)Summary (cont.)

A nearly-optimal algorithm for anatomizing A nearly-optimal algorithm for anatomizing tables.tables.

Achieves the minimal possible error (or close to it).Achieves the minimal possible error (or close to it). Complexity is linear.Complexity is linear. Simple and can be implemented easily.Simple and can be implemented easily.

Experiments have shown that Anatomy has an Experiments have shown that Anatomy has an average error of below 10%, as opposed to over average error of below 10%, as opposed to over 100% error of generalization.100% error of generalization.

Only the case with a single Only the case with a single sensitivesensitive attribute is attribute is investigated (the rest is left to future work).investigated (the rest is left to future work).

Background knowledge of attacker is neglected.Background knowledge of attacker is neglected.

anatomy: simple and effective privacy preservation israel chernyak db seminar (winter 2009)

Documents

qi values

defects of generalization

generalization example

data distribution

effective data analysis

sensitive values

conventional generalization

effective analysis