anonymizing sequential releases acm sigkdd 2006 benjamin c. m. fung simon fraser university...

Anonymizing Sequential Releases

ACM SIGKDD 2006

Benjamin C. M. Fung

Simon Fraser University

[email protected]

Ke Wang

Simon Fraser University

[email protected]

2

Motivation: Sequential Releases

• Previous works address single release only.

• Data are released in multiple shots.

• An organization makes a new release:– New information become available.– A tailored view for each data sharing purpose.– Separate release for sensitive information and

identifying information.

• Related releases sharpens the identification of individuals by a global quasi-identifier.

3

T2: Previous Release

Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4

The join on T1.Job = T2.Job

Pid Name Job Disease Class

1 Alice Banker Cancer c1


3 Bob Clerk HIV c2

4 Bob Driver Cancer c3

5 Cathy Engineer HIV c4

- Alice Banker Cancer c1


Do not want Name to be linked to Disease in the join of the two releases.

4


Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4





3 Bob Clerk HIV c2





join sharpens identification:{Bob, HIV} has groups size 1.

5


Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4





3 Bob Clerk HIV c2





join weakens identification:{Alice, Cancer} has groups size 4.

lossy join: combat join attack.

6


Pid Job Disease

1 Banker Cancer

2 Banker Cancer

3 Clerk HIV

4 Driver Cancer

5 Engineer HIV

T1: Current Release

Pid Name Job Class

1 Alice Banker c1

2 Alice Banker c1

3 Bob Clerk c2

4 Bob Driver c3

5 Cathy Engineer c4





3 Bob Clerk HIV c2





join enables inferences across tables:AliceCancer has 100% confidence.

7

Related Work

• k-anonymity [SS98, FWY05, BA05, LDR05, WYC04, WLFW06]– Quasi-identifier (QID): a set of identifying

attributes in the table. If some record is linked to an external source by a QID value, so are at least k-1 other records.

– The database is made anonymous to itself.– In sequential releases, the database must be

made anonymous to the combination of all releases thus far.

8

Related Work

• l-diversity [MGK06]

– Ensures that sensitive values are “well-represented” in each QID group, measured by entropy.

• Confidence limiting [WFY05, WFY06]:

qid s, confidence < h

where qid is a value on QID, s is a sensitive value.

9

Related Work

• View releases– e.g., T1 and T2 are two views, both can be

modified before the release: more room for satisfying privacy and information requirements.

– [MW04, DP05] measure information disclosure of a view set wrt a secret view.

– [YWJ05, KG06] detect privacy violation by a view set over a base table.

– They measure or detect violations, but do not remove them.

10

Sequential Release• Sequential release:

– Current release T1. Previous release T2.– T1 was unknown when T2 was released.– T2, once released, cannot be modified when T1 is

released.

• Solution #1: k-anonymize all attributes in T1.– Excessive distortion.

• Solution #2: generalize T1 based on T2.– Monotonically distort the later release.

• Solution #3: release a “complete” cohort of all potential releases anonymized at one time.– Require predicting all future releases

11

Intuition of Our Approach

• A lossy join hides the true join relationship to cripple a global QID.

• Generalizing the current release T1 so that the join with the previous release T2 becomes lossy enough to disorient the attacker.

• Two general notions of privacy: (X,Y)-anonymity and (X,Y)-linkability, where X and Y are sets of attributes.

15

(X,Y)-Privacy• k-anonymity: # of distinct records for each

QID value ≥ k.

• (X,Y)-anonymity: # of distinct Y values for each X value ≥ k.

• (X,Y)-linkability: the maximum confidence that a record contains y given that it contains x ≤ k, where (x,y) are values on X and Y.

• Generalize k-anonymity [SS98] and confidence limiting [WFY05, WFY06].

16

Example: (X,Y)-AnonymityPid Job Zip PoB Test

1 Banker 123 Canada HIV

1 Banker 123 Canada Diabetes

1 Banker 123 Canada Eye

2 Clerk 456 Japan HIV

2 Clerk 456 Japan Diabetes

2 Clerk 456 Japan Eye

2 Clerk 456 Japan Heart• QID = {Job, Zip, PoB} is not a key.• k-anonymity fails to ensure that each value

on QID is linked to at least k distinct patients.

17

Example: (X,Y)-Anonymity• With (X,Y)-anonymity,

– specify the anonymity wrt patients by letting

X = {Job, Zip, PoB} and

Y = Pid– Each X group must be linked to at least k

distinct values on Pid.

• If X = {Job, Zip, PoB} and Y = Test, each X group is required to be linked to at least k distinct tests.

18

Example: (X,Y)-LinkabilityPid Job Zip PoB Test




4 Banker 123 Canada Diabetes



• {Banker,123,Canada} HIV (75% confidence).• With Y = Test, the (X,Y)-linkability states that no

test can be inferred from a value on X with a confidence higher than a given threshold.

19

Problem Statement

• The data holder has previously released T2 and wants to release T1, where T2 and T1 are projections of the same underlying table.

• Want to ensure (X,Y)-privacy on the join of T1 and T2.

• Sequential anonymization is to generalize T1 on X ∩ att(T1) so that the join of T1 and T2 preserves the (X,Y)-privacy and T1 remains as useful as possible.

20

Generalization / Specialization• Each generalization replaces all child

values with the parent value. – A cut contains exactly one

value on every root-to-leafpath.

• Each specialization v {v1,…,vc}, replaces the value v in every record containing v with the child value vi that is consistent with the original domain value in the record.

JobANY

Professional Admin

Engineer Lawyer Banker Clerk

21

Generalization / Specialization

• An interval of a continuous attribute is split on-the-fly to maximize information utility.– e.g., age [30-40) [30-37), [37-40)– The split at 37 maximizes the information

gain.

• A taxonomy tree is dynamically grown for each continuous (non-join) attribute.

22

Match Function

• Given T1 and T2, the attacker may apply prior knowledge to match the records in T1 and T2.

• So, the data holder applies such prior knowledge for matching:

– schema information of T1 and T2.

– taxonomies for attributes.

– following inclusion-exclusion principle.

23

Match Function• Let t1 T1 and t2 T2.• Consistency Predicate: t1.A matches t2.A

if they are on the same generalization path for attribute A. – e.g., Male matches Single Male.

• Inconsistency Predicate: t1.A matches t2.B only if t1.A and t2.B are not semantically inconsistent.– Excludes impossible matches.– e.g., Male and Pregnant are semantically

inconsistent, so are Married Male and 6 Month Pregnant.

24

Algorithm Overview

Top-Down Specialization for Sequential AnonymizationInput: T1, T2, a (X,Y)-privacy requirement, a taxonomy tree

for each attribute in X1 where X1=X ∩ att(T1). Output: a generalized T1 satisfying the privacy requirement.

1. generalize every value of Aj to ANYj where Aj X1;

2. while there is a valid candidate in ỤCutj do

3. find the winner w of highest Score(w) from ỤCutj;

4. specialize w on T1 and remove w from ỤCutj;

5. update Score(v) and the valid status for all v in ỤCutj;6. end while

7. output the generalized T1 and ỤCutj;

25

Monotonic Privacy • Theorem 1: On a single table, the (X,Y)-privacy

is anti-monotone wrt specialization on X.– If violated, remains violated after a specialization.

• AY(X) is non-increasing wrt specialization on X.– X always reduces the set of records that contain a X

value, therefore, reduces the set of Y values that co-occur with a X value.

• LY(X) is non-decreasing wrt specialization on X.

– A specialization v {v1,…,vc} transforms a value x on X to the specialized values x1,…,xc on X.

– If ly(xi) < ly(x) for some xi, there must exist some xj such that ly(xj) > ly(x) (otherwise, ly(x) < ly(xi)).

26

Monotonic Privacy

• On the join of T1 and T2, in general, (X,Y)-anonymity is not anti-monotone wrt a specialization on X ∩ att(T1).– Specializing T1 may create dangling records.

• Two tables are population-related if every record in each table has at least one matching record in the other table no dangling record.

• Lemma 1: If T1 and T2 are population-related, AY(X) is non-increasing wrt specialization on X ∩ att(T1).

27

Monotonic Privacy

• Lemma 2: If Y contains attributes from T1 or T2, but not from both, LY(X) does not decrease after specialization of T1 on the attributes X ∩ att(T1).

• Theorem 2: Assume that T1 and T2 are projections of the same underlying tables, (X,Y)-anonymity and (X,Y)-linkability on the join of T1 and T2 are anti-monotone wrt specialization of T1 on X ∩ att(T1).

28

Score Metric

• Score(v) evaluates the “goodness” of a specialization v for preserving privacy and information.

• Each specialization v gains some information and loses some privacy. We maximize

• InfoGain(v) is measured on T1.• PrivLoss(v) is measured on the join of T1 and T2.

29

Information Gain

• If T1 is released for classification on a specified class column, InfoGain(v) could be the reduction of the class entropy:

• T1[v] denotes the set of generalized records in T1 that contain v before the specialization.

• T1[vi] denotes the set of records in T1 that contain vi after the specialization.

• InfoGain(v) could be the notion of distortion.

30

Privacy Loss

• PrivLoss(v) is measured by the decrease of AY(X) or the increase of LY(X) due to the specialization of v: AY(X) - AY(Xv) for (X,Y)-anonymity

LY(Xv) - LY(X) for (X,Y)-linkability

where X and Xv represent the attributes before and after specializing v respectively.

31

Challenges1. Each specialization on w affects the

matching of join, thus, privacy checking.• too expensive to rejoin the two tables for

each specialization.

2. Materializing the join is impractical.• A lossy join can be very large.

Our solution: Incrementally maintains some count statistics to update Score(v) without executing the join.

32

Data Structure

• Expensive operations on specializing w– accessing the records in T1 containing w– matching the records in T1 with the records in

T2.

• X1 = X ∩ att(T1) and X2 = X ∩ att(T2),

• J1 and J2 denote the join attributes in T1 and T2.

33

Data Structure• Tree1: partition T1 records by the

attributes X1 and J1-X1 in that order, one level per attribute.– Link[v] links up all nodes for v at the attribute

level of v.

• Tree2: partition T2 records by the attributes J2 and X2-J2 in that order.– Tree2 is static.

• Probe the matching partitions in Tree2. – Match the last |J1| attributes in a partition in

Tree1 with the first |J2| attributes in Tree2.

34

Analysis• On specializing w, Link[w] provides a direct access

to the records involved in T1• Tree2 provides a direct access to the matching

partitions in T2.• Matching is performed at the partition level, not at

the record level. • The cost of each iteration has two parts.

1. Specialize the affected partitions on Link[w]. 2. Update the score and status of candidates using count

statistics.

• Each record in T1 is accessed at most | X ∩ att(T1) | h times where h is the maximum height of the taxonomies.

35

Empirical Study

• The Adult data set. 45222 records.• Two versions of (T1,T2)• Set A (categorical attributes only)

– T1 contains the Class attribute, the 3 categorical attributes and the 3 join attributes.

– T2 contains the 2 categorical attributes and the 3 join attributes.

• Set B (both categorical and continuous)– T1 contains the additional 6 continuous

attributes from Taxation Department.

Department Attribute # of

Leaves

# of

Levels

Taxation

(T1)

Education (E) 16 5

Occupation (O) 14 3

Work-class (W) 8 5

Common

(T1 & T2)

Marital-status (M) 7 4

Relationship (Ra) 6 3

Sex (S) 2 2

Immigration

(T2)

Native-country (Nc) 40 5

Race (Ra) 5 3

Schema for Set A

• T1 contains the Class attribute

37

Empirical Study• Classification metric

– Classification error on the generalized testing set of T1.

• Distortion metric [SS98]– Categorical: 1 unit of distortion for each

generalization.– Continuous: Suppose v is generalized to interval

[a-b). Unit of distortion = (b-a)/(f2-f1), where [f1,f2) is the full range of the attribute.

– Normalize total distortion by the number of records.

38

(X,Y)-Anonymity• TopN attributes: most important for classification.

– Chosen by successively removing the top attribute in a decision tree.

• Join attributes are the Top3 attributes.– If not important, simply remove them.

• X contains – TopN attributes in T1 for a specified N (to ensure

that the generalization is performed on important attributes),

– all join attributes,– all attributes in T2 (to ensure X is global).

39

Distortion of (X,Y)-anonymity• Ki is a key in Ti.• XYD: produced by our method with Y = K1.• KAD: produced by k-anonymity on T1 with

QID=att(T1).

Set A Set B

40

Classification error of (X,Y)-anonymity• XYE: produced by our method with Y = K1.• XYE(row): produced by our method with Y={K1,K2}.• BLE: produced by the unmodified data.• KAE: produced by k-anonymity on T1 with QID=att(T1).• RJE: produced by removing all join attributes from T1.

Set A Set B

41

(X,Y)-Linkability• Y contains the TopN attributes.

– If not important, simply remove them.

• X contains the rest of the attributes in T1 and T2, except T2.Ra and T2.Nc because otherwise no privacy requirement can be satisfied.

• Focus on the classification error because the distortion due to (X,Y)-linkability is not comparable with the distortion due to k-anonymity.

42

Classification error of (X,Y)-linkability• XYE: produced by our method with Y = TopN.• BLE: produced by the unmodified data.• RJE: produced by removing all join attributes from T1.• RSE: produced by removing all attributes in Y from T1.

Set A Set B

43

(X,Y)-anonymity (k=40) (X,Y)-linkability (k=90%)

Scalability

44

Conclusion• Previous works on k-anonymization

focused on a single release of data.• Studied the sequential anonymization

problem. • Extended the privacy notion to this model. • Introduced lossy join as a way to hide the

join relationship among releases.• Addressed computational challenges due

to large size of lossy join.• Extendable to more than one previously

released tables T2,…,Tp.

45

References[BA05] R. Bayardo and R. Agrawal. Data privacy

through optimal k-anonymization. In IEEE ICDE, pages 217.228, 2005.

[DP05] A. Deutsch and Y. Papakonstantinou. Privacy in database publishing. In ICDT, 2005.

[FWY05] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In IEEE ICDE, pages 205.216, April 2005.

[KG06] D. Kifer and J. Gehrke. Injecting utility into anonymized datasets. In ACM SIGMOD, Chicago, IL, June 2006.

46

References[LDR05] K. LeFevre, D. J. DeWitt, and R.

Ramakrishnan. Incognito: Efcient full-domain k-anonymity. In ACM SIGMOD, 2005.

[MGK06] A. Machanavajjhala, J. Gehrke, and D. Kifer. l-diversity: Privacy beyond k-anonymity. In IEEE ICDE, 2006.

[MW04] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS, 2004.

[SS98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. In IEEE Symposium on Research in Security and Privacy, May 1998.

47

References[WFY05] K. Wang, B. C. M. Fung, and P. S. Yu.

Template-based privacy preservation in classification problems. In IEEE ICDM, pages 466.473, November 2005.

[WFY06] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker's condence: An alternative to k-anonymization. Knowledge and Information Systems: An International Journal, 2006.

[WYC04] K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In IEEE ICDM, November 2004.

48

References[WLFW06] R. C. W. Wong, J. Li., A. W. C. Fu, and

K. Wang. (,k)-anonymity: An enhanced k-anonymity model for privacy preserving data publishing. In ACM SIGKDD, 2006.

[YWJ05] C. Yao, X. S. Wang, and S. Jajodia. Checking for k-anonymity violation by views. In VLDB, 2005.

anonymizing sequential releases acm sigkdd 2006 benjamin c. m. fung simon fraser university...

Documents

previous release pidjobdisease

engineerhiv t1

new release

single release

separate release

lossy join

related releases

qid value