template-based privacy preservation in classification problems

30
Template-Based Privacy Preservation in Classification Problems IEEE ICDM 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada [email protected] Ke Wang Simon Fraser University BC, Canada [email protected] Philip S. Yu IBM T.J. Watson Research Center [email protected]

Upload: loe

Post on 30-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Template-Based Privacy Preservation in Classification Problems. Ke Wang Simon Fraser University BC, Canada [email protected]. Benjamin C. M. Fung Simon Fraser University BC, Canada [email protected]. Philip S. Yu IBM T.J. Watson Research Center [email protected]. IEEE ICDM 2005. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Template-Based Privacy Preservation in Classification Problems

Template-Based Privacy Preservation in

Classification Problems

IEEE ICDM 2005

Benjamin C. M. Fung

Simon Fraser University BC, Canada

[email protected]

Ke Wang

Simon Fraser University BC, Canada

[email protected]

Philip S. Yu

IBM T.J. Watson Research Center

[email protected]

Page 2: Template-Based Privacy Preservation in Classification Problems

2

Outline

• Privacy threats caused by data mining abilities

• Our method: Progressive Disclosure Algorithm

• Experimental evaluation

• Related works

• Conclusions

Page 3: Template-Based Privacy Preservation in Classification Problems

3

Privacy Concern

• Most previous works concern the input of data mining tools where private information is revealed directly by inspection of the data without sophisticated analysis.

• Our privacy concern is on the output of data mining methods.– The aggregate patterns can be used to infer sensitive

information about individuals.

Page 4: Template-Based Privacy Preservation in Classification Problems

• Motivating Example: A data owner wants to release a table to a data mining firm for classification analysis on Rating, but does not want the firm to infer the bankruptcy state Discharged using the attributes Job and Country.

• This work aims at releasing data with dual goals:– Preserve information for wanted classification analysis.– Limit usefulness of unwanted sensitive inferences.

Page 5: Template-Based Privacy Preservation in Classification Problems

5

• Motivating Example: A data owner wants to release a table to a data mining firm for classification analysis on Rating, but does not want the firm to infer the bankruptcy state Discharged using the attributes Job and Country.

• Inference: {Trader,UK} Discharged• Confidence = 4/5 = 80%• An inference is sensitive if its confidence > threshold.

Page 6: Template-Based Privacy Preservation in Classification Problems

6

Eliminate Low Support Inferences?

• In data mining, association or classification rules are used to capture general patterns of large populations.– A low support means the lack of statistical significance.

• In privacy protection, inference rules are used to infer sensitive information about individuals.– Eliminate sensitive inferences of any support.– In fact, a sensitive inference in a small group could

present even more threats because individuals in a small group are more identifiable.

Page 7: Template-Based Privacy Preservation in Classification Problems

7

The Problem

• Consider a table T(M1,…,Mm, 1, …, n, )

• Classification goal: Modeling class attribute • Privacy goal: Limit sensitive inferences on i

– Specified by one or more templates <IC , h>– IC is a set of attributes containing some masking

attributes Mj, e.g., IC = {Job, Country}

is a value from some i, e.g., = Discharged

– h is a threshold on confidence– ic is an inference, where ic contains values from IC– T satisfies <IC , h> if every matching inference

ic has a confidence conf(ic ) ≤ h.

Page 8: Template-Based Privacy Preservation in Classification Problems

8

Flexibility of Templates

• Selectively protecting certain values while not protecting other values.

• Specifying a different threshold h for a different template IC .

• Specifying multiple inference channels ICs (even for the same ).

• Specifying templates for multiple sensitive attributes.

• These flexibilities minimize unnecessary masking, i.e., minimize unnecessary information loss.

Page 9: Template-Based Privacy Preservation in Classification Problems

9

• Achieve goals by suppressing some values on masking attributes M1,…,Mm

• To eliminate {Trader,UK} Discharged

– Suppress Trader and Clerk to Job

– Suppress UK and Canada to Country

– Reduced confidence = 5/10 = 50%

Page 10: Template-Based Privacy Preservation in Classification Problems

10

Challenges• Incorrect suppression may eliminate some

desired classification structures for modeling .

• Finding an optimal suppression is hard. – For a table with a total of q distinct values on masking

attributes, there are 2q possible suppressed tables.– We present an approximate solution based on a

search that iteratively improves the solution and prunes the search whenever no better solution is possible.

Page 11: Template-Based Privacy Preservation in Classification Problems

11

The Algorithm• Progressive Disclosure Algorithm (PDA) iteratively

discloses domain values starting from the most suppressed T in which each masking attribute Mj in

IC contains only ∪ j.

• Supj contains all currently suppressed values in Mj.

• In each iteration, disclose one suppressed value w.

• To disclose a value w from Supj, we replace j with w in all suppressed records that currently contain j and originally contain w before suppression.

• This process repeats until no disclosure is possible without violating the set of templates.

Page 12: Template-Based Privacy Preservation in Classification Problems

12

Progressive Disclosure Algorithm (PDA)

1: suppress every value of Mj to j where Mj IC;∪

2: every Supj contains all domain values of Mj IC;∪

3: while there is a candidate in Sup∪ j do

4: find winner w of highest Score(w) from Sup∪ j;

5: disclose w on T and remove w from Sup∪ j;

6: update Score(x) and status for x in Sup∪ j;

7: end while

8: output the suppressed T and Sup∪ j;

Page 13: Template-Based Privacy Preservation in Classification Problems

Conf = 5 / 24 = 21% Conf = 1 / 4 = 25% Conf = 1 / 4 = 25%

Page 14: Template-Based Privacy Preservation in Classification Problems

14

Search Criteria: Score

• Disclosing a value v gains information and loses privacy

• Score(v) measures the information gain per unit of privacy loss.

• InfoGain measures the information gain of disclosing v.

Page 15: Template-Based Privacy Preservation in Classification Problems

15

Search Criteria: Score

• PrivLoss measures the privacy loss of disclosing v, defined as the average increase of Conf(IC ) over all affected IC .

where Conf and Confv represent the confidence before and after disclosing v.

• The key to the scalability of our algorithm is incrementally updating Score(v) in each iteration for candidates v in Sup∪ j. (see paper for details)

Page 16: Template-Based Privacy Preservation in Classification Problems

16

Cost Analysis

• At each iteration, the cost can be summarized as two operations.

1. Scan the partitions on Link[w] for disclosing the winner w and maintaining some count statistics.

2. Make use of the count statistics to update the score and status of every affected candidate without accessing data records. Thus, each iteration accesses only the records suppressed to w.

• The number of iterations is bounded by the number of distinct values in the masking attributes.

Page 17: Template-Based Privacy Preservation in Classification Problems

17

Experimental Evaluation

• Data quality– Experiment with a broad range of templates.– Use C4.5 classifier.– Measure classification error before and after

suppression.

• Efficiency and Scalability

Page 18: Template-Based Privacy Preservation in Classification Problems

18

Data sets

• Japanese Credit Screening (CRX)– Credit card applications– 8 categorical attributes and 2 classes– 465 recs. for training and 188 recs. for testing

• Adult– Census data– 8 categorical attributes and 2 classes– 30162 recs. for training and 15060 recs. for testing – Previously used in Bayardo et al. (2005), Fung et al.

(2005), Iyengar (2002), and Wang et al. (2004).

Page 19: Template-Based Privacy Preservation in Classification Problems

Results on CRX• TopN sensitive attributes 1,…,N; an IC includes the

remaining masking attributes M1,…Mm.

• Base Error (BE) for original data = 15.4%

• Suppression Error (SE) for suppressed data

• Removal Error (RE) for

removed 1,…,N

• RE-SE measures benefitsof suppression

• Took at most 2 seconds for each experiment

Page 20: Template-Based Privacy Preservation in Classification Problems

20

Results on Adult• Base Error (BE) for original data = 17.6%

• Took at most 14 seconds for each experiment.

Page 21: Template-Based Privacy Preservation in Classification Problems

Scalability• Replicate the Adult data set and substitute some random data.

• A time consuming setting:– 1 sensitive attribute– Remaining 7 as masking

attributes– h=90%

Page 22: Template-Based Privacy Preservation in Classification Problems

22

Related Works

• Iyengar (2002) proposed a genetic algorithm to address the problem of k-anonymity for classification.

• Bayardo et al. (2005) employed generalization and suppression to address a similar problem.

• Our work concerns over the output of data mining methods, where the threats are caused by what data mining tools can discover.

Page 23: Template-Based Privacy Preservation in Classification Problems

23

Related Works

• Clifton (2000) suggested to eliminate sensitive inferences by limiting the data size.

• Verykios et al. (2004) proposed several algorithms for hiding association rules in a transaction database with minimal modification to the data. – Hide one rule at a time by either decreasing its support

or its confidence– Achieved by removing items from transactions. – Our work considers the use of the data for classification

analysis and eliminates all sensitive inferences including those with a low support.

Page 24: Template-Based Privacy Preservation in Classification Problems

24

Related Works

• Cox (1980) proposed the k%- dominance rule which suppresses a sensitive cell if the attribute values of two or three entities in the cell contribute more than k% of the corresponding SUM statistic.

– Such “cell suppression” suppresses the count or other statistics stored in a cell of a statistical table.

– Very different from the “value suppression” considered in our work.

Page 25: Template-Based Privacy Preservation in Classification Problems

25

Conclusions• Formulate a template-based privacy preservation

problem.

• Show that suppression is an effective way to eliminate sensitive inferences.

• Present an effective algorithm based on a search that iteratively improves the solution.

• Evaluate this method on real life data sets.

Page 26: Template-Based Privacy Preservation in Classification Problems

26

References

1. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large datasets. In Proc. of the 1993 ACM SIGMOD, pages 207-216, 1993.

2. R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proc. of the 21st IEEE ICDE, pages 217-228, 2005.

3. C. Clifton. Using sample size to limit exposure to data mining. Journal of Computer Security, 8(4):281-307, 2000.

4. C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu. Tools for privacy preserving data mining. SIGKDD Explorations, 4(2), 2002.

5. L. H. Cox. Suppression methodology and statistical disclosure control. Journal of the American Statistics Association, Theory and Method Section, 75:377-385, 1980.

Page 27: Template-Based Privacy Preservation in Classification Problems

27

References

6. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In Proc. of the 8th ACM SIGKDD, pages 217-228, 2002.

7. C. Farkas and S. Jajodia. The inference problem: A survey. SIGKDD Explorations, 4(2):6-11, 2003.

8. B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation. In Proc. of the 21st IEEE ICDE, pages 205-216, Tokyo, Japan, 2005.

9. S. Hettich and S. D. Bay. The UCI KDD Archive, 1999. http://kdd.ics.uci.edu.

10. V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proc. of the 8th ACM SIGKDD, 2002.

11. M. Kantarcioglu, J. Jin, and C. Clifton. When do data mining results violate privacy? In Proc. of the 2004 ACM SIGKDD, pages 599-604, 2004.

Page 28: Template-Based Privacy Preservation in Classification Problems

28

References

12. J. Kim and W. Winkler. Masking microdata files. In ASA Proc. of the Section on Survey Research Methods, 1995.

13. W. Kloesgen. Knowledge discovery in databases and data privacy. In IEEE Expert Symposium: Knowledge Discovery in Databases, 1995.

14. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.

15. L. Sweeney. Datafly: A system for providing anonymity in medical data. In Proc. of the 11th International Conference on Database Security, pages 356-381, 1998.

16. L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal on Uncertainty, Fuzziness, and Knowledge-based Systems, 10(5):571-588, 2002.

Page 29: Template-Based Privacy Preservation in Classification Problems

29

References

17. V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni. Association rule hiding. IEEE TKDE, 16(4):434-447, 2004.

18. K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization: a data mining solution to privacy protection. In Proc. of the 4th IEEE ICDM, 2004.

19. R. W. Yip and K. N. Levitt. The design and implementation of a data level database inference detection system. In Proc. of the 12th International Working Conference on Database Security XII, pages 253-266, 1999.

Page 30: Template-Based Privacy Preservation in Classification Problems

30

FAQQ: Inference rules with low supports are insignificant

anyway, why do we bother eliminating them?A: Keeping those low-support inferences is a relaxation of

our current privacy requirement. In other words, the suppression error will be even lower (better) than our current model. If the user prefers, she may introduce another threshold (minimum support). This will further improve the classification quality.