privacy preserving data mining – randomized response and...

Privacy preserving data mining –randomized response and association rule hiding

Li Xiong

CS573 Data Privacy and Anonymity

Partial slides credit: W. Du, Syracuse University, Y. Gao, Peking University

Privacy Preserving Data Mining Techniques Protecting sensitive raw data

Randomization (additive noise) Geometric perturbation and projection (multiplicative

noise) Randomized response technique

Categorical data perturbation in data collection model

Protecting sensitive knowledge (knowledge hiding)

Data Collection Model

Data cannot be shared directly because of privacy concern

Background:Randomized Response

)5.0()(

YesP

P'(Yes) P(Yes) P(No) (1)P'(No) P(Yes) (1) P(No)

Do you smoke?

Head

Tail No

Yes

The true answer is “Yes”

Biased coin:

5.0)(

HeadP

Decision Tree Mining using Randomized Response Multiple attributes encoded in bits

)5.0()(

YesP

Head

TailFalse answer !E: 001

True answer E: 110Biased coin:

5.0)(

HeadP

Column distribution can be estimated for learning a decision tree!

Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Accuracy of Decision tree built on randomized response

Generalization for Multi-Valued Categorical Data

True Value: Si

Si

Si+1

Si+2

Si+3

q1

q2

q3

q4

P '(s1)P '(s2)P '(s3)P '(s4)

q1 q4 q3 q2q2 q1 q4 q3q3 q2 q1 q4q4 q3 q2 q1

P(s1)P(s2)P(s3)P(s4)

M

A Generalization RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]

RR Matrix can be arbitrary

Can we find optimal RR matrices?

M

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang, 2008

What is an optimal matrix?

Which of the following is better?

M1 1 0 00 1 00 0 1

M2

13

13

13

13

13

13

13

13

13

Privacy: M2 is betterUtility: M1 is better

So, what is an optimal matrix?

Optimal RR Matrix

An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M). Privacy Quantification Utility Quantification

A number of privacy and utility metrics have been proposed. Privacy: how accurately one can estimate

individual info. Utility: how accurately we can estimate aggregate

info.

Metrics

Privacy: accuracy of estimate of individual values

Utility: difference between the original probability and the estimated probability

Optimization Methods

Approach 1: Weighted sum: w1 Privacy + w2 Utility

Approach 2 Fix Privacy, find M with the optimal Utility. Fix Utility, find M with the optimal Privacy. Challenge: Difficult to generate M with a fixed

privacy or utility. Proposed Approach: Multi-Objective

Optimization

Optimization algorithm

Evolutionary Multi-Objective Optimization (EMOO) The algorithm

Start with a set of initial RR matrices Repeat the following steps in each iteration

Mating: selecting two RR matrices in the pool Crossover: exchanging several columns between the

two RR matrices Mutation: change some values in a RR matrix Meet the privacy bound: filtering the resultant matrices Evaluate the fitness value for the new RR matrices.

Note : the fitness values is defined in terms of privacy and utility metrics

Illustration

Output of Optimization

Privacy

Utility

Worse

Better

M1M2

M4

M3

M5

M7

M6

M8

The optimal set is often plotted in the objective space as Pareto front.

For First attribute of Adult data

Privacy Preserving Data Mining Techniques Protecting sensitive raw data

Randomization (additive noise) Geometric perturbation and projection (multiplicative

noise) Randomized response technique

Protecting sensitive knowledge (knowledge hiding) Frequent itemset and association rule hiding Downgrading classifier effectiveness

Frequent Itemset Mining and Association Rule Mining

Frequent itemset mining: frequent set of items in a transaction data set

Association rules: associations between items

Frequent Itemset Mining and Association Rule Mining

First proposed by Agrawal, Imielinski, and Swami in SIGMOD 1993 SIGMOD Test of Time Award 2003

“This paper started a field of research. In addition to containing an innovative algorithm, its subject matter brought data mining to the attention of the database community … even led several years ago to an IBM commercial, featuring supermodels, that touted the importance of work such as that contained in this paper. ”

Apriori algorithm in VLDB 1994 #4 in the top 10 data mining algorithms in ICDM 2006

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD ’93.Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94.

February 19, 2009

20

Basic Concepts: Frequent Patterns and Association Rules

Itemset: X = {x1, …, xk} (k-itemset) Frequent itemset: X with minimum

support count Support count (absolute support): count

of transactions containing X

Association rule: A B with minimum support and confidence Support: probability that a transaction

contains A Bs = P(A B)

Confidence: conditional probability that a transaction having A also contains B

c = P(A | B) Association rule mining process

Find all frequent patterns (more costly) Generate strong association rules

Customerbuys diaper

Customerbuys both

Customerbuys beer

Transaction-id Items bought

10 A, B, D

20 A, C, D

30 A, D, E

40 B, E, F

50 B, C, D, E, F

February 19, 2009

Illustration of Frequent Itemsets and Association Rules

Transaction-id Items bought

10 A, B, D

20 A, C, D

30 A, D, E

40 B, E, F

50 B, C, D, E, F

Frequent itemsets (minimum support count = 3) ?

Association rules (minimum support = 50%, minimum confidence = 50%) ?

{A:3, B:3, D:4, E:3, AD:3}

A D (60%, 100%)D A (60%, 75%)

SIGMOD Ph.D. Workshop IDAR’07

22

Association Rule Hiding: what? why??

Problem: hide sensitive association rules in data without losing non-sensitive rules

Motivations: confidential rules may have serious adverse effects


Problem statement

Given a database D to be released minimum threshold “MST”, “MCT” a set of association rules R mined from D a set of sensitive rules Rh R to be hided

Find a new database D’ such that the rules in Rh cannot be mined from D’ the rules in R-Rh can still be mined as many as

possible


Solutions

Data modification approaches Basic idea: data sanitization D->D’ Approaches: distortion,blocking Drawbacks

Cannot control hiding effects intuitively, lots of I/O

Data reconstruction approaches Basic idea: knowledge sanitization D->K->D’ Potential advantages

Can easily control the availability of rules and control the hiding effects directly, intuitively, handily

Distortion-based Techniques

A B C D

1 1 1 0

1 0 1 1

0 0 0 1

1 1 1 0

1 0 1 1

Rule ARule A→→C has: C has: Support(Support(AA→→CC)=80%)=80%Confidence(Confidence(AA→→CC)=100%)=100%

Sample DatabaseSample Database

A B C D

1 1 1 0

1 0 00 1

0 0 0 1

1 1 1 0

1 0 00 1

Distorted DatabaseDistorted Database

Rule ARule A→→C has now: C has now: Support(Support(AA→→CC)=40%)=40%Confidence(Confidence(AA→→CC)=50%)=50%

DistortionAlgorithm

Side Effects

Before Hiding Before Hiding ProcessProcess

After Hiding After Hiding ProcessProcess

Side EffectSide Effect

Rule Ri has had conf(Rconf(Rii)>MCT)>MCT

Rule Ri has now conf(Rconf(Rii)<MCT)<MCT

Rule Eliminated(Undesirable Side Effect)

Rule Ri has had conf(Rconf(Rii)<MCT)<MCT

Rule Ri has now conf(Rconf(Rii)>MCT)>MCT

Ghost Rule(Undesirable Side Effect)

Large Itemset I has had sup(Isup(I)>MST)>MST

Itemset I has now sup(Isup(I)<MST)<MST

Itemset Eliminated(Undesirable Side Effect)

Distortion-based Techniques

Challenges/Goals:

To minimize the undesirable Side Effects that the hiding process causes to non-sensitive rules.

To minimize the number of 11’’ss that must be deleted in the database.

Algorithms must be linear in time as the database increases in size.

Sensitive itemsets: ABC

Data distortion [Atallah 99]

Hardness result: The distortion problem is NP Hard

Heuristic search Find items to remove and transactions to

remove the items from

Disclosure Limitation of Sensitive Rules, M. Atallah, A. Elmagarmid, M. Ibrahim, E. Bertino, V. Verykios, 1999

Heuristic Approach

A greedy bottom-up search through the ancestors (subsets) of the sensitive itemsetfor the parent with maximum support (why?) At the end of the search, 1-itemset is selected

Search through the common transactions containing the item and the sensitive itemsetfor the transaction that affects minimum number of 2-itemsets

Delete the selected item from the identified transaction

Results comparison

Blocking-based Techniques

AA BB CC DD

11 11 11 00

11 00 11 11

00 00 00 11

11 11 11 00

11 00 11 11

AA BB CC DD

11 11 11 00

11 00 ?? 11

?? 00 00 11

11 11 11 00

11 00 11 11

BlockingAlgorithm

Initial DatabaseInitial Database New DatabaseNew Database

Support and Confidence becomes marginal. Support and Confidence becomes marginal. In New Database: 60% In New Database: 60% ≤≤ conf(Aconf(A →→ C) C) ≤≤ 100%100%


Data reconstruction approach

D’

DD

.1 Frequent Set MiningFS R

R-Rh’FS

.2 Perform sanitization Algorithm

3.FP-tree - based Inverse Frequent Set Mining

FP-tree

2007-7-10 SIGMOD Ph.D. Workshop IDAR’07

36

The first two phases

1. Frequent set mining Generate all frequent itemsets with their supports and

support counts FS from original database D 2. Perform sanitization algorithm

Input: FS output in phase 1, R, Rh

Output: sanitized frequent itemsets FS’ Process

Select hiding strategy Identify sensitive frequent sets Perform sanitization

In best cases, sanitization algorithm can ensure from FS’ ,we can exactly get the non-sensitive rules set R-Rh

FS

FS’ R-Rh

R

2007-7-10SIGMOD Ph.D. Workshop IDAR’07

37

Example: the first two phases

TID ItemsT1 ABCET2 ABCT3 ABCDT4 ABDT5 ADT6 ACD

Oiginal Database: D

σ= 4

MST=66%MCT=75%

Frequent Itemsets: FSA:6 100%B:4 66%C:4 66%D:4 66%

AB:4 66%AC:4 66%AD:4 66%

Frequent Itemsets: FS'

A:6 100%C:4 66%D:4 66%

AC:4 66%AD:4 66%

rules confid-ence support

CA 100% 66%DA 100% 66%

Association Rules: R-Rh

rules confid-ence support

B A 100% 66%C A 100% 66%D A 100% 66%

Association Rules: R

1. Frequent set mining

2. Perform sanitization algorithm

Open research questions

Optimal solution Itemsets sanitization

The support and confidence of the rules in R- Rh should remain unchanged as much as possible

Integrating data protection and knowledge (rule) protection

Coming up

Cryptographic protocols for privacy preserving distributed data mining

privacy preserving data mining – randomized response and...

Documents