privacy preserving data mining li xiong cs573 data privacy and anonymity

Privacy preserving data mining

Li Xiong

CS573 Data Privacy and Anonymity

February 12, 2009 2

What Is Data Mining?

Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) patterns or knowledge from huge amount of data

Knowledge discovery in databases (KDD), knowledge extraction, data/pattern analysis, information harvesting, business intelligence

Privacy preserving data mining

Support data mining while preserving privacy Sensitive raw data Sensitive mining results

February 12, 2009 4

Seminal work

Privacy preserving data mining, Agrawal and Srikant, 2000 Centralized data Data randomization (additive noise) Decision tree classifier

Privacy preserving data mining, Lindell and Pinkas, 2000 Distributed data mining Secure multi-party computation Decision tree classifier

Input Perturbation

x1…xn

Reveal entire database, but randomize entries

Database

x1+1…xn+n

Add random noise i to each database entry xi

For example, if distribution of noise hasmean 0, user can compute average of xi

User

February 12, 2009 6

Taxonomy of PPDM algorithms

Data distribution Centralized Distributed – Privacy preserving distributed data mining

Approaches Input perturbation – additive noise (randomization),

multiplicative noise, generalization, swapping, sampling Output perturbation – rule hiding Crypto techniques – secure multiparty computation

Data mining algorithms Classification Association rule mining Clustering

Randomization techniques

Privacy preserving data mining, Agrawal and Srikant, 2000 Seminal work on decision tree classifier

Limiting Privacy Breaches in Privacy-Preserving Data Mining, Evfimievski and Gehrke, 2003 Refined privacy definition Association rule mining

Randomization Based Decision Tree Learning (Agrawal and Srikant ’00)

Basic idea: Perturb Data with Value Distortion User provides xi+r instead of xi r is a random value

Uniform, uniform distribution between [-, ] Gaussian, normal distribution with = 0,

Hypothesis Miner doesn’t see the real data or can’t

reconstruct real values Miner can reconstruct enough information to build

decision tree for classification

Randomization Approach

50 | 40K | ... 30 | 70K | ... ...

...

Randomizer Randomizer

ClassificationAlgorithm

Model

65 | 20K | ... 25 | 60K | ... ...30 becomes 65 (30+35)

Alice’s age

Add random number to Age

?

February 12, 2008 10

Classification predicts categorical class labels (discrete or nominal)

Prediction (Regression) models continuous-valued functions, i.e., predicts

unknown or missing values Typical applications

Credit approval Target marketing Medical diagnosis Fraud detection

Classification

Li Xiong 1111

Motivating Example for Classification – Fruit Identification

…

DangerousHardSmallSmooth

SafeSoftLargeGreenHairy

DangerousSoftRedSmooth

SafeHardLargeGreenHairy

safeHardLargeBrownHairy

ConclusionFleshSizeColorSkin

Large

Red

February 12, 2008 12

Another Example – Credit Approval

Classification rule: If age = “31...40” and income = high then credit_rating = excellent

Future customers Paul: age = 35, income = high excellent credit rating John: age = 20, income = medium fair credit rating

Name Age Income … Credit

Clark 35 High … Excellent

Milton 38 High … Excellent

Neo 25 Medium … Fair

… … … … …

February 12, 2008 Data Mining: Concepts and Techniques 13

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a

predefined class, as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules, decision trees, or mathematical formulae

Model usage: for classifying future or unknown objects


Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no


Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

<=30 >40

no yes yes

yes

31..40

no

fairexcellentyesno


Algorithm for Decision Tree Induction

ID3 (Iterative Dichotomiser), C4.5 CART (Classification and Regression Trees) Basic algorithm (a greedy algorithm) - tree is constructed in a top-down

recursive divide-and-conquer manner At start, all the training examples are at the root A test attribute is selected that “best” separate the data into partitions

Heuristic or statistical measure Samples are partitioned recursively based on selected attributes

Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting

is employed for classifying the leaf There are no samples left


Attribute Selection Measures

Idea: select attribute that partition samples into homogeneous groups Measures

Information gain (ID3) Gain ratio (C4.5) Gini index (CART)


Attribute Selection Measure: Information Gain (ID3)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class C i,

estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to classify D:

Information gain – difference between original information requirement and the new information requirement by branching on attribute A

)(log)( 21

i

m

ii ppDInfo

)(||

||)(

1j

v

j

jA DInfo

D

DDInfo

(D)InfoInfo(D)Gain(A) A


Attribute Selection Measure: Gini index (CART)

If a data set D contains examples from n classes, gini index, gini(D) is defined as

where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index

gini(D) is defined as

Reduction in Impurity:

The attribute provides the smallest ginisplit(D) (or the largest reduction in

impurity) is chosen to split the node

n

jp jDgini

1

21)(

)(||||)(

||||)( 2

21

1 DginiDD

DginiDDDginiA

)()()( DginiDginiAginiA


Information-Gain for Continuous-Value Attributes

Let attribute A be a continuous-valued attribute

Must determine the best split point for A

Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent values is considered as a possible split point

(ai+ai+1)/2 is the midpoint between the values of ai and ai+1

The point with the minimum expected information requirement for A is selected as the split-point for A

Split: D1 is the set of tuples in D satisfying A ≤ split-point, and

D2 is the set of tuples in D satisfying A > split-point

Randomization Approach

50 | 40K | ... 30 | 70K | ... ...

...



Model

65 | 20K | ... 25 | 60K | ... ...30 becomes 65 (30+35)

Alice’s age


?


Attribute Selection Measure: Gini index (CART)

If a data set D contains examples from n classes, gini index, gini(D) is defined as

where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index

gini(D) is defined as

Reduction in Impurity:

The attribute provides the smallest ginisplit(D) (or the largest reduction in

impurity) is chosen to split the node

n

jp jDgini

1

21)(

)(||||)(

||||)( 2

21

1 DginiDD

DginiDDDginiA

)()()( DginiDginiAginiA

Randomization Approach Overview

50 | 40K | ... 30 | 70K | ... ...

...


ReconstructDistribution of Age

ReconstructDistributionof Salary


Model

65 | 20K | ... 25 | 60K | ... ...30 becomes 65 (30+35)

Alice’s age


Original Distribution Reconstruction

x1, x2, …, xn are the n original data values Drawn from n iid random variables X1, X2, …, Xn similar to X

Using value distortion, The given values are w1 = x1 + y1, w2 = x2 + y2, …, wn = xn + yn

yi’s are from n iid random variables Y1, Y2, …, Yn similar to Y

Reconstruction Problem: Given FY and wi’s, estimate FX

Original Distribution Reconstruction: Method

Bayes’ theorem for continuous distribution

The estimated density function:

Iterative estimation The initial estimate for fX at j=0: uniform distribution Iterative estimation

Stopping Criterion: 2 test between successive iterations

n

iXiY

XiYX

dzzfzwf

afawf

naf

1

1

n

i jXiY

jXiYj

X

dzzfzwf

afawf

naf

1

1 1

Reconstruction of Distribution

0

200

400

600

800

1000

1200

20 60

Age

Nu

mb

er o

f P

eop

le

OriginalRandomizedReconstructed

Original Distribution Reconstruction

Original Distribution Construction for Decision Tree

When to reconstruct distributions? Global

Reconstruct for each attribute once at the beginning

Build the decision tree using the reconstructed data

ByClass First split the training data Reconstruct for each class separately Build the decision tree using the

reconstructed data Local

First split the training data Reconstruct for each class separately Reconstruct at each node while building

the tree

Accuracy vs. Randomization Level

Fn 3

40

50

60

70

80

90

100

10 20 40 60 80 100 150 200

Randomization Level

Acc

ura

cy OriginalRandomizedByClass

More Results

Global performs worse than ByClass and Local ByClass and Local have accuracy within 5% to 15% (absolute

error) of the Original accuracy Overall, all are much better than the Randomized accuracy

Privacy level

Is the privacy level sufficiently measured?

How to Measure Privacy Breach

Weak: no single database entry has been revealed

Stronger: no single piece of information is revealed (what’s the difference from the “weak” version?)

Strongest: the adversary’s beliefs about the data have not changed

Kullback-Leibler Distance

Measures the “difference” between two probability distributions

Privacy of Input Perturbation

X is a random variable, R is the randomization operator, Y=R(X) is the perturbed database

Measure mutual information between original and randomized databases Average KL distance between (1) distribution of X

and (2) distribution of X conditioned on Y=y

Ey (KL(PX|Y=y || Px)) Intuition: if this distance is small, then Y leaks little

information about actual values of X

Why is this definition problematic?

Is the randomization sufficient?

Gladys: 85Doris: 90Beryl: 82

Name: AgedatabaseGladys: 72

Doris: 110Beryl: 85

Age is an integer between 0 and 90

Randomize database entries by adding random integers between -20 and 20

Randomization operatorhas to be public (why?)

Doris’s age is 90!!

Privacy Definitions

Mutual information can be small on average, but an individual randomized value can still leak a lot of information about the original value

Better: consider some property Q(x) Adversary has a priori probability Pi that Q(xi) is

true

Privacy breach if revealing yi=R(xi) significantly changes adversary’s probability that Q(xi) is true Intuition: adversary learned something about entry

xi (namely, likelihood of property Q holding for this entry)

Example

Data: 0x1000, p(x=0)=0.01, p(x=k)=0.00099 Reveal y=R(x) Three possible randomization operators R

R1(x) = x with prob. 20%; a uniformly random number with prob. 80%

R2(x) = x+ mod 1001, uniform in [-100,100]

R3(x) = R2(x) with prob. 50%, a uniform random number with prob. 50%

Which randomization operator is better?

Some Properties

Q1(x): x=0; Q2(x): x{200, ..., 800}

What are the a priori probabilities for a given x that these properties hold? Q1(x): 1%, Q2(x): 40.5%

Now suppose adversary learned that y=R(x)=0. What are probabilities of Q1(x) and Q2(x)? If R = R1 then Q1(x): 71.6%, Q2(x): 83%

If R = R2 then Q1(x): 4.8%, Q2(x): 100%

If R = R3 then Q1(x): 2.9%, Q2(x): 70.8%

Privacy Breaches

R1(x) leaks information about property Q1(x) Before seeing R1(x), adversary thinks that

probability of x=0 is only 1%, but after noticing that R1(x)=0, the probability that x=0 is 72%

R2(x) leaks information about property Q2(x) Before seeing R2(x), adversary thinks that

probability of x{200, ..., 800} is 41%, but after noticing that R2(x)=0, the probability that x{200, ..., 800} is 100%

Randomization operator should be such that posterior distribution is close to the prior distribution for any property

Privacy Breach: Definitions

Q(x) is some property, 1, 2 are probabilities 1“very unlikely”, 2“very likely”

Straight privacy breach:

P(Q(x)) 1, but P(Q(x) | R(x)=y) 2 Q(x) is unlikely a priori, but likely after seeing

randomized value of x Inverse privacy breach:

P(Q(x)) 2, but P(Q(x) | R(x)=y) 1

Q(x) is likely a priori, but unlikely after seeing randomized value of x

[Evfimievski et al.]

How to check privacy breach

How to ensure that randomization operator hides every property? There are 2|X| properties Often randomization operator has to be selected

even before distribution Px is known (why?)

Idea: look at operator’s transition probabilities How likely is xi to be mapped to a given y?

Intuition: if all possible values of xi are equally likely to be randomized to a given y, then revealing y=R(xi) will not reveal much about actual value of xi

Amplification

Randomization operator is -amplifying for y if

For given 1, 2, no straight or inverse privacy breaches occur if

y)p(x

y)p(x :V x ,x

2

1x21

) -(1

) -(1

2

1

1

2

[Evfimievski et al.]

Amplification: Example

R1(x) = x with prob. 20%; a uniformly random number with prob. 80%

R2(x) = x+ mod 1001, uniform in [-100,100]

R3(x) = R2(x) with prob. 50%, a uniform random number with prob. 50%

For R3,

p(xy) = ½ (1/201 + 1/1001) if y[x-100,x+100]

½(0 + 1/1001) otherwise

Fractional difference = 1 + 1001/201 < 6 (= ) Therefore, no straight or inverse privacy

breaches will occur with 1=14%, 2=50%

Coming up

Multiplicative noise Output perturbation


Example: Information Gain

Class P: buys_computer = “yes”, Class N: buys_computer = “no”

age pi ni I(pi, ni)<=30 2 3 0.97131…40 4 0 0>40 3 2 0.971

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIDInfoage

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

246.0

)()()(

DInfoDInfoageGain age

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

940.0)14

5(log

14

5)

14

9(log

14

9)5,9()( 22 IDInfo

privacy preserving data mining li xiong cs573 data privacy and anonymity

Documents