data mining with differential privacy

Tags:

Post on 22-Jun-2015

178 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Data mining with differential privacy @ KDD'10

TRANSCRIPT

+

Data Mining with Differential Privacy

Chang Wei-Yuan �2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting

Arik Friedman and Assaf Schuster / KDD’10

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

2

+ Introduction �

n There is great value in data mining solutions. �n reliable privacy guarantees�n available accuracy�

n Differential privacy �n computations are insensitive to changes in

any particular individual's record

3

+ Introduction (cont.)�

n Once an individual is certain that his or her data will remain private, being opted in or out of the database should make little difference.

4

+

n Example1 �

Introduction (cont.)�

Name Result

Tom 0

Jack 1

Henry 1

Diego 0

Alice ?

5

n f(i) = count(i)�

n Alice => i=5�

n count(5) – count(4)�

+

n Example2 �

n We can speculate the target based on the information. �

Introduction (cont.)�

6

Id Sex Job Hometown Hobby

1 M student Hsinchu sport

2 M teacher Taipei writing

3 F student Hsinchu Singing

4 F student Taipei Singing

5 ? ? ? ?

+ Introduction (cont.)�

n Goal:count(5) – count(4) ≈ 0 �

n Goal:” computations are insensitive to changes in any particular individual's record ”

7

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

8

+Differential Privacy

n Differential privacy �

9

Output

Probability

•  M:a randomized computation� •  f:a query function� •  D, D’:the datasets with

symmetric difference

+Differential Privacy (cont.)

n Differential privacy �

10

Define.(∊-Differential Privacy)� We say a randomized computation M provides differential privacy if for any datasets A and B with symmetric difference AΔB=1, and set of possible outcomes S ⊆ Range(M)�

+Laplace Mechanism

n Example of Laplace Mechanism�

11

Name Result

Tom 0

Jack 1

Henry 1

Diego 0

Alice ?

n count(4) = 2 + noise(4)�

n count(5) = 3 + noise(5)�

n count(5) – count(4) = e∊ �

+Laplace Mechanism

n Laplace Mechanism �

12

Theorem. (Laplace mechanism) � Given a function f over an arbitrary domain D, the computation� � � provides differential privacy.�

+Exponential Mechanism

n Example of Exponential Mechanism

13

item� q� ∊=0� ∊=0.1� ∊=1� Football 30 0.46 0.42 0.92

Volleyball 25 0.38 0.33 0.07

Basketball 8 0.12 0.14 1.5E-05

Tennis 2 0.03 0.10 7.7E-07

+Exponential Mechanism (cont.)

n Exponential Mechanism

14

Theorem. (Exponential Mechanism) � Let q be a quality function, given a database d, assigns a score r to each outcome. Then the mechanism M, defined by� � � maintains differential privacy.�

+PINQ Framework

n PINQ Framework �n PINQ is a proposed architecture for data

analysis with differential privacy�n Another operator presented in PINQ is

partition which was dubbed parallel composition. �n  the costs do not add up when queries are executed

on disjoint datasets

15

+PINQ Framework (cont.) 16

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

17

+Method �18

n SQL-based ID3 �

n DiffP-ID3 �

n DiffP-C4.5

+SuL-based ID3

n Based on SuLQ framework and Using Laplace Mechanism. �

n It makes direct use of the NoisyCount primitive to evaluate the information gain criterion. �

n It required to evaluate the information gain should be carried out for each attribute separately. �n  the budget per query is small �

19

+SuL-based ID3

n ID3 Classification �

n Split point �n max( Gain(Job), Gain(Home), Gain(Hobby) )

20

Id Sex Job Hometown Hobby

1 M student Hsinchu sport

2 M teacher Taipei writing

3 F student Hsinchu Singing

4 F student Taipei Singing

+SuL-based ID3

n SuL-based ID3 Classification �

n Split point �n max( Gain(Job)+Noisy, Gain(Home)+Noisy,

Gain(Hobby)+Noisy )

21

Id Sex Job Hometown Hobby

1 M student Hsinchu sport

2 M teacher Taipei writing

3 F student Hsinchu Singing

4 F student Taipei Singing

+DiffP-ID3

n Based on PINQ framework and using exponential mechanism. �

n It evaluates all attributes simultaneously in one query, the outcome of which is the attribute to use for splitting. �n the quality function q provided to the scores

each attribute

22

+DiffP-ID3 (cont.)

n DiffP-ID3 Classification �

n Split point �n Max( Gain(M(Job)), Gain(M(Job)),

Gain(M(Hobby)) )�n PINQ Partition

23

Id Sex Job Hometown Hobby

1 M student Hsinchu sport

2 M teacher Taipei writing

3 F student Hsinchu Singing

4 F student Taipei Singing

+DiffP-ID3 (cont.)

n Which quality function should be fed into the exponential mechanism? �n the depth constraint �n the sensitivity of the splitting criterion �

n Information gain will be the most sensitive to noise, and Max operator will be the least sensitive to noise.

24

+DiffP-C4.5

n One important extension is the ability to handle continuous attributes. �n First, the domain is divided into ranges where

the score is constant. Each range is considered a discrete option. �

n Then, a point from the range is sampled with uniform distribution and returned as the output of the exponential mechanism.

25

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

26

+Experiment �

n It define a domain with ten nominal attributes and a class attribute from another paper. �

n It introduces noise to the samples by reassigning attributes and classes, replacing each value with probability noise. �

n For testing, it generated similarly a noiseless test set with 10, 000 records.

27

+ 28

n the average accuracy is higher as more training samples are available �

n the influence of the noise weakens as the number of samples grows using Gini and Max

+ 29

n three of the ten attributes were replaced with numeric attributes over the domain [0, 100] �

n Figure 4 presents the results of a similar experiment

+ 30

n for smaller training sets, ID3 allows for better accuracy�

n for larger training sets, C4.5 is better than ID3

+ 31

n the accuracy results presented in Figure 6 was around 5% and even lower than the results presented in Figure 7 �

n when the sizeof the dataset is small, algorithms that make efficient use of the privacy budget are superior

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

32

+Conclusion

n When the number of training samples is relatively small or the privacy constraints set by the data provider are very limiting, the sensitivity of the calculations becomes crucial.

33

+Future work

n One solution might be to consider other stopping rules when selecting nodes, trading possible improvements in accuracy for increased stability. �

n In addition, it may be fruitful to consider different tactics for budget distribution.

34

+Outline

n Introduction �

n Background �

n Method �

n Experiment �

n Conclusion �

n Though

35

+

Thought

36

+Thanks for listening. 2014 / 10 / 3 (Fri.) @ MakeLab Group Meeting �v123582@gmail.com�

top related