privacy-preserving data mining: current research and … · privacy-preserving data mining: current...
TRANSCRIPT
Privacy-preserving Data Mining: current research and trends
Stan MatwinSchool of Information Technology and Engineering
University of Ottawa, [email protected]
2
Few words about our researchUniversit[é|y] [d’|of] Ottawa is thelargest bilingual university in Canada
• Applied research and tech transfer – Profiles of digital game players/ active learning, co-
training, instance selection– Classification of medical abstracts
• Privacy-aware Data Mining
• Text Analysis and Machine Learning group
• Learning with previous knowledge
3
Plan • What is [data] privacy?• Privacy and Data Mining• Privacy-preserving Data mining: main
approaches– Anonymization– Obfuscation– Cryptographic hiding
• Challenges– Definition of privacy– Mining mobile data– Mining medical data
4
Privacy
• Freedom to be left alone• …ability to control who knows what about
us [Westin; Moor] [=database views]• Jeopardized with the Internet – greased
data• Moral obligation for the community to find
solutions
5
Privacy and data mining
• Individual Privacy– Nobody should know more about any entity after the
data mining than they did before– Approaches: Data Obfuscation, Value swapping
• Organization Privacy– Protect knowledge about a collection of entities– Individual entity values are not disclosed to all parties
• The results alone need not violate privacy
7
Why naïve anonymization does not work
• The [Sweeney 01] experiment• purchased the voter registration list for Cambridge, MA
– 54,805 people
• 69% unique on postal code and birth date; 87% US-wide with all three
• Also, we do not know with what other data sources we may do joins in the future
• Solution: k-anonymization
8
Randomization Approach Overview
50 | 40K | ...30 | 70K | ... ...
...
Randomizer Randomizer
ReconstructDistribution
of Age
ReconstructDistributionof Salary
ClassificationAlgorithm Model
65 | 20K | ... 25 | 60K | ... ...Age 30
becomes 65
(30+35)
Salary 70K
becomes 20K
Alice’s age
9
Reconstruction Problem
• Original values x1, x2, ..., xn from probability distribution X (unknown)
• To hide these values, we use y1, y2, ..., ynfrom known distribution Y
• Given– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y Estimate the probability distribution of X.
10
Intuition (Reconstruct single point)
• Use Bayes' rule for density functions
10 90Age
V
Original distribution for AgeProbabilistic estimate of original value of V
11
Works well
0
200
400
600
800
1000
1200
20 60
Age
Num
ber
of P
eopl
e
OriginalRandomizedReconstructed
12
Recap: Why is privacy preserved?
• Cannot reconstruct individual values accurately.
• Can only reconstruct distributions.
13
Classification
• Naïve Bayes– Assumes independence between attributes.
• Decision Tree– Correlations are weakened by randomization,
not destroyed.
14
Decision Tree Example
Age Salary Repeat Visitor?
23 50K Repeat17 30K Repeat43 40K Repeat68 50K Single32 70K Single20 20K Repeat
Age < 25
Salary < 50K
Repeat
Repeat
Single
Yes
Yes
No
No
15
Decision Tree Experiments
Fn 1 Fn 2 Fn 3 Fn 4 Fn 550
60
70
80
90
100
Acc
urac
y Original
Randomized
Reconstructed
100% Randomization Level
100% privacy: attribute cannot be estimated(with 95% confidence) any closer than the entire range for the attribute
16
• For very high privacy, discretization will lead to a poor model
• Gaussian provides more privacy at higher confidence levels
• In fact, it can be de-randomized using advanced control theory approach [Kargupta 2003]
Issues
17
Association Rule Mining Algorithm [Agrawal et al. 1993]
1. = large 1-itemsets2. for do begin3. 4. for all candidates do begin5. compute c.count 6. end7. 8. end9. Return
1L);;2( 1 ++≠= − kLk k φ)( 1−−= kk LgenaprioriC
kCc ∈
sup}min.|{ −≥∈= countcCcL kk
kk LL U=
c.count is the frequency count for a given itemset.
Key issue: to compute the frequency count, we needs to access attributes that belong to different parties.
18
An Example
• c.count is the vector product.• Let’s use A to denote Alice’s
attribute vector and B to denote Bob’s attribute vector.
• AB is a candidate frequent itemset, then c.count = A • B = 3.
• How to conduct this computation across parties without compromising each party’s data privacy?
A B
Alice Bob
11101
01111
19
Homomorphic Encryption [Paillier 1999]
• Privacy-preserving protocols are based on Homomorphic Encryption.
• Specifically, we use the following additive homomorphism property:
• Where e is an encryption function and is the data to be encrypted and .
)()()()( 2121 nn mmmemememe +++=××× LL
im0)( ≠ime
20
Digital Envelope[Chaum85]
• A digital envelope is a random number (a set of random numbers) only known by the owner of private data.
V V + R VV
22
Frequency Count Protocol• Assume Alice’s attribute vector is A and Bob’s
attribute vector is B.• Each vector contains N elements.• : the ith element of A.• : the ith element of B.• One of parties is randomly chosen as a key
generator, e.g, Alice, who generates (e, d) and an integer X > N. e and X will be shared with Bob.
• Let’s use e(.) to denote encryption and d(.) to denote decryption.
iA
iB
23
……
Alice
XRA ×+ 11
Digital envelopes
XRA ×+ 22 XRA NN ×+
NRRR ,, 21 L A set of random integers generated by Alice
26
1111 )( BXRAeW ××+= 2222 )( BXRAeW ××+=NNNN BXRAeW ××+= )(…
Bob
)()(100
XRAeBXRAeWBWB
iiiiiii
ii
×+=××+=⇒==⇒=
27
• Bob multiplies all the for those that are not equal to 0. In other words, Bob computes the multiplication of all non-zero , e.g., where .
sWi sBi
sWi ∏= iWW
0≠iW
jWWWW ×××= L21
30
jWWWW ×××= L21
)()()( 2211 XRAeXRAeXRAe jj ×+×××+××+= L
))(( 2121 XRRRAAAe jj ×+++++++= LL
According to the property of homomorphic encryption
31
• Bob generates an integer .
• Bob then computes
'R
)'(' XReWW ××=
Alice
))'(( 2121 XRRRRAAAe jj ×++++++++= LL
According to the property of homomorphic encryption
32
The Final Step
• Alice decrypts and computes modulo X.
• She then obtains for those Aj for which corresponding Bj are 0, which is = c.count
'W
XXRRRRAAAedcountc
jj mod)))'(((.
2121 ×++++++++= LL
0mod))'((
)(
21
21
=×++++
<≤+++
XXRRRR
XNAAA
j
j
L
L
1 2 jA A A+ + +L
33
Privacy Analysis
• All the information that Bob obtains from Alice is.
• Since Bob doesn’t know the decryption key d, he cannot get Alice’s original data values.
)(,),(),( 2211 XRAeXRAeXRAe NN ×+×+×+ L
Goal: Bob never sees Alice’s data values.
34
Privacy Analysis
The information that Alice obtains from Bob is for those .
Alice computes . She only obtains the frequencycount and cannot know Bob’s original data values.
XWd mod)'(
))'((' 2121 XRRRRAAAeW jj ×++++++++= LL 1=iB
Goal: Alice never sees Bob’s data values.
35
Complexity Analysis
)1( +Nα αwhere N is the total number transactions and is the number of bits for each encrypted element.
Linear in the number of transactions
The total number elements in each attribute vector
36
Complexity Analysis
The computational cost is (10N + 20 + g) where N is the total numbertransactions and g is the computational cost for generating a key pair.
Linear in the number of transactions
37
Other Privacy-Oriented Protocols
Multi-Party Frequency Count Protocol
Multi-Party Summation Protocol
Multi-Party Comparison Protocol
Multi-Party Sorting Protocol
[Zhan et al. 2005 (a)]
[Zhan et al. 2005 (f)]
[Zhan et al. 2006 (a)]
[Zhan et al. 2006 (a)]
38
What about the results of DM?
• Can DM results reveal personal information?• In some cases, yes [Atzori et al. 05]:Suppose an association rule is found:
This meansthen therefore has support=1, and
identifies one person!!
1 2 3 4[sup 80, 98.7%]a a a a conf∧ ∧ ⇒ = =
1 2 3 4a a a a∧ ∧ ∧¬
1 2 3 4sup({ , , , }) 80a a a a =1 2 3 4
1 2 3sup({ , , , }) 0.8sup({ , , }) 81.05
0.987 .0987a a a aa a a = = =
39
• They propose an approach called k-anonymous patterns and an algorithm (inference channels) which detects violations of k-anonymity
• The algorithm is expensive computationally• We have a new approach which embeds k-
anonimity into the concept lattice association rule algorithm [Zaki, Ogihara 98]
40
Conclusion
• Important problem, challenge for the field• Lots of creative work, but lack of systematic
approach• Medical data particularly sensitive, but also
makes de-identification easier: genotype-phenotype inferences, location-visit patterns, family structures, etc. [Malin 2005]
• Lack of an operational, agreed upon definition of privacy: inspiration in economics?