-anonymous patternsgroups.di.unipi.it/~atzori/slides/pkdd05-slides.pdf · 9th european conference...
TRANSCRIPT
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
k -Anonymous Patterns
M. Atzori1,2 F. Bonchi2 F. Giannotti2 D. Pedreschi1
1Department of Computer ScienceUniversity of Pisa
2Information Science and Technology InstituteCNR, Pisa
Speaker: Maurizio Atzori
9th European Conference on Principles and Practice ofKnowledge Discovery in Databases (PKDD), 2005
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Outline
1 MotivationData Mining and Privacy of IndividualsAn Example of the Problem Addressed
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Outline
1 MotivationData Mining and Privacy of IndividualsAn Example of the Problem Addressed
2 k-Anonymous PatternsDefinitions and PropertiesFirst Results on Inference Channels
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Outline
1 MotivationData Mining and Privacy of IndividualsAn Example of the Problem Addressed
2 k-Anonymous PatternsDefinitions and PropertiesFirst Results on Inference Channels
3 Condensed RepresentationDefinitions and PropertiesBenefits of the Condensed Representation
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
A Taxonomy of Privacy Preserving Data Mining
1 Intesional Knowledge HidingBertino’s approach, based on DB Sanitization
2 Extensional Knowledge HidingAgrawal’s approach, based on DB RandomizationSweeney’s approach, based on DB Anonymization
3 Distributed Extensional Knowledge HidingClifton’s approach based on Secure Multiparty Computation
4 Secure Intesional Knowledge SharingClifton’s Public/Private/Unknown Attribute FrameworkZaïane’s Association Rule Sanitization (but also IKH)k -Anonymous Patterns – Our approach
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
The Purpose
We want to publish datamining results (like SecureIntesional Knowledge Sharing)
We DON’T want to release information related to fewpeople, that can help to trace single individuals
We don’t want to specify any other information
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
A Motivating Example in the Medical Domain
Example
Suppose Dr. Gregory House conduces both usual hospitalactivities and research
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
A Motivating Example in the Medical Domain
Example
Suppose Dr. Gregory House conduces both usual hospitalactivities and research
He has a big database with all sensitive information abouthis patients
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
A Motivating Example in the Medical Domain
Example
Suppose Dr. Gregory House conduces both usual hospitalactivities and research
He has a big database with all sensitive information abouthis patients
Playing with Data Mining, he discovered interesting trendsabout patologies in his patient data
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
A Motivating Example in the Medical Domain
Example
Suppose Dr. Gregory House conduces both usual hospitalactivities and research
He has a big database with all sensitive information abouthis patients
Playing with Data Mining, he discovered interesting trendsabout patologies in his patient data
Question
Can Dr. House publish his discoveries to third persons withoutoffending the privacy of his patients?
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
A Motivating Example in the Medical Domain
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
A Motivating Example in the Medical Domain
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
An Association Rule Can Be Used to Break Anonymity
Example
a1 ∧ a2 ∧ a3 ⇒ a4 [sup = 80, conf = 98.7%]
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
An Association Rule Can Be Used to Break Anonymity
Example
a1 ∧ a2 ∧ a3 ⇒ a4 [sup = 80, conf = 98.7%]
sup({a1, a2, a3}) =sup({a1, a2, a3, a4})
conf≈
800.987
= 81.05
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
An Association Rule Can Be Used to Break Anonymity
Example
a1 ∧ a2 ∧ a3 ⇒ a4 [sup = 80, conf = 98.7%]
sup({a1, a2, a3}) =sup({a1, a2, a3, a4})
conf≈
800.987
= 81.05
In other words, we know that there is just one individual forwhich the pattern a1 ∧ a2 ∧ a3 ∧ ¬a4 holds.
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
Now we know that...
Fact
Even if we mine with a high support value, we can inferpatterns holding in the original database which are notintentionally released
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
Now we know that...
Fact
Even if we mine with a high support value, we can inferpatterns holding in the original database which are notintentionally released
They can regards very few individuals
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Data Mining and Privacy of IndividualsAn Example of the Problem Addressed
Now we know that...
Fact
Even if we mine with a high support value, we can inferpatterns holding in the original database which are notintentionally released
They can regards very few individuals
The support value of such patterns can be inferred withoutaccessing the database
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesFirst Results on Inference Channels
What do you mean with “k -Anonymous Pattern”?
Definition (Anonymous Pattern)
Given a database D and an anonymity threshold k , a pattern pis said to be k-anonymous if sup
D(p) ≥ k or sup
D(p) = 0.
Definition (Inference Channel)
An Inference Channel is any set of itemsets from which it ispossible to infer that a pattern p is not k-anonymous.
We are interested in inference channels that are made offrequent itemsets.
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesFirst Results on Inference Channels
Example
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesFirst Results on Inference Channels
Example
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesFirst Results on Inference Channels
Reducing the Number of Patterns to Check
Theorem
∀p ∈ Pat(I) : 0 < supD
(p) < k . ∃ I ⊆ J ∈ 2I : CJI .
Translation: we can prune the search space by looking forInference Channels regarding only conjunctive patterns.
This property makes possible to havea (Naïve) Inference Channel Detector Algorithm
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesBenefits of the Condensed Representation
What do you mean with “Condensed Representation”?
Definition (Partial Order on Inference Channels)
CJI � CL
H when I ⊆ H and (J \ I) ⊆ (L \ H)
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesBenefits of the Condensed Representation
What do you mean with “Condensed Representation”?
Definition (Partial Order on Inference Channels)
CJI � CL
H when I ⊆ H and (J \ I) ⊆ (L \ H)
Example
Caca � Cabcd
ab
Intuitively, a ∧ ¬c is less specific than a ∧ b ∧ ¬c ∧ ¬d , since thetransactions s.t. a ∧ ¬c are a superset of the transactions s.t.a ∧ b ∧ ¬c ∧ ¬d
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesBenefits of the Condensed Representation
What do you mean with “Condensed Representation”?
Definition (Partial Order on Inference Channels)
CJI � CL
H when I ⊆ H and (J \ I) ⊆ (L \ H)
Example
Caca � Cabcd
ab
Intuitively, a ∧ ¬c is less specific than a ∧ b ∧ ¬c ∧ ¬d , since thetransactions s.t. a ∧ ¬c are a superset of the transactions s.t.a ∧ b ∧ ¬c ∧ ¬d
Definition (Maximal Inference Channel)
CJI is maximal w.r.t. D and σ, if ∀CL
H � CJI then sup(CL
H) = f LH = 0
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesBenefits of the Condensed Representation
Theoretical Results on Condensed Representation
Theorem (Form of Maximal Inference Channel)
An Inference Channel CJI is maximal iff
1 I is closed and2 J is maximal
Theorem (Lossless Representation)
Every Inference Channel can be computed from the set ofMaximal Inference Channels
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesBenefits of the Condensed Representation
Condensed Representation of Inference Channels
Smaller search space1 Memory saving (the number of non-maximal channels can
be huge)2 Faster running times3 Less distortion if we try to sanitize the set of frequent
itemset (but this point is not discussed in the paper)
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesBenefits of the Condensed Representation
Condensed Representation of Inference Channels
Number of Inference Channels
0 1000 2000 3000
Num
ber
of M
axim
al In
fere
nce
Cha
nnel
s
0
200
400
600MUSHROOM, k = 10ACCIDENTS, k = 10MUSHROOM, k = 50ACCIDENTS, k = 50
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesBenefits of the Condensed Representation
Improvements Over the Time Performance
Support (%)
20 25 30 35 40 45 50
Tim
e (m
s)
0
10000
20000
30000
40000
50000
60000
Naive
Optimized
MUSHROOM, k = 10
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Definitions and PropertiesBenefits of the Condensed Representation
Number of Inference Channels
Support (%)
20 30 40 50 60 70 80
Num
ber
of M
axim
al In
fere
nce
Cha
nnel
s
0
200
400
600
800
k = 10k = 25k = 50k = 100
MUSHROOM
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Motivationk -Anonymous Patterns
Condensed RepresentationSummary
Summary
We defined k-anonymous patterns and provide a generalcharacterization of inference channels holding amongpatterns that may threat anonymity of source data
We developed an effective and efficient algorithm to detectsuch potential threats, which yields a methodology tocheck whether the mining results may be disclosed withoutany risk of violating anonymity
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns
Appendix For Further Reading
For Further Reading
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi.Blocking Anonymity Threats Raised by Frequent ItemsetMining.Fifth IEEE International Conference on Data Mining(ICDM), Houston, Texas, USA, 2005.
L. Sweeney.k-Anonymity: A Model for Protecting Privacy.International Journal on Uncertainty Fuzziness andKnowledge-based Systems, 10(5), 2002.
M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi k -Anonymous Patterns