privacy preserving data publication yufei tao department of computer science and engineering chinese...
TRANSCRIPT
![Page 1: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/1.jpg)
Privacy Preserving Data Publication
Yufei Tao
Department of Computer Science and Engineering
Chinese University of Hong Kong
![Page 2: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/2.jpg)
Centralized publication
Assume that a hospital wants to publish the following table, called the microdata.
The publication must preserve the privacy of patients. Prevent an adversary from knowing who-contracted-
what.Microdata
![Page 3: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/3.jpg)
Centralized publication (cont.)
A simple solution: Remove column ‘Name’. It does not work. See next.
publish
![Page 4: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/4.jpg)
Linking attacks
The published table A voter registration list
Quasi-identifier (QI) attributes
An adversary
![Page 5: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/5.jpg)
These are real threats
Fact: 87% of Americans can be uniquely identified by {Zipcode, gender, date-of-birth}.
A famous experiment by Sweeney [International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]
finds the medical record of an ex-governor of Massachusetts.
![Page 6: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/6.jpg)
Objectives
Publish a distorted version of the dataset so that [Privacy] the privacy of all individuals is “adequately”
protected; [Utility] the dataset is useful for analyzing the
characteristics of the microdata.
Paradox: Privacy protection , utility .
![Page 7: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/7.jpg)
Issues
Privacy principleWhat is adequate privacy protection?
Distortion approachHow to achieve the privacy principle?
The literature has discussed other issues as well.Complexities, improving the utility of the published
data, etc.
![Page 8: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/8.jpg)
Principle 1: k-anonymity
2-anonymous generalization:QI attributes
Sensitive attribute
4 Q
I gr
oups
A voter registration list
[Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]
![Page 9: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/9.jpg)
Defects of k-anonymity
What is the disease of Joe?
No “diversity” in this QI group.A voter registration list
![Page 10: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/10.jpg)
Principle 2: l-diversity
Each QI group should have at least l “well-represented” sensitive values.
Different ways to interpret “well-represented”.
[Machanavajjhala et al., ICDE, 2006]
![Page 11: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/11.jpg)
Naive interpretation
Each QI-group has l different sensitive values.
A 2-diverse table
Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu
![Page 12: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/12.jpg)
Defects of the naive interpretation
Assume that Joe is identified in the QI group. What is the probability that he contracted HIV?
Implication: The most frequent sensitive value in a QI group cannot be too frequent.
But accomplishing only is still vulnerable against attacks with background knowledge.
Disease
...
HIV
HIV
HIV
pneumonia
...
...
bronchitis
...
A QI group with 100 tuples 98 tuples
![Page 13: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/13.jpg)
Background knowledge attack
Let Joe be an individual in the QI group having HIV. A friend of Joe has the background knowledge: “Joe does not have
pneumonia”. How likely would this friend assume that Joe had HIV?
A QI group with 100 tuples
50 tuples
Disease
...
HIV
HIVpneumonia
...
...
bronchitis
...
pneumonia
...
49 tuples
![Page 14: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/14.jpg)
Controlling also the 2nd most frequent value
Even if an adversary can eliminate pneumonia, s/he can only assume that Joe has HIV with 40 / 70 probability.
A QI group with 100 tuples
40 tuples
Disease
...
HIV
HIVpneumonia
...
...
bronchitis
...
pneumonia
...
bronchitis
...
30 tuples
30 tuples
![Page 15: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/15.jpg)
An example of 4-diversity
A QI group
Disease
...
...
...
The most frequent value
The 2nd most frequent value
The 3rd most frequent valueThe 4th most frequent value
The other values
![Page 16: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/16.jpg)
An example of 4-diversity (cont.)
A QI group
Disease
...
...
...
The most frequent value
The other values
Same cardinality
![Page 17: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/17.jpg)
Assume that Joe is a person in the QI group. Property: If an adversary can eliminate only 3 diseases,
s/he can correctly guess the disease of Joe with at most 50% probability.
An example of 4-diversity (cont.)
A QI group
HIV
pneumonia
bronchitiscancer
The other values
Disease
...
...
...
![Page 18: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/18.jpg)
l-diversity
Consider a QI group. m is the number of sensitive values in the group. r1 is the number of tuples having the most sensitive value.
r2 is the number of tuples having the 2nd most sensitive value.
… rm is the number of tuples having the m-th most sensitive value.
Then, r1 c (rl + … + rm), where c is a constant.
If an adversary can eliminate only l – 1 sensitive values, s/he can infer the disease of a person with probability at most 1 / (c + 1).
Called (c, l)-diversity precisely.
![Page 19: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/19.jpg)
Defects of l-diversity
Andy does not want anyone to know that he had a stomach problem. Sarah does not mind at all if others find out that she had flu.
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000Sarah 28 F 37000Mary 56 F 58000
A 2-diverse table A voter registration listAge Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu
![Page 20: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/20.jpg)
Does not work if an individual can have multiple tuples in the microdata.
Defects of l-diversity (cont.)
Microdata
Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerAndy 4 M 12000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis
Sarah 28 F 37000 fluMary 56 F 58000 flu
![Page 21: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/21.jpg)
Defects of l-diversity (cont.)
Name Age Sex ZipcodeAndy 4 M 12000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
A 2-diverse table A voter registration listAge Sex Zipcode Disease
4 M 12000 gastric ulcer4 M 12000 dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu
![Page 22: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/22.jpg)
Principle 3: Personalized anonymity
Key ideas: Guarding node + sensitive attribute (SA) generalization Assume a publicly-known hierarchy on the sensitive attribute.
any illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
[Xiao and Tao, SIGMOD, 2006]
![Page 23: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/23.jpg)
Guarding nodeany illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
Andy does not want anyone to know that he had a stomach problem. He can specify “stomach disease” as the guarding node for his tuple.
Protect Andy from being conjectured to have any disease in the subtree of the guarding node.
Name Age Sex Zipcode Disease guarding node
Andy 4 M 12000 gastric ulcer stomach disease
![Page 24: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/24.jpg)
Guarding node (cont.)any illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
Sarah is willing to disclose her exact symptom. She can specify Ø as the guarding node for her tuple.
Name Age Sex Zipcode Disease guarding node
Sarah 28 F 37000 flu Ø
![Page 25: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/25.jpg)
Guarding node (cont.)any illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
Bill does not have any special preference. He sets the guarding node of his tuple to be the same as his sensitive value.
Name Age Sex Zipcode Disease guarding node
Bill 5 M 14000 dyspepsia dyspepsia
![Page 26: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/26.jpg)
A personalized approachany illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis ØSarah 28 F 37000 flu ØMary 56 F 58000 flu flu
![Page 27: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/27.jpg)
Personalized anonymity
No adversary should be able to breach the privacy requirement of any guarding node with a probability above pbreach..
If pbreach = 0.3, then no adversary can have more than 30% probability to find out that: Andy had a stomach disease Bill had dyspepsia …
Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis ØSarah 28 F 37000 flu ØMary 56 F 58000 flu flu
![Page 28: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/28.jpg)
Why SA generalization?
How many female patients are there with age above 30? 4 ∙ (60 – 30 + 1) / (60 – 21 + 1) = 3 Real answer: 1
Pure QI generalization
Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu
Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerBill 5 M 14000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis
Sarah 28 F 37000 fluMary 56 F 58000 flu
Microdata
![Page 29: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/29.jpg)
SA generalization (cont.)
With SA generalizationAge Sex Zipcode Disease
[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] flu
56 F 58000respiratory infection
Pure QI generalization
Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia
[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis
[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu
any illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
![Page 30: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/30.jpg)
Evaluation of disclosure risk
What is the probability that the adversary can find out that “Andy had a stomach disease”?
Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000
Sarah 28 F 37000Mary 56 F 58000
Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia
21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection
A voter registration listThe published data
![Page 31: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/31.jpg)
Combinatorial reconstruction (cont.)
Can each individual appear more than once? No = the primary case Yes = the non-primary case
Some possible reconstructions:
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
The primary case
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
The non-primary case
![Page 32: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/32.jpg)
Combinatorial reconstruction (cont.)
Can each individual appear more than once? No = the primary case Yes = the non-primary case
Some possible reconstructions:
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
The primary case
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
The non-primary case
![Page 33: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/33.jpg)
Breach probability (primary)
Totally 120 possible reconstructions
If Andy is associated with a stomach disease in nb reconstructions The probability that the adversary should associate Andy with some stomach problem
is nb / 120
Andy is associated with gastric ulcer in 24 reconstructions dyspepsia in 24 reconstructions gastritis in 0 reconstructions
nb = 48
The breach probability for Andy’s tuple is 48 / 120 = 2 / 5.
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
any illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
![Page 34: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/34.jpg)
Breach probability (non-primary)
Totally 625 possible reconstructions
Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions.
nb = 225 The breach probability for Andy’s tuple is
225 / 625 = 9 / 25
any illness
stomach diseaserespiratory infection
flu pneumonia gastricbronchitis dyspepsia
respiratory system problem digestive system problem
gastritisulcer
Andy
Bill
Ken
Nash
Mike
gastric ulcer
dyspepsia
pneumonia
bronchitis
![Page 35: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/35.jpg)
A defect of personalized anonymity
Does not guard against background knowledge.Recall that l-diversity can achieve this purpose.
But it seems possible to adapt the personalized approach to tackle background knowledge.Future work?
![Page 36: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/36.jpg)
Other privacy principles
k-gather. Due to [Aggarwal et al., PODS, 2006]
Suffers from the problems of k-anonymity.
(a, k)-anonymity Due to [Wong et al., KDD, 2006]
t-closeness. Recently proposed by [Li and Li, ICDE, 2007]
![Page 37: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/37.jpg)
Issues
Privacy principleWhat is adequate privacy protection?
Distortion approachHow to achieve the privacy principle?
![Page 38: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/38.jpg)
Three approaches Suppression
We do not discuss it because the utility of the resulting table is low; it can be regarded as a special case of generalization.
Generalization Due to [Sweeney, International Journal on Uncertainty, Fuzziness and
Knowledge-based Systems, 2002]
Anatomy (also called “bucketization”) Due to [Xiao and Tao, VLDB, 2006]
Each of the above approaches can be integrated with all the privacy principles discussed earlier.
![Page 39: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/39.jpg)
A multidimensional view of generalization
20
10k
7060504030
60k
50k
40k
30k
20k
x (Age)y
(Zip
code
)
1 2
3
4
5
6 and 7
8
R1 R2
![Page 40: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/40.jpg)
Taxonomy of generalization
Local recoding (Generalized) rectangles
may overhalp.Suppression is a special case
of local recoding.
Global recodingAll rectangles are disjoint.
[LeFevre et al. SIGMOD, 2005]
![Page 41: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/41.jpg)
Taxonomy of generalization (cont.)
Global recoding can be further divided.
Single-dimension recoding Rectangles form a grid.
Multi-dimension recodingThe opposite of single-
dimension recoding.
![Page 42: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/42.jpg)
Taxonomy of generalization (cont.)
Single-dimension recoding can be further divided. Full-domain recoding Full-subtree recoding
Both assume a hierarchy on each QI attribute. Example: A hierarchy on Age
[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]
[1, 30] [31, 60] [61, 90]
[1, 90]
1, 2, 3, …, 10 ...
![Page 43: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/43.jpg)
Taxonomy of generalization (cont.)
Full-domain recoding All age values must be generalized to the same level of the
hierachy.
[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]
[1, 30] [31, 60] [61, 90]
[1, 90]
1, 2, 3, …, 10 ...
![Page 44: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/44.jpg)
Taxonomy of generalization (cont.)
Full-subtree recoding The subtrees of all generalized values must be disjoint. Permissible generalization:
[1, 30], [31, 40], [41, 50], [51, 60], [61, 90]. Illegal generalization:
[1, 10], [1, 30], [31, 60], [61, 90].
[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]
[1, 30] [31, 60] [61, 90]
[1, 90]
1, 2, 3, …, 10 ...
![Page 45: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/45.jpg)
Why all these generalization types?
Reason 1:If a dataset is generalized in a more restricted manner, less preprocessing is required before it can be analyzed by a standard statistical tool (such as SAAS).
![Page 46: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/46.jpg)
Why all these generalization types?
Reason 2: More restrictive generalization is usually faster to compute and easier to analyze.
[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]
[1, 30] [31, 60] [61, 90]
[1, 90]
1, 2, 3, …, 10 ... level 0
level 1
level 2
level 3
![Page 47: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/47.jpg)
Why all these generalization types?
Reason 3: Less restrictive generalization promises more accurate data analysis, provided that a sophisticated analytical method is used.
![Page 48: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/48.jpg)
Generalization algorithms
Operate on a quality metric. Examples: The generalization level (for full-domain recoding) Total rectangle size (for local recoding) …
Mostly heuristics-based. Finding the optimal generalization is often
NP hard.
level 0
level 1
level 2
level 3
![Page 49: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/49.jpg)
Defect of generalization Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
Age Sex Zipcode Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] dyspepsia
[21, 60] M [10001, 60000] pneumonia
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] gastritis
[61, 70] F [10001, 60000] flu
[61, 70] F [10001, 60000] bronchitis
Estimated answer: 2p, where p is the probability that each of the two tuples satisfies the query conditions on the Age and Zipcode.
![Page 50: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/50.jpg)
Defect of generalization (cont.) Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
p = Area( R1 ∩ Q ) / Area( R1 ) = 0.05
Estimated answer for Query A: 2p = 0.1
Age Sex Zipcode Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] pneumonia
![Page 51: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/51.jpg)
Defect of generalization (cont.) Query A:SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000] Estimated answer = 0.1
Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu
Linda 65 F 25000 gastritisAlice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
The exact answer = 1
![Page 52: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/52.jpg)
Defect of generalization (cont.) Cause of inaccuracy:
QI distribution inside each QI group is lost!
Age Sex Zipcode Disease
[21, 60] M [10001, 60000] pneumonia
[21, 60] M [10001, 60000] pneumonia
![Page 53: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/53.jpg)
Anatomy
Releases a quasi-identifier table (QIT) and a sensitive table (ST).
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 2
2 bronchitis 12 flu 2
2 gastritis 1
Age Sex Zipcode Group-ID
23 M 11000 127 M 13000 1
35 M 59000 1
59 M 12000 161 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Quasi-identifier table (QIT)
Sensitive table (ST)
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
Microdata
![Page 54: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/54.jpg)
Anatomy (cont.)1. Decide an l-diverse partition of the tuples.
Age Sex Zipcode Disease
23 M 11000 pneumonia
27 M 13000 dyspepsia
35 M 59000 dyspepsia
59 M 12000 pneumonia
61 F 54000 flu
65 F 25000 gastritis
65 F 25000 flu
70 F 30000 bronchitis
QI group 1
QI group 2
A 2-diverse partition
![Page 55: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/55.jpg)
Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition.
Disease
pneumonia
dyspepsia
dyspepsia
pneumonia
flu
gastritis
flu
bronchitis
Age Sex Zipcode
23 M 1100027 M 1300035 M 5900059 M 12000
61 F 5400065 F 2500065 F 2500070 F 30000
group 1
group 2
quasi-identifier table (QIT) sensitive table (ST)
![Page 56: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/56.jpg)
Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the decided partition.
Group-ID Disease
1 pneumonia1 dyspepsia1 dyspepsia1 pneumonia
2 flu2 gastritis2 flu2 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 127 M 13000 135 M 59000 159 M 12000 1
61 F 54000 265 F 25000 265 F 25000 270 F 30000 2
quasi-identifier table (QIT) sensitive table (ST)
![Page 57: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/57.jpg)
Privacy preservation
Given a pair of QIT and ST generated from an l-diverse partition, an adversary can infer the sensitive value of each individual with confidence at most 1 / l.
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 22 bronchitis 1
2 flu 2
2 gastritis 1
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
Name Age Sex Zipcode
Bob 23 M 11000
![Page 58: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/58.jpg)
Accuracy of data analysis Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
Group-ID Disease Count
1 dyspepsia 2
1 pneumonia 22 bronchitis 1
2 flu 2
2 gastritis 1
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
61 F 54000 2
65 F 25000 2
65 F 25000 2
70 F 30000 2
Quasi-identifier table (QIT)
Sensitive table (ST)
![Page 59: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/59.jpg)
Accuracy of data analysis
Query A:SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
2 patients contracted pneumonia 2 out of 4 patients satisfy the query conditions on Age and Zipcode Estimated answer = 2 * 2 / 4 = 1.
Age Sex Zipcode Group-ID
23 M 11000 1
27 M 13000 1
35 M 59000 1
59 M 12000 1
t1t2t3t4
![Page 60: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/60.jpg)
A defect of anatomy
Existence breach: Does an individual exist in the microdata?
![Page 61: Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong](https://reader036.vdocument.in/reader036/viewer/2022062421/56649d095503460f949db183/html5/thumbnails/61.jpg)
Future work
Re-publication
Tackle stronger background knowledgeRecent work [Martin et al., ICDE, 2007]
Improving utilityPioneering work [Kifer and Gehrke, SIGMOD, 2006]
Application to specific (non-trivial) applicationsLocation privacy
Pioneering work [Mokbel et al., VLDB, 2006]