personalized privacy preservation xiaokui xiao, yufei tao city university of hong kong

39
Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong K ong

Upload: delphia-byrd

Post on 16-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Personalized Privacy Preservation

Xiaokui Xiao, Yufei Tao

City University of Hong Kong

Page 2: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Privacy preserving data publishing

Microdata

• Purposes:– Allow researchers to effectively study the correlation b

etween various attributes – Protect the privacy of every patient

Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerBill 5 M 14000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis

Sarah 28 F 37000 fluMary 56 F 58000 flu

Page 3: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

A naïve solution

• It does not work. See next.

Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerBill 5 M 14000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis

Sarah 28 F 37000 fluMary 56 F 58000 flu

Age Sex Zipcode Disease4 M 12000 gastric ulcer5 M 14000 dyspepsia6 M 18000 pneumonia9 M 19000 bronchitis12 F 22000 flu19 F 24000 pneumonia21 F 33000 gastritis25 F 34000 gastritis28 F 37000 flu56 F 58000 flu

publish

Page 4: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Inference attack

Age Sex Zipcode Disease4 M 12000 gastric ulcer5 M 14000 dyspepsia6 M 18000 pneumonia9 M 19000 bronchitis12 F 22000 flu19 F 24000 pneumonia21 F 33000 gastritis25 F 34000 gastritis28 F 37000 flu56 F 58000 flu

Published table

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

An external database(a voter registration list)

An adversary

Quasi-identifier (QI) attributes

Page 5: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Generalization

• Transform each QI value into a less specific form

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

A generalized table An external databaseAge Sex Zipcode Disease

[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 25] F [30001, 35000] gastritis[21, 25] F [30001, 35000] gastritis[26, 60] F [35001, 60000] flu[26, 60] F [35001, 60000] flu

Information loss

Page 6: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

k-anonymity

• The following table is 2-anonymous

Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 25] F [30001, 35000] gastritis[21, 25] F [30001, 35000] gastritis[26, 60] F [35001, 60000] flu[26, 60] F [35001, 60000] flu

5 QI groups

Quasi-identifier (QI) attributes Sensitive attribute

Page 7: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Drawback of k-anonymity

• What is the disease of Linda?

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

A 2-anonymous table An external databaseAge Sex Zipcode Disease

[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 25] F [30001, 35000] gastritis[21, 25] F [30001, 35000] gastritis[26, 60] F [35001, 60000] flu[26, 60] F [35001, 60000] flu

Page 8: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

A better criterion: l-diversity• Each QI-group

– has at least l different sensitive values– even the most frequent sensitive value does not have a lot of tupl

es

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Alice 12 F 22000Mike 7 M 17000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

A 2-diverse table An external databaseAge Sex Zipcode Disease

[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

Page 9: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Motivation 1: Personalization

• Andy does not want anyone to know that he had a stomach problem• Sarah does not mind at all if others find out that she had flu

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

A 2-diverse table An external databaseAge Sex Zipcode Disease

[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

Page 10: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Motivation 2: Non-primary case

MicrodataName Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerAndy 4 M 12000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis

Sarah 28 F 37000 fluMary 56 F 58000 flu

Page 11: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Motivation 2: Non-primary case (cont.)

Name Age Sex ZipcodeAndy 4 M 12000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

2-diverse table An external databaseAge Sex Zipcode Disease

4 M 12000 gastric ulcer4 M 12000 dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

Page 12: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Motivation 3: SA generalization

• How many female patients are there with age above 30?• 4 ∙ (60 – 30 + 1) / (60 – 21 + 1) = 3• Real answer: 1

A generalized tableAge Sex Zipcode Disease

[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

An external database

Page 13: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Motivation 3: SA generalization (cont.)

• Generalization of the sensitive attribute is beneficial in this case

A better generalized tableAge Sex Zipcode Disease

[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] flu

56 F 58000respiratory infection

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

An external database

Page 14: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Personalized anonymity

• We propose– a mechanism to capture personalized privacy

requirements– criteria for measuring the degree of security

provided by a generalized table– an algorithm for generating publishable tables

Page 15: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Guarding nodeany illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

• Andy does not want anyone to know that he had a stomach problem• He can specify “stomach disease” as the guarding node for his tuple

• The data publisher should prevent an adversary from associating Andy with “stomach disease”

Name Age Sex Zipcode Disease guarding node

Andy 4 M 12000 gastric ulcer stomach disease

Page 16: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Guarding nodeany illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

• Sarah is willing to disclose her exact symptom• She can specify Ø as the guarding node for her tuple

Name Age Sex Zipcode Disease guarding node

Sarah 28 F 37000 flu Ø

Page 17: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Guarding nodeany illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

• Bill does not have any special preference• He can specify the guarding node for his tuple as the same with his

sensitive value

Name Age Sex Zipcode Disease guarding node

Bill 5 M 14000 dyspepsia dyspepsia

Page 18: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

A personalized approachany illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis Ø

Sarah 28 F 37000 flu ØMary 56 F 58000 flu flu

Page 19: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Personalized anonymity

• A table satisfies personalized anonymity with a parameter pbreach

– Iff no adversary can breach the privacy requirement of any tuple with a probability above pbreach

• If pbreach = 0.3, then any adversary should have no more than 30% probability to find out that:

– Andy had a stomach disease– Bill had dyspepsia– etc

Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis Ø

Sarah 28 F 37000 flu ØMary 56 F 58000 flu flu

Page 20: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Personalized anonymity• Personalized anonymity with respect to a predefined para

meter pbreach– an adversary can breach the privacy requirement of any tuple with

a probability at most pbreach

Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia

21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection

• We need a method for calculating the breach probabilities

What is the probability that Andy had some stomach problem?

Page 21: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Combinatorial reconstruction

• Assumptions– the adversary has no prior knowledge about each indivi

dual– every individual involved in the microdata also appears i

n the external database

Page 22: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Combinatorial reconstruction

• Andy does not want anyone to know that he had some stomach problem

• What is the probability that the adversary can find out that “Andy had a stomach disease”?

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia

21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection

Page 23: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Combinatorial reconstruction (cont.)

• Can each individual appear more than once?– No = the primary case– Yes = the non-primary case

• Some possible reconstructions:

AndyBillKenNashMike

gastric ulcerdyspepsiapneumoniabronchitis

the primary case

AndyBillKenNashMike

gastric ulcerdyspepsiapneumoniabronchitis

the non-primary case

Page 24: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Combinatorial reconstruction (cont.)

• Can each individual appear more than once?– No = the primary case– Yes = the non-primary case

• Some possible reconstructions:

AndyBillKenNashMike

gastric ulcerdyspepsiapneumoniabronchitis

the primary case

AndyBillKenNashMike

gastric ulcerdyspepsiapneumoniabronchitis

the non-primary case

Page 25: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Breach probability (primary)

• Totally 120 possible reconstructions• If Andy is associated with a stomach disease in nb reconstructions • The probability that the adversary should associate Andy with some

stomach problem is nb / 120

• Andy is associated with– gastric ulcer in 24 reconstructions– dyspepsia in 24 reconstructions– gastritis in 0 reconstructions

• nb = 48• The breach probability for Andy’s tuple is 48 / 120 = 2 / 5

AndyBillKenNashMike

gastric ulcerdyspepsiapneumoniabronchitis

any illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

Page 26: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Breach probability (non-primary)

• Totally 625 possible reconstructions• Andy is associated with gastric ulcer or dyspepsi

a or gastritis in 225 reconstructions

• nb = 225• The breach probability for Andy’s tuple is

225 / 625 = 9 / 25

any illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

AndyBillKenNashMike

gastric ulcerdyspepsiapneumoniabronchitis

Page 27: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Breach probability: Formal results

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia

21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection

Page 28: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Breach probability: Formal results

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia

21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection

Page 29: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

More in our paper

• An algorithm for computing generalized tables that– satisfies personalized anonymity with predefin

ed pbreach

– reduces information loss by employing generalization on both the QI attributes and the sensitive attribute

Page 30: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Experiment settings 1

• Goal: To show that k-anonymity and l-diversity do not always provide sufficient privacy protection

• Real dataset

• Pri-leaf• Nonpri-leaf• Pri-mixed• Nonpri-mixed

• Cardinality = 100k

Age Education Gender Marital-status Occupation Income

Page 31: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Degree of privacy protection (Pri-leaf)

pbreach = 0.25 (k = 4, l = 4)

Page 32: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Degree of privacy protection (Nonpri-leaf)

pbreach = 0.25 (k = 4, l = 4)

Page 33: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Degree of privacy protection (Pri-mixed)

pbreach = 0.25 (k = 4, l = 4)

Page 34: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Degree of privacy protection (Nonpri-mixed)

pbreach = 0.25 (k = 4, l = 4)

Page 35: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Experiment settings 2

• Goal: To show that applying generalization on both the QI attributes and the sensitive attribute will lead to more effective data analysis

Page 36: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Accuracy of analysis (no personalization)

Page 37: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Accuracy of analysis (with personalization)

Page 38: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Conclusions

• k-anonymity and l-diversity are not sufficient for the Non-primary case

• Guarding nodes allow individuals to describe their privacy requirements better

• Generalization on the sensitive attribute is beneficial

Page 39: Personalized Privacy Preservation Xiaokui Xiao, Yufei Tao City University of Hong Kong

Thank you!

Datasets and implementation are available for download at

http://www.cs.cityu.edu.hk/~taoyf