data management for data science - github pages...such that pr(a(san)=breach) –pr(a’()=breach)...

23
CS639: Data Management for Data Science Lecture 26: Privacy [slides from Vitaly Shmatikov] Theodoros Rekatsinas 1

Upload: others

Post on 23-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

CS639:DataManagementfor

DataScienceLecture26:Privacy

[slidesfromVitaly Shmatikov]

TheodorosRekatsinas

1

Page 2: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

slide 2

Reading

• Dwork.“DifferentialPrivacy” (invitedtalkatICALP2006).

Page 3: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

BasicSetting

xn

xn-1

!

x3

x2

x1

SanUsers(government,researchers,marketers,…)

query1

answer1

queryT

answerT

!DB=

randomcoins¢ ¢ ¢

slide 3

Page 4: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

ExamplesofSanitizationMethods• Inputperturbation

• Addrandomnoisetodatabase,release

• Summarystatistics• Means,variances• Marginaltotals• Regressioncoefficients

• Outputperturbation• Summarystatisticswithnoise

• Interactiveversionsoftheabovemethods• AuditordecideswhichqueriesareOK,typeofnoise

slide 4

Page 5: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

StrawmanDefinition• Assumex1,…,xn aredrawni.i.d.fromunknowndistribution

• Candidatedefinition:sanitizationissafeifitonlyrevealsthedistribution

• Impliedapproach:• Learnthedistribution• Releasedescriptionofdistributionorre-samplepoints

• Thisdefinitionistautological!• Estimateofdistributiondependsondata…whyisitsafe?

slide 5

Page 6: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

Frequency in DB or frequency in underlying population?

BlendingintoaCrowd• Intuition:“Iamsafeinagroupofkormore”

• kvaries(3…6…100…10,000?)

• Manyvariationsontheme• Adversarywantspredicategsuchthat0<#{i|g(xi)=true}<k

• Why?• Privacyis“protectionfrombeingbroughttotheattentionofothers”[Gavison]

• Rarepropertyhelpsre-identifysomeone• Implicit:informationaboutalargegroupispublic

• E.g.,liverproblemsmoreprevalentamongdiabetics

slide 6

Page 7: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

Clustering-BasedDefinitions• GivensanitizationS,lookatalldatabasesconsistentwithS

• Safeifnopredicateistrueforallconsistentdatabases

• k-anonymity• PartitionDintobins• Safeifeachbiniseitherempty,orcontainsatleastkelements

• Cellboundmethods• Releasemarginalsums

slide 7

brown blue S

blond [0,12] [0,12] 12brown [0,14] [0,16] 18S 14 16

brown blue S

blond 2 10 12brown 12 6 18S 14 16

Page 8: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

IssueswithClustering

• Purelysyntacticdefinitionofprivacy• Whatadversarydoesthisapplyto?

• Doesnotconsideradversarieswithsideinformation• Doesnotconsiderprobability• Doesnotconsideradversarialalgorithmformakingdecisions(inference)

slide 8

Page 9: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

“Bayesian”Adversaries• AdversaryoutputspointzÎ D• Score=1/fz iffz >0,0otherwise

• fzisthenumberofmatchingpointsinD

• SanitizationissafeifE(score)≤e• Procedure:

• Assumeyouknowadversary’spriordistributionoverdatabases

• Givenacandidateoutput,updatepriorconditionedonoutput(viaBayes’rule)

• Ifmaxz E(score|output)<e,thensafetorelease

slide 9

Page 10: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

Issueswith“Bayesian”Privacy

• Restrictsthetypeofpredicatesadversarycanchoose• Mustknowpriordistribution

• Canoneschemeworkformanydistributions?• Sanitizerworksharderthanadversary

• Conditionalprobabilitiesdon’tconsiderpreviousiterations• Remembersimulatableauditing?

slide 10

Page 11: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

ClassicalIntutionforPrivacy• “IfthereleaseofstatisticsSmakesitpossibletodeterminethevalue[ofprivateinformation]moreaccuratelythanispossiblewithoutaccesstoS,adisclosurehastakenplace.”[Dalenius1977]

• Privacymeansthatanythingthatcanbelearnedaboutarespondentfromthestatisticaldatabasecanbelearnedwithoutaccesstothedatabase

• Similartosemanticsecurityofencryption• Anythingabouttheplaintextthatcanbelearnedfromaciphertextcanbelearnedwithouttheciphertext

slide 11

Page 12: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

ProblemswithClassicIntuition• Popularinterpretation:priorandposteriorviewsaboutanindividualshouldn’tchange“toomuch”

• Whatifmy(incorrect)prioristhateveryUTCSgraduatestudenthasthreearms?

• Howmuchis“toomuch?”• Can’tachievecryptographicallysmalllevelsofdisclosureandkeepthedatauseful

• Adversarialuserissupposed tolearnunpredictablethingsaboutthedatabase

slide 12

Page 13: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

ImpossibilityResult• Privacy:forsomedefinitionof“privacybreach,”" distributionondatabases," adversariesA,$ A’suchthatPr(A(San)=breach)– Pr(A’()=breach)≤e

• Forreasonable“breach”,ifSan(DB)containsinformationaboutDB,thensomeadversarybreaksthisdefinition

• Example• ParisknowsthatTheois2inchestallerthantheaverageGreek• DBallowscomputingaverageheightofaGreek• ThisDBbreaksTheos’s privacyaccordingtothisdefinition…evenifhisrecordisnot inthedatabase!

slide 13

[Dwork]

Page 14: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

(VeryInformal)ProofSketch• SupposeDBisuniformlyrandom

• EntropyI(DB;San(DB))>0

• “Breach”ispredictingapredicateg(DB)• Adversaryknowsr,H(r;San(DB))Å g(DB)

• Hisasuitablehashfunction,r=H(DB)

• Byitself,doesnotleakanythingaboutDB(why?)• TogetherwithSan(DB),revealsg(DB)(why?)

slide 14

Page 15: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

DifferentialPrivacy(1)

xnxn-1!x3x2x1

San

query1answer1

queryTanswerT

!DB=

randomcoins¢ ¢ ¢

slide 15

◆ ExamplewithGreeksandTheoAdversarylearnsTheo’sheightevenifheisnotinthedatabase

◆ Intuition:“WhateverislearnedwouldbelearnedregardlessofwhetherornotTheoparticipates”Dual:Whateverisalreadyknown,situationwon’tgetworse

Adversary A

Page 16: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

DifferentialPrivacy(2)

xnxn-1!0x2x1

San

query1answer1

queryTanswerT

!DB=

randomcoins¢ ¢ ¢ Adversary A

◆ Definen+1gamesGame0: Adv.interactswithSan(DB)Gamei: Adv.interactswithSan(DB-i);DB-i = (x1,…,xi-1,0,xi+1,…,xn)

GivenSandpriorp()onDB,definen+1posteriordistrib’s

slide 16

Page 17: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

DifferentialPrivacy(3)

xnxn-1!0x2x1

San

query1answer1

queryTanswerT

!DB=

randomcoins¢ ¢ ¢ Adversary A

slide 17

Definition:Sanissafeif" priordistributionsp(¢)onDB," transcriptsS," i =1,…,n

StatDiff(p0(¢|S),pi(¢|S))≤e

Page 18: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

Indistinguishability

xnxn-1!x3x2x1

San

query1answer1

queryTanswerT

!DB=

randomcoins¢ ¢ ¢

slide 18

transcriptS

xnxn-1!y3x2x1

San

query1answer1

queryTanswerT

!DB’=

randomcoins¢ ¢ ¢

transcriptS’

Differin1rowDistancebetweendistributionsisatmoste

Page 19: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

WhichDistancetoUse?• Problem:emustbelarge

• Anytwodatabasesinducetranscriptsatdistance≤ne• Togetutility,neede >1/n

• Statisticaldifference1/nisnotmeaningful!• Example:releaserandompointindatabase

• San(x1,…,xn)=(j,xj )forrandomj

• Foreveryi,changingxi inducesstatisticaldifference1/n

• Butsomexi isrevealedwithprobability1

slide 19

Page 20: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

?

Definition:Sanise-indistinguishable if" A," DB,DB’whichdifferin1row," setsoftranscriptsS

Adversary A

query1

answer1transcript

S

query1

answer1transcript

S’

Equivalently," S:p(San(DB)=S)p(San(DB’)=S) Î 1± e

p(San(DB)Î S)Î (1± e)p(San(DB’)Î S)

FormalizingIndistinguishability

slide 20

Page 21: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

IndistinguishabilityÞ Diff.Privacy

Definition:Sanissafeif" priordistributionsp(¢)onDB," transcriptsS," i =1,…,n

StatDiff(p0(¢|S),pi(¢|S))≤e

ForeverySandDB,indistinguishabilityimplies

ThisimpliesStatDiff(p0(¢|S),pi(¢|S))≤e

slide 21

Page 22: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

SensitivitywithLaplaceNoise

slide 22

Page 23: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then

DifferentialPrivacy:Summary

• Sangivese-differentialprivacyifforallvaluesofDBandMeandalltranscriptst:

slide 23

Pr [t]

Pr[ San (DB - Me) = t]Pr[ San (DB + Me) = t]

≤ ee » 1±e