data management for data science - github pages...such that pr(a(san)=breach) –pr(a’()=breach)...
TRANSCRIPT
![Page 1: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/1.jpg)
CS639:DataManagementfor
DataScienceLecture26:Privacy
[slidesfromVitaly Shmatikov]
TheodorosRekatsinas
1
![Page 2: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/2.jpg)
slide 2
Reading
• Dwork.“DifferentialPrivacy” (invitedtalkatICALP2006).
![Page 3: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/3.jpg)
BasicSetting
xn
xn-1
!
x3
x2
x1
SanUsers(government,researchers,marketers,…)
query1
answer1
queryT
answerT
!DB=
randomcoins¢ ¢ ¢
slide 3
![Page 4: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/4.jpg)
ExamplesofSanitizationMethods• Inputperturbation
• Addrandomnoisetodatabase,release
• Summarystatistics• Means,variances• Marginaltotals• Regressioncoefficients
• Outputperturbation• Summarystatisticswithnoise
• Interactiveversionsoftheabovemethods• AuditordecideswhichqueriesareOK,typeofnoise
slide 4
![Page 5: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/5.jpg)
StrawmanDefinition• Assumex1,…,xn aredrawni.i.d.fromunknowndistribution
• Candidatedefinition:sanitizationissafeifitonlyrevealsthedistribution
• Impliedapproach:• Learnthedistribution• Releasedescriptionofdistributionorre-samplepoints
• Thisdefinitionistautological!• Estimateofdistributiondependsondata…whyisitsafe?
slide 5
![Page 6: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/6.jpg)
Frequency in DB or frequency in underlying population?
BlendingintoaCrowd• Intuition:“Iamsafeinagroupofkormore”
• kvaries(3…6…100…10,000?)
• Manyvariationsontheme• Adversarywantspredicategsuchthat0<#{i|g(xi)=true}<k
• Why?• Privacyis“protectionfrombeingbroughttotheattentionofothers”[Gavison]
• Rarepropertyhelpsre-identifysomeone• Implicit:informationaboutalargegroupispublic
• E.g.,liverproblemsmoreprevalentamongdiabetics
slide 6
![Page 7: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/7.jpg)
Clustering-BasedDefinitions• GivensanitizationS,lookatalldatabasesconsistentwithS
• Safeifnopredicateistrueforallconsistentdatabases
• k-anonymity• PartitionDintobins• Safeifeachbiniseitherempty,orcontainsatleastkelements
• Cellboundmethods• Releasemarginalsums
slide 7
brown blue S
blond [0,12] [0,12] 12brown [0,14] [0,16] 18S 14 16
brown blue S
blond 2 10 12brown 12 6 18S 14 16
![Page 8: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/8.jpg)
IssueswithClustering
• Purelysyntacticdefinitionofprivacy• Whatadversarydoesthisapplyto?
• Doesnotconsideradversarieswithsideinformation• Doesnotconsiderprobability• Doesnotconsideradversarialalgorithmformakingdecisions(inference)
slide 8
![Page 9: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/9.jpg)
“Bayesian”Adversaries• AdversaryoutputspointzÎ D• Score=1/fz iffz >0,0otherwise
• fzisthenumberofmatchingpointsinD
• SanitizationissafeifE(score)≤e• Procedure:
• Assumeyouknowadversary’spriordistributionoverdatabases
• Givenacandidateoutput,updatepriorconditionedonoutput(viaBayes’rule)
• Ifmaxz E(score|output)<e,thensafetorelease
slide 9
![Page 10: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/10.jpg)
Issueswith“Bayesian”Privacy
• Restrictsthetypeofpredicatesadversarycanchoose• Mustknowpriordistribution
• Canoneschemeworkformanydistributions?• Sanitizerworksharderthanadversary
• Conditionalprobabilitiesdon’tconsiderpreviousiterations• Remembersimulatableauditing?
slide 10
![Page 11: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/11.jpg)
ClassicalIntutionforPrivacy• “IfthereleaseofstatisticsSmakesitpossibletodeterminethevalue[ofprivateinformation]moreaccuratelythanispossiblewithoutaccesstoS,adisclosurehastakenplace.”[Dalenius1977]
• Privacymeansthatanythingthatcanbelearnedaboutarespondentfromthestatisticaldatabasecanbelearnedwithoutaccesstothedatabase
• Similartosemanticsecurityofencryption• Anythingabouttheplaintextthatcanbelearnedfromaciphertextcanbelearnedwithouttheciphertext
slide 11
![Page 12: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/12.jpg)
ProblemswithClassicIntuition• Popularinterpretation:priorandposteriorviewsaboutanindividualshouldn’tchange“toomuch”
• Whatifmy(incorrect)prioristhateveryUTCSgraduatestudenthasthreearms?
• Howmuchis“toomuch?”• Can’tachievecryptographicallysmalllevelsofdisclosureandkeepthedatauseful
• Adversarialuserissupposed tolearnunpredictablethingsaboutthedatabase
slide 12
![Page 13: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/13.jpg)
ImpossibilityResult• Privacy:forsomedefinitionof“privacybreach,”" distributionondatabases," adversariesA,$ A’suchthatPr(A(San)=breach)– Pr(A’()=breach)≤e
• Forreasonable“breach”,ifSan(DB)containsinformationaboutDB,thensomeadversarybreaksthisdefinition
• Example• ParisknowsthatTheois2inchestallerthantheaverageGreek• DBallowscomputingaverageheightofaGreek• ThisDBbreaksTheos’s privacyaccordingtothisdefinition…evenifhisrecordisnot inthedatabase!
slide 13
[Dwork]
![Page 14: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/14.jpg)
(VeryInformal)ProofSketch• SupposeDBisuniformlyrandom
• EntropyI(DB;San(DB))>0
• “Breach”ispredictingapredicateg(DB)• Adversaryknowsr,H(r;San(DB))Å g(DB)
• Hisasuitablehashfunction,r=H(DB)
• Byitself,doesnotleakanythingaboutDB(why?)• TogetherwithSan(DB),revealsg(DB)(why?)
slide 14
![Page 15: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/15.jpg)
DifferentialPrivacy(1)
xnxn-1!x3x2x1
San
query1answer1
queryTanswerT
!DB=
randomcoins¢ ¢ ¢
slide 15
◆ ExamplewithGreeksandTheoAdversarylearnsTheo’sheightevenifheisnotinthedatabase
◆ Intuition:“WhateverislearnedwouldbelearnedregardlessofwhetherornotTheoparticipates”Dual:Whateverisalreadyknown,situationwon’tgetworse
Adversary A
![Page 16: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/16.jpg)
DifferentialPrivacy(2)
xnxn-1!0x2x1
San
query1answer1
queryTanswerT
!DB=
randomcoins¢ ¢ ¢ Adversary A
◆ Definen+1gamesGame0: Adv.interactswithSan(DB)Gamei: Adv.interactswithSan(DB-i);DB-i = (x1,…,xi-1,0,xi+1,…,xn)
GivenSandpriorp()onDB,definen+1posteriordistrib’s
slide 16
![Page 17: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/17.jpg)
DifferentialPrivacy(3)
xnxn-1!0x2x1
San
query1answer1
queryTanswerT
!DB=
randomcoins¢ ¢ ¢ Adversary A
slide 17
Definition:Sanissafeif" priordistributionsp(¢)onDB," transcriptsS," i =1,…,n
StatDiff(p0(¢|S),pi(¢|S))≤e
![Page 18: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/18.jpg)
Indistinguishability
xnxn-1!x3x2x1
San
query1answer1
queryTanswerT
!DB=
randomcoins¢ ¢ ¢
slide 18
transcriptS
xnxn-1!y3x2x1
San
query1answer1
queryTanswerT
!DB’=
randomcoins¢ ¢ ¢
transcriptS’
Differin1rowDistancebetweendistributionsisatmoste
![Page 19: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/19.jpg)
WhichDistancetoUse?• Problem:emustbelarge
• Anytwodatabasesinducetranscriptsatdistance≤ne• Togetutility,neede >1/n
• Statisticaldifference1/nisnotmeaningful!• Example:releaserandompointindatabase
• San(x1,…,xn)=(j,xj )forrandomj
• Foreveryi,changingxi inducesstatisticaldifference1/n
• Butsomexi isrevealedwithprobability1
slide 19
![Page 20: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/20.jpg)
?
Definition:Sanise-indistinguishable if" A," DB,DB’whichdifferin1row," setsoftranscriptsS
Adversary A
query1
answer1transcript
S
query1
answer1transcript
S’
Equivalently," S:p(San(DB)=S)p(San(DB’)=S) Î 1± e
p(San(DB)Î S)Î (1± e)p(San(DB’)Î S)
FormalizingIndistinguishability
slide 20
![Page 21: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/21.jpg)
IndistinguishabilityÞ Diff.Privacy
Definition:Sanissafeif" priordistributionsp(¢)onDB," transcriptsS," i =1,…,n
StatDiff(p0(¢|S),pi(¢|S))≤e
ForeverySandDB,indistinguishabilityimplies
ThisimpliesStatDiff(p0(¢|S),pi(¢|S))≤e
slide 21
![Page 22: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/22.jpg)
SensitivitywithLaplaceNoise
slide 22
![Page 23: Data Management for Data Science - GitHub Pages...such that Pr(A(San)=breach) –Pr(A’()=breach) ≤ e •For reasonable “breach”, if San(DB) contains information about DB, then](https://reader033.vdocument.in/reader033/viewer/2022042301/5ecbbaf828bb144c0c321dd3/html5/thumbnails/23.jpg)
DifferentialPrivacy:Summary
• Sangivese-differentialprivacyifforallvaluesofDBandMeandalltranscriptst:
slide 23
Pr [t]
Pr[ San (DB - Me) = t]Pr[ San (DB + Me) = t]
≤ ee » 1±e