page 1march 3, 2005 10th estonian winter school in computer science privacy preserving data mining...
TRANSCRIPT
page 1March 3, 2005 10th Estonian Winter School in Computer Science
Privacy Preserving Data Mining
Lecture 3
Non-Cryptographic Approaches for Preserving Privacy
(Based on Slides of Kobbi Nissim)
Benny PinkasHP Labs, Israel
page 2March 3, 2005 10th Estonian Winter School in Computer Science
Why not use cryptographic methods?
• Many users contribute data. Cannot require them to participate in a cryptographic protocol.– In particular, cannot require p2p communication between
users.• Cryptographic protocols incur considerable overhead.
d
…
page 3March 3, 2005 10th Estonian Winter School in Computer Science
Data Privacy
DataData
usersusers
breacbreach h
privacprivacyy
access access mechanismmechanism
d
page 4March 3, 2005 10th Estonian Winter School in Computer Science
Easy Tempting Solution
• But, ‘harmless’ attributes uniquely identify many patients (gender, age, approx weight, ethnicity, marital status…)
• Recall, DOB+gender+zip code identify people whp. • Worse:`rare’ attributes (e.g. disease with prob. 1/3000)
ddMr. Brown
Ms. John
Mr. Doe
A Bad SolutionA Bad Solution
IdeaIdea: a. Remove identifying information (name, : a. Remove identifying information (name, SSN, …)SSN, …)b. Publish data
page 5March 3, 2005 10th Estonian Winter School in Computer Science
What is Privacy?
• Something should not be computable from query answers– E.g. Joe={Joe’s private data} – The definition should take into account the
adversary’s power (computational, #of queries, prior knowledge, …)
• Quite often it is much easier to say what is surely non-private– E.g. Strong breaking of privacy: adversary is able
to retrieve (almost) everybody’s private data
Intuition: Intuition: privacy privacy breached if it is possible breached if it is possible to compute someone’s to compute someone’s private information from private information from his identityhis identity
page 6March 3, 2005 10th Estonian Winter School in Computer Science
The Data Privacy Game: an Information-Privacy Tradeoff
• Private functions: – want to hide x(DB)=dx
• Information functions:– want to revealf (q, DB) for queries q
• Here: explicit definition of private functions. – The question: which information functions may be allowed?
• Different from Crypto (secure function evaluation):– There, want to reveal f() (explicit definition of information
function)– want to hide all functions () not computable from f()– Implicit definition of private functions– The question whether f() should be revealed is not asked
xxffffff
page 7March 3, 2005 10th Estonian Winter School in Computer Science
A simplistic model: Statistical Database (SDB)
d {0,1} {0,1}nn q q [n] [n]
queryquery
aaqq==iiqq ddii
answeranswerMr. Fox 0/1Mr. Fox 0/1
Ms. John Ms. John 0/10/1
Mr. Doe 0/1Mr. Doe 0/1
bits
page 8March 3, 2005 10th Estonian Winter School in Computer Science
Approaches to SDB Privacy
• Studied extensively since the 70s• Perturbation
– Add randomness. Give `noisy’ or `approximate’ answers– Techniques:
• Data perturbation (perturb data and then answer queries as usual) [Reiss 84, Liew Choi Liew 85, Traub Yemini Wozniakowski 84] …
• Output perturbation (perturb answers to queries) [Denning 80, Beck 80, Achugbue Chin 79, Fellegi Phillips 74] …
– Recent interest: [Agrawal, Srikant 00] [Agrawal, Aggarwal 01],…• Query Restriction
– Answer queries accurately but sometimes disallow queries– Require queries to obey some structure [Dobkin Jones Lipton 79]
• Restricts number of queries– Auditing [Chin Ozsoyoglu 82, Kleinberg Papadimitriou Raghavan 01]
page 9March 3, 2005 10th Estonian Winter School in Computer Science
Some Recent Privacy Definitions
X – data, Y – (noisy) observation of X[Agrawal, Srikant ‘00] Interval of confidence
– Let Y = X+noise (e.g. uniform noise in [-100,100].– Perturb input data. Can still estimate underlying
distribution.– Tradeoff: more noise less accuracy but more privacy.– Intuition: large possible interval privacy preserved
• Given Y, we know that within c% confidence X is in [a1,a2]. For example, for Y=200, with 50% X is in [150,250].
• a2-a1 defines the amount of privacy at c% confidence
– Problem: there might be some a-priori information about X• X = someone’s age & Y= -97
page 10
March 3, 2005 10th Estonian Winter School in Computer Science
The [AS] scheme can be turned against itself
• Assume that N is large– Even if the data-miner doesn’t have a-priori
information about X, it can estimate it given the randomized data Y.
• The perturbation is uniform in [-1,1]
• [AS]: privacy interval 2 with confidence 100%
• Let fx(X)=50% for x[0,1], and 50% for x[4,5].
• But, after learning fx(X) the value of X can be easily localized within an interval of size at most 1.
– Problem: aggregate information provides information that can be used to attack individual data
page 11
March 3, 2005 10th Estonian Winter School in Computer Science
Some Recent Privacy Definitions
X – data, Y – (noisy) observation of X• [Agrawal, Aggarwal ‘01] Mutual information
– Intuition: • High entropy is good. I(X;Y) = H(X)-H(X|Y) (mutual
information)
• small I(X;Y) (mutual information) privacy preserved (Y provides little information about X).
• Problem [EGS] : – Average notion. Privacy loss can happen with low
but significant probability, but without affecting I(X;Y).
– Sometimes I(X;Y) seems good but privacy is breached
page 12
March 3, 2005 10th Estonian Winter School in Computer Science
Output Perturbation (Randomization Approach)
• Exact answer to query q:– aq =iq di
• Actual SDB answer: âq
• Perturbation : – For all q: | âq – aq | ≤
• Questions:– Does perturbation give any privacy?– How much perturbation is needed for privacy?– Usability
page 13
March 3, 2005 10th Estonian Winter School in Computer Science
Privacy Preserved by Perturbation n
Database: dR{0,1}n (uniform input distribution!)
Algorithm: on query q,1. Let aq=iq di
2. If | aq - |q|/2 | < return âq = |q| / 2
3. Otherwise return âq = aq
n (lgn)2 Privacy is preserved– Assume poly(n) queries– If n (lgn)2, whp always use rule 2
• No information about d is given!
• (but database is completely useless…)
• Shows that sometimes perturbation n is enough for privacy. Can we do better?
q/2 aq
âq
q/2
page 14
March 3, 2005 10th Estonian Winter School in Computer Science
strong breaking strong breaking of privacyof privacy
• The previous useless database achieves the best possible perturbation.
• Theorem [Dinur-Nissim]: Given any DB and any DB response algorithm with perturbation = o(n), there is a poly-time reconstruction algorithm that outputs a database d’, s.t. dist(d,d’) < o(n).
Perturbation << n Implies no Privacy
page 15
March 3, 2005 10th Estonian Winter School in Computer Science
d dencode pert decode
aaqq11
aaqq22
aaqqtt
aaqq33
partial sums
ââqq11
ââqq22
ââqqtt
ââqq33
perturbed sums
The Adversary as a Decoding Algorithm
’
page 16
March 3, 2005 10th Estonian Winter School in Computer Science
Proof of Theorem [DN03] The Adversary Reconstruction Algorithm
Observation: A solution always exists, e.g. x=d.
• Query phase:Query phase: Get â Get âqqj j for t for t randomrandom subsets q subsets q11,,
…,q…,qtt
• Weeding phase: Weeding phase: Solve the Linear Program Solve the Linear Program (over (over ):):
0 0 x xii 1 1
||iiqjqj x xii - â - âqj qj | |
• Rounding:Rounding: Let c Let cii = round(x = round(xii), output c), output c
page 17
March 3, 2005 10th Estonian Winter School in Computer Science
Why does the Reconstruction Algorithm Work?
• Consider x{0,1}n s.t. dist(x,d)=c·n = (n)
• Observation:
– A random q contains c’·n coordinates in which x≠d
– The differences in the sum of these coordinates is, with constant probability, at least (n) (> = o(n) ).
– Such a q disqualifies x as a solution for the LP
• Since the total number of queries q is polynomial, then all such vectors x are disqualified with overwhelming probability.
page 18
March 3, 2005 10th Estonian Winter School in Computer Science
Summary of Results (statistical database)
• [Dinur, Nissim 03] :– Unlimited adversary:
• Perturbation of magnitude (n) required
– Polynomial-time adversary:• Perturbation of magnitude (sqrt(n)) is required (shown
above)
– In both cases, adversary may reconstruct a good approximation for the database• Disallows even very weak notions of privacy
– Bounded adversary, restricted to T << n queries (SuLQ):• There is a privacy preserving access mechanism with
perturbation << sqrt(T)• Chance for usability• Reasonable model as database grows larger and larger
small
DB
small
DB
mediu
m
mediu
m
DB
DB
larg
e
larg
e
DB
DB
page 19
March 3, 2005 10th Estonian Winter School in Computer Science
SuLQ for Multi-Attribute Statistical Database (SDB)
Query (Query (qq,, f f))
qq [ [nn]]
f f : {0,1}: {0,1}kk {0,1}{0,1}
Answer Answer aaq,fq,f==iiqq f(d f(dii))
n p
ers
on
sn
pers
on
s
k k attributesattributes
Database {Database {ddi,ji,j}}
ffffff
ff
aaq,fq,f
00 00 11 11 00
11 00 11 00 00
11 11 00 11 11
00 00 11 00 11
11 11 00 00 11
00 00 00 11 00
Row distributionRow distributionDD (D(D11,D,D22,,
…,D…,Dnn))
page 20
March 3, 2005 10th Estonian Winter School in Computer Science
Privacy and Usability Concerns for the Multi-Attribute Model [DN]
• Rich set of queries: subset sums over any property of the k attributes– Obviously increases usability, but how is
privacy affected?• More to protect: functions of the k attributes• Relevant factors:
– What is the adversary’s goal?– Row dependency
• Vertically split data (between k or less databases):– Can privacy still be maintained with
independently operating databases?
page 21
March 3, 2005 10th Estonian Winter School in Computer Science
Privacy Definition - Intuition
• 3-phase adversary– Phase 0: defines a target set G of poly(n)
functions g: {0,1}k {0,1}• Will try to learn some of this information
about someone
– Phase 1: adaptively queries the database T=o(n) times
– Phase 2: chooses an index i of a row it intends to attack and a function gG• Attack:
– given d-i
–try to guess g(di,1…di,k)
use all use all gained gained info to info to chooschoose i, ge i, g
page 22
March 3, 2005 10th Estonian Winter School in Computer Science
The Privacy Definition
• P 0i,g – a-priori probability that g(di,1…di,k)=1
• p Ti,g – a-posteriori probability that g(di,1…di,k)=1
– Given answers to the T queries, and d-i
• Define conf(p) = log (p/(1-p)) – 1-1 relationship between p and conf(p) – conf(1/2)=0; conf(2/3)=1; conf(1)=
conf i,g = conf(pTi,g) – conf(p0
i,g)
• (,T) – privacy: (“relative privacy”)– For all distributions D1…Dn , row i, function g
and any adversary making at most T queries:Pr[conf i,g > ] = neg(n)
page 23
March 3, 2005 10th Estonian Winter School in Computer Science
The SuLQ* Database
•Adversary restricted to T << n queries
•On query (q, f):• q [n]
• f : {0,1}k {0,1} (binary function)
– Let aq,f = iq f(di,1…di,k)– Let N Binomial(0, T )– Return aq,f+N
*SuLQ – Sub Linear Queries*SuLQ – Sub Linear Queries
page 24
March 3, 2005 10th Estonian Winter School in Computer Science
Privacy Analysis of the SuLQ Database
• Pmi,g - a-posteriori probability that g(di,1…di,k)=1
– Given d-i and answers to the first m queries• conf(pm
i,g) Describes a random walk on the line with:– Starting point: conf(p0
i,g)
– Compromise: conf(pmi,g) – conf(p0
i,g) > • W.h.p. more than T steps needed to reach
compromise
conf(pconf(p00i,gi,g)) conf(pconf(p00
i,gi,g) +) +
page 25
March 3, 2005 10th Estonian Winter School in Computer Science
Usability: One multi-attribute SuLQ DB
• Statistics of any property f of the k attributes– I.e. for what fraction of the
(sub)population does f(d1…dk) hold?– Easy: just put f in the query– Other applications:
• k independent multi-attribute SuLQ DBs
• Vertically partitioned SulQ DBs• Testing whether Pr[|] ≥ Pr[]+
– Caveat: we hide g() about a specific row (not about multiple rows)
00
11
11
00
11
00
11
11
00
11
11
00
00
11
11
11
11
00
11
11
11
00
00
11
00
00
11
00
11
11
page 26
March 3, 2005 10th Estonian Winter School in Computer Science
Overview of Methods
• Input Perturbation
• Output Perturbation
• Query Restriction
SDBUser
(Restricted) Query
Exact ResponseOr Denial
SDB UserSDB’DataPerturbation
Query
Response
SDB User
(Restricted) Query
Perturbed Response
page 27
March 3, 2005 10th Estonian Winter School in Computer Science
Query restriction
• The decision whether to answer or deny the query – Can be based on the content of the query and on
answers to previous queries– Or, can be based on the above and on the content
of the database
SDBUser
(Restricted) Query
Exact ResponseOr Denial
page 28
March 3, 2005 10th Estonian Winter School in Computer Science
Auditing
• [AW89] classify auditing as a query restriction method:– “Auditing of an SDB involves keeping up-to-date logs
of all queries made by each user (not the data involved) and constantly checking for possible compromise whenever a new query is issued”
• Partial motivation: May allow for more queries to be posed, if no privacy threat occurs.
• Early work: Hofmann 1977, Schlorer 1976, Chin, Ozsoyoglu 1981, 1986
• Recent interest: Kleinberg, Papadimitriou, Raghavan 2000, Li, Wang, Wang, Jajodia 2002, Jonsson, Krokhin 2003
page 29
March 3, 2005 10th Estonian Winter School in Computer Science
How Auditors may Inadvertently Compromise
Privacy
page 30
March 3, 2005 10th Estonian Winter School in Computer Science
The Setting
• Dataset: d={d1,…,dn} – Entries di: Real, Integer, Boolean
• Query: q = (f ,i1,…,ik)– f : Min, Max, Median, Sum, Average, Count…
• Bad users will try to breach the privacy of individuals• Compromise uniquely determine di (very weak def)
Statisticaldatabase
f (di1,…,dik)
q = (f ,i1,…,ik)
page 31
March 3, 2005 10th Estonian Winter School in Computer Science
Auditing
StatisticaldatabaseQuery log
q1,…,qi
Here’s a new query: qi+1
Here’s the answer
Query denied (as the answer would cause
privacy loss)
OR
Auditor
page 32
March 3, 2005 10th Estonian Winter School in Computer Science
Example 1: Sum/Max auditing
Oh well…
q1 = sum(d1,d2,d3)
sum(d1,d2,d3) = 15
q2 = max(d1,d2,d3)
Denied (the answer would cause privacy loss)
q2 is denied iff d1=d2=d3 = 5
I win!
Auditor
di real, sum/max queries, privacy breached if some di learned
There must be a reason for the
denial…
page 33
March 3, 2005 10th Estonian Winter School in Computer Science
Sounds Familiar?
Mr. Chairman, I would like to answer the committee's questions, but on the advice of my counsel I respectfully decline to answer the question based on the protection afforded me under the Constitution of the United States.
David Duncan, Former auditor for Enron and partner in Andersen:
page 34
March 3, 2005 10th Estonian Winter School in Computer Science
Max Auditing
q1 = max(d1,d2,d3,d4)
M1234
di real
M123 / deniedIf denied: d4=M1234
M12 / deniedIf denied: d3=M123
Auditor
q2 = max(d1,d2,d3)
q2 = max(d1,d2)
d1 d2 d4 d6d3 d5 d7 d8 … dndn-1
Learn an item with prob ½
page 35
March 3, 2005 10th Estonian Winter School in Computer Science
Boolean Auditing?
1 / denied
1 / denied
qi denied iff di = di+1 learn database/complement
Auditor
…
di Booleand1 d2 d4 d6d3 d5 d7 d8 … dndn-1
q1 = sum(d1,d2)
q2=sum(d2,d3)
page 36
March 3, 2005 10th Estonian Winter School in Computer Science
The Problem
• The problem:– Query denials leak (potentially sensitive) information
• Users cannot decide denials by themselves
Possible assignments to {d1,…,dn}
Assignments consistent with (q1,…qi, a1,…,ai)
qi+1 denied
page 37
March 3, 2005 10th Estonian Winter School in Computer Science
Solution to the problem: simulatable Auditing
An auditor is simulatable if a simulator exists s.t.:
Auditor
qi+1 qi+1
Deny/answer Deny/answer
Simulator
Simulation denials do not leak information
q1,…,qia1,…,ai
Statisticaldatabase
q1,…,qi
page 38
March 3, 2005 10th Estonian Winter School in Computer Science
Why Simulatable Auditors do not Leak Information?
Possible assignments to {d1,…,dn}
Assignments consistent with (q1,…qi, a1,…,ai )
qi+1 denied/allowed
page 40
March 3, 2005 10th Estonian Winter School in Computer Science
Query Restriction for Sum Queries
• Given:– D={x1,..,xn} dataset, xi – S is a subset of X. Query: xiS xi
• Is it possible to compromise D?– Here compromise means: uniquely
determine xi from the queries
• Can compromise if subsets arbitrarily small:– sum(x9)= x9
page 41
March 3, 2005 10th Estonian Winter School in Computer Science
Query Set Size Control
• Do not permit queries that involve a small subset of the database.
• Compromise still possible– Want to discover x:
sum(x,y1,..,yk) - sum(y1,..,yk) = x• Issue: Overlap• In general, overlap is not enough.
– Need to restrict also the number of queries– Note that overlap itself sometimes restricts
number of queries (e.g. size of queries = cn, overlap = const, only about 1/c possible queries)
page 42
March 3, 2005 10th Estonian Winter School in Computer Science
Restricting Set-Sum Queries
• Restricting the sum queries based on– Number of database elements in the sum– Overlap with previous sum queries– Total number of queries
• Note that the criteria are known to the user– They do not depend on the contents of the
database
• Therefore, the user can simulate the denial/no-denial answer given by the DB– Simulatable auditing
page 43
March 3, 2005 10th Estonian Winter School in Computer Science
Restricting Overlap and Number of Queries
• Assume:– |Query Qi| ≥ k– |Qi Qj| ≤ r– Adversary knows a-priori at most L values, L+1 <
k• Claim: Data cannot be compromised with fewer than
1+(2k-L)/r Sum Queries.1 0 0 0 1 1 1 11 0 0 1 0 0 1 0
x1x2x3..
xn
=
Q1Q2Q3...
Qt
≥ k
≤r
≥ k≥ k
≥ k
xl
≤r
≤r
page 44
March 3, 2005 10th Estonian Winter School in Computer Science
Claim: Data cannot be compromised with fewer than 1+(2k-L)/r Sum Queries [Dobkin,Jones,Lipton][Reiss]– k < query size, r > overlap, L a-priori known items
• Suppose xc compromised after t queries where each query represented by:– Qi = xi1 + xi2 + … + xik for i =1, …, t
• Implies that:– xc = i=1,t i Qi = i=1,t i j=1,k xij
– Let i = 1 if x in query i, 0 otherwise– xc= i=1,t i =1,n i x = =1,n (i=1,t i i)x
Overlap + Number of Queries
page 45
March 3, 2005 10th Estonian Winter School in Computer Science
We have: xc= =1,n (i=1,ti i)x
• In the above sum, (i=1,ti i) must be 0 for all x except for xc (in order for xc to be compromised)
• This happens iff i=0 for all i, or if i =j =1 and i j have opposite signs– or i =0, in which case the ith query didn’t matter
Overlap + Number of Queries
page 46
March 3, 2005 10th Estonian Winter School in Computer Science
• Wlog, first query contains xc, second query is of opposite sign.
• In the first query, k elements are probed• The second query adds at least k-r elements • Elements from first and second queries cannot
be canceled within the same (additional) query (opposite signs requires)
• Therefore each new query cancels items from first or from second query, but not from both.
• Need to cancel 2k-r-L elements. – Need 2+(2k-r-L)/r queries, i.e. 1+(2k-L)/r.
Overlap + Number of Queries
page 47
March 3, 2005 10th Estonian Winter School in Computer Science
Notes
• The number of queries satisfying |Qi|≥ k and |Qi Qj| ≤r is small– If k=n/c for some constant c and r=const, then
there are only ~c queries where no two queries overlap by more than 1.
– Hence , query sequence length may be uncomfortably short.
– Or, if r=k/c (overlap is a constant fraction of query size) then number of queries, 1+(2k-L)/r, is O( c).
page 48
March 3, 2005 10th Estonian Winter School in Computer Science
Conclusions
• Privacy should be defined and analyzed rigorously– In particular, assuming randomization privacy is
dangerous• High perturbation is needed for privacy against
polynomial adversaries– Threshold phenomenon – above n: total privacy,
below n: no privacy (for poly-time adversary)– Main tool: a reconstruction algorithm
• Careless auditing might leak private information• Self Auditing (simulatable auditors) is safe
– Decision whether to allow a query based on previous `good’ queries and their answers• Without access to DB contents
• Users may apply the decision procedure by themselves
page 49
March 3, 2005 10th Estonian Winter School in Computer Science
ToDo
• Come up with good model and requirements for database privacy– Learn from crypto– Protect against more general loss of privacy
• Simulatable auditors are a starting point for designing more reasonable audit mechanisms
page 50
March 3, 2005 10th Estonian Winter School in Computer Science
References
• Course web page:– A Study of Perturbation Techniques for Data
Privacy, Cynthia Dwork and Nina Mishra and Kobbi Nissim, http://theory.stanford.edu/~nmishra/cs369-2004.html
– Privacy and Databaseshttp://theory.stanford.edu/~rajeev/privacy.html
page 51
March 3, 2005 10th Estonian Winter School in Computer Science
Foundations of CS at the Weizmann Institute
• Uri Feige• Oded Goldreich• Shafi Goldwasser• David Harel• Moni Naor
• David Peleg• Amir Pnueli • Ran Raz• Omer Reingold• Adi Shamir
• All students receive a fellowship
• Language of instruction English
Yellow crypto