secure multiparty computation – applications for privacy...
TRANSCRIPT
Secure Multiparty Computation –
Applications for Privacy Preserving
Distributed Data Mining
Li Xiong
CS573 Data Privacy and Security
Slides credit: Chris Clifton, Purdue University; Murat Kantarcioglu, UT Dallas
Distributed Data Mining
• The setting:
– Data is distributed at different sites
– These sites may be third parties (e.g., hospitals, government (e.g., hospitals, government bodies) or may be the individual him or herself
• Privacy and security constraints:
– Privacy of individual data
– Security of data holders
xn
x1
x3
x2
f(x1,x2,…, xn)
Distributed Data Mining
• Government / public agencies. Example:
– The Centers for Disease Control want to identify disease outbreaks
– Insurance companies have data on disease incidents, seriousness, patient background, etc.
– But can/should they release this information?
• Industry Collaborations / Trade Groups. Example:• Industry Collaborations / Trade Groups. Example:
– An industry trade group may want to identify best practices to help members
– But some practices are trade secrets
– How do we provide “commodity” results to all (Manufacturing using chemical supplies from supplier X have high failure rates), while still preserving secrets (manufacturing process Y gives low failure rates)?
Taxonomy
• Data partition– Horizontally partitioned– Vertically partitioned
• Techniques– Data perturbation– Data perturbation
– Secure Multi-party Computation protocols
• Mining applications– Decision tree– Bayes classifier
– …
Horizontally Partitioned Data
• Data can be unioned to create the complete
set key X1…Xd
K1
k2
kn
key X1…Xd
Site 1
key X1…Xd
Site 2
key X1…Xd
Site r…
kn
K1
k2
ki
Ki+1
ki+2
kj
Km+1
km+2
kn
Vertically Partitioned Data
• Data can be joined to create the complete set
key X1…Xi Xi+1…Xj … Xm+1…Xd
key X1…Xi
Site 1
key Xi+1…Xj
Site 2
key Xm+1…Xd
Site r…
Secure Multiparty Computation
• Goal: Compute function when each party has some of the inputs
• Secure– Can be simulated by ideal model - nobody knows
anything but their own input and the resultsanything but their own input and the results
– Formally: ∃ polynomial time S such that {S(x,f(x,y))} ≡ {View(x,y)}
• Semi-Honest model: follow protocol, but remember intermediate exchanges
• Malicious: “cheat” to find something out
Application of SMC to Private Data
Mining
• The aim:
– Compute the data mining algorithm on the data so that nothing but the output is learned
• Is it sufficient?• Is it sufficient?
xn
x1
x3
x2
f(x1,x2,…, xn)
Privacy and Secure Computation
• Privacy ≠≠≠≠ Security
– Secure computation only deals with the process of
computing the function
– It does not ask whether or not the function should be
computedcomputed
• A two-stage process:
– Decide that the function/algorithm should be
computed – an issue of privacy (e.g. differential
privacy)
– Apply secure computation techniques to compute it
securely – security (focus of this class)
Secure Multiparty Computation
• Basic cryptographic tools– Oblivious transfer
– Random shares
– Oblivious circuit evaluation– Oblivious circuit evaluation
• Yao’s Millionaire’s problem (Yao ’86)– Secure computation possible if function can be
represented as a circuit
• Works for multiple parties as well (Goldreich, Micali, and Wigderson ’87)
But we aren’t done yet
• Circuit evaluation: Build a circuit that represents the computation
– For all possible inputs
– Impossibly large for typical data mining tasks– Impossibly large for typical data mining tasks
• Next step:
– Efficient techniques for specialized tasks and computations
– Tradeoff between security, efficiency, and accuracy
Outline
• Privacy preserving two-party decision tree
mining (Lindell & Pinkas ’00)
• Primitive protocols
– Secure sum– Secure sum
– Secure union (encryption based)
– Secure max (probabilistic random response
based)
• Secure data mining using sub protocolsPrivacy preserving data mining, Lindell, 2000
Tools for Privacy Preserving Data Mining, Clifton, 2002
Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data, Kantarcioglu, 2004
Decision Tree Construction (Lindell &
Pinkas ’00)
• Two-party horizontal partitioning
– Each site has same schema
– Attribute set known
– Individual entities private– Individual entities private
• Learn a decision tree classifier
– ID3
• Semi-honest model
A Decision Tree for “buys_computer”
age?
overcast<=30>4031..40
14
student? credit rating?
no yes yes
yes
fairexcellentyesno
ID3 Algorithm for Decision Tree Induction
� Greedy algorithm - tree is constructed in a top-down recursive divide-and-conquer
manner
� At start, all the training examples are at the root
� A test attribute is selected that “best” separate the data into partitions -
information gain
� Samples are partitioned recursively based on selected attributes
15
� Conditions for stopping partitioning
� All samples for a given node belong to the same class
� There are no remaining attributes for further partitioning – majority voting is
employed for classifying the leaf
� There are no samples left
Privacy Preserving ID3
Step 1: If R is empty, return a leaf-node with the class value with
Input:
� R – the set of attributes
� C – the class attribute
� T – the set of transactions
Step 1: If R is empty, return a leaf-node with the class value with
the most transactions in T
• Set of attributes is public
– Both know if R is empty
• Use secure protocol for majority voting
• Yao’s protocol
– Inputs (|T1(c1)|,…,|T1(cL)|), (|T2(c1)|,…,|T2(cL)|)
– Output i where |T1(ci)|+|T2(ci)| is largest
Privacy Preserving ID3
Step 2: If T consists of transactions which have all the same value
c for the class attribute, return a leaf node with the value c
• Represent having more than one class (in the transaction set),
by a fixed symbol different from ci,
• Force the parties to input either this fixed symbol or ci• Force the parties to input either this fixed symbol or ci
• Check equality to decide if at leaf node for class ci
• Various approaches for equality checking
– Yao’86
– Fagin, Naor ’96
– Naor, Pinkas ‘01
Privacy Preserving ID3
• Step 3:(a) Determine the attribute that best classifies the transactions in T, let it be A
– Each party compute P1 and P2 , i.e., x1 and x2, respectively
– Essentially done by securely computing x*(ln x) where x is the sum of
∑∑∈∈
−=−=)()(
log)(log)()(Slabelv
vv
Slabelv n
n
n
nvPvPSEntropy
– Essentially done by securely computing x*(ln x) where x is the sum of values from the two parties
• Step 3: (b,c) Recursively call ID3δ for the remaining attributes on the transaction sets T(a1),…,T(am) where a1,…, am are the values of the attribute A– Since the results of 3(a) and the attribute values are public, both
parties can individually partition the database and prepare their inputs for the recursive calls
Security proof tools
– Real/ideal model: the real model can be simulated in
the ideal model
• Key idea – Show that whatever can be computed by a party participating in the protocol can be computed based on its input and output onlybased on its input and output only
• ∃ polynomial Tme S such that {S(x,f(x,y))} ≡ {View(x,y)}
Security proof tools
• Composition theorem
– if a protocol is secure in the hybrid model where
the protocol uses a trusted party that computes
the (sub) functionalities, and we replace the calls the (sub) functionalities, and we replace the calls
to the trusted party by calls to secure protocols,
then the resulting protocol is secure
– Prove that component protocols are secure, then prove that the combined protocol is secure
Secure Sub-protocols for PPDDM
• In general, PPDDM protocols depend on few
common sub-protocols.
• Those common sub-protocols could be re-• Those common sub-protocols could be re-
used to implement PPDDM protocols
Privacy preserving data mining toolkit (Clifton ‘02)
• Many different data mining techniques often perform similar computations at various stages (e.g., computing sum, counting the number of items)
• Toolkit • Toolkit
– simple computations – sum, union, intersection …
– assemble them to solve specific mining tasks –association rule mining, bayes classifier, …
• The protocols may not be truly secure but more efficient than traditional SMC methods
Tools for Privacy Preserving Data Mining, Clifton, 2002
Primitive protocols
• Secure functions
– Secure sum
– Secure union
– …– …
Secure Sum
• Leading site: (v1 + R) mod n
• Site l receives:
• Leading site: (V-R) mod n
Secure Sum
Secure sum - security
• Does not reveal the real number
• Is it secure?� Site can collude!
� Each site can divide the number into shares, and � Each site can divide the number into shares, and
run the algorithm multiple times with permutated
nodes
Secure Union
• Commutative encryption
– For any set of permutated keys and a message M
– For any set of permutated keys and message M1 and M2For any set of permutated keys and message M1 and M2
• Secure union
– Each site encrypt its items and items from other site, remove duplicates, and decrypt
E1(E2(E3(ABC)))E1(ABC)E1(E2(ABD))
Secure Union
1
ABC
E1(E2(E3(ABC)))
E1(E2(E3(ABD)))
E2(E3(ABC))
E2(E3(ABD))
E3(E1(ABC))E3(E1(E2(ABD)))E2(E3(ABC))E2(E3(E1(ABC))) 2
ABD
3
ABCE3(ABC)E2(ABD)
E3(ABC)
E3(ABD)
ABC
ABD
Secure Union Security
• Does not reveal which item belongs to which site
• Is it secure under the definition of secure multi-party computation?multi-party computation?
� It reveals the number of items that are common
in the sites!
� Revealing innocuous information leakage allows a
more efficient algorithm than a fully secure
algorithm
Random Shares based Secure Union• Phase 1: random item addition
– Multiple rounds with permutated ring
– Each node sends a random share of its item set and a random share of a random
item set
• Phase 2: random item removal
– Each node subtracts its random items set
30
Random Shares based Secure Union -
Analysis
• Item exposure attack
– An adversary makes a claim C on a particular item a node i contributes to the final result (C: vi in xi)
• Set exposure attack
– An adversary makes a claim C on the whole set of – An adversary makes a claim C on the whole set of items a node i contributes to the final union result X (C: xi = ai).
• Change of beliefs (posterior probability and prior probability)
– P(C|IR,X) - P(C|X)
– P(C|IR,X)/P(C|X)
31
Exposure Risk – Set Exposure
• Disclosure decreases with increasing number of generated
random items and increasing number of participating nodes
• Set exposure risk is or close to 0 for probabilistic and crypto
approach
32
Exposure Risk – Risk Exposure
• Item exposure risk decreases with increasing number of
generated random items and participating nodes
• Item exposure risk for probabilistic approach is quite high
33
Cost Comparison
• Commutative protocol and anonymous communication protocol efficient but sensitive to union size
• Probabilistic protocol efficient but sensitive to domain size
• Estimated runtime for the general circuit-based protocol implemented by FairplayMP framework is 15 days, 127 days and 1.4 years for the domain sizes tested
34
Multi-round protocols
• Multi-round probabilistic protocols for
max/min and top k (Xiong ‘07)
Protocol Structure
• Random response
(Warner 1965)
• Multi-round
randomized protocol
Social Survey
D1 D2
Output Input
36
randomized protocol
– Randomized local
computation
– Multi-node
anonymity
• Assumption: semi-
honest model
Private
Data
D1
Private
Data
D3
Private
Data
D2
Private
Data
Dn
…
OutputInput
Input
InputOutput
Output
Local
Computation
Local
Computation
Local
Computation
Local
Computation
Preserving Privacy for outsourcing aggregation services, Xiong, 2007
A Naïve Max/Min Protocol
igi-1 gi
vi 1 2
30 10
30
3040
start
37
gi-1>=vi gi-1<vi
gi gi-1 vi34
20 40
30
40
40
� Add in randomization – how, when, and how
much?
• Random response at node i:
Max Protocol – Random response
gi-1>=vi gi-1(r)<vi
gi(r) gi-1(r) w/ prob Pr:
random numberigi-1 gi
v
38
w/ prob 1-Pr:
vi
vi
• Multiple rounds
• Randomization Probability at round r :
– Pr(r) =
• Local algorithm at round r and node i:
Max Protocol – multi-round random
response
1
0*−rdP
• Local algorithm at round r and node i:
39
gi-1(r)>=vi gi-1(r)<vi
gi(r) gi-1(r) w/ prob Pr:
rand [gi-1(r), vi)
w/ prob 1-Pr:
vi
igi-1(r) gi(r)
vi
Max Protocol - Illustration
Start18 3532
D2D2
30 10
0
40
32 4035
D3D4
30
20 40
10
18 3532
32 4035
PrivateTopK Protocol
Gi’(r)=topk(Gi-1(r) U Vi);
Vi’ = Gi’(r) – Gi-1’(r);
m = |Vi’|;
if m=0 then
Gi(r)= Gi-1(r);
else60
80
100
125
145
110
130
150
2/28/2012 41
else
with probability 1-Pr(r):
Gi(r)= Gi-1(r)
with probability Pr:
Gi(r)[1:k-m] = Gi-1(r)[1:k-m]
Gi(r)[k-m+1:k] = a sorted list of m
random values generated from
[min(Gi’(r)[k]-delta,Gi-1(r-1)[k-m+1]),
Gi’(r)[k])
end
Gi(r)
Gi-1(r)
10
Vi
40
50
80
110
D1
…
D2
Dn
Min/Max Protocol - Correctness
• Precision bound:
– Converges with r
– Smaller p0 and d provides faster convergence
2
)1(
01*1)Pr(1
−
=−≥−∏
rrrr
jdPj
42
Min/Max Protocol - Cost
• Communication cost
– single round: O(n)
– Minimum # of rounds given
precision guarantee (1-e):
43
Min/Max Protocol - Security
• Probability/confidence based metric: P(C|IR,R)
– Different types of exposures based on claim
• Data value: v =a
0.50 1
Absolute Privacy Provable Exposure
44
• Data value: vi=a
• Data ownership: Vi contains a
– Loss of Privacy (LoP) = | P(C|IR,R) – P(C|R) |
• Information entropy based metric:
– Loss of privacy as a measure of randomness of information: H(D|R) -
H(D|IR,R)
Min/Max Protocol – Security (Analysis)
• Upper bound for average expected LoP:
max r 1/2r-1 * (1-P0*dr-1)
• Larger p0 and d provides better privacy
45
Min/Max Protocol – Security (Experiments)
46
• Loss of privacy decreases with increasing number of nodes
• Probabilistic protocol achieves better privacy (close to 0)
• When n is large, anonymous protocol is actually okay!
Union
• Commutative encryption based approach
– Number of rounds: 2 rounds
– Each round: encryption and decryption
• Multi-round random-response approach?• Multi-round random-response approach?
Vector
1
0b1
b2
…
p1
0
1
…
p2
1
0
…
pc
OR OR OR… =
1
1
VG
…
• Each database has a boolean vector of the data items
• Union vector is a logical OR of all vectors
0bL
…
0
…
0
…
0
…
Privacy Preserving Indexing of Documents on the Network, Bawa, 2003
Group Vector Protocol
…
Pex=1/2r, Pin=1-Pex
for(i=1; i<L; i++)
if (Vs[i]=1 and VG’[i]=0)
Processing of VG’ at ps of round r…
0
1
0
v1
0
0
1
…
v2
0
1
0
…
vc
0
0
0
…
vG’
0
0
1
…
vG’
r=1, Pex=1/2, Pin=1/2
if (Vs[i]=1 and VG’[i]=0)
Set VG’[i]=1 with prob. Pin
if (Vs[i]=0 and VG’[i]=1)
Set VG’[i]=0 with prob. Pex
v1v2
vc
r=2, Pex=1/4, Pin=3/40
0
1
…
vG’
0
1
1
…
vG’
0
1
1
…
vG’
0
0
1
…
vG’
0
1
1
…
vG’
p1p2 pc
Open issues
� Tradeoff between accuracy, efficiency, and security
� How to quantify security
� How to design adjustable protocols
� Can we generalize the algorithms for a set of � Can we generalize the algorithms for a set of
operators based on their properties
� Operators: sum, union, max, min …
� Properties: commutative, associative, invertible,
randomizable
Secure Functionalities
• Secure Comparison: Comparing two integers without revealing the integer values.
• Secure Polynomial Evaluation: Party A has polynomial P(x) and Part B has a value b, the polynomial P(x) and Part B has a value b, the goal is to calculate P(b) without revealing P(x) or b
• Secure Set Intersection: Party A has set SA A
and Party B has set and Party B has set SB , the goal is to calculate
without revealing anything else. A BS S∩
Secure Functionalities Used
• Secure Set Union: Party A has set SA A and Party and Party
B has set B has set SB , the goal is to calculate
without revealing anything else.
A BS S∪
• Secure Dot Product: Party A has a vector X
and Party B has a vector Y. The goal is to
calculate X.Y without revealing anything else.
•Secure Sum
•Secure Comparison
•Association Rule Mining
•Decision Trees
Data Mining on Horizontally
Partitioned DataSpecific Secure Tools
•Secure Union
•Secure Logarithm
•Secure Poly. Evaluation
•EM Clustering
•Naïve Bayes Classifier
•Secure Comparison
•Secure Set Intersection
•Association Rule Mining
•Decision Trees
Data Mining on Vertically
Partitioned DataSpecific Secure Tools
•Secure Dot Product
•Secure Logarithm
•Secure Poly. Evaluation
•K-means Clustering
•Naïve Bayes Classifier
•Outlier Detection
Summary of SMC Based PPDDM
• Mainly used for distributed data mining.
• Provably secure under some assumptions.
• Efficient/specific cryptographic solutions for many distributed data mining problems are developed.developed.
• Mainly semi-honest assumption(i.e. parties follow the protocols)
Ongoing research
• New models that can trade-off better
between efficiency and security
• Game theoretic / incentive issues in PPDM
• Combining anonymization and cryptographic
techniques for PPDM