secure multiparty computation – applications for privacy...

Secure Multiparty Computation –

Applications for Privacy Preserving

Distributed Data Mining

Li Xiong

CS573 Data Privacy and Security

Slides credit: Chris Clifton, Purdue University; Murat Kantarcioglu, UT Dallas


• The setting:

– Data is distributed at different sites

– These sites may be third parties (e.g., hospitals, government (e.g., hospitals, government bodies) or may be the individual him or herself

• Privacy and security constraints:

– Privacy of individual data

– Security of data holders

xn

x1

x3

x2

f(x1,x2,…, xn)


• Government / public agencies. Example:

– The Centers for Disease Control want to identify disease outbreaks

– Insurance companies have data on disease incidents, seriousness, patient background, etc.

– But can/should they release this information?

• Industry Collaborations / Trade Groups. Example:• Industry Collaborations / Trade Groups. Example:

– An industry trade group may want to identify best practices to help members

– But some practices are trade secrets

– How do we provide “commodity” results to all (Manufacturing using chemical supplies from supplier X have high failure rates), while still preserving secrets (manufacturing process Y gives low failure rates)?

Taxonomy

• Data partition– Horizontally partitioned– Vertically partitioned

• Techniques– Data perturbation– Data perturbation

– Secure Multi-party Computation protocols

• Mining applications– Decision tree– Bayes classifier

– …

Horizontally Partitioned Data

• Data can be unioned to create the complete

set key X1…Xd

K1

k2

kn

key X1…Xd

Site 1

key X1…Xd

Site 2

key X1…Xd

Site r…

kn

K1

k2

ki

Ki+1

ki+2

kj

Km+1

km+2

kn

Vertically Partitioned Data

• Data can be joined to create the complete set

key X1…Xi Xi+1…Xj … Xm+1…Xd

key X1…Xi

Site 1

key Xi+1…Xj

Site 2

key Xm+1…Xd

Site r…

Secure Multiparty Computation

• Goal: Compute function when each party has some of the inputs

• Secure– Can be simulated by ideal model - nobody knows

anything but their own input and the resultsanything but their own input and the results

– Formally: ∃ polynomial time S such that {S(x,f(x,y))} ≡ {View(x,y)}

• Semi-Honest model: follow protocol, but remember intermediate exchanges

• Malicious: “cheat” to find something out

Application of SMC to Private Data

Mining

• The aim:

– Compute the data mining algorithm on the data so that nothing but the output is learned

• Is it sufficient?• Is it sufficient?

xn

x1

x3

x2

f(x1,x2,…, xn)

Privacy and Secure Computation

• Privacy ≠≠≠≠ Security

– Secure computation only deals with the process of

computing the function

– It does not ask whether or not the function should be

computedcomputed

• A two-stage process:

– Decide that the function/algorithm should be

computed – an issue of privacy (e.g. differential

privacy)

– Apply secure computation techniques to compute it

securely – security (focus of this class)

Secure Multiparty Computation

• Basic cryptographic tools– Oblivious transfer

– Random shares

– Oblivious circuit evaluation– Oblivious circuit evaluation

• Yao’s Millionaire’s problem (Yao ’86)– Secure computation possible if function can be

represented as a circuit

• Works for multiple parties as well (Goldreich, Micali, and Wigderson ’87)

But we aren’t done yet

• Circuit evaluation: Build a circuit that represents the computation

– For all possible inputs

– Impossibly large for typical data mining tasks– Impossibly large for typical data mining tasks

• Next step:

– Efficient techniques for specialized tasks and computations

– Tradeoff between security, efficiency, and accuracy

Outline

• Privacy preserving two-party decision tree

mining (Lindell & Pinkas ’00)

• Primitive protocols

– Secure sum– Secure sum

– Secure union (encryption based)

– Secure max (probabilistic random response

based)

• Secure data mining using sub protocolsPrivacy preserving data mining, Lindell, 2000

Tools for Privacy Preserving Data Mining, Clifton, 2002

Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data, Kantarcioglu, 2004

Decision Tree Construction (Lindell &

Pinkas ’00)

• Two-party horizontal partitioning

– Each site has same schema

– Attribute set known

– Individual entities private– Individual entities private

• Learn a decision tree classifier

– ID3

• Semi-honest model

A Decision Tree for “buys_computer”

age?

overcast<=30>4031..40

14

student? credit rating?

no yes yes

yes

fairexcellentyesno

ID3 Algorithm for Decision Tree Induction

� Greedy algorithm - tree is constructed in a top-down recursive divide-and-conquer

manner

� At start, all the training examples are at the root

� A test attribute is selected that “best” separate the data into partitions -

information gain

� Samples are partitioned recursively based on selected attributes

15

� Conditions for stopping partitioning

� All samples for a given node belong to the same class

� There are no remaining attributes for further partitioning – majority voting is

employed for classifying the leaf

� There are no samples left

Privacy Preserving ID3

Step 1: If R is empty, return a leaf-node with the class value with

Input:

� R – the set of attributes

� C – the class attribute

� T – the set of transactions

Step 1: If R is empty, return a leaf-node with the class value with

the most transactions in T

• Set of attributes is public

– Both know if R is empty

• Use secure protocol for majority voting

• Yao’s protocol

– Inputs (|T1(c1)|,…,|T1(cL)|), (|T2(c1)|,…,|T2(cL)|)

– Output i where |T1(ci)|+|T2(ci)| is largest


Step 2: If T consists of transactions which have all the same value

c for the class attribute, return a leaf node with the value c

• Represent having more than one class (in the transaction set),

by a fixed symbol different from ci,

• Force the parties to input either this fixed symbol or ci• Force the parties to input either this fixed symbol or ci

• Check equality to decide if at leaf node for class ci

• Various approaches for equality checking

– Yao’86

– Fagin, Naor ’96

– Naor, Pinkas ‘01


• Step 3:(a) Determine the attribute that best classifies the transactions in T, let it be A

– Each party compute P1 and P2 , i.e., x1 and x2, respectively

– Essentially done by securely computing x*(ln x) where x is the sum of

∑∑∈∈

−=−=)()(

log)(log)()(Slabelv

vv

Slabelv n

n

n

nvPvPSEntropy

– Essentially done by securely computing x*(ln x) where x is the sum of values from the two parties

• Step 3: (b,c) Recursively call ID3δ for the remaining attributes on the transaction sets T(a1),…,T(am) where a1,…, am are the values of the attribute A– Since the results of 3(a) and the attribute values are public, both

parties can individually partition the database and prepare their inputs for the recursive calls

Security proof tools

– Real/ideal model: the real model can be simulated in

the ideal model

• Key idea – Show that whatever can be computed by a party participating in the protocol can be computed based on its input and output onlybased on its input and output only

• ∃ polynomial Tme S such that {S(x,f(x,y))} ≡ {View(x,y)}

Security proof tools

• Composition theorem

– if a protocol is secure in the hybrid model where

the protocol uses a trusted party that computes

the (sub) functionalities, and we replace the calls the (sub) functionalities, and we replace the calls

to the trusted party by calls to secure protocols,

then the resulting protocol is secure

– Prove that component protocols are secure, then prove that the combined protocol is secure

Secure Sub-protocols for PPDDM

• In general, PPDDM protocols depend on few

common sub-protocols.

• Those common sub-protocols could be re-• Those common sub-protocols could be re-

used to implement PPDDM protocols

Privacy preserving data mining toolkit (Clifton ‘02)

• Many different data mining techniques often perform similar computations at various stages (e.g., computing sum, counting the number of items)

• Toolkit • Toolkit

– simple computations – sum, union, intersection …

– assemble them to solve specific mining tasks –association rule mining, bayes classifier, …

• The protocols may not be truly secure but more efficient than traditional SMC methods

Tools for Privacy Preserving Data Mining, Clifton, 2002

Primitive protocols

• Secure functions

– Secure sum

– Secure union

– …– …

Secure Sum

• Leading site: (v1 + R) mod n

• Site l receives:

• Leading site: (V-R) mod n

Secure Sum

Secure sum - security

• Does not reveal the real number

• Is it secure?� Site can collude!

� Each site can divide the number into shares, and � Each site can divide the number into shares, and

run the algorithm multiple times with permutated

nodes

Secure Union

• Commutative encryption

– For any set of permutated keys and a message M

– For any set of permutated keys and message M1 and M2For any set of permutated keys and message M1 and M2

• Secure union

– Each site encrypt its items and items from other site, remove duplicates, and decrypt

E1(E2(E3(ABC)))E1(ABC)E1(E2(ABD))

Secure Union

1

ABC

E1(E2(E3(ABC)))

E1(E2(E3(ABD)))

E2(E3(ABC))

E2(E3(ABD))

E3(E1(ABC))E3(E1(E2(ABD)))E2(E3(ABC))E2(E3(E1(ABC))) 2

ABD

3

ABCE3(ABC)E2(ABD)

E3(ABC)

E3(ABD)

ABC

ABD

Secure Union Security

• Does not reveal which item belongs to which site

• Is it secure under the definition of secure multi-party computation?multi-party computation?

� It reveals the number of items that are common

in the sites!

� Revealing innocuous information leakage allows a

more efficient algorithm than a fully secure

algorithm

Random Shares based Secure Union• Phase 1: random item addition

– Multiple rounds with permutated ring

– Each node sends a random share of its item set and a random share of a random

item set

• Phase 2: random item removal

– Each node subtracts its random items set

30

Random Shares based Secure Union -

Analysis

• Item exposure attack

– An adversary makes a claim C on a particular item a node i contributes to the final result (C: vi in xi)

• Set exposure attack

– An adversary makes a claim C on the whole set of – An adversary makes a claim C on the whole set of items a node i contributes to the final union result X (C: xi = ai).

• Change of beliefs (posterior probability and prior probability)

– P(C|IR,X) - P(C|X)

– P(C|IR,X)/P(C|X)

31

Exposure Risk – Set Exposure

• Disclosure decreases with increasing number of generated

random items and increasing number of participating nodes

• Set exposure risk is or close to 0 for probabilistic and crypto

approach

32

Exposure Risk – Risk Exposure

• Item exposure risk decreases with increasing number of

generated random items and participating nodes

• Item exposure risk for probabilistic approach is quite high

33

Cost Comparison

• Commutative protocol and anonymous communication protocol efficient but sensitive to union size

• Probabilistic protocol efficient but sensitive to domain size

• Estimated runtime for the general circuit-based protocol implemented by FairplayMP framework is 15 days, 127 days and 1.4 years for the domain sizes tested

34

Multi-round protocols

• Multi-round probabilistic protocols for

max/min and top k (Xiong ‘07)

Protocol Structure

• Random response

(Warner 1965)

• Multi-round

randomized protocol

Social Survey

D1 D2

Output Input

36

randomized protocol

– Randomized local

computation

– Multi-node

anonymity

• Assumption: semi-

honest model

Private

Data

D1

Private

Data

D3

Private

Data

D2

Private

Data

Dn

…

OutputInput

Input

InputOutput

Output

Local

Computation

Local

Computation

Local

Computation

Local

Computation

Preserving Privacy for outsourcing aggregation services, Xiong, 2007

A Naïve Max/Min Protocol

igi-1 gi

vi 1 2

30 10

30

3040

start

37

gi-1>=vi gi-1<vi

gi gi-1 vi34

20 40

30

40

40

� Add in randomization – how, when, and how

much?

• Random response at node i:

Max Protocol – Random response

gi-1>=vi gi-1(r)<vi

gi(r) gi-1(r) w/ prob Pr:

random numberigi-1 gi

v

38

w/ prob 1-Pr:

vi

vi

• Multiple rounds

• Randomization Probability at round r :

– Pr(r) =

• Local algorithm at round r and node i:

Max Protocol – multi-round random

response

1

0*−rdP

• Local algorithm at round r and node i:

39

gi-1(r)>=vi gi-1(r)<vi

gi(r) gi-1(r) w/ prob Pr:

rand [gi-1(r), vi)

w/ prob 1-Pr:

vi

igi-1(r) gi(r)

vi

Max Protocol - Illustration

Start18 3532

D2D2

30 10

0

40

32 4035

D3D4

30

20 40

10

18 3532

32 4035

PrivateTopK Protocol

Gi’(r)=topk(Gi-1(r) U Vi);

Vi’ = Gi’(r) – Gi-1’(r);

m = |Vi’|;

if m=0 then

Gi(r)= Gi-1(r);

else60

80

100

125

145

110

130

150

2/28/2012 41

else

with probability 1-Pr(r):

Gi(r)= Gi-1(r)

with probability Pr:

Gi(r)[1:k-m] = Gi-1(r)[1:k-m]

Gi(r)[k-m+1:k] = a sorted list of m

random values generated from

[min(Gi’(r)[k]-delta,Gi-1(r-1)[k-m+1]),

Gi’(r)[k])

end

Gi(r)

Gi-1(r)

10

Vi

40

50

80

110

D1

…

D2

Dn

Min/Max Protocol - Correctness

• Precision bound:

– Converges with r

– Smaller p0 and d provides faster convergence

2

)1(

01*1)Pr(1

−

=−≥−∏

rrrr

jdPj

42

Min/Max Protocol - Cost

• Communication cost

– single round: O(n)

– Minimum # of rounds given

precision guarantee (1-e):

43

Min/Max Protocol - Security

• Probability/confidence based metric: P(C|IR,R)

– Different types of exposures based on claim

• Data value: v =a

0.50 1

Absolute Privacy Provable Exposure

44

• Data value: vi=a

• Data ownership: Vi contains a

– Loss of Privacy (LoP) = | P(C|IR,R) – P(C|R) |

• Information entropy based metric:

– Loss of privacy as a measure of randomness of information: H(D|R) -

H(D|IR,R)

Min/Max Protocol – Security (Analysis)

• Upper bound for average expected LoP:

max r 1/2r-1 * (1-P0*dr-1)

• Larger p0 and d provides better privacy

45

Min/Max Protocol – Security (Experiments)

46

• Loss of privacy decreases with increasing number of nodes

• Probabilistic protocol achieves better privacy (close to 0)

• When n is large, anonymous protocol is actually okay!

Union

• Commutative encryption based approach

– Number of rounds: 2 rounds

– Each round: encryption and decryption

• Multi-round random-response approach?• Multi-round random-response approach?

Vector

1

0b1

b2

…

p1

0

1

…

p2

1

0

…

pc

OR OR OR… =

1

1

VG

…

• Each database has a boolean vector of the data items

• Union vector is a logical OR of all vectors

0bL

…

0

…

0

…

0

…

Privacy Preserving Indexing of Documents on the Network, Bawa, 2003

Group Vector Protocol

…

Pex=1/2r, Pin=1-Pex

for(i=1; i<L; i++)

if (Vs[i]=1 and VG’[i]=0)

Processing of VG’ at ps of round r…

0

1

0

v1

0

0

1

…

v2

0

1

0

…

vc

0

0

0

…

vG’

0

0

1

…

vG’

r=1, Pex=1/2, Pin=1/2


Set VG’[i]=1 with prob. Pin


Set VG’[i]=0 with prob. Pex

v1v2

vc

r=2, Pex=1/4, Pin=3/40

0

1

…

vG’

0

1

1

…

vG’

0

1

1

…

vG’

0

0

1

…

vG’

0

1

1

…

vG’

p1p2 pc

Open issues

� Tradeoff between accuracy, efficiency, and security

� How to quantify security

� How to design adjustable protocols

� Can we generalize the algorithms for a set of � Can we generalize the algorithms for a set of

operators based on their properties

� Operators: sum, union, max, min …

� Properties: commutative, associative, invertible,

randomizable

Secure Functionalities

• Secure Comparison: Comparing two integers without revealing the integer values.

• Secure Polynomial Evaluation: Party A has polynomial P(x) and Part B has a value b, the polynomial P(x) and Part B has a value b, the goal is to calculate P(b) without revealing P(x) or b

• Secure Set Intersection: Party A has set SA A

and Party B has set and Party B has set SB , the goal is to calculate

without revealing anything else. A BS S∩

Secure Functionalities Used

• Secure Set Union: Party A has set SA A and Party and Party

B has set B has set SB , the goal is to calculate

without revealing anything else.

A BS S∪

• Secure Dot Product: Party A has a vector X

and Party B has a vector Y. The goal is to

calculate X.Y without revealing anything else.

•Secure Sum

•Secure Comparison

•Association Rule Mining

•Decision Trees

Data Mining on Horizontally

Partitioned DataSpecific Secure Tools

•Secure Union

•Secure Logarithm

•Secure Poly. Evaluation

•EM Clustering

•Naïve Bayes Classifier

•Secure Comparison

•Secure Set Intersection

•Association Rule Mining

•Decision Trees

Data Mining on Vertically

Partitioned DataSpecific Secure Tools

•Secure Dot Product

•Secure Logarithm

•Secure Poly. Evaluation

•K-means Clustering

•Naïve Bayes Classifier

•Outlier Detection

Summary of SMC Based PPDDM

• Mainly used for distributed data mining.

• Provably secure under some assumptions.

• Efficient/specific cryptographic solutions for many distributed data mining problems are developed.developed.

• Mainly semi-honest assumption(i.e. parties follow the protocols)

Ongoing research

• New models that can trade-off better

between efficiency and security

• Game theoretic / incentive issues in PPDM

• Combining anonymization and cryptographic

techniques for PPDM