two methods for privacy preserving data mining with malicious participants

Information Sciences 177 (2007) 5468–5483

www.elsevier.com/locate/ins

Two methods for privacy preserving data miningwith malicious participants

Divyesh Shah, Sheng Zhong *

Department of Computer Science and Engineering State University of New York at Buffalo, Amherst, NY 14260, USA

Received 6 December 2006; received in revised form 3 July 2007; accepted 13 July 2007

Abstract

Privacy preserving data mining addresses the need of multiple parties with private inputs to run a data mining algorithmand learn the results over the combined data without revealing any unnecessary information. Most of the existing crypto-graphic solutions to privacy-preserving data mining assume semi-honest participants. In theory, these solutions can beextended to the malicious model using standard techniques like commitment schemes and zero-knowledge proofs. How-ever, these techniques are often expensive, especially when the data sizes are large. In this paper, we investigate alternativeways to convert solutions in the semi-honest model to the malicious model. We take two classical solutions as examples,one of which can be extended to the malicious model with only slight modifications while another requires a careful rede-sign of the protocol. In both cases, our solutions for the malicious model are much more efficient than the zero-knowledgeproofs based solutions.� 2007 Elsevier Inc. All rights reserved.

Keywords: Data mining; Privacy; Malicious model; Attacks; Protocols

1. Introduction

Data mining techniques [27–29] have helped in identifying trends from large databases allowing the minerto learn useful trends not obvious from the large transactional data. However, concerns over data privacyhave limited its applicability in the case of multiple data providers with confidential data. They wish to learnglobal data mining results from the union of their databases but without revealing their data.

Extensive work has been done in this area over the past few years resulting in many specialized solutionsextending existing data mining algorithms to meet the privacy-preserving requirement [3,6,10,12,13,16,17,20,22,24,25,30]. Most of this work assumes the participating parties to be semi-honest, i.e., they will not deviatefrom the protocol, but might be curious to extract any extra information from the messages that they see dur-ing the protocol execution. Another model that can be considered is the malicious model where the partiescould arbitrarily deviate from the protocol to derive private information that is not leaked by the protocol.

0020-0255/$ - see front matter � 2007 Elsevier Inc. All rights reserved.

doi:10.1016/j.ins.2007.07.013

* Corresponding author. Tel.: +1 716 645 3180x107; fax: +1 716 645 3464.E-mail address: [email protected] (S. Zhong).

mailto:[email protected]

D. Shah, S. Zhong / Information Sciences 177 (2007) 5468–5483 5469

In [9], a systematic method is described for converting protocols that are secure in the semi-honest model toones that are equally secure and privacy-preserving in the malicious model with the use of commitmentschemes and zero-knowledge proofs for both two-party and multi-party case. While this result is theoreticallyimportant, the techniques used are often expensive and increase the computational and communication over-head of the protocol. The problem we have identified here is that these techniques enforce strict protocol emu-lation when extending to the malicious model; preventing any kind of malicious behavior at all. However, wetake a more privacy-oriented view to this problem. We notice that a large fraction of this malicious behaviordoes not compromise the privacy of any of the honest parties. There are just a fraction of malicious attacksthat the parties need to be protected against to ensure that the privacy of any data provider’s data is notcompromised.

In this paper, we present alternative solutions in the malicious model with more emphasis on privacy. Wediscuss two well-known previously proposed protocols, which assumed semi-honest behavior, and how theycan be made equally privacy-preserving in the malicious model. First, we show that the Vaidya–Clifton algo-rithm [22] for privacy-preserving secure scalar product, when slightly modified, is equally privacy preserving inthe presence of malicious participants. Next, we describe an improved version of the Kantarcıoglu–Clifton pri-vate distributed k-nn classifier [17] for the malicious model. We also consider a few more example protocols todemonstrate the general applicability of our proposed approach.

The remainder of the paper is organized as follows: In Section 2, we discuss related work and relevant back-ground. We present our definitions and assumptions in Section 3. In Section 4, we show that the protocol in[22], when slightly modified, is equally privacy preserving in the malicious model as well. Section 5 describes anextension to the protocol proposed in [17] along with detailed communication and computation overheadanalysis for the improved protocol as well as comparison between the previous version and our improved ver-sion. With a few more brief examples in Section 6, we show how our contributions in Section 4 and 5 can begeneralized and applied to other privacy preserving data mining protocols. Finally, in Section 7, we concludewith pointers to possible future work.

2. Background and related work

Privacy preserving data mining has aroused a lot of interest among researchers and as a result there hasbeen a lot of active work in this field in recent years.

Randomization approaches [3,6,10] were among the early attempts for preserving privacy while mining datafrom different data providers. These methods perturbed the private input data with random values and fromthe distribution of this perturbed data tried to recover the original distribution of the private data. However,the original distribution cannot be exactly reproduced and this is what introduces a privacy vs accuracy trade-off in these and other similar methods. Of course, there are a few perturbation techniques that lead to goodprivacy plus good accuracy in specific data mining problems. LeFevre et al. [19] describe various algorithmsfor data anonymization based on target class of workloads.

The secure two/multi-party computation approach has been very widely used. Yao [26] presents a genericsecure circuit evaluation protocol for computing any general functionality. However, as this is quiet expensiveother protocols [12,13,16,17,20] use this to only carry out a small part of the total computation.

Goldreich [9] describes in detail a formal approach to enforce semi-honest behavior even in the presence ofmalicious parties. The protocol required to force semi-honest behavior consists of three phases: input-commit-ment, coin-generation and protocol-emulation. These phases use techniques like commitment schemes andzero-knowledge proofs and thus are often expensive. Zhong et al. [24] use zero-knowledge proofs to extend theirproposed protocol to work in the case where any of the miner and/or respondents could be malicious. Brickelland Shmatikov [4] present an efficient anonymous data collection protocol which does not rely on zero knowl-edge proofs to be secure in the malicious model. The problem addressed is different from the secure multi-partycomputation problem as here the miner should learn the inputs but should not be able to link them to respon-dents, whereas in secure multi-party computation parties do not wish others to learn their input values.

Jiang and Clifton [15] describe a framework for transforming semi-honest protocols such that any mali-cious behavior by participants in the malicious model can be accounted for. It enables liability for maliciousbehavior to be assigned to the responsible party. However, this works only as a detection and accountability

5470 D. Shah, S. Zhong / Information Sciences 177 (2007) 5468–5483

framework and does not disallow malicious behavior during the protocol execution. It is possible to hold aparty responsible for malicious behavior only after the protocol has been executed. In many cases this couldbe too late because the privacy of the honest parties would have been compromised already. The frameworkpresented in this paper preserves the privacy of the parties by disallowing any disclosure due to maliciousattacks. In Section 3, we discuss some attacks which are unavoidable in the malicious model like input sub-stitution, where the malicious party substitutes its original input and participates in the protocol using thissubstituted input. While these attacks cannot be prevented by our proposed protocol extensions, the frame-work proposed in [15] cannot deal with them either.

We also use an order-preserving encryption scheme (OPES) in one of our extended protocols. Agrawalet al. [1] describe an OPES for numerical data such that the order of the plain texts is preserved when theyare encrypted using such a scheme.

3. Malicious model vs. semi-honest model

As we mentioned in Section 1, there are two models: semi-honest and malicious. A semi-honest participantwill not deviate from the protocol but will only try to extract any possible extra information from the messagesthat it sees as the privacy-preserving data mining algorithm executes. On the other hand, a malicious adversarycan arbitrarily deviate from the protocol. Goldreich describes a standard technique in [9] which enforces strictprotocol emulation for each of the participants, thus disallowing any (theoretically avoidable) deviation fromthe protocol, but often making it expensive.

As mentioned in [9], standard cryptographic techniques provide security in the malicious model and heresecurity is a stronger notion than privacy. Any secure protocol, in our context, is definitely privacy-preserving,but the reverse is not necessarily true. We argue that, in the context of privacy-preserving data mining, we aremore concerned about privacy. Therefore, we can use more light-weight solutions to replace standard crypto-graphic solutions.

Before we discuss possible solutions in the malicious model, we briefly review this model.

3.1. Attacks in malicious model

In the malicious model, some attacks can never be avoided, even when you use the standard cryptographic

solutions based on zero-knowledge proofs. These attacks include:

1. Parties refusing to participate in the protocol.2. Parties substituting their local input.3. Parties aborting the protocol prematurely.

(To see why this is the case, interested readers can refer to [9].) Since these attacks will always exist anyway,we should never attempt to prevent them, because that would be simply futile. Instead, we define a protocol to

be secure in the malicious model if only unavoidable attacks can be launched by the adversary.Of course, there are also many avoidable attacks in the malicious model. Among the avoidable attacks,

some can be trivially dealt with, while others cannot. The trivial attacks include sending messages when thereceiver is not expecting any messages, sending extra messages, sending duplicate messages, skipping messages,not sending messages at all, sending messages in a different format than expected and sending messages withvalues not within the accepted range. Most of these trivial attacks can be dealt with one of these actions:

• Ignoring any extra/duplicate messages.• Timing out and aborting protocol on not receiving a message for a timeout period.• Aborting protocol on receiving invalid/incorrectly formatted message.

Unlike the above trivial attacks, the nontrivial attacks replace designated messages in protocol with some-thing else. These are clearly the strongest avoidable attacks in the malicious model. Consequently, hereafter weignore the trivial attacks and only consider how to avoid the message replacing attacks.


In summary, we have the following simple taxonomy of attacks in the malicious model:

• Unavoidable Attacks: If these are the only attacks possible, then we consider the protocol to be secure.• Avoidable Attacks:

– Trivial Attacks: easy to deal with; we ignore them.– Message Replacing Attacks: These are essentially the only attacks we need to deal with when we extend a

solution to the malicious model.

Given the above taxonomy of attacks, now we can discuss the types of solutions in the malicious model.

3.2. Solutions with slight modifications

Some solutions when extended to the malicious model can naturally deal with message replacing attacksand thus only simple modifications are required. For example, the protocol presented in [22] requires onlyslight modifications as shown in Section 4. Section 6 discusses the protocols presented in [7,23], which alsorequire minor modifications. Of course, it is not always possible to develop a solution with only slightmodifications.

3.3. Solutions with significant revisions

Many other solutions require a careful and significant revision to extend them to the malicious model. Wepresent an example in Section 5 where our improved version of the protocol in [17] is a significant rework ofthe original protocol. Another example protocol presented in [5] is discussed in Section 6.

3.4. Solutions requiring stricter notions of security and zero-knowledge proofs

The above two techniques might prove insufficient for preserving privacy for some protocols; some proto-cols require the use of more expensive cryptographic techniques for extension to the malicious model. Thiscategory of modification techniques has been studied extensively in the field of cryptography and we donot discuss this further in this paper until Section 6.

With this view in mind, we say that our work presented in Section 4 and 5 is equally privacy-preserving inthe malicious model as in the semi-honest model (for the existing protocols).

4. Privacy-preserving scalar product protocol

Here, we attempt to study the Vaidya–Clifton protocol for computing scalar product privately consideringone or more of the participants to be malicious instead of semi-honest. This scalar product protocol is the pri-vacy-preserving component in their algorithm for mining association rules from vertically partitioned databetween two parties in [22]. However, as we prove below, their protocol (when slightly modified) turns outto be equally privacy preserving in the presence of malicious participants as in the case where the participantsare semi-honest. This result makes the protocol applicable any place where scalar product needs to be com-puted privately (for example, [12,13]), not restricting to association rules mining.

In Section 4.1, we describe their scalar product protocol which assumes semi-honest parties and in Section4.2, we prove that this protocol is equally privacy-preserving in the malicious model as well with certainmodifications.

4.1. Existing privacy-preserving scalar product protocol

Vaidya and Clifton [22] describes an efficient scalar product protocol which the authors use as the majorprivacy-preserving component in their algorithm for mining association rules from vertically partitioned data


between two parties. Their algorithm finds the frequent itemsets from the partitioned data. The basic idea is asfollows:

Generate candidate itemsets using apriori-gen function (described in [2]), and for each itemset, if attributesin the itemset are distributed between the two parties use their scalar product protocol to compute the globalsupport, else compute the support locally at the owning site.

Below we describe this scalar product protocol in brief. We describe the protocol in terms of the messagesexchanged between the two parties to compute the scalar product securely.

Goal: To compute ~X �~Y ensuring privacy of the inputs of both A and B.Require: Two semi-honest parties A and B; co-efficient matrix C known to both; A’s private input, IA is

f~X ;Rg where ~X contains n boolean values and R is a n-dimensional random vector, i.e., n values chosen ran-domly by A which simulates reading n values from a random tape; B’s private input, IB is f~Y ;R0g where ~Ycontains n boolean values and R 0 is a r-dimensional (r < n) random vector.1

Also, M is the set of messages exchanged; with M ¼ f ðIA; IBÞ where M depends on the inputs of the twoparties and f is a function determined by the protocol and M = Ma [Mb, where Ma and Mb are the messageensembles of A and B, respectively. Message Ma1 consists of a vector X 0, Mb1 consists of a sum S and a vectorS 0. Similarly, message Ma2 consists of a sum Temp and a vector Rr, which is equal to R reduced to r values bysumming up consecutive n/r values in R, and Mb2 is the final scalar product result.

1

va

1. Message Ma1{X 0}: A! B

R waslues as

X 0 ¼ X þ CR

2. Message Mb1fS; S0g : B! A

S ¼ X 0TY

¼ X TY þ RTCTY

S0 ¼ CTY þ R00

where R00 is R 0 with each value repeated n/r times3. Message Ma2fTemp;Rrg : A! B

Temp ¼ S � RTS0 ¼ ðX TY þ RTCTY Þ � RTS0 ¼ ðX TY þ RTCTY þ RTR00 � RTR00Þ � RTS0

¼ ðX TY þ RTðCTY þ R00Þ|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}S0 � RTR00Þ � RTS0 ¼ ðX TY þ RTS0 � RTR00Þ � RTS0 ¼ X TY � RTR00

¼ X TY � RTr R0 since RT

r R0 ¼ RTR00

4. Message Mb2{SP}: B! A

X TY ¼ Temp þ RTr R0

SP ¼ ~X �~Y ¼X

xi � yi ¼ X TY

In [7], Goethals et al. discuss the privacy guarantee of the Vaidya–Clifton inner scalar product protocol. Theyshow that this protocol does not achieve its privacy goal in all scenarios and it is possible for one of the partiesto retrieve the other party’s inputs with high probability. However, we still consider it useful to study Vaidya–Clifton protocol and extend it to the malicious case for the following reasons:

specified as a random vector chosen by A, but as in standard literature of theoretical computer science, we treat such randominput from a ‘‘random tape’’.


• Goethals et al. show that Vaidya–Clifton protocol is insecure when Bob’s input vector has few 1s. But theyhave not proved the protocol to be insecure in the other cases (in fact, most cases). If the application of theprotocol is restricted to these other cases, the Vaidya–Clifton protocol or its extension to the maliciousmodel may remain useful.

• The practical implication of our result (on Vaidya–Clifton protocol) is that, when we use Vaidya–Cliftonprotocol with a malicious adversary, it has the same privacy guarantees as with a semi-honest adversary.For the cases it is not private with a semi-honest adversary, our results indicate that there is not any extradisclosure when we extend the protocol to the malicious model.

Therefore, despite of the fact that the Vaidya–Clifton protocol is flawed, we still study how to extend it tothe malicious model.

4.2. Proof of equal privacy-preserving property for malicious parties

In this section, we show that a slightly modified version of the Vaidya–Clifton protocol for scalar productdoes not compromise any more privacy even in the presence of malicious adversary A(or B). This implies thatwe do not need to re-design the entire protocol for the malicious model.

The slight modifications we introduce here are mainly those we described in Section 3: ignoring extra mes-sages, timing out on not receiving a message, and aborting protocol on invalid messages. There is an addi-tional modification we would like to make—assuming that we encode the involved values using elements ofa large finite field, we require that the entries of matrix C be chosen uniformly and independently from thefield.

We show our privacy guarantee by proving that any message replacing attack launched by any of the par-ties is equivalent to that party substituting its local input, which any protocol cannot avoid as pointed out inSection 3. (Why does this make sense? Recall that a protocol is defined to be secure in the malicious model ifonly unavoidable attacks can be launched. When all message replacing attacks are equivalent to certainunavoidable attacks, there is no additional threat introduced by the malicious model.2) So, with the exceptionof the three events specified in Section 3, no other malicious behavior leads to any more privacy loss than inthe semi-honest model.3

Denote by M ¼ fMa;Mbg the ensemble of messages exchanged by two parties A and B in the privacy-pre-serving scalar product protocol when both parties follow the protocol, A has input IA and B has input IB,where Ma is the ensemble of messages sent by A and Mb is the ensemble of messages sent by B. Clearly, A

decides her message ensemble Ma based on her input IA and the message ensemble Mb she receives. Similarly,B decides her message ensemble Mb based on her input IB and the message ensemble Ma she receives.

Thus, when both parties follow the protocol, the message ensemble is completely determined by the inputs.Consequently, we have Ma ¼ fAðIA; IBÞ and Mb ¼ fBðIA; IBÞ, where fA and fB are functions determined by theprotocol.

Now we consider the case that A or B is malicious. Denote by eM ¼ ffMa ; fMbg the message ensembleexchanged by A and B when one of them is malicious. If A is malicious, then fMa is an arbitrary messageensemble generated by malicious A and fMb is B’s message ensemble depending on IB and fMa . Similarly, ifB is malicious, then fMb is an arbitrary message ensemble generated by malicious B and fMa is A’s messageensemble depending on IA and fMb .

Using the above notations, now we can prove that the protocol does not compromise any more privacy if A

is malicious.

2 A good analogy is the following scenario: a secret is known by two people. It would be an unavoidable attack is one of them wants toreveal the secret to the adversary. If we can show that all other attacks are equivalent to one of them revealing the secret, then the system isessentially secure.

3 Such a proof of security is somewhat less intuitive. Some readers may prefer seeing a quantification of what information can bereconstructed from the messages and results. What we prove in this paper actually implies that the slightly modified protocol protectsagainst malicious parties all the information that cannot be derived in the semi-honest model when the Vaidya–Clifton protocol is used.Specifically, in most cases the exact values of data points are protected—see the discussion after Theorem 4.1.


Theorem 4.1. (Malicious A)In the modified version of the Vaidya–Clifton protocol for privacy-preserving scalar

product, for all possible message ensemble fMa generated by malicious A, and all legal inputs IB of B, there existseIA such that

4 Fo

ðfMa ; fMbÞ ffi ðfAð eIA ; IBÞ; fBð eIA ; IBÞÞ;
where ffi denotes that two message ensembles are identically distributed.
Proof. Suppose that malicious A generates a message ensemble fMa ¼ fgMa1 ;gMa2g, where gMa2 ¼ fgTemp;fRrg.Now, we need to show that, for all gMa1 and gMa2 , there exists eIA ¼ feX ; eRg that satisfies the following threeconditions:

1. gMa1 ¼ eX þ CeR;2. gTemp ¼ S � eRTS0;3. fRr is equal to eR reduced to r values by summing up consecutive n/r values in eR.

Note that, if there is eR that satisfies conditions (2) and (3), we can always find an eX to satisfy condition (1).Therefore, all we need is the existence of eR that satisfies conditions (2) and (3).

Now we rewrite conditions (2) and (3) as the following linear system

1 � � � 1 0 � � � � � � � � � 0

..

. ... ..

. ... ..

. ... ..

. ...

0 � � � � � � � � � 0 1 � � � 1

S01 S02 S03 � � � � � � � � � S0n�1 S0n

266664

377775

fR1fR2

..

.

..

.

fRn

2666666664

3777777775¼

fRr1

..

.

fRrr

S � gTemp

2666664

3777775

where each S0i ( eRi , fRri , resp.) is the ith element of vector S 0 (eR, fRr , resp.). This linear system will always have asolution if we can show that the coefficient matrix on the left side is of rank r + 1.

Clearly, the first r row vectors of this matrix are linearly independent. The last row vector is essentiallyS0T = YTC + R

00T. Since the entries of C are uniformly and independently chosen, the probability of S0T being

in the span of the first r row vectors is negligible. Therefore, with high probability,4 the coefficient matrix is ofrank r + 1 and the linear system has a solution.

Now, let eX ¼ gMa1 � CeR, where eR is a solution to the above linear system. Thus, we have eIA ¼ feX ; eRg thatsatisfies the three conditions we listed.

Thus, gMa1 and gMa2 would be the ‘‘normal’’ messages received by B when malicious A replaced her inputwith feX ; eRg. Further, B being honest(or semi-honest), its response fMb would depend only on messages itreceives from AðfMaÞ and IB, thereby depending on eIA and IB.

Therefore, for malicious A, we can say that

ðfMa ; fMbÞ ffi ðfAð eIA ; IBÞ; fBð eIA ; IBÞÞ:

Next, we prove that the protocol does not compromise any more privacy if B is malicious. h

Theorem 4.2. (Malicious B)In the modified version of the Vaidya–Clifton protocol for privacy-preserving scalar

product, for all possible message ensemble fMb generated by malicious B, and all legal inputs IA of A, there existseIB such that

ðfMa ; fMbÞ ffi ðfAðIA; eIBÞ; fBðIA; eIBÞÞ;
where ffi denotes that two message ensembles are identically distributed.
r a formal definition of high probability, Ref. [8].


Proof. Suppose that malicious B generates a message ensemble fMb ¼ fgMb1 ;gMb2g, where gMb1 ¼ feS ; eS0g. Now,we need to show that, for all gMb1 and gMb2 , there exists eIB ¼ feY ; eR0 g that satisfies the following two conditions:

1. eS ¼ X 0TeY ;2. fSP ¼ X TeY ;3. eS0 ¼ CTeY þfR00 , where fR00 is eR0 with each value repeated n/r times.

We need to show existence of eY , eR0 that satisfy the three conditions above. We rewrite conditions (1), (2)and (3) as the following linear system

X 01 � � � � � � X 0n 0 � � � � � � 0

x1 � � � � � � xn 0 � � � � � � 0

c1;1 c2;1 � � � cn;1 1 0 � � � 0

..

. ... ..

. ... ..

. ... ..

. ...

c1;n=r c2;n=r � � � cn;n=r 1 0 � � � 0

c1;ðn=rþ1Þ c2;ðn=rþ1Þ � � � cn;ðn=rþ1Þ 0 1 � � � 0

..

. ... ..

. ... ..

. ... ..

. ...

c1;n c2;n � � � cn;n 0 � � � 0 1

2666666666666666664

3777777777777777775

ey1

..

.

eyn

fR01...

fR0n

26666666666664

37777777777775

¼

eSfSPfS01...

fS0n

2666666664

3777777775

where each X 0i (xi, eyi , R0i,eS0i , resp.) is the ith element of vector X 0 (X, eY , eR0 , eS0 , resp.) and each ci,j is the ith row

and jth column element of the matrix C. The co-efficient matrix on the left side is of order (n + 2) · 2n. Thislinear system has 2n unknowns and n + 2 equations and thus will always have a solution if we can show thatthe coefficient matrix on the left side is of rank n + 2.

Now, since the entries of C are uniformly and independently chosen, the probability of the last n rowvectors in the co-efficient matrix above being linearly independent is very high. Also, the probability of these n

row vectors being in the span of the first 2 row vectors is negligible. Therefore, with high probability, thecoefficient matrix is of rank n + 2 and the linear system has a solution.

Now, let eR0 ¼ eS0 � CT eY , where eY is from above and eR0 is a solution to the above linear system. Thus, wehave eIB ¼ feY ; eR0g that satisfies the three conditions we listed.

Thus, gMb1 and gMb2 would be the ‘‘normal’’ messages received by A when malicious B replaced her inputwith {eY , eR0}. Further, A being honest(or semi-honest), its response fMa would depend only on messages itreceives from BðfMbÞ and IA, thereby depending on IA and eIB .

Therefore, for malicious B, we can say that

ðfMa ; fMbÞ ffi ðfAðIA; eIBÞ; fBðIA; eIBÞÞ:

Also, If gMb2 ¼ / (null/empty message), it is similar to early termination by B and would deprive A fromknowing the final result. However, as mentioned earlier, we cannot hope to avoid this situation. h

Remark. Clearly, the above theorem does not imply that all sensitive information is protected. In general, theslightly modified protocol protects in the malicious model all private information preserved by the Vaidya–Clifton protocol in the semi-honest model. Specifically, in most cases (i.e., when X and Y have neither toomany nor too few 1s), the exact values of data points are protected. However, there are possibilities of infor-mation leakage:

• Any information implied by the result and the dishonest party’s input is leaked to the dishonest party. Forexample, when Y = 1 Bob learns the sum of Xis.

• As pointed out in [7], if Bob’s vector Y has few 1s then Alice can obtain at least half of Y’s coefficients.


5. k-nn classifier

Kantarcıoglu and Clifton [17] present an effective algorithm to privately compute a distributed k-nearest-neighbor classifier. In Section 5.1, we describe this algorithm which assumes semi-honest parties and discuss acouple of possible attacks which could violate privacy if the parties are malicious in Section 5.2. Section 5.3describes our modified protocol for the malicious case. In Section 5.4, we present extensive privacy, compu-tation and communication analysis for our modified protocol and finally in Section 5.5, we present experimen-tal results comparing the Kantarcıoglu–Clifton protocol with our improved protocol.

5.1. Existing privacy-preserving k-nn classification algorithm

The protocol in [17] requires n data providing sites, S1; . . . ; Sn, permuting site Ss (where s maybe in 1; ::; n),untrusted non-colluding third party C with public key EC, query originator site A with public key EA, a proba-bilistic public encryption scheme and query ðx; d; kÞwhere x is the point to be classified, d is the distance functionand k is the number of nearest neighbors required. It returns to A the majority class of the k-nearest-neighbors.

1. For all sites Si, in parallel do
(a) Select its k closest items to x and generate identifiers, (extended) distances and class values(encrypted
with EA)(Thus, the k local closest items from each site are encoded in Ri, which is a set of k values where eachis of the type fid; d;EAðcÞg).

(b) For each distance value, compute random shares of result of the comparison with every other nk � 1distances and encrypt with EC. (Thus, the result of comparison of k local closest items to those of everyother site is encoded in ERi, which is a set of nk values where each is of the type fid;ECðvÞ;EAðcÞg).

(c) Send ERi to permuting site Ss.
2. Ss permutes ERi and sends to C.3. C determines a total ordering of all the distances and finds the global k-nearest-neighbors.4. C xors a random value (ri) to the class values of the k-nearest-neighbors and sends to A: EA c) � ri.5. A decrypts the class values to get ci + ri.6. A and C run a secure circuit evaluation protocol to find the majority class from these k values.
5.2. Possible malicious attacks

Below we discuss a couple of possible attacks when one or more of the parties are malicious. We also pointout how these attacks compromise the privacy-preserving property of the protocol.

Attack 1: The permuting site Ss, being one of S1; . . . ; Sn can execute a malicious attack on any particular siteSj which violates Sj’s privacy. It can in polynomial time find all the comparison results that involve site Sj. Byreplacing these values to indicate that Sj’s values are closer, it can make sure that the k-nearest-neighbors thatC selects all belong to site Sj. Thus, the majority class revealed to A on termination would be the majority classof the k-nn values belonging to Sj.

Attack 2: n � 1 of the data providing sites fS1; . . . ; Sj�1; Sjþ1; . . . ; Sng can collude and make sure that thefinal majority class result is the majority class result of a particular site Sj. This can be done by the n � 1 col-luding sites replacing the distances to their k-nearest-neighbors with very large values only for comparisonswith site j so that the comparison results with j show j’s items to be closer.

5.3. Extended privacy-preserving k-nn classifier algorithm for malicious parties

Here, we present an improved version of the protocol in [17] which is equally privacy-preserving but in themalicious model as well, i.e., the modified version achieves the same privacy goals as specified in Section 5.1.


We use an order-preserving encryption scheme (OPES) to enable site C to find a total ordering of the distancevalues sent by the n sites. Also, the need to compute (n � 1)k secure comparisons for each of the k points forevery site is done away with making our protocol much more efficient. The original protocol requires oneuntrusted non-colluding party C. Our protocol requires two such parties, C and another permuting site Pensuring that the permuting site is not among S1; . . . ; Sn.

We assume that general circuit evaluation protocol is secure in the malicious model as well. Even if thisassumption is false, we can extend it by adding early commitment and zero-knowledge proofs. As this is a verysmall portion of the computation for the entire protocol the overhead due to zero-knowledge proofs wouldnot make the solution very expensive.

Require: n sites Si, 1 6 i 6 n; two untrusted non-colluding parties C and P; query ðx; d; kÞ generated by orig-inating site A; public keys of C and AðEC;EAÞ; key K and order-preserving encryption function EK known onlyto sites S1; . . . ; Sn.

1. For all sites Si, in parallel do(a) Locally compute k-nn points to x using distance function d.(b) Encrypt distance values of these k points from x using OPES key Ek and corresponding class values

using EA.(c) Now, encrypt fEKðdiÞ;EAðciÞg with EC and send the k values to permuting site P.

2. P permutes all the nk items and sends to C.3. C decrypts the items and uses EK(di) to get an ordering of the nk items. Here, the ordering of the

encrypted distances is the same as that of the original distance values.4. C selects the k smallest items in the ordering as the global k-nearest-neighbors.5. For these k values, C adds ri to EA(ci) values and sends to site A.6. Site A and C execute a general circuit evaluation protocol to compute securely the majority class value.

5.4. Analysis

5.4.1. Privacy analysis

The protocol discussed in Section 5.3 has similar privacy guarantees in the malicious model as does theKantarcıoglu–Clifton protocol described in Section 5.1 in the semi-honest model. Specifically, the followingprivacy promises are made (under the non-collusion assumption):

• The values of the k global nearest neighbor points and their majority class value is protected from the n

participating sites. Also, the data points of the all n sites are protected from each other by using randomshares for distance comparisons. No site can learn another site’s data values or class mapping. (However,note that every site knows the identities of all other sites, which are public information.)

• The values of nk data points, their sources and class values and the final result are protected from theuntrusted non-colluding site C.

• Everything except the final majority class value is protected from the query originator site A.

Thus the only information that any of the participating sites learn is what they can infer from their owndata and the query and only the originating site learns the final result.

The only malicious behavior that appears to be beneficial to the malicious adversary is one(or more) of thedata sites sending very large di values so that their inputs do not contribute to the final result. However, thisattack can be shown equivalent to the particular site replacing its input database with another database. Also,this does not really compromise privacy as the results do not indicate which of the n � 1 (or less) parties datavalues contributed to the result. Of course, if n � 1 parties collude and send large distance values, they canensure that final result will depend on the only honest party’s input. However, we make security and privacy


claims assuming that the sites are non-colluding.5 Hence, under the same non-collusion assumption the pro-tocol discussed in Section 5.3 is equally privacy-preserving in the malicious model as the one described in Sec-tion 5.1 is in the semi-honest model.

5.4.2. Communication cost analysis

The communication cost for the protocol presented in [17] includes the following components assuming m

bits required to represent the distance, t is the key-size for encryption, and q bits to represent the result: a totalof O(n2k2) comparisons costing O(n2k2mt) bits; O(nk + t) bits of encrypted comparison shares plus O(q + t)for the result for each item for each site costing O(n2k2 + nkq + nkt) bits; and finally C sends O(k(q + t)) bitsto A.

The most dominating factor in this protocol is the secure comparisons, O(n2k2mt).The communication cost for our improved version of the protocol assuming the order-preserving encryp-

tion results in ciphertext of m bits, t is the key-size for public encryption, and q bits to represent the resultconsists of: a total of O(nk) distance values costing O(nkm) bits plus O(q + t) for the result for each itemfor each site costing O(nkm + nkq + nkt) bits; and finally C sends O(k(q + t)) bits to A. Thus, we can clearlysee that the most dominating factor in the communication cost of the original protocol is completely absent inour improved version, thus making it communication-wise very efficient.

5.4.3. Computation cost analysis

Once again, the major contributor to the computation cost of the original protocol is the O(n2k2) securecomparisons. The total cost includes the following: O(nk) encryptions for the query results of size q; O(nk)encryptions of the comparison sets; and O(n2k2mt3) as cost of the secure comparisons assuming t is thekey-size and each comparison requires O(m) 1 out of 2 oblivious transfers.

The computation cost of our version for malicious participants consists of: O(nk) encryptions for the queryresults of size q; O(nk) encryptions for the distance values using an OPES and one secure circuit evaluationbetween A and C costing O(qk) 1 out of 2 oblivious transfers.

Thus, even computationally our protocol is more efficient due to the absence of the O(n2k2) securecomparisons.

5.5. Experimental results

We conducted simulations for the Kantarcıoglu–Clifton private k-nn classifier protocol and compared thatto simulation results for our proposed protocol. The results showed a tremendous improvement in the com-munication as well as computation cost.

We coded our simulation in Java using Java’s cryptographic library for DES and RSA encryption. We didnot implement Blum–Goldwasser’s encryption but performed an equal number of similar operations in oursimulation experiments. Also, we use previous experimental results for order-preserving encryption scheme[1], secure comparisons [11] and secure circuit evaluation [21] as estimates in our results. We conducted theexperiments in a distributed setting with every site that exchanged any data executing on a different machine.All the experiments were performed with 10 data sites (n = 10) executing in parallel. The computer systemsused for the simulation were a Sun Ultra Enterprise 80, 4 GB memory batch compute server and a SunUltra-80 4 Gb memory interactive timeshare server connected via a fast ethernet connection. In all the follow-ing diagrams, Protocol 1 represents the earlier protocol and Protocol 2 is our improved version.

5 Note that collusion/non-collusion is a dimension orthogonal to semi-honest/malicious adversaries. That is, we can have all of the fourpossible combinations: semi-honest non-colluding adversaries, semi-honest colluding adversaries, malicious non-colluding adversaries,malicious colluding adversaries. In particular, semi-honest adversaries can collude, which means that each adversary exchange his ownknowledge with other adversaries, so that all adversaries can jointly derive the maximum amount of protected information. One example is[16], which considers only semi-honest adversaries, but devotes its Section 4 to discussion of collusion. Also, it could be the case thatmalicious adversaries do not collude. In this case, each adversary can have arbitrary malicious behavior, but his malicious behavior isindependent of other adversaries’. An example is [14], in which the 4.1 considers non-colluding malicious adversaries.


First, we present a site-wise distribution of the time taken to execute the protocol in Fig. 1. This timeincludes both communication and computational time. These results are for k = 10, 512 bit modulus forBlum–Goldwasser and RSA and 56-bit keys for DES. The secure comparisons and circuit evaluation alsoassume 512-bit keys.

As can be clearly seen, the difference is huge. The maximum time at any site for the improved protocol inthis setting is 4270 ms. compared to 477,570 ms for the earlier protocol.

Next, we present the operation-wise protocol execution distribution in Table 1.

Fig. 1. Protocol execution time distribution by site.

Table 1Protocol execution distribution by operation

Original k-nn classifier protocol Modified k-nn classifier protocol

Time (ms) % Time (ms) %

Comparisons 2,250,000 80.3 0 0Circuit evaluation 4100 0.2 4100 74.3Encrypt/decrypt 508,347 18 1130 20.5Communication 26,035 0.9 277 5Identifiers 10,680 0.4 0 0Total 2,799,170 100 5514 100

Fig. 2. Effect of varying modulus size.

Fig. 3. Effect of varying k.


These execution times represent the combined times of all 10 data sites, permuting site, Site C and origina-tor site as opposed to individual site times in Fig. 1.

Fig. 2 shows the effect of varying the modulus size for the Blum–Goldwasser encryption on the two pro-tocols as well as a comparison among the two.

Increasing the value of k has a very strong impact on the previous protocol with execution taking as long as1,213,211 ms. for k = 20 as shown in Fig. 3. Even at this highest value of k our version of the protocol takes amodest 10,012 ms. to execute.

The major factor contributing in the huge difference between the two protocols is the absence of need foreach site to compare its k values with every other site, encrypt the results and sending them to the permutingsite. Also, our protocol does not require any identifiers with the distance values.

6. Generalization of proposed methods

In Sections 4 and 5, we discussed extensions from semi-honest model to malicious model for two existingprotocols, one requiring very little revision and the other requiring a significant redesign of the original pro-tocol. Below we discuss a few more example protocols to emphasize that the above two protocols are not anexception; in fact many protocols can be extended in a similar fashion to yield efficient protocols which areequally privacy preserving in the malicious model. Due to limited space, we will discuss these examples in briefas compared to the protocols discussed in Section 4 and 5.

6.1. Solutions with slight modifications

Wright and Yang [23] describe a privacy-preserving scalar product protocol, which they use as an impor-tant subroutine in their bayesian network protocol for distributed heterogeneous data. Goethals et al. [7] alsodescribe a very similar protocol, extending to include non-binary data and a batched version of the scalarproduct protocol. These protocols use a homomorphic encryption scheme. The basic idea of these protocolsis simple: Alice encrypts all of her data items with her public key and sends the ciphertexts to Bob. Bob com-putes an encryption of the scalar product minus a random number using his own data items, where the ran-dom number is chosen by himself. Finally, Alice decrypts the above ciphertext to get her random share of thescalar product.

These protocols are privacy preserving in the semi-honest model and can be shown to be equally private inthe malicious case with a few minor modifications. Goethals et al. [7] use zero knowledge and conditional dis-closure of secrets as techniques to make the protocol secure in the malicious model. However, even withoutthese techniques, the protocol can be made private in the malicious model as most of the message replacingattacks are equivalent to input substitutions or probing attacks, which are unavoidable. These expensive


cryptographic techniques only provide a stronger sense of security in this case. For practical scenarios thatrequire high efficiency, we recommend to use an extension to the malicious model with minor modifications,rather than with zero knowledge proofs. The minor modifications required here are the ones to deal with thetrivial attacks mentioned in Section 3.

6.2. Solutions with significant revisions

Du and Zhan present a decision tree classification algorithm over private data in [5]. They describe a secureshared scalar product protocol for two parties, which is the basic privacy preserving component of their clas-sification algorithm. We briefly discuss their scalar product protocol and how it can be extended to the mali-cious model to yield an efficient protocol.

They introduce a semi-honest, untrusted, non-colluding third party called the commodity server to handout two random vectors RA and RB and two random numbers ra and rb such that RA Æ RB = ra + rb. The com-modity server only provides this initial information to AliceðRA; raÞ and BobðRB; rbÞ, which is independent ofAlice and Bob’s private inputs. It does not further participate in the protocol. Suppose that Alice has input A

and Bob has input B. Alice sends A 0 = A + RA to Bob and Bob sends B 0 = B + RB to Alice. Bob selectsanother random number V2 and sends to Alice A 0 Æ B + (rb � V2). Alice uses this to obtain V1 such thatV1 + V2 = A Æ B. Thus, both have random shares of the scalar product result.

This protocol can be made equally privacy-preserving in the malicious model with some significant changes.The protocol might leak some partial data if the random numbers are not generated from the real domain, butthis can be taken care of trivially. Because the commodity server cannot collude, it cannot participate in anattack to disclose private data by sending non-random commodities. Also, input substitution attacks likeBob sending a different vector in place of B + RB cannot be avoided; but Du and Zhan [5] show that theydo not compromise any privacy in the semi-honest model and this holds true for the malicious as well. Aspointed out by Du and Zhan, the protocol potentially discloses some private data when Bob is maliciousand sets V2 to zero and VB to all zeros except one. This allows Bob to learn the value of a specific entry inAlice’s data vector. To thwart this attack, we need to ensure that Bob cannot set V2 to zero. To achieve this,we use a secure protocol in malicious model for equality(from [18]) to ensure non-zero V2 and a bit commit-ment scheme to force Bob to commit to it’s randomly chosen non-zero V2. As this is not a very large portion ofthe entire protocol the overhead due to bit commitment and equality would not make the solution very expen-sive. Also, the protocol would require some more minor changes to deal with the trivial attacks mentioned inSection 3.

With the above mentioned modifications the protocol is equally privacy-preserving in the malicious case asin the semi-honest.

6.3. Solutions requiring complete transformation using zero-knowledge proofs

Goldreich [9] describes a systematic method for converting protocols that are secure in the semi-honestmodel to ones that are equally secure and privacy-preserving in the malicious model with the use of commit-ment schemes and zero-knowledge proofs for both two-party and multi-party cases. These techniques havegeneral applicability and strong security, but are computationally expensive. While we have discussed alterna-tive methods to achieve security in the semi-honest model, there are protocols that seem to require using suchtechniques. A couple of example protocols are as follows:

1. Yang et al. [24] describe a data collection scheme where a miner collects data from respondents and theapplication requires the respondents to remain anonymous. They then also extend their protocol to the casewhere the miner as well as any of the respondents could be malicious. The solution proposed by them useszero knowledge proofs to preserve anonymity in the malicious model.

2. Kantarcıoglu and Kardes [18] describe two-party secure protocols in malicious model for equality, dotproduct and full domain set operations. They use zero knowledge proofs to extend existing protocols tothe malicious case as simpler methods do not seem to yield the same privacy promises for the particularprotocols.


Of course, we note that there is no way to formally prove a protocol needs to use the general cryptographictechniques for extension to the malicious model. Thus the above discussion and examples only apply to thecurrent state of art of privacy preserving data mining.

7. Conclusion and future work

This paper presents a more liberal outlook to privacy-preserving data mining in the presence of maliciousparticipants by focusing only on malicious behavior resulting in privacy being compromised. This opens upthe possibility of developing efficient solutions for the malicious model as well with minimal use (or no useat all) of techniques like commitment schemes and zero-knowledge proofs.

In this paper, we proved that the privacy preserving scalar product protocol in [22], with slight modifica-tions, is equally privacy preserving in the presence of malicious participants as well. This paper also describedan extension to the privacy-preserving k-nn classifier algorithm in [17] for malicious participants and thoroughcomparative analysis showing that our improved version is more efficient by a factor of around 103. With thehelp of a few more examples, we show that similar efficient protocols can be developed for a variety of privacypreserving data mining protocols.

Possible future work in this direction could include considering more existing protocols and extending themto the malicious model resulting in efficient solutions using the ideas expressed in this paper. Some of theseextensions could result in efficient solutions in the malicious model. However, we are aware of the possibilityof there being no better way for dealing with malicious participants in some protocols other than using com-mitment schemes and zero-knowledge proofs.

Acknowledgements

We thank all of the three anonymous reviewers for their insightful comments which helped us improve thispaper. Special thanks go to one of the reviewers for pointing out an erroneous claim in the initial submissionof this paper.

References

[1] R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Order preserving encryption for numeric data, in: Proceedings of SIGMOD 2004, Paris,France, 2004, pp. 563–574.

[2] R. Agrawal, R. Srikant. Fast algorithms for mining association rules, in: Proceedings of 20th International Conference on Very LargeDatabases, Chile, 1994, pp. 487–499.

[3] R. Agrawal, R. Srikant. Privacy-preserving data mining, in: Proceedings of ACM SIGMOD Conference on Management of Data,ACM Press, May 2000, pp. 439–450.

[4] J. Brickell, V. Shmatikov. Efficient anonymity preserving data collection, in: Proceedings of KDD 2006, Philadelphia, PA, USA,August 2006, pp. 76–85.

[5] W. Du, Z. Zhan. Building decision tree classifier on private data, in: Proceedings of the IEEE international conference on Privacy,security and data mining, vol. 14, 2002, pp. 1–8.

[6] W. Du, Z. Zhan. Using randomized response techniques for privacy-preserving data mining, in: Proceedings of 9th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, 2003, pp. 505–510.

[7] B. Goethals, S. Laur, H. Lipmaa, T. Mielikainen. On private scalar product computation for privacy-preserving data mining, in: The7th Annual International Conference in Information Security and Cryptology (ICISC 2004), vol. 3506 of Lecture Notes in ComputerScience, 2004, pp. 104–120.

[8] O. Goldreich, Foundations of Cryptography, vol. 1, Cambridge University Press, 2001.[9] O. Goldreich, Foundations of Cryptography, vol. 2, Cambridge University Press, 2004.

[10] Z. Huang, W. Du, B. Chen. Deriving private information from randomized data, in: Proceedings of ACM SIGMOD InternationalConference on Management of Data, 2005, pp. 37–48.

[11] I. Ioannidis, A. Grama. An efficient protocol for yao’s millionaires’ problem, in: Proceedings of 36th Hawaii International Conferenceon System Sciences (HICSS’03), 2003, pp. 205–210.

[12] G. Jagannathan, K. Pillaipakkaamnatt, R. Wright. A new privacy-preserving distributed k-clustering algorithm, in: Proceedings of2006 SIAM International Conference on Data Mining(SDM), 2006, pp. 492–496.

[13] G. Jagannathan, R. Wright. Privacy-preserving distributed k-means clustering over arbitrarily partitioned data, in: Proceedings ofACM KDD 2005, 2005, pp. 593–599.


[14] Qinglin Jiang, Douglas S. Reeves, Peng Ning. Improving robustness of pgp keyrings by conflict detection, in: Topics in Cryptology—CT-RSA 2004, 2004, pp. 194–207.

[15] W. Jiang, C. Clifton. Transforming semi-honest protocols to ensure accountability, in: The ICDM workshop on Privacy Aspects ofData Mining (PADM06), 2006, pp. 542–529.

[16] M. Kantarcıoglu, C. Clifton, Privacy-preserving distributed mining of association rules on horizontally partitioned data, IEEETransactions on Knowledge and Data Engineering 16 (9) (2004) 1026–1037.

[17] M. Kantarcıoglu, C. Clifton, Privately computing a distributed k-nn classifier, in: Proceedings of PKDD 2004, 2004, pp. 279–290.[18] M. Kantarcıoglu, O. Kardes. Privacy preserving data mining in malicious model. Technical Report CS-2006-06, Stevens Institute of

Technology, Department of Computer Science, 2006.[19] K. LeFevre, D. DeWitt, R. Ramakrishnan. Workload aware anonymization, in: Proceedings of KDD 2006, Philadelphia, PA, USA,

August 2006, pp. 277–286.[20] Y. Lindell, B. Pinkas. Privacy preserving data mining, in: Proceedings of CRYPTO 2000, 2000, pp. 36–54.[21] D. Malkhi, N. Nisan, B. Pinkas, Y. Sella. Fairplay – a secure two-party computation system, in: Proceedings of Usenix Security

Symposium 2004, 2004, pp. 20–35.[22] J. Vaidya, C. Clifton. Privacy preserving association rule mining in vertically partitioned data, in: Proceedings of 8th ACM SIGKDD

International conference on Knowledge discovery and data mining, 2002, pp. 639–644.[23] R. Wright, Z. Yang. Privacy preserving bayesian network structure computation on distributed heterogeneous data, in: Proceedings

of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 713–718.[24] Z. Yang, S. Zhong, R. Wright. Anonymity preserving data collection, in: Proceedings of ACM KDD 2005, Chicago, August 2005, pp.

334–343.[25] Z. Yang, S. Zhong, R. Wright. Privacy preserving classification of customer data without loss of accuracy, in: Proceedings of 2005

SIAM International Conference on Data Mining(SDM), California, April 2005, pp. 92–102.[26] A. Yao. How to generate and exchange secrets, in: Proceedings of 27th IEEE Symposium on Foundations of Computer Science, 1986,

pp. 162–167.[27] Michael M. Yin, Jason T.L. Wang, Genescout: a data mining system for predicting vertebrate genes in genomic DNA sequences,

Information Sciences 163 (1–3) (2004) 201–218.[28] Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Zhenjie Zhang, Aoying Zhou, A false negative approach to mining frequent itemsets

from high speed transactional data streams, Information Sciences 176 (16) (2006) 1986–2015.[29] Ling Zhang, Bo Zhang, Fuzzy reasoning model under quotient space structure, Information Sciences 176 (4) (2005) 353–364.[30] S. Zhong, Privacy preserving algorithms for distributed mining of frequent itemsets, Information Sciences 177 (2) (2007) 490–503.

two methods for privacy preserving data mining with malicious participants

Documents