10[1].1.1.76

8/9/2019 10[1].1.1.76

1/12

2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

Privacy-preserving Distributed Mining of

Association Rules on Horizontally Partitioned DataMurat Kantarcoglu and Chris Clifton, Senior Member, IEEE

Abstract Data mining can extract important knowledge fromlarge data collections but sometimes these collections are splitamong various parties. Privacy concerns may prevent the partiesfrom directly sharing the data, and some types of informationabout the data. This paper addresses secure mining of associationrules over horizontally partitioned data. The methods incorporatecryptographic techniques to minimize the information shared,while adding little overhead to the mining task.

Index Terms Data Mining, Security, Privacy

I. INTRODUCTION

Data mining technology has emerged as a means of identi-

fying patterns and trends from large quantities of data. Data

mining and data warehousing go hand-in-hand: most tools

operate by gathering all data into a central site, then running

an algorithm against that data. However, privacy concerns

can prevent building a centralized warehouse data may

be distributed among several custodians, none of which are

allowed to transfer their data to another site.

This paper addresses the problem of computing associa-

tion rules within such a scenario. We assume homogeneous

databases: All sites have the same schema, but each site has

information on different entities. The goal is to produce asso-

ciation rules that hold globally, while limiting the informationshared about each site.

Computing association rules without disclosing individual

transactions is straightforward. We can compute the global

support and confidence of an association rule AB Cknowing only the local supports of AB and ABC, and thesize of each database:

supportABC =

sitesi=1 support countABC(i)sites

i=1 database size(i)

supportAB =

sitesi=1 support countAB(i)

sitesi=1 database size(i)

confidenceABC =supportABC

supportAB

Note that this doesnt require sharing any individual transac-

tions. We can easily extend an algorithm such as a-priori [1]

The authors are with the Department of Computer Sciences, PurdueUniversity, 250 N. University St, W. Lafayette, IN 47907.E-mail: [email protected], [email protected].

Manuscript received 29 Jan. 2003; revised 11 Jun. 2003, accepted 1 Jul.2003.

c2003 IEEE. Personal use of this material is permitted. However, permis-sion to reprint/republish this material for advertising or promotional purposesor for creating new collective works for resale or redistribution to servers orlists, or to reuse any copyrighted component of this work in other works mustbe obtained from the IEEE.

to the distributed case using the following lemma: If a rule

has support > k% globally, it must have support > k% onat least one of the individual sites. A distributed algorithm for

this would work as follows: Request that each site send all

rules with support at least k. For each rule returned, requestthat all sites send the count of their transactions that support

the rule, and the total count of all transactions at the site.

From this, we can compute the global support of each rule,

and (from the lemma) be certain that all rules with support at

least k have been found. More thorough studies of distributed

association rule mining can be found in [2], [3].The above approach protects individual data privacy, but

it does require that each site disclose what rules it supports,

and how much it supports each potential global rule. What

if this information is sensitive? For example, suppose the

Centers for Disease Control (CDC), a public agency, would

like to mine health records to try to find ways to reduce

the proliferation of antibiotic resistant bacteria. Insurance

companies have data on patient diseases and prescriptions.

Mining this data would allow the discovery of rules such

as Augmentin&Summer Infection&Fall, i.e., peopletaking Augmentin in the summer seem to have recurring

infections.

The problem is that insurance companies will be concernedabout sharing this data. Not only must the privacy of patient

records be maintained, but insurers will be unwilling to release

rules pertaining only to them. Imagine a rule indicating a high

rate of complications with a particular medical procedure. If

this rule doesnt hold globally, the insurer would like to know

this they can then try to pinpoint the problem with their

policies and improve patient care. If the fact that the insurers

data supports this rule is revealed (say, under a Freedom of

Information Act request to the CDC), the insurerer could be

exposed to significant public relations or liability problems.

This potential risk could exceed their own perception of the

benefit of participating in the CDC study.

This paper presents a solution that preserves suchsecrets the parties learn (almost) nothing beyond

the global results. The solution is efficient: The addi-

tional cost relative to previous non-secure techniques is

O(number of candidate itemsets sites) encryptions, anda constant increase in the number of messages.

The method presented in this paper assumes three or more

parties. In the two-party case, knowing a rule is supported

globally and not supported at ones own site reveals that the

other site supports the rule. Thus, much of the knowledge

we try to protect is revealed even with a completely secure

method for computing the global results. We discuss the two-

8/9/2019 10[1].1.1.76

2/12

KANTARCIOGLU AND CLIFTON 3

E1(C)

E3(E1(C))E2(E3(E1(C)))Site 2

D

Site 1

C

Site 3C

E2(E3(C))E2(E3(D))

E3(C)

E3(D)

C

D

Fig. 1. Determining global candidate itemsets

party case further in Section V. By the same argument, we

assume no collusion, as colluding parties can reduce this to

the two-party case.

A. Private Association Rule Mining OverviewOur method follows the basic approach outlined on Page 2

except that values are passed between the local data mining

sites rather than to a centralized combiner. The two phases

are discovering candidate itemsets (those that are frequent on

one or more sites), and determining which of the candidate

itemsets meet the global support/confidence thresholds.

The first phase (Figure 1) uses commutative encryption.

Each party encrypts its own frequent itemsets (e.g., Site 1

encrypts itemset C). The encrypted itemsets are then passed

to other parties, until all parties have encrypted all itemsets.

These are passed to a common party to eliminate duplicates,

and to begin decryption. (In the figure, the full set of itemsets

are shown to the left of Site 1, after Site 1 decrypts.) Thisset is then passed to each party, and each party decrypts each

itemset. The final result is the common itemsets (C and D in

the figure).

In the second phase (Figure 2), each of the locally supported

itemsets is tested to see if it is supported globally. In the figure,

the itemset ABC is known to be supported at one or more sites,

and each computes their local support. The first site chooses a

random value R, and adds to R the amount by which its support

for ABC exceeds the minimum support threshold. This value is

passed to site 2, which adds the amount by which its support

exceeds the threshold (note that this may be negative, as shown

in the figure.) This is passed to site three, which again adds

its excess support. The resulting value (18) is tested using asecure comparison to see if it exceeds the Random value (17).If so, itemset ABC is supported globally.

This gives a brief, oversimplified idea of how the method

works. Section III gives full details. Before going into the

details, we give background and definitions of relevant data

mining and security techniques.

I I . BACKGROUND AND RELATED WORK

There are several fields where related work is occurring. We

first describe other work in privacy-preserving data mining,

Site 3

ABC: 20

DBSize = 300

Site 1

ABC: 5

DBSize = 100

Site 2

ABC: 6

DBSize=200

R+count-5%*DBsize

= 17+5-5%*100R=17

17+6-5%*20013+20-5%*300

ABC: Yes!

17

13

18R?

Fig. 2. Determining if itemset support exceeds 5% threshold

then go into detail on specific background work on which this

paper builds.Previous work in privacy-preserving data mining has ad-

dressed two issues. In one, the aim is preserving customer

privacy by distorting the data values [4]. The idea is that

the distorted data does not reveal private information, and

thus is safe to use for mining. The key result is that the

distorted data, and information on the distribution of the

random data used to distort the data, can be used to generate

an approximation to the original data distribution, without

revealing the original data values. The distribution is used to

improve mining results over mining the distorted data directly,

primarily through selection of split points to bin continuous

data. Later refinement of this approach tightened the bounds

on what private information is disclosed, by showing that theability to reconstruct the distribution can be used to tighten

estimates of original values based on the distorted data [5].

More recently, the data distortion approach has been applied

to boolean association rules [6], [7]. Again, the idea is to

modify data values such that reconstruction of the values for

any individual transaction is difficult, but the rules learned on

the distorted data are still valid. One interesting feature of

this work is a flexible definition of privacy; e.g., the ability to

correctly guess a value of 1 from the distorted data can be

considered a greater threat to privacy than correctly learning

a 0.

The data distortion approach addresses a different problem

from our work. The assumption with distortion is that thevalues must be kept private from whoever is doing the mining.

We instead assume that some parties are allowed to see some

of the data, just that no one is allowed to see all the data.

In return, we are able to get exact, rather than approximate,

results.

The other approach uses cryptographic tools to build deci-

sion trees. [8] In this work, the goal is to securely build an

ID3 decision tree where the training set is distributed between

two parties. The basic idea is that finding the attribute that

maximizes information gain is equivalent to finding the at-

tribute that minimizes the conditional entropy. The conditional

8/9/2019 10[1].1.1.76

3/12


entropy for an attribute for two parties can be written as a

sum of the expression of the form (v1 + v2) log(v1 + v2).The authors give a way to securely calculate the expression

(v1 + v2) log(v1 + v2) and show how to use this functionfor building the ID3 securely. This approach treats privacy-

preserving data mining as a special case of secure multi-party

computation [9] and not only aims for preserving individual

privacy but also tries to preserve leakage of any information

other than the final result. We follow this approach, but address

a different problem (association rules), and emphasize the

efficiency of the resulting algorithms. A particular difference

is that we recognize that some kinds of information can be

exchanged without violating security policies; secure multi-

party computation forbids leakage of any information other

than the final result. The ability to share non-sensitive data

enables highly efficient solutions.

The problem of privately computing association rules in

vertically partitioned distributed data has also been addressed

[10]. The vertically partitioned problem occurs when each

transaction is split across multiple sites, with each site having

a different set of attributes for the entire set of transactions.With horizontal partitioning each site has a set of complete

transactions. In relational terms, with horizontal partioning the

relation to be mined is the union of the relations at the sites. In

vertical partitioning, the relations at the individual sites must

be joined to get the relation to be mined. The change in the

way the data is distributed makes this a much different problem

from the one we address here, resulting in a very different

solution.

A. Mining of Association Rules

The association rules mining problem can be defined as

follows: [1] Let I = {i1, i2, . . . , in} be a set of items. LetDB be a set of transactions, where each transaction T is anitemset such that T I. Given an itemset X I, a transactionT contains X if and only if X T. An association rule isan implication of the form X Y where X I, Y I andX Y = . The rule X Y has support s in the transactiondatabase DB ifs% of transactions in DB contain XY. Theassociation rule holds in the transaction database DB withconfidence c if c% of transactions in DB that contain X alsocontains Y. An itemset X with k items called k-itemset. Theproblem of mining association rules is to find all rules whose

support and confidence are higher than certain user specified

minimum support and confidence.

In this simplified definition of the association rules, missingitems, negatives and quantities are not considered. In this

respect, transaction database DB can be seen as 0/1 matrixwhere each column is an item and each row is a transaction.

In this paper, we use this view of association rules.

1) Distributed Mining of Association Rules: The above

problem of mining association rules can be extended to

distributed environments. Let us assume that a transaction

database DB is horizontally partitioned among n sites (namelyS1, S2, . . . , S n) where DB = DB1 DB2 . . . DBn andDBi resides at side Si (1 i n). The itemset X haslocal support count of X.supi at site Si if X.supi of the

transactions contains X. The global support count of X isgiven as X.sup =

ni=1 X.supi. An itemset X is globally

supported if X.sup s (n

i=1 |DBi|). Global confidenceof a rule X Y can be given as {X Y} .sup/X.sup.

The set of large itemsets L(k) consists of all k-itemsetsthat are globally supported. The set of locally large itemsets

LLi(k) consists of all k-itemsets supported locally at site Si.GL

i(k)= L

(k) LL

i(k)is the set of globally large k-itemsets

locally supported at site Si. The aim of distributed associationrule mining is to find the sets L(k) for all k > 1 and the supportcounts for these itemsets, and from this compute association

rules with the specified minimum support and confidence.

A fast algorithm for distributed association rule mining is

given in Cheung et. al. [2]. Their procedure for fast distributed

mining of association rules (FDM) is summarized below.

1) Candidate Sets Generation: Generate candidate sets

CGi(k) based on GLi(k1), itemsets that are supportedby the Si at the (k-1)-th iteration, using the classic a-priori candidate generation algorithm. Each site gener-

ates candidates based on the intersection of globally

large (k-1) itemsets and locally large (k-1) itemsets.2) Local Pruning: For each X CGi(k), scan the

database DBi at Si to compute X.supi. If X is locallylarge Si, it is included in the LLi(k) set. It is clear thatif X is supported globally, it will be supported in onesite.

3) Support Count Exchange: LLi(k) are broadcast, andeach site computes the local support for the items in

iLLi(k).4) Broadcast Mining Results: Each site broadcasts the

local support for itemsets in iLLi(k). From this, eachsite is able to compute L(k).

The details of the above algorithm can be found in [2].

B. Secure Multi-party Computation

Substantial work has been done on secure multi-party com-

putation. The key result is that a wide class of computations

can be computed securely under reasonable assumptions. We

give a brief overview of this work, concentrating on material

that is used later in the paper. The definitions given here are

from Goldreich [9]. For simplicity, we concentrate on the two-

party case. Extending the definitions to the multi-party case is

straightforward.

1) Security in semi-honest model: A semi-honest party

follows the rules of the protocol using its correct input, but isfree to later use what it sees during execution of the protocol

to compromise security. This is somewhat realistic in the real

world because parties who want to mine data for their mutual

benefit will follow the protocol to get correct results. Also, a

protocol that is buried in large, complex software can not be

easily altered.

A formal definition of private two-party computation in

the semi-honest model is given below. Computing a function

privately is equivalent to computing it securely. The formal

proof of this can be found in Goldreich [9].

Definition 2.1: (privacy w.r.t. semi-honest behavior): [9]

8/9/2019 10[1].1.1.76

4/12


Let f : {0, 1} {0, 1}

{0, 1} {0, 1}

be prob-

abilistic, polynomial-time functionality, where f1 (x, y)(resp.,f2 (x, y)) denotes the first (resp., second) element of f(x, y))and let be two-party protocol for computing f.

Let the view of the first (resp., second) party during

an execution of on (x, y), denoted view1 (x, y)(resp., view2 (x, y)) be (x, r1, m1, . . . , mt) (resp.,(y, r2, m1, . . . , mt)) where r1 represent the outcome ofthe first (resp., r2 second) partys internal coin tosses, andmi represent the i

th message it has received.

The output of the first (resp., second) party during an

execution of on (x, y) is denoted output1 (x, y) (resp.,output2 (x, y)) and is implicit in the partys view of theexecution.

privately computes f if there exist probabilistic polyno-mial time algorithms, denoted S1, S2 such that

{(S1 (x, f1 (x, y)) , f2 (x, y))}x,y{0,1} C

view1 (x, y) , output

2 (x, y)

x,y{0,1}

(1)

{(f1 (x, y) , S2 (x, f1 (x, y)))}x,y{0,1} C

output1 (x, y) ,view2 (x, y)

x,y{0,1}

(2)

where C denotes computational indistinguishability.

The above definition says that a computation is secure if

the view of each party during the execution of the protocol

can be effectively simulated by the input and the output of

the party. This is not quite the same as saying that private

information is protected. For example, if two parties use a

secure protocol to mine distributed association rules, a secure

protocol still reveals that if a particular rule is not supported by

a particular site, and that rule appears in the globally supportedrule set, then it must be supported by the other site. A site

can deduce this information by solely looking at its locally

supported rules and the globally supported rules. On the other

hand, there is no way to deduce the exact support count of

some itemset by looking at the globally supported rules. With

three or more parties, knowing a rule holds globally reveals

that at least one site supports it, but no site knows which site

(other than, obviously, itself). In summary, a secure multi-party

protocol will not reveal more information to a particular party

than the information that can be induced by looking at that

partys input and the output.

2) Yaos general two-party secure function evaluation:

Yaos general secure two-party evaluation is based on ex-pressing the function f(x, y) as a circuit and encrypting thegates for secure evaluation [11]. With this protocol, any two-

party function can be evaluated securely in the semi-honest

model. To be efficiently evaluated, however, the functions

must have a small circuit representation. We will not give

details of this generic method, however we do use this generic

results for securely finding whether a b (Yaos millionaireproblem). For comparing any two integers securely, Yaos

generic method is one of the most efficient methods known,

although other asymptotically equivalent but practically more

efficient algorithms could be used as well [12].

C. Commutative Encryption

Commutative encryption is an important tool that can be

used in many privacy-preserving protocols. An encryption

algorithm is commutative if the following two equations hold

for any given feasible encryption keys K1, . . . , K n K, anym in items domain M, and any permutations of i, j.

EKi1 (. . . E Kin (M) . . .) = EKj1 (. . . E Kjn (M) . . .) (3)

M1, M2 M such that M1 = M2 and for given k, 1, |CG(k)| is likely to be of reasonable size. However, |CG(1)|could be very large, as it is dependent only on the size of the

domain of items, and is not limited by already discovered

frequent itemsets. If the participants can agree on an upper

bound on the number of frequent items supported at any one

site that is tighter than every item may be frequent without

inspecting the data, we can achieve a corresponding decrease

in the costs with no loss of security. This is likely to be feasible

in practice; the very success of the a-priori algorithm is based

on the assumption that relatively few items are frequent.

Alternatively, if we are willing to leak an upper bound on the

number of itemsets supported at each site, each site can set itsown upper bound and pad only to that bound. This can be done

for every round, not just k = 1. As a practical matter, such anapproach would achieve acceptable security and would change

the |CG(k)| factor in the communication and encryption costsof Protocol 1 to O(| i LLi(k)|), equivalent to FDM.

Another way to limit the encryption cost of padding is to

pad randomly from the domain of the encryption output rather

than encrypting items from F. Assuming |domainofEi| >>|domainofitemsets |, the probability of padding with a valuethat decrypts to a real itemset is small, and even if this occurs

it will only result in additional itemset being tested for support

in Protocol 2. When the support count is tested, such false

hits will be filtered out, and the final result will be correct.The comparison phase at the end of protocol 2 can be

also removed, eliminating the O(m | i LLi(k)| t) bitsof communication and O(| i LLi(k)| m t

3) encryptioncost. This reveals the excess support for each itemset. Practical

applications may demand this count as part of the result for

globally supported itemsets, so the only information leaked is

the support counts for itemsets in iLLi(k) L(k). As thesecannot be traced to an individual site, this will generally be

acceptable in practice.

The cost estimates are based on the assumption that all

frequent itemsets (even 1-itemsets) are part of the result. If

8/9/2019 10[1].1.1.76

9/12


exposing the globally frequent 1-itemsets is a problem, the

algorithm could easily begin with 2-itemsets (or larger). While

the worst-case cost would be unchanged, there would be an

impact in practical terms. Eliminating the pruning of globally

infrequent 1-itemsets would increase the size of CGi(2) andthus LLi(2), however, local pruning of infrequent 1-itemsetsshould make the sizes manageable. More critical is the impact

on |CG(2)|, and thus the cost of padding to hide the numberof locally large itemsets. In practice, the size of CG(2) will

rarely be the theoretical limit ofitem domain size

2

, but this

worst-case bound would need to be used if the algorithm began

with finding 2-itemsets (the problem is worse for k > 2). Apractical solution would again be to have sites agree on a

reasonable upper bound for the number of locally supported

k-itemsets for the initial k, revealing some information tosubstantially decrease the amount of padding needed.

B. Practical Cost of Encryption

While achieving privacy comes at a reasonable increase in

communication cost, what about the cost of encryption? As a

test of this, we implemented Pohlig-Hellman, described in the

appendix. The encryption time per itemset represented with

t = 512 bits was 0.00428 seconds on a 700MHz Pentium 3under Linux. Using this, and the results for the reported in [3],

we can estimate the cost of privacy-preserving association rule

mining on the tests in [3].

The first set of experiments described in [3] contain suffi-

cient detail for us to estimate the cost of encryption. These

experiments used three sites, an item domain size of 1000,

and a total database size of 500k transactions.

The encryption cost for the initial round (k = 1) wouldbe 4.28 seconds at each site, as the padding need only be to

the domain size of 1000. While finding two-itemsets could

potentially be much worse (

10002

= 499500), in practice

|CG(2)| is much smaller. The experiment in [3] reports a totalnumber of candidate sets (

k>1 |CG(k)|) of just over 100,000

at 1% support. This gives a total encryption cost of around

430 seconds per site, with all sites encrypting simultaneously.

This assumes none of the optimizations of Section VI-A;

if the encryption cost at each site could be cut to |LLi(k)|by eliminating the cost of encrypting the padding items, the

encryption cost would be cut to 5% to 35% of the above on

the datasets used in [3].

There is also the encryption cost of the secure comparison

at the end of Protocol 2. Although the data reported in [3]

does not give us the exact size of iLLi(k), it appears to beon the order of 2000. Based on this, the cost of the secure

comparison, O(| i LLi(k)| m t3), would be about 170

seconds.

The total execution time for the experiment reported in

[3] was approximately 800 seconds. Similar numbers hold at

different support levels; the added cost of encryption would at

worst increase the total run time by roughly 75%.

VII. CONCLUSIONS AND FURTHER WORK

Cryptographic tools can enable data mining that would

otherwise be prevented due to security concerns. We have

given procedures to mine distributed association rules on

horizontally partitioned data. We have shown that distributed

association rule mining can be done efficiently under reason-

able security assumptions.

We believe the need for mining of data where access is

restricted by privacy concerns will increase. Examples include

knowledge discovery among intelligence services of differ-

ent countries and collaboration among corporations without

revealing trade secrets. Even within a single multi-national

company, privacy laws in different jurisdictions may prevent

sharing individual data. Many more examples can be imagined.

We would like to see secure algorithms for classification,

clustering, etc. Another possibility is secure approximate data

mining algorithms. Allowing error in the results may enable

more efficient algorithms that maintain the desired level of

security.

The secure multi-party computation definitions from the

cryptography domain may be too restrictive for our purposes.

A specific example of the need for more flexible definitions

can be seen in Protocol 1. The padding set F is defined to be

infinite, so that the probability of collision among these itemsis 0. This is impractical, and intuitively allowing collisions

among the padded itemsets would seem more secure, as the

information leaked (itemsets supported in common by subsets

of the sites) would become an upper bound rather than an exact

value. However, unless we know in advance the probability of

collision among real itemsets, or more specifically we can set

the size of F so the ratio of the collision probabilities in Fand real itemsets is constant, the protocol is less secure under

secure multi-party communication definitions. The problem is

that knowing the probability of collision among items chosen

from F enables us to predict (although generally with lowaccuracy) which fully encrypted itemsets are real and which

are fake. This allows a probabilistic upper bound estimate onthe number of itemsets supported at each site. It also allows a

probabilistic estimate of the number of itemsets supported in

common by subsets of the sites that is tighter than the number

of collisions found in the RuleSet. Definitions that allow usto trade off such estimates, and techniques to prove protocols

relative to those definitions, will allow us to prove the privacy

of protocols that are practically superior to protocols meeting

strict secure multi-party computation definitions.

More suitable security definitions that allow parties to

choose their desired level of security are needed, allowing

efficient solutions that maintain the desired security. Some

suggested directions for research in this area are given in [19].

One line of research is to predict the value of information fora particular organization, allowing tradeoff between disclosure

cost, computation cost, and benefit from the result. We believe

some ideas from game theory and economics may be relevant.

In summary, it is possible to mine globally valid results

from distributed data without revealing information that com-

promises the privacy of the individual sources. Such privacy-

preserving data mining can be done with a reasonable increase

in cost over methods that do not maintain privacy. Continued

research will expand the scope of privacy-preserving data

mining, enabling most or all data mining methods to be applied

in situations where privacy concerns would appear to restrict

8/9/2019 10[1].1.1.76

10/12


such mining.

APPENDIX

CRYPTOGRAPHIC NOTES ON COMMUTATIVE ENCRYPTION

The Pohlig-Hellman encryption scheme [15] can be used for

a commutative encryption scheme meeting the requirements

of Section II-C. Pohlig-Hellman works as follows. Given a

large prime p with no small factors of p 1, each partychooses a random e, d pair such that e d = 1 (mod p 1).The encryption of a given message M is Me (mod p).Decryption of a given ciphertext C is done by evaluating Cd

(mod p). Cd = Med (mod p), and due to Fermats littletheorem, Med = M1+k(p1) = M (mod p).

It is easy to see that Pohlig-Hellman with shared p satisfiesequation 3. Let us assume that there are n different encryptionand decryption pairs ((e1, d1), . . . , (en, dn)). For any permuta-tion function i, j and E = e1 e2 . . .en = ei1 ei2 . . . ein =ei1 ei2 . . . ein (mod p 1):

Eei1 (. . . E ein (M) . . .)

= (. . . ((Mein (mod p))

ein1 (mod p)) . . .)

ei1 (mod p))

= Meinein1 ...ei1 (mod p)

= ME (mod p)

= Mejnejn1 ...ej1 (mod p)

= Eej1 (. . . E ejn (M) . . .)

Equation 4 is also satisfied by the Pohlig-Hellman encryp-

tion scheme. Let M1, M2 GF(p) such that M1 = M2.Any order of encryption by all parties is equal to evaluating

Eth power mod p of the plain text. Let us assume that afterencryptions M1 and M2 are mapped to the same value. Thisimplies that ME1 = M

E2 (mod p). By exponentiating both

sides with D = d1 d2 . . . dn (mod p 1), we getM1 = M2 (mod p), a contradiction. (Note that E D =e1 e2 . . . en d1 d2 . . . dn = e1 d1 . . . en dn = 1(mod p 1).) Therefore, the probability that two differentelements map to the same value is zero.

Direct implementation of Pohlig-Hellman is not secure.

Consider the following example, encrypting two values a andb, where b = a2. Ee(b) = Ee(a2) = (a2)

e(mod p) = (ae)

2

(mod p) = (Ee(a))2

(mod p). This shows that given twoencrypted values, it is possible to determine if one is the square

of the other (even though the base values are not revealed.)

This violates the security requirement of Section II-C.

Huberman et al. provide a solution [20]. Rather than en-

crypting items directly, a hash of the items is encrypted.The hash occurs only at the originating site, the second and

later encryption of items can use Pohlig-Hellman directly.

The hash breaks the relationship revealed by the encryption

(e.g., a = b2). After decryption, the hashed values must bemapped back to the original values. This can be done by

hashing the candidate itemsets in CG(k) to build a lookuptable, anything not in this table is a fake padding itemset

and can be discarded.3

3A hash collision, resulting in a padding itemset mapping to an itemset inthe table, could result in an extra itemset appearing in the union. This wouldbe filtered out by Protocol 2; the final results would be correct.

We present the approach from [20] as an example; any

secure encryption scheme that satisfies Equations 3 and 4

can be used in our protocols. The above approach is used to

generate the cost estimates in Section VI-B. Other approaches,

and further definitions and discussion of their security, can be

found in [21][24].

ACKNOWLEDGMENT

We wish to acknowledge the contributions of Mike Atallah

and Jaideep Vaidya. Discussions with them have helped to

tighten the proofs, giving clear bounds on the information

released.

REFERENCES

[1] R. Agrawal and R. Srikant, Fast algorithms for mining associationrules, in Proceedings of the 20th International Conference on Very

Large Data Bases. Santiago, Chile: VLDB, Sept. 12-15 1994, pp.487499. [Online]. Available: http://www.vldb.org/dblp/db/conf/vldb/vldb94-487.html

[2] D. W.-L. Cheung, J. Han, V. Ng, A. W.-C. Fu, and Y. Fu, A fastdistributed algorithm for mining association rules, in Proceedings of the

1996 International Conference on Parallel and Distributed InformationSystems (PDIS96). Miami Beach, Florida, USA: IEEE, Dec. 1996,pp. 3142.

[3] D. W.-L. Cheung, V. Ng, A. W.-C. Fu, and Y. Fu, Efficient miningof association rules in distributed databases, IEEE Trans. Knowledge

Data Eng., vol. 8, no. 6, pp. 911922, Dec. 1996.[4] R. Agrawal and R. Srikant, Privacy-preserving data mining, in

Proceedings of the 2000 ACM SIGMOD Conference on Managementof Data. Dallas, TX: ACM, May 14-19 2000, pp. 439450. [Online].Available: http://doi.acm.org/10.1145/342009.335438

[5] D. Agrawal and C. C. Aggarwal, On the design and quantificationof privacy preserving data mining algorithms, in Proceedingsof the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium onPrinciples of Database Systems. Santa Barbara, California, USA:ACM, May 21-23 2001, pp. 247255. [Online]. Available: http:

//doi.acm.org/10.1145/375551.375602[6] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy

preserving mining of association rules, in The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Edmonton, Alberta, Canada, July 23-26 2002, pp. 217228. [Online].Available: http://doi.acm.org/10.1145/775047.775080

[7] S. J. Rizvi and J. R. Haritsa, Maintaining data privacy in associationrule mining, in Proceedings of 28th International Conference on Very

Large Data Bases. Hong Kong: VLDB, Aug. 20-23 2002, pp. 682693.[Online]. Available: http://www.vldb.org/conf/2002/S19P03.pdf

[8] Y. Lindell and B. Pinkas, Privacy preserving data mining, in Advances in Cryptology CRYPTO 2000. Springer-Verlag, Aug.20-24 2000, pp. 3654. [Online]. Available: http://link.springer.de/link/service/series/0558/bibs/1880/18800036.htm

[9] O. Goldreich, Secure multi-party computation, Sept. 1998, (workingdraft). [Online]. Available: http://www.wisdom.weizmann.ac.il/oded/pp.html

[10] J. Vaidya and C. Clifton, Privacy preserving association rule mining invertically partitioned data, in The Eighth ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, Edmonton,Alberta, Canada, July 23-26 2002, pp. 639644. [Online]. Available:http://doi.acm.org/10.1145/775047.775142

[11] A. C. Yao, How to generate and exchange secrets, in Proceedings ofthe 27th IEEE Symposium on Foundations of Computer Science . IEEE,1986, pp. 162167.

[12] I. Ioannidis and A. Grama, An efficient protocol for yaos millionairesproblem, in Hawaii International Conference on System Sciences(HICSS-36), Waikoloa Village, Hawaii, Jan. 6-9 2003.

[13] O. Goldreich, Encryption schemes, Mar. 2003, (workingdraft). [Online]. Available: http://www.wisdom.weizmann.ac.il/oded/PSBookFrag/enc.ps

[14] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtainingdigital signatures and public-key cryptosystems, Communications ofthe ACM, vol. 21, no. 2, pp. 120126, 1978. [Online]. Available:http://doi.acm.org/10.1145/359340.359342

8/9/2019 10[1].1.1.76

11/12


[15] S. C. Pohlig and M. E. Hellman, An improved algorithm for com-puting logarithms over GF(p) and its cryptographic significance, IEEETransactions on Information Theory, vol. IT-24, pp. 106110, 1978.

[16] M. K. Reiter and A. D. Rubin, Crowds: Anonymity for Webtransactions, ACM Transactions on Information and System Security,vol. 1, no. 1, pp. 6692, Nov. 1998. [Online]. Available: http:

//doi.acm.org/10.1145/290163.290168[17] J. C. Benaloh, Secret sharing homomorphisms: Keeping

shares of a secret secret, in Advances in Cryptography -CRYPTO86: Proceedings, A. Odlyzko, Ed., vol. 263. Springer-

Verlag, Lecture Notes in Computer Science, 1986, pp. 251260. [Online]. Available: http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9%743&volume=263&spage=251

[18] B. Chor and E. Kushilevitz, A communication-privacy tradeoff formodular addition, Information Processing Letters, vol. 45, no. 4, pp.205210, 1993.

[19] C. Clifton, M. Kantarcioglu, and J. Vaidya, Defining privacy for datamining, in National Science Foundation Workshop on Next Generation

Data Mining, H. Kargupta, A. Joshi, and K. Sivakumar, Eds., Baltimore,MD, Nov. 1-3 2002, pp. 126133.

[20] B. A. Huberman, M. Franklin, and T. Hogg, Enhancing privacy andtrust in electronic communities, in Proceedings of the First ACMConference on Electronic Commerce (EC99). Denver, Colorado, USA:ACM Press, Nov. 35 1999, pp. 7886.

[21] J. C. Benaloh and M. de Mare, One-way accumulators: A decentralizedalternative to digital signatures, in Advances in Cryptology

EUROCRYPT93, Workshop on the Theory and Application of

Cryptographic Techniques, ser. Lecture Notes in Computer Science,vol. 765. Lofthus, Norway: Springer-Verlag, May 1993, pp. 274285. [Online]. Available: http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9%743&volume=765&spage=274

[22] W. Diffie and M. Hellman, New directions in cryptography, IEEETrans. Inform. Theory, vol. IT-22, no. 6, pp. 644654, Nov. 1976.

[23] T. ElGamal, A public key cryptosystem and a signature scheme basedon discrete logarithms, IEEE Trans. Inform. Theory, vol. IT-31, no. 4,pp. 469472, July 1985.

[24] A. Shamir, R. L. Rivest, and L. M. Adleman, Mental poker, Laboratoryfor Computer Science, MIT, Technical Memo MIT-LCS-TM-125, Feb.1979.

Murat Kantarcoglu is a Ph.D. candidate at PurdueUniversity. He has a Masters degree in ComputerScience from Purdue University and a Bachelorsdegree in Computer Engineering from Middle EastTechnical University, Ankara Turkey. His researchinterests include data mining, database security andinformation security. He is a student member ofACM.

Chris Clifton is an Associate Professor of ComputerScience at Purdue University. He has a Ph.D. fromPrinceton University, and Bachelors and Mastersdegrees from the Massachusetts Institute of Technol-ogy. Prior to joining Purdue he held positions at TheMITRE Corporation and Northwestern University.His research interests include data mining, databasesupport for text, and database security. He is a seniormember of the IEEE and a member of the IEEEComputer Society and the ACM.

Protocol 1 Finding secure union of large itemsets of size k

Require: N 3 sites numbered 1..N 1, set F of non-itemsets.

Phase 0: Encryption of all the rules by all sites

for each site i do

generate LLi(k) as in steps 1 and 2 of the FDM algorithmLLei(k) = for each X LLi(k) do

LLei(k) = LLei(k) {Ei(X)}end for

for j = |LLei(k)| + 1 to |CG(k)| doLLei(k) = LLei(k) {Ei(random selection from F)}

end for

end for

Phase 1: Encryption by all sites

for Round j = 0 to N 1 doif Round j= 0 then

Each site i sends permuted LLei(k) to site (i +1) modN

else

Each site i encrypts all items in LLe(ij mod N)(k)with Ei, permutes, and sends it to site (i + 1) mod N

end if

end for{At the end of Phase 1, site i has the itemsets ofsite (i + 1) mod N encrypted by every site}

Phase 2: Merge odd/even itemsets

Each site i sends LLei+1 mod N to site i mod 2

Site 0 sets RuleSet1 = (N1)/2j=1 LLe(2j1)(k)

Site 1 sets RuleSet0 = (N1)/2j=0 LLe(2j)(k)

Phase 3: Merge all itemsets

Site 1 sends permuted RuleSet1 to site 0Site 0 sets RuleSet = RuleSet0 RuleSet1

Phase 4: Decryption

for i = 0 to N 1 doSite i decrypts items in RuleSet using DiSite i sends permuted RuleSet to site i + 1 mod N

end for

Site N 1 decrypts items in RuleSet using DN1RuleSet(k) = RuleSet FSite N 1 broadcasts RuleSet(k) to sites 0..N 2

8/9/2019 10[1].1.1.76

12/12


Protocol 2 Finding the global support counts securely

Require: N 3 sites numbered 0..N 1, m 2 |DB|rule set = at site 0:

for each r candidate set do

choose random integer xr from a uniform distributionover 0..m 1;t = r.supi s |DBi| + xr (mod m);rule set = rule set {(r, t)};

end for

send rule set to site 1 ;for i = 1 to N 2 do

for each (r, t) rule set dot = r.supi s |DBi| + t (mod m);rule set = rule set {(r, t)} {(r, t)} ;

end for

send rule set to site i + 1 ;end for

at site N-1:for each (r, t) rule set do

t = r.supi s |DBi| + t (mod m);securely compute if (t xr) (mod m) < m/2 with thesite 0; { Site 0 knows xr }if (t xr) (mod m) < m/2 then

multi-cast r as a globally large itemset.end if

end for

10[1].1.1.76

Documents