10[1].1.1.76
TRANSCRIPT
-
8/9/2019 10[1].1.1.76
1/12
2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR
Privacy-preserving Distributed Mining of
Association Rules on Horizontally Partitioned DataMurat Kantarcoglu and Chris Clifton, Senior Member, IEEE
Abstract Data mining can extract important knowledge fromlarge data collections but sometimes these collections are splitamong various parties. Privacy concerns may prevent the partiesfrom directly sharing the data, and some types of informationabout the data. This paper addresses secure mining of associationrules over horizontally partitioned data. The methods incorporatecryptographic techniques to minimize the information shared,while adding little overhead to the mining task.
Index Terms Data Mining, Security, Privacy
I. INTRODUCTION
Data mining technology has emerged as a means of identi-
fying patterns and trends from large quantities of data. Data
mining and data warehousing go hand-in-hand: most tools
operate by gathering all data into a central site, then running
an algorithm against that data. However, privacy concerns
can prevent building a centralized warehouse data may
be distributed among several custodians, none of which are
allowed to transfer their data to another site.
This paper addresses the problem of computing associa-
tion rules within such a scenario. We assume homogeneous
databases: All sites have the same schema, but each site has
information on different entities. The goal is to produce asso-
ciation rules that hold globally, while limiting the informationshared about each site.
Computing association rules without disclosing individual
transactions is straightforward. We can compute the global
support and confidence of an association rule AB Cknowing only the local supports of AB and ABC, and thesize of each database:
supportABC =
sitesi=1 support countABC(i)sites
i=1 database size(i)
supportAB =
sitesi=1 support countAB(i)
sitesi=1 database size(i)
confidenceABC =supportABC
supportAB
Note that this doesnt require sharing any individual transac-
tions. We can easily extend an algorithm such as a-priori [1]
The authors are with the Department of Computer Sciences, PurdueUniversity, 250 N. University St, W. Lafayette, IN 47907.E-mail: [email protected], [email protected].
Manuscript received 29 Jan. 2003; revised 11 Jun. 2003, accepted 1 Jul.2003.
c2003 IEEE. Personal use of this material is permitted. However, permis-sion to reprint/republish this material for advertising or promotional purposesor for creating new collective works for resale or redistribution to servers orlists, or to reuse any copyrighted component of this work in other works mustbe obtained from the IEEE.
to the distributed case using the following lemma: If a rule
has support > k% globally, it must have support > k% onat least one of the individual sites. A distributed algorithm for
this would work as follows: Request that each site send all
rules with support at least k. For each rule returned, requestthat all sites send the count of their transactions that support
the rule, and the total count of all transactions at the site.
From this, we can compute the global support of each rule,
and (from the lemma) be certain that all rules with support at
least k have been found. More thorough studies of distributed
association rule mining can be found in [2], [3].The above approach protects individual data privacy, but
it does require that each site disclose what rules it supports,
and how much it supports each potential global rule. What
if this information is sensitive? For example, suppose the
Centers for Disease Control (CDC), a public agency, would
like to mine health records to try to find ways to reduce
the proliferation of antibiotic resistant bacteria. Insurance
companies have data on patient diseases and prescriptions.
Mining this data would allow the discovery of rules such
as Augmentin&Summer Infection&Fall, i.e., peopletaking Augmentin in the summer seem to have recurring
infections.
The problem is that insurance companies will be concernedabout sharing this data. Not only must the privacy of patient
records be maintained, but insurers will be unwilling to release
rules pertaining only to them. Imagine a rule indicating a high
rate of complications with a particular medical procedure. If
this rule doesnt hold globally, the insurer would like to know
this they can then try to pinpoint the problem with their
policies and improve patient care. If the fact that the insurers
data supports this rule is revealed (say, under a Freedom of
Information Act request to the CDC), the insurerer could be
exposed to significant public relations or liability problems.
This potential risk could exceed their own perception of the
benefit of participating in the CDC study.
This paper presents a solution that preserves suchsecrets the parties learn (almost) nothing beyond
the global results. The solution is efficient: The addi-
tional cost relative to previous non-secure techniques is
O(number of candidate itemsets sites) encryptions, anda constant increase in the number of messages.
The method presented in this paper assumes three or more
parties. In the two-party case, knowing a rule is supported
globally and not supported at ones own site reveals that the
other site supports the rule. Thus, much of the knowledge
we try to protect is revealed even with a completely secure
method for computing the global results. We discuss the two-
-
8/9/2019 10[1].1.1.76
2/12
KANTARCIOGLU AND CLIFTON 3
E1(C)
E3(E1(C))E2(E3(E1(C)))Site 2
D
Site 1
C
Site 3C
E2(E3(C))E2(E3(D))
E3(C)
E3(D)
C
D
Fig. 1. Determining global candidate itemsets
party case further in Section V. By the same argument, we
assume no collusion, as colluding parties can reduce this to
the two-party case.
A. Private Association Rule Mining OverviewOur method follows the basic approach outlined on Page 2
except that values are passed between the local data mining
sites rather than to a centralized combiner. The two phases
are discovering candidate itemsets (those that are frequent on
one or more sites), and determining which of the candidate
itemsets meet the global support/confidence thresholds.
The first phase (Figure 1) uses commutative encryption.
Each party encrypts its own frequent itemsets (e.g., Site 1
encrypts itemset C). The encrypted itemsets are then passed
to other parties, until all parties have encrypted all itemsets.
These are passed to a common party to eliminate duplicates,
and to begin decryption. (In the figure, the full set of itemsets
are shown to the left of Site 1, after Site 1 decrypts.) Thisset is then passed to each party, and each party decrypts each
itemset. The final result is the common itemsets (C and D in
the figure).
In the second phase (Figure 2), each of the locally supported
itemsets is tested to see if it is supported globally. In the figure,
the itemset ABC is known to be supported at one or more sites,
and each computes their local support. The first site chooses a
random value R, and adds to R the amount by which its support
for ABC exceeds the minimum support threshold. This value is
passed to site 2, which adds the amount by which its support
exceeds the threshold (note that this may be negative, as shown
in the figure.) This is passed to site three, which again adds
its excess support. The resulting value (18) is tested using asecure comparison to see if it exceeds the Random value (17).If so, itemset ABC is supported globally.
This gives a brief, oversimplified idea of how the method
works. Section III gives full details. Before going into the
details, we give background and definitions of relevant data
mining and security techniques.
I I . BACKGROUND AND RELATED WORK
There are several fields where related work is occurring. We
first describe other work in privacy-preserving data mining,
Site 3
ABC: 20
DBSize = 300
Site 1
ABC: 5
DBSize = 100
Site 2
ABC: 6
DBSize=200
R+count-5%*DBsize
= 17+5-5%*100R=17
17+6-5%*20013+20-5%*300
ABC: Yes!
17
13
18R?
Fig. 2. Determining if itemset support exceeds 5% threshold
then go into detail on specific background work on which this
paper builds.Previous work in privacy-preserving data mining has ad-
dressed two issues. In one, the aim is preserving customer
privacy by distorting the data values [4]. The idea is that
the distorted data does not reveal private information, and
thus is safe to use for mining. The key result is that the
distorted data, and information on the distribution of the
random data used to distort the data, can be used to generate
an approximation to the original data distribution, without
revealing the original data values. The distribution is used to
improve mining results over mining the distorted data directly,
primarily through selection of split points to bin continuous
data. Later refinement of this approach tightened the bounds
on what private information is disclosed, by showing that theability to reconstruct the distribution can be used to tighten
estimates of original values based on the distorted data [5].
More recently, the data distortion approach has been applied
to boolean association rules [6], [7]. Again, the idea is to
modify data values such that reconstruction of the values for
any individual transaction is difficult, but the rules learned on
the distorted data are still valid. One interesting feature of
this work is a flexible definition of privacy; e.g., the ability to
correctly guess a value of 1 from the distorted data can be
considered a greater threat to privacy than correctly learning
a 0.
The data distortion approach addresses a different problem
from our work. The assumption with distortion is that thevalues must be kept private from whoever is doing the mining.
We instead assume that some parties are allowed to see some
of the data, just that no one is allowed to see all the data.
In return, we are able to get exact, rather than approximate,
results.
The other approach uses cryptographic tools to build deci-
sion trees. [8] In this work, the goal is to securely build an
ID3 decision tree where the training set is distributed between
two parties. The basic idea is that finding the attribute that
maximizes information gain is equivalent to finding the at-
tribute that minimizes the conditional entropy. The conditional
-
8/9/2019 10[1].1.1.76
3/12
4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR
entropy for an attribute for two parties can be written as a
sum of the expression of the form (v1 + v2) log(v1 + v2).The authors give a way to securely calculate the expression
(v1 + v2) log(v1 + v2) and show how to use this functionfor building the ID3 securely. This approach treats privacy-
preserving data mining as a special case of secure multi-party
computation [9] and not only aims for preserving individual
privacy but also tries to preserve leakage of any information
other than the final result. We follow this approach, but address
a different problem (association rules), and emphasize the
efficiency of the resulting algorithms. A particular difference
is that we recognize that some kinds of information can be
exchanged without violating security policies; secure multi-
party computation forbids leakage of any information other
than the final result. The ability to share non-sensitive data
enables highly efficient solutions.
The problem of privately computing association rules in
vertically partitioned distributed data has also been addressed
[10]. The vertically partitioned problem occurs when each
transaction is split across multiple sites, with each site having
a different set of attributes for the entire set of transactions.With horizontal partitioning each site has a set of complete
transactions. In relational terms, with horizontal partioning the
relation to be mined is the union of the relations at the sites. In
vertical partitioning, the relations at the individual sites must
be joined to get the relation to be mined. The change in the
way the data is distributed makes this a much different problem
from the one we address here, resulting in a very different
solution.
A. Mining of Association Rules
The association rules mining problem can be defined as
follows: [1] Let I = {i1, i2, . . . , in} be a set of items. LetDB be a set of transactions, where each transaction T is anitemset such that T I. Given an itemset X I, a transactionT contains X if and only if X T. An association rule isan implication of the form X Y where X I, Y I andX Y = . The rule X Y has support s in the transactiondatabase DB ifs% of transactions in DB contain XY. Theassociation rule holds in the transaction database DB withconfidence c if c% of transactions in DB that contain X alsocontains Y. An itemset X with k items called k-itemset. Theproblem of mining association rules is to find all rules whose
support and confidence are higher than certain user specified
minimum support and confidence.
In this simplified definition of the association rules, missingitems, negatives and quantities are not considered. In this
respect, transaction database DB can be seen as 0/1 matrixwhere each column is an item and each row is a transaction.
In this paper, we use this view of association rules.
1) Distributed Mining of Association Rules: The above
problem of mining association rules can be extended to
distributed environments. Let us assume that a transaction
database DB is horizontally partitioned among n sites (namelyS1, S2, . . . , S n) where DB = DB1 DB2 . . . DBn andDBi resides at side Si (1 i n). The itemset X haslocal support count of X.supi at site Si if X.supi of the
transactions contains X. The global support count of X isgiven as X.sup =
ni=1 X.supi. An itemset X is globally
supported if X.sup s (n
i=1 |DBi|). Global confidenceof a rule X Y can be given as {X Y} .sup/X.sup.
The set of large itemsets L(k) consists of all k-itemsetsthat are globally supported. The set of locally large itemsets
LLi(k) consists of all k-itemsets supported locally at site Si.GL
i(k)= L
(k) LL
i(k)is the set of globally large k-itemsets
locally supported at site Si. The aim of distributed associationrule mining is to find the sets L(k) for all k > 1 and the supportcounts for these itemsets, and from this compute association
rules with the specified minimum support and confidence.
A fast algorithm for distributed association rule mining is
given in Cheung et. al. [2]. Their procedure for fast distributed
mining of association rules (FDM) is summarized below.
1) Candidate Sets Generation: Generate candidate sets
CGi(k) based on GLi(k1), itemsets that are supportedby the Si at the (k-1)-th iteration, using the classic a-priori candidate generation algorithm. Each site gener-
ates candidates based on the intersection of globally
large (k-1) itemsets and locally large (k-1) itemsets.2) Local Pruning: For each X CGi(k), scan the
database DBi at Si to compute X.supi. If X is locallylarge Si, it is included in the LLi(k) set. It is clear thatif X is supported globally, it will be supported in onesite.
3) Support Count Exchange: LLi(k) are broadcast, andeach site computes the local support for the items in
iLLi(k).4) Broadcast Mining Results: Each site broadcasts the
local support for itemsets in iLLi(k). From this, eachsite is able to compute L(k).
The details of the above algorithm can be found in [2].
B. Secure Multi-party Computation
Substantial work has been done on secure multi-party com-
putation. The key result is that a wide class of computations
can be computed securely under reasonable assumptions. We
give a brief overview of this work, concentrating on material
that is used later in the paper. The definitions given here are
from Goldreich [9]. For simplicity, we concentrate on the two-
party case. Extending the definitions to the multi-party case is
straightforward.
1) Security in semi-honest model: A semi-honest party
follows the rules of the protocol using its correct input, but isfree to later use what it sees during execution of the protocol
to compromise security. This is somewhat realistic in the real
world because parties who want to mine data for their mutual
benefit will follow the protocol to get correct results. Also, a
protocol that is buried in large, complex software can not be
easily altered.
A formal definition of private two-party computation in
the semi-honest model is given below. Computing a function
privately is equivalent to computing it securely. The formal
proof of this can be found in Goldreich [9].
Definition 2.1: (privacy w.r.t. semi-honest behavior): [9]
-
8/9/2019 10[1].1.1.76
4/12
KANTARCIOGLU AND CLIFTON 5
Let f : {0, 1} {0, 1}
{0, 1} {0, 1}
be prob-
abilistic, polynomial-time functionality, where f1 (x, y)(resp.,f2 (x, y)) denotes the first (resp., second) element of f(x, y))and let be two-party protocol for computing f.
Let the view of the first (resp., second) party during
an execution of on (x, y), denoted view1 (x, y)(resp., view2 (x, y)) be (x, r1, m1, . . . , mt) (resp.,(y, r2, m1, . . . , mt)) where r1 represent the outcome ofthe first (resp., r2 second) partys internal coin tosses, andmi represent the i
th message it has received.
The output of the first (resp., second) party during an
execution of on (x, y) is denoted output1 (x, y) (resp.,output2 (x, y)) and is implicit in the partys view of theexecution.
privately computes f if there exist probabilistic polyno-mial time algorithms, denoted S1, S2 such that
{(S1 (x, f1 (x, y)) , f2 (x, y))}x,y{0,1} C
view1 (x, y) , output
2 (x, y)
x,y{0,1}
(1)
{(f1 (x, y) , S2 (x, f1 (x, y)))}x,y{0,1} C
output1 (x, y) ,view2 (x, y)
x,y{0,1}
(2)
where C denotes computational indistinguishability.
The above definition says that a computation is secure if
the view of each party during the execution of the protocol
can be effectively simulated by the input and the output of
the party. This is not quite the same as saying that private
information is protected. For example, if two parties use a
secure protocol to mine distributed association rules, a secure
protocol still reveals that if a particular rule is not supported by
a particular site, and that rule appears in the globally supportedrule set, then it must be supported by the other site. A site
can deduce this information by solely looking at its locally
supported rules and the globally supported rules. On the other
hand, there is no way to deduce the exact support count of
some itemset by looking at the globally supported rules. With
three or more parties, knowing a rule holds globally reveals
that at least one site supports it, but no site knows which site
(other than, obviously, itself). In summary, a secure multi-party
protocol will not reveal more information to a particular party
than the information that can be induced by looking at that
partys input and the output.
2) Yaos general two-party secure function evaluation:
Yaos general secure two-party evaluation is based on ex-pressing the function f(x, y) as a circuit and encrypting thegates for secure evaluation [11]. With this protocol, any two-
party function can be evaluated securely in the semi-honest
model. To be efficiently evaluated, however, the functions
must have a small circuit representation. We will not give
details of this generic method, however we do use this generic
results for securely finding whether a b (Yaos millionaireproblem). For comparing any two integers securely, Yaos
generic method is one of the most efficient methods known,
although other asymptotically equivalent but practically more
efficient algorithms could be used as well [12].
C. Commutative Encryption
Commutative encryption is an important tool that can be
used in many privacy-preserving protocols. An encryption
algorithm is commutative if the following two equations hold
for any given feasible encryption keys K1, . . . , K n K, anym in items domain M, and any permutations of i, j.
EKi1 (. . . E Kin (M) . . .) = EKj1 (. . . E Kjn (M) . . .) (3)
M1, M2 M such that M1 = M2 and for given k, 1, |CG(k)| is likely to be of reasonable size. However, |CG(1)|could be very large, as it is dependent only on the size of the
domain of items, and is not limited by already discovered
frequent itemsets. If the participants can agree on an upper
bound on the number of frequent items supported at any one
site that is tighter than every item may be frequent without
inspecting the data, we can achieve a corresponding decrease
in the costs with no loss of security. This is likely to be feasible
in practice; the very success of the a-priori algorithm is based
on the assumption that relatively few items are frequent.
Alternatively, if we are willing to leak an upper bound on the
number of itemsets supported at each site, each site can set itsown upper bound and pad only to that bound. This can be done
for every round, not just k = 1. As a practical matter, such anapproach would achieve acceptable security and would change
the |CG(k)| factor in the communication and encryption costsof Protocol 1 to O(| i LLi(k)|), equivalent to FDM.
Another way to limit the encryption cost of padding is to
pad randomly from the domain of the encryption output rather
than encrypting items from F. Assuming |domainofEi| >>|domainofitemsets |, the probability of padding with a valuethat decrypts to a real itemset is small, and even if this occurs
it will only result in additional itemset being tested for support
in Protocol 2. When the support count is tested, such false
hits will be filtered out, and the final result will be correct.The comparison phase at the end of protocol 2 can be
also removed, eliminating the O(m | i LLi(k)| t) bitsof communication and O(| i LLi(k)| m t
3) encryptioncost. This reveals the excess support for each itemset. Practical
applications may demand this count as part of the result for
globally supported itemsets, so the only information leaked is
the support counts for itemsets in iLLi(k) L(k). As thesecannot be traced to an individual site, this will generally be
acceptable in practice.
The cost estimates are based on the assumption that all
frequent itemsets (even 1-itemsets) are part of the result. If
-
8/9/2019 10[1].1.1.76
9/12
10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR
exposing the globally frequent 1-itemsets is a problem, the
algorithm could easily begin with 2-itemsets (or larger). While
the worst-case cost would be unchanged, there would be an
impact in practical terms. Eliminating the pruning of globally
infrequent 1-itemsets would increase the size of CGi(2) andthus LLi(2), however, local pruning of infrequent 1-itemsetsshould make the sizes manageable. More critical is the impact
on |CG(2)|, and thus the cost of padding to hide the numberof locally large itemsets. In practice, the size of CG(2) will
rarely be the theoretical limit ofitem domain size
2
, but this
worst-case bound would need to be used if the algorithm began
with finding 2-itemsets (the problem is worse for k > 2). Apractical solution would again be to have sites agree on a
reasonable upper bound for the number of locally supported
k-itemsets for the initial k, revealing some information tosubstantially decrease the amount of padding needed.
B. Practical Cost of Encryption
While achieving privacy comes at a reasonable increase in
communication cost, what about the cost of encryption? As a
test of this, we implemented Pohlig-Hellman, described in the
appendix. The encryption time per itemset represented with
t = 512 bits was 0.00428 seconds on a 700MHz Pentium 3under Linux. Using this, and the results for the reported in [3],
we can estimate the cost of privacy-preserving association rule
mining on the tests in [3].
The first set of experiments described in [3] contain suffi-
cient detail for us to estimate the cost of encryption. These
experiments used three sites, an item domain size of 1000,
and a total database size of 500k transactions.
The encryption cost for the initial round (k = 1) wouldbe 4.28 seconds at each site, as the padding need only be to
the domain size of 1000. While finding two-itemsets could
potentially be much worse (
10002
= 499500), in practice
|CG(2)| is much smaller. The experiment in [3] reports a totalnumber of candidate sets (
k>1 |CG(k)|) of just over 100,000
at 1% support. This gives a total encryption cost of around
430 seconds per site, with all sites encrypting simultaneously.
This assumes none of the optimizations of Section VI-A;
if the encryption cost at each site could be cut to |LLi(k)|by eliminating the cost of encrypting the padding items, the
encryption cost would be cut to 5% to 35% of the above on
the datasets used in [3].
There is also the encryption cost of the secure comparison
at the end of Protocol 2. Although the data reported in [3]
does not give us the exact size of iLLi(k), it appears to beon the order of 2000. Based on this, the cost of the secure
comparison, O(| i LLi(k)| m t3), would be about 170
seconds.
The total execution time for the experiment reported in
[3] was approximately 800 seconds. Similar numbers hold at
different support levels; the added cost of encryption would at
worst increase the total run time by roughly 75%.
VII. CONCLUSIONS AND FURTHER WORK
Cryptographic tools can enable data mining that would
otherwise be prevented due to security concerns. We have
given procedures to mine distributed association rules on
horizontally partitioned data. We have shown that distributed
association rule mining can be done efficiently under reason-
able security assumptions.
We believe the need for mining of data where access is
restricted by privacy concerns will increase. Examples include
knowledge discovery among intelligence services of differ-
ent countries and collaboration among corporations without
revealing trade secrets. Even within a single multi-national
company, privacy laws in different jurisdictions may prevent
sharing individual data. Many more examples can be imagined.
We would like to see secure algorithms for classification,
clustering, etc. Another possibility is secure approximate data
mining algorithms. Allowing error in the results may enable
more efficient algorithms that maintain the desired level of
security.
The secure multi-party computation definitions from the
cryptography domain may be too restrictive for our purposes.
A specific example of the need for more flexible definitions
can be seen in Protocol 1. The padding set F is defined to be
infinite, so that the probability of collision among these itemsis 0. This is impractical, and intuitively allowing collisions
among the padded itemsets would seem more secure, as the
information leaked (itemsets supported in common by subsets
of the sites) would become an upper bound rather than an exact
value. However, unless we know in advance the probability of
collision among real itemsets, or more specifically we can set
the size of F so the ratio of the collision probabilities in Fand real itemsets is constant, the protocol is less secure under
secure multi-party communication definitions. The problem is
that knowing the probability of collision among items chosen
from F enables us to predict (although generally with lowaccuracy) which fully encrypted itemsets are real and which
are fake. This allows a probabilistic upper bound estimate onthe number of itemsets supported at each site. It also allows a
probabilistic estimate of the number of itemsets supported in
common by subsets of the sites that is tighter than the number
of collisions found in the RuleSet. Definitions that allow usto trade off such estimates, and techniques to prove protocols
relative to those definitions, will allow us to prove the privacy
of protocols that are practically superior to protocols meeting
strict secure multi-party computation definitions.
More suitable security definitions that allow parties to
choose their desired level of security are needed, allowing
efficient solutions that maintain the desired security. Some
suggested directions for research in this area are given in [19].
One line of research is to predict the value of information fora particular organization, allowing tradeoff between disclosure
cost, computation cost, and benefit from the result. We believe
some ideas from game theory and economics may be relevant.
In summary, it is possible to mine globally valid results
from distributed data without revealing information that com-
promises the privacy of the individual sources. Such privacy-
preserving data mining can be done with a reasonable increase
in cost over methods that do not maintain privacy. Continued
research will expand the scope of privacy-preserving data
mining, enabling most or all data mining methods to be applied
in situations where privacy concerns would appear to restrict
-
8/9/2019 10[1].1.1.76
10/12
KANTARCIOGLU AND CLIFTON 11
such mining.
APPENDIX
CRYPTOGRAPHIC NOTES ON COMMUTATIVE ENCRYPTION
The Pohlig-Hellman encryption scheme [15] can be used for
a commutative encryption scheme meeting the requirements
of Section II-C. Pohlig-Hellman works as follows. Given a
large prime p with no small factors of p 1, each partychooses a random e, d pair such that e d = 1 (mod p 1).The encryption of a given message M is Me (mod p).Decryption of a given ciphertext C is done by evaluating Cd
(mod p). Cd = Med (mod p), and due to Fermats littletheorem, Med = M1+k(p1) = M (mod p).
It is easy to see that Pohlig-Hellman with shared p satisfiesequation 3. Let us assume that there are n different encryptionand decryption pairs ((e1, d1), . . . , (en, dn)). For any permuta-tion function i, j and E = e1 e2 . . .en = ei1 ei2 . . . ein =ei1 ei2 . . . ein (mod p 1):
Eei1 (. . . E ein (M) . . .)
= (. . . ((Mein (mod p))
ein1 (mod p)) . . .)
ei1 (mod p))
= Meinein1 ...ei1 (mod p)
= ME (mod p)
= Mejnejn1 ...ej1 (mod p)
= Eej1 (. . . E ejn (M) . . .)
Equation 4 is also satisfied by the Pohlig-Hellman encryp-
tion scheme. Let M1, M2 GF(p) such that M1 = M2.Any order of encryption by all parties is equal to evaluating
Eth power mod p of the plain text. Let us assume that afterencryptions M1 and M2 are mapped to the same value. Thisimplies that ME1 = M
E2 (mod p). By exponentiating both
sides with D = d1 d2 . . . dn (mod p 1), we getM1 = M2 (mod p), a contradiction. (Note that E D =e1 e2 . . . en d1 d2 . . . dn = e1 d1 . . . en dn = 1(mod p 1).) Therefore, the probability that two differentelements map to the same value is zero.
Direct implementation of Pohlig-Hellman is not secure.
Consider the following example, encrypting two values a andb, where b = a2. Ee(b) = Ee(a2) = (a2)
e(mod p) = (ae)
2
(mod p) = (Ee(a))2
(mod p). This shows that given twoencrypted values, it is possible to determine if one is the square
of the other (even though the base values are not revealed.)
This violates the security requirement of Section II-C.
Huberman et al. provide a solution [20]. Rather than en-
crypting items directly, a hash of the items is encrypted.The hash occurs only at the originating site, the second and
later encryption of items can use Pohlig-Hellman directly.
The hash breaks the relationship revealed by the encryption
(e.g., a = b2). After decryption, the hashed values must bemapped back to the original values. This can be done by
hashing the candidate itemsets in CG(k) to build a lookuptable, anything not in this table is a fake padding itemset
and can be discarded.3
3A hash collision, resulting in a padding itemset mapping to an itemset inthe table, could result in an extra itemset appearing in the union. This wouldbe filtered out by Protocol 2; the final results would be correct.
We present the approach from [20] as an example; any
secure encryption scheme that satisfies Equations 3 and 4
can be used in our protocols. The above approach is used to
generate the cost estimates in Section VI-B. Other approaches,
and further definitions and discussion of their security, can be
found in [21][24].
ACKNOWLEDGMENT
We wish to acknowledge the contributions of Mike Atallah
and Jaideep Vaidya. Discussions with them have helped to
tighten the proofs, giving clear bounds on the information
released.
REFERENCES
[1] R. Agrawal and R. Srikant, Fast algorithms for mining associationrules, in Proceedings of the 20th International Conference on Very
Large Data Bases. Santiago, Chile: VLDB, Sept. 12-15 1994, pp.487499. [Online]. Available: http://www.vldb.org/dblp/db/conf/vldb/vldb94-487.html
[2] D. W.-L. Cheung, J. Han, V. Ng, A. W.-C. Fu, and Y. Fu, A fastdistributed algorithm for mining association rules, in Proceedings of the
1996 International Conference on Parallel and Distributed InformationSystems (PDIS96). Miami Beach, Florida, USA: IEEE, Dec. 1996,pp. 3142.
[3] D. W.-L. Cheung, V. Ng, A. W.-C. Fu, and Y. Fu, Efficient miningof association rules in distributed databases, IEEE Trans. Knowledge
Data Eng., vol. 8, no. 6, pp. 911922, Dec. 1996.[4] R. Agrawal and R. Srikant, Privacy-preserving data mining, in
Proceedings of the 2000 ACM SIGMOD Conference on Managementof Data. Dallas, TX: ACM, May 14-19 2000, pp. 439450. [Online].Available: http://doi.acm.org/10.1145/342009.335438
[5] D. Agrawal and C. C. Aggarwal, On the design and quantificationof privacy preserving data mining algorithms, in Proceedingsof the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium onPrinciples of Database Systems. Santa Barbara, California, USA:ACM, May 21-23 2001, pp. 247255. [Online]. Available: http:
//doi.acm.org/10.1145/375551.375602[6] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy
preserving mining of association rules, in The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Edmonton, Alberta, Canada, July 23-26 2002, pp. 217228. [Online].Available: http://doi.acm.org/10.1145/775047.775080
[7] S. J. Rizvi and J. R. Haritsa, Maintaining data privacy in associationrule mining, in Proceedings of 28th International Conference on Very
Large Data Bases. Hong Kong: VLDB, Aug. 20-23 2002, pp. 682693.[Online]. Available: http://www.vldb.org/conf/2002/S19P03.pdf
[8] Y. Lindell and B. Pinkas, Privacy preserving data mining, in Advances in Cryptology CRYPTO 2000. Springer-Verlag, Aug.20-24 2000, pp. 3654. [Online]. Available: http://link.springer.de/link/service/series/0558/bibs/1880/18800036.htm
[9] O. Goldreich, Secure multi-party computation, Sept. 1998, (workingdraft). [Online]. Available: http://www.wisdom.weizmann.ac.il/oded/pp.html
[10] J. Vaidya and C. Clifton, Privacy preserving association rule mining invertically partitioned data, in The Eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Edmonton,Alberta, Canada, July 23-26 2002, pp. 639644. [Online]. Available:http://doi.acm.org/10.1145/775047.775142
[11] A. C. Yao, How to generate and exchange secrets, in Proceedings ofthe 27th IEEE Symposium on Foundations of Computer Science . IEEE,1986, pp. 162167.
[12] I. Ioannidis and A. Grama, An efficient protocol for yaos millionairesproblem, in Hawaii International Conference on System Sciences(HICSS-36), Waikoloa Village, Hawaii, Jan. 6-9 2003.
[13] O. Goldreich, Encryption schemes, Mar. 2003, (workingdraft). [Online]. Available: http://www.wisdom.weizmann.ac.il/oded/PSBookFrag/enc.ps
[14] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtainingdigital signatures and public-key cryptosystems, Communications ofthe ACM, vol. 21, no. 2, pp. 120126, 1978. [Online]. Available:http://doi.acm.org/10.1145/359340.359342
-
8/9/2019 10[1].1.1.76
11/12
12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR
[15] S. C. Pohlig and M. E. Hellman, An improved algorithm for com-puting logarithms over GF(p) and its cryptographic significance, IEEETransactions on Information Theory, vol. IT-24, pp. 106110, 1978.
[16] M. K. Reiter and A. D. Rubin, Crowds: Anonymity for Webtransactions, ACM Transactions on Information and System Security,vol. 1, no. 1, pp. 6692, Nov. 1998. [Online]. Available: http:
//doi.acm.org/10.1145/290163.290168[17] J. C. Benaloh, Secret sharing homomorphisms: Keeping
shares of a secret secret, in Advances in Cryptography -CRYPTO86: Proceedings, A. Odlyzko, Ed., vol. 263. Springer-
Verlag, Lecture Notes in Computer Science, 1986, pp. 251260. [Online]. Available: http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9%743&volume=263&spage=251
[18] B. Chor and E. Kushilevitz, A communication-privacy tradeoff formodular addition, Information Processing Letters, vol. 45, no. 4, pp.205210, 1993.
[19] C. Clifton, M. Kantarcioglu, and J. Vaidya, Defining privacy for datamining, in National Science Foundation Workshop on Next Generation
Data Mining, H. Kargupta, A. Joshi, and K. Sivakumar, Eds., Baltimore,MD, Nov. 1-3 2002, pp. 126133.
[20] B. A. Huberman, M. Franklin, and T. Hogg, Enhancing privacy andtrust in electronic communities, in Proceedings of the First ACMConference on Electronic Commerce (EC99). Denver, Colorado, USA:ACM Press, Nov. 35 1999, pp. 7886.
[21] J. C. Benaloh and M. de Mare, One-way accumulators: A decentralizedalternative to digital signatures, in Advances in Cryptology
EUROCRYPT93, Workshop on the Theory and Application of
Cryptographic Techniques, ser. Lecture Notes in Computer Science,vol. 765. Lofthus, Norway: Springer-Verlag, May 1993, pp. 274285. [Online]. Available: http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9%743&volume=765&spage=274
[22] W. Diffie and M. Hellman, New directions in cryptography, IEEETrans. Inform. Theory, vol. IT-22, no. 6, pp. 644654, Nov. 1976.
[23] T. ElGamal, A public key cryptosystem and a signature scheme basedon discrete logarithms, IEEE Trans. Inform. Theory, vol. IT-31, no. 4,pp. 469472, July 1985.
[24] A. Shamir, R. L. Rivest, and L. M. Adleman, Mental poker, Laboratoryfor Computer Science, MIT, Technical Memo MIT-LCS-TM-125, Feb.1979.
Murat Kantarcoglu is a Ph.D. candidate at PurdueUniversity. He has a Masters degree in ComputerScience from Purdue University and a Bachelorsdegree in Computer Engineering from Middle EastTechnical University, Ankara Turkey. His researchinterests include data mining, database security andinformation security. He is a student member ofACM.
Chris Clifton is an Associate Professor of ComputerScience at Purdue University. He has a Ph.D. fromPrinceton University, and Bachelors and Mastersdegrees from the Massachusetts Institute of Technol-ogy. Prior to joining Purdue he held positions at TheMITRE Corporation and Northwestern University.His research interests include data mining, databasesupport for text, and database security. He is a seniormember of the IEEE and a member of the IEEEComputer Society and the ACM.
Protocol 1 Finding secure union of large itemsets of size k
Require: N 3 sites numbered 1..N 1, set F of non-itemsets.
Phase 0: Encryption of all the rules by all sites
for each site i do
generate LLi(k) as in steps 1 and 2 of the FDM algorithmLLei(k) = for each X LLi(k) do
LLei(k) = LLei(k) {Ei(X)}end for
for j = |LLei(k)| + 1 to |CG(k)| doLLei(k) = LLei(k) {Ei(random selection from F)}
end for
end for
Phase 1: Encryption by all sites
for Round j = 0 to N 1 doif Round j= 0 then
Each site i sends permuted LLei(k) to site (i +1) modN
else
Each site i encrypts all items in LLe(ij mod N)(k)with Ei, permutes, and sends it to site (i + 1) mod N
end if
end for{At the end of Phase 1, site i has the itemsets ofsite (i + 1) mod N encrypted by every site}
Phase 2: Merge odd/even itemsets
Each site i sends LLei+1 mod N to site i mod 2
Site 0 sets RuleSet1 = (N1)/2j=1 LLe(2j1)(k)
Site 1 sets RuleSet0 = (N1)/2j=0 LLe(2j)(k)
Phase 3: Merge all itemsets
Site 1 sends permuted RuleSet1 to site 0Site 0 sets RuleSet = RuleSet0 RuleSet1
Phase 4: Decryption
for i = 0 to N 1 doSite i decrypts items in RuleSet using DiSite i sends permuted RuleSet to site i + 1 mod N
end for
Site N 1 decrypts items in RuleSet using DN1RuleSet(k) = RuleSet FSite N 1 broadcasts RuleSet(k) to sites 0..N 2
-
8/9/2019 10[1].1.1.76
12/12
KANTARCIOGLU AND CLIFTON 13
Protocol 2 Finding the global support counts securely
Require: N 3 sites numbered 0..N 1, m 2 |DB|rule set = at site 0:
for each r candidate set do
choose random integer xr from a uniform distributionover 0..m 1;t = r.supi s |DBi| + xr (mod m);rule set = rule set {(r, t)};
end for
send rule set to site 1 ;for i = 1 to N 2 do
for each (r, t) rule set dot = r.supi s |DBi| + t (mod m);rule set = rule set {(r, t)} {(r, t)} ;
end for
send rule set to site i + 1 ;end for
at site N-1:for each (r, t) rule set do
t = r.supi s |DBi| + t (mod m);securely compute if (t xr) (mod m) < m/2 with thesite 0; { Site 0 knows xr }if (t xr) (mod m) < m/2 then
multi-cast r as a globally large itemset.end if
end for