10[1].1.1.76

Upload: acsartin

Post on 29-May-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 10[1].1.1.76

    1/12

    2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

    Privacy-preserving Distributed Mining of

    Association Rules on Horizontally Partitioned DataMurat Kantarcoglu and Chris Clifton, Senior Member, IEEE

    Abstract Data mining can extract important knowledge fromlarge data collections but sometimes these collections are splitamong various parties. Privacy concerns may prevent the partiesfrom directly sharing the data, and some types of informationabout the data. This paper addresses secure mining of associationrules over horizontally partitioned data. The methods incorporatecryptographic techniques to minimize the information shared,while adding little overhead to the mining task.

    Index Terms Data Mining, Security, Privacy

    I. INTRODUCTION

    Data mining technology has emerged as a means of identi-

    fying patterns and trends from large quantities of data. Data

    mining and data warehousing go hand-in-hand: most tools

    operate by gathering all data into a central site, then running

    an algorithm against that data. However, privacy concerns

    can prevent building a centralized warehouse data may

    be distributed among several custodians, none of which are

    allowed to transfer their data to another site.

    This paper addresses the problem of computing associa-

    tion rules within such a scenario. We assume homogeneous

    databases: All sites have the same schema, but each site has

    information on different entities. The goal is to produce asso-

    ciation rules that hold globally, while limiting the informationshared about each site.

    Computing association rules without disclosing individual

    transactions is straightforward. We can compute the global

    support and confidence of an association rule AB Cknowing only the local supports of AB and ABC, and thesize of each database:

    supportABC =

    sitesi=1 support countABC(i)sites

    i=1 database size(i)

    supportAB =

    sitesi=1 support countAB(i)

    sitesi=1 database size(i)

    confidenceABC =supportABC

    supportAB

    Note that this doesnt require sharing any individual transac-

    tions. We can easily extend an algorithm such as a-priori [1]

    The authors are with the Department of Computer Sciences, PurdueUniversity, 250 N. University St, W. Lafayette, IN 47907.E-mail: [email protected], [email protected].

    Manuscript received 29 Jan. 2003; revised 11 Jun. 2003, accepted 1 Jul.2003.

    c2003 IEEE. Personal use of this material is permitted. However, permis-sion to reprint/republish this material for advertising or promotional purposesor for creating new collective works for resale or redistribution to servers orlists, or to reuse any copyrighted component of this work in other works mustbe obtained from the IEEE.

    to the distributed case using the following lemma: If a rule

    has support > k% globally, it must have support > k% onat least one of the individual sites. A distributed algorithm for

    this would work as follows: Request that each site send all

    rules with support at least k. For each rule returned, requestthat all sites send the count of their transactions that support

    the rule, and the total count of all transactions at the site.

    From this, we can compute the global support of each rule,

    and (from the lemma) be certain that all rules with support at

    least k have been found. More thorough studies of distributed

    association rule mining can be found in [2], [3].The above approach protects individual data privacy, but

    it does require that each site disclose what rules it supports,

    and how much it supports each potential global rule. What

    if this information is sensitive? For example, suppose the

    Centers for Disease Control (CDC), a public agency, would

    like to mine health records to try to find ways to reduce

    the proliferation of antibiotic resistant bacteria. Insurance

    companies have data on patient diseases and prescriptions.

    Mining this data would allow the discovery of rules such

    as Augmentin&Summer Infection&Fall, i.e., peopletaking Augmentin in the summer seem to have recurring

    infections.

    The problem is that insurance companies will be concernedabout sharing this data. Not only must the privacy of patient

    records be maintained, but insurers will be unwilling to release

    rules pertaining only to them. Imagine a rule indicating a high

    rate of complications with a particular medical procedure. If

    this rule doesnt hold globally, the insurer would like to know

    this they can then try to pinpoint the problem with their

    policies and improve patient care. If the fact that the insurers

    data supports this rule is revealed (say, under a Freedom of

    Information Act request to the CDC), the insurerer could be

    exposed to significant public relations or liability problems.

    This potential risk could exceed their own perception of the

    benefit of participating in the CDC study.

    This paper presents a solution that preserves suchsecrets the parties learn (almost) nothing beyond

    the global results. The solution is efficient: The addi-

    tional cost relative to previous non-secure techniques is

    O(number of candidate itemsets sites) encryptions, anda constant increase in the number of messages.

    The method presented in this paper assumes three or more

    parties. In the two-party case, knowing a rule is supported

    globally and not supported at ones own site reveals that the

    other site supports the rule. Thus, much of the knowledge

    we try to protect is revealed even with a completely secure

    method for computing the global results. We discuss the two-

  • 8/9/2019 10[1].1.1.76

    2/12

    KANTARCIOGLU AND CLIFTON 3

    E1(C)

    E3(E1(C))E2(E3(E1(C)))Site 2

    D

    Site 1

    C

    Site 3C

    E2(E3(C))E2(E3(D))

    E3(C)

    E3(D)

    C

    D

    Fig. 1. Determining global candidate itemsets

    party case further in Section V. By the same argument, we

    assume no collusion, as colluding parties can reduce this to

    the two-party case.

    A. Private Association Rule Mining OverviewOur method follows the basic approach outlined on Page 2

    except that values are passed between the local data mining

    sites rather than to a centralized combiner. The two phases

    are discovering candidate itemsets (those that are frequent on

    one or more sites), and determining which of the candidate

    itemsets meet the global support/confidence thresholds.

    The first phase (Figure 1) uses commutative encryption.

    Each party encrypts its own frequent itemsets (e.g., Site 1

    encrypts itemset C). The encrypted itemsets are then passed

    to other parties, until all parties have encrypted all itemsets.

    These are passed to a common party to eliminate duplicates,

    and to begin decryption. (In the figure, the full set of itemsets

    are shown to the left of Site 1, after Site 1 decrypts.) Thisset is then passed to each party, and each party decrypts each

    itemset. The final result is the common itemsets (C and D in

    the figure).

    In the second phase (Figure 2), each of the locally supported

    itemsets is tested to see if it is supported globally. In the figure,

    the itemset ABC is known to be supported at one or more sites,

    and each computes their local support. The first site chooses a

    random value R, and adds to R the amount by which its support

    for ABC exceeds the minimum support threshold. This value is

    passed to site 2, which adds the amount by which its support

    exceeds the threshold (note that this may be negative, as shown

    in the figure.) This is passed to site three, which again adds

    its excess support. The resulting value (18) is tested using asecure comparison to see if it exceeds the Random value (17).If so, itemset ABC is supported globally.

    This gives a brief, oversimplified idea of how the method

    works. Section III gives full details. Before going into the

    details, we give background and definitions of relevant data

    mining and security techniques.

    I I . BACKGROUND AND RELATED WORK

    There are several fields where related work is occurring. We

    first describe other work in privacy-preserving data mining,

    Site 3

    ABC: 20

    DBSize = 300

    Site 1

    ABC: 5

    DBSize = 100

    Site 2

    ABC: 6

    DBSize=200

    R+count-5%*DBsize

    = 17+5-5%*100R=17

    17+6-5%*20013+20-5%*300

    ABC: Yes!

    17

    13

    18R?

    Fig. 2. Determining if itemset support exceeds 5% threshold

    then go into detail on specific background work on which this

    paper builds.Previous work in privacy-preserving data mining has ad-

    dressed two issues. In one, the aim is preserving customer

    privacy by distorting the data values [4]. The idea is that

    the distorted data does not reveal private information, and

    thus is safe to use for mining. The key result is that the

    distorted data, and information on the distribution of the

    random data used to distort the data, can be used to generate

    an approximation to the original data distribution, without

    revealing the original data values. The distribution is used to

    improve mining results over mining the distorted data directly,

    primarily through selection of split points to bin continuous

    data. Later refinement of this approach tightened the bounds

    on what private information is disclosed, by showing that theability to reconstruct the distribution can be used to tighten

    estimates of original values based on the distorted data [5].

    More recently, the data distortion approach has been applied

    to boolean association rules [6], [7]. Again, the idea is to

    modify data values such that reconstruction of the values for

    any individual transaction is difficult, but the rules learned on

    the distorted data are still valid. One interesting feature of

    this work is a flexible definition of privacy; e.g., the ability to

    correctly guess a value of 1 from the distorted data can be

    considered a greater threat to privacy than correctly learning

    a 0.

    The data distortion approach addresses a different problem

    from our work. The assumption with distortion is that thevalues must be kept private from whoever is doing the mining.

    We instead assume that some parties are allowed to see some

    of the data, just that no one is allowed to see all the data.

    In return, we are able to get exact, rather than approximate,

    results.

    The other approach uses cryptographic tools to build deci-

    sion trees. [8] In this work, the goal is to securely build an

    ID3 decision tree where the training set is distributed between

    two parties. The basic idea is that finding the attribute that

    maximizes information gain is equivalent to finding the at-

    tribute that minimizes the conditional entropy. The conditional

  • 8/9/2019 10[1].1.1.76

    3/12

    4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

    entropy for an attribute for two parties can be written as a

    sum of the expression of the form (v1 + v2) log(v1 + v2).The authors give a way to securely calculate the expression

    (v1 + v2) log(v1 + v2) and show how to use this functionfor building the ID3 securely. This approach treats privacy-

    preserving data mining as a special case of secure multi-party

    computation [9] and not only aims for preserving individual

    privacy but also tries to preserve leakage of any information

    other than the final result. We follow this approach, but address

    a different problem (association rules), and emphasize the

    efficiency of the resulting algorithms. A particular difference

    is that we recognize that some kinds of information can be

    exchanged without violating security policies; secure multi-

    party computation forbids leakage of any information other

    than the final result. The ability to share non-sensitive data

    enables highly efficient solutions.

    The problem of privately computing association rules in

    vertically partitioned distributed data has also been addressed

    [10]. The vertically partitioned problem occurs when each

    transaction is split across multiple sites, with each site having

    a different set of attributes for the entire set of transactions.With horizontal partitioning each site has a set of complete

    transactions. In relational terms, with horizontal partioning the

    relation to be mined is the union of the relations at the sites. In

    vertical partitioning, the relations at the individual sites must

    be joined to get the relation to be mined. The change in the

    way the data is distributed makes this a much different problem

    from the one we address here, resulting in a very different

    solution.

    A. Mining of Association Rules

    The association rules mining problem can be defined as

    follows: [1] Let I = {i1, i2, . . . , in} be a set of items. LetDB be a set of transactions, where each transaction T is anitemset such that T I. Given an itemset X I, a transactionT contains X if and only if X T. An association rule isan implication of the form X Y where X I, Y I andX Y = . The rule X Y has support s in the transactiondatabase DB ifs% of transactions in DB contain XY. Theassociation rule holds in the transaction database DB withconfidence c if c% of transactions in DB that contain X alsocontains Y. An itemset X with k items called k-itemset. Theproblem of mining association rules is to find all rules whose

    support and confidence are higher than certain user specified

    minimum support and confidence.

    In this simplified definition of the association rules, missingitems, negatives and quantities are not considered. In this

    respect, transaction database DB can be seen as 0/1 matrixwhere each column is an item and each row is a transaction.

    In this paper, we use this view of association rules.

    1) Distributed Mining of Association Rules: The above

    problem of mining association rules can be extended to

    distributed environments. Let us assume that a transaction

    database DB is horizontally partitioned among n sites (namelyS1, S2, . . . , S n) where DB = DB1 DB2 . . . DBn andDBi resides at side Si (1 i n). The itemset X haslocal support count of X.supi at site Si if X.supi of the

    transactions contains X. The global support count of X isgiven as X.sup =

    ni=1 X.supi. An itemset X is globally

    supported if X.sup s (n

    i=1 |DBi|). Global confidenceof a rule X Y can be given as {X Y} .sup/X.sup.

    The set of large itemsets L(k) consists of all k-itemsetsthat are globally supported. The set of locally large itemsets

    LLi(k) consists of all k-itemsets supported locally at site Si.GL

    i(k)= L

    (k) LL

    i(k)is the set of globally large k-itemsets

    locally supported at site Si. The aim of distributed associationrule mining is to find the sets L(k) for all k > 1 and the supportcounts for these itemsets, and from this compute association

    rules with the specified minimum support and confidence.

    A fast algorithm for distributed association rule mining is

    given in Cheung et. al. [2]. Their procedure for fast distributed

    mining of association rules (FDM) is summarized below.

    1) Candidate Sets Generation: Generate candidate sets

    CGi(k) based on GLi(k1), itemsets that are supportedby the Si at the (k-1)-th iteration, using the classic a-priori candidate generation algorithm. Each site gener-

    ates candidates based on the intersection of globally

    large (k-1) itemsets and locally large (k-1) itemsets.2) Local Pruning: For each X CGi(k), scan the

    database DBi at Si to compute X.supi. If X is locallylarge Si, it is included in the LLi(k) set. It is clear thatif X is supported globally, it will be supported in onesite.

    3) Support Count Exchange: LLi(k) are broadcast, andeach site computes the local support for the items in

    iLLi(k).4) Broadcast Mining Results: Each site broadcasts the

    local support for itemsets in iLLi(k). From this, eachsite is able to compute L(k).

    The details of the above algorithm can be found in [2].

    B. Secure Multi-party Computation

    Substantial work has been done on secure multi-party com-

    putation. The key result is that a wide class of computations

    can be computed securely under reasonable assumptions. We

    give a brief overview of this work, concentrating on material

    that is used later in the paper. The definitions given here are

    from Goldreich [9]. For simplicity, we concentrate on the two-

    party case. Extending the definitions to the multi-party case is

    straightforward.

    1) Security in semi-honest model: A semi-honest party

    follows the rules of the protocol using its correct input, but isfree to later use what it sees during execution of the protocol

    to compromise security. This is somewhat realistic in the real

    world because parties who want to mine data for their mutual

    benefit will follow the protocol to get correct results. Also, a

    protocol that is buried in large, complex software can not be

    easily altered.

    A formal definition of private two-party computation in

    the semi-honest model is given below. Computing a function

    privately is equivalent to computing it securely. The formal

    proof of this can be found in Goldreich [9].

    Definition 2.1: (privacy w.r.t. semi-honest behavior): [9]

  • 8/9/2019 10[1].1.1.76

    4/12

    KANTARCIOGLU AND CLIFTON 5

    Let f : {0, 1} {0, 1}

    {0, 1} {0, 1}

    be prob-

    abilistic, polynomial-time functionality, where f1 (x, y)(resp.,f2 (x, y)) denotes the first (resp., second) element of f(x, y))and let be two-party protocol for computing f.

    Let the view of the first (resp., second) party during

    an execution of on (x, y), denoted view1 (x, y)(resp., view2 (x, y)) be (x, r1, m1, . . . , mt) (resp.,(y, r2, m1, . . . , mt)) where r1 represent the outcome ofthe first (resp., r2 second) partys internal coin tosses, andmi represent the i

    th message it has received.

    The output of the first (resp., second) party during an

    execution of on (x, y) is denoted output1 (x, y) (resp.,output2 (x, y)) and is implicit in the partys view of theexecution.

    privately computes f if there exist probabilistic polyno-mial time algorithms, denoted S1, S2 such that

    {(S1 (x, f1 (x, y)) , f2 (x, y))}x,y{0,1} C

    view1 (x, y) , output

    2 (x, y)

    x,y{0,1}

    (1)

    {(f1 (x, y) , S2 (x, f1 (x, y)))}x,y{0,1} C

    output1 (x, y) ,view2 (x, y)

    x,y{0,1}

    (2)

    where C denotes computational indistinguishability.

    The above definition says that a computation is secure if

    the view of each party during the execution of the protocol

    can be effectively simulated by the input and the output of

    the party. This is not quite the same as saying that private

    information is protected. For example, if two parties use a

    secure protocol to mine distributed association rules, a secure

    protocol still reveals that if a particular rule is not supported by

    a particular site, and that rule appears in the globally supportedrule set, then it must be supported by the other site. A site

    can deduce this information by solely looking at its locally

    supported rules and the globally supported rules. On the other

    hand, there is no way to deduce the exact support count of

    some itemset by looking at the globally supported rules. With

    three or more parties, knowing a rule holds globally reveals

    that at least one site supports it, but no site knows which site

    (other than, obviously, itself). In summary, a secure multi-party

    protocol will not reveal more information to a particular party

    than the information that can be induced by looking at that

    partys input and the output.

    2) Yaos general two-party secure function evaluation:

    Yaos general secure two-party evaluation is based on ex-pressing the function f(x, y) as a circuit and encrypting thegates for secure evaluation [11]. With this protocol, any two-

    party function can be evaluated securely in the semi-honest

    model. To be efficiently evaluated, however, the functions

    must have a small circuit representation. We will not give

    details of this generic method, however we do use this generic

    results for securely finding whether a b (Yaos millionaireproblem). For comparing any two integers securely, Yaos

    generic method is one of the most efficient methods known,

    although other asymptotically equivalent but practically more

    efficient algorithms could be used as well [12].

    C. Commutative Encryption

    Commutative encryption is an important tool that can be

    used in many privacy-preserving protocols. An encryption

    algorithm is commutative if the following two equations hold

    for any given feasible encryption keys K1, . . . , K n K, anym in items domain M, and any permutations of i, j.

    EKi1 (. . . E Kin (M) . . .) = EKj1 (. . . E Kjn (M) . . .) (3)

    M1, M2 M such that M1 = M2 and for given k, 1, |CG(k)| is likely to be of reasonable size. However, |CG(1)|could be very large, as it is dependent only on the size of the

    domain of items, and is not limited by already discovered

    frequent itemsets. If the participants can agree on an upper

    bound on the number of frequent items supported at any one

    site that is tighter than every item may be frequent without

    inspecting the data, we can achieve a corresponding decrease

    in the costs with no loss of security. This is likely to be feasible

    in practice; the very success of the a-priori algorithm is based

    on the assumption that relatively few items are frequent.

    Alternatively, if we are willing to leak an upper bound on the

    number of itemsets supported at each site, each site can set itsown upper bound and pad only to that bound. This can be done

    for every round, not just k = 1. As a practical matter, such anapproach would achieve acceptable security and would change

    the |CG(k)| factor in the communication and encryption costsof Protocol 1 to O(| i LLi(k)|), equivalent to FDM.

    Another way to limit the encryption cost of padding is to

    pad randomly from the domain of the encryption output rather

    than encrypting items from F. Assuming |domainofEi| >>|domainofitemsets |, the probability of padding with a valuethat decrypts to a real itemset is small, and even if this occurs

    it will only result in additional itemset being tested for support

    in Protocol 2. When the support count is tested, such false

    hits will be filtered out, and the final result will be correct.The comparison phase at the end of protocol 2 can be

    also removed, eliminating the O(m | i LLi(k)| t) bitsof communication and O(| i LLi(k)| m t

    3) encryptioncost. This reveals the excess support for each itemset. Practical

    applications may demand this count as part of the result for

    globally supported itemsets, so the only information leaked is

    the support counts for itemsets in iLLi(k) L(k). As thesecannot be traced to an individual site, this will generally be

    acceptable in practice.

    The cost estimates are based on the assumption that all

    frequent itemsets (even 1-itemsets) are part of the result. If

  • 8/9/2019 10[1].1.1.76

    9/12

    10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

    exposing the globally frequent 1-itemsets is a problem, the

    algorithm could easily begin with 2-itemsets (or larger). While

    the worst-case cost would be unchanged, there would be an

    impact in practical terms. Eliminating the pruning of globally

    infrequent 1-itemsets would increase the size of CGi(2) andthus LLi(2), however, local pruning of infrequent 1-itemsetsshould make the sizes manageable. More critical is the impact

    on |CG(2)|, and thus the cost of padding to hide the numberof locally large itemsets. In practice, the size of CG(2) will

    rarely be the theoretical limit ofitem domain size

    2

    , but this

    worst-case bound would need to be used if the algorithm began

    with finding 2-itemsets (the problem is worse for k > 2). Apractical solution would again be to have sites agree on a

    reasonable upper bound for the number of locally supported

    k-itemsets for the initial k, revealing some information tosubstantially decrease the amount of padding needed.

    B. Practical Cost of Encryption

    While achieving privacy comes at a reasonable increase in

    communication cost, what about the cost of encryption? As a

    test of this, we implemented Pohlig-Hellman, described in the

    appendix. The encryption time per itemset represented with

    t = 512 bits was 0.00428 seconds on a 700MHz Pentium 3under Linux. Using this, and the results for the reported in [3],

    we can estimate the cost of privacy-preserving association rule

    mining on the tests in [3].

    The first set of experiments described in [3] contain suffi-

    cient detail for us to estimate the cost of encryption. These

    experiments used three sites, an item domain size of 1000,

    and a total database size of 500k transactions.

    The encryption cost for the initial round (k = 1) wouldbe 4.28 seconds at each site, as the padding need only be to

    the domain size of 1000. While finding two-itemsets could

    potentially be much worse (

    10002

    = 499500), in practice

    |CG(2)| is much smaller. The experiment in [3] reports a totalnumber of candidate sets (

    k>1 |CG(k)|) of just over 100,000

    at 1% support. This gives a total encryption cost of around

    430 seconds per site, with all sites encrypting simultaneously.

    This assumes none of the optimizations of Section VI-A;

    if the encryption cost at each site could be cut to |LLi(k)|by eliminating the cost of encrypting the padding items, the

    encryption cost would be cut to 5% to 35% of the above on

    the datasets used in [3].

    There is also the encryption cost of the secure comparison

    at the end of Protocol 2. Although the data reported in [3]

    does not give us the exact size of iLLi(k), it appears to beon the order of 2000. Based on this, the cost of the secure

    comparison, O(| i LLi(k)| m t3), would be about 170

    seconds.

    The total execution time for the experiment reported in

    [3] was approximately 800 seconds. Similar numbers hold at

    different support levels; the added cost of encryption would at

    worst increase the total run time by roughly 75%.

    VII. CONCLUSIONS AND FURTHER WORK

    Cryptographic tools can enable data mining that would

    otherwise be prevented due to security concerns. We have

    given procedures to mine distributed association rules on

    horizontally partitioned data. We have shown that distributed

    association rule mining can be done efficiently under reason-

    able security assumptions.

    We believe the need for mining of data where access is

    restricted by privacy concerns will increase. Examples include

    knowledge discovery among intelligence services of differ-

    ent countries and collaboration among corporations without

    revealing trade secrets. Even within a single multi-national

    company, privacy laws in different jurisdictions may prevent

    sharing individual data. Many more examples can be imagined.

    We would like to see secure algorithms for classification,

    clustering, etc. Another possibility is secure approximate data

    mining algorithms. Allowing error in the results may enable

    more efficient algorithms that maintain the desired level of

    security.

    The secure multi-party computation definitions from the

    cryptography domain may be too restrictive for our purposes.

    A specific example of the need for more flexible definitions

    can be seen in Protocol 1. The padding set F is defined to be

    infinite, so that the probability of collision among these itemsis 0. This is impractical, and intuitively allowing collisions

    among the padded itemsets would seem more secure, as the

    information leaked (itemsets supported in common by subsets

    of the sites) would become an upper bound rather than an exact

    value. However, unless we know in advance the probability of

    collision among real itemsets, or more specifically we can set

    the size of F so the ratio of the collision probabilities in Fand real itemsets is constant, the protocol is less secure under

    secure multi-party communication definitions. The problem is

    that knowing the probability of collision among items chosen

    from F enables us to predict (although generally with lowaccuracy) which fully encrypted itemsets are real and which

    are fake. This allows a probabilistic upper bound estimate onthe number of itemsets supported at each site. It also allows a

    probabilistic estimate of the number of itemsets supported in

    common by subsets of the sites that is tighter than the number

    of collisions found in the RuleSet. Definitions that allow usto trade off such estimates, and techniques to prove protocols

    relative to those definitions, will allow us to prove the privacy

    of protocols that are practically superior to protocols meeting

    strict secure multi-party computation definitions.

    More suitable security definitions that allow parties to

    choose their desired level of security are needed, allowing

    efficient solutions that maintain the desired security. Some

    suggested directions for research in this area are given in [19].

    One line of research is to predict the value of information fora particular organization, allowing tradeoff between disclosure

    cost, computation cost, and benefit from the result. We believe

    some ideas from game theory and economics may be relevant.

    In summary, it is possible to mine globally valid results

    from distributed data without revealing information that com-

    promises the privacy of the individual sources. Such privacy-

    preserving data mining can be done with a reasonable increase

    in cost over methods that do not maintain privacy. Continued

    research will expand the scope of privacy-preserving data

    mining, enabling most or all data mining methods to be applied

    in situations where privacy concerns would appear to restrict

  • 8/9/2019 10[1].1.1.76

    10/12

    KANTARCIOGLU AND CLIFTON 11

    such mining.

    APPENDIX

    CRYPTOGRAPHIC NOTES ON COMMUTATIVE ENCRYPTION

    The Pohlig-Hellman encryption scheme [15] can be used for

    a commutative encryption scheme meeting the requirements

    of Section II-C. Pohlig-Hellman works as follows. Given a

    large prime p with no small factors of p 1, each partychooses a random e, d pair such that e d = 1 (mod p 1).The encryption of a given message M is Me (mod p).Decryption of a given ciphertext C is done by evaluating Cd

    (mod p). Cd = Med (mod p), and due to Fermats littletheorem, Med = M1+k(p1) = M (mod p).

    It is easy to see that Pohlig-Hellman with shared p satisfiesequation 3. Let us assume that there are n different encryptionand decryption pairs ((e1, d1), . . . , (en, dn)). For any permuta-tion function i, j and E = e1 e2 . . .en = ei1 ei2 . . . ein =ei1 ei2 . . . ein (mod p 1):

    Eei1 (. . . E ein (M) . . .)

    = (. . . ((Mein (mod p))

    ein1 (mod p)) . . .)

    ei1 (mod p))

    = Meinein1 ...ei1 (mod p)

    = ME (mod p)

    = Mejnejn1 ...ej1 (mod p)

    = Eej1 (. . . E ejn (M) . . .)

    Equation 4 is also satisfied by the Pohlig-Hellman encryp-

    tion scheme. Let M1, M2 GF(p) such that M1 = M2.Any order of encryption by all parties is equal to evaluating

    Eth power mod p of the plain text. Let us assume that afterencryptions M1 and M2 are mapped to the same value. Thisimplies that ME1 = M

    E2 (mod p). By exponentiating both

    sides with D = d1 d2 . . . dn (mod p 1), we getM1 = M2 (mod p), a contradiction. (Note that E D =e1 e2 . . . en d1 d2 . . . dn = e1 d1 . . . en dn = 1(mod p 1).) Therefore, the probability that two differentelements map to the same value is zero.

    Direct implementation of Pohlig-Hellman is not secure.

    Consider the following example, encrypting two values a andb, where b = a2. Ee(b) = Ee(a2) = (a2)

    e(mod p) = (ae)

    2

    (mod p) = (Ee(a))2

    (mod p). This shows that given twoencrypted values, it is possible to determine if one is the square

    of the other (even though the base values are not revealed.)

    This violates the security requirement of Section II-C.

    Huberman et al. provide a solution [20]. Rather than en-

    crypting items directly, a hash of the items is encrypted.The hash occurs only at the originating site, the second and

    later encryption of items can use Pohlig-Hellman directly.

    The hash breaks the relationship revealed by the encryption

    (e.g., a = b2). After decryption, the hashed values must bemapped back to the original values. This can be done by

    hashing the candidate itemsets in CG(k) to build a lookuptable, anything not in this table is a fake padding itemset

    and can be discarded.3

    3A hash collision, resulting in a padding itemset mapping to an itemset inthe table, could result in an extra itemset appearing in the union. This wouldbe filtered out by Protocol 2; the final results would be correct.

    We present the approach from [20] as an example; any

    secure encryption scheme that satisfies Equations 3 and 4

    can be used in our protocols. The above approach is used to

    generate the cost estimates in Section VI-B. Other approaches,

    and further definitions and discussion of their security, can be

    found in [21][24].

    ACKNOWLEDGMENT

    We wish to acknowledge the contributions of Mike Atallah

    and Jaideep Vaidya. Discussions with them have helped to

    tighten the proofs, giving clear bounds on the information

    released.

    REFERENCES

    [1] R. Agrawal and R. Srikant, Fast algorithms for mining associationrules, in Proceedings of the 20th International Conference on Very

    Large Data Bases. Santiago, Chile: VLDB, Sept. 12-15 1994, pp.487499. [Online]. Available: http://www.vldb.org/dblp/db/conf/vldb/vldb94-487.html

    [2] D. W.-L. Cheung, J. Han, V. Ng, A. W.-C. Fu, and Y. Fu, A fastdistributed algorithm for mining association rules, in Proceedings of the

    1996 International Conference on Parallel and Distributed InformationSystems (PDIS96). Miami Beach, Florida, USA: IEEE, Dec. 1996,pp. 3142.

    [3] D. W.-L. Cheung, V. Ng, A. W.-C. Fu, and Y. Fu, Efficient miningof association rules in distributed databases, IEEE Trans. Knowledge

    Data Eng., vol. 8, no. 6, pp. 911922, Dec. 1996.[4] R. Agrawal and R. Srikant, Privacy-preserving data mining, in

    Proceedings of the 2000 ACM SIGMOD Conference on Managementof Data. Dallas, TX: ACM, May 14-19 2000, pp. 439450. [Online].Available: http://doi.acm.org/10.1145/342009.335438

    [5] D. Agrawal and C. C. Aggarwal, On the design and quantificationof privacy preserving data mining algorithms, in Proceedingsof the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium onPrinciples of Database Systems. Santa Barbara, California, USA:ACM, May 21-23 2001, pp. 247255. [Online]. Available: http:

    //doi.acm.org/10.1145/375551.375602[6] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy

    preserving mining of association rules, in The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,Edmonton, Alberta, Canada, July 23-26 2002, pp. 217228. [Online].Available: http://doi.acm.org/10.1145/775047.775080

    [7] S. J. Rizvi and J. R. Haritsa, Maintaining data privacy in associationrule mining, in Proceedings of 28th International Conference on Very

    Large Data Bases. Hong Kong: VLDB, Aug. 20-23 2002, pp. 682693.[Online]. Available: http://www.vldb.org/conf/2002/S19P03.pdf

    [8] Y. Lindell and B. Pinkas, Privacy preserving data mining, in Advances in Cryptology CRYPTO 2000. Springer-Verlag, Aug.20-24 2000, pp. 3654. [Online]. Available: http://link.springer.de/link/service/series/0558/bibs/1880/18800036.htm

    [9] O. Goldreich, Secure multi-party computation, Sept. 1998, (workingdraft). [Online]. Available: http://www.wisdom.weizmann.ac.il/oded/pp.html

    [10] J. Vaidya and C. Clifton, Privacy preserving association rule mining invertically partitioned data, in The Eighth ACM SIGKDD International

    Conference on Knowledge Discovery and Data Mining, Edmonton,Alberta, Canada, July 23-26 2002, pp. 639644. [Online]. Available:http://doi.acm.org/10.1145/775047.775142

    [11] A. C. Yao, How to generate and exchange secrets, in Proceedings ofthe 27th IEEE Symposium on Foundations of Computer Science . IEEE,1986, pp. 162167.

    [12] I. Ioannidis and A. Grama, An efficient protocol for yaos millionairesproblem, in Hawaii International Conference on System Sciences(HICSS-36), Waikoloa Village, Hawaii, Jan. 6-9 2003.

    [13] O. Goldreich, Encryption schemes, Mar. 2003, (workingdraft). [Online]. Available: http://www.wisdom.weizmann.ac.il/oded/PSBookFrag/enc.ps

    [14] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtainingdigital signatures and public-key cryptosystems, Communications ofthe ACM, vol. 21, no. 2, pp. 120126, 1978. [Online]. Available:http://doi.acm.org/10.1145/359340.359342

  • 8/9/2019 10[1].1.1.76

    11/12

    12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, TO APPEAR

    [15] S. C. Pohlig and M. E. Hellman, An improved algorithm for com-puting logarithms over GF(p) and its cryptographic significance, IEEETransactions on Information Theory, vol. IT-24, pp. 106110, 1978.

    [16] M. K. Reiter and A. D. Rubin, Crowds: Anonymity for Webtransactions, ACM Transactions on Information and System Security,vol. 1, no. 1, pp. 6692, Nov. 1998. [Online]. Available: http:

    //doi.acm.org/10.1145/290163.290168[17] J. C. Benaloh, Secret sharing homomorphisms: Keeping

    shares of a secret secret, in Advances in Cryptography -CRYPTO86: Proceedings, A. Odlyzko, Ed., vol. 263. Springer-

    Verlag, Lecture Notes in Computer Science, 1986, pp. 251260. [Online]. Available: http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9%743&volume=263&spage=251

    [18] B. Chor and E. Kushilevitz, A communication-privacy tradeoff formodular addition, Information Processing Letters, vol. 45, no. 4, pp.205210, 1993.

    [19] C. Clifton, M. Kantarcioglu, and J. Vaidya, Defining privacy for datamining, in National Science Foundation Workshop on Next Generation

    Data Mining, H. Kargupta, A. Joshi, and K. Sivakumar, Eds., Baltimore,MD, Nov. 1-3 2002, pp. 126133.

    [20] B. A. Huberman, M. Franklin, and T. Hogg, Enhancing privacy andtrust in electronic communities, in Proceedings of the First ACMConference on Electronic Commerce (EC99). Denver, Colorado, USA:ACM Press, Nov. 35 1999, pp. 7886.

    [21] J. C. Benaloh and M. de Mare, One-way accumulators: A decentralizedalternative to digital signatures, in Advances in Cryptology

    EUROCRYPT93, Workshop on the Theory and Application of

    Cryptographic Techniques, ser. Lecture Notes in Computer Science,vol. 765. Lofthus, Norway: Springer-Verlag, May 1993, pp. 274285. [Online]. Available: http://springerlink.metapress.com/openurl.asp?genre=article&issn=0302-9%743&volume=765&spage=274

    [22] W. Diffie and M. Hellman, New directions in cryptography, IEEETrans. Inform. Theory, vol. IT-22, no. 6, pp. 644654, Nov. 1976.

    [23] T. ElGamal, A public key cryptosystem and a signature scheme basedon discrete logarithms, IEEE Trans. Inform. Theory, vol. IT-31, no. 4,pp. 469472, July 1985.

    [24] A. Shamir, R. L. Rivest, and L. M. Adleman, Mental poker, Laboratoryfor Computer Science, MIT, Technical Memo MIT-LCS-TM-125, Feb.1979.

    Murat Kantarcoglu is a Ph.D. candidate at PurdueUniversity. He has a Masters degree in ComputerScience from Purdue University and a Bachelorsdegree in Computer Engineering from Middle EastTechnical University, Ankara Turkey. His researchinterests include data mining, database security andinformation security. He is a student member ofACM.

    Chris Clifton is an Associate Professor of ComputerScience at Purdue University. He has a Ph.D. fromPrinceton University, and Bachelors and Mastersdegrees from the Massachusetts Institute of Technol-ogy. Prior to joining Purdue he held positions at TheMITRE Corporation and Northwestern University.His research interests include data mining, databasesupport for text, and database security. He is a seniormember of the IEEE and a member of the IEEEComputer Society and the ACM.

    Protocol 1 Finding secure union of large itemsets of size k

    Require: N 3 sites numbered 1..N 1, set F of non-itemsets.

    Phase 0: Encryption of all the rules by all sites

    for each site i do

    generate LLi(k) as in steps 1 and 2 of the FDM algorithmLLei(k) = for each X LLi(k) do

    LLei(k) = LLei(k) {Ei(X)}end for

    for j = |LLei(k)| + 1 to |CG(k)| doLLei(k) = LLei(k) {Ei(random selection from F)}

    end for

    end for

    Phase 1: Encryption by all sites

    for Round j = 0 to N 1 doif Round j= 0 then

    Each site i sends permuted LLei(k) to site (i +1) modN

    else

    Each site i encrypts all items in LLe(ij mod N)(k)with Ei, permutes, and sends it to site (i + 1) mod N

    end if

    end for{At the end of Phase 1, site i has the itemsets ofsite (i + 1) mod N encrypted by every site}

    Phase 2: Merge odd/even itemsets

    Each site i sends LLei+1 mod N to site i mod 2

    Site 0 sets RuleSet1 = (N1)/2j=1 LLe(2j1)(k)

    Site 1 sets RuleSet0 = (N1)/2j=0 LLe(2j)(k)

    Phase 3: Merge all itemsets

    Site 1 sends permuted RuleSet1 to site 0Site 0 sets RuleSet = RuleSet0 RuleSet1

    Phase 4: Decryption

    for i = 0 to N 1 doSite i decrypts items in RuleSet using DiSite i sends permuted RuleSet to site i + 1 mod N

    end for

    Site N 1 decrypts items in RuleSet using DN1RuleSet(k) = RuleSet FSite N 1 broadcasts RuleSet(k) to sites 0..N 2

  • 8/9/2019 10[1].1.1.76

    12/12

    KANTARCIOGLU AND CLIFTON 13

    Protocol 2 Finding the global support counts securely

    Require: N 3 sites numbered 0..N 1, m 2 |DB|rule set = at site 0:

    for each r candidate set do

    choose random integer xr from a uniform distributionover 0..m 1;t = r.supi s |DBi| + xr (mod m);rule set = rule set {(r, t)};

    end for

    send rule set to site 1 ;for i = 1 to N 2 do

    for each (r, t) rule set dot = r.supi s |DBi| + t (mod m);rule set = rule set {(r, t)} {(r, t)} ;

    end for

    send rule set to site i + 1 ;end for

    at site N-1:for each (r, t) rule set do

    t = r.supi s |DBi| + t (mod m);securely compute if (t xr) (mod m) < m/2 with thesite 0; { Site 0 knows xr }if (t xr) (mod m) < m/2 then

    multi-cast r as a globally large itemset.end if

    end for