[ieee 2012 12th international conference on hybrid intelligent systems (his) - pune, india...

5
Privacy Preserving Mining of Association Rules on Horizontally and Vertically Partitioned Data: A Review Paper Madhuri N.Kumbhar Dept. of Computer Engineering Pimpri Chinchwad C.O.E. Pune, India e-mail: [email protected] Ms. Reena Kharat Dept. of Computer Engineering Pimpri Chinchwad C.O.E. Pune, India e-mail: [email protected] n Abstract— Data mining can extract important knowledge from large database - sometimes this database is split among various parties. Here, the main aim of privacy preserving data mining is to find the global mining results by preserving the individual sites private data/information. Many Privacy Preserving Association Rule Mining (PPARM) algorithms are proposed for different partitioning methods by satisfying privacy constraints. The various methods such as randomization, perturbation, heuristic and cryptography techniques are proposed by different authors to find privacy preserving association rule mining in horizontally and vertically partitioned databases. In this paper, the analysis of different methods for PPARM is performed and their results are compared. For satisfying the privacy constraints in vertically partitioned databases, algorithm based on cryptography techniques, Homomorphic encryption, Secure Scalar product and Shamir’s secret sharing technique are used. For horizontal Partitioned databases, algorithm that combines advantage of both RSA public key cryptosystem and Homomorphic encryption scheme and algorithm that uses Paillier cryptosystem to compute global supports are used. This paper reviews the wide methods used for mining association rules over distributed dataset while preserving privacy. Keywords- Privacy Preserving Association Rule Mining, Cryptography, SMC, Scalar Product, Secret Sharing, Homomorphic Paillier Cryptosystem I. INTRODUCTION Data may be personal or corporate; data mining offers the potential to reveal what others regard as sensitive (private). In some cases, it may be of mutual benefit for two parties or multiple parties (even may be competitors) to share their data for an analysis task. However, they would like to ensure their own data remains private. Means, there is a need to protect sensitive knowledge during a data mining process. This problem is called Privacy-Preserving Data Mining (PPDM). So maintaining privacy is challenging issue in data mining. Many secure protocols have been proposed so far for data mining and machine learning techniques for decision tree classification, clustering, association rule mining, Neural Networks, Bayesian Networks. The main concern of these algorithms is to preserve the privacy of parties’ sensitive data, while they gain useful knowledge from the whole dataset. One of the most studied problems in data mining is the process of discovering frequent item sets and, consequently association rules. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc. Most of the privacy-preserving data mining methods apply a transformation which reduces the effectiveness of the underlying data when it is applied to data mining methods or algorithms. Privacy concerns can prevent building of centralized warehouse – in distributed among several sites, none of which are allowed to transfer their data to another site. In preserving privacy of data, the problem is how securely results are obtained but not with data mining result but. As a simple example, suppose some hospitals want to get useful aggregated knowledge about a specific diagnosis from their patients’ records while each hospital is not allowed, due to the privacy acts, to disclose individuals’ private data. Therefore, they need to run a joint and secure protocol on their distributed database to reach to the desired information. In many cases data is distributed, and bringing the data together in one place for analysis is not possible due these privacy laws or policies. Now it is not the data miner’s intention to compromise any private information. In fact, the results of the data mining rarely compromise privacy. This is due to the fact that data mining itself is concerned with retrieving high level data and trends about the data. If the results could be obtained without sharing information between different sites and the results are summary of entire data then there would be no loss of privacy through data mining if those results are not used to get private information. In distributed environment, the database is available across multiple sites and privacy preserved data mining is performed to find the global mining results by preserving the individual sites private data/information. Every site can access the global results which are useful for analysis. In recent years, many researchers are focusing on privacy preserving data mining in distributed environment as it is having lot of applications in diverse fields. In this paper, different approaches are discussed for mining the association rules while fulfilling privacy requirements over horizontally and vertically partitioned distributed databases and their results are compared. 231 978-1-4673-5116-4/12/$31.00 c 2012 IEEE

Upload: reena

Post on 08-Dec-2016

216 views

Category:

Documents


4 download

TRANSCRIPT

Privacy Preserving Mining of Association Rules on Horizontally and Vertically Partitioned Data: A Review Paper

Madhuri N.Kumbhar Dept. of Computer Engineering

Pimpri Chinchwad C.O.E. Pune, India

e-mail: [email protected]

Ms. Reena Kharat Dept. of Computer Engineering

Pimpri Chinchwad C.O.E. Pune, India

e-mail: [email protected]

Abstract— Data mining can extract important knowledge from large database - sometimes this database is split among various parties. Here, the main aim of privacy preserving data mining is to find the global mining results by preserving the individual sites private data/information. Many Privacy Preserving Association Rule Mining (PPARM) algorithms are proposed for different partitioning methods by satisfying privacy constraints. The various methods such as randomization, perturbation, heuristic and cryptography techniques are proposed by different authors to find privacy preserving association rule mining in horizontally and vertically partitioned databases. In this paper, the analysis of different methods for PPARM is performed and their results are compared. For satisfying the privacy constraints in vertically partitioned databases, algorithm based on cryptography techniques, Homomorphic encryption, Secure Scalar product and Shamir’s secret sharing technique are used. For horizontal Partitioned databases, algorithm that combines advantage of both RSA public key cryptosystem and Homomorphic encryption scheme and algorithm that uses Paillier cryptosystem to compute global supports are used. This paper reviews the wide methods used for mining association rules over distributed dataset while preserving privacy.

Keywords- Privacy Preserving Association Rule Mining, Cryptography, SMC, Scalar Product, Secret Sharing, Homomorphic Paillier Cryptosystem

I. INTRODUCTION Data may be personal or corporate; data mining offers the

potential to reveal what others regard as sensitive (private). In some cases, it may be of mutual benefit for two parties or multiple parties (even may be competitors) to share their data for an analysis task. However, they would like to ensure their own data remains private. Means, there is a need to protect sensitive knowledge during a data mining process. This problem is called Privacy-Preserving Data Mining (PPDM). So maintaining privacy is challenging issue in data mining.

Many secure protocols have been proposed so far for data mining and machine learning techniques for decision tree classification, clustering, association rule mining, Neural Networks, Bayesian Networks. The main concern of these algorithms is to preserve the privacy of parties’ sensitive data, while they gain useful knowledge from the whole dataset. One of the most studied problems in data mining is the process of discovering frequent item sets and,

consequently association rules. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc.

Most of the privacy-preserving data mining methods apply a transformation which reduces the effectiveness of the underlying data when it is applied to data mining methods or algorithms. Privacy concerns can prevent building of centralized warehouse – in distributed among several sites, none of which are allowed to transfer their data to another site. In preserving privacy of data, the problem is how securely results are obtained but not with data mining result but. As a simple example, suppose some hospitals want to get useful aggregated knowledge about a specific diagnosis from their patients’ records while each hospital is not allowed, due to the privacy acts, to disclose individuals’ private data. Therefore, they need to run a joint and secure protocol on their distributed database to reach to the desired information.

In many cases data is distributed, and bringing the data together in one place for analysis is not possible due these privacy laws or policies. Now it is not the data miner’s intention to compromise any private information. In fact, the results of the data mining rarely compromise privacy. This is due to the fact that data mining itself is concerned with retrieving high level data and trends about the data. If the results could be obtained without sharing information between different sites and the results are summary of entire data then there would be no loss of privacy through data mining if those results are not used to get private information.

In distributed environment, the database is available across multiple sites and privacy preserved data mining is performed to find the global mining results by preserving the individual sites private data/information. Every site can access the global results which are useful for analysis. In recent years, many researchers are focusing on privacy preserving data mining in distributed environment as it is having lot of applications in diverse fields.

In this paper, different approaches are discussed for mining the association rules while fulfilling privacy requirements over horizontally and vertically partitioned distributed databases and their results are compared.

231978-1-4673-5116-4/12/$31.00 c©2012 IEEE

Distr

ibut

ion

of c

andida

tes

set

Ste

p 1

Ste

p 1

Ste

p 1

II. HORIZANTALLY PARTITIONED DATABASES

A. Privacy Mining of Generic Basic Association Rules

Moez et al. [8] described the generic base of association rules from the frequent closed set in a distributed environment while preserving constraints of privacy by using a commutative cryptographic protocol. It proposes the Global architecture that associated with algorithm works on extraction of generic base from a condensed representation of data, i.e., closure in this case. It offers the advantage of being able to carry out the various tasks of the extraction of itemset while guaranteeing security and anonymity.

Fig.1 represents the global architecture of our solution.

Figure 1. Global Architecture

Commutative Encryption Protocol algorithm uses the assumption that a large prime p has been made publicly known to all (even adversaries). Let, the secret owner is denoted by P0. The trustees are denoted by P1, P2, _ _, Pn-1. Finally, Pn denotes the recipient of the secret. The phases are described as follows:

Phase 1: Initializing In this phase, large prime number p is made public and has secret mi and private key pair (ai, bi).

Phase 2: Locking In this phase, the secret is locked by all parties in a cascaded manner. Party Pi locks the secret mi by applying ci = miai mod (p) and sends ci to party P i+1

Phase 3: Unlocking Here, the secret is unlocked by all trustees and then in last

step by Pn. Master Site P0 receives the set of all messages and unlocks the secret i.e., the set of messages

The security of this scheme heavily draws on computational difficulty of Discrete Logarithmic Problem (DLP).

B. M. Hussein’s Scheme M. Hussein et al. [4] propose a modification to privacy

preserving association rule mining on distributed homogenous database algorithm. Our algorithm is faster than old one which modified with preserving privacy and accurate results. Modified algorithm is based on a semi-honest model with negligible collision probability. The flexibility to extend to any number of sites without any change in implementation can be achieved.

Figure 2. General Structure of Scheme

Step1: All local data mining (LDM) compute the mining results using fast distributed mining of association rules (FDM) as locally large k-item sets (LLi (k)) and local support for each item set in LLi (k) then Encrypt frequent item sets and support (LLei (k)) then send it to the data mining combiner. Step 2: The combiner merge all received frequent items and supports with the data mining combiner frequent items and support in encrypted form then send LLe (k) to algorithm initiator to compute the global association rules. Step3: The algorithm initiator receives the frequent items with support encrypted. The initiator first decrypts it, and then merges it with his local data mining result to obtain global mining results L(k), then compute global association rules and distribute it to all protocol parties.

This algorithm is more flexible to extend it to any number of sites without any change in implementation.

C. Equations EMHS Scheme

This algorithm is more flexible to extend it to any number of sites without any change in implementation.

Xuan et al. [9] proposed an enhanced M. Hussein Scheme [4] for privacy preserving association rules mining on horizontally distributed databases. It improves privacy and performance when number of sites gets increased. This algorithm uses two servers i.e. Initiator and Combiner and homomorphic Paillier cryptosystem to compute global supports.

…..

Communication Protocol: Encryption + anonymity

Site 1

DB1

Closure Gener- ation [Size- k]

Site 2

DB2

Closure Gener- ation

[Size- k]

Site n

e n

DBn

Closure Gener- ation

[Size- k]

Master Site: Generic base AR

Master Site: Candidate [k+1] Generation {can} =Ø

NO

YES

LDM Initiator + LDM

LDM LDM LDM

232 2012 12th International Conference on Hybrid Intelligent Systems (HIS)

Figure 3. Model of EMHS

There are three phases of this scheme. Init Phase:

Initiator sends RSA’s and Paillier’s public key to all other sites. First phase: Candidate set generation phase Step 1: Each site independently and parallel finds its local MFI, and (except Initiator) encrypts its local MFI by using its RSA’s public key. Then Clients parallely send their encrypted data to Combiner. Step 2: Combiner merges the data received from Clients with its encrypted data and then sends the union data to Initiator. Step 3: Initiator decrypts the data received from Combiner and combines the decrypted data with its local MFI to find global MFI, in which each maximal frequent item set is not the subset of any others.

Then Initiator sends the global MFI to all other sites. Each site generates candidate set, where each candidate is subset generated from each maximal frequent itemsets in global MFI. The candidates are different with each other and are sorted in the same order at all sites. Second Phase: Global support computation phase Step 1: Each site in parallel computes its local support count of each candidate and (except Initiator) encrypts its support counts by using its Paillier’s public key. Then Clients parallel send their encrypted data to Combiner. The encrypted of local support count of candidate X at site si is denoted as E (X. supi). Step 2: With each candidate X, Combiner computes:

E (X.supCombiner) = E (X.supCombiner) * E (X.supk) After that, decrypted data are sent to Initiator. Step 3: Initiator decrypts the data received from Combiner and computes global support count of each candidate X as follows:

X.sup = D (E (X.supCombiner)) + X.sup Initiator

Final Phase: Each Site together computes Then Initiator finds strong global association rules and sends the result to all other sites.

In EMHS, applying MFI approach in the first phase will reduce the size union data in this phase; and in the second phase, the union data is fixed when increasing the number of sites.

III. VERTICALLY PARTITIONED DATABASES

In vertical partitioning a transaction database DB is vertical partitioned among n parties (namely P1, P2, Pn), where DB = DB1 DB2 … DBn, DBi DBj , for i, j n, DBi resides at party Pi (1 i n). This means 1. All the data sets contain the same number of transactions, let N denote the total number of transactions for each data set.

2. The identities of the jth (for j [1, N]) transaction in all the data sets are the same.

Vaidya and Clifton [1] presented protocols for privacy-preserving association rule mining over vertically partitioned data. Encryption is a well known technique for preserving the confidentiality of sensitive information. In comparison with the other techniques described, a strong encryption scheme can be more effective in protecting the data privacy. An encryption system normally requires that the encrypted data should be decrypted before making any operations on it. For example, if the value is hidden by a randomization-based technique, the original value will be disclosed with certain probability. If the value is encrypted using a semantic secure encryption scheme, the encrypted value provides no help for an attacker to find the original value.

A. Algorithm with Data Miner

N.V. Muthu Laxmi et al. [10] proposed a methodology to find privacy preserving association rule mining for vertically partitioned distributed database with data miner (DM). Cryptography techniques and scalar product protocol are used to find global frequent item sets along with support values while protecting one’s private data/information from others accessing. In this method, a special site is designated called as data miner (DM) who initiates the process of finding association rules without knowing any one’s individual data/information .

The communication among three sites and DM is shown in fig. 4.

Figure 4. Communication among three sites and DM The role of DM: • To initiate the process by sending Min_sup

threshold and public key to all sites. • DM also participates in the encryption and decryption

process for frequent item sets in order to protect individual sites attributes information that is names of attributes & number of attributes exists in a site and their support values.

Initiator (Public Key, Private Key)

Combiner (Public Key)

Client 1 Client 2 Client 3

DM

Site 3 Site 2

Site 1

2012 12th International Conference on Hybrid Intelligent Systems (HIS) 233

• DM is having privileges to find the global frequent item sets and their support values. He also generates the association rules which are then broadcasted to all sites.

Every site communicates with its successor site only except the last site, Siten communicates with DM. DM is having communications with all sites, Site1 to Siten. Each site performs the computations by using scalar product concept with its own computed results and the computed results obtained from its predecessor site.

This model utilizes the concept of scalar product is proposed to find global association rules when the database is partitioned vertically among n number of sites. In the proposed model, DM has privileges to initiate the mining process, finding global association rules. A secured computation for association rules is achieved with this model by preserving the privacy of the individual sites information. The functioning of the proposed model is illustrated with sample databases.

B. T-VDC Method

Zhu. et al. [12] addressed the problem with the existing protocol of secure two-party vector dot product computation like the low efficiency and may disclose the privacy data they proposed a method which is effective to find frequent item sets on vertically distributed data. The method uses semi-honest third party to participate in the calculation, put the converted data of the parties to a third party to calculate.

The results is compared to the original Vector dot product algorithm, the method can obviously improve the algorithm efficiency and accuracy of the results at the precondition that assured the data privacy of all parties.

Vector-Based Semi-Honest Third-Party Product

Protocol (T-VDC) [12] uses a pair of random numbers to encrypt the original data and with the help of the third party to calculate. The process does not need a large number of randomly generated vectors and matrices and the communication process only transfer a small amount of vectors and numerical.

The support count of the item denoted by count (item), x is an attribute of party A, the attribute value is denoted by the vector X=(x1, x2… xn), y is an attribute of party B, the value of the attribute is denoted by vector Y= (y1, y2… yn).

The count (item) = X.Y = i*yi

Algorithm 2: The dot product protocol T-VDC based on the semi-honest third party. Input: Transaction database D; minimum support threshold min_sup. Output: The frequent itemsets F in D.

Step 1: Generate candidate set Ck. If all the attributes in Ck are entirely at A or B then that party independently calculates c.count. Otherwise third party computes, c.count = X•Y= [X’Y- v (y1q+y2q+ +ynq)]/k Step 2: Let A have p attributes and B have the remaining. Construct vector X and vector Y on A’s side and B’s side respectively. Step 3: Third party creates pair of random (k,v) Vector X is multiplies by k and offset is added to get resultant vector X’, send X’ to Y. Step 4: Y computes X’Y and sum Ysum = y1q + y2q+..+ ynq and X’Y to third party. Third party computes c.count = X•Y= [X’Y- v (y1q+y2q+ +ynq)]/k T-VDC is simplified based on VDC algorithm; there is no need to generate a large number of random matrices and random vectors, so the execution time of T-VDC is less than the VDC. With the minimum support increases, the time cost decreases.

C. Algorithm Based on Secret Sharing Technique Xinjing et al. [11] explores the issue of privacy

preserving distributed association rule mining in vertically partitioned data among multiple parties, and proposes a collusion-resistant algorithm of distributed association rule mining based on the Shamir’s secret sharing technique. This prevents effectively the collusive behaviors and conducts the computations across the parties without compromising their data privacy. Shamir’s secret sharing method [6] allows a dealer D to distribute a secret value among n peers, and knowledge of any pair can be used to reconstruct the secret.

Algorithm 3: Privacy-Preserving Algorithm to Collaboratively Compute c. count

In this algorithm, every party solves polynomial equations to find sum of secret values. If this sum is equals to ‘n’ then that transaction supports the association rules. For this each party selects a random polynomial qi(x) for each transaction. Each party computes share of each party and send to other party.

Now, each party adds all shares received from other parties and sends results to all other party.

As each party have values of polynomial where constant term is sum of all secret values.

This algorithm is semantically secure for the network attackers and can prevent effectively the collusive behaviors from the collaborative parties under the condition of the number of the collusive parties t < n 1.

234 2012 12th International Conference on Hybrid Intelligent Systems (HIS)

IV. COMPARATIVE STUDY

The methods proposed in three papers [8] [9] [4] based on PPARM in horizontal partitioning of databases.

The security of the scheme [8] heavily draws on the computational difficulty of discrete logarithm problem (DLP). EMHS follows MFI approach and does not modify the original data in both two phases. Thus, Initiator will find global frequent item sets accurately means the final results are accurate.

Commutative encryption is used for algorithm proposed in [8] which didn’t violate privacy constraints. Both MHS [4] and EMHS [9] scheme satisfies semi-honest model. EMHS uses Paillier cryptosystem in the second phase and MHS uses RSA cryptosystem. For this reason, Combiner is much more difficult to attack in EMHS. Means EMHS has higher privacy than MHS. Both EMHS and MHS are two-phase schemes, the communication cost (or cost) of each scheme is the sum of the one in each phase. EMHS has better performance than MHS in sparse datasets when increasing the number of sites.

The methods discussed in [1], [10], [11], [12] are based on privacy preserving ARM over vertical partitioning. In [12], put forward a vector product protocol based on semi-honest third party (T-VDC). In this, the process does not need a large number of randomly generated vectors and matrices as in [1] and the communication process only transfer a small amount of vectors and numerical. So compared to the VDC algorithm [1], computing and communication overheads are very small. For N transactions and n parties, Communication cost of algorithm [11] based on secret sharing method is 2Nn (n-1).

As in [12] the random number is randomly selected, it can better guarantee the security of all data. Algorithm [11] is semantically secure for the network attackers and can prevent effectively the collusive behaviors from the collaborative parties under the condition of the number of the collusive parties t < n -1.

V. CONCLUSION

The issues of privacy preserving association rule mining are addressed here. In particular, privacy preserving algorithms over horizontal and vertical partitioned databases are discussed and results are compared.

The EMHS scheme [10], improves the privacy and performance when number of nodes increased. As it applies MFI and Paillier cryptosystem it provides high security; but security of initiator and combiner are not addressed. The security of Global architecture [8] relies on DLP and it is secure enough to keep client’s privacy. This scheme is not secure to maintain privacy of master site. The method proposed for vertical partitioned databases generates association rules efficiently and with minimum number of computations and communication by satisfying privacy constraints. All methods works on Boolean association rule mining and are designed for semi-honest model.

In practice while calculating c.count collaboratively [11], participant may deviate from algorithm and lead to

malicious behavior. But algorithm is semantically secure and prevents collusive behavior with accurate results. In the proposed model [10], data miner has privileges to initiate mining process, finding global association rules. The performance of all models is analyzed in terms of privacy, security and communications.

REFERENCES [1] J. Vaidya, Clifton. “Privacy preserving association rule mining in

vertically partitioned data “ In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada: ACM.2002:639-644.

[2] C. Yao, “Protocols for secure computations,” Proceedings of the 23rd Annual IEEE Symposium on Foundations of Computer Science, IEEE Press, New York, 1982.

[3] Gui Qiong, Cheng Xiao-Hui. Association Rule Mining Algorithm Based on Similarity Matrix of Transactions [J]. Journal of Guilin University of Technology, Vol. 28, No.4, Nov.2008, p p. 568-571.

[4] Mahmoud Hussein, Ashraf El-Sisi, Nabil Ismail, “Fast Cryptographic Privacy Preserving Association Rules Mining on Distributed Homogenous DataBase”, Knowledge-Based Intelligent Information and Engineering Systems, Lecture Notes in Computer Science, Volume 5178/2008, pp. 607 -- 616 (2008).

[5] Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya, Xiaodong Lin, and Michael Y. Zhu (2003), “Tools for privacy preserving distributed data mining,” SIGKDD Explorations, Vol. 4, No.2pp1-7.

[6] A. Shamir;“How to share a secret,” Communications of the ACM, vol.22 (11), pp.612–613, 1979.

[7] Jaiwei Han and Micheline Kamber, “Data Mining-Concept and Techniques”, Morgan Kaufman Publishers, 2nd Edition , 2006.

[8] Moez Waddey , Pascal Poncelet, Sadok Ben Yahia, “Novel Approach For Privacy Mining Of Generic Basic Association Rules,” In PAVLAD’09, November 6, 2009, Hong Kong, China, 2009 ACM.

[9] Xuan Canh Nguyen, Hoai Bac Le, Tung Anh Cao, “An Enhanced Scheme For Privacy-Preserving Association Rules Mining On Horizontally Distributed Databases,” In 2012 IEEE.

[10] N. V. Muthu Lakshmi1 & K. Sandhya Rani, “Privacy Preserving Association Rule Mining in Vertically Partitioned Databases,” In International Journal of Computer Applications (0975 – 8887) Volume 39– No.13, February 2012.

[11] Xinjing Ge, Li Yan, Jianming Zhu, Wenjie Shi, “Privacy- Preserving Distributed Association Rule Mining Based on the Secret Sharing Technique,” 2009 IEEE.

[12] Zhu Yu- quan, Tang Yang, Chen Geng, “A Privacy Preserving Algorithm for Mining Distributed Association Rules,” 19-21 May 2011.

2012 12th International Conference on Hybrid Intelligent Systems (HIS) 235