research article personalized privacy-preserving frequent...

11
Research Article Personalized Privacy-Preserving Frequent Itemset Mining Using Randomized Response Chongjing Sun, Yan Fu, Junlin Zhou, and Hui Gao Web Science Center, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China Correspondence should be addressed to Yan Fu; [email protected] Received 8 December 2013; Accepted 24 February 2014; Published 30 March 2014 Academic Editors: B. Johansson and L. Xiao Copyright © 2014 Chongjing Sun et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Frequent itemset mining is the important first step of association rule mining, which discovers interesting patterns from the massive data. ere are increasing concerns about the privacy problem in the frequent itemset mining. Some works have been proposed to handle this kind of problem. In this paper, we introduce a personalized privacy problem, in which different attributes may need different privacy levels protection. To solve this problem, we give a personalized privacy-preserving method by using the randomized response technique. By providing different privacy levels for different attributes, this method can get a higher accuracy on frequent itemset mining than the traditional method providing the same privacy level. Finally, our experimental results show that our method can have better results on the frequent itemset mining while preserving personalized privacy. 1. Introduction e rapid development of Internet technology, cloud comput- ing, mobile computing, and Internet of things has produced the exploding large datasets, the so-called big data. e data can be used to capture the useful information and interesting patterns and indeed help people to make effective decisions. Data mining is a powerful technique which can discover the hidden interesting patterns and models. By analyzing the data, data mining offers considerable benefits to the consumers, companies, and organizations. Association rule mining is a popular and important technique of data mining. It is intended to discover the interesting patterns from the large transaction data, that is, frequent patterns and association rules. ese rules can be used to improve the market decision making, such as promotional pricing or product placements. For example, by analyzing the transactions data from a supermarket, the rule found that customers who buy diapers also tend to buy beer. Nowadays, association rule mining is employed in many application areas such as Web usage mining, intrusion detection, and bioinformatics. e process of mining association rules contains two main steps [1]. e first step is to discover all the frequent itemsets with the support being higher than the minimum support. e second step generates the strong rules based on the frequent itemsets found in the first step, and the strong rules have confidence greater than the minimum confidence. In this paper, we focus on the frequent itemset mining, which is the most important part of association rule mining. With the exponentially increasing information being col- lected for analysis, there is growing concern about the privacy problem. In fear of the privacy problem, some people might be reluctant to provide the true information. e survey [2] conducted in 1993 demonstrated the worries of people about the privacy. 17% of the respondents are extremely concerned about their personal information. ey were reluctant to provide any data and even the privacy protection measures were in place. However, 56% of them were willing to provide their information under the condition that privacy protection methods were adopted. e remaining 27% of them are marginally concerned about their privacy and willing to provide data under any condition. erefore, in order to collect the true data, privacy protection measures must be taken for protecting the personal information. Besides, before outsourcing or sharing the data with other companies, the data owner also must take some privacy-preserving measures. Hindawi Publishing Corporation e Scientific World Journal Volume 2014, Article ID 686151, 10 pages http://dx.doi.org/10.1155/2014/686151

Upload: others

Post on 25-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Research ArticlePersonalized Privacy-Preserving Frequent Itemset Mining UsingRandomized Response

Chongjing Sun Yan Fu Junlin Zhou and Hui Gao

Web Science Center School of Computer Science and Engineering University of Electronic Science and Technology of ChinaChengdu 611731 China

Correspondence should be addressed to Yan Fu fuyanuestceducn

Received 8 December 2013 Accepted 24 February 2014 Published 30 March 2014

Academic Editors B Johansson and L Xiao

Copyright copy 2014 Chongjing Sun et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Frequent itemsetmining is the important first step of association rulemining which discovers interesting patterns from themassivedata There are increasing concerns about the privacy problem in the frequent itemset mining Some works have been proposedto handle this kind of problem In this paper we introduce a personalized privacy problem in which different attributes mayneed different privacy levels protection To solve this problem we give a personalized privacy-preserving method by using therandomized response technique By providing different privacy levels for different attributes this method can get a higher accuracyon frequent itemset mining than the traditional method providing the same privacy level Finally our experimental results showthat our method can have better results on the frequent itemset mining while preserving personalized privacy

1 Introduction

Therapid development of Internet technology cloud comput-ing mobile computing and Internet of things has producedthe exploding large datasets the so-called big data The datacan be used to capture the useful information and interestingpatterns and indeed help people to make effective decisionsData mining is a powerful technique which can discoverthe hidden interesting patterns and models By analyzingthe data data mining offers considerable benefits to theconsumers companies and organizations

Association rule mining is a popular and importanttechnique of data mining It is intended to discover theinteresting patterns from the large transaction data thatis frequent patterns and association rules These rules canbe used to improve the market decision making such aspromotional pricing or product placements For exampleby analyzing the transactions data from a supermarket therule found that customers who buy diapers also tend tobuy beer Nowadays association rule mining is employed inmany application areas such as Web usage mining intrusiondetection and bioinformatics

The process of mining association rules contains twomain steps [1] The first step is to discover all the frequent

itemsets with the support being higher than the minimumsupport The second step generates the strong rules based onthe frequent itemsets found in the first step and the strongrules have confidence greater than the minimum confidenceIn this paper we focus on the frequent itemset mining whichis the most important part of association rule mining

With the exponentially increasing information being col-lected for analysis there is growing concern about the privacyproblem In fear of the privacy problem some people mightbe reluctant to provide the true information The survey [2]conducted in 1993 demonstrated the worries of people aboutthe privacy 17 of the respondents are extremely concernedabout their personal information They were reluctant toprovide any data and even the privacy protection measureswere in place However 56 of them were willing to providetheir information under the condition that privacy protectionmethods were adopted The remaining 27 of them aremarginally concerned about their privacy and willing toprovide data under any condition Therefore in order tocollect the true data privacy protection measures mustbe taken for protecting the personal information Besidesbefore outsourcing or sharing the data with other companiesthe data owner also must take some privacy-preservingmeasures

Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 686151 10 pageshttpdxdoiorg1011552014686151

2 The Scientific World Journal

Two privacy environments are proposed in [3] based onthe privacy mechanisms The first one is B2B (business tobusiness) which assures that the data obtained from the userswould be preprocessed by the privacy-preserving techniquesbefore being supplied to the data miner The second one isB2C (business to customer) which provides the privacy pro-tection at the point of data collection that is at the user siteIn order to protect the sensitive information we proposedmethods which can be used under both environments Wefocus on the privacy of input data in the frequent itemsetmining process Under the B2C environment the personaldata are randomized after each user provides them and thensent to the data collector Under the B2B environment thecollected dataset is randomized before outsourcing or sharingwith others even at public

Although many works have been proposed to handle theprivacy problem in the frequent itemset mining process fewof them focused on the personalized privacy problem Xiaoand Tao [4] proposed the personalized privacy preservationproblem considering that different people need differentlevels of privacy protection and they used the generalizationtechniques Besides they also raise the problem of themultipublishing [5] wherein different recipients can havedifferent privacy protection levels on the original dataset

In this paper we define the personalized privacy asdifferent privacy levels on different attributes or items inthe process of frequent itemset mining For example thecustomer does not care too much about the disclosure ofthe information that he bought daily necessities But he willconcern heavily about the privacy problem if he boughtsome sensitive items If we provide the same level of privacyprotection for different items the data utility will be losttoo much in order to satisfy all items privacy requirementsThis is because the nonsensitive items also are disturbed tooheavily Hence it is necessary to provide the different degreeof privacy protection for different items or attributes

To solve the above personalized privacy problem we usethe randomized response technique [6] Based on the MASK[7] we give the solution on how to reconstruct the associ-ations under the different privacy protections Our methodcan be used to preserve the privacy under both the B2B andB2C environments Besides we design amethod to acceleratethe itemset support computation Finally our experimentalresults show that the performance of our method is betterthan the traditional method with respect to the accuracywhile satisfying the different privacy requirements

The rest of this paper is organized as follows In Section 2we review the previous related works Section 3 gives somerelated preliminaries and defines the personalized privacyproblem Then in Section 4 we describe the proposedmethod on the personalized privacy protection and how todiscover the frequent itemset from the randomized dataWe evaluate our technique in Section 5 by experiments andconclude our work in Section 6

2 Related Works

There are many works on the privacy problem in the associ-ation rule mining and they can be divided into two research

streams [3] input data privacy and output rule privacy Mostof them focus on the second one To preserve the input dataprivacy some methods are adopted to disturb the originaldata so that the attacker cannot get the true data of usersFor the output rule privacy the collected data are heuristicallyaltered in order to protect some sensitive rules being minedfrom the dataset by data miners

The association rule hiding algorithms which protect theoutput rule privacy can be divided into three categories [8]namely heuristic approaches border-based approaches andexact approaches The first class of approaches selectivelysanitizes a set of transactions from the database to hidethe sensitive rules efficiently which suffer from high sideeffects Two classic techniques in this class of approaches aredistortion [9 10] and blocking [11 12] The second set ofapproaches [13 14] hide the sensitive rules by modifying theoriginal borders in the lattice of the frequent and infrequentitemsets in the database The third set of approaches [15 16]hide the sensitive rules by solving the Constrain Satisfactionproblem using the integer or linear programming They canfind the optimality to hide the rules with high computationcost

In this paper we solve the problem of the data inputprivacy and the corresponding approaches can be dividedinto cryptograph-based and reconstruction-based methodsThe cryptograph-based approaches handle the problem thatsome partners want to discover shared association rule fromthe global data without disclosing their sensitive data toothers The global data may be vertically partitioned [17]or horizontally partitioned [18] and distributed in manypartners The reconstruction-based methods firstly random-ize the original data to hide the sensitive data and thenreconstruct the interesting patterns based on the statisticalfeatures without knowing true values For the centralizeddata there are two distortion strategies statistical distortionand algebraic distortion Rizvi and Haritsa [7] proposedthe MASK approach to disturb the original sparse Booleandatabases This method retains each 0 or 1 bit in the databasewith the probability as 119901 and flips this value with theprobability as 1 minus 119901 Zhang et al [19] proposed the solutionwith the algebraic distortion By using the eigenvalues andeigenvectors the data of users can be distorted by matrixtransformation and noise addition

Our work focuses on the input data privacy problemof the centralized data The frequent itemset mining isconducted on the distorted data that is randomized by therandomized response technique All the above algorithms didnot mention the personalized privacy problem that we solvein this paper

3 Preliminaries and Problem Definition

In this section we introduce the related preliminaries aboutthe privacy-preserving frequent itemset mining and give theproblem definition of this paper

31 Association Rule Mining Let 119868 be a set of items and 119863

the database containing a set of transactions wherein each

The Scientific World Journal 3

transaction 119879 is a set of items such that 119879 sube 119868 A transaction119879 is said to contain 119883 if and only if 119883 sube 119879 A rule is animplication of the form 119883 rArr 119884 where 119883 119884 sub 119868 and119883 cap 119884 = 120601 The support of the rule 119883 rArr 119884 is defined in(1) and its confidence is defined in (2)

supp (119883 997904rArr 119884) = 119875 (119883 cup 119884) (1)

conf (119883 997904rArr 119884) = 119875 (119884 | 119883) =supp (119883 cup 119884)

supp (119883) (2)

where 119875(119883 cup 119884) is the percentage of transactions in 119863 thatcontain 119883 cup 119884 and supp(119883) is defined as the percentage oftransactions containing119883

A set of items 119883 is said to be frequent if and only ifsupp (119883) is greater than the user-defined minimum supportThen 119883 is a frequent itemset If the rule 119883 rArr 119884 hasthe support greater than the minimum support and itsconfidence is greater than the minimum confidence then119883 rArr 119884 is an interesting rule that is association rule From(1) and (2) it is shown that finding the association rules iseffectively equivalent to generating all the frequent itemsetswith support greater than the minimum support Thereforewe focus on the frequent itemset mining

32 Randomized Response Randomized response techniquewas first introduced by Warner [20] to solve the statisticalsurvey problem of the sensitive questions For examplesocial scientists want to know how many people in somearea use drugs Usually respondents are reluctant to directlyanswer this kind of questions Hence Warner proposed therandomized response

To ask a binary choice question about whether peoplehave a sensitive attribute 119860 the randomized response givestwo related questions like the following

(1) I have the sensitive attribute 119860(2) I do not have the sensitive attribute 119860

Respondents decide which question to answer with aprobability 119901 Then the interviewer only gets a ldquoyesrdquo or ldquonordquowithout knowing which question the respondent answeredIf a respondent has the sensitive attribute119860 then he will givethe answer ldquoyesrdquo to answer the first question with probabilityas119901 or ldquonordquo with probability as 1minus119901 If a person does not have119860 then he will give ldquonordquo with probability as 119901 to answer thefirst question and ldquoyesrdquo with probability as 1minus119901 to answer thesecond question Hence the probability that an interviewergets the answer ldquoyesrdquo can be computed by (3) while gettingldquonordquo can be computed by (4)

119875 (ans = yes) = 119875 (119860 = yes) sdot 119901 + 119875 (119860 = no) sdot (1 minus 119901) (3)

119875 (ans = no) = 119875 (119860 = no) sdot 119901 + 119875 (119860 = yes) sdot (1 minus 119901) (4)

where 119875 (119860 = yes) is the proportion of the respondents thathave the attribute 119860 while 119875 (119860 = no) is the proportion ofrespondents that do not have the attribute119860 The interviewercan get the proportion of respondents having the attribute119860119875 (119860 = yes) by solving (3) and (4)

33 Problem Definition In this paper we solve the dataprivacy problem with respect to the frequent itemset miningunder the B2B environments as well as the B2C environment

Under the B2C model the interviewer can conducta survey containing sensitive questions For example thequestions on ldquowhether you are divorcedrdquo ldquowhether you havecriminal recordsrdquo or ldquowhether you use drugsrdquo are sensitiveThis kind of surveys can analyze the association betweenfactors and indeed reflect sociology problems In this modelrespondents disturb survey vectors with the given parametersand then send them to the reviewers

Under the B2B model the data owner such as supermar-ket managers disturbs the original data before sharing withother business partners Taking the market transactions as anexample customers are reluctant to let others know what heor she bought especially some sensitive products By takingthe privacy protection measures transaction details can behidden without leaking the personal privacy

Considering the data utility and different degree of pri-vacy requirements the protection on the different attributesor items should be different For example the customer doesnot care toomuch about the disclosure that he bought paperswhile concerning heavily on some sensitive items

The problem of personalized privacy-preserving frequentitemset mining is given the original transaction dataset 119863how to disturb 119863 into 119863

1015840 satisfying the different privacyrequirements and mine frequent itemsets from 119863

1015840 so thatthe frequent itemsets mined from119863 are close to the frequentitemsets mined from119863

1015840 as much as possible

4 Personalized Privacy-Preserving FrequentItemset Mining

In this section we present the procedure on how to distortthe original data by the randomized response techniquesatisfying the personalized privacy requirement Then wegive the method to reconstruct itemset support from thedistorted data Finally we devise a personalized privacy-preserving frequent itemset mining algorithm based on theApriori algorithm

41 Personalized Data Distortion Suppose there are 119899 itemsin the database 119863 Each transaction in 119863 is represented by abinary vector 119879 119879

119894= 1 if this transaction contains the item

119894 otherwise 119879119894= 0 For each item 119894 1 le 119894 le 119899 there is a

probability parameter 119901119894(0 le 119901

119894le 1) to disturb the data

on the item 119894 and these parameters form a vector shown informula (5) Then the distorted value of this transaction onitem 119894 can be expressed in (6) where 119903

119894is a value randomly

drawn from a uniform distribution over the interval [0 1]

119875 = (1199011 1199012 119901

119899) (5)

1198791015840

119894=

119879119894 119903

119894le 119901119894

1 minus 119879119894 otherwise

(6)

From (6) we can see that the 119894th value of transaction 119879 iskept with probability 119901

119894and flipped with probability 1 minus 119901

119894

4 The Scientific World Journal

s = 001si = 01

si = 03

si = 05

si = 08

si = 095

R(p

is i)

1

1

08

08

06

06

04

04

02

0200

pi

Figure 1 Reconstruction probability

The proportion of 1198791015840119894= 1 can be calculated by (7) and the

proportion of 1198791015840119894= 0 can be calculated by (8)

119875 (1198791015840

119894= 1) = 119875 (119879

119894= 1) sdot 119901

119894+ 119875 (119879

119894= 0) sdot (1 minus 119901

119894) (7)

119875 (1198791015840

119894= 0) = 119875 (119879

119894= 0) sdot 119901

119894+ 119875 (119879

119894= 1) sdot (1 minus 119901

119894) (8)

For the personalized privacy requirements differentitems are distorted with different probability parameters Fora given parameter119901

119894 the probability of correct reconstruction

on the value 119879119894= 1 is calculated by

119877 (119901119894 119904119894) = 119875 (119879

1015840

119894= 1 | 119879

119894= 1) sdot 119875 (119879

119894= 1 | 119879

1015840

119894= 1)

+ 119875 (1198791015840

119894= 0 | 119879

119894= 1) sdot 119875 (119879

119894= 1 | 119879

1015840

119894= 0)

(9)

The first part in (9) means that the original value 119879119894= 1

is distorted into value 1198791015840

119894= 1 and then reconstructed from

1198791015840

119894= 1 and the second part is for the distorted value 119879

1015840

119894=

0 Finally the probability can be computed by (10) where119904119894is the support of item 119894 in original database The similar

derivation of (10) can be found in [7]

119877 (119901119894 119904119894) =

1199041198941199012

119894

119904119894119901119894+ (1 minus 119904

119894) (1 minus 119901

119894)+

119904119894(1 minus 119901

119894)2

119904119894(1 minus 119901

119894) + (1 minus 119904

119894) 119901119894

(10)

The reconstruction probability curves of (10) are shownin Figure 1We can see that the higher the item support is theeasier this item having value 1 will be reconstructed Besidesthe curves are symmetric around 119901

119894= 05 The further the

distance between 119901119894and 05 the easier the item having value 1

will be reconstructedTherefore for a given item the differentprobability parameters will lead to the different degree ofprivacy protection The parameters for items will be set bythe data owner according to the properties of these items For

example the parameter 119901119894for the item milk can be very high

even 10 while for the drugs it can be more close to 05Under the B2C environment the interviewer and the

respondents cooperate with each other to set the parametervector 119875 Then respondents disturb their personal dataand send them to the interviewer While under the B2Benvironment the data owner disturbs the original data withparameter vector 119875 and sends the distorted data with 119901 to thethird parties

42 Itemset Support Reconstruction After getting the dis-torted data 1198631015840 the data mining will reconstruct the supportfor itemsets to find the frequent ones In order to reconstructthe support of an itemset we need to get the count of everycombination of the items in this itemset For example inorder to compute the support of itemset 119860119861119862 in originaldataset 119863 that is 119901 (119860 = 1 119861 = 1 119862 = 1) we compute thecount of 23 combinations of 119860119861119862 in119863

1015840 that is 1198881015840(119860 = 0 119861 =

0 119862 = 0) 1198881015840(119860 = 0 119861 = 0 119862 = 1) and 1198881015840(119860 = 1 119861 =

1 119862 = 1)This is because the original value (119860 = 1 119861 = 1 119862 =

1) can be distorted to these 23 combinations of 119860119861119862Let 119878 be the combinations of the 119896 items in (11) where

119878119895is the binary form of value 119895 in (12) For example when

119896 = 3 with items 119860119861119862 1198780is (119860 = 0 119861 = 0 119862 = 0) 119878

3is

(119860 = 0 119861 = 1 119862 = 1) that is 011 is the binary form of 3

119878 = 1198780 1198781 119878

2119896minus1 (11)

119878119894= binary (119894 119896) (12)

The combination 119878119894in the original database can be

distorted into any combination 119878119895in 119878 According to the

probability theory the probability of 119878119894being distorted to 119878

119895

can be computed by

119903119894119895= prod ((119878

119894⊙ 119878119895) ∘ 119875(119896)

+ (119878119894oplus 119878119895) ∘ (1 minus 119875

(119896))) (13)

where ∘ is the Hadamard product of two vectors and 119875(119896) is

the distorting probability parameters corresponding to the 119896items prod() returns the product of the vector elementsThatmeans that if the value on an item 119894 is kept then the 119901

(119894)is

multiplied to 119903119894119895 otherwise 1 minus119901

(119894) An example with 119896 as 3 in

Table 1 illustrates the formulation on the transition matrixCorresponding to the combinations defined in (11) of the

given 119896 items we define their counts in the original database119863 and the distorted database1198631015840 respectively in (14) where119862is the count vector for 119863 while 1198621015840 is the count vector for 1198631015840119888119894and 1198881015840

119894are the counts of the combination 119878

119894for the 119896 items

in119863 and1198631015840 respectively

119862 =

[[[[

[

1198880

1198881

1198882119896minus1

]]]]

]

1198621015840=

[[[[

[

1198881015840

0

1198881015840

1

1198881015840

2119896minus1

]]]]

]

(14)

Then the relationship between 119862 and 1198621015840 can be expressed

by (15) according to the randomized response distortionmechanism

1198621015840= 119877119862 (15)

The Scientific World Journal 5

Table 1 An example of transition matrix

000 sdot sdot sdot 110 111000 119901

111990121199013

sdot sdot sdot (1 minus 1199011) (1 minus 119901

2) 1199013

(1 minus 1199011)(1 minus 119901

2)(1 minus 119901

3)

001 11990111199012(1 minus 119901

3) sdot sdot sdot (1 minus 119901

1)(1 minus 119901

2)(1 minus 119901

3) (1 minus 119901

1) (1 minus 119901

2) 1199013

111 (1 minus 1199011) (1 minus 119901

2) 1199013

sdot sdot sdot 11990111199012(1 minus 119901

3) 119901

111990121199013

After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862

119862 = 119877minus11198621015840 (16)

The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863

119904 =1198882119896minus1

119898 (17)

43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found

In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868

1 119868

119896 we extract the subdataset

119863(119883) = 119863(1198681 119868

119896) from the database1198631015840 In the subdataset

119863(119883) we map each transaction 119905 isin 119863(119883) into a value by

V = 119905 sdot 119887 = (1199051 1199052 119905

119896) sdot (2119896minus1

2119896minus2

20) (18)

By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot

119887 Then 1198881015840

119895in (14) is the count of elements in 119881 equal to

119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883

In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction

ACD ACDt middot b

000

010

000

011

011

middot middot middot

111

100

101

100

111

0

1

2

3

4

5

6

7

000

001

010

011

100

101

110

111

Figure 2 An example of transaction mapping

5 Experimental Results

In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)

51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942

The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items

52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent

6 The Scientific World Journal

itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method

The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min

120588 =1

10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

sum

119891isin119865cap119865

10038161003816100381610038161003816119904119891minus 119904119891

10038161003816100381610038161003816

119904119891

(19)

The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent

120590+=

1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

120590minus=

|119865| minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

(20)

The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588

53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08

531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6

From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item

120590minus 120590+

120588

FF

Figure 3 Evaluation metrics

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

025

02

015

01

005

0

120588

Figure 4 Support error 120588 versus minimum support onT3I4D500KN10

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

01

008

006

004

002

0

120590+

Figure 5 Frequent itemset added ration 120590+ versus minimum

support on T3I4D500KN10

or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors

The Scientific World Journal 7

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

120590minus

016

014

012

01

008

006

004

002

0

Figure 6 Frequent itemset lost ration 120590minus versus minimum support

on T3I4D500KN10

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

120588

014

012

01

008

006

004

002

0

Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10

We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095

We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901

as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095

532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of

120590+

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

05

04

03

02

01

0

Figure 8 Frequent itemset added ration 120590+ versus minimum

support on BMS-WebView-1x10

120590minus

016

014

012

01

008

006

004

002

0

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

Figure 9 Frequent itemset lost ration 120590minus versus minimum support

on BMS-WebView-1x10

dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support

The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information

533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset

8 The Scientific World Journal

04

035

03

025

02

015

01

005

15 20

25 3 35 4 45 5 55 6

120588

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1 PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 10 Support error 120588 versus minimum support on differentsize of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

120590+

14

12

10

8

6

4

2

0

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 11 Frequent itemset added ration 120590+ versus minimum

support on different size of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

035

03

025

02

015

005

01

0

120590minus

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 12 Frequent itemset lost ration 120590minus versus minimum support

on different size of datasets

120588

02

015

01

005

00 1 2 3 4 5 6 7 8 9

k

MASK-08MASK-08741

MASK-095Personalized

Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942

120590+

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

8 9

MASK-08MASK-08741

MASK-095Personalized

Minimum support

4

3

2

1

0

0

04

03

02

01

Figure 14 Frequent itemset added ration 120590+ versus itemset length

119896 on dataset T40I10D100KN942

T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately

For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741

In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error

The Scientific World Journal 9

120590minus

k

0 1 2 3 4 5 6 7 8 9

MASK-08MASK-08741

MASK-095Personalized

1

08

06

04

02

0

Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896

on dataset T40I10D100KN942

Table 2 The number of frequent 119896-itemsets

119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25

6 Conclusions

In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments

References

[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994

[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000

[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008

[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006

[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009

[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981

[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002

[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008

[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007

[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004

[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004

[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005

[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005

[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006

[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005

[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

2 The Scientific World Journal

Two privacy environments are proposed in [3] based onthe privacy mechanisms The first one is B2B (business tobusiness) which assures that the data obtained from the userswould be preprocessed by the privacy-preserving techniquesbefore being supplied to the data miner The second one isB2C (business to customer) which provides the privacy pro-tection at the point of data collection that is at the user siteIn order to protect the sensitive information we proposedmethods which can be used under both environments Wefocus on the privacy of input data in the frequent itemsetmining process Under the B2C environment the personaldata are randomized after each user provides them and thensent to the data collector Under the B2B environment thecollected dataset is randomized before outsourcing or sharingwith others even at public

Although many works have been proposed to handle theprivacy problem in the frequent itemset mining process fewof them focused on the personalized privacy problem Xiaoand Tao [4] proposed the personalized privacy preservationproblem considering that different people need differentlevels of privacy protection and they used the generalizationtechniques Besides they also raise the problem of themultipublishing [5] wherein different recipients can havedifferent privacy protection levels on the original dataset

In this paper we define the personalized privacy asdifferent privacy levels on different attributes or items inthe process of frequent itemset mining For example thecustomer does not care too much about the disclosure ofthe information that he bought daily necessities But he willconcern heavily about the privacy problem if he boughtsome sensitive items If we provide the same level of privacyprotection for different items the data utility will be losttoo much in order to satisfy all items privacy requirementsThis is because the nonsensitive items also are disturbed tooheavily Hence it is necessary to provide the different degreeof privacy protection for different items or attributes

To solve the above personalized privacy problem we usethe randomized response technique [6] Based on the MASK[7] we give the solution on how to reconstruct the associ-ations under the different privacy protections Our methodcan be used to preserve the privacy under both the B2B andB2C environments Besides we design amethod to acceleratethe itemset support computation Finally our experimentalresults show that the performance of our method is betterthan the traditional method with respect to the accuracywhile satisfying the different privacy requirements

The rest of this paper is organized as follows In Section 2we review the previous related works Section 3 gives somerelated preliminaries and defines the personalized privacyproblem Then in Section 4 we describe the proposedmethod on the personalized privacy protection and how todiscover the frequent itemset from the randomized dataWe evaluate our technique in Section 5 by experiments andconclude our work in Section 6

2 Related Works

There are many works on the privacy problem in the associ-ation rule mining and they can be divided into two research

streams [3] input data privacy and output rule privacy Mostof them focus on the second one To preserve the input dataprivacy some methods are adopted to disturb the originaldata so that the attacker cannot get the true data of usersFor the output rule privacy the collected data are heuristicallyaltered in order to protect some sensitive rules being minedfrom the dataset by data miners

The association rule hiding algorithms which protect theoutput rule privacy can be divided into three categories [8]namely heuristic approaches border-based approaches andexact approaches The first class of approaches selectivelysanitizes a set of transactions from the database to hidethe sensitive rules efficiently which suffer from high sideeffects Two classic techniques in this class of approaches aredistortion [9 10] and blocking [11 12] The second set ofapproaches [13 14] hide the sensitive rules by modifying theoriginal borders in the lattice of the frequent and infrequentitemsets in the database The third set of approaches [15 16]hide the sensitive rules by solving the Constrain Satisfactionproblem using the integer or linear programming They canfind the optimality to hide the rules with high computationcost

In this paper we solve the problem of the data inputprivacy and the corresponding approaches can be dividedinto cryptograph-based and reconstruction-based methodsThe cryptograph-based approaches handle the problem thatsome partners want to discover shared association rule fromthe global data without disclosing their sensitive data toothers The global data may be vertically partitioned [17]or horizontally partitioned [18] and distributed in manypartners The reconstruction-based methods firstly random-ize the original data to hide the sensitive data and thenreconstruct the interesting patterns based on the statisticalfeatures without knowing true values For the centralizeddata there are two distortion strategies statistical distortionand algebraic distortion Rizvi and Haritsa [7] proposedthe MASK approach to disturb the original sparse Booleandatabases This method retains each 0 or 1 bit in the databasewith the probability as 119901 and flips this value with theprobability as 1 minus 119901 Zhang et al [19] proposed the solutionwith the algebraic distortion By using the eigenvalues andeigenvectors the data of users can be distorted by matrixtransformation and noise addition

Our work focuses on the input data privacy problemof the centralized data The frequent itemset mining isconducted on the distorted data that is randomized by therandomized response technique All the above algorithms didnot mention the personalized privacy problem that we solvein this paper

3 Preliminaries and Problem Definition

In this section we introduce the related preliminaries aboutthe privacy-preserving frequent itemset mining and give theproblem definition of this paper

31 Association Rule Mining Let 119868 be a set of items and 119863

the database containing a set of transactions wherein each

The Scientific World Journal 3

transaction 119879 is a set of items such that 119879 sube 119868 A transaction119879 is said to contain 119883 if and only if 119883 sube 119879 A rule is animplication of the form 119883 rArr 119884 where 119883 119884 sub 119868 and119883 cap 119884 = 120601 The support of the rule 119883 rArr 119884 is defined in(1) and its confidence is defined in (2)

supp (119883 997904rArr 119884) = 119875 (119883 cup 119884) (1)

conf (119883 997904rArr 119884) = 119875 (119884 | 119883) =supp (119883 cup 119884)

supp (119883) (2)

where 119875(119883 cup 119884) is the percentage of transactions in 119863 thatcontain 119883 cup 119884 and supp(119883) is defined as the percentage oftransactions containing119883

A set of items 119883 is said to be frequent if and only ifsupp (119883) is greater than the user-defined minimum supportThen 119883 is a frequent itemset If the rule 119883 rArr 119884 hasthe support greater than the minimum support and itsconfidence is greater than the minimum confidence then119883 rArr 119884 is an interesting rule that is association rule From(1) and (2) it is shown that finding the association rules iseffectively equivalent to generating all the frequent itemsetswith support greater than the minimum support Thereforewe focus on the frequent itemset mining

32 Randomized Response Randomized response techniquewas first introduced by Warner [20] to solve the statisticalsurvey problem of the sensitive questions For examplesocial scientists want to know how many people in somearea use drugs Usually respondents are reluctant to directlyanswer this kind of questions Hence Warner proposed therandomized response

To ask a binary choice question about whether peoplehave a sensitive attribute 119860 the randomized response givestwo related questions like the following

(1) I have the sensitive attribute 119860(2) I do not have the sensitive attribute 119860

Respondents decide which question to answer with aprobability 119901 Then the interviewer only gets a ldquoyesrdquo or ldquonordquowithout knowing which question the respondent answeredIf a respondent has the sensitive attribute119860 then he will givethe answer ldquoyesrdquo to answer the first question with probabilityas119901 or ldquonordquo with probability as 1minus119901 If a person does not have119860 then he will give ldquonordquo with probability as 119901 to answer thefirst question and ldquoyesrdquo with probability as 1minus119901 to answer thesecond question Hence the probability that an interviewergets the answer ldquoyesrdquo can be computed by (3) while gettingldquonordquo can be computed by (4)

119875 (ans = yes) = 119875 (119860 = yes) sdot 119901 + 119875 (119860 = no) sdot (1 minus 119901) (3)

119875 (ans = no) = 119875 (119860 = no) sdot 119901 + 119875 (119860 = yes) sdot (1 minus 119901) (4)

where 119875 (119860 = yes) is the proportion of the respondents thathave the attribute 119860 while 119875 (119860 = no) is the proportion ofrespondents that do not have the attribute119860 The interviewercan get the proportion of respondents having the attribute119860119875 (119860 = yes) by solving (3) and (4)

33 Problem Definition In this paper we solve the dataprivacy problem with respect to the frequent itemset miningunder the B2B environments as well as the B2C environment

Under the B2C model the interviewer can conducta survey containing sensitive questions For example thequestions on ldquowhether you are divorcedrdquo ldquowhether you havecriminal recordsrdquo or ldquowhether you use drugsrdquo are sensitiveThis kind of surveys can analyze the association betweenfactors and indeed reflect sociology problems In this modelrespondents disturb survey vectors with the given parametersand then send them to the reviewers

Under the B2B model the data owner such as supermar-ket managers disturbs the original data before sharing withother business partners Taking the market transactions as anexample customers are reluctant to let others know what heor she bought especially some sensitive products By takingthe privacy protection measures transaction details can behidden without leaking the personal privacy

Considering the data utility and different degree of pri-vacy requirements the protection on the different attributesor items should be different For example the customer doesnot care toomuch about the disclosure that he bought paperswhile concerning heavily on some sensitive items

The problem of personalized privacy-preserving frequentitemset mining is given the original transaction dataset 119863how to disturb 119863 into 119863

1015840 satisfying the different privacyrequirements and mine frequent itemsets from 119863

1015840 so thatthe frequent itemsets mined from119863 are close to the frequentitemsets mined from119863

1015840 as much as possible

4 Personalized Privacy-Preserving FrequentItemset Mining

In this section we present the procedure on how to distortthe original data by the randomized response techniquesatisfying the personalized privacy requirement Then wegive the method to reconstruct itemset support from thedistorted data Finally we devise a personalized privacy-preserving frequent itemset mining algorithm based on theApriori algorithm

41 Personalized Data Distortion Suppose there are 119899 itemsin the database 119863 Each transaction in 119863 is represented by abinary vector 119879 119879

119894= 1 if this transaction contains the item

119894 otherwise 119879119894= 0 For each item 119894 1 le 119894 le 119899 there is a

probability parameter 119901119894(0 le 119901

119894le 1) to disturb the data

on the item 119894 and these parameters form a vector shown informula (5) Then the distorted value of this transaction onitem 119894 can be expressed in (6) where 119903

119894is a value randomly

drawn from a uniform distribution over the interval [0 1]

119875 = (1199011 1199012 119901

119899) (5)

1198791015840

119894=

119879119894 119903

119894le 119901119894

1 minus 119879119894 otherwise

(6)

From (6) we can see that the 119894th value of transaction 119879 iskept with probability 119901

119894and flipped with probability 1 minus 119901

119894

4 The Scientific World Journal

s = 001si = 01

si = 03

si = 05

si = 08

si = 095

R(p

is i)

1

1

08

08

06

06

04

04

02

0200

pi

Figure 1 Reconstruction probability

The proportion of 1198791015840119894= 1 can be calculated by (7) and the

proportion of 1198791015840119894= 0 can be calculated by (8)

119875 (1198791015840

119894= 1) = 119875 (119879

119894= 1) sdot 119901

119894+ 119875 (119879

119894= 0) sdot (1 minus 119901

119894) (7)

119875 (1198791015840

119894= 0) = 119875 (119879

119894= 0) sdot 119901

119894+ 119875 (119879

119894= 1) sdot (1 minus 119901

119894) (8)

For the personalized privacy requirements differentitems are distorted with different probability parameters Fora given parameter119901

119894 the probability of correct reconstruction

on the value 119879119894= 1 is calculated by

119877 (119901119894 119904119894) = 119875 (119879

1015840

119894= 1 | 119879

119894= 1) sdot 119875 (119879

119894= 1 | 119879

1015840

119894= 1)

+ 119875 (1198791015840

119894= 0 | 119879

119894= 1) sdot 119875 (119879

119894= 1 | 119879

1015840

119894= 0)

(9)

The first part in (9) means that the original value 119879119894= 1

is distorted into value 1198791015840

119894= 1 and then reconstructed from

1198791015840

119894= 1 and the second part is for the distorted value 119879

1015840

119894=

0 Finally the probability can be computed by (10) where119904119894is the support of item 119894 in original database The similar

derivation of (10) can be found in [7]

119877 (119901119894 119904119894) =

1199041198941199012

119894

119904119894119901119894+ (1 minus 119904

119894) (1 minus 119901

119894)+

119904119894(1 minus 119901

119894)2

119904119894(1 minus 119901

119894) + (1 minus 119904

119894) 119901119894

(10)

The reconstruction probability curves of (10) are shownin Figure 1We can see that the higher the item support is theeasier this item having value 1 will be reconstructed Besidesthe curves are symmetric around 119901

119894= 05 The further the

distance between 119901119894and 05 the easier the item having value 1

will be reconstructedTherefore for a given item the differentprobability parameters will lead to the different degree ofprivacy protection The parameters for items will be set bythe data owner according to the properties of these items For

example the parameter 119901119894for the item milk can be very high

even 10 while for the drugs it can be more close to 05Under the B2C environment the interviewer and the

respondents cooperate with each other to set the parametervector 119875 Then respondents disturb their personal dataand send them to the interviewer While under the B2Benvironment the data owner disturbs the original data withparameter vector 119875 and sends the distorted data with 119901 to thethird parties

42 Itemset Support Reconstruction After getting the dis-torted data 1198631015840 the data mining will reconstruct the supportfor itemsets to find the frequent ones In order to reconstructthe support of an itemset we need to get the count of everycombination of the items in this itemset For example inorder to compute the support of itemset 119860119861119862 in originaldataset 119863 that is 119901 (119860 = 1 119861 = 1 119862 = 1) we compute thecount of 23 combinations of 119860119861119862 in119863

1015840 that is 1198881015840(119860 = 0 119861 =

0 119862 = 0) 1198881015840(119860 = 0 119861 = 0 119862 = 1) and 1198881015840(119860 = 1 119861 =

1 119862 = 1)This is because the original value (119860 = 1 119861 = 1 119862 =

1) can be distorted to these 23 combinations of 119860119861119862Let 119878 be the combinations of the 119896 items in (11) where

119878119895is the binary form of value 119895 in (12) For example when

119896 = 3 with items 119860119861119862 1198780is (119860 = 0 119861 = 0 119862 = 0) 119878

3is

(119860 = 0 119861 = 1 119862 = 1) that is 011 is the binary form of 3

119878 = 1198780 1198781 119878

2119896minus1 (11)

119878119894= binary (119894 119896) (12)

The combination 119878119894in the original database can be

distorted into any combination 119878119895in 119878 According to the

probability theory the probability of 119878119894being distorted to 119878

119895

can be computed by

119903119894119895= prod ((119878

119894⊙ 119878119895) ∘ 119875(119896)

+ (119878119894oplus 119878119895) ∘ (1 minus 119875

(119896))) (13)

where ∘ is the Hadamard product of two vectors and 119875(119896) is

the distorting probability parameters corresponding to the 119896items prod() returns the product of the vector elementsThatmeans that if the value on an item 119894 is kept then the 119901

(119894)is

multiplied to 119903119894119895 otherwise 1 minus119901

(119894) An example with 119896 as 3 in

Table 1 illustrates the formulation on the transition matrixCorresponding to the combinations defined in (11) of the

given 119896 items we define their counts in the original database119863 and the distorted database1198631015840 respectively in (14) where119862is the count vector for 119863 while 1198621015840 is the count vector for 1198631015840119888119894and 1198881015840

119894are the counts of the combination 119878

119894for the 119896 items

in119863 and1198631015840 respectively

119862 =

[[[[

[

1198880

1198881

1198882119896minus1

]]]]

]

1198621015840=

[[[[

[

1198881015840

0

1198881015840

1

1198881015840

2119896minus1

]]]]

]

(14)

Then the relationship between 119862 and 1198621015840 can be expressed

by (15) according to the randomized response distortionmechanism

1198621015840= 119877119862 (15)

The Scientific World Journal 5

Table 1 An example of transition matrix

000 sdot sdot sdot 110 111000 119901

111990121199013

sdot sdot sdot (1 minus 1199011) (1 minus 119901

2) 1199013

(1 minus 1199011)(1 minus 119901

2)(1 minus 119901

3)

001 11990111199012(1 minus 119901

3) sdot sdot sdot (1 minus 119901

1)(1 minus 119901

2)(1 minus 119901

3) (1 minus 119901

1) (1 minus 119901

2) 1199013

111 (1 minus 1199011) (1 minus 119901

2) 1199013

sdot sdot sdot 11990111199012(1 minus 119901

3) 119901

111990121199013

After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862

119862 = 119877minus11198621015840 (16)

The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863

119904 =1198882119896minus1

119898 (17)

43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found

In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868

1 119868

119896 we extract the subdataset

119863(119883) = 119863(1198681 119868

119896) from the database1198631015840 In the subdataset

119863(119883) we map each transaction 119905 isin 119863(119883) into a value by

V = 119905 sdot 119887 = (1199051 1199052 119905

119896) sdot (2119896minus1

2119896minus2

20) (18)

By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot

119887 Then 1198881015840

119895in (14) is the count of elements in 119881 equal to

119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883

In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction

ACD ACDt middot b

000

010

000

011

011

middot middot middot

111

100

101

100

111

0

1

2

3

4

5

6

7

000

001

010

011

100

101

110

111

Figure 2 An example of transaction mapping

5 Experimental Results

In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)

51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942

The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items

52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent

6 The Scientific World Journal

itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method

The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min

120588 =1

10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

sum

119891isin119865cap119865

10038161003816100381610038161003816119904119891minus 119904119891

10038161003816100381610038161003816

119904119891

(19)

The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent

120590+=

1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

120590minus=

|119865| minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

(20)

The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588

53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08

531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6

From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item

120590minus 120590+

120588

FF

Figure 3 Evaluation metrics

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

025

02

015

01

005

0

120588

Figure 4 Support error 120588 versus minimum support onT3I4D500KN10

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

01

008

006

004

002

0

120590+

Figure 5 Frequent itemset added ration 120590+ versus minimum

support on T3I4D500KN10

or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors

The Scientific World Journal 7

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

120590minus

016

014

012

01

008

006

004

002

0

Figure 6 Frequent itemset lost ration 120590minus versus minimum support

on T3I4D500KN10

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

120588

014

012

01

008

006

004

002

0

Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10

We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095

We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901

as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095

532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of

120590+

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

05

04

03

02

01

0

Figure 8 Frequent itemset added ration 120590+ versus minimum

support on BMS-WebView-1x10

120590minus

016

014

012

01

008

006

004

002

0

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

Figure 9 Frequent itemset lost ration 120590minus versus minimum support

on BMS-WebView-1x10

dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support

The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information

533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset

8 The Scientific World Journal

04

035

03

025

02

015

01

005

15 20

25 3 35 4 45 5 55 6

120588

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1 PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 10 Support error 120588 versus minimum support on differentsize of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

120590+

14

12

10

8

6

4

2

0

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 11 Frequent itemset added ration 120590+ versus minimum

support on different size of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

035

03

025

02

015

005

01

0

120590minus

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 12 Frequent itemset lost ration 120590minus versus minimum support

on different size of datasets

120588

02

015

01

005

00 1 2 3 4 5 6 7 8 9

k

MASK-08MASK-08741

MASK-095Personalized

Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942

120590+

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

8 9

MASK-08MASK-08741

MASK-095Personalized

Minimum support

4

3

2

1

0

0

04

03

02

01

Figure 14 Frequent itemset added ration 120590+ versus itemset length

119896 on dataset T40I10D100KN942

T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately

For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741

In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error

The Scientific World Journal 9

120590minus

k

0 1 2 3 4 5 6 7 8 9

MASK-08MASK-08741

MASK-095Personalized

1

08

06

04

02

0

Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896

on dataset T40I10D100KN942

Table 2 The number of frequent 119896-itemsets

119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25

6 Conclusions

In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments

References

[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994

[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000

[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008

[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006

[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009

[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981

[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002

[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008

[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007

[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004

[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004

[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005

[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005

[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006

[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005

[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 3

transaction 119879 is a set of items such that 119879 sube 119868 A transaction119879 is said to contain 119883 if and only if 119883 sube 119879 A rule is animplication of the form 119883 rArr 119884 where 119883 119884 sub 119868 and119883 cap 119884 = 120601 The support of the rule 119883 rArr 119884 is defined in(1) and its confidence is defined in (2)

supp (119883 997904rArr 119884) = 119875 (119883 cup 119884) (1)

conf (119883 997904rArr 119884) = 119875 (119884 | 119883) =supp (119883 cup 119884)

supp (119883) (2)

where 119875(119883 cup 119884) is the percentage of transactions in 119863 thatcontain 119883 cup 119884 and supp(119883) is defined as the percentage oftransactions containing119883

A set of items 119883 is said to be frequent if and only ifsupp (119883) is greater than the user-defined minimum supportThen 119883 is a frequent itemset If the rule 119883 rArr 119884 hasthe support greater than the minimum support and itsconfidence is greater than the minimum confidence then119883 rArr 119884 is an interesting rule that is association rule From(1) and (2) it is shown that finding the association rules iseffectively equivalent to generating all the frequent itemsetswith support greater than the minimum support Thereforewe focus on the frequent itemset mining

32 Randomized Response Randomized response techniquewas first introduced by Warner [20] to solve the statisticalsurvey problem of the sensitive questions For examplesocial scientists want to know how many people in somearea use drugs Usually respondents are reluctant to directlyanswer this kind of questions Hence Warner proposed therandomized response

To ask a binary choice question about whether peoplehave a sensitive attribute 119860 the randomized response givestwo related questions like the following

(1) I have the sensitive attribute 119860(2) I do not have the sensitive attribute 119860

Respondents decide which question to answer with aprobability 119901 Then the interviewer only gets a ldquoyesrdquo or ldquonordquowithout knowing which question the respondent answeredIf a respondent has the sensitive attribute119860 then he will givethe answer ldquoyesrdquo to answer the first question with probabilityas119901 or ldquonordquo with probability as 1minus119901 If a person does not have119860 then he will give ldquonordquo with probability as 119901 to answer thefirst question and ldquoyesrdquo with probability as 1minus119901 to answer thesecond question Hence the probability that an interviewergets the answer ldquoyesrdquo can be computed by (3) while gettingldquonordquo can be computed by (4)

119875 (ans = yes) = 119875 (119860 = yes) sdot 119901 + 119875 (119860 = no) sdot (1 minus 119901) (3)

119875 (ans = no) = 119875 (119860 = no) sdot 119901 + 119875 (119860 = yes) sdot (1 minus 119901) (4)

where 119875 (119860 = yes) is the proportion of the respondents thathave the attribute 119860 while 119875 (119860 = no) is the proportion ofrespondents that do not have the attribute119860 The interviewercan get the proportion of respondents having the attribute119860119875 (119860 = yes) by solving (3) and (4)

33 Problem Definition In this paper we solve the dataprivacy problem with respect to the frequent itemset miningunder the B2B environments as well as the B2C environment

Under the B2C model the interviewer can conducta survey containing sensitive questions For example thequestions on ldquowhether you are divorcedrdquo ldquowhether you havecriminal recordsrdquo or ldquowhether you use drugsrdquo are sensitiveThis kind of surveys can analyze the association betweenfactors and indeed reflect sociology problems In this modelrespondents disturb survey vectors with the given parametersand then send them to the reviewers

Under the B2B model the data owner such as supermar-ket managers disturbs the original data before sharing withother business partners Taking the market transactions as anexample customers are reluctant to let others know what heor she bought especially some sensitive products By takingthe privacy protection measures transaction details can behidden without leaking the personal privacy

Considering the data utility and different degree of pri-vacy requirements the protection on the different attributesor items should be different For example the customer doesnot care toomuch about the disclosure that he bought paperswhile concerning heavily on some sensitive items

The problem of personalized privacy-preserving frequentitemset mining is given the original transaction dataset 119863how to disturb 119863 into 119863

1015840 satisfying the different privacyrequirements and mine frequent itemsets from 119863

1015840 so thatthe frequent itemsets mined from119863 are close to the frequentitemsets mined from119863

1015840 as much as possible

4 Personalized Privacy-Preserving FrequentItemset Mining

In this section we present the procedure on how to distortthe original data by the randomized response techniquesatisfying the personalized privacy requirement Then wegive the method to reconstruct itemset support from thedistorted data Finally we devise a personalized privacy-preserving frequent itemset mining algorithm based on theApriori algorithm

41 Personalized Data Distortion Suppose there are 119899 itemsin the database 119863 Each transaction in 119863 is represented by abinary vector 119879 119879

119894= 1 if this transaction contains the item

119894 otherwise 119879119894= 0 For each item 119894 1 le 119894 le 119899 there is a

probability parameter 119901119894(0 le 119901

119894le 1) to disturb the data

on the item 119894 and these parameters form a vector shown informula (5) Then the distorted value of this transaction onitem 119894 can be expressed in (6) where 119903

119894is a value randomly

drawn from a uniform distribution over the interval [0 1]

119875 = (1199011 1199012 119901

119899) (5)

1198791015840

119894=

119879119894 119903

119894le 119901119894

1 minus 119879119894 otherwise

(6)

From (6) we can see that the 119894th value of transaction 119879 iskept with probability 119901

119894and flipped with probability 1 minus 119901

119894

4 The Scientific World Journal

s = 001si = 01

si = 03

si = 05

si = 08

si = 095

R(p

is i)

1

1

08

08

06

06

04

04

02

0200

pi

Figure 1 Reconstruction probability

The proportion of 1198791015840119894= 1 can be calculated by (7) and the

proportion of 1198791015840119894= 0 can be calculated by (8)

119875 (1198791015840

119894= 1) = 119875 (119879

119894= 1) sdot 119901

119894+ 119875 (119879

119894= 0) sdot (1 minus 119901

119894) (7)

119875 (1198791015840

119894= 0) = 119875 (119879

119894= 0) sdot 119901

119894+ 119875 (119879

119894= 1) sdot (1 minus 119901

119894) (8)

For the personalized privacy requirements differentitems are distorted with different probability parameters Fora given parameter119901

119894 the probability of correct reconstruction

on the value 119879119894= 1 is calculated by

119877 (119901119894 119904119894) = 119875 (119879

1015840

119894= 1 | 119879

119894= 1) sdot 119875 (119879

119894= 1 | 119879

1015840

119894= 1)

+ 119875 (1198791015840

119894= 0 | 119879

119894= 1) sdot 119875 (119879

119894= 1 | 119879

1015840

119894= 0)

(9)

The first part in (9) means that the original value 119879119894= 1

is distorted into value 1198791015840

119894= 1 and then reconstructed from

1198791015840

119894= 1 and the second part is for the distorted value 119879

1015840

119894=

0 Finally the probability can be computed by (10) where119904119894is the support of item 119894 in original database The similar

derivation of (10) can be found in [7]

119877 (119901119894 119904119894) =

1199041198941199012

119894

119904119894119901119894+ (1 minus 119904

119894) (1 minus 119901

119894)+

119904119894(1 minus 119901

119894)2

119904119894(1 minus 119901

119894) + (1 minus 119904

119894) 119901119894

(10)

The reconstruction probability curves of (10) are shownin Figure 1We can see that the higher the item support is theeasier this item having value 1 will be reconstructed Besidesthe curves are symmetric around 119901

119894= 05 The further the

distance between 119901119894and 05 the easier the item having value 1

will be reconstructedTherefore for a given item the differentprobability parameters will lead to the different degree ofprivacy protection The parameters for items will be set bythe data owner according to the properties of these items For

example the parameter 119901119894for the item milk can be very high

even 10 while for the drugs it can be more close to 05Under the B2C environment the interviewer and the

respondents cooperate with each other to set the parametervector 119875 Then respondents disturb their personal dataand send them to the interviewer While under the B2Benvironment the data owner disturbs the original data withparameter vector 119875 and sends the distorted data with 119901 to thethird parties

42 Itemset Support Reconstruction After getting the dis-torted data 1198631015840 the data mining will reconstruct the supportfor itemsets to find the frequent ones In order to reconstructthe support of an itemset we need to get the count of everycombination of the items in this itemset For example inorder to compute the support of itemset 119860119861119862 in originaldataset 119863 that is 119901 (119860 = 1 119861 = 1 119862 = 1) we compute thecount of 23 combinations of 119860119861119862 in119863

1015840 that is 1198881015840(119860 = 0 119861 =

0 119862 = 0) 1198881015840(119860 = 0 119861 = 0 119862 = 1) and 1198881015840(119860 = 1 119861 =

1 119862 = 1)This is because the original value (119860 = 1 119861 = 1 119862 =

1) can be distorted to these 23 combinations of 119860119861119862Let 119878 be the combinations of the 119896 items in (11) where

119878119895is the binary form of value 119895 in (12) For example when

119896 = 3 with items 119860119861119862 1198780is (119860 = 0 119861 = 0 119862 = 0) 119878

3is

(119860 = 0 119861 = 1 119862 = 1) that is 011 is the binary form of 3

119878 = 1198780 1198781 119878

2119896minus1 (11)

119878119894= binary (119894 119896) (12)

The combination 119878119894in the original database can be

distorted into any combination 119878119895in 119878 According to the

probability theory the probability of 119878119894being distorted to 119878

119895

can be computed by

119903119894119895= prod ((119878

119894⊙ 119878119895) ∘ 119875(119896)

+ (119878119894oplus 119878119895) ∘ (1 minus 119875

(119896))) (13)

where ∘ is the Hadamard product of two vectors and 119875(119896) is

the distorting probability parameters corresponding to the 119896items prod() returns the product of the vector elementsThatmeans that if the value on an item 119894 is kept then the 119901

(119894)is

multiplied to 119903119894119895 otherwise 1 minus119901

(119894) An example with 119896 as 3 in

Table 1 illustrates the formulation on the transition matrixCorresponding to the combinations defined in (11) of the

given 119896 items we define their counts in the original database119863 and the distorted database1198631015840 respectively in (14) where119862is the count vector for 119863 while 1198621015840 is the count vector for 1198631015840119888119894and 1198881015840

119894are the counts of the combination 119878

119894for the 119896 items

in119863 and1198631015840 respectively

119862 =

[[[[

[

1198880

1198881

1198882119896minus1

]]]]

]

1198621015840=

[[[[

[

1198881015840

0

1198881015840

1

1198881015840

2119896minus1

]]]]

]

(14)

Then the relationship between 119862 and 1198621015840 can be expressed

by (15) according to the randomized response distortionmechanism

1198621015840= 119877119862 (15)

The Scientific World Journal 5

Table 1 An example of transition matrix

000 sdot sdot sdot 110 111000 119901

111990121199013

sdot sdot sdot (1 minus 1199011) (1 minus 119901

2) 1199013

(1 minus 1199011)(1 minus 119901

2)(1 minus 119901

3)

001 11990111199012(1 minus 119901

3) sdot sdot sdot (1 minus 119901

1)(1 minus 119901

2)(1 minus 119901

3) (1 minus 119901

1) (1 minus 119901

2) 1199013

111 (1 minus 1199011) (1 minus 119901

2) 1199013

sdot sdot sdot 11990111199012(1 minus 119901

3) 119901

111990121199013

After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862

119862 = 119877minus11198621015840 (16)

The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863

119904 =1198882119896minus1

119898 (17)

43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found

In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868

1 119868

119896 we extract the subdataset

119863(119883) = 119863(1198681 119868

119896) from the database1198631015840 In the subdataset

119863(119883) we map each transaction 119905 isin 119863(119883) into a value by

V = 119905 sdot 119887 = (1199051 1199052 119905

119896) sdot (2119896minus1

2119896minus2

20) (18)

By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot

119887 Then 1198881015840

119895in (14) is the count of elements in 119881 equal to

119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883

In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction

ACD ACDt middot b

000

010

000

011

011

middot middot middot

111

100

101

100

111

0

1

2

3

4

5

6

7

000

001

010

011

100

101

110

111

Figure 2 An example of transaction mapping

5 Experimental Results

In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)

51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942

The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items

52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent

6 The Scientific World Journal

itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method

The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min

120588 =1

10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

sum

119891isin119865cap119865

10038161003816100381610038161003816119904119891minus 119904119891

10038161003816100381610038161003816

119904119891

(19)

The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent

120590+=

1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

120590minus=

|119865| minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

(20)

The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588

53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08

531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6

From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item

120590minus 120590+

120588

FF

Figure 3 Evaluation metrics

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

025

02

015

01

005

0

120588

Figure 4 Support error 120588 versus minimum support onT3I4D500KN10

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

01

008

006

004

002

0

120590+

Figure 5 Frequent itemset added ration 120590+ versus minimum

support on T3I4D500KN10

or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors

The Scientific World Journal 7

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

120590minus

016

014

012

01

008

006

004

002

0

Figure 6 Frequent itemset lost ration 120590minus versus minimum support

on T3I4D500KN10

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

120588

014

012

01

008

006

004

002

0

Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10

We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095

We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901

as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095

532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of

120590+

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

05

04

03

02

01

0

Figure 8 Frequent itemset added ration 120590+ versus minimum

support on BMS-WebView-1x10

120590minus

016

014

012

01

008

006

004

002

0

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

Figure 9 Frequent itemset lost ration 120590minus versus minimum support

on BMS-WebView-1x10

dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support

The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information

533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset

8 The Scientific World Journal

04

035

03

025

02

015

01

005

15 20

25 3 35 4 45 5 55 6

120588

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1 PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 10 Support error 120588 versus minimum support on differentsize of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

120590+

14

12

10

8

6

4

2

0

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 11 Frequent itemset added ration 120590+ versus minimum

support on different size of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

035

03

025

02

015

005

01

0

120590minus

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 12 Frequent itemset lost ration 120590minus versus minimum support

on different size of datasets

120588

02

015

01

005

00 1 2 3 4 5 6 7 8 9

k

MASK-08MASK-08741

MASK-095Personalized

Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942

120590+

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

8 9

MASK-08MASK-08741

MASK-095Personalized

Minimum support

4

3

2

1

0

0

04

03

02

01

Figure 14 Frequent itemset added ration 120590+ versus itemset length

119896 on dataset T40I10D100KN942

T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately

For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741

In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error

The Scientific World Journal 9

120590minus

k

0 1 2 3 4 5 6 7 8 9

MASK-08MASK-08741

MASK-095Personalized

1

08

06

04

02

0

Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896

on dataset T40I10D100KN942

Table 2 The number of frequent 119896-itemsets

119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25

6 Conclusions

In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments

References

[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994

[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000

[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008

[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006

[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009

[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981

[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002

[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008

[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007

[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004

[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004

[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005

[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005

[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006

[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005

[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

4 The Scientific World Journal

s = 001si = 01

si = 03

si = 05

si = 08

si = 095

R(p

is i)

1

1

08

08

06

06

04

04

02

0200

pi

Figure 1 Reconstruction probability

The proportion of 1198791015840119894= 1 can be calculated by (7) and the

proportion of 1198791015840119894= 0 can be calculated by (8)

119875 (1198791015840

119894= 1) = 119875 (119879

119894= 1) sdot 119901

119894+ 119875 (119879

119894= 0) sdot (1 minus 119901

119894) (7)

119875 (1198791015840

119894= 0) = 119875 (119879

119894= 0) sdot 119901

119894+ 119875 (119879

119894= 1) sdot (1 minus 119901

119894) (8)

For the personalized privacy requirements differentitems are distorted with different probability parameters Fora given parameter119901

119894 the probability of correct reconstruction

on the value 119879119894= 1 is calculated by

119877 (119901119894 119904119894) = 119875 (119879

1015840

119894= 1 | 119879

119894= 1) sdot 119875 (119879

119894= 1 | 119879

1015840

119894= 1)

+ 119875 (1198791015840

119894= 0 | 119879

119894= 1) sdot 119875 (119879

119894= 1 | 119879

1015840

119894= 0)

(9)

The first part in (9) means that the original value 119879119894= 1

is distorted into value 1198791015840

119894= 1 and then reconstructed from

1198791015840

119894= 1 and the second part is for the distorted value 119879

1015840

119894=

0 Finally the probability can be computed by (10) where119904119894is the support of item 119894 in original database The similar

derivation of (10) can be found in [7]

119877 (119901119894 119904119894) =

1199041198941199012

119894

119904119894119901119894+ (1 minus 119904

119894) (1 minus 119901

119894)+

119904119894(1 minus 119901

119894)2

119904119894(1 minus 119901

119894) + (1 minus 119904

119894) 119901119894

(10)

The reconstruction probability curves of (10) are shownin Figure 1We can see that the higher the item support is theeasier this item having value 1 will be reconstructed Besidesthe curves are symmetric around 119901

119894= 05 The further the

distance between 119901119894and 05 the easier the item having value 1

will be reconstructedTherefore for a given item the differentprobability parameters will lead to the different degree ofprivacy protection The parameters for items will be set bythe data owner according to the properties of these items For

example the parameter 119901119894for the item milk can be very high

even 10 while for the drugs it can be more close to 05Under the B2C environment the interviewer and the

respondents cooperate with each other to set the parametervector 119875 Then respondents disturb their personal dataand send them to the interviewer While under the B2Benvironment the data owner disturbs the original data withparameter vector 119875 and sends the distorted data with 119901 to thethird parties

42 Itemset Support Reconstruction After getting the dis-torted data 1198631015840 the data mining will reconstruct the supportfor itemsets to find the frequent ones In order to reconstructthe support of an itemset we need to get the count of everycombination of the items in this itemset For example inorder to compute the support of itemset 119860119861119862 in originaldataset 119863 that is 119901 (119860 = 1 119861 = 1 119862 = 1) we compute thecount of 23 combinations of 119860119861119862 in119863

1015840 that is 1198881015840(119860 = 0 119861 =

0 119862 = 0) 1198881015840(119860 = 0 119861 = 0 119862 = 1) and 1198881015840(119860 = 1 119861 =

1 119862 = 1)This is because the original value (119860 = 1 119861 = 1 119862 =

1) can be distorted to these 23 combinations of 119860119861119862Let 119878 be the combinations of the 119896 items in (11) where

119878119895is the binary form of value 119895 in (12) For example when

119896 = 3 with items 119860119861119862 1198780is (119860 = 0 119861 = 0 119862 = 0) 119878

3is

(119860 = 0 119861 = 1 119862 = 1) that is 011 is the binary form of 3

119878 = 1198780 1198781 119878

2119896minus1 (11)

119878119894= binary (119894 119896) (12)

The combination 119878119894in the original database can be

distorted into any combination 119878119895in 119878 According to the

probability theory the probability of 119878119894being distorted to 119878

119895

can be computed by

119903119894119895= prod ((119878

119894⊙ 119878119895) ∘ 119875(119896)

+ (119878119894oplus 119878119895) ∘ (1 minus 119875

(119896))) (13)

where ∘ is the Hadamard product of two vectors and 119875(119896) is

the distorting probability parameters corresponding to the 119896items prod() returns the product of the vector elementsThatmeans that if the value on an item 119894 is kept then the 119901

(119894)is

multiplied to 119903119894119895 otherwise 1 minus119901

(119894) An example with 119896 as 3 in

Table 1 illustrates the formulation on the transition matrixCorresponding to the combinations defined in (11) of the

given 119896 items we define their counts in the original database119863 and the distorted database1198631015840 respectively in (14) where119862is the count vector for 119863 while 1198621015840 is the count vector for 1198631015840119888119894and 1198881015840

119894are the counts of the combination 119878

119894for the 119896 items

in119863 and1198631015840 respectively

119862 =

[[[[

[

1198880

1198881

1198882119896minus1

]]]]

]

1198621015840=

[[[[

[

1198881015840

0

1198881015840

1

1198881015840

2119896minus1

]]]]

]

(14)

Then the relationship between 119862 and 1198621015840 can be expressed

by (15) according to the randomized response distortionmechanism

1198621015840= 119877119862 (15)

The Scientific World Journal 5

Table 1 An example of transition matrix

000 sdot sdot sdot 110 111000 119901

111990121199013

sdot sdot sdot (1 minus 1199011) (1 minus 119901

2) 1199013

(1 minus 1199011)(1 minus 119901

2)(1 minus 119901

3)

001 11990111199012(1 minus 119901

3) sdot sdot sdot (1 minus 119901

1)(1 minus 119901

2)(1 minus 119901

3) (1 minus 119901

1) (1 minus 119901

2) 1199013

111 (1 minus 1199011) (1 minus 119901

2) 1199013

sdot sdot sdot 11990111199012(1 minus 119901

3) 119901

111990121199013

After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862

119862 = 119877minus11198621015840 (16)

The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863

119904 =1198882119896minus1

119898 (17)

43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found

In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868

1 119868

119896 we extract the subdataset

119863(119883) = 119863(1198681 119868

119896) from the database1198631015840 In the subdataset

119863(119883) we map each transaction 119905 isin 119863(119883) into a value by

V = 119905 sdot 119887 = (1199051 1199052 119905

119896) sdot (2119896minus1

2119896minus2

20) (18)

By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot

119887 Then 1198881015840

119895in (14) is the count of elements in 119881 equal to

119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883

In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction

ACD ACDt middot b

000

010

000

011

011

middot middot middot

111

100

101

100

111

0

1

2

3

4

5

6

7

000

001

010

011

100

101

110

111

Figure 2 An example of transaction mapping

5 Experimental Results

In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)

51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942

The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items

52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent

6 The Scientific World Journal

itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method

The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min

120588 =1

10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

sum

119891isin119865cap119865

10038161003816100381610038161003816119904119891minus 119904119891

10038161003816100381610038161003816

119904119891

(19)

The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent

120590+=

1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

120590minus=

|119865| minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

(20)

The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588

53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08

531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6

From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item

120590minus 120590+

120588

FF

Figure 3 Evaluation metrics

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

025

02

015

01

005

0

120588

Figure 4 Support error 120588 versus minimum support onT3I4D500KN10

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

01

008

006

004

002

0

120590+

Figure 5 Frequent itemset added ration 120590+ versus minimum

support on T3I4D500KN10

or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors

The Scientific World Journal 7

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

120590minus

016

014

012

01

008

006

004

002

0

Figure 6 Frequent itemset lost ration 120590minus versus minimum support

on T3I4D500KN10

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

120588

014

012

01

008

006

004

002

0

Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10

We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095

We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901

as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095

532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of

120590+

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

05

04

03

02

01

0

Figure 8 Frequent itemset added ration 120590+ versus minimum

support on BMS-WebView-1x10

120590minus

016

014

012

01

008

006

004

002

0

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

Figure 9 Frequent itemset lost ration 120590minus versus minimum support

on BMS-WebView-1x10

dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support

The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information

533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset

8 The Scientific World Journal

04

035

03

025

02

015

01

005

15 20

25 3 35 4 45 5 55 6

120588

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1 PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 10 Support error 120588 versus minimum support on differentsize of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

120590+

14

12

10

8

6

4

2

0

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 11 Frequent itemset added ration 120590+ versus minimum

support on different size of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

035

03

025

02

015

005

01

0

120590minus

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 12 Frequent itemset lost ration 120590minus versus minimum support

on different size of datasets

120588

02

015

01

005

00 1 2 3 4 5 6 7 8 9

k

MASK-08MASK-08741

MASK-095Personalized

Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942

120590+

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

8 9

MASK-08MASK-08741

MASK-095Personalized

Minimum support

4

3

2

1

0

0

04

03

02

01

Figure 14 Frequent itemset added ration 120590+ versus itemset length

119896 on dataset T40I10D100KN942

T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately

For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741

In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error

The Scientific World Journal 9

120590minus

k

0 1 2 3 4 5 6 7 8 9

MASK-08MASK-08741

MASK-095Personalized

1

08

06

04

02

0

Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896

on dataset T40I10D100KN942

Table 2 The number of frequent 119896-itemsets

119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25

6 Conclusions

In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments

References

[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994

[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000

[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008

[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006

[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009

[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981

[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002

[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008

[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007

[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004

[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004

[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005

[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005

[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006

[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005

[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 5

Table 1 An example of transition matrix

000 sdot sdot sdot 110 111000 119901

111990121199013

sdot sdot sdot (1 minus 1199011) (1 minus 119901

2) 1199013

(1 minus 1199011)(1 minus 119901

2)(1 minus 119901

3)

001 11990111199012(1 minus 119901

3) sdot sdot sdot (1 minus 119901

1)(1 minus 119901

2)(1 minus 119901

3) (1 minus 119901

1) (1 minus 119901

2) 1199013

111 (1 minus 1199011) (1 minus 119901

2) 1199013

sdot sdot sdot 11990111199012(1 minus 119901

3) 119901

111990121199013

After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862

119862 = 119877minus11198621015840 (16)

The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863

119904 =1198882119896minus1

119898 (17)

43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found

In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868

1 119868

119896 we extract the subdataset

119863(119883) = 119863(1198681 119868

119896) from the database1198631015840 In the subdataset

119863(119883) we map each transaction 119905 isin 119863(119883) into a value by

V = 119905 sdot 119887 = (1199051 1199052 119905

119896) sdot (2119896minus1

2119896minus2

20) (18)

By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot

119887 Then 1198881015840

119895in (14) is the count of elements in 119881 equal to

119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883

In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction

ACD ACDt middot b

000

010

000

011

011

middot middot middot

111

100

101

100

111

0

1

2

3

4

5

6

7

000

001

010

011

100

101

110

111

Figure 2 An example of transaction mapping

5 Experimental Results

In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)

51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942

The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items

52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent

6 The Scientific World Journal

itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method

The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min

120588 =1

10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

sum

119891isin119865cap119865

10038161003816100381610038161003816119904119891minus 119904119891

10038161003816100381610038161003816

119904119891

(19)

The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent

120590+=

1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

120590minus=

|119865| minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

(20)

The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588

53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08

531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6

From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item

120590minus 120590+

120588

FF

Figure 3 Evaluation metrics

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

025

02

015

01

005

0

120588

Figure 4 Support error 120588 versus minimum support onT3I4D500KN10

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

01

008

006

004

002

0

120590+

Figure 5 Frequent itemset added ration 120590+ versus minimum

support on T3I4D500KN10

or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors

The Scientific World Journal 7

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

120590minus

016

014

012

01

008

006

004

002

0

Figure 6 Frequent itemset lost ration 120590minus versus minimum support

on T3I4D500KN10

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

120588

014

012

01

008

006

004

002

0

Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10

We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095

We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901

as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095

532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of

120590+

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

05

04

03

02

01

0

Figure 8 Frequent itemset added ration 120590+ versus minimum

support on BMS-WebView-1x10

120590minus

016

014

012

01

008

006

004

002

0

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

Figure 9 Frequent itemset lost ration 120590minus versus minimum support

on BMS-WebView-1x10

dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support

The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information

533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset

8 The Scientific World Journal

04

035

03

025

02

015

01

005

15 20

25 3 35 4 45 5 55 6

120588

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1 PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 10 Support error 120588 versus minimum support on differentsize of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

120590+

14

12

10

8

6

4

2

0

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 11 Frequent itemset added ration 120590+ versus minimum

support on different size of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

035

03

025

02

015

005

01

0

120590minus

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 12 Frequent itemset lost ration 120590minus versus minimum support

on different size of datasets

120588

02

015

01

005

00 1 2 3 4 5 6 7 8 9

k

MASK-08MASK-08741

MASK-095Personalized

Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942

120590+

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

8 9

MASK-08MASK-08741

MASK-095Personalized

Minimum support

4

3

2

1

0

0

04

03

02

01

Figure 14 Frequent itemset added ration 120590+ versus itemset length

119896 on dataset T40I10D100KN942

T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately

For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741

In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error

The Scientific World Journal 9

120590minus

k

0 1 2 3 4 5 6 7 8 9

MASK-08MASK-08741

MASK-095Personalized

1

08

06

04

02

0

Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896

on dataset T40I10D100KN942

Table 2 The number of frequent 119896-itemsets

119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25

6 Conclusions

In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments

References

[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994

[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000

[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008

[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006

[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009

[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981

[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002

[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008

[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007

[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004

[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004

[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005

[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005

[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006

[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005

[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

6 The Scientific World Journal

itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method

The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min

120588 =1

10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

sum

119891isin119865cap119865

10038161003816100381610038161003816119904119891minus 119904119891

10038161003816100381610038161003816

119904119891

(19)

The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent

120590+=

1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

120590minus=

|119865| minus10038161003816100381610038161003816119865 cap 119865

10038161003816100381610038161003816

|119865|

(20)

The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588

53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08

531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6

From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item

120590minus 120590+

120588

FF

Figure 3 Evaluation metrics

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

025

02

015

01

005

0

120588

Figure 4 Support error 120588 versus minimum support onT3I4D500KN10

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

01

008

006

004

002

0

120590+

Figure 5 Frequent itemset added ration 120590+ versus minimum

support on T3I4D500KN10

or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors

The Scientific World Journal 7

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

120590minus

016

014

012

01

008

006

004

002

0

Figure 6 Frequent itemset lost ration 120590minus versus minimum support

on T3I4D500KN10

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

120588

014

012

01

008

006

004

002

0

Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10

We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095

We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901

as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095

532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of

120590+

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

05

04

03

02

01

0

Figure 8 Frequent itemset added ration 120590+ versus minimum

support on BMS-WebView-1x10

120590minus

016

014

012

01

008

006

004

002

0

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

Figure 9 Frequent itemset lost ration 120590minus versus minimum support

on BMS-WebView-1x10

dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support

The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information

533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset

8 The Scientific World Journal

04

035

03

025

02

015

01

005

15 20

25 3 35 4 45 5 55 6

120588

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1 PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 10 Support error 120588 versus minimum support on differentsize of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

120590+

14

12

10

8

6

4

2

0

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 11 Frequent itemset added ration 120590+ versus minimum

support on different size of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

035

03

025

02

015

005

01

0

120590minus

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 12 Frequent itemset lost ration 120590minus versus minimum support

on different size of datasets

120588

02

015

01

005

00 1 2 3 4 5 6 7 8 9

k

MASK-08MASK-08741

MASK-095Personalized

Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942

120590+

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

8 9

MASK-08MASK-08741

MASK-095Personalized

Minimum support

4

3

2

1

0

0

04

03

02

01

Figure 14 Frequent itemset added ration 120590+ versus itemset length

119896 on dataset T40I10D100KN942

T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately

For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741

In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error

The Scientific World Journal 9

120590minus

k

0 1 2 3 4 5 6 7 8 9

MASK-08MASK-08741

MASK-095Personalized

1

08

06

04

02

0

Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896

on dataset T40I10D100KN942

Table 2 The number of frequent 119896-itemsets

119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25

6 Conclusions

In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments

References

[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994

[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000

[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008

[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006

[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009

[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981

[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002

[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008

[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007

[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004

[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004

[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005

[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005

[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006

[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005

[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 7

MASKPersonalized

Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001

120590minus

016

014

012

01

008

006

004

002

0

Figure 6 Frequent itemset lost ration 120590minus versus minimum support

on T3I4D500KN10

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

120588

014

012

01

008

006

004

002

0

Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10

We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095

We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901

as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095

532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of

120590+

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

05

04

03

02

01

0

Figure 8 Frequent itemset added ration 120590+ versus minimum

support on BMS-WebView-1x10

120590minus

016

014

012

01

008

006

004

002

0

MASK-08MASK-08741

MASK-095Personalized

Minimum support15 2 25 3 35 4 45 5 55 6

times10minus3

Figure 9 Frequent itemset lost ration 120590minus versus minimum support

on BMS-WebView-1x10

dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support

The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information

533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset

8 The Scientific World Journal

04

035

03

025

02

015

01

005

15 20

25 3 35 4 45 5 55 6

120588

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1 PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 10 Support error 120588 versus minimum support on differentsize of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

120590+

14

12

10

8

6

4

2

0

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 11 Frequent itemset added ration 120590+ versus minimum

support on different size of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

035

03

025

02

015

005

01

0

120590minus

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 12 Frequent itemset lost ration 120590minus versus minimum support

on different size of datasets

120588

02

015

01

005

00 1 2 3 4 5 6 7 8 9

k

MASK-08MASK-08741

MASK-095Personalized

Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942

120590+

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

8 9

MASK-08MASK-08741

MASK-095Personalized

Minimum support

4

3

2

1

0

0

04

03

02

01

Figure 14 Frequent itemset added ration 120590+ versus itemset length

119896 on dataset T40I10D100KN942

T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately

For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741

In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error

The Scientific World Journal 9

120590minus

k

0 1 2 3 4 5 6 7 8 9

MASK-08MASK-08741

MASK-095Personalized

1

08

06

04

02

0

Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896

on dataset T40I10D100KN942

Table 2 The number of frequent 119896-itemsets

119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25

6 Conclusions

In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments

References

[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994

[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000

[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008

[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006

[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009

[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981

[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002

[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008

[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007

[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004

[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004

[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005

[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005

[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006

[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005

[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

8 The Scientific World Journal

04

035

03

025

02

015

01

005

15 20

25 3 35 4 45 5 55 6

120588

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1 PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 10 Support error 120588 versus minimum support on differentsize of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

120590+

14

12

10

8

6

4

2

0

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 11 Frequent itemset added ration 120590+ versus minimum

support on different size of datasets

15 2 25 3 35 4 45 5 55 6

times10minus3Minimum support

MASKBMS 1

PersonalizedBMS 1

035

03

025

02

015

005

01

0

120590minus

PersonalizedBMS 1 times 10

MASKBMS 1 times 10

Figure 12 Frequent itemset lost ration 120590minus versus minimum support

on different size of datasets

120588

02

015

01

005

00 1 2 3 4 5 6 7 8 9

k

MASK-08MASK-08741

MASK-095Personalized

Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942

120590+

0 1 2 3 4 5 6 7

1 2 3 4 5 6 7

8 9

MASK-08MASK-08741

MASK-095Personalized

Minimum support

4

3

2

1

0

0

04

03

02

01

Figure 14 Frequent itemset added ration 120590+ versus itemset length

119896 on dataset T40I10D100KN942

T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately

For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741

In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error

The Scientific World Journal 9

120590minus

k

0 1 2 3 4 5 6 7 8 9

MASK-08MASK-08741

MASK-095Personalized

1

08

06

04

02

0

Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896

on dataset T40I10D100KN942

Table 2 The number of frequent 119896-itemsets

119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25

6 Conclusions

In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments

References

[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994

[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000

[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008

[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006

[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009

[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981

[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002

[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008

[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007

[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004

[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004

[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005

[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005

[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006

[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005

[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 9

120590minus

k

0 1 2 3 4 5 6 7 8 9

MASK-08MASK-08741

MASK-095Personalized

1

08

06

04

02

0

Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896

on dataset T40I10D100KN942

Table 2 The number of frequent 119896-itemsets

119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25

6 Conclusions

In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments

References

[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994

[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000

[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008

[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006

[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009

[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981

[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002

[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008

[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007

[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004

[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004

[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005

[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005

[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006

[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005

[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

10 The Scientific World Journal

[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002

[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004

[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004

[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965

[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014