research article personalized privacy-preserving frequent...
TRANSCRIPT
Research ArticlePersonalized Privacy-Preserving Frequent Itemset Mining UsingRandomized Response
Chongjing Sun Yan Fu Junlin Zhou and Hui Gao
Web Science Center School of Computer Science and Engineering University of Electronic Science and Technology of ChinaChengdu 611731 China
Correspondence should be addressed to Yan Fu fuyanuestceducn
Received 8 December 2013 Accepted 24 February 2014 Published 30 March 2014
Academic Editors B Johansson and L Xiao
Copyright copy 2014 Chongjing Sun et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Frequent itemsetmining is the important first step of association rulemining which discovers interesting patterns from themassivedata There are increasing concerns about the privacy problem in the frequent itemset mining Some works have been proposedto handle this kind of problem In this paper we introduce a personalized privacy problem in which different attributes mayneed different privacy levels protection To solve this problem we give a personalized privacy-preserving method by using therandomized response technique By providing different privacy levels for different attributes this method can get a higher accuracyon frequent itemset mining than the traditional method providing the same privacy level Finally our experimental results showthat our method can have better results on the frequent itemset mining while preserving personalized privacy
1 Introduction
Therapid development of Internet technology cloud comput-ing mobile computing and Internet of things has producedthe exploding large datasets the so-called big data The datacan be used to capture the useful information and interestingpatterns and indeed help people to make effective decisionsData mining is a powerful technique which can discoverthe hidden interesting patterns and models By analyzingthe data data mining offers considerable benefits to theconsumers companies and organizations
Association rule mining is a popular and importanttechnique of data mining It is intended to discover theinteresting patterns from the large transaction data thatis frequent patterns and association rules These rules canbe used to improve the market decision making such aspromotional pricing or product placements For exampleby analyzing the transactions data from a supermarket therule found that customers who buy diapers also tend tobuy beer Nowadays association rule mining is employed inmany application areas such as Web usage mining intrusiondetection and bioinformatics
The process of mining association rules contains twomain steps [1] The first step is to discover all the frequent
itemsets with the support being higher than the minimumsupport The second step generates the strong rules based onthe frequent itemsets found in the first step and the strongrules have confidence greater than the minimum confidenceIn this paper we focus on the frequent itemset mining whichis the most important part of association rule mining
With the exponentially increasing information being col-lected for analysis there is growing concern about the privacyproblem In fear of the privacy problem some people mightbe reluctant to provide the true information The survey [2]conducted in 1993 demonstrated the worries of people aboutthe privacy 17 of the respondents are extremely concernedabout their personal information They were reluctant toprovide any data and even the privacy protection measureswere in place However 56 of them were willing to providetheir information under the condition that privacy protectionmethods were adopted The remaining 27 of them aremarginally concerned about their privacy and willing toprovide data under any condition Therefore in order tocollect the true data privacy protection measures mustbe taken for protecting the personal information Besidesbefore outsourcing or sharing the data with other companiesthe data owner also must take some privacy-preservingmeasures
Hindawi Publishing Corporatione Scientific World JournalVolume 2014 Article ID 686151 10 pageshttpdxdoiorg1011552014686151
2 The Scientific World Journal
Two privacy environments are proposed in [3] based onthe privacy mechanisms The first one is B2B (business tobusiness) which assures that the data obtained from the userswould be preprocessed by the privacy-preserving techniquesbefore being supplied to the data miner The second one isB2C (business to customer) which provides the privacy pro-tection at the point of data collection that is at the user siteIn order to protect the sensitive information we proposedmethods which can be used under both environments Wefocus on the privacy of input data in the frequent itemsetmining process Under the B2C environment the personaldata are randomized after each user provides them and thensent to the data collector Under the B2B environment thecollected dataset is randomized before outsourcing or sharingwith others even at public
Although many works have been proposed to handle theprivacy problem in the frequent itemset mining process fewof them focused on the personalized privacy problem Xiaoand Tao [4] proposed the personalized privacy preservationproblem considering that different people need differentlevels of privacy protection and they used the generalizationtechniques Besides they also raise the problem of themultipublishing [5] wherein different recipients can havedifferent privacy protection levels on the original dataset
In this paper we define the personalized privacy asdifferent privacy levels on different attributes or items inthe process of frequent itemset mining For example thecustomer does not care too much about the disclosure ofthe information that he bought daily necessities But he willconcern heavily about the privacy problem if he boughtsome sensitive items If we provide the same level of privacyprotection for different items the data utility will be losttoo much in order to satisfy all items privacy requirementsThis is because the nonsensitive items also are disturbed tooheavily Hence it is necessary to provide the different degreeof privacy protection for different items or attributes
To solve the above personalized privacy problem we usethe randomized response technique [6] Based on the MASK[7] we give the solution on how to reconstruct the associ-ations under the different privacy protections Our methodcan be used to preserve the privacy under both the B2B andB2C environments Besides we design amethod to acceleratethe itemset support computation Finally our experimentalresults show that the performance of our method is betterthan the traditional method with respect to the accuracywhile satisfying the different privacy requirements
The rest of this paper is organized as follows In Section 2we review the previous related works Section 3 gives somerelated preliminaries and defines the personalized privacyproblem Then in Section 4 we describe the proposedmethod on the personalized privacy protection and how todiscover the frequent itemset from the randomized dataWe evaluate our technique in Section 5 by experiments andconclude our work in Section 6
2 Related Works
There are many works on the privacy problem in the associ-ation rule mining and they can be divided into two research
streams [3] input data privacy and output rule privacy Mostof them focus on the second one To preserve the input dataprivacy some methods are adopted to disturb the originaldata so that the attacker cannot get the true data of usersFor the output rule privacy the collected data are heuristicallyaltered in order to protect some sensitive rules being minedfrom the dataset by data miners
The association rule hiding algorithms which protect theoutput rule privacy can be divided into three categories [8]namely heuristic approaches border-based approaches andexact approaches The first class of approaches selectivelysanitizes a set of transactions from the database to hidethe sensitive rules efficiently which suffer from high sideeffects Two classic techniques in this class of approaches aredistortion [9 10] and blocking [11 12] The second set ofapproaches [13 14] hide the sensitive rules by modifying theoriginal borders in the lattice of the frequent and infrequentitemsets in the database The third set of approaches [15 16]hide the sensitive rules by solving the Constrain Satisfactionproblem using the integer or linear programming They canfind the optimality to hide the rules with high computationcost
In this paper we solve the problem of the data inputprivacy and the corresponding approaches can be dividedinto cryptograph-based and reconstruction-based methodsThe cryptograph-based approaches handle the problem thatsome partners want to discover shared association rule fromthe global data without disclosing their sensitive data toothers The global data may be vertically partitioned [17]or horizontally partitioned [18] and distributed in manypartners The reconstruction-based methods firstly random-ize the original data to hide the sensitive data and thenreconstruct the interesting patterns based on the statisticalfeatures without knowing true values For the centralizeddata there are two distortion strategies statistical distortionand algebraic distortion Rizvi and Haritsa [7] proposedthe MASK approach to disturb the original sparse Booleandatabases This method retains each 0 or 1 bit in the databasewith the probability as 119901 and flips this value with theprobability as 1 minus 119901 Zhang et al [19] proposed the solutionwith the algebraic distortion By using the eigenvalues andeigenvectors the data of users can be distorted by matrixtransformation and noise addition
Our work focuses on the input data privacy problemof the centralized data The frequent itemset mining isconducted on the distorted data that is randomized by therandomized response technique All the above algorithms didnot mention the personalized privacy problem that we solvein this paper
3 Preliminaries and Problem Definition
In this section we introduce the related preliminaries aboutthe privacy-preserving frequent itemset mining and give theproblem definition of this paper
31 Association Rule Mining Let 119868 be a set of items and 119863
the database containing a set of transactions wherein each
The Scientific World Journal 3
transaction 119879 is a set of items such that 119879 sube 119868 A transaction119879 is said to contain 119883 if and only if 119883 sube 119879 A rule is animplication of the form 119883 rArr 119884 where 119883 119884 sub 119868 and119883 cap 119884 = 120601 The support of the rule 119883 rArr 119884 is defined in(1) and its confidence is defined in (2)
supp (119883 997904rArr 119884) = 119875 (119883 cup 119884) (1)
conf (119883 997904rArr 119884) = 119875 (119884 | 119883) =supp (119883 cup 119884)
supp (119883) (2)
where 119875(119883 cup 119884) is the percentage of transactions in 119863 thatcontain 119883 cup 119884 and supp(119883) is defined as the percentage oftransactions containing119883
A set of items 119883 is said to be frequent if and only ifsupp (119883) is greater than the user-defined minimum supportThen 119883 is a frequent itemset If the rule 119883 rArr 119884 hasthe support greater than the minimum support and itsconfidence is greater than the minimum confidence then119883 rArr 119884 is an interesting rule that is association rule From(1) and (2) it is shown that finding the association rules iseffectively equivalent to generating all the frequent itemsetswith support greater than the minimum support Thereforewe focus on the frequent itemset mining
32 Randomized Response Randomized response techniquewas first introduced by Warner [20] to solve the statisticalsurvey problem of the sensitive questions For examplesocial scientists want to know how many people in somearea use drugs Usually respondents are reluctant to directlyanswer this kind of questions Hence Warner proposed therandomized response
To ask a binary choice question about whether peoplehave a sensitive attribute 119860 the randomized response givestwo related questions like the following
(1) I have the sensitive attribute 119860(2) I do not have the sensitive attribute 119860
Respondents decide which question to answer with aprobability 119901 Then the interviewer only gets a ldquoyesrdquo or ldquonordquowithout knowing which question the respondent answeredIf a respondent has the sensitive attribute119860 then he will givethe answer ldquoyesrdquo to answer the first question with probabilityas119901 or ldquonordquo with probability as 1minus119901 If a person does not have119860 then he will give ldquonordquo with probability as 119901 to answer thefirst question and ldquoyesrdquo with probability as 1minus119901 to answer thesecond question Hence the probability that an interviewergets the answer ldquoyesrdquo can be computed by (3) while gettingldquonordquo can be computed by (4)
119875 (ans = yes) = 119875 (119860 = yes) sdot 119901 + 119875 (119860 = no) sdot (1 minus 119901) (3)
119875 (ans = no) = 119875 (119860 = no) sdot 119901 + 119875 (119860 = yes) sdot (1 minus 119901) (4)
where 119875 (119860 = yes) is the proportion of the respondents thathave the attribute 119860 while 119875 (119860 = no) is the proportion ofrespondents that do not have the attribute119860 The interviewercan get the proportion of respondents having the attribute119860119875 (119860 = yes) by solving (3) and (4)
33 Problem Definition In this paper we solve the dataprivacy problem with respect to the frequent itemset miningunder the B2B environments as well as the B2C environment
Under the B2C model the interviewer can conducta survey containing sensitive questions For example thequestions on ldquowhether you are divorcedrdquo ldquowhether you havecriminal recordsrdquo or ldquowhether you use drugsrdquo are sensitiveThis kind of surveys can analyze the association betweenfactors and indeed reflect sociology problems In this modelrespondents disturb survey vectors with the given parametersand then send them to the reviewers
Under the B2B model the data owner such as supermar-ket managers disturbs the original data before sharing withother business partners Taking the market transactions as anexample customers are reluctant to let others know what heor she bought especially some sensitive products By takingthe privacy protection measures transaction details can behidden without leaking the personal privacy
Considering the data utility and different degree of pri-vacy requirements the protection on the different attributesor items should be different For example the customer doesnot care toomuch about the disclosure that he bought paperswhile concerning heavily on some sensitive items
The problem of personalized privacy-preserving frequentitemset mining is given the original transaction dataset 119863how to disturb 119863 into 119863
1015840 satisfying the different privacyrequirements and mine frequent itemsets from 119863
1015840 so thatthe frequent itemsets mined from119863 are close to the frequentitemsets mined from119863
1015840 as much as possible
4 Personalized Privacy-Preserving FrequentItemset Mining
In this section we present the procedure on how to distortthe original data by the randomized response techniquesatisfying the personalized privacy requirement Then wegive the method to reconstruct itemset support from thedistorted data Finally we devise a personalized privacy-preserving frequent itemset mining algorithm based on theApriori algorithm
41 Personalized Data Distortion Suppose there are 119899 itemsin the database 119863 Each transaction in 119863 is represented by abinary vector 119879 119879
119894= 1 if this transaction contains the item
119894 otherwise 119879119894= 0 For each item 119894 1 le 119894 le 119899 there is a
probability parameter 119901119894(0 le 119901
119894le 1) to disturb the data
on the item 119894 and these parameters form a vector shown informula (5) Then the distorted value of this transaction onitem 119894 can be expressed in (6) where 119903
119894is a value randomly
drawn from a uniform distribution over the interval [0 1]
119875 = (1199011 1199012 119901
119899) (5)
1198791015840
119894=
119879119894 119903
119894le 119901119894
1 minus 119879119894 otherwise
(6)
From (6) we can see that the 119894th value of transaction 119879 iskept with probability 119901
119894and flipped with probability 1 minus 119901
119894
4 The Scientific World Journal
s = 001si = 01
si = 03
si = 05
si = 08
si = 095
R(p
is i)
1
1
08
08
06
06
04
04
02
0200
pi
Figure 1 Reconstruction probability
The proportion of 1198791015840119894= 1 can be calculated by (7) and the
proportion of 1198791015840119894= 0 can be calculated by (8)
119875 (1198791015840
119894= 1) = 119875 (119879
119894= 1) sdot 119901
119894+ 119875 (119879
119894= 0) sdot (1 minus 119901
119894) (7)
119875 (1198791015840
119894= 0) = 119875 (119879
119894= 0) sdot 119901
119894+ 119875 (119879
119894= 1) sdot (1 minus 119901
119894) (8)
For the personalized privacy requirements differentitems are distorted with different probability parameters Fora given parameter119901
119894 the probability of correct reconstruction
on the value 119879119894= 1 is calculated by
119877 (119901119894 119904119894) = 119875 (119879
1015840
119894= 1 | 119879
119894= 1) sdot 119875 (119879
119894= 1 | 119879
1015840
119894= 1)
+ 119875 (1198791015840
119894= 0 | 119879
119894= 1) sdot 119875 (119879
119894= 1 | 119879
1015840
119894= 0)
(9)
The first part in (9) means that the original value 119879119894= 1
is distorted into value 1198791015840
119894= 1 and then reconstructed from
1198791015840
119894= 1 and the second part is for the distorted value 119879
1015840
119894=
0 Finally the probability can be computed by (10) where119904119894is the support of item 119894 in original database The similar
derivation of (10) can be found in [7]
119877 (119901119894 119904119894) =
1199041198941199012
119894
119904119894119901119894+ (1 minus 119904
119894) (1 minus 119901
119894)+
119904119894(1 minus 119901
119894)2
119904119894(1 minus 119901
119894) + (1 minus 119904
119894) 119901119894
(10)
The reconstruction probability curves of (10) are shownin Figure 1We can see that the higher the item support is theeasier this item having value 1 will be reconstructed Besidesthe curves are symmetric around 119901
119894= 05 The further the
distance between 119901119894and 05 the easier the item having value 1
will be reconstructedTherefore for a given item the differentprobability parameters will lead to the different degree ofprivacy protection The parameters for items will be set bythe data owner according to the properties of these items For
example the parameter 119901119894for the item milk can be very high
even 10 while for the drugs it can be more close to 05Under the B2C environment the interviewer and the
respondents cooperate with each other to set the parametervector 119875 Then respondents disturb their personal dataand send them to the interviewer While under the B2Benvironment the data owner disturbs the original data withparameter vector 119875 and sends the distorted data with 119901 to thethird parties
42 Itemset Support Reconstruction After getting the dis-torted data 1198631015840 the data mining will reconstruct the supportfor itemsets to find the frequent ones In order to reconstructthe support of an itemset we need to get the count of everycombination of the items in this itemset For example inorder to compute the support of itemset 119860119861119862 in originaldataset 119863 that is 119901 (119860 = 1 119861 = 1 119862 = 1) we compute thecount of 23 combinations of 119860119861119862 in119863
1015840 that is 1198881015840(119860 = 0 119861 =
0 119862 = 0) 1198881015840(119860 = 0 119861 = 0 119862 = 1) and 1198881015840(119860 = 1 119861 =
1 119862 = 1)This is because the original value (119860 = 1 119861 = 1 119862 =
1) can be distorted to these 23 combinations of 119860119861119862Let 119878 be the combinations of the 119896 items in (11) where
119878119895is the binary form of value 119895 in (12) For example when
119896 = 3 with items 119860119861119862 1198780is (119860 = 0 119861 = 0 119862 = 0) 119878
3is
(119860 = 0 119861 = 1 119862 = 1) that is 011 is the binary form of 3
119878 = 1198780 1198781 119878
2119896minus1 (11)
119878119894= binary (119894 119896) (12)
The combination 119878119894in the original database can be
distorted into any combination 119878119895in 119878 According to the
probability theory the probability of 119878119894being distorted to 119878
119895
can be computed by
119903119894119895= prod ((119878
119894⊙ 119878119895) ∘ 119875(119896)
+ (119878119894oplus 119878119895) ∘ (1 minus 119875
(119896))) (13)
where ∘ is the Hadamard product of two vectors and 119875(119896) is
the distorting probability parameters corresponding to the 119896items prod() returns the product of the vector elementsThatmeans that if the value on an item 119894 is kept then the 119901
(119894)is
multiplied to 119903119894119895 otherwise 1 minus119901
(119894) An example with 119896 as 3 in
Table 1 illustrates the formulation on the transition matrixCorresponding to the combinations defined in (11) of the
given 119896 items we define their counts in the original database119863 and the distorted database1198631015840 respectively in (14) where119862is the count vector for 119863 while 1198621015840 is the count vector for 1198631015840119888119894and 1198881015840
119894are the counts of the combination 119878
119894for the 119896 items
in119863 and1198631015840 respectively
119862 =
[[[[
[
1198880
1198881
1198882119896minus1
]]]]
]
1198621015840=
[[[[
[
1198881015840
0
1198881015840
1
1198881015840
2119896minus1
]]]]
]
(14)
Then the relationship between 119862 and 1198621015840 can be expressed
by (15) according to the randomized response distortionmechanism
1198621015840= 119877119862 (15)
The Scientific World Journal 5
Table 1 An example of transition matrix
000 sdot sdot sdot 110 111000 119901
111990121199013
sdot sdot sdot (1 minus 1199011) (1 minus 119901
2) 1199013
(1 minus 1199011)(1 minus 119901
2)(1 minus 119901
3)
001 11990111199012(1 minus 119901
3) sdot sdot sdot (1 minus 119901
1)(1 minus 119901
2)(1 minus 119901
3) (1 minus 119901
1) (1 minus 119901
2) 1199013
111 (1 minus 1199011) (1 minus 119901
2) 1199013
sdot sdot sdot 11990111199012(1 minus 119901
3) 119901
111990121199013
After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862
119862 = 119877minus11198621015840 (16)
The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863
119904 =1198882119896minus1
119898 (17)
43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found
In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868
1 119868
119896 we extract the subdataset
119863(119883) = 119863(1198681 119868
119896) from the database1198631015840 In the subdataset
119863(119883) we map each transaction 119905 isin 119863(119883) into a value by
V = 119905 sdot 119887 = (1199051 1199052 119905
119896) sdot (2119896minus1
2119896minus2
20) (18)
By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot
119887 Then 1198881015840
119895in (14) is the count of elements in 119881 equal to
119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883
In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction
ACD ACDt middot b
000
010
000
011
011
middot middot middot
111
100
101
100
111
0
1
2
3
4
5
6
7
000
001
010
011
100
101
110
111
Figure 2 An example of transaction mapping
5 Experimental Results
In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)
51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942
The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items
52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent
6 The Scientific World Journal
itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method
The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min
120588 =1
10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
sum
119891isin119865cap119865
10038161003816100381610038161003816119904119891minus 119904119891
10038161003816100381610038161003816
119904119891
(19)
The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent
120590+=
1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
120590minus=
|119865| minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
(20)
The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588
53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08
531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6
From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item
120590minus 120590+
120588
FF
Figure 3 Evaluation metrics
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
025
02
015
01
005
0
120588
Figure 4 Support error 120588 versus minimum support onT3I4D500KN10
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
01
008
006
004
002
0
120590+
Figure 5 Frequent itemset added ration 120590+ versus minimum
support on T3I4D500KN10
or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors
The Scientific World Journal 7
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
120590minus
016
014
012
01
008
006
004
002
0
Figure 6 Frequent itemset lost ration 120590minus versus minimum support
on T3I4D500KN10
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
120588
014
012
01
008
006
004
002
0
Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10
We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095
We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901
as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095
532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of
120590+
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
05
04
03
02
01
0
Figure 8 Frequent itemset added ration 120590+ versus minimum
support on BMS-WebView-1x10
120590minus
016
014
012
01
008
006
004
002
0
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
Figure 9 Frequent itemset lost ration 120590minus versus minimum support
on BMS-WebView-1x10
dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support
The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information
533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset
8 The Scientific World Journal
04
035
03
025
02
015
01
005
15 20
25 3 35 4 45 5 55 6
120588
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1 PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 10 Support error 120588 versus minimum support on differentsize of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
120590+
14
12
10
8
6
4
2
0
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 11 Frequent itemset added ration 120590+ versus minimum
support on different size of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
035
03
025
02
015
005
01
0
120590minus
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 12 Frequent itemset lost ration 120590minus versus minimum support
on different size of datasets
120588
02
015
01
005
00 1 2 3 4 5 6 7 8 9
k
MASK-08MASK-08741
MASK-095Personalized
Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942
120590+
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
8 9
MASK-08MASK-08741
MASK-095Personalized
Minimum support
4
3
2
1
0
0
04
03
02
01
Figure 14 Frequent itemset added ration 120590+ versus itemset length
119896 on dataset T40I10D100KN942
T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately
For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741
In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error
The Scientific World Journal 9
120590minus
k
0 1 2 3 4 5 6 7 8 9
MASK-08MASK-08741
MASK-095Personalized
1
08
06
04
02
0
Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896
on dataset T40I10D100KN942
Table 2 The number of frequent 119896-itemsets
119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25
6 Conclusions
In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments
References
[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994
[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000
[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008
[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006
[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009
[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981
[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002
[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008
[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007
[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004
[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004
[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005
[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005
[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006
[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005
[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
2 The Scientific World Journal
Two privacy environments are proposed in [3] based onthe privacy mechanisms The first one is B2B (business tobusiness) which assures that the data obtained from the userswould be preprocessed by the privacy-preserving techniquesbefore being supplied to the data miner The second one isB2C (business to customer) which provides the privacy pro-tection at the point of data collection that is at the user siteIn order to protect the sensitive information we proposedmethods which can be used under both environments Wefocus on the privacy of input data in the frequent itemsetmining process Under the B2C environment the personaldata are randomized after each user provides them and thensent to the data collector Under the B2B environment thecollected dataset is randomized before outsourcing or sharingwith others even at public
Although many works have been proposed to handle theprivacy problem in the frequent itemset mining process fewof them focused on the personalized privacy problem Xiaoand Tao [4] proposed the personalized privacy preservationproblem considering that different people need differentlevels of privacy protection and they used the generalizationtechniques Besides they also raise the problem of themultipublishing [5] wherein different recipients can havedifferent privacy protection levels on the original dataset
In this paper we define the personalized privacy asdifferent privacy levels on different attributes or items inthe process of frequent itemset mining For example thecustomer does not care too much about the disclosure ofthe information that he bought daily necessities But he willconcern heavily about the privacy problem if he boughtsome sensitive items If we provide the same level of privacyprotection for different items the data utility will be losttoo much in order to satisfy all items privacy requirementsThis is because the nonsensitive items also are disturbed tooheavily Hence it is necessary to provide the different degreeof privacy protection for different items or attributes
To solve the above personalized privacy problem we usethe randomized response technique [6] Based on the MASK[7] we give the solution on how to reconstruct the associ-ations under the different privacy protections Our methodcan be used to preserve the privacy under both the B2B andB2C environments Besides we design amethod to acceleratethe itemset support computation Finally our experimentalresults show that the performance of our method is betterthan the traditional method with respect to the accuracywhile satisfying the different privacy requirements
The rest of this paper is organized as follows In Section 2we review the previous related works Section 3 gives somerelated preliminaries and defines the personalized privacyproblem Then in Section 4 we describe the proposedmethod on the personalized privacy protection and how todiscover the frequent itemset from the randomized dataWe evaluate our technique in Section 5 by experiments andconclude our work in Section 6
2 Related Works
There are many works on the privacy problem in the associ-ation rule mining and they can be divided into two research
streams [3] input data privacy and output rule privacy Mostof them focus on the second one To preserve the input dataprivacy some methods are adopted to disturb the originaldata so that the attacker cannot get the true data of usersFor the output rule privacy the collected data are heuristicallyaltered in order to protect some sensitive rules being minedfrom the dataset by data miners
The association rule hiding algorithms which protect theoutput rule privacy can be divided into three categories [8]namely heuristic approaches border-based approaches andexact approaches The first class of approaches selectivelysanitizes a set of transactions from the database to hidethe sensitive rules efficiently which suffer from high sideeffects Two classic techniques in this class of approaches aredistortion [9 10] and blocking [11 12] The second set ofapproaches [13 14] hide the sensitive rules by modifying theoriginal borders in the lattice of the frequent and infrequentitemsets in the database The third set of approaches [15 16]hide the sensitive rules by solving the Constrain Satisfactionproblem using the integer or linear programming They canfind the optimality to hide the rules with high computationcost
In this paper we solve the problem of the data inputprivacy and the corresponding approaches can be dividedinto cryptograph-based and reconstruction-based methodsThe cryptograph-based approaches handle the problem thatsome partners want to discover shared association rule fromthe global data without disclosing their sensitive data toothers The global data may be vertically partitioned [17]or horizontally partitioned [18] and distributed in manypartners The reconstruction-based methods firstly random-ize the original data to hide the sensitive data and thenreconstruct the interesting patterns based on the statisticalfeatures without knowing true values For the centralizeddata there are two distortion strategies statistical distortionand algebraic distortion Rizvi and Haritsa [7] proposedthe MASK approach to disturb the original sparse Booleandatabases This method retains each 0 or 1 bit in the databasewith the probability as 119901 and flips this value with theprobability as 1 minus 119901 Zhang et al [19] proposed the solutionwith the algebraic distortion By using the eigenvalues andeigenvectors the data of users can be distorted by matrixtransformation and noise addition
Our work focuses on the input data privacy problemof the centralized data The frequent itemset mining isconducted on the distorted data that is randomized by therandomized response technique All the above algorithms didnot mention the personalized privacy problem that we solvein this paper
3 Preliminaries and Problem Definition
In this section we introduce the related preliminaries aboutthe privacy-preserving frequent itemset mining and give theproblem definition of this paper
31 Association Rule Mining Let 119868 be a set of items and 119863
the database containing a set of transactions wherein each
The Scientific World Journal 3
transaction 119879 is a set of items such that 119879 sube 119868 A transaction119879 is said to contain 119883 if and only if 119883 sube 119879 A rule is animplication of the form 119883 rArr 119884 where 119883 119884 sub 119868 and119883 cap 119884 = 120601 The support of the rule 119883 rArr 119884 is defined in(1) and its confidence is defined in (2)
supp (119883 997904rArr 119884) = 119875 (119883 cup 119884) (1)
conf (119883 997904rArr 119884) = 119875 (119884 | 119883) =supp (119883 cup 119884)
supp (119883) (2)
where 119875(119883 cup 119884) is the percentage of transactions in 119863 thatcontain 119883 cup 119884 and supp(119883) is defined as the percentage oftransactions containing119883
A set of items 119883 is said to be frequent if and only ifsupp (119883) is greater than the user-defined minimum supportThen 119883 is a frequent itemset If the rule 119883 rArr 119884 hasthe support greater than the minimum support and itsconfidence is greater than the minimum confidence then119883 rArr 119884 is an interesting rule that is association rule From(1) and (2) it is shown that finding the association rules iseffectively equivalent to generating all the frequent itemsetswith support greater than the minimum support Thereforewe focus on the frequent itemset mining
32 Randomized Response Randomized response techniquewas first introduced by Warner [20] to solve the statisticalsurvey problem of the sensitive questions For examplesocial scientists want to know how many people in somearea use drugs Usually respondents are reluctant to directlyanswer this kind of questions Hence Warner proposed therandomized response
To ask a binary choice question about whether peoplehave a sensitive attribute 119860 the randomized response givestwo related questions like the following
(1) I have the sensitive attribute 119860(2) I do not have the sensitive attribute 119860
Respondents decide which question to answer with aprobability 119901 Then the interviewer only gets a ldquoyesrdquo or ldquonordquowithout knowing which question the respondent answeredIf a respondent has the sensitive attribute119860 then he will givethe answer ldquoyesrdquo to answer the first question with probabilityas119901 or ldquonordquo with probability as 1minus119901 If a person does not have119860 then he will give ldquonordquo with probability as 119901 to answer thefirst question and ldquoyesrdquo with probability as 1minus119901 to answer thesecond question Hence the probability that an interviewergets the answer ldquoyesrdquo can be computed by (3) while gettingldquonordquo can be computed by (4)
119875 (ans = yes) = 119875 (119860 = yes) sdot 119901 + 119875 (119860 = no) sdot (1 minus 119901) (3)
119875 (ans = no) = 119875 (119860 = no) sdot 119901 + 119875 (119860 = yes) sdot (1 minus 119901) (4)
where 119875 (119860 = yes) is the proportion of the respondents thathave the attribute 119860 while 119875 (119860 = no) is the proportion ofrespondents that do not have the attribute119860 The interviewercan get the proportion of respondents having the attribute119860119875 (119860 = yes) by solving (3) and (4)
33 Problem Definition In this paper we solve the dataprivacy problem with respect to the frequent itemset miningunder the B2B environments as well as the B2C environment
Under the B2C model the interviewer can conducta survey containing sensitive questions For example thequestions on ldquowhether you are divorcedrdquo ldquowhether you havecriminal recordsrdquo or ldquowhether you use drugsrdquo are sensitiveThis kind of surveys can analyze the association betweenfactors and indeed reflect sociology problems In this modelrespondents disturb survey vectors with the given parametersand then send them to the reviewers
Under the B2B model the data owner such as supermar-ket managers disturbs the original data before sharing withother business partners Taking the market transactions as anexample customers are reluctant to let others know what heor she bought especially some sensitive products By takingthe privacy protection measures transaction details can behidden without leaking the personal privacy
Considering the data utility and different degree of pri-vacy requirements the protection on the different attributesor items should be different For example the customer doesnot care toomuch about the disclosure that he bought paperswhile concerning heavily on some sensitive items
The problem of personalized privacy-preserving frequentitemset mining is given the original transaction dataset 119863how to disturb 119863 into 119863
1015840 satisfying the different privacyrequirements and mine frequent itemsets from 119863
1015840 so thatthe frequent itemsets mined from119863 are close to the frequentitemsets mined from119863
1015840 as much as possible
4 Personalized Privacy-Preserving FrequentItemset Mining
In this section we present the procedure on how to distortthe original data by the randomized response techniquesatisfying the personalized privacy requirement Then wegive the method to reconstruct itemset support from thedistorted data Finally we devise a personalized privacy-preserving frequent itemset mining algorithm based on theApriori algorithm
41 Personalized Data Distortion Suppose there are 119899 itemsin the database 119863 Each transaction in 119863 is represented by abinary vector 119879 119879
119894= 1 if this transaction contains the item
119894 otherwise 119879119894= 0 For each item 119894 1 le 119894 le 119899 there is a
probability parameter 119901119894(0 le 119901
119894le 1) to disturb the data
on the item 119894 and these parameters form a vector shown informula (5) Then the distorted value of this transaction onitem 119894 can be expressed in (6) where 119903
119894is a value randomly
drawn from a uniform distribution over the interval [0 1]
119875 = (1199011 1199012 119901
119899) (5)
1198791015840
119894=
119879119894 119903
119894le 119901119894
1 minus 119879119894 otherwise
(6)
From (6) we can see that the 119894th value of transaction 119879 iskept with probability 119901
119894and flipped with probability 1 minus 119901
119894
4 The Scientific World Journal
s = 001si = 01
si = 03
si = 05
si = 08
si = 095
R(p
is i)
1
1
08
08
06
06
04
04
02
0200
pi
Figure 1 Reconstruction probability
The proportion of 1198791015840119894= 1 can be calculated by (7) and the
proportion of 1198791015840119894= 0 can be calculated by (8)
119875 (1198791015840
119894= 1) = 119875 (119879
119894= 1) sdot 119901
119894+ 119875 (119879
119894= 0) sdot (1 minus 119901
119894) (7)
119875 (1198791015840
119894= 0) = 119875 (119879
119894= 0) sdot 119901
119894+ 119875 (119879
119894= 1) sdot (1 minus 119901
119894) (8)
For the personalized privacy requirements differentitems are distorted with different probability parameters Fora given parameter119901
119894 the probability of correct reconstruction
on the value 119879119894= 1 is calculated by
119877 (119901119894 119904119894) = 119875 (119879
1015840
119894= 1 | 119879
119894= 1) sdot 119875 (119879
119894= 1 | 119879
1015840
119894= 1)
+ 119875 (1198791015840
119894= 0 | 119879
119894= 1) sdot 119875 (119879
119894= 1 | 119879
1015840
119894= 0)
(9)
The first part in (9) means that the original value 119879119894= 1
is distorted into value 1198791015840
119894= 1 and then reconstructed from
1198791015840
119894= 1 and the second part is for the distorted value 119879
1015840
119894=
0 Finally the probability can be computed by (10) where119904119894is the support of item 119894 in original database The similar
derivation of (10) can be found in [7]
119877 (119901119894 119904119894) =
1199041198941199012
119894
119904119894119901119894+ (1 minus 119904
119894) (1 minus 119901
119894)+
119904119894(1 minus 119901
119894)2
119904119894(1 minus 119901
119894) + (1 minus 119904
119894) 119901119894
(10)
The reconstruction probability curves of (10) are shownin Figure 1We can see that the higher the item support is theeasier this item having value 1 will be reconstructed Besidesthe curves are symmetric around 119901
119894= 05 The further the
distance between 119901119894and 05 the easier the item having value 1
will be reconstructedTherefore for a given item the differentprobability parameters will lead to the different degree ofprivacy protection The parameters for items will be set bythe data owner according to the properties of these items For
example the parameter 119901119894for the item milk can be very high
even 10 while for the drugs it can be more close to 05Under the B2C environment the interviewer and the
respondents cooperate with each other to set the parametervector 119875 Then respondents disturb their personal dataand send them to the interviewer While under the B2Benvironment the data owner disturbs the original data withparameter vector 119875 and sends the distorted data with 119901 to thethird parties
42 Itemset Support Reconstruction After getting the dis-torted data 1198631015840 the data mining will reconstruct the supportfor itemsets to find the frequent ones In order to reconstructthe support of an itemset we need to get the count of everycombination of the items in this itemset For example inorder to compute the support of itemset 119860119861119862 in originaldataset 119863 that is 119901 (119860 = 1 119861 = 1 119862 = 1) we compute thecount of 23 combinations of 119860119861119862 in119863
1015840 that is 1198881015840(119860 = 0 119861 =
0 119862 = 0) 1198881015840(119860 = 0 119861 = 0 119862 = 1) and 1198881015840(119860 = 1 119861 =
1 119862 = 1)This is because the original value (119860 = 1 119861 = 1 119862 =
1) can be distorted to these 23 combinations of 119860119861119862Let 119878 be the combinations of the 119896 items in (11) where
119878119895is the binary form of value 119895 in (12) For example when
119896 = 3 with items 119860119861119862 1198780is (119860 = 0 119861 = 0 119862 = 0) 119878
3is
(119860 = 0 119861 = 1 119862 = 1) that is 011 is the binary form of 3
119878 = 1198780 1198781 119878
2119896minus1 (11)
119878119894= binary (119894 119896) (12)
The combination 119878119894in the original database can be
distorted into any combination 119878119895in 119878 According to the
probability theory the probability of 119878119894being distorted to 119878
119895
can be computed by
119903119894119895= prod ((119878
119894⊙ 119878119895) ∘ 119875(119896)
+ (119878119894oplus 119878119895) ∘ (1 minus 119875
(119896))) (13)
where ∘ is the Hadamard product of two vectors and 119875(119896) is
the distorting probability parameters corresponding to the 119896items prod() returns the product of the vector elementsThatmeans that if the value on an item 119894 is kept then the 119901
(119894)is
multiplied to 119903119894119895 otherwise 1 minus119901
(119894) An example with 119896 as 3 in
Table 1 illustrates the formulation on the transition matrixCorresponding to the combinations defined in (11) of the
given 119896 items we define their counts in the original database119863 and the distorted database1198631015840 respectively in (14) where119862is the count vector for 119863 while 1198621015840 is the count vector for 1198631015840119888119894and 1198881015840
119894are the counts of the combination 119878
119894for the 119896 items
in119863 and1198631015840 respectively
119862 =
[[[[
[
1198880
1198881
1198882119896minus1
]]]]
]
1198621015840=
[[[[
[
1198881015840
0
1198881015840
1
1198881015840
2119896minus1
]]]]
]
(14)
Then the relationship between 119862 and 1198621015840 can be expressed
by (15) according to the randomized response distortionmechanism
1198621015840= 119877119862 (15)
The Scientific World Journal 5
Table 1 An example of transition matrix
000 sdot sdot sdot 110 111000 119901
111990121199013
sdot sdot sdot (1 minus 1199011) (1 minus 119901
2) 1199013
(1 minus 1199011)(1 minus 119901
2)(1 minus 119901
3)
001 11990111199012(1 minus 119901
3) sdot sdot sdot (1 minus 119901
1)(1 minus 119901
2)(1 minus 119901
3) (1 minus 119901
1) (1 minus 119901
2) 1199013
111 (1 minus 1199011) (1 minus 119901
2) 1199013
sdot sdot sdot 11990111199012(1 minus 119901
3) 119901
111990121199013
After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862
119862 = 119877minus11198621015840 (16)
The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863
119904 =1198882119896minus1
119898 (17)
43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found
In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868
1 119868
119896 we extract the subdataset
119863(119883) = 119863(1198681 119868
119896) from the database1198631015840 In the subdataset
119863(119883) we map each transaction 119905 isin 119863(119883) into a value by
V = 119905 sdot 119887 = (1199051 1199052 119905
119896) sdot (2119896minus1
2119896minus2
20) (18)
By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot
119887 Then 1198881015840
119895in (14) is the count of elements in 119881 equal to
119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883
In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction
ACD ACDt middot b
000
010
000
011
011
middot middot middot
111
100
101
100
111
0
1
2
3
4
5
6
7
000
001
010
011
100
101
110
111
Figure 2 An example of transaction mapping
5 Experimental Results
In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)
51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942
The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items
52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent
6 The Scientific World Journal
itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method
The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min
120588 =1
10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
sum
119891isin119865cap119865
10038161003816100381610038161003816119904119891minus 119904119891
10038161003816100381610038161003816
119904119891
(19)
The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent
120590+=
1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
120590minus=
|119865| minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
(20)
The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588
53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08
531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6
From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item
120590minus 120590+
120588
FF
Figure 3 Evaluation metrics
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
025
02
015
01
005
0
120588
Figure 4 Support error 120588 versus minimum support onT3I4D500KN10
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
01
008
006
004
002
0
120590+
Figure 5 Frequent itemset added ration 120590+ versus minimum
support on T3I4D500KN10
or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors
The Scientific World Journal 7
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
120590minus
016
014
012
01
008
006
004
002
0
Figure 6 Frequent itemset lost ration 120590minus versus minimum support
on T3I4D500KN10
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
120588
014
012
01
008
006
004
002
0
Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10
We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095
We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901
as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095
532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of
120590+
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
05
04
03
02
01
0
Figure 8 Frequent itemset added ration 120590+ versus minimum
support on BMS-WebView-1x10
120590minus
016
014
012
01
008
006
004
002
0
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
Figure 9 Frequent itemset lost ration 120590minus versus minimum support
on BMS-WebView-1x10
dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support
The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information
533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset
8 The Scientific World Journal
04
035
03
025
02
015
01
005
15 20
25 3 35 4 45 5 55 6
120588
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1 PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 10 Support error 120588 versus minimum support on differentsize of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
120590+
14
12
10
8
6
4
2
0
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 11 Frequent itemset added ration 120590+ versus minimum
support on different size of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
035
03
025
02
015
005
01
0
120590minus
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 12 Frequent itemset lost ration 120590minus versus minimum support
on different size of datasets
120588
02
015
01
005
00 1 2 3 4 5 6 7 8 9
k
MASK-08MASK-08741
MASK-095Personalized
Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942
120590+
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
8 9
MASK-08MASK-08741
MASK-095Personalized
Minimum support
4
3
2
1
0
0
04
03
02
01
Figure 14 Frequent itemset added ration 120590+ versus itemset length
119896 on dataset T40I10D100KN942
T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately
For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741
In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error
The Scientific World Journal 9
120590minus
k
0 1 2 3 4 5 6 7 8 9
MASK-08MASK-08741
MASK-095Personalized
1
08
06
04
02
0
Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896
on dataset T40I10D100KN942
Table 2 The number of frequent 119896-itemsets
119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25
6 Conclusions
In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments
References
[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994
[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000
[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008
[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006
[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009
[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981
[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002
[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008
[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007
[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004
[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004
[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005
[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005
[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006
[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005
[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 3
transaction 119879 is a set of items such that 119879 sube 119868 A transaction119879 is said to contain 119883 if and only if 119883 sube 119879 A rule is animplication of the form 119883 rArr 119884 where 119883 119884 sub 119868 and119883 cap 119884 = 120601 The support of the rule 119883 rArr 119884 is defined in(1) and its confidence is defined in (2)
supp (119883 997904rArr 119884) = 119875 (119883 cup 119884) (1)
conf (119883 997904rArr 119884) = 119875 (119884 | 119883) =supp (119883 cup 119884)
supp (119883) (2)
where 119875(119883 cup 119884) is the percentage of transactions in 119863 thatcontain 119883 cup 119884 and supp(119883) is defined as the percentage oftransactions containing119883
A set of items 119883 is said to be frequent if and only ifsupp (119883) is greater than the user-defined minimum supportThen 119883 is a frequent itemset If the rule 119883 rArr 119884 hasthe support greater than the minimum support and itsconfidence is greater than the minimum confidence then119883 rArr 119884 is an interesting rule that is association rule From(1) and (2) it is shown that finding the association rules iseffectively equivalent to generating all the frequent itemsetswith support greater than the minimum support Thereforewe focus on the frequent itemset mining
32 Randomized Response Randomized response techniquewas first introduced by Warner [20] to solve the statisticalsurvey problem of the sensitive questions For examplesocial scientists want to know how many people in somearea use drugs Usually respondents are reluctant to directlyanswer this kind of questions Hence Warner proposed therandomized response
To ask a binary choice question about whether peoplehave a sensitive attribute 119860 the randomized response givestwo related questions like the following
(1) I have the sensitive attribute 119860(2) I do not have the sensitive attribute 119860
Respondents decide which question to answer with aprobability 119901 Then the interviewer only gets a ldquoyesrdquo or ldquonordquowithout knowing which question the respondent answeredIf a respondent has the sensitive attribute119860 then he will givethe answer ldquoyesrdquo to answer the first question with probabilityas119901 or ldquonordquo with probability as 1minus119901 If a person does not have119860 then he will give ldquonordquo with probability as 119901 to answer thefirst question and ldquoyesrdquo with probability as 1minus119901 to answer thesecond question Hence the probability that an interviewergets the answer ldquoyesrdquo can be computed by (3) while gettingldquonordquo can be computed by (4)
119875 (ans = yes) = 119875 (119860 = yes) sdot 119901 + 119875 (119860 = no) sdot (1 minus 119901) (3)
119875 (ans = no) = 119875 (119860 = no) sdot 119901 + 119875 (119860 = yes) sdot (1 minus 119901) (4)
where 119875 (119860 = yes) is the proportion of the respondents thathave the attribute 119860 while 119875 (119860 = no) is the proportion ofrespondents that do not have the attribute119860 The interviewercan get the proportion of respondents having the attribute119860119875 (119860 = yes) by solving (3) and (4)
33 Problem Definition In this paper we solve the dataprivacy problem with respect to the frequent itemset miningunder the B2B environments as well as the B2C environment
Under the B2C model the interviewer can conducta survey containing sensitive questions For example thequestions on ldquowhether you are divorcedrdquo ldquowhether you havecriminal recordsrdquo or ldquowhether you use drugsrdquo are sensitiveThis kind of surveys can analyze the association betweenfactors and indeed reflect sociology problems In this modelrespondents disturb survey vectors with the given parametersand then send them to the reviewers
Under the B2B model the data owner such as supermar-ket managers disturbs the original data before sharing withother business partners Taking the market transactions as anexample customers are reluctant to let others know what heor she bought especially some sensitive products By takingthe privacy protection measures transaction details can behidden without leaking the personal privacy
Considering the data utility and different degree of pri-vacy requirements the protection on the different attributesor items should be different For example the customer doesnot care toomuch about the disclosure that he bought paperswhile concerning heavily on some sensitive items
The problem of personalized privacy-preserving frequentitemset mining is given the original transaction dataset 119863how to disturb 119863 into 119863
1015840 satisfying the different privacyrequirements and mine frequent itemsets from 119863
1015840 so thatthe frequent itemsets mined from119863 are close to the frequentitemsets mined from119863
1015840 as much as possible
4 Personalized Privacy-Preserving FrequentItemset Mining
In this section we present the procedure on how to distortthe original data by the randomized response techniquesatisfying the personalized privacy requirement Then wegive the method to reconstruct itemset support from thedistorted data Finally we devise a personalized privacy-preserving frequent itemset mining algorithm based on theApriori algorithm
41 Personalized Data Distortion Suppose there are 119899 itemsin the database 119863 Each transaction in 119863 is represented by abinary vector 119879 119879
119894= 1 if this transaction contains the item
119894 otherwise 119879119894= 0 For each item 119894 1 le 119894 le 119899 there is a
probability parameter 119901119894(0 le 119901
119894le 1) to disturb the data
on the item 119894 and these parameters form a vector shown informula (5) Then the distorted value of this transaction onitem 119894 can be expressed in (6) where 119903
119894is a value randomly
drawn from a uniform distribution over the interval [0 1]
119875 = (1199011 1199012 119901
119899) (5)
1198791015840
119894=
119879119894 119903
119894le 119901119894
1 minus 119879119894 otherwise
(6)
From (6) we can see that the 119894th value of transaction 119879 iskept with probability 119901
119894and flipped with probability 1 minus 119901
119894
4 The Scientific World Journal
s = 001si = 01
si = 03
si = 05
si = 08
si = 095
R(p
is i)
1
1
08
08
06
06
04
04
02
0200
pi
Figure 1 Reconstruction probability
The proportion of 1198791015840119894= 1 can be calculated by (7) and the
proportion of 1198791015840119894= 0 can be calculated by (8)
119875 (1198791015840
119894= 1) = 119875 (119879
119894= 1) sdot 119901
119894+ 119875 (119879
119894= 0) sdot (1 minus 119901
119894) (7)
119875 (1198791015840
119894= 0) = 119875 (119879
119894= 0) sdot 119901
119894+ 119875 (119879
119894= 1) sdot (1 minus 119901
119894) (8)
For the personalized privacy requirements differentitems are distorted with different probability parameters Fora given parameter119901
119894 the probability of correct reconstruction
on the value 119879119894= 1 is calculated by
119877 (119901119894 119904119894) = 119875 (119879
1015840
119894= 1 | 119879
119894= 1) sdot 119875 (119879
119894= 1 | 119879
1015840
119894= 1)
+ 119875 (1198791015840
119894= 0 | 119879
119894= 1) sdot 119875 (119879
119894= 1 | 119879
1015840
119894= 0)
(9)
The first part in (9) means that the original value 119879119894= 1
is distorted into value 1198791015840
119894= 1 and then reconstructed from
1198791015840
119894= 1 and the second part is for the distorted value 119879
1015840
119894=
0 Finally the probability can be computed by (10) where119904119894is the support of item 119894 in original database The similar
derivation of (10) can be found in [7]
119877 (119901119894 119904119894) =
1199041198941199012
119894
119904119894119901119894+ (1 minus 119904
119894) (1 minus 119901
119894)+
119904119894(1 minus 119901
119894)2
119904119894(1 minus 119901
119894) + (1 minus 119904
119894) 119901119894
(10)
The reconstruction probability curves of (10) are shownin Figure 1We can see that the higher the item support is theeasier this item having value 1 will be reconstructed Besidesthe curves are symmetric around 119901
119894= 05 The further the
distance between 119901119894and 05 the easier the item having value 1
will be reconstructedTherefore for a given item the differentprobability parameters will lead to the different degree ofprivacy protection The parameters for items will be set bythe data owner according to the properties of these items For
example the parameter 119901119894for the item milk can be very high
even 10 while for the drugs it can be more close to 05Under the B2C environment the interviewer and the
respondents cooperate with each other to set the parametervector 119875 Then respondents disturb their personal dataand send them to the interviewer While under the B2Benvironment the data owner disturbs the original data withparameter vector 119875 and sends the distorted data with 119901 to thethird parties
42 Itemset Support Reconstruction After getting the dis-torted data 1198631015840 the data mining will reconstruct the supportfor itemsets to find the frequent ones In order to reconstructthe support of an itemset we need to get the count of everycombination of the items in this itemset For example inorder to compute the support of itemset 119860119861119862 in originaldataset 119863 that is 119901 (119860 = 1 119861 = 1 119862 = 1) we compute thecount of 23 combinations of 119860119861119862 in119863
1015840 that is 1198881015840(119860 = 0 119861 =
0 119862 = 0) 1198881015840(119860 = 0 119861 = 0 119862 = 1) and 1198881015840(119860 = 1 119861 =
1 119862 = 1)This is because the original value (119860 = 1 119861 = 1 119862 =
1) can be distorted to these 23 combinations of 119860119861119862Let 119878 be the combinations of the 119896 items in (11) where
119878119895is the binary form of value 119895 in (12) For example when
119896 = 3 with items 119860119861119862 1198780is (119860 = 0 119861 = 0 119862 = 0) 119878
3is
(119860 = 0 119861 = 1 119862 = 1) that is 011 is the binary form of 3
119878 = 1198780 1198781 119878
2119896minus1 (11)
119878119894= binary (119894 119896) (12)
The combination 119878119894in the original database can be
distorted into any combination 119878119895in 119878 According to the
probability theory the probability of 119878119894being distorted to 119878
119895
can be computed by
119903119894119895= prod ((119878
119894⊙ 119878119895) ∘ 119875(119896)
+ (119878119894oplus 119878119895) ∘ (1 minus 119875
(119896))) (13)
where ∘ is the Hadamard product of two vectors and 119875(119896) is
the distorting probability parameters corresponding to the 119896items prod() returns the product of the vector elementsThatmeans that if the value on an item 119894 is kept then the 119901
(119894)is
multiplied to 119903119894119895 otherwise 1 minus119901
(119894) An example with 119896 as 3 in
Table 1 illustrates the formulation on the transition matrixCorresponding to the combinations defined in (11) of the
given 119896 items we define their counts in the original database119863 and the distorted database1198631015840 respectively in (14) where119862is the count vector for 119863 while 1198621015840 is the count vector for 1198631015840119888119894and 1198881015840
119894are the counts of the combination 119878
119894for the 119896 items
in119863 and1198631015840 respectively
119862 =
[[[[
[
1198880
1198881
1198882119896minus1
]]]]
]
1198621015840=
[[[[
[
1198881015840
0
1198881015840
1
1198881015840
2119896minus1
]]]]
]
(14)
Then the relationship between 119862 and 1198621015840 can be expressed
by (15) according to the randomized response distortionmechanism
1198621015840= 119877119862 (15)
The Scientific World Journal 5
Table 1 An example of transition matrix
000 sdot sdot sdot 110 111000 119901
111990121199013
sdot sdot sdot (1 minus 1199011) (1 minus 119901
2) 1199013
(1 minus 1199011)(1 minus 119901
2)(1 minus 119901
3)
001 11990111199012(1 minus 119901
3) sdot sdot sdot (1 minus 119901
1)(1 minus 119901
2)(1 minus 119901
3) (1 minus 119901
1) (1 minus 119901
2) 1199013
111 (1 minus 1199011) (1 minus 119901
2) 1199013
sdot sdot sdot 11990111199012(1 minus 119901
3) 119901
111990121199013
After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862
119862 = 119877minus11198621015840 (16)
The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863
119904 =1198882119896minus1
119898 (17)
43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found
In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868
1 119868
119896 we extract the subdataset
119863(119883) = 119863(1198681 119868
119896) from the database1198631015840 In the subdataset
119863(119883) we map each transaction 119905 isin 119863(119883) into a value by
V = 119905 sdot 119887 = (1199051 1199052 119905
119896) sdot (2119896minus1
2119896minus2
20) (18)
By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot
119887 Then 1198881015840
119895in (14) is the count of elements in 119881 equal to
119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883
In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction
ACD ACDt middot b
000
010
000
011
011
middot middot middot
111
100
101
100
111
0
1
2
3
4
5
6
7
000
001
010
011
100
101
110
111
Figure 2 An example of transaction mapping
5 Experimental Results
In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)
51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942
The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items
52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent
6 The Scientific World Journal
itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method
The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min
120588 =1
10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
sum
119891isin119865cap119865
10038161003816100381610038161003816119904119891minus 119904119891
10038161003816100381610038161003816
119904119891
(19)
The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent
120590+=
1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
120590minus=
|119865| minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
(20)
The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588
53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08
531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6
From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item
120590minus 120590+
120588
FF
Figure 3 Evaluation metrics
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
025
02
015
01
005
0
120588
Figure 4 Support error 120588 versus minimum support onT3I4D500KN10
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
01
008
006
004
002
0
120590+
Figure 5 Frequent itemset added ration 120590+ versus minimum
support on T3I4D500KN10
or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors
The Scientific World Journal 7
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
120590minus
016
014
012
01
008
006
004
002
0
Figure 6 Frequent itemset lost ration 120590minus versus minimum support
on T3I4D500KN10
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
120588
014
012
01
008
006
004
002
0
Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10
We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095
We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901
as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095
532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of
120590+
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
05
04
03
02
01
0
Figure 8 Frequent itemset added ration 120590+ versus minimum
support on BMS-WebView-1x10
120590minus
016
014
012
01
008
006
004
002
0
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
Figure 9 Frequent itemset lost ration 120590minus versus minimum support
on BMS-WebView-1x10
dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support
The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information
533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset
8 The Scientific World Journal
04
035
03
025
02
015
01
005
15 20
25 3 35 4 45 5 55 6
120588
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1 PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 10 Support error 120588 versus minimum support on differentsize of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
120590+
14
12
10
8
6
4
2
0
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 11 Frequent itemset added ration 120590+ versus minimum
support on different size of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
035
03
025
02
015
005
01
0
120590minus
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 12 Frequent itemset lost ration 120590minus versus minimum support
on different size of datasets
120588
02
015
01
005
00 1 2 3 4 5 6 7 8 9
k
MASK-08MASK-08741
MASK-095Personalized
Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942
120590+
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
8 9
MASK-08MASK-08741
MASK-095Personalized
Minimum support
4
3
2
1
0
0
04
03
02
01
Figure 14 Frequent itemset added ration 120590+ versus itemset length
119896 on dataset T40I10D100KN942
T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately
For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741
In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error
The Scientific World Journal 9
120590minus
k
0 1 2 3 4 5 6 7 8 9
MASK-08MASK-08741
MASK-095Personalized
1
08
06
04
02
0
Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896
on dataset T40I10D100KN942
Table 2 The number of frequent 119896-itemsets
119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25
6 Conclusions
In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments
References
[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994
[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000
[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008
[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006
[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009
[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981
[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002
[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008
[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007
[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004
[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004
[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005
[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005
[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006
[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005
[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
4 The Scientific World Journal
s = 001si = 01
si = 03
si = 05
si = 08
si = 095
R(p
is i)
1
1
08
08
06
06
04
04
02
0200
pi
Figure 1 Reconstruction probability
The proportion of 1198791015840119894= 1 can be calculated by (7) and the
proportion of 1198791015840119894= 0 can be calculated by (8)
119875 (1198791015840
119894= 1) = 119875 (119879
119894= 1) sdot 119901
119894+ 119875 (119879
119894= 0) sdot (1 minus 119901
119894) (7)
119875 (1198791015840
119894= 0) = 119875 (119879
119894= 0) sdot 119901
119894+ 119875 (119879
119894= 1) sdot (1 minus 119901
119894) (8)
For the personalized privacy requirements differentitems are distorted with different probability parameters Fora given parameter119901
119894 the probability of correct reconstruction
on the value 119879119894= 1 is calculated by
119877 (119901119894 119904119894) = 119875 (119879
1015840
119894= 1 | 119879
119894= 1) sdot 119875 (119879
119894= 1 | 119879
1015840
119894= 1)
+ 119875 (1198791015840
119894= 0 | 119879
119894= 1) sdot 119875 (119879
119894= 1 | 119879
1015840
119894= 0)
(9)
The first part in (9) means that the original value 119879119894= 1
is distorted into value 1198791015840
119894= 1 and then reconstructed from
1198791015840
119894= 1 and the second part is for the distorted value 119879
1015840
119894=
0 Finally the probability can be computed by (10) where119904119894is the support of item 119894 in original database The similar
derivation of (10) can be found in [7]
119877 (119901119894 119904119894) =
1199041198941199012
119894
119904119894119901119894+ (1 minus 119904
119894) (1 minus 119901
119894)+
119904119894(1 minus 119901
119894)2
119904119894(1 minus 119901
119894) + (1 minus 119904
119894) 119901119894
(10)
The reconstruction probability curves of (10) are shownin Figure 1We can see that the higher the item support is theeasier this item having value 1 will be reconstructed Besidesthe curves are symmetric around 119901
119894= 05 The further the
distance between 119901119894and 05 the easier the item having value 1
will be reconstructedTherefore for a given item the differentprobability parameters will lead to the different degree ofprivacy protection The parameters for items will be set bythe data owner according to the properties of these items For
example the parameter 119901119894for the item milk can be very high
even 10 while for the drugs it can be more close to 05Under the B2C environment the interviewer and the
respondents cooperate with each other to set the parametervector 119875 Then respondents disturb their personal dataand send them to the interviewer While under the B2Benvironment the data owner disturbs the original data withparameter vector 119875 and sends the distorted data with 119901 to thethird parties
42 Itemset Support Reconstruction After getting the dis-torted data 1198631015840 the data mining will reconstruct the supportfor itemsets to find the frequent ones In order to reconstructthe support of an itemset we need to get the count of everycombination of the items in this itemset For example inorder to compute the support of itemset 119860119861119862 in originaldataset 119863 that is 119901 (119860 = 1 119861 = 1 119862 = 1) we compute thecount of 23 combinations of 119860119861119862 in119863
1015840 that is 1198881015840(119860 = 0 119861 =
0 119862 = 0) 1198881015840(119860 = 0 119861 = 0 119862 = 1) and 1198881015840(119860 = 1 119861 =
1 119862 = 1)This is because the original value (119860 = 1 119861 = 1 119862 =
1) can be distorted to these 23 combinations of 119860119861119862Let 119878 be the combinations of the 119896 items in (11) where
119878119895is the binary form of value 119895 in (12) For example when
119896 = 3 with items 119860119861119862 1198780is (119860 = 0 119861 = 0 119862 = 0) 119878
3is
(119860 = 0 119861 = 1 119862 = 1) that is 011 is the binary form of 3
119878 = 1198780 1198781 119878
2119896minus1 (11)
119878119894= binary (119894 119896) (12)
The combination 119878119894in the original database can be
distorted into any combination 119878119895in 119878 According to the
probability theory the probability of 119878119894being distorted to 119878
119895
can be computed by
119903119894119895= prod ((119878
119894⊙ 119878119895) ∘ 119875(119896)
+ (119878119894oplus 119878119895) ∘ (1 minus 119875
(119896))) (13)
where ∘ is the Hadamard product of two vectors and 119875(119896) is
the distorting probability parameters corresponding to the 119896items prod() returns the product of the vector elementsThatmeans that if the value on an item 119894 is kept then the 119901
(119894)is
multiplied to 119903119894119895 otherwise 1 minus119901
(119894) An example with 119896 as 3 in
Table 1 illustrates the formulation on the transition matrixCorresponding to the combinations defined in (11) of the
given 119896 items we define their counts in the original database119863 and the distorted database1198631015840 respectively in (14) where119862is the count vector for 119863 while 1198621015840 is the count vector for 1198631015840119888119894and 1198881015840
119894are the counts of the combination 119878
119894for the 119896 items
in119863 and1198631015840 respectively
119862 =
[[[[
[
1198880
1198881
1198882119896minus1
]]]]
]
1198621015840=
[[[[
[
1198881015840
0
1198881015840
1
1198881015840
2119896minus1
]]]]
]
(14)
Then the relationship between 119862 and 1198621015840 can be expressed
by (15) according to the randomized response distortionmechanism
1198621015840= 119877119862 (15)
The Scientific World Journal 5
Table 1 An example of transition matrix
000 sdot sdot sdot 110 111000 119901
111990121199013
sdot sdot sdot (1 minus 1199011) (1 minus 119901
2) 1199013
(1 minus 1199011)(1 minus 119901
2)(1 minus 119901
3)
001 11990111199012(1 minus 119901
3) sdot sdot sdot (1 minus 119901
1)(1 minus 119901
2)(1 minus 119901
3) (1 minus 119901
1) (1 minus 119901
2) 1199013
111 (1 minus 1199011) (1 minus 119901
2) 1199013
sdot sdot sdot 11990111199012(1 minus 119901
3) 119901
111990121199013
After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862
119862 = 119877minus11198621015840 (16)
The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863
119904 =1198882119896minus1
119898 (17)
43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found
In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868
1 119868
119896 we extract the subdataset
119863(119883) = 119863(1198681 119868
119896) from the database1198631015840 In the subdataset
119863(119883) we map each transaction 119905 isin 119863(119883) into a value by
V = 119905 sdot 119887 = (1199051 1199052 119905
119896) sdot (2119896minus1
2119896minus2
20) (18)
By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot
119887 Then 1198881015840
119895in (14) is the count of elements in 119881 equal to
119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883
In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction
ACD ACDt middot b
000
010
000
011
011
middot middot middot
111
100
101
100
111
0
1
2
3
4
5
6
7
000
001
010
011
100
101
110
111
Figure 2 An example of transaction mapping
5 Experimental Results
In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)
51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942
The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items
52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent
6 The Scientific World Journal
itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method
The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min
120588 =1
10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
sum
119891isin119865cap119865
10038161003816100381610038161003816119904119891minus 119904119891
10038161003816100381610038161003816
119904119891
(19)
The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent
120590+=
1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
120590minus=
|119865| minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
(20)
The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588
53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08
531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6
From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item
120590minus 120590+
120588
FF
Figure 3 Evaluation metrics
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
025
02
015
01
005
0
120588
Figure 4 Support error 120588 versus minimum support onT3I4D500KN10
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
01
008
006
004
002
0
120590+
Figure 5 Frequent itemset added ration 120590+ versus minimum
support on T3I4D500KN10
or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors
The Scientific World Journal 7
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
120590minus
016
014
012
01
008
006
004
002
0
Figure 6 Frequent itemset lost ration 120590minus versus minimum support
on T3I4D500KN10
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
120588
014
012
01
008
006
004
002
0
Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10
We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095
We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901
as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095
532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of
120590+
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
05
04
03
02
01
0
Figure 8 Frequent itemset added ration 120590+ versus minimum
support on BMS-WebView-1x10
120590minus
016
014
012
01
008
006
004
002
0
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
Figure 9 Frequent itemset lost ration 120590minus versus minimum support
on BMS-WebView-1x10
dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support
The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information
533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset
8 The Scientific World Journal
04
035
03
025
02
015
01
005
15 20
25 3 35 4 45 5 55 6
120588
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1 PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 10 Support error 120588 versus minimum support on differentsize of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
120590+
14
12
10
8
6
4
2
0
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 11 Frequent itemset added ration 120590+ versus minimum
support on different size of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
035
03
025
02
015
005
01
0
120590minus
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 12 Frequent itemset lost ration 120590minus versus minimum support
on different size of datasets
120588
02
015
01
005
00 1 2 3 4 5 6 7 8 9
k
MASK-08MASK-08741
MASK-095Personalized
Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942
120590+
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
8 9
MASK-08MASK-08741
MASK-095Personalized
Minimum support
4
3
2
1
0
0
04
03
02
01
Figure 14 Frequent itemset added ration 120590+ versus itemset length
119896 on dataset T40I10D100KN942
T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately
For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741
In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error
The Scientific World Journal 9
120590minus
k
0 1 2 3 4 5 6 7 8 9
MASK-08MASK-08741
MASK-095Personalized
1
08
06
04
02
0
Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896
on dataset T40I10D100KN942
Table 2 The number of frequent 119896-itemsets
119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25
6 Conclusions
In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments
References
[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994
[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000
[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008
[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006
[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009
[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981
[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002
[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008
[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007
[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004
[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004
[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005
[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005
[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006
[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005
[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 5
Table 1 An example of transition matrix
000 sdot sdot sdot 110 111000 119901
111990121199013
sdot sdot sdot (1 minus 1199011) (1 minus 119901
2) 1199013
(1 minus 1199011)(1 minus 119901
2)(1 minus 119901
3)
001 11990111199012(1 minus 119901
3) sdot sdot sdot (1 minus 119901
1)(1 minus 119901
2)(1 minus 119901
3) (1 minus 119901
1) (1 minus 119901
2) 1199013
111 (1 minus 1199011) (1 minus 119901
2) 1199013
sdot sdot sdot 11990111199012(1 minus 119901
3) 119901
111990121199013
After getting the transition matrix 119877 with elementscomputed by (12) we can reconstruct the counts of combi-nations by (16) Hence we evaluate the performance of thereconstruction by computing the difference between 119862 and119862
119862 = 119877minus11198621015840 (16)
The count of the last combination divided by the numberof transactions is the estimated support of the correspondingitemset as shown in (17) where 119898 is the number of transac-tions in119863
119904 =1198882119896minus1
119898 (17)
43 Mining Frequent Itemset from Distorted Dataset Wemine the frequent itemset from the database by using the clas-sic Apriori algorithm which employs an iterative approach tosearch the frequent itemset The frequent 119896-itemsets are usedto generate the candidate (119896 + 1)-itemsets by the AprioriGenThen Apriori computes the support of candidate (119896 + 1)-itemsets and filters out the itemsets with support less than theminimum support These operations iteratively run until nomore frequent itemsets can be found
In the privacy-preserving personalized frequent itemsetmining we need to reconstruct the support of candidateitemsets from the distorted database For a candidate itemsetwe compute the counts of all its items combinations In orderto accelerate the counting speed we firstly remove the itemswith support less than the minimum support Then for thecandidate itemset 119883 = 119868
1 119868
119896 we extract the subdataset
119863(119883) = 119863(1198681 119868
119896) from the database1198631015840 In the subdataset
119863(119883) we map each transaction 119905 isin 119863(119883) into a value by
V = 119905 sdot 119887 = (1199051 1199052 119905
119896) sdot (2119896minus1
2119896minus2
20) (18)
By (18) a vector119881will be computed wherein V119894= 119863(119883)119894sdot
119887 Then 1198881015840
119895in (14) is the count of elements in 119881 equal to
119895 By this method it is very fast to compute the vector 1198621015840Figure 2 gives an example of transaction mapping that mapsthe transaction 119905 into a value By computing the count of eachvalue from0 to 7 we can get the counts of all the combinationsfor the itemset 119860119862119863 By using (16) and (17) we can get thesupport of the itemset119883
In our privacy-preserving personalized frequent itemsetmining we iteratively generate the candidate (119896 + 1)-itemsetfrom the frequent 119896-itemset and check the support of thecandidate itemset using the itemset support reconstruction
ACD ACDt middot b
000
010
000
011
011
middot middot middot
111
100
101
100
111
0
1
2
3
4
5
6
7
000
001
010
011
100
101
110
111
Figure 2 An example of transaction mapping
5 Experimental Results
In this section we evaluate the performance about thepersonalized privacy-preserving frequent itemset miningon the real dataset and synthetic datasets generated bythe classic IBM Quest Market-Basket Synthetic Data Gen-erator (The C++ source code can be downloaded athttpwwwcsloyolaedusimcgiannelassoc genhtml)
51 Datasets The synthetic datasets are generated by theIBM Almaden generator with the parameters TIDN [1]where 119879 is the average size of the transactions 119868 is theaverage size of the maximal potentially large itemsets 119863 isthe number of transactions and 119873 is the number of itemsWe generated two synthetic datasets T3I4D500KN10 andT40I10D100KN942
The real dataset is BMS-WebView-1 [21] which containsthe click-stream data from the website of a legwear and legcare retailer The dataset contains 59602 transactions and497 items We scaled the dataset with a factor of 10 andgot the dataset BMS-WebView-1x10 which contains 596020transactions and 497 items
52 Evaluation Metrics We evaluate our method on thedistorted database by two kinds of error metrics [7] thesupport errors and the identity error Let 119865 be the frequent
6 The Scientific World Journal
itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method
The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min
120588 =1
10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
sum
119891isin119865cap119865
10038161003816100381610038161003816119904119891minus 119904119891
10038161003816100381610038161003816
119904119891
(19)
The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent
120590+=
1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
120590minus=
|119865| minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
(20)
The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588
53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08
531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6
From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item
120590minus 120590+
120588
FF
Figure 3 Evaluation metrics
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
025
02
015
01
005
0
120588
Figure 4 Support error 120588 versus minimum support onT3I4D500KN10
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
01
008
006
004
002
0
120590+
Figure 5 Frequent itemset added ration 120590+ versus minimum
support on T3I4D500KN10
or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors
The Scientific World Journal 7
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
120590minus
016
014
012
01
008
006
004
002
0
Figure 6 Frequent itemset lost ration 120590minus versus minimum support
on T3I4D500KN10
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
120588
014
012
01
008
006
004
002
0
Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10
We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095
We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901
as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095
532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of
120590+
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
05
04
03
02
01
0
Figure 8 Frequent itemset added ration 120590+ versus minimum
support on BMS-WebView-1x10
120590minus
016
014
012
01
008
006
004
002
0
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
Figure 9 Frequent itemset lost ration 120590minus versus minimum support
on BMS-WebView-1x10
dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support
The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information
533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset
8 The Scientific World Journal
04
035
03
025
02
015
01
005
15 20
25 3 35 4 45 5 55 6
120588
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1 PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 10 Support error 120588 versus minimum support on differentsize of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
120590+
14
12
10
8
6
4
2
0
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 11 Frequent itemset added ration 120590+ versus minimum
support on different size of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
035
03
025
02
015
005
01
0
120590minus
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 12 Frequent itemset lost ration 120590minus versus minimum support
on different size of datasets
120588
02
015
01
005
00 1 2 3 4 5 6 7 8 9
k
MASK-08MASK-08741
MASK-095Personalized
Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942
120590+
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
8 9
MASK-08MASK-08741
MASK-095Personalized
Minimum support
4
3
2
1
0
0
04
03
02
01
Figure 14 Frequent itemset added ration 120590+ versus itemset length
119896 on dataset T40I10D100KN942
T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately
For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741
In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error
The Scientific World Journal 9
120590minus
k
0 1 2 3 4 5 6 7 8 9
MASK-08MASK-08741
MASK-095Personalized
1
08
06
04
02
0
Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896
on dataset T40I10D100KN942
Table 2 The number of frequent 119896-itemsets
119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25
6 Conclusions
In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments
References
[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994
[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000
[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008
[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006
[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009
[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981
[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002
[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008
[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007
[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004
[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004
[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005
[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005
[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006
[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005
[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
6 The Scientific World Journal
itemsets discovered from the original database 119863 by theApriori algorithm and119865 represent the reconstructed frequentitemset minded from the distorted database by our method
The support error metric is evaluated by (19) whichreflects the relative error in the support values We measurethe average error based on the frequent itemsets which arecorrectly identified to be frequent that is119865cap119865 with the givenminimum support 119904min
120588 =1
10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
sum
119891isin119865cap119865
10038161003816100381610038161003816119904119891minus 119904119891
10038161003816100381610038161003816
119904119891
(19)
The identity error metrics are given in (20) 120590+ indicatesthe percentage of false positive that is the percentage ofreconstructed itemsets which do not exist in the original fre-quent itemsets 120590minus indicates the percentage of false negativethat is the percentage of original frequent itemsets which arenot correctly reconstructed as frequent
120590+=
1003816100381610038161003816100381611986510038161003816100381610038161003816minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
120590minus=
|119865| minus10038161003816100381610038161003816119865 cap 119865
10038161003816100381610038161003816
|119865|
(20)
The corresponding metrics are illustrated in Figure 3 Forthe part of metric 120588 the intersection of two frequent itemsets119865 and 119865 is maybe empty Under this condition we do notcompute the result of 120588
53 Results Analysis We evaluate our method by measuringthe support error and the two identity errors shown informulas (19) (20) and we conduct our experiments on thethree given datasets described in Section 51 We compareour method with MASK which distorts the original datasetwith only one parameter value119901 If the personalized distortedvector of our method is 119875 then 119901 = min(119875) In ourexperiments the elements in the personalized distortedvector 119875 in formula (5) are generated following the uniformdistribution with the range of [08 095]The average value ofvector119875 is 08741 In order to protect all the items or attributesin a dataset the parameter 119901 in MASK must be set as 08Therefore we compare our method with the MASK havingthe parameter 119901 as 08
531 ComparisonswithDifferentMinimumSupport Thefirstexperiment was conducted on the T3I4D500KN10 In thisexperiment we set the minimum support from 005 to095 with the step length as 005 For each minimumsupport we ran the MASK and our personalized method100 times and compute the average values and the standarddeviations of evaluation metrics The results are shown inFigures 4 5 and 6
From Figures 4 5 and 6 we can see that our methodcan have smaller error than MASK This is because theMASK has to distort all the items or attributes in themaximum distorting level in order to protect each item
120590minus 120590+
120588
FF
Figure 3 Evaluation metrics
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
025
02
015
01
005
0
120588
Figure 4 Support error 120588 versus minimum support onT3I4D500KN10
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
01
008
006
004
002
0
120590+
Figure 5 Frequent itemset added ration 120590+ versus minimum
support on T3I4D500KN10
or attribute while our method distort different items withdifferent distorting level The results show that our methodonly leads to half of support error 120588 of MASK in the datasetT3I4D500KN10 Besides from Figures 5 and 6 we can seethat ourmethod can havemuch smaller deviations of identityerrors
The Scientific World Journal 7
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
120590minus
016
014
012
01
008
006
004
002
0
Figure 6 Frequent itemset lost ration 120590minus versus minimum support
on T3I4D500KN10
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
120588
014
012
01
008
006
004
002
0
Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10
We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095
We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901
as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095
532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of
120590+
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
05
04
03
02
01
0
Figure 8 Frequent itemset added ration 120590+ versus minimum
support on BMS-WebView-1x10
120590minus
016
014
012
01
008
006
004
002
0
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
Figure 9 Frequent itemset lost ration 120590minus versus minimum support
on BMS-WebView-1x10
dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support
The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information
533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset
8 The Scientific World Journal
04
035
03
025
02
015
01
005
15 20
25 3 35 4 45 5 55 6
120588
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1 PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 10 Support error 120588 versus minimum support on differentsize of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
120590+
14
12
10
8
6
4
2
0
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 11 Frequent itemset added ration 120590+ versus minimum
support on different size of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
035
03
025
02
015
005
01
0
120590minus
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 12 Frequent itemset lost ration 120590minus versus minimum support
on different size of datasets
120588
02
015
01
005
00 1 2 3 4 5 6 7 8 9
k
MASK-08MASK-08741
MASK-095Personalized
Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942
120590+
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
8 9
MASK-08MASK-08741
MASK-095Personalized
Minimum support
4
3
2
1
0
0
04
03
02
01
Figure 14 Frequent itemset added ration 120590+ versus itemset length
119896 on dataset T40I10D100KN942
T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately
For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741
In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error
The Scientific World Journal 9
120590minus
k
0 1 2 3 4 5 6 7 8 9
MASK-08MASK-08741
MASK-095Personalized
1
08
06
04
02
0
Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896
on dataset T40I10D100KN942
Table 2 The number of frequent 119896-itemsets
119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25
6 Conclusions
In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments
References
[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994
[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000
[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008
[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006
[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009
[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981
[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002
[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008
[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007
[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004
[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004
[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005
[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005
[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006
[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005
[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 7
MASKPersonalized
Minimum support0 0001 0002 0003 0004 0005 0006 0007 0008 0009 001
120590minus
016
014
012
01
008
006
004
002
0
Figure 6 Frequent itemset lost ration 120590minus versus minimum support
on T3I4D500KN10
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
120588
014
012
01
008
006
004
002
0
Figure 7 Support error 120588 versus minimum support on BMS-WebView-1x10
We conducted the same experiments on the real datasetBMS-WebView-1x10 For this dataset we set the minimumsupport from02 to 055with the step length as 005 andfor each minimum support we ran the algorithms 20 timesAs the average value of elements in vector 119875 is 08741 we alsoran theMASKwith parameter119901 as 08741 Besides we ran theMASK with 119901 as the maximum value of 119875 that is 095
We show the results in Figures 7 8 and 9 and the resultshere show that ourmethod is much better thanMASKwith 119901
as 08 However the results of our method are much similarto the results of MASK with 119901 as 08741 the average valueof vector 119875 for our method Similar results can be found inFigures 13 14 and 15 on the dataset T40I10D100KN942But the MASK with 119901 as 08741 cannot protect all the itemsor attributes because some attribute needs protection withprivacy level smaller than 08741 such as 085 For the threemetrics the results of MASK with 119901 as 095 are better thanour method This is because the privacy protection level ofeach attribute in our method is between 08 and 095
532 The Impact of the Size of Dataset We conduct thesecond experiment to evaluate the impact of the size of
120590+
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
05
04
03
02
01
0
Figure 8 Frequent itemset added ration 120590+ versus minimum
support on BMS-WebView-1x10
120590minus
016
014
012
01
008
006
004
002
0
MASK-08MASK-08741
MASK-095Personalized
Minimum support15 2 25 3 35 4 45 5 55 6
times10minus3
Figure 9 Frequent itemset lost ration 120590minus versus minimum support
on BMS-WebView-1x10
dataset on the reconstruction errors shown in formulas (19)(20) The dataset BMS-WebView-1x10 is formed from thedataset BMS-WebView-1 by copying it 10 times Similarly weset the minimum support of frequent itemset mining from02 to 055 with the step length as 005 and run thealgorithms 20 times for each minimum support
The corresponding results in Figures 10 11 and 12 showthat the reconstruction errors on BMS-WebView-1x10 aremuch smaller than the errors on BMS-WebView-1 for bothMASK and our method Note that the frequent itemsets dis-covered from BMS-WebView-1 are the same as the frequentitemsets discovered fromBMS-WebView-1x10This gives us athought that if we want to improve the reconstruction resultswe can copy the original dataset many times and distortthe new dataset and send it to the cooperated party Thenthe cooperated party can accurately discover the interestingpatterns without knowing the original private information
533 Comparisons with Different Lengths of Frequent Item-set The third experiment is conducted on the dataset
8 The Scientific World Journal
04
035
03
025
02
015
01
005
15 20
25 3 35 4 45 5 55 6
120588
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1 PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 10 Support error 120588 versus minimum support on differentsize of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
120590+
14
12
10
8
6
4
2
0
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 11 Frequent itemset added ration 120590+ versus minimum
support on different size of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
035
03
025
02
015
005
01
0
120590minus
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 12 Frequent itemset lost ration 120590minus versus minimum support
on different size of datasets
120588
02
015
01
005
00 1 2 3 4 5 6 7 8 9
k
MASK-08MASK-08741
MASK-095Personalized
Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942
120590+
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
8 9
MASK-08MASK-08741
MASK-095Personalized
Minimum support
4
3
2
1
0
0
04
03
02
01
Figure 14 Frequent itemset added ration 120590+ versus itemset length
119896 on dataset T40I10D100KN942
T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately
For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741
In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error
The Scientific World Journal 9
120590minus
k
0 1 2 3 4 5 6 7 8 9
MASK-08MASK-08741
MASK-095Personalized
1
08
06
04
02
0
Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896
on dataset T40I10D100KN942
Table 2 The number of frequent 119896-itemsets
119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25
6 Conclusions
In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments
References
[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994
[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000
[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008
[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006
[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009
[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981
[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002
[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008
[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007
[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004
[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004
[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005
[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005
[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006
[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005
[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
8 The Scientific World Journal
04
035
03
025
02
015
01
005
15 20
25 3 35 4 45 5 55 6
120588
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1 PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 10 Support error 120588 versus minimum support on differentsize of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
120590+
14
12
10
8
6
4
2
0
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 11 Frequent itemset added ration 120590+ versus minimum
support on different size of datasets
15 2 25 3 35 4 45 5 55 6
times10minus3Minimum support
MASKBMS 1
PersonalizedBMS 1
035
03
025
02
015
005
01
0
120590minus
PersonalizedBMS 1 times 10
MASKBMS 1 times 10
Figure 12 Frequent itemset lost ration 120590minus versus minimum support
on different size of datasets
120588
02
015
01
005
00 1 2 3 4 5 6 7 8 9
k
MASK-08MASK-08741
MASK-095Personalized
Figure 13 Support error 120588 versus itemset length 119896 on datasetT40I10D100KN942
120590+
0 1 2 3 4 5 6 7
1 2 3 4 5 6 7
8 9
MASK-08MASK-08741
MASK-095Personalized
Minimum support
4
3
2
1
0
0
04
03
02
01
Figure 14 Frequent itemset added ration 120590+ versus itemset length
119896 on dataset T40I10D100KN942
T40I10D100KN942 In this experiment we set the min-imum support as 145 and we run the correspondingalgorithms 5 times and get the average values We also set theparameter 119901 of MASK as 08 08741 and 095 separately
For the minimum support 145 the maximum length oforiginal frequent itemsets is 8 and the number of frequent119896-itemset is shown in Table 2 Then we compare the resulton each itemset length from 1 to 8 The results shown inFigures 13 14 and 15 demonstrate that our method performsbetter than MASK with 119901 as 08 for each itemset length andperforms very similar to the MASK with 119901 as 08741
In Figure 13 as the intersection set of frequent 8-itemsetsreconstructed by MASK with 119901 as 08 and original frequent8-itemsets is an empty set the support error cannot becomputed then the value cannot be shown We cannotconclude the clear trend of the reconstruction error withthe itemset length increasing In Figure 14 when 119896 is 8 thefrequent itemset added ratio increased sharply By carefullyanalyzing the results we found that many new frequentitemsets have the support very close to theminimum supportwhich leads to the heavy identity error
The Scientific World Journal 9
120590minus
k
0 1 2 3 4 5 6 7 8 9
MASK-08MASK-08741
MASK-095Personalized
1
08
06
04
02
0
Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896
on dataset T40I10D100KN942
Table 2 The number of frequent 119896-itemsets
119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25
6 Conclusions
In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments
References
[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994
[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000
[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008
[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006
[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009
[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981
[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002
[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008
[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007
[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004
[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004
[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005
[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005
[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006
[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005
[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World Journal 9
120590minus
k
0 1 2 3 4 5 6 7 8 9
MASK-08MASK-08741
MASK-095Personalized
1
08
06
04
02
0
Figure 15 Frequent itemset lost ration 120590minus versus itemset length 119896
on dataset T40I10D100KN942
Table 2 The number of frequent 119896-itemsets
119896 1 2 3 4 5 6 7 8|119865| 690 4869 293 368 483 469 331 25
6 Conclusions
In this paper we solve the problem of how to provide thepersonalized privacy protection on different attributes oritems while discovering the frequent itemsets Based on theclassic randomized response technique we proposed a per-sonalized privacy-preserving method Besides we proposeda method to improve the efficiency of counting the frequentitemsets by mapping the frequent itemset vector into a valueExperimental results show that the personalized privacyprotection method can have much better performance thanthe traditional privacy protection method which providesthe same privacy protection for the different items Fromthe experimental results we can see that we can copy theoriginal dataset many times to create a new dataset and thendistort this new dataset Then the others can discover thefrequent itemsets with smaller error at the expense of morecomputation and communication cost
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgments
This research work was supported in part by the NationalNatural Science Foundation of China under Grant no61003231 no 61103109 and no 11105024 and by the Funda-mental Research Funds for the Central Universities underZYGX2011J057 The authors are grateful for the anonymousreviewers who made constructive comments
References
[1] R Agrawal and R Srikant ldquoFast algorithms for mining associ-ation rulesrdquo in Proceedings of the 20th International Conferenceon Very Large Data Bases (VLDB rsquo94) vol 1215 pp 487ndash499Santiago de Chile Chile September 1994
[2] L F Cranor J Reagle and M S Ackerman Beyond ConcernUnderstanding Net Usersrsquo Attitudes about Online Privacy MITPress Cambridge Mass USA 2000
[3] J R Haritsa ldquoMining association rules under privacy con-straintsrdquo in Privacy-Preserving Data Mining pp 239ndash266Springer New York NY USA 2008
[4] X Xiao and Y Tao ldquoPersonalized privacy preservationrdquo in Pro-ceedings of the ACM International Conference on Managementof Data (SIGMOD rsquo06) pp 229ndash240 ACM Chicago Ill USAJune 2006
[5] X Xiao Y Tao andM Chen ldquoOptimal random perturbation atmultiple privacy levelsrdquo Proceedings of the VLDB Endowmentvol 2 no 1 pp 814ndash825 2009
[6] A C Tamhane ldquoRandomized response techniques for multiplesensitive attributesrdquo Journal of the American Statistical Associa-tion vol 76 no 376 pp 916ndash923 1981
[7] S J Rizvi and J R Haritsa ldquoMaintaining data privacy inassociation ruleminingrdquo inProceedings of the 28th InternationalConference on Very Large Data Bases (VLDB rsquo02) pp 682ndash693VLDB Endowment Hong Kong August 2002
[8] V S Verykios and A Gkoulalas-Divanis ldquoA survey of associ-ation rule hiding methods for privacyrdquo in Privacy-PreservingData Mining pp 267ndash289 Springer New York NY USA 2008
[9] Y-H Wu C-M Chiang and A L P Chen ldquoHiding sensitiveassociation rules with limited side effectsrdquo IEEE Transactions onKnowledge and Data Engineering vol 19 no 1 pp 29ndash42 2007
[10] E D Pontikakis A A Tsitsonis and V S Verykios ldquoAn experi-mental study of distortion-based techniques for association rulehidingrdquo inResearchDirections in Data andApplications SecurityXVIII vol 144 of IFIP International Federation for InformationProcessing pp 325ndash339 Springer 2004
[11] E D Pontikakis Y Theodoridis A A Tsitsonis L Changand V S Verykios ldquoA quantitative and qualitative analysis ofblocking in association rule hidingrdquo in Proceedings of the ACMWorkshop on Privacy in the Electronic Society (WPES rsquo04) pp29ndash30 ACM Press Washington DC USA October 2004
[12] S-L Wang and A Jafari ldquoUsing unknowns for hiding sensitivepredictive association rulesrdquo in Proceedings of the IEEE Inter-national Conference on Information Reuse and Integration (IRIrsquo05) pp 223ndash228 IEEE Las Vegas Nev USA August 2005
[13] S Xingzhi and P S Yu ldquoA border-based approach for hidingsensitive frequent itemsetsrdquo in Proceedings of the 5th IEEEInternational Conference on Data Mining (ICDM rsquo05) pp 426ndash433 IEEE Houston Tex USA November 2005
[14] G V Moustakides and V S Verykios ldquoA max-min approachfor hiding frequent itemsetsrdquo in Proceedings of the 6th IEEEInternational Conference on Data Mining Workshops (ICDMrsquo06) pp 502ndash506 Hong Kong December 2006
[15] S Menon S Sarkar and S Mukherjee ldquoMaximizing accuracyof shared databases when concealing sensitive patternsrdquo Infor-mation Systems Research vol 16 no 3 pp 256ndash270 2005
[16] A Gkoulalas-Divanis and V S Verykios ldquoAn integer program-ming approach for frequent itemset hidingrdquo in Proceedingsof the 15th ACM Conference on Information and KnowledgeManagement (CIKM rsquo06) pp 748ndash757 ACM Arlington VaUSA November 2006
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
10 The Scientific World Journal
[17] J Vaidya and C Clifton ldquoPrivacy preserving association rulemining in vertically partitioned datardquo in Proceedings of the8th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD rsquo02) pp 639ndash644 ACMAlberta Canada July 2002
[18] MKantarcioglu andCClifton ldquoPrivacy-preserving distributedmining of association rules on horizontally partitioned datardquoIEEE Transactions on Knowledge and Data Engineering vol 16no 9 pp 1026ndash1037 2004
[19] N Zhang S Wang and W Zhao ldquoA new scheme on privacypreserving association rule miningrdquo in Knowledge Discovery inDatabases PKDD 2004 vol 3202 of Lecture Notes in ComputerScience pp 484ndash495 2004
[20] S L Warner ldquoRandomized response a survey technique foreliminating evasive answer biasrdquo Journal of the AmericanStatistical Association vol 60 no 309 pp 63ndash66 1965
[21] Z Zheng R Kohavi and L Mason ldquoReal world performanceof association rule algorithmsrdquo in Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery andData Mining (KDD rsquo01) pp 401ndash406 ACM San FranciscoCalif USA August 2001
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Submit your manuscripts athttpwwwhindawicom
Computer Games Technology
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Distributed Sensor Networks
International Journal of
Advances in
FuzzySystems
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014
International Journal of
ReconfigurableComputing
Hindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Applied Computational Intelligence and Soft Computing
thinspAdvancesthinspinthinsp
Artificial Intelligence
HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014
Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Electrical and Computer Engineering
Journal of
Journal of
Computer Networks and Communications
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporation
httpwwwhindawicom Volume 2014
Advances in
Multimedia
International Journal of
Biomedical Imaging
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
ArtificialNeural Systems
Advances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
RoboticsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Computational Intelligence and Neuroscience
Industrial EngineeringJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Human-ComputerInteraction
Advances in
Computer EngineeringAdvances in
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014