hiding slides

HIDING SENSITIVE ITEMS IN ASSOCIATION RULE MINING.

Presented By:

…….Exploration of knowledge and privacy preserving.

M.Swarna Rekha.K.Reshma.Ch.Savanth.Vishnu Babu.Nag Santhosh.Moses.

Talk overview Growing privacy concerns. Why privacy preserving data mining? Approaches. Problem statement. Apriori algorithm. Problem description. Proposed algorithms. Illustrating Examples. Analysis. Conclusions. Software and Hardware requirement Specification

Growing privacy concerns Threat to individual privacy. Inference of sensitive information

including personal information or even patterns from non-sensitive information.

Individual item in database must not be disclosed

Not necessarily a person Information about a corporation Transaction record

Why privacy preserving data mining? Government / public agencies. Example:

The Centers for Disease Control want to identify disease outbreaks

Insurance companies have data on disease incidents, seriousness, patient background, etc.

But can/should they release this information? Industry Collaborations / Trade Groups.

Example: An industry trade group may want to identify best

practices to help members But some practices are trade secrets How do we provide “commodity” results to all

(Manufacturing using chemical supplies from supplier X have high failure rates), while still preserving secrets (manufacturing process Y gives low failure rates)?

Why privacy preserving data mining? Multinational Corporations

A company would like to mine its data for globally valid results

But national laws may prevent transborder data sharing

Public use of private data Data mining enables research studies of

large populations But these populations are reluctant to

release personal information

Example:Patient Records…

Patient health records split among providers Insurance company Pharmacy Doctor Hospital

Each agrees not to release the data without my consent

Medical study wants correlations across providers Rules relating complaints/procedures to “unrelated” drugs

Does this need patient consent? And that of every other patient!

It shouldn’t! Rules shouldn’t disclose patient individual data

Approaches:

The first approach is to alter the data before delivery to the data miner so that real values are obscured.

The second approach assumes the data is distributed between two or more sites, and these sites cooperate to leam the global data mining results without revealing the data at their individual sites.

Introduction

Our technique of altering the data is to selectively modify individual values from a database to prevent discovery of set of rules.

Here we apply a group of heuristic solutions for reducing the number of occurrences of some frequent itemsets below a minimum user specified threshold.

The second approach is to allow users access to only a subset of data while global data mining results can still be discovered.

Problem statement Mining of association rules.

Let I = { i,, i2,…., im } be a set of literals, called items. Given a set of transactions D, where each transaction T is a set of items such that T is subset or equal to I , an association rule is an expression X=>Y where X,Y are subset or equal to I and XП Y = ø .An example of such a rule is that 90% of customers buy hamburgers also buy coke. The 90% here is called the confidence of the rule which means that 90% of transaction that contain X also contain Y. The support of the rule is the percentage of transactions that contain both X and Y. The problem of mining association rules is to find all rules that are greater than the user-specified minimum support and minimum confidence.

Mining of association rules

Local Data

LocalData

Mining

Local Data

LocalData

Mining

Local Data

LocalData

Mining

Combinedresults

DataMining

Combiner

A & B C

A&B C 4%

A&B C

Apriori algorithm: Apriori is an influential algorithm for mining frequent itemsets from a

givan database.It employs an iterative approach known as a level-wise search,where k-itemsets are used to explore (k+1)-itmsets.

Apriori property: All non-empty subsets of a frequent itemset must also be frequent. A two step process:

1.The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1

with itself.This set of candidates is denoted Ck .The join is performed where members of Lk-1 are joinable if their first (k-2) items are in common.

2.The prune step: A scan of database to determine the count of each candidate in ck

would result in the determination of Lk .Any (k-1) subset that is not frequent cannot be a subset of a frequent k-itemset. Hence if any (k-1) subset of a candidate k-itemset is not in Lk-1 ,then the candidate cannot be frequent either and so can be removed from Ck .

Apriori Algorithm:Input: Database, D, of transactions; minimum support

threshold, min sup.

Output: L, frequent itemsets in D.

Method:

1. L1 = find frequent _1-itemsets(D);

2. for (k = 2;Lk-1 ≠ø;k++) {

3. Ck = apriori gen(Lk-1, min sup);

4. for each transaction t є D { // scan D for counts

5. Ct = subset(Ck ; t); // get the subsets of t that are candidates

6. for each candidate c є Ct

7. c.count++;

8. }

9. Lk = {c є Ck |c.count ≥ min sup}

10. }

11. return L = UkLk;

Procedure Apriori_gen(Lk-1:frequent (k-1)-itemsets; min sup: minimum support)

1. for each itemset l1 є Lk-1

2. for each itemset l2 є Lk-1

3. if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^ …. ^ (l1[k - 2] = l2[k - 2]) ^ (l1[k - 1] < l2[k-1]) then {

4. c = l1 join l2; // join step: generate candidates

5. if has infrequent subset(c;Lk-1) then

6. delete c; // prune step: remove unfruitful candidate7. else add c to Ck;

8. }9. return Ck;

Procedure has infrequent subset(c: candidate k-itemset; Lk-1: frequent (k-1)-itemsets); // use prior knowledge

1. for each (k - 1)-subset s of c2. if s !є Lk-1 then

3. return TRUE;4. return FALSE;

Example:

TID ITEMS

T1 ABC

T2 ABC

T3 ABC

T4 AB

T5 A

T6 AC

Itemset

Sup.count

A 6

B 4

C 4

Itemset

Sup.count

A 6

B 4

C 4

C1: L1:

Scan D forCount of eachCandidate

Compare candidate support count with minimumSupport count 2

Transaction Database D

itemset

Sup_count

AB 4

AC 4

BC 3

C2:Itemset Sup.

Count

AB 4

AC 4

BC 3

L2:

Itemset

Sup.Count

ABC 3

Itemset

Sup.Count

ABC 3

C3: L3:

Generate C2Candidates from L1

Scan D forCount of each candidate

Generate C3Candidates from L2

Scan D for Count of each candidate

Generation of candidate itemsets and frequent itemsets, where the minimum support count is 2.

Generating Association Rules: Consider the frequent itemset L2={AB,AC,BC}.The non-empty subsets

of L2 are {A},{B},{A},{C},{B},{C}.The resulting association rules are:

B=>A Confidence=4/4=100% A=>C Confidence=4/6=66% C=>A Confidence=4/4=100% B=>C Confidence=3/4=75% C=>B Confidence= 3/4=75% If the minimum confidence thresold is 70%,except second rule all are

strong. Consider the frequent itemset L3={ABC}.The non-empty subsets of L3

are {A},{B},{C},{AB},{AC},{BC}.The resulting association rules are: A=>B^C Confidence=3/6=50% B=>A^C Confidence=3/4=75% C=>A^B Confidence=3/4=75% A^B=>C Confidence=3/4=75% A^C=>B Confidence=3/4=75% B^C=>A Confidence=3/3=100% If the minimum confidence thresold is 70%,except first rule all are

strong.

Problem description

We propose algorithms to modify the data in database so that sensitive items cant be inferred through association rule mining.

More specifically,the objective is to modify the database D such that no association rules containing H,set of items to be hidden on the right hand side will be discovered.

Proposed algorithms……

To hide an association rule,we can either decrease its support or its confidence to be smaller than pre-specified minimum support and minimum confidence.To decrease confidence of a rule, we propose two algorithms:

I. Increase Support of LHS First(ISLF).II. Decrease Support of RHS First(DSRF). The first algorithm tries to increase

the support of left hand side of rule.If it was not successful,it tries to decrease the support of the right hand side of the rule.

Algorithm ISLF:Input:

(1) A source database D,

(2) A min-support,

(3) A min-confidence,

(4) A set of hidden items H

Output:

A transformed database D’, where rules containing H on RHS will be hidden

Algorithm:

1. Find large I-item sets from D ;

2. For each hidden item h є H

3. If h is not a large I-item set, then H := H-{h} ;

4. If H is empty, then EXIT;// no AR contains H in RHS

5. Find large 2-itemsets from D ;

6. For each hє H {

7. For each large 2-itemset containing h {

8. Compute confidence of rule U, where U is a rule of x-> h ;

9. If Confidence >min _ conf , then {//Increase Support of LHS

10. Find T1={t in D | t partially supports LHS(U);

11. Sort T1 in descending order by the number of supported items

12. Repeat {

13. choose the first transaction t from T1;

14. Modify t to support LHS(U);

15. Compute support and confidence of U };}

16. Until ( confidence (U) < min _ conf or T1 is empty );

17. } ; //end if confidence>min-conf

18. If confidence > min-conf, then {/Decrease Support of RHS

19. Find T2 = { t in D I t supports RHS (U)} ;

20. Sort T2 in descending order by the number of supported items ;

21. Repeat {

22. Choose the first transaction t from Tz;

23. Modify t to partially support RHS(U) ;

24. Compute support and confidence of U; }

25. Until ( confidence(U) <min-conf or T2 is empty ) ;

26. } ; //end if confidence>min-conf

27. If Confidence > min-conf, then

28. CAN NOT HIDE h ;

29. Else

30. Update D with new transaction t;

31. }//end of for each large 2-itemset

32. Remove h from H;

33. }//end of for each h є H

Output updated D,as the transformed D’;

Example Running ISLF Algorithm Example 1: To hide an item C,the rule B C

(50%,75%) will be hidden if transaction T5 is modified from 100 to 110 using ISL .To hide item B,the rule A B(67%,83%) will be hidden if transaction T1 is modified from 111 to 101 using DSR.TID D D1 D2

T1 111 111 101

T2 111 111 111

T3 111 111 111

T4 110 110 110

T5 100 110 110

T6 101 101 101

Database before and after hiding item C,B using ISLF

Example 2: Here we reverse the order of hiding items.To

hide the item B,the rule C B(50%,75%) will be hidden if transaction T5 is modified from 100 to 101 using ISL.To hide item C,the rule A C(83%,83%) will be hidden if transaction T1 is modified from 111 to 110 using DSR.

TID D D3 D4

T1 111 111 110

T2 111 111 111

T3 111 111 111

T4 110 110 110

T5 100 101 101

T6 101 101 101

Database before and after hiding item B,C using ISLF

Examples running DSRF Algorithm Example 3: To hide an item C,the rule B

C(50%,75%) will be hidden if transaction T1 is modified from 111 to 110 using DSR.To hide item B,the rule C B(50%,67%) will be hidden due to transaction T1 is modified.

TID D D5

T1 111 110

T2 111 111

T3 111 111

T4 110 110

T5 100 100

T6 101 101

Database before and after hiding item C,B using DSRF

Example 4: Here we reverse the order of hiding items.To

hide item B,the rule C B(50%,75%) will be hidden if transaction T1 is modified from 111 to 101 using DSR.To hide item C,the rule B C will be hidden due to transaction T1 is modified.

TID D D6

T1 111 101

T2 111 111

T3 111 111

T4 110 110

T5 100 100

T6 101 101

Database before and after hiding item B,C using DSRF

Analysis: The first characteristic is that the transformed

databases are different under different ordering of hiding items. From the above illustrated examples database D2,D4 are generated using ISLF and D5,D6 are generated using DSRF algorithm.

The second characteristic we analyze is the efficiency of the proposed algorithm compared with Dasseni’s algorithm.It can be seen that ISLF and DSRF algorithms require less database scanning and prune more number of association rules compared with Dasseni’s algorithm.

Algorithm

DB Scans

RulesPruned

ISLF 3 2

Dasseni’s

4 0

DB Scans and Rules pruned in hiding item C using ISLF

One of the reasons that dasseni’s approach does not prune rules is that hidden rules are given in advance.

Our approach needs to hide all rules containing hidden items on the right hands side,but dasseni’s approach can hide some of the rules containing hidden item on the right hand side.

The third characteristic we analyze is efficiency comparison of the ISLF and DSRF algorithms DSRF algorithm seems to be more effective when the support count of the hidden item is large. This is due to when support of right hand side of the rule is large; increase support of left hand side usually does not reduce the confidence of the rule but decrease support of right hand side usually decreases the confidence of the rule.

Conclusions: we have examined the database privacy problems

caused by data mining technology and proposed two algorithms for hiding sensitive data in association rules mining.

The proposed algorithms are based on modifying the database transactions so that the confidence of the association rules can be reduced. Examples demonstrating the proposed algorithms are shown.

The efficiency of the proposed approach are further compared with Dasseni’s approach.It was shown that our approach required less number of database scanning and prune more number of hidden rules. However, our approach must hide all rules containing the hidden items on the right hand side, where Dasseni’s approach can hide some of the specified rules.

Software requirement specification:

The proposed algorithms can be implemented using JAVA as Front-end and Oracle-9i as Back-end under Windows environment.

Intel core 2 duo processor RAM size RAM speed

Hardware requirement specification:

hiding slides

Technology

individual privacy

personal information

nonsensitive information

problem description

problem statement

individual item

proposed algorithms

apriori algorithm