association mining data mining spring 2012. transactional database transaction – a row in the...

Post on 15-Jan-2016

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Association Mining

Data Mining

Spring 2012

• Transactional Database

• Transaction – A row in the database

• i.e.: {Eggs, Cheese, Milk}

Transactional Database

Transactional dataset

Eggs Cheese Milk

Milk Jam

Cheese Bacon Eggs Cat food

Butter Bread

Bread Butter Eggs Milk Cheese

• Item = {Milk}, {Cheese}, {Bread}, etc.

• Itemset = {Milk}, {Milk, Cheese}, {Bacon, Bread, Milk}

• Doesn’t have to be in the dataset

• Can be of size 1 – n

Items and Itemsets

Transactional dataset

Eggs Cheese Milk

Milk Jam

Cheese Bacon Eggs Cat food

Butter Bread

Bread Butter Eggs Milk Cheese

The Support Measure

Support Examples

Support({Eggs}) = 3/5 = 60%

Support({Eggs, Milk}) = 2/5 = 40%

Transactional dataset

Eggs Cheese Milk

Milk Jam

Cheese Bacon Eggs Cat food

Butter Bread

Bread Butter Eggs Milk Cheese

Minimum Support

Minsup – The minimum support threshold for an itemset to be considered frequent (User defined)

Frequent itemset – an itemset in a database whose support is greater than or equal to minsup.

Support(X) > minsup = frequent

Support(X) < minsup = infrequent

Minimum Support Examples Minimum support = 50% Support({Eggs}) = 3/5 = 60% Pass

Support({Eggs, Milk}) = 2/5 = 40% Fail

Transactional dataset

Eggs Cheese Milk

Milk Jam

Cheese Bacon Eggs Cat food

Butter Bread

Bread Butter Eggs Milk Cheese

Association Rules

Confidence Example 1

{Eggs} => {Bread}

Confidence = sup({Eggs, Bread})/Sup({Eggs})

Confidence = (1/5)/(3/5) = 33%

Transactional dataset

Eggs Cheese Milk

Milk Jam

Cheese Bacon Eggs Cat food

Butter Bread

Bread Butter Eggs Milk Cheese

Confidence Example 2

{Milk} => {Eggs, Cheese}

Confidence = sup({Milk, Eggs, Cheese})/sup({Milk})

Confidence = (2/5)/(3/5) = 66%

Transactional dataset

Eggs Cheese Milk

Milk Jam

Cheese Bacon Eggs Cat food

Butter Bread

Bread Butter Eggs Milk Cheese

Strong Association Rules

Minimum Confidence – A user defined minimum bound on confidence. (Minconf)

Strong association rule – a rule X=>Y whose conf > minconf.

- this is a potentially interesting rule for the user.

Conf(X=>Y) > minconf = strong

Conf(X=>Y) < minconf = uninteresting

Minimum Confidence Example

Minconf = 50%

{Eggs} => {Bread}

Confidence = (1/5)/(3/5) = 33% Fail

{Milk} => {Eggs, Cheese}

Confidence = (2/5)/(3/5) = 66% Pass

Association Mining

Association Mining:

- Finds strong rules contained in a dataset from frequent itemsets.

Can be divided into two major subtasks:1. Finding frequent itemsets2. Rule generation

• Some algorithms change items into letters or numbers

• Numbers are more compact

• Easier to make comparisons

Transactional Database Revisited

Transactional dataset

1 2 3

3 5

2 7 1 4

6 8

8 6 1 3 2

Basic Set Logic

Subset – a subset itemset X is contained in an itemset Y.

Superset – a superset itemset Y contains an itemset X.

example: X = {1,2} Y = {1,2,3,5} Y

X

Apriori

Arranges database into a temporary lattice structure to find associations

Apriori principle –

1. itemsets in the lattice with support < minsup will only produce supersets with support < minsup.

2. the subsets of frequent itemsets are always frequent.

Prunes lattice structure of non-frequent itemsets using minsup.

Reduces the number of comparisons Reduces the number of candidate itemsets

Monotonicity

Monotone (upward closed) - if X is a subset of Y,

then support(X) cannot exceed support(Y).

Anti-Monotone (downward closed) - if X is a subset of Y, then support(Y) cannot exceed support(X).

Apriori is anti-monotone.- uses this property to prune the lattice structure.

Itemset Lattice

Lattice Pruning

Lattice Example

1 2 3 4 5

2 4

1 2 4

1 4

Count occurrences of each 1-itemset in the database and compute their support: Support = #occurrences/#rows in dbPrune anything less than minsup = 30%

Lattice Example

1 2 3 4 5

2 4

1 2 4

1 4

1 2 3 4 5

2 4

1 2 4

1 4

1 2 3 4 5

2 4

1 2 4

1 4

Count occurrences of each 2-itemset in the database and compute their supportPrune anything less than minsup = 30%

Lattice Example

A B C D E

B D

A B D

A D

Count occurrences of the last 3-itemset in the database and compute its support.Prune anything less than minsup = 30%

Example - Results

1 2 3 4 5

2 4

1 2 4

1 4

Frequent itemsets: {1}, {2}, {3}, {1,2}, {1,3}, {2,3}, {1,2,3}

Apriori Algorithm

Frequent Itemset Generation

Itemset Support Frequent

{1} 75% Yes

{2} 50% No

{3} 75% Yes

{4} 25% No

{5} 100% Yes

Transactional Database

1 2 3 4 5

2 3 5

1 3 5

1 5

1. Minsup = 70%2. Generate all 1-itemsets3. Calculate the support for each itemset4. Determine whether or not the itemsets are frequent

Frequent Itemset Generation

Itemset Support Frequent

{1,3} 50% Yes

{1,5} 75% Yes

{3,5} 75% Yes

Transactional Database

1 2 3 4 5

2 3 5

1 3 5

1 5

Generate all 2-itemsets, minsup = 70%

{1} U {3} = {1,3} , {1} U {5} = {1,5}

{3} U {5} = {3,5}

Frequent Itemset Generation

Itemset Support Frequent

{1,3,5} 50% Yes

Transactional Database

1 2 3 4 5

2 3 5

1 3 5

1 5

Generate all 3-itemsets, minsup = 70%

{1,3} U {1,5} = {1,3,5}

Frequent Itemset Results

All frequent itemsets generated are output:

{1} , {3} , {5}

{1,3} , {1,5} , {3,5}

{1,3,5}

Apriori Rule Mining

Apriori Rule Mining

Rule Combinations: 1. {1,2} 2-itemsets

{1}=>{2}{2}=>{1}

2. {1,2,3} 3-itemsets

{1}=>{2,3}{2,3}=>{1}{1,2}=>{3}{3}=>{1,2}{1,3}=>{2}{2}=>{1,3}

Strong Rule Generation

Transactional Database

1 2 3 4 5

2 3 5

1 3 5

1 5

1. I = {{1}, {3}, {5}}2. Rules = X => Y3. Minconf = 80%

Strong Rule Generation

Transactional Database

1 2 3 4 5

2 3 5

1 3 5

1 5

1. I = {{1}, {3}, {5}}2. Rules = X => Y3. Minconf = 80%

Strong Rules Results

All strong rules generated are output:

{1}=>{5}{3}=>{5}{2}=>{3,5}{2,3}=>{5}{2,5}=>{3}

Other Frequent Itemsets

Closed Frequent Itemset – a frequent itemset X who has no immediate supersets with the same support count as X.

Maximal Frequent Itemset – a frequent itemset whom none of its immediate supersets are frequent.

Itemset Relationships

Frequent Itemsets

Closed Frequent Itemsets Maximal

FrequentItemsets

Targeted Association Mining

Targeted Association Mining

* Users may only be interested in specific results

* Potential to get smaller, faster, and more focused results

* Examples: 1. User wants to know how often only bread and garlic cloves occur together.

2. User wants to know what items occur with toilet paper.

Itemset Trees

* Itemset Tree: - A data structure which aids in users querying for a

specific itemset and it’s support.

* Items within a transaction are mapped to integer values and ordered such that each transaction is in lexical order.

{Bread, Onion, Garlic} = {1, 2, 3}

* Why use numbers?- make the tree more compact - numbers follow ordering easily

Itemset Trees

An Itemset Tree T contains: * A root pair (I, f(I)), where I is an itemset and f(I) is its count. * A (possibly empty) set {T1, T2, . . . , Tk} each element of which is an

itemset tree.

* If Ij is in the root, then it will also be inThe root’s children

* If Ij is not in the root, then it might be in the root’s children if:

first_item(I) < first_item(Ij) and

last_item(I) < last_item(Ij)

Building an Itemset TreeLet ci be a node in the itemset tree.Let I be a transaction from the dataset

Loop: Case 1: ci = I

Case 2: ci is a child of I

- make I the parent node of ci

Case 3: ci and I contain a common lexical overlap i.e. {1,2,4} vs. {1,2,6}

- make a node for the overlap- make I and ci it’s children.

Case 4: ci is a parent of I- Loop to check ci’s children- make I a child of ci

Note: {2,6} and {1,2,6} do not have a Lexical overlap

Itemset Trees - Creation

Dataset

2 4

1 2 3 5

3 9

1 2 6

2

2 9

Itemset Trees - Creation

Dataset

2 4

1 2 3 5

3 9

1 2 6

2

2 9

Child node.

Itemset Trees - Creation

Dataset

2 4

1 2 3 5

3 9

1 2 6

2

2 9

Child node.

Itemset Trees - Creation

Dataset

2 4

1 2 3 5

3 9

1 2 6

2

2 9

Child node.

Itemset Trees - Creation

Dataset

2 4

1 2 3 5

3 9

1 2 6

2

2 9

Lexical overlap

Itemset Trees - Creation

Dataset

2 4

1 2 3 5

3 9

1 2 6

2

2 9

Parent node.

Itemset Trees - Creation

Dataset

2 4

1 2 3 5

3 9

1 2 6

2

2 9

Child node.

Itemset Trees – Querying

Let I be an itemset, Let ci be a node in the treeLet totalSup be the total count for I in the tree

For all s.t. first_item(ci) < first_item(I):

Case 1: If I is contained in ci. - Add support to totalSup.

Case 2: If I is not contained and last_item(ci) < last_item(I)- proceed down the tree

Example 1

Itemset Trees - Querying

Querying Example 1:

Query: {2}

totalSup = 0

Itemset Trees - Querying

Querying Example 1:

Query: {2}

2 = 2

Add to support:

totalSup = 3

Itemset Trees - Querying

Querying Example 1:

Query: {2}

1,2 contains 2

Add to support

totalSup = 3 + 2 = 5

Itemset Trees - Querying

Querying Example 1:

Query: {2,9}

3 > 2, and end of Subtree.

Return support

totalSup = 5

Example 2

Itemset Trees - Querying

Querying Example 2:

Query: {2,9}

totalSup = 0

Itemset Trees - Querying

Querying Example 2:

Query: {2,9}

totalSup = 0

2 < 22 < 9 continue

Itemset Trees - Querying

Querying Example 2:

Query: {2,9}

totalSup = 0

2 < 24 < 9

{2,4} doesn’t contain{2,9}, go to next sibling

Itemset Trees - Querying

Querying Example 2:

Query: {2,9}

totalSup = 1

{2,9} = {2,9}

Add to support!

Itemset Trees - Querying

Querying Example 2:

Query: {2,9}

totalSup = 1

1 < 22 < 9

continue

Itemset Trees - Querying

Querying Example 2:

Query: {2,9}

totalSup = 1

1 < 25 < 9

{1,2,3,5} doesn’t contain{2,9}, go to next sibling

Itemset Trees - Querying

Querying Example 2:

Query: {2,9}

totalSup = 1

1 < 26 < 9

{1,2,6} doesn’t contain{2,9}, go to next node

Itemset Trees - Querying

Querying Example 2:

Query: {2,9}

totalSup = 1

3 < 2 <= fail9 < 9

End of tree,

totalSupp = 1

Nodes = 8

top related