maximally informative k-itemsets. motivation subgroup discovery typically produces very many...

20
Maximally Informative k-Itemsets

Upload: godwin-logan

Post on 08-Jan-2018

220 views

Category:

Documents


1 download

DESCRIPTION

Dissimilar patterns  Optimize dissimilarity of patterns reported  Additional value of individual patterns reported  Consider extent of patterns  Treat patterns as binary features/items  Joint entropy of itemset captures informativeness of pattern set

TRANSCRIPT

Page 1: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Maximally Informative k-Itemsets

Page 2: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Motivation Subgroup Discovery typically produces very many

patterns with high levels of redundancy

Grammatically different patterns represent same subgroup Complement Combinations of patterns

marital-status = ‘Married-civ-spouse’ ∧ age ≥ 29marital-status = ‘Married-civ-spouse’ ∧ education-num ≥ 8marital-status = ‘Married-civ-spouse’ ∧ age ≤ 76age ≤ 67 ∧ marital-status = ‘Married-civ-spouse’marital-status = ‘Married-civ-spouse’age ≥ 33 ∧ marital-status = ‘Married-civ-spouse’…

Page 3: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Dissimilar patterns Optimize dissimilarity of patterns reported Additional value of individual patterns reported Consider extent of patterns

Treat patterns as binary features/items

Joint entropy of itemset captures informativeness of pattern set

Page 4: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Joint Entropy of Itemset Binary features (items) x1, x2, ..., xn

Itemset of size k: X = {x1, …, xk}

Joint entropy:

kB

kkkk bxbxpbxbxpXH}1,0{

1111 ),,(lg),,()(

Page 5: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Joint Entropy Each item cuts the database into 2 parts (not

necessarily of equal size) Each additional item cuts each part in 2 Joint entropy is maximal when all parts have equal

size

Page 6: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Definition miki

A Maximally Informative k-Itemset (miki) is an itemset of size k that maximizes the joint entropy:

An itemset X I of cardinality k is a maximally informative k-itemset, iff for all itemsets Y I of cardinality k,

)()( XHYH

Page 7: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Properties of joint entropy, miki’s

Symmetric treatment of 0’s and 1’s Both infrequent and frequent items discouraged

Optimal at p(xi) = 0.5 Items in miki are (relatively) independent

Goals orthogonal to mining associations Focus on value 1 Frequent items are encouraged Find items that are dependent

Page 8: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

More Properties

At most 1 bit of information per item

monotonicity of joint entropy Suppose X and Y are two itemsets such that X Y.

Then

unit growth of joint entropy Suppose X and Y are two itemsets such that X Y.

Then

)()( YHXH

XYXHYH \)()(

1)(0 xH

Page 9: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Properties: Independence Bound

independence bound on joint entropy Suppose that X = {x1, …, xk} is an itemset. Then

Every item adds at most H(xi) Items potentially share information, hence ≤ = iff items are independent A candidate itemset can be discarded if the bound

is not above the current maximum (no need to check the data)

i

ixHXH )()(

Page 10: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Example 1

H(A) = 1, H(B) = 1, H(C) = 1 H(D) = − ⅜lg⅜ − ⅝lg⅝ ≈ 0.96 {A, B, C} is a miki H({A, B, C}) = 2.5 ≤ 3

A B C D1 1 1 01 1 0 01 1 1 01 0 0 00 1 1 00 0 0 10 0 1 10 0 0 1

Page 11: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Partitions of itemsets Group items that share information Obtain tighter bound Precompute joint entropy of small itemsets (e.g. 2-

or 3-itemsets)

joint entropy of partition Suppose that P = {B1, …, Bm} is a partition of an

itemset. The joint entropy of P is defined as

iiBHPH )()(

Page 12: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Partition Properties

partitioned bound on joint entropy Suppose that P = {B1, …, Bm} is a partition of an

itemset X. Then

independence bound on partitioned joint entropy

Suppose that P = {B1, …, Bm} is a partition of an itemset X = {x1, …, xk}. Then

)()( PHXH

i

ixHPH )()(

Page 13: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Example 2

B and D are similar {{B, D}, {C}} partition of {B, C, D} H({B, C, D}) = 2.16 H({{B, D}, {C}}) = 2.41 H(B) + H(C) + H(D) = 2.96

A B C D1 1 1 01 1 0 01 1 1 01 0 0 00 1 1 00 0 0 10 0 1 10 0 0 1

Page 14: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Algorithms

Algorithm 1Exhaustively consider all itemsets of size k, and return the optimal

Algorithm 2Use independence bound to skip table scan if not above current optimal

Algorithm 3Use partitioned bound on joint entropy. Use random partition of k/2 blocks

Page 15: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Algorithms

Algorithm 4Consider prefix X of size k l of current itemset. If upper bound on any extension of X is below current optimal, then skip all extensions of X. l = 3 gives best results in practice

Algorithm 5Repeatedly add the item that improves the joint entropy the most (forward selection)

Page 16: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Example 1 82 subgroup discovered in 2-dimensional space Miki of 4 patterns

-1

0

1

2

3

4

5

6

7

8

9

-4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 0-1

0

1

2

3

4

5

6

7

8

9

-4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 0

Page 17: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Mushroom (119 x 8124)

2 3 4 5 6 7

Alg 1 7021 273,819 7.94106

1.82108

3.47109 5.61010

Alg 2 12 265 4,917 69,134 1.23106

1.95107

Alg 3 4 83 602 9,747 211,934 4.58106

Alg 4 602 9,747 209,329 4.4106

Alg 5 237 354 470 585 699 812

number of table scans

Page 18: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

2 3 4 5 6 7

Alg 1 0:18 16:29 735:23 >1000 >1000 >1000

Alg 2 0 0:03 1:34 34:21 692:42 >1000

Alg 3 0:36 0:38 1:36 23:25 445:58 >1000

Alg 4 1:37 16:17 244:11 >1000

Alg 5 0 0 0:01 0:03 0:04 0:04

running time

Mushroom (119 x 8124)

Page 19: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Joint Entropy of miki’s (Mushroom)

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7

• number of items (y = x)• entropy of miki• entropy of greedy

approximation

Page 20: MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically

Joint Entropy of miki’s (LumoLogp)

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7

• number of items (y = x)• entropy of miki• entropy of greedy

approximation