maximally informative k-itemsets. motivation subgroup discovery typically produces very many...

Maximally Informative k-Itemsets

Motivation Subgroup Discovery typically produces very many

patterns with high levels of redundancy

Grammatically different patterns represent same subgroup Complement Combinations of patterns

marital-status = ‘Married-civ-spouse’ ∧ age ≥ 29marital-status = ‘Married-civ-spouse’ ∧ education-num ≥ 8marital-status = ‘Married-civ-spouse’ ∧ age ≤ 76age ≤ 67 ∧ marital-status = ‘Married-civ-spouse’marital-status = ‘Married-civ-spouse’age ≥ 33 ∧ marital-status = ‘Married-civ-spouse’…

Dissimilar patterns Optimize dissimilarity of patterns reported Additional value of individual patterns reported Consider extent of patterns

Treat patterns as binary features/items

Joint entropy of itemset captures informativeness of pattern set

Joint Entropy of Itemset Binary features (items) x1, x2, ..., xn

Itemset of size k: X = {x1, …, xk}

Joint entropy:

kB

kkkk bxbxpbxbxpXH}1,0{

1111 ),,(lg),,()(

Joint Entropy Each item cuts the database into 2 parts (not

necessarily of equal size) Each additional item cuts each part in 2 Joint entropy is maximal when all parts have equal

size

Definition miki

A Maximally Informative k-Itemset (miki) is an itemset of size k that maximizes the joint entropy:

An itemset X I of cardinality k is a maximally informative k-itemset, iff for all itemsets Y I of cardinality k,

)()( XHYH

Properties of joint entropy, miki’s

Symmetric treatment of 0’s and 1’s Both infrequent and frequent items discouraged

Optimal at p(xi) = 0.5 Items in miki are (relatively) independent

Goals orthogonal to mining associations Focus on value 1 Frequent items are encouraged Find items that are dependent

More Properties

At most 1 bit of information per item

monotonicity of joint entropy Suppose X and Y are two itemsets such that X Y.

Then

unit growth of joint entropy Suppose X and Y are two itemsets such that X Y.

Then

)()( YHXH

XYXHYH \)()(

1)(0 xH

Properties: Independence Bound

independence bound on joint entropy Suppose that X = {x1, …, xk} is an itemset. Then

Every item adds at most H(xi) Items potentially share information, hence ≤ = iff items are independent A candidate itemset can be discarded if the bound

is not above the current maximum (no need to check the data)

i

ixHXH )()(

Example 1

H(A) = 1, H(B) = 1, H(C) = 1 H(D) = − ⅜lg⅜ − ⅝lg⅝ ≈ 0.96 {A, B, C} is a miki H({A, B, C}) = 2.5 ≤ 3

A B C D1 1 1 01 1 0 01 1 1 01 0 0 00 1 1 00 0 0 10 0 1 10 0 0 1

Partitions of itemsets Group items that share information Obtain tighter bound Precompute joint entropy of small itemsets (e.g. 2-

or 3-itemsets)

joint entropy of partition Suppose that P = {B1, …, Bm} is a partition of an

itemset. The joint entropy of P is defined as

iiBHPH )()(

Partition Properties

partitioned bound on joint entropy Suppose that P = {B1, …, Bm} is a partition of an

itemset X. Then

independence bound on partitioned joint entropy

Suppose that P = {B1, …, Bm} is a partition of an itemset X = {x1, …, xk}. Then

)()( PHXH

i

ixHPH )()(

Example 2

B and D are similar {{B, D}, {C}} partition of {B, C, D} H({B, C, D}) = 2.16 H({{B, D}, {C}}) = 2.41 H(B) + H(C) + H(D) = 2.96

A B C D1 1 1 01 1 0 01 1 1 01 0 0 00 1 1 00 0 0 10 0 1 10 0 0 1

Algorithms

Algorithm 1Exhaustively consider all itemsets of size k, and return the optimal

Algorithm 2Use independence bound to skip table scan if not above current optimal

Algorithm 3Use partitioned bound on joint entropy. Use random partition of k/2 blocks

Algorithms

Algorithm 4Consider prefix X of size k l of current itemset. If upper bound on any extension of X is below current optimal, then skip all extensions of X. l = 3 gives best results in practice

Algorithm 5Repeatedly add the item that improves the joint entropy the most (forward selection)

Example 1 82 subgroup discovered in 2-dimensional space Miki of 4 patterns

-1

0

1

2

3

4

5

6

7

8

9

-4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 0-1

0

1

2

3

4

5

6

7

8

9

-4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 0

Mushroom (119 x 8124)

2 3 4 5 6 7

Alg 1 7021 273,819 7.94106

1.82108

3.47109 5.61010

Alg 2 12 265 4,917 69,134 1.23106

1.95107

Alg 3 4 83 602 9,747 211,934 4.58106

Alg 4 602 9,747 209,329 4.4106

Alg 5 237 354 470 585 699 812

number of table scans

2 3 4 5 6 7

Alg 1 0:18 16:29 735:23 >1000 >1000 >1000

Alg 2 0 0:03 1:34 34:21 692:42 >1000

Alg 3 0:36 0:38 1:36 23:25 445:58 >1000

Alg 4 1:37 16:17 244:11 >1000

Alg 5 0 0 0:01 0:03 0:04 0:04

running time

Mushroom (119 x 8124)

Joint Entropy of miki’s (Mushroom)

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7

• number of items (y = x)• entropy of miki• entropy of greedy

approximation

Joint Entropy of miki’s (LumoLogp)

0

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7

• number of items (y = x)• entropy of miki• entropy of greedy

approximation

maximally informative k-itemsets. motivation subgroup discovery typically produces very many...

Documents