Maximally Informative k-Itemsets
Motivation Subgroup Discovery typically produces very many
patterns with high levels of redundancy
Grammatically different patterns represent same subgroup Complement Combinations of patterns
marital-status = ‘Married-civ-spouse’ ∧ age ≥ 29marital-status = ‘Married-civ-spouse’ ∧ education-num ≥ 8marital-status = ‘Married-civ-spouse’ ∧ age ≤ 76age ≤ 67 ∧ marital-status = ‘Married-civ-spouse’marital-status = ‘Married-civ-spouse’age ≥ 33 ∧ marital-status = ‘Married-civ-spouse’…
Dissimilar patterns Optimize dissimilarity of patterns reported Additional value of individual patterns reported Consider extent of patterns
Treat patterns as binary features/items
Joint entropy of itemset captures informativeness of pattern set
Joint Entropy of Itemset Binary features (items) x1, x2, ..., xn
Itemset of size k: X = {x1, …, xk}
Joint entropy:
kB
kkkk bxbxpbxbxpXH}1,0{
1111 ),,(lg),,()(
Joint Entropy Each item cuts the database into 2 parts (not
necessarily of equal size) Each additional item cuts each part in 2 Joint entropy is maximal when all parts have equal
size
Definition miki
A Maximally Informative k-Itemset (miki) is an itemset of size k that maximizes the joint entropy:
An itemset X I of cardinality k is a maximally informative k-itemset, iff for all itemsets Y I of cardinality k,
)()( XHYH
Properties of joint entropy, miki’s
Symmetric treatment of 0’s and 1’s Both infrequent and frequent items discouraged
Optimal at p(xi) = 0.5 Items in miki are (relatively) independent
Goals orthogonal to mining associations Focus on value 1 Frequent items are encouraged Find items that are dependent
More Properties
At most 1 bit of information per item
monotonicity of joint entropy Suppose X and Y are two itemsets such that X Y.
Then
unit growth of joint entropy Suppose X and Y are two itemsets such that X Y.
Then
)()( YHXH
XYXHYH \)()(
1)(0 xH
Properties: Independence Bound
independence bound on joint entropy Suppose that X = {x1, …, xk} is an itemset. Then
Every item adds at most H(xi) Items potentially share information, hence ≤ = iff items are independent A candidate itemset can be discarded if the bound
is not above the current maximum (no need to check the data)
i
ixHXH )()(
Example 1
H(A) = 1, H(B) = 1, H(C) = 1 H(D) = − ⅜lg⅜ − ⅝lg⅝ ≈ 0.96 {A, B, C} is a miki H({A, B, C}) = 2.5 ≤ 3
A B C D1 1 1 01 1 0 01 1 1 01 0 0 00 1 1 00 0 0 10 0 1 10 0 0 1
Partitions of itemsets Group items that share information Obtain tighter bound Precompute joint entropy of small itemsets (e.g. 2-
or 3-itemsets)
joint entropy of partition Suppose that P = {B1, …, Bm} is a partition of an
itemset. The joint entropy of P is defined as
iiBHPH )()(
Partition Properties
partitioned bound on joint entropy Suppose that P = {B1, …, Bm} is a partition of an
itemset X. Then
independence bound on partitioned joint entropy
Suppose that P = {B1, …, Bm} is a partition of an itemset X = {x1, …, xk}. Then
)()( PHXH
i
ixHPH )()(
Example 2
B and D are similar {{B, D}, {C}} partition of {B, C, D} H({B, C, D}) = 2.16 H({{B, D}, {C}}) = 2.41 H(B) + H(C) + H(D) = 2.96
A B C D1 1 1 01 1 0 01 1 1 01 0 0 00 1 1 00 0 0 10 0 1 10 0 0 1
Algorithms
Algorithm 1Exhaustively consider all itemsets of size k, and return the optimal
Algorithm 2Use independence bound to skip table scan if not above current optimal
Algorithm 3Use partitioned bound on joint entropy. Use random partition of k/2 blocks
Algorithms
Algorithm 4Consider prefix X of size k l of current itemset. If upper bound on any extension of X is below current optimal, then skip all extensions of X. l = 3 gives best results in practice
Algorithm 5Repeatedly add the item that improves the joint entropy the most (forward selection)
Example 1 82 subgroup discovered in 2-dimensional space Miki of 4 patterns
-1
0
1
2
3
4
5
6
7
8
9
-4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 0-1
0
1
2
3
4
5
6
7
8
9
-4 -3,5 -3 -2,5 -2 -1,5 -1 -0,5 0
Mushroom (119 x 8124)
2 3 4 5 6 7
Alg 1 7021 273,819 7.94106
1.82108
3.47109 5.61010
Alg 2 12 265 4,917 69,134 1.23106
1.95107
Alg 3 4 83 602 9,747 211,934 4.58106
Alg 4 602 9,747 209,329 4.4106
Alg 5 237 354 470 585 699 812
number of table scans
2 3 4 5 6 7
Alg 1 0:18 16:29 735:23 >1000 >1000 >1000
Alg 2 0 0:03 1:34 34:21 692:42 >1000
Alg 3 0:36 0:38 1:36 23:25 445:58 >1000
Alg 4 1:37 16:17 244:11 >1000
Alg 5 0 0 0:01 0:03 0:04 0:04
running time
Mushroom (119 x 8124)
Joint Entropy of miki’s (Mushroom)
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7
• number of items (y = x)• entropy of miki• entropy of greedy
approximation
Joint Entropy of miki’s (LumoLogp)
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7
• number of items (y = x)• entropy of miki• entropy of greedy
approximation