a generalized maximum entropy approach to bregman co clustering

Author : Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Srujana Merugu, and Dharmendra S. ModhaSource : KDD ’04, August 22-25, 2004, ACM, pp. 509- pp.514Presenter : Allen Wu

112/04/09

1

Introduction Bregman divergences Bregman co-clustering Algorithm Experiments Conclusion

112/04/09

2

Information-theoretic co-clustering (ITCC) model the co-clustering problem as the joint probability distribution.

We seek a co-clustering of both dimensions such that loss in “Mutual Information”

is minimized given a fixed no. of row & col. Clusters.

)ˆ;ˆ( - );(min,ˆ

YXIYXIYX

112/04/09

3

The loss in mutual information equals

where

Can be shown that q(x,y) is a “maximum entropy” approximation to p(x,y).

)),( || ),((D )ˆ;ˆ( - );( KL yxqyxpYXIYXI

yyxxyypxxpyxpyxq ˆ,ˆ where),ˆ|()ˆ|()ˆ,ˆ(),(

112/04/09

4

0.18 0.18 0.14 0.14 0.18 0.18

0.150.150.150.150.20.2

)ˆ(

)(

)ˆ(

)()ˆ,ˆ()ˆ|()ˆ|()ˆ,ˆ(),(

yp

yp

xp

xpyxpyypxxpyxpyxq

5

0.5 0.5

0.30.30.4

054.05.0

18.0

3.0

15.03.0

112/04/09

6

D(p||q)0.0419

090.0419

090.05696

0.05696

0.03760.04964

1

D(p||q)0.056960.056960.0419

10.0419

10.04964

10.0376

112/04/09

D(p||q)0.0211

80.0211

80.0224

30.04076

50.04893 0.04893

7

D(p||q)0.04813

80.04813

80.04194

20.0229

50.0205

20.0205

2

112/04/09

8

112/04/09

However, the matrix may contain negative entries or a distortion measure other than KL-divergence.

The squared Euclidean distance might be more appropriate.

This paper address the general situation by extending ITCC along three directions. “Nearness” is now measured by any Bregman

divergence. Allow specification of a larger class of constraints. Generalize the maximum entropy approach.

112/04/09

9

112/04/09

10

112/04/09

11

112/04/09

12

112/04/09

13

The objective function is

k

h xh

hk

x1

2

},...,{ 1

min

112/04/09

14

Let ф be a real-valued strictly convex function defined on the convex set S=dom(ф)R, ф is differentiable on int(S), the interior of

S.

The Bregman divergence dф:S ×int(S)[0,∞) is defined as

)(,)()(),( 2212121 zzzzzzzd

112/04/09

15

112/04/09

16

I-Divergence Given zR+, let ф(z) = zlog(z).For z1, z2 R+

Squared Euclidean Distance Given z R, let ф(z) =z2. For z1, z2 R,

)()/log(),( 2121121 zzzzzzzd

22121 )(),( zzzzd

112/04/09

17

Bregman information is defined as the expected Bregman divergence to the expectation. Iф(Z)=E[dф(Z,E[Z])]

I-Divergence Given a real non-negative random variable Z, the

Bregman information is Iф(Z)=E[Zlog(Z/E[Z])]

Squared Euclidean Distance Given any real random variable Z, the Bregman

information is Iф(Z)=E[(Z-E[Z])2]

112/04/09

18

Let (X, Y)~p(X, Y) be jointly distributed random variables with X, Y.

p(X, Y) be written the form of the matrix Z

The quality of the co-clustering can be defined as

)(,][,][],[ ,11 vuuvnm

uv yxpzvuzZ

nv

mu vyYuxX 11 ][},{:;][},{:

),( clustering-co by the determineduniquely is Z where

)ˆ,()]ˆ,([1 1

m

u

n

vuvuvuv zzdzZZdE

112/04/09

19

(,) involves four random variables corresponding to the various partitioning of the matrix Z.

We can obtain different matrix approximations based on the statistics of Z corresponding to the non-trivial combinations of }}ˆ{},ˆ{},{},{},ˆ,ˆ{},,ˆ{},ˆ,{{ VUVUVUVUVU

}ˆ,ˆ,,{ VUVU

}ˆ,ˆ,,{ VUVU

112/04/09

20

(Γ) denotes the class of matrix approximation schemes based on (,).

The set of approximations MA(,,C) consists of all Z’Sm×n.

The “best” approximation Z.

}},ˆ{},ˆ,{{ }},{},{},ˆ,ˆ{{

}}ˆ,ˆ{{ }},ˆ{},ˆ{{

43

21

VUVUCVUVUC

VUCVUC

)]',([minargˆ),,('

ZZdEZCMZ A

112/04/09

21

112/04/09

22

We present brief case studies to demonstrate two salient features. Dimensionality reduction Missing value prediction

112/04/09

23

Clustering interleaved with implicit dimensionality reduction

Superior performance as compared to one-sided clustering

112/04/09

24

Assign zero measure for missing elements, co-cluster and use reconstructed matrix for prediction

Implicit discovery of correlated sub-matrices

112/04/09

25

The Bregman divergence as the co-clustering loss function. I-divergence and squared Euclidean distance

Approximation models of various complexities are possible depending on the statistics.

The minimum Bregman information principle as a generalization of the maximum entropy principle.

112/04/09

26

a generalized maximum entropy approach to bregman co clustering

Technology

matrix z

statistics of z

bregman information

z s mn

real random variable

bregman divergence d

expected bregman divergence

coclustering problem