fully automatic cross-associations deepayan chakrabarti (cmu) spiros papadimitriou (cmu) dharmendra...

Fully Automatic Cross-Associations

Deepayan Chakrabarti (CMU)Spiros Papadimitriou (CMU)Dharmendra Modha (IBM)Christos Faloutsos (CMU and IBM)

Problem Definition

Products

Product Groups

Simultaneously group customers and products, or, documents and words, or, users and preferences …

Problem Definition

Desiderata:

1. Simultaneously discover row and column groups

2. Fully Automatic: No “magic numbers”

3. Scalable to large graphs

Cross-Associations ≠ Co-clustering !Information-theoretic

co-clustering Cross-Associations

1. Lossy Compression.

2. Approximates the original matrix, while trying to minimize KL-divergence.

3. The number of row and column groups must be given by the user.

1. Lossless Compression.

2. Always provides complete information about the matrix, for any number of row and column groups.

3. Chosen automatically using the MDL principle.

Related Work

K-means and variants:

“Frequent itemsets”:

Information Retrieval:

Graph Partitioning:

Dimensionality curse

Choosing the number of clusters

User must specify “support”

Choosing the number of “concepts”

Number of partitions

Measure of imbalance between clusters

What makes a cross-association “good”?

versus

Column groups Column groups

Better Clustering

1. Similar nodes are grouped together

2. As few groups as necessary

A few, homogeneous

blocks

Good Compression

Why is this better?

implies

Main Idea

Good Compression

Better Clusteringimplies

Column groups

pi1 = ni

1 / (ni1 + ni

(ni1+ni

0)* H(pi1) Cost of describing

ni1 and ni

Code Cost Description Cost

Binary Matrix

Examples

One row group, one column group

high low

m row group, n column group

highlow

Total Encoding Cost = (ni1+ni

ni1 and ni

Σi +Σi

What makes a cross-association “good”?

versus

Column groups Column groups

ups Why is this

better?

low low

Total Encoding Cost = (ni1+ni

ni1 and ni

Σi +Σi

Algorithmsk =

5 row groups

k=1, l=2

k=2, l=2

k=2, l=3

k=3, l=3

k=3, l=4

k=4, l=4

k=4, l=5

l = 5 col groups

Algorithmsl = 5

Start with initial matrix

Find good groups for fixed k and l

Choose better values for k and l

Final cross-associations

Lower the encoding cost

Fixed k and ll = 5

Fixed k and l

Column groups

ups Swaps:

for each row:

swap it to the row group which minimizes the code cost

Fixed k and l

Column groups

Ditto for column swaps

… and repeat …

Choosing k and ll = 5

Split:1. Find the row group R with the maximum entropy per row

2. Choose the rows in R whose removal reduces the entropy per row in R

3. Send these rows to the new row group, and set k=k+1

Split:

Similar for column groups too.

Algorithmsl = 5

Splits

Experimentsl = 5 col groups

k = 5 row

groups

“Customer-Product” graph with Zipfian sizes, no noise

Experiments

“Caveman” graph with Zipfian cave sizes, noise=10%

l = 8 col groups

k = 6 row

groups

Experiments

“White Noise” graph

l = 3 col groups

k = 2 row

groups

Experiments

“CLASSIC” graph of documents & words: k=15, l=19

ExperimentsN

Words in abstract

“GRANTS” graph of documents & words: k=41, l=28

Experiments

“Who-trusts-whom” graph from epinions.com: k=18, l=16

Epinions.com user

Experiments

“Clickstream” graph of users and websites: k=15, l=13

Webpages

Experiments

Number of non-zeros

Splits

Linear on the number of “ones”: Scalable

Conclusions

Desiderata:

Simultaneously discover row and column groups

Fully Automatic: No “magic numbers”

Scalable to large graphs

Fixed k and ll = 5

swaps swaps

Experimentsl = 5 col groups

k = 5 row

groups

“Caveman” graph with Zipfian cave sizes, no noise

Given any binary matrix a “good” cross-association will have low cost

But how can we find such a cross-association?

l = 5 col groups

k = 5 row

groups

Main Idea

sizei * H(pi) +Cost of describing cross-associations

Σi Total Encoding Cost =

Good Compression

Minimize the total cost

Main Idea

How well does a cross-association compress the matrix? Encode the matrix in a lossless fashion Compute the encoding cost Low encoding cost good compression good

clustering

Good Compression

fully automatic cross-associations deepayan chakrabarti (cmu) spiros papadimitriou (cmu) dharmendra...

number of row

row groupsl

row group r

new row group

lfinal crossassociationslower

crossassociation good

initial matrixfind good

encoding costfind good

Documents

watts humphrey ibm director of programming and...

from the ibm cmu reference architecture document 1 soa...

algorithms for max-min optimization anupam gupta carnegie...

congrats cmu grad€¦ · congrats cmu grad congrats cmu...

rajita sanji & devi modha ap comparative & politics | 3b

ryan o’donnell (cmu, ias) joint work with yi wu (cmu,...

venkatesan guruswami (cmu) yuan zhou (cmu)

ibm software group sachdeva@us.ibm.com naveen sachdeva...

cmu scs data mining on streams christos faloutsos cmu

spring 2012 bioe 2630 (pitt) : 16-725 (cmu ri) 18-791 (cmu...

jeremiah blocki cmu ryan williams ibm almaden icalp 2010

packing rectangles into bins nikhil bansal (cmu) joint with...

95-843: service oriented architecture from the ibm cmu...

ryan o'donnell (cmu) yi wu (cmu, ibm) yuan zhou (cmu)

13 may, 2000handy andy rapid prototyping of computer systems...

outperforming lru with an adaptive replacement cache...

ryan o'donnell (cmu, ias) yi wu (cmu, ibm) yuan zhou (cmu)

paras hrishikesh modha deputy attorney general state bar...

ibm software group sachdeva@us.ibm.com naveen sachdeva soa...

cmu scs multimedia and graph mining christos faloutsos cmu