1 autopart: parameter-free graph partitioning and outlier detection deepayan chakrabarti...

1

AutoPart: Parameter-Free Graph Partitioning and Outlier Detection

Deepayan Chakrabarti ([email protected])

2

Problem Definition

People

Pe

ople

People Groups

Pe

ople

Gro

up

s

Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …

3

Reminder

People

Pe

ople

Graph: N nodes and E directed edges

4

Problem Definition

People

Pe

ople

People Groups

Pe

ople

Gro

up

s

Goals:

• [#1] Find groups (of people, species, proteins, etc.)

• [#2] Find outlier edges (“bridges”)

• [#3] Compute inter-group “distances” (how similar are two groups of proteins?)

5

Problem Definition

People

Pe

ople

People Groups

Pe

ople

Gro

up

s

Properties:

• Fully Automatic (estimate the number of groups)

• Scalable

• Allow incremental updates

6

Related Work

Graph Partitioning METIS (Karypis+/1998)

Spectral partitioning (Ng+/2001)

Clustering Techniques K-means and variants

(Pelleg+/2000,Hamerly+/2003)

Information-theoreticco-clustering (Dhillon+/2003)

LSI (Deerwester+/1990) Choosing the number of “concepts”

Measure of imbalance between clusters, OR

Number of partitions

Rows and columns are considered separately, OR

Not fully automatic

7

Outline

Problem Definition Related Work Finding clusters in graphs Outliers and inter-group distances Experiments Conclusions

8

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

9

What is a “good” clustering

Node GroupsNode Groups

No

de

Gro

up

s

No

de

Gro

up

s

versus

Why is this better?

Good Clustering

1. Similar nodes are grouped together

2. As few groups as necessary

A few, homogeneous

blocks

Good Compression

implies

10

Binary Matrix

Node groups

Nod

e gr

oups

Main Idea

Good Compression

Good Clusteringimplies

pi1 = ni

1 / (ni1 + ni

0)

(ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

11

Examples

One node group

high low

n node groups

highlow

Total Encoding Cost = (ni1+ni


ni1, ni

0 and groups


Σi +Σi

12

What is a “good” clustering

Node GroupsNode Groups

No

de

Gro

up

s

No

de

Gro

up

s

versus

Why is this better?

low low

Total Encoding Cost = (ni1+ni


ni1, ni

0 and groups


Σi +Σi

13

Outline




14

Algorithms k = 5 node groups

15

Algorithms

Start with initial matrix

Find good groups for fixed k

Choose better values for k

Final groupingLower the

encoding cost

16

Algorithms





encoding cost

17

Node groups

Nod

e gr

oups

Fixed number of groups k

Reassign:for each node:

reassign it to the group which minimizes the code cost

18

Algorithms




encoding cost


19

Choosing k

Split:1. Find the group R with the maximum entropy per node

2. Choose the nodes in R whose removal reduces the entropy per node in R

3. Send these nodes to the new group, and set k=k+1

20

Algorithms





encoding cost

Reassign

Splits

21

Algorithms

Properties:

Fully Automatic number of groups is found automatically

Scalable O(E) time

Allow incremental updates reassign new node/edge to the group with least cost, and continue…

22

Outline




23

Outlier Edges

Nodes

No

des

Outliers Deviations from “normality”

Lower quality compression

Find edges whose removal maximally reduces cost

No

de

Gro

up

s

Node Groups

Outlier edges

24

Inter-cluster distances

Nodes

No

des

No

de

Gro

up

s

Node Groups

Grp1

Grp2

Grp3

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

25

Inter-cluster distances

No

de

Gro

up

s

Node Groups

Grp1

Grp2

Grp3

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

Grp1 Grp2

Grp3

5.5

4.55.1

26

Outline




27

Experiments

“Quasi block-diagonal” graph with noise=10%

28

Experiments

Authors

Aut

hors

DBLP dataset

• 6,090 authors in:• SIGMOD

• ICDE

• VLDB

• PODS

• ICDT

• 175,494 “dots”, one “dot” per co-citation

29

Experiments

Authors

Aut

hors

Aut

hor

grou

ps

Author groups

k=8 author groups found

Stonebraker, DeWitt, Carey

30

Experiments

Aut

hor

grou

ps

Author groups

Grp8Grp1

Inter-group distances

31

Experiments

User groups

Use

r gr

oups

Epinions dataset

• 75,888 users

• 508,960 “dots”, one “dot” per “trust” relationship

k=19 groups foundSmall dense “core”

32

Experiments

Number of “dots”

Tim

e (in

sec

onds

)

Linear in the number of “dots” Scalable

33

Conclusions

Goals:

Find groups

Find outliers

Compute inter-group “distances”

Properties:

Fully Automatic

Scalable

Allow incremental updates

1 autopart: parameter-free graph partitioning and outlier detection deepayan chakrabarti...

Documents

good groups

groups of people

encoding cost slide

good clustering node

code cost slide

groups of proteins

n nodes

edges slide