1 autopart: parameter-free graph partitioning and outlier detection deepayan chakrabarti...

Post on 21-Dec-2015

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

AutoPart: Parameter-Free Graph Partitioning and Outlier Detection

Deepayan Chakrabarti (deepay@cs.cmu.edu)

2

Problem Definition

People

Pe

ople

People Groups

Pe

ople

Gro

up

s

Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …

3

Reminder

People

Pe

ople

Graph: N nodes and E directed edges

4

Problem Definition

People

Pe

ople

People Groups

Pe

ople

Gro

up

s

Goals:

• [#1] Find groups (of people, species, proteins, etc.)

• [#2] Find outlier edges (“bridges”)

• [#3] Compute inter-group “distances” (how similar are two groups of proteins?)

5

Problem Definition

People

Pe

ople

People Groups

Pe

ople

Gro

up

s

Properties:

• Fully Automatic (estimate the number of groups)

• Scalable

• Allow incremental updates

6

Related Work

Graph Partitioning METIS (Karypis+/1998)

Spectral partitioning (Ng+/2001)

Clustering Techniques K-means and variants

(Pelleg+/2000,Hamerly+/2003)

Information-theoreticco-clustering (Dhillon+/2003)

LSI (Deerwester+/1990) Choosing the number of “concepts”

Measure of imbalance between clusters, OR

Number of partitions

Rows and columns are considered separately, OR

Not fully automatic

7

Outline

Problem Definition Related Work Finding clusters in graphs Outliers and inter-group distances Experiments Conclusions

8

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

9

What is a “good” clustering

Node GroupsNode Groups

No

de

Gro

up

s

No

de

Gro

up

s

versus

Why is this better?

Good Clustering

1. Similar nodes are grouped together

2. As few groups as necessary

A few, homogeneous

blocks

Good Compression

implies

10

Binary Matrix

Node groups

Nod

e gr

oups

Main Idea

Good Compression

Good Clusteringimplies

pi1 = ni

1 / (ni1 + ni

0)

(ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

11

Examples

One node group

high low

n node groups

highlow

Total Encoding Cost = (ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

12

What is a “good” clustering

Node GroupsNode Groups

No

de

Gro

up

s

No

de

Gro

up

s

versus

Why is this better?

low low

Total Encoding Cost = (ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

13

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

14

Algorithms k = 5 node groups

15

Algorithms

Start with initial matrix

Find good groups for fixed k

Choose better values for k

Final groupingLower the

encoding cost

16

Algorithms

Start with initial matrix

Find good groups for fixed k

Choose better values for k

Final groupingLower the

encoding cost

17

Node groups

Nod

e gr

oups

Fixed number of groups k

Reassign:for each node:

reassign it to the group which minimizes the code cost

18

Algorithms

Start with initial matrix

Choose better values for k

Final groupingLower the

encoding cost

Find good groups for fixed k

19

Choosing k

Split:1. Find the group R with the maximum entropy per node

2. Choose the nodes in R whose removal reduces the entropy per node in R

3. Send these nodes to the new group, and set k=k+1

20

Algorithms

Start with initial matrix

Find good groups for fixed k

Choose better values for k

Final groupingLower the

encoding cost

Reassign

Splits

21

Algorithms

Properties:

Fully Automatic number of groups is found automatically

Scalable O(E) time

Allow incremental updates reassign new node/edge to the group with least cost, and continue…

22

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

23

Outlier Edges

Nodes

No

des

Outliers Deviations from “normality”

Lower quality compression

Find edges whose removal maximally reduces cost

No

de

Gro

up

s

Node Groups

Outlier edges

24

Inter-cluster distances

Nodes

No

des

No

de

Gro

up

s

Node Groups

Grp1

Grp2

Grp3

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

25

Inter-cluster distances

No

de

Gro

up

s

Node Groups

Grp1

Grp2

Grp3

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

Grp1 Grp2

Grp3

5.5

4.55.1

26

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

27

Experiments

“Quasi block-diagonal” graph with noise=10%

28

Experiments

Authors

Aut

hors

DBLP dataset

• 6,090 authors in:• SIGMOD

• ICDE

• VLDB

• PODS

• ICDT

• 175,494 “dots”, one “dot” per co-citation

29

Experiments

Authors

Aut

hors

Aut

hor

grou

ps

Author groups

k=8 author groups found

Stonebraker, DeWitt, Carey

30

Experiments

Aut

hor

grou

ps

Author groups

Grp8Grp1

Inter-group distances

31

Experiments

User groups

Use

r gr

oups

Epinions dataset

• 75,888 users

• 508,960 “dots”, one “dot” per “trust” relationship

k=19 groups foundSmall dense “core”

32

Experiments

Number of “dots”

Tim

e (in

sec

onds

)

Linear in the number of “dots” Scalable

33

Conclusions

Goals:

Find groups

Find outliers

Compute inter-group “distances”

Properties:

Fully Automatic

Scalable

Allow incremental updates

top related