1 autopart: parameter-free graph partitioning and outlier detection deepayan chakrabarti...

33
1 AutoPart: Parameter- Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti ([email protected])

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

1

AutoPart: Parameter-Free Graph Partitioning and Outlier Detection

Deepayan Chakrabarti ([email protected])

Page 2: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

2

Problem Definition

People

Pe

ople

People Groups

Pe

ople

Gro

up

s

Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …

Page 3: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

3

Reminder

People

Pe

ople

Graph: N nodes and E directed edges

Page 4: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

4

Problem Definition

People

Pe

ople

People Groups

Pe

ople

Gro

up

s

Goals:

• [#1] Find groups (of people, species, proteins, etc.)

• [#2] Find outlier edges (“bridges”)

• [#3] Compute inter-group “distances” (how similar are two groups of proteins?)

Page 5: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

5

Problem Definition

People

Pe

ople

People Groups

Pe

ople

Gro

up

s

Properties:

• Fully Automatic (estimate the number of groups)

• Scalable

• Allow incremental updates

Page 6: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

6

Related Work

Graph Partitioning METIS (Karypis+/1998)

Spectral partitioning (Ng+/2001)

Clustering Techniques K-means and variants

(Pelleg+/2000,Hamerly+/2003)

Information-theoreticco-clustering (Dhillon+/2003)

LSI (Deerwester+/1990) Choosing the number of “concepts”

Measure of imbalance between clusters, OR

Number of partitions

Rows and columns are considered separately, OR

Not fully automatic

Page 7: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

7

Outline

Problem Definition Related Work Finding clusters in graphs Outliers and inter-group distances Experiments Conclusions

Page 8: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

8

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

Page 9: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

9

What is a “good” clustering

Node GroupsNode Groups

No

de

Gro

up

s

No

de

Gro

up

s

versus

Why is this better?

Good Clustering

1. Similar nodes are grouped together

2. As few groups as necessary

A few, homogeneous

blocks

Good Compression

implies

Page 10: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

10

Binary Matrix

Node groups

Nod

e gr

oups

Main Idea

Good Compression

Good Clusteringimplies

pi1 = ni

1 / (ni1 + ni

0)

(ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

Page 11: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

11

Examples

One node group

high low

n node groups

highlow

Total Encoding Cost = (ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

Page 12: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

12

What is a “good” clustering

Node GroupsNode Groups

No

de

Gro

up

s

No

de

Gro

up

s

versus

Why is this better?

low low

Total Encoding Cost = (ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

Page 13: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

13

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

Page 14: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

14

Algorithms k = 5 node groups

Page 15: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

15

Algorithms

Start with initial matrix

Find good groups for fixed k

Choose better values for k

Final groupingLower the

encoding cost

Page 16: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

16

Algorithms

Start with initial matrix

Find good groups for fixed k

Choose better values for k

Final groupingLower the

encoding cost

Page 17: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

17

Node groups

Nod

e gr

oups

Fixed number of groups k

Reassign:for each node:

reassign it to the group which minimizes the code cost

Page 18: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

18

Algorithms

Start with initial matrix

Choose better values for k

Final groupingLower the

encoding cost

Find good groups for fixed k

Page 19: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

19

Choosing k

Split:1. Find the group R with the maximum entropy per node

2. Choose the nodes in R whose removal reduces the entropy per node in R

3. Send these nodes to the new group, and set k=k+1

Page 20: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

20

Algorithms

Start with initial matrix

Find good groups for fixed k

Choose better values for k

Final groupingLower the

encoding cost

Reassign

Splits

Page 21: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

21

Algorithms

Properties:

Fully Automatic number of groups is found automatically

Scalable O(E) time

Allow incremental updates reassign new node/edge to the group with least cost, and continue…

Page 22: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

22

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

Page 23: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

23

Outlier Edges

Nodes

No

des

Outliers Deviations from “normality”

Lower quality compression

Find edges whose removal maximally reduces cost

No

de

Gro

up

s

Node Groups

Outlier edges

Page 24: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

24

Inter-cluster distances

Nodes

No

des

No

de

Gro

up

s

Node Groups

Grp1

Grp2

Grp3

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

Page 25: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

25

Inter-cluster distances

No

de

Gro

up

s

Node Groups

Grp1

Grp2

Grp3

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

Grp1 Grp2

Grp3

5.5

4.55.1

Page 26: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

26

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

Page 27: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

27

Experiments

“Quasi block-diagonal” graph with noise=10%

Page 28: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

28

Experiments

Authors

Aut

hors

DBLP dataset

• 6,090 authors in:• SIGMOD

• ICDE

• VLDB

• PODS

• ICDT

• 175,494 “dots”, one “dot” per co-citation

Page 29: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

29

Experiments

Authors

Aut

hors

Aut

hor

grou

ps

Author groups

k=8 author groups found

Stonebraker, DeWitt, Carey

Page 30: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

30

Experiments

Aut

hor

grou

ps

Author groups

Grp8Grp1

Inter-group distances

Page 31: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

31

Experiments

User groups

Use

r gr

oups

Epinions dataset

• 75,888 users

• 508,960 “dots”, one “dot” per “trust” relationship

k=19 groups foundSmall dense “core”

Page 32: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

32

Experiments

Number of “dots”

Tim

e (in

sec

onds

)

Linear in the number of “dots” Scalable

Page 33: 1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

33

Conclusions

Goals:

Find groups

Find outliers

Compute inter-group “distances”

Properties:

Fully Automatic

Scalable

Allow incremental updates