1 autopart: parameter-free graph partitioning and outlier detection deepayan chakrabarti...

AutoPart: Parameter-Free Graph Partitioning and Outlier Detection

Deepayan Chakrabarti (deepay@cs.cmu.edu)

Problem Definition

People

People Groups

Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …

Reminder

People

Graph: N nodes and E directed edges

Problem Definition

People

People Groups

Goals:

• [#1] Find groups (of people, species, proteins, etc.)

• [#2] Find outlier edges (“bridges”)

• [#3] Compute inter-group “distances” (how similar are two groups of proteins?)

Problem Definition

People

People Groups

Properties:

• Fully Automatic (estimate the number of groups)

• Scalable

• Allow incremental updates

Related Work

Graph Partitioning METIS (Karypis+/1998)

Spectral partitioning (Ng+/2001)

Clustering Techniques K-means and variants

(Pelleg+/2000,Hamerly+/2003)

Information-theoreticco-clustering (Dhillon+/2003)

LSI (Deerwester+/1990) Choosing the number of “concepts”

Measure of imbalance between clusters, OR

Number of partitions

Rows and columns are considered separately, OR

Not fully automatic

Outline

Problem Definition Related Work Finding clusters in graphs Outliers and inter-group distances Experiments Conclusions

Outline

Problem Definition Related Work Finding clusters in graphs

What is a good clustering? How can we find such a clustering?

Outliers and inter-group distances Experiments Conclusions

What is a “good” clustering

Node GroupsNode Groups

versus

Why is this better?

Good Clustering

1. Similar nodes are grouped together

2. As few groups as necessary

A few, homogeneous

blocks

Good Compression

implies

Binary Matrix

Node groups

Main Idea

Good Compression

Good Clusteringimplies

pi1 = ni

1 / (ni1 + ni

(ni1+ni

0)* H(pi1) Cost of describing

ni1, ni

0 and groups

Code Cost Description Cost

Σi +Σi

Examples

One node group

high low

n node groups

highlow

Total Encoding Cost = (ni1+ni

ni1, ni

0 and groups

Σi +Σi

What is a “good” clustering

Node GroupsNode Groups

versus

Why is this better?

low low

Total Encoding Cost = (ni1+ni

ni1, ni

0 and groups

Σi +Σi

Outline

Algorithms k = 5 node groups

Algorithms

Start with initial matrix

Find good groups for fixed k

Choose better values for k

Final groupingLower the

encoding cost

Algorithms

encoding cost

Node groups

Fixed number of groups k

Reassign:for each node:

reassign it to the group which minimizes the code cost

Algorithms

encoding cost

Choosing k

Split:1. Find the group R with the maximum entropy per node

2. Choose the nodes in R whose removal reduces the entropy per node in R

3. Send these nodes to the new group, and set k=k+1

Algorithms

encoding cost

Reassign

Splits

Algorithms

Properties:

Fully Automatic number of groups is found automatically

Scalable O(E) time

Allow incremental updates reassign new node/edge to the group with least cost, and continue…

Outline

Outlier Edges

Outliers Deviations from “normality”

Lower quality compression

Find edges whose removal maximally reduces cost

Node Groups

Outlier edges

Inter-cluster distances

Node Groups

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

Inter-cluster distances

Node Groups

Two groups are “close”

Merging them does not increase cost by much

distance(i,j) = relative increase in cost on merging i and j

Grp1 Grp2

4.55.1

Outline

Experiments

“Quasi block-diagonal” graph with noise=10%

Experiments

Authors

DBLP dataset

• 6,090 authors in:• SIGMOD

• ICDE

• VLDB

• PODS

• ICDT

• 175,494 “dots”, one “dot” per co-citation

Experiments

Authors

Author groups

k=8 author groups found

Stonebraker, DeWitt, Carey

Experiments

Author groups

Grp8Grp1

Inter-group distances

Experiments

User groups

Epinions dataset

• 75,888 users

• 508,960 “dots”, one “dot” per “trust” relationship

k=19 groups foundSmall dense “core”

Experiments

Number of “dots”

Linear in the number of “dots” Scalable

Conclusions

Goals:

Find groups

Find outliers

Compute inter-group “distances”

Properties:

Fully Automatic

Scalable

Allow incremental updates

1 autopart: parameter-free graph partitioning and outlier detection deepayan chakrabarti...

good groups

groups of people

encoding cost slide

good clustering node

code cost slide

groups of proteins

n nodes

edges slide

Documents

graph mining: laws, generators and...

th autoparts manufacturing - metal mecanica · 2013 review:...

autopart international parts info 1.6.16

deepayan founder

1 estimating rates of rare events at multiple resolutions...

review on pupils - neurological approach - by deepayan kar

1 matching dom trees to search logs for accurate webpage...

auto autopart march 2015

lattice graphics: an introduction -...

purnamrita sarkar (uc berkeley) deepayan chakrabarti (yahoo!...

autopart v30 whats new booklet

kronecker graphs: an approach to modeling networks jure...

neighborhood formation and anomaly detection in bipartite...

1 clustering applications at yahoo! deepayan chakrabarti...

1 challenges in computational advertising deepayan...

1 a graph-theoretic approach to webpage segmentation...

reverse engineering autopart - mirror

on the analysis of optical mapping data by deepayan

nonparametric link prediction in dynamic graphs purnamrita...

lattice graphics deepayan...