1 autopart: parameter-free graph partitioning and outlier detection deepayan chakrabarti...
Post on 21-Dec-2015
216 views
TRANSCRIPT
1
AutoPart: Parameter-Free Graph Partitioning and Outlier Detection
Deepayan Chakrabarti ([email protected])
2
Problem Definition
People
Pe
ople
People Groups
Pe
ople
Gro
up
s
Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …
3
Reminder
People
Pe
ople
Graph: N nodes and E directed edges
4
Problem Definition
People
Pe
ople
People Groups
Pe
ople
Gro
up
s
Goals:
• [#1] Find groups (of people, species, proteins, etc.)
• [#2] Find outlier edges (“bridges”)
• [#3] Compute inter-group “distances” (how similar are two groups of proteins?)
5
Problem Definition
People
Pe
ople
People Groups
Pe
ople
Gro
up
s
Properties:
• Fully Automatic (estimate the number of groups)
• Scalable
• Allow incremental updates
6
Related Work
Graph Partitioning METIS (Karypis+/1998)
Spectral partitioning (Ng+/2001)
Clustering Techniques K-means and variants
(Pelleg+/2000,Hamerly+/2003)
Information-theoreticco-clustering (Dhillon+/2003)
LSI (Deerwester+/1990) Choosing the number of “concepts”
Measure of imbalance between clusters, OR
Number of partitions
Rows and columns are considered separately, OR
Not fully automatic
7
Outline
Problem Definition Related Work Finding clusters in graphs Outliers and inter-group distances Experiments Conclusions
8
Outline
Problem Definition Related Work Finding clusters in graphs
What is a good clustering? How can we find such a clustering?
Outliers and inter-group distances Experiments Conclusions
9
What is a “good” clustering
Node GroupsNode Groups
No
de
Gro
up
s
No
de
Gro
up
s
versus
Why is this better?
Good Clustering
1. Similar nodes are grouped together
2. As few groups as necessary
A few, homogeneous
blocks
Good Compression
implies
10
Binary Matrix
Node groups
Nod
e gr
oups
Main Idea
Good Compression
Good Clusteringimplies
pi1 = ni
1 / (ni1 + ni
0)
(ni1+ni
0)* H(pi1) Cost of describing
ni1, ni
0 and groups
Code Cost Description Cost
Σi +Σi
11
Examples
One node group
high low
n node groups
highlow
Total Encoding Cost = (ni1+ni
0)* H(pi1) Cost of describing
ni1, ni
0 and groups
Code Cost Description Cost
Σi +Σi
12
What is a “good” clustering
Node GroupsNode Groups
No
de
Gro
up
s
No
de
Gro
up
s
versus
Why is this better?
low low
Total Encoding Cost = (ni1+ni
0)* H(pi1) Cost of describing
ni1, ni
0 and groups
Code Cost Description Cost
Σi +Σi
13
Outline
Problem Definition Related Work Finding clusters in graphs
What is a good clustering? How can we find such a clustering?
Outliers and inter-group distances Experiments Conclusions
14
Algorithms k = 5 node groups
15
Algorithms
Start with initial matrix
Find good groups for fixed k
Choose better values for k
Final groupingLower the
encoding cost
16
Algorithms
Start with initial matrix
Find good groups for fixed k
Choose better values for k
Final groupingLower the
encoding cost
17
Node groups
Nod
e gr
oups
Fixed number of groups k
Reassign:for each node:
reassign it to the group which minimizes the code cost
18
Algorithms
Start with initial matrix
Choose better values for k
Final groupingLower the
encoding cost
Find good groups for fixed k
19
Choosing k
Split:1. Find the group R with the maximum entropy per node
2. Choose the nodes in R whose removal reduces the entropy per node in R
3. Send these nodes to the new group, and set k=k+1
20
Algorithms
Start with initial matrix
Find good groups for fixed k
Choose better values for k
Final groupingLower the
encoding cost
Reassign
Splits
21
Algorithms
Properties:
Fully Automatic number of groups is found automatically
Scalable O(E) time
Allow incremental updates reassign new node/edge to the group with least cost, and continue…
22
Outline
Problem Definition Related Work Finding clusters in graphs
What is a good clustering? How can we find such a clustering?
Outliers and inter-group distances Experiments Conclusions
23
Outlier Edges
Nodes
No
des
Outliers Deviations from “normality”
Lower quality compression
Find edges whose removal maximally reduces cost
No
de
Gro
up
s
Node Groups
Outlier edges
24
Inter-cluster distances
Nodes
No
des
No
de
Gro
up
s
Node Groups
Grp1
Grp2
Grp3
Two groups are “close”
Merging them does not increase cost by much
distance(i,j) = relative increase in cost on merging i and j
25
Inter-cluster distances
No
de
Gro
up
s
Node Groups
Grp1
Grp2
Grp3
Two groups are “close”
Merging them does not increase cost by much
distance(i,j) = relative increase in cost on merging i and j
Grp1 Grp2
Grp3
5.5
4.55.1
26
Outline
Problem Definition Related Work Finding clusters in graphs
What is a good clustering? How can we find such a clustering?
Outliers and inter-group distances Experiments Conclusions
27
Experiments
“Quasi block-diagonal” graph with noise=10%
28
Experiments
Authors
Aut
hors
DBLP dataset
• 6,090 authors in:• SIGMOD
• ICDE
• VLDB
• PODS
• ICDT
• 175,494 “dots”, one “dot” per co-citation
29
Experiments
Authors
Aut
hors
Aut
hor
grou
ps
Author groups
k=8 author groups found
Stonebraker, DeWitt, Carey
30
Experiments
Aut
hor
grou
ps
Author groups
Grp8Grp1
Inter-group distances
31
Experiments
User groups
Use
r gr
oups
Epinions dataset
• 75,888 users
• 508,960 “dots”, one “dot” per “trust” relationship
k=19 groups foundSmall dense “core”
32
Experiments
Number of “dots”
Tim
e (in
sec
onds
)
Linear in the number of “dots” Scalable
33
Conclusions
Goals:
Find groups
Find outliers
Compute inter-group “distances”
Properties:
Fully Automatic
Scalable
Allow incremental updates