chris biemann university of leipzig, nlp-dept. leipzig, germany june 9, 2006
DESCRIPTION
Chinese Whispers an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany June 9, 2006 TextGraphs 06, NYC, USA. Outline. Introduction to Graph Clustering Chinese Whispers Algorithm - PowerPoint PPT PresentationTRANSCRIPT
1
Chinese Whispers an Efficient Graph Clustering Algorithm and its
Application to Natural Language Processing Problems
Chris BiemannUniversity of Leipzig, NLP-Dept.
Leipzig, Germany
June 9, 2006
TextGraphs 06, NYC, USA
2
Outline
• Introduction to Graph Clustering• Chinese Whispers Algorithm• Experiments with Synthetic Data• Application of CW to
– Language Seperation– POS clustering– Word Sense Induction
• Extensions
3
Graph Clustering• Find groups of nodes in undirected, weighted graphs• Hierarchical Clustering vs. Flat Partitioning
3 3 3
3 4 4 3
4
? Desired outcomes ?
• Colors symbolise partitions
3 3 3
3 4 4 3
5
Chinese Whispers Algorithm
• Nodes have a class and communicate it to their adjacent nodes
• A node adopts one of the the majority class in its neighbourhood
• Nodes are processed in random order for some iterations
Algorithm:
initialize:forall vi in V: class(vi)=i;
while changes:
forall v in V, randomized order:
class(v)=highest ranked class in neighborhood of v;
AL1
DL2
EL3
BL4
CL3
58
63
deg=1deg=2
deg=3deg=5
deg=4
6
Example: CW-Partitioning in two steps
7
Properties of CWPRO:• Efficiency: CW is time-linear in the number of edges. This is bound by
n² with n= number of nodes, but in real world data, graphs are much sparser
• Parameter-free: this includes number of clusters
CON:• Non-deterministic: due to random order processing and possible ties
w.r.t. the majority.• Does not converge: See tie example:
However, the CONs are not severe for real world data...
Formally hard to analyse: perform experiments
8
Experiment: Bi-partite cliques, unweighted
• Intuition: Bi-partite cliques should be split into two cliques• CW can split bi-partite cliques into two parts or leave them
as a whole. • Measure, how often CW succeeds:
the larger the graph, the saver the split
-> CW meant for large graphs
9
Co-occurrences: A source for Graphs
• The entirety of all significant co-occurrences is a co-occurrence graph G(V,E) withV: Vertices = WordsE: Edges (v1, v2, s) with v1, v2 words, s significance value.
• Co-occurrence graph is– weighted by significance
(here: log-likelihood)– undirected
• Small-world-property
10
Application: Language Seperation
• Cluster the co-occurrence graph of a multilingual corpus
• Use words of the same class in a language identifier as lexicon
• Almost perfect performance
Precision, Recall and F-value for 7-lingual corpora
0,96
0,97
0,98
0,99
1
100 1000 10000 100000
# of sentences per language
P/R/F
Precision Recall F-value
11
Application: Acquisition of POS-classes
• Distributional similarity: Words that co-occur significantly with the same neighbours should be of the same POS
• Clustering the second-order NB-co-occurrence graph of the BNC (excluding the top 2000 frequent words)
12
Results: POS-clusters
• In total: 282 clusters, of which 26 with more than 100 members. Syntacto-semantic motivation. Purity: 88%
13
Application: Word Sense Induction
• Co-occurrence graphs of ambigous words can be partitioned [Dorow & Widdows 03]: Leave out focus word
• Clusters contain context words for disambiguation
14
Unsupervised WSI Evaluation Framework
Evaluation: For unambiguos words, merge their co-occurrence graphs and try to split them into previous parts
• retrieval precision (rP): similarity of the found sense with the gold standard sense
• retrieval recall (rR): amount of words that have been correctly assigned to the gold standard sense
• precision (P): fraction of correctly found disambiguations• recall (R): fraction of correctly found senses
45 test words of different POS and frequency bands.
15
Results: WSI
• No parameter for expected number of clusters• CW scores compareable to an algorithm especially designed for WSI
16
hip
17
hip
18
hip
19
hip
20
Conclusion
• Very effective graph partitioning algorithm for weighted, undirected graphs
• Possible to process really large graphs• Fuzzy partitioning and hierachichal clustering possible• Especially suited for small world graphs (sparse adjacency
matrix)• Useful in NLP applications such as Language Seperation,
POS clustering, Word Sense Induction
Download a GUI implementation in Java of Chinese Whispers (Open Source) at
http://wortschatz.informatik.uni-leipzig.de/~cbiemann/software/CW.html
21
Questions ?
THANK YOU
22
Experiment: Convergence
• Weighted graphs converge much faster (less ties)• For weighted graphs, 15 iterations were enough to partition
the 1.7M nodes / 56M edges co-occurrence graph of our main German corpus
• Larger graphs result in less uncertainity
23
Experiment: Small World Mixtures
• CW can seperate well if merge rate is not too high• Different sizes of original SWs do not impose a problem
24
Experiment: Small World Mixtures
• CW can seperate well if merge rate is not too high• Different sizes of original SWs do not impose a problem
25
Usages of hip• FIGHT: The punching hip , be it the leading hip of a front punch or the
trailing hip of a reverse punch , must swivel forwards , so that your centre-line directly faces the opponent .
• MUSIC: This hybrid mix of reggae and hip hop follows acid jazz , Belgian New Beat and acid swing the wholly forgettable contribution of Jive Bunny as the sound to set disco feet tapping .
• DANCER: Sitting back and taking it all in is another former hip hop dancer , Moet Lo , who lost his Wall Street messenger job when his firm discovered his penchant for the five-finger discount at Polo stores
• HOORAY: Ho , hey , ho hi , ho , hey , ho , hip hop hooray , funky , get down , a-boogie , get down .
• MEDICINE: We treated orthopaedic screening as a distinct category because some neonatal deformations (such as congenital dislocation of the hip ) represent only a predisposition to congenital abnormality , and surgery is avoided by conservative treatment .
• BODYPART-INJURY: I had a hip replacement operation on my left side , after which I immediately broke my right leg .
• BODYPART-CLOTHING: At his hip he wore a pistol in an ancient leather holster .