distributional clustering of english words
DESCRIPTION
Distributional Clustering of English Words. Fernando Pereira- AT&T Bell Laboratories, 600 Naftali Tishby- Dept. of Computer Science, Hebrew University Lillian Lee- Dept. of Computer Science, Cornell University - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/1.jpg)
Distributional Clustering of English Words
Fernando Pereira- AT&T Bell Laboratories, 600Naftali Tishby- Dept. of Computer Science, Hebrew University
Lillian Lee- Dept. of Computer Science, Cornell University
Presenter- Juan Ramos, Dept. of Computer Science, Rutgers Universtiy, [email protected]
![Page 2: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/2.jpg)
Overview
• Purpose: evaluate a method for clustering words according to their distribution in particular syntactic contexts.
• Methodology: find lowest distortion sets of clusters of words to determine models of word coocurrence.
![Page 3: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/3.jpg)
Applications
• Scientific POV: lexical acquisition of words• Practical POV: classification concerns
data sparseness in garmmar models.• Address clusters in large corpus of
documents
![Page 4: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/4.jpg)
Definitions
• Context: function of given word in its sentence.– Eg: a noun as a direct object
• Sense class: hidden model describing word association tendencies– Mix of cluster and cluster probability given a
word• Cluster: probabilistic concept of a sense
class
![Page 5: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/5.jpg)
Problem Setting
• Restrict problem to verbs (V) and nouns (N) in main verb-direct object relationship
• f (v, n) = frequencies of occurrence of verb, noun pairs– Text must be pre-formatted to fit specifications
• For given noun n, conditional distribution p(n, v) = f(v,n)/(sum (v, f(v,n))
![Page 6: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/6.jpg)
Problem Setting cont.
• Goal: create set C of clusters and probabilityies p(c|n).
• Each c in C associated to cluster centroid p(c)– p(c) = average of p(n) over all v in V.
![Page 7: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/7.jpg)
Distributional Similarity
• Given two distributions p, q, KL distance is D(p || q) = sum (x, p(x) log (p(x)/q(x)))– D(p || q) = 0 implies p = q– Small D(p || q) implies two distributions are
likely instances of a centroid p(c).• D(p || q) measures loss of data by using
p(c).
![Page 8: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/8.jpg)
Theoretical Foundation
• Given unstructured V, N, training data of X independent pairs of verbs and nouns.
• Problem: learn joint distribution of pairs given X
• Not quite unsupervised, not quite supervised– No internal structure in pairs– Learn underlying distribution
![Page 9: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/9.jpg)
Distributional Clustering
• Approximately decompose p(n,v) to p’(n,v) = sum (c in C, p(c|n)*p(c, v)).– p(c|n) = membership probability of n in c– p(c,v) = p(v|c) = probability of v given centroid
for c• Assuming p(n), p’(v) coincide, p’(n,v) =
sum(c in C, p(c)*p(n|c)*p(v|c))
![Page 10: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/10.jpg)
Maximum Likelihood Cluster Centroids
• Used to maximize goodness of fit between data and p’(n,v)
• For sequence of pairs S, S’s model log prob. is: l(S) = sum(N, log (sum (c in C, p’(n,v)))).– Maximize according to p(n|c) and p(v|c).– Variation of l(S):
![Page 11: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/11.jpg)
Maximum Entropy Cluster Membership
• Assume independence between variations of p(n|c) and p(v|c).– Can find Bayes inverses of p(n|c) given p(v|c)
and p(v|n)– p(v|c) that maximize l(S) also minimize
average distortion between cluster model and data
![Page 12: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/12.jpg)
Entropy Cluster Membership cont.
• Average cluster distortion:
• Entropy:
![Page 13: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/13.jpg)
Entropy Cluster Membership cont.
• Class and membership distributions:
– Z(c) and Z(n) are normalization sums• Previous equations reduce log-likelihood to:
• At maximum, variation vanishes
![Page 14: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/14.jpg)
KL Distortion
• Attempt to minimize KL distortion through variation of KL distances:
– Results in weighted average of noun distributions.
![Page 15: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/15.jpg)
Free Energy Function• Combined minimum distortion and max entropy
equivalent to minimum of free energy: F = <D> - H/beta
• F determines <D> and H through partial derivatives:
• Min of F determines balance between disordering max entropy and ordering distortion min.
![Page 16: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/16.jpg)
Hierarchical Clustering
• Number of clusters is determined through sequence of increases of beta.– Higher beta implies more local influence of
noun on definition of centroids.• Start with low beta and a single c in C
– Search for lowest beta that splits c into two or more leaf c’s.
– Repeat until |C| reaches desired size.
![Page 17: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/17.jpg)
Experimental Results
• Classify 64 nouns appearing as direct objects of verb ‘fire’ in Associated Press documents, 1988, where |V| = 2147.
• Four words most similar to cluster centroid and KL distances for first splits.– Split 1: cluster of ‘fire’ as discharging
weapons vs. cluster of ‘fire’ as releasing employees
– Split 2: weapons as projectiles vs. weapons as guns.
![Page 18: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/18.jpg)
Clustering on Verb ‘fire’
![Page 19: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/19.jpg)
Evaluation
![Page 20: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/20.jpg)
Evaluation cont.
![Page 21: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/21.jpg)
Conclusions
• Clustering is efficient, informative, and returns good predictions
• Future work– Make clustering method more rigorous– Introduce human judgment, i.e. a more
supervised approach– Extend model to other word relationships
![Page 22: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/22.jpg)
References
![Page 23: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/23.jpg)
References cont.
![Page 24: Distributional Clustering of English Words](https://reader030.vdocument.in/reader030/viewer/2022020322/56813fe8550346895daad8ef/html5/thumbnails/24.jpg)
More References