distributional clustering of english words

Distributional Clustering of English Words

Fernando Pereira- AT&T Bell Laboratories, 600Naftali Tishby- Dept. of Computer Science, Hebrew University

Lillian Lee- Dept. of Computer Science, Cornell University

Presenter- Juan Ramos, Dept. of Computer Science, Rutgers Universtiy, [email protected]

Overview

• Purpose: evaluate a method for clustering words according to their distribution in particular syntactic contexts.

• Methodology: find lowest distortion sets of clusters of words to determine models of word coocurrence.

Applications

• Scientific POV: lexical acquisition of words• Practical POV: classification concerns

data sparseness in garmmar models.• Address clusters in large corpus of

documents

Definitions

• Context: function of given word in its sentence.– Eg: a noun as a direct object

• Sense class: hidden model describing word association tendencies– Mix of cluster and cluster probability given a

word• Cluster: probabilistic concept of a sense

class

Problem Setting

• Restrict problem to verbs (V) and nouns (N) in main verb-direct object relationship

• f (v, n) = frequencies of occurrence of verb, noun pairs– Text must be pre-formatted to fit specifications

• For given noun n, conditional distribution p(n, v) = f(v,n)/(sum (v, f(v,n))

Problem Setting cont.

• Goal: create set C of clusters and probabilityies p(c|n).

• Each c in C associated to cluster centroid p(c)– p(c) = average of p(n) over all v in V.

Distributional Similarity

• Given two distributions p, q, KL distance is D(p || q) = sum (x, p(x) log (p(x)/q(x)))– D(p || q) = 0 implies p = q– Small D(p || q) implies two distributions are

likely instances of a centroid p(c).• D(p || q) measures loss of data by using

p(c).

Theoretical Foundation

• Given unstructured V, N, training data of X independent pairs of verbs and nouns.

• Problem: learn joint distribution of pairs given X

• Not quite unsupervised, not quite supervised– No internal structure in pairs– Learn underlying distribution

Distributional Clustering

• Approximately decompose p(n,v) to p’(n,v) = sum (c in C, p(c|n)*p(c, v)).– p(c|n) = membership probability of n in c– p(c,v) = p(v|c) = probability of v given centroid

for c• Assuming p(n), p’(v) coincide, p’(n,v) =

sum(c in C, p(c)*p(n|c)*p(v|c))

Maximum Likelihood Cluster Centroids

• Used to maximize goodness of fit between data and p’(n,v)

• For sequence of pairs S, S’s model log prob. is: l(S) = sum(N, log (sum (c in C, p’(n,v)))).– Maximize according to p(n|c) and p(v|c).– Variation of l(S):

Entropy Cluster Membership cont.

• Average cluster distortion:

• Entropy:

Entropy Cluster Membership cont.

• Class and membership distributions:

– Z(c) and Z(n) are normalization sums• Previous equations reduce log-likelihood to:

• At maximum, variation vanishes

KL Distortion

• Attempt to minimize KL distortion through variation of KL distances:

– Results in weighted average of noun distributions.

Free Energy Function• Combined minimum distortion and max entropy

equivalent to minimum of free energy: F = <D> - H/beta

• F determines <D> and H through partial derivatives:

• Min of F determines balance between disordering max entropy and ordering distortion min.

Hierarchical Clustering

• Number of clusters is determined through sequence of increases of beta.– Higher beta implies more local influence of

noun on definition of centroids.• Start with low beta and a single c in C

– Search for lowest beta that splits c into two or more leaf c’s.

– Repeat until |C| reaches desired size.

Experimental Results

• Classify 64 nouns appearing as direct objects of verb ‘fire’ in Associated Press documents, 1988, where |V| = 2147.

• Four words most similar to cluster centroid and KL distances for first splits.– Split 1: cluster of ‘fire’ as discharging

weapons vs. cluster of ‘fire’ as releasing employees

– Split 2: weapons as projectiles vs. weapons as guns.

Clustering on Verb ‘fire’

Evaluation

Evaluation cont.

Conclusions

• Clustering is efficient, informative, and returns good predictions

• Future work– Make clustering method more rigorous– Introduce human judgment, i.e. a more

supervised approach– Extend model to other word relationships

References

References cont.

More References

distributional clustering of english words

Documents

probability of v

verbs v

nsum v

average cluster distortion

cluster model

cluster probability

average of pn

average distortion