collocations he zhongjun 2007-04-13. outline introduction approaches to find collocations frequency...

COLLOCATIONS

He Zhongjun 2007-04-13

Outline

IntroductionApproaches to find collocations

FrequencyMean and VarianceHypothesis testMutual information

Applications

What are collocations?

A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (-- the book)

A collocation is a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. (-- Choueka, 1988)

Examplesnoun phrases

strong tea vs. powerful teaverbs

make a decision vs. take a decisionknock … door vs. hit … doormake up

Idiomskick the bucket ( 死掉 )

Subtle, unexplainable, native speaker usagebroad daylight vs. bright daylight昨天，去年，上个月

…

Introduction – Character /Criteria

Non-compositionalitye.g. kick the bucket white wine, white hair, white

womanNon-substitutabilitye.g. white wine -> yellow wine?

Non-modifiabilitye.g. as poor as church mouse / mice ？

Can not translate word by word

Outline



Applications

Frequency (2-1)

Countinge.g. the count of bigrams

in corpus

Not effective, most of the pairs are function words!

Frequency (2-2)

Filter by Part-Of-Speech (Justeson and Katz 1995)Or using stop list of

function words

simple quantitative technique+ simple linguistic knowledge

Mean and Variance(4-1)

Fixed bigrams -> bigrams at a distanceshe knocked on his doorThey knocked at the door100 women knocked on Donaldson ‘ s doorA man knocked on the metal front door

Mean offset(3+3+5+5)/4 = 4.0

deviation


Mean

Variance

2

2 1

( )

1

n

ii

d ds

n

1

n

ii

dd

n

Low variance means two words usually occur at about the same distance

Mean and Variance(4-3)The mean of -1.15 indicates that strong usually occurs at the left side.

e.g. strong business support

strong and for don’t form collocations


If the mean is close to 1.0 and the deviation is low, it can find collocations as frequency-based method. It can also find loose phrases.

Hypothesis TestingWhat if high frequency and low variance is accidental

e.g. new companies, new and companies are frequently occurring words, however, it is not collocation.Hypothesis testing: assessing whether or not something is a chance event

Null hypothesis H0 : there is no association between the words beyond chance occurrencesCompute the probability p that the event would occur if H0 were trueIf p > P reject H0

otherwise, accept H0

t-test (5-1)

2

xt

s

N

t statistic:

sample mean distribution mean

sample variance

sample size

Think of the corpus as a long sequence of N bigrams, if the interest bigram occurs, the value is 1, otherwise, the value is 0. (binomial distribution )

t-test (5-2) N(new) = 15828, N(companies) = 4675, N(tokens)=14307668

N(new companies) = 8

P(new) = 15828/14307668, P(companies) = 4675/14307668 P(new companies) = 8/14307668 =5.591*10-7

H0: P(new companies) = p(new)p(companies) = 3.615 * 10-7

mean: (assuming Bernoulli trial)

variance:

t = 0.9999932 < 2.576 Accept H0

73.615 10

2 (1 )s p p p

( 0.005)

t-test (5-3)

Rank the bigrams with the same frequency, which a frequency-based method cannot do.

t-test (5-4)Using t-test to find words whose co-occurrence patterns best distinguish between two words

e.g. lexicography (Church et al., 1989)

t-test (5-5)

Pearson’s chi-square test (4-1)t-test assumes probabilities are approximately normally distributed

test not assuming normalityCompare the observed frequencies with the frequencies expected for independence. If the difference is large, reject H0

2

Pearson’s chi-square test (4-2)

22

,

( )1.55 3.841( 0.05)ij ij

i j ij

O E

E

Accept H0,, new and companies occur independently!

Pearson’s chi-square test (4-3)Identification of translation pairs in aligned corpora (Church et al., 1991)

59 is the number of sentence pairs which have cow in English and vache in French.

2 456400

Reject H0, (cow, vache) is a translation pair.

Pearson’s chi-square test (4-4)Metric for Corpus similarity (Kilgarriff et al., 1998)

H0= Two corpora drawn from same source

Likelihood ratios (3-1)More appropriate of sparse dataTwo alternative explanations for the occurrence frequency of a bigram (Dunning 1993)

H1 = P(w2|w1) = P(w2| ¬ w1) = p (independence)

H2 = P(w2|w1) = p1 p2 = P(w2| ¬ w1) (dependence)

log = log ( L(H1) / L(H2) )

L(H) = likelihood of observing O under H

Likelihood ratios (3-2)c1, c2, c12 are the number of occurrences of w1, w2, w1w2, and assuming a binomial distribution:

Likelihood ratios (3-3)

2log

If is a likelihood ratio of a particular form, then is asymptotically distributed (Mood et al., 1974)

2

Likelihood ratio test is more appropriate for sparse data.

Mutual Information (7-1)Information you gain about x’ when knowing y’

Pointwise mutual information (Church et al.1991; Church and Hanks 1989)

Mutual Information (7-2)

The amount of information about the occurrence of Ayatollah atposition i in the corpus increases by 18.38 bits if we are told thatRuhollah occurs at position i+1.


English: house of commonsFrench: chambre de communes

Problem1: information gain direct dependence


2 considers more than (house, communes)

MI considers only (house, communes)

Mutual Information (7-5)Problem2: Data sparseness

Mutual Information (7-6)For Perfect dependence:

For perfect independence:

MI is a not good measure of dependence since the score depends on the frequency of the individual words.


Pointwise MI: MI(new, companies)Uncertainty reduced in predicting “companies” When knowing the previous word is “new”Small sample, not good measure if count is lowMI 0, good indication of independence

Mutual information: MI (wi-1 , wi )How much information (entropy) gained

• Unary Model P(w) - Bigram Model P(wi | wi-1)

Estimated using a large sample

Outline



Applications

ApplicationsComputational lexicographyInformation Retrieval

Accuracy of retrieval can be improved if the similarity between a user query and a document is determined based on common collocations instead of words. (Fagan 1989)

Natural Language Generation (Smadja 1993)Cross Language information retrieval (Hull and Grefenstette 1998)

Collocations and Word Sense Disambiguation

Association or co-occurrencedoctor and nurseplane and airport

Both are important for word sense disambiguation

Collocation - local context (One sense per collocation)• Drop me a line (letter)• .. on the line .. (phone line)

Occurrence - topical context or global context• Subject based disambiguation

ReferencesChoueka, Yaacov. 1988. Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO, pp. 43–38.Justeson, John S., and Slava M. Katz. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1:9–27.Church, Kenneth Ward, and Patrick Hanks. 1989. Word association norms, mutual information and lexicography. In ACL 27, pp. 76–83.Church, Kenneth, William Gale, Patrick Hanks, and Donald Hindle. 1991. Using statistics in lexical analysis. In Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115–164. Hillsdale, NJ: Lawrence Erlbaum.Kilgarriff, Adam, and Tony Rose. 1998. Metrics for corpus similarity and homogeneity. Manuscript, ITRI, University of Brighton.Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19:61–74.Mood, Alexander M., Franklin A. Graybill, and Duane C. Boes. 1974. Introduction to the theory of statistics. New York: McGraw-Hill. 3rd edition.Fagan, Joel L. 1989. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science 40:115–132.Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19:143–177.Hull, David A., and Gregory Grefenstette. 1998. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Karen Sparck Jones and Peter Willett (eds.), Readings in Information Retrieval. San Francisco: Morgan Kaufmann.

Thanks!

collocations he zhongjun 2007-04-13. outline introduction approaches to find collocations frequency...

Documents