collocations he zhongjun 2007-04-13. outline introduction approaches to find collocations frequency...

38
COLLOCATIONS He Zhongjun 2007-04-13

Upload: jemima-baker

Post on 21-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

COLLOCATIONS

He Zhongjun 2007-04-13

Page 2: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Outline

IntroductionApproaches to find collocations

FrequencyMean and VarianceHypothesis testMutual information

Applications

Page 3: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Outline

IntroductionApproaches to find collocations

FrequencyMean and VarianceHypothesis testMutual information

Applications

Page 4: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

What are collocations?

A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (-- the book)

A collocation is a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. (-- Choueka, 1988)

Page 5: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Examplesnoun phrases

strong tea vs. powerful teaverbs

make a decision vs. take a decisionknock … door vs. hit … doormake up

Idiomskick the bucket ( 死掉 )

Subtle, unexplainable, native speaker usagebroad daylight vs. bright daylight昨天,去年,上个月

Page 6: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Introduction – Character /Criteria

Non-compositionalitye.g. kick the bucket white wine, white hair, white

womanNon-substitutabilitye.g. white wine -> yellow wine?

Non-modifiabilitye.g. as poor as church mouse / mice ?

Can not translate word by word

Page 7: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Outline

IntroductionApproaches to find collocations

FrequencyMean and VarianceHypothesis testMutual information

Applications

Page 8: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Frequency (2-1)

Countinge.g. the count of bigrams

in corpus

Not effective, most of the pairs are function words!

Page 9: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Frequency (2-2)

Filter by Part-Of-Speech (Justeson and Katz 1995)Or using stop list of

function words

simple quantitative technique+ simple linguistic knowledge

Page 10: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mean and Variance(4-1)

Fixed bigrams -> bigrams at a distanceshe knocked on his doorThey knocked at the door100 women knocked on Donaldson ‘ s doorA man knocked on the metal front door

Mean offset(3+3+5+5)/4 = 4.0

deviation

Page 11: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mean and Variance(4-2)

Mean

Variance

2

2 1

( )

1

n

ii

d ds

n

1

n

ii

dd

n

Low variance means two words usually occur at about the same distance

Page 12: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mean and Variance(4-3)The mean of -1.15 indicates that strong usually occurs at the left side.

e.g. strong business support

strong and for don’t form collocations

Page 13: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mean and Variance(4-4)

If the mean is close to 1.0 and the deviation is low, it can find collocations as frequency-based method. It can also find loose phrases.

Page 14: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Hypothesis TestingWhat if high frequency and low variance is accidental

e.g. new companies, new and companies are frequently occurring words, however, it is not collocation.Hypothesis testing: assessing whether or not something is a chance event

Null hypothesis H0 : there is no association between the words beyond chance occurrencesCompute the probability p that the event would occur if H0 were trueIf p > P reject H0

otherwise, accept H0

Page 15: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

t-test (5-1)

2

xt

s

N

t statistic:

sample mean distribution mean

sample variance

sample size

Think of the corpus as a long sequence of N bigrams, if the interest bigram occurs, the value is 1, otherwise, the value is 0. (binomial distribution )

Page 16: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

t-test (5-2) N(new) = 15828, N(companies) = 4675, N(tokens)=14307668

N(new companies) = 8

P(new) = 15828/14307668, P(companies) = 4675/14307668 P(new companies) = 8/14307668 =5.591*10-7

H0: P(new companies) = p(new)p(companies) = 3.615 * 10-7

mean: (assuming Bernoulli trial)

variance:

t = 0.9999932 < 2.576 Accept H0

73.615 10

2 (1 )s p p p

( 0.005)

Page 17: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

t-test (5-3)

Rank the bigrams with the same frequency, which a frequency-based method cannot do.

Page 18: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

t-test (5-4)Using t-test to find words whose co-occurrence patterns best distinguish between two words

e.g. lexicography (Church et al., 1989)

Page 19: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

t-test (5-5)

Page 20: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Pearson’s chi-square test (4-1)t-test assumes probabilities are approximately normally distributed

test not assuming normalityCompare the observed frequencies with the frequencies expected for independence. If the difference is large, reject H0

2

Page 21: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Pearson’s chi-square test (4-2)

22

,

( )1.55 3.841( 0.05)ij ij

i j ij

O E

E

Accept H0,, new and companies occur independently!

Page 22: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Pearson’s chi-square test (4-3)Identification of translation pairs in aligned corpora (Church et al., 1991)

59 is the number of sentence pairs which have cow in English and vache in French.

2 456400

Reject H0, (cow, vache) is a translation pair.

Page 23: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Pearson’s chi-square test (4-4)Metric for Corpus similarity (Kilgarriff et al., 1998)

H0= Two corpora drawn from same source

Page 24: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Likelihood ratios (3-1)More appropriate of sparse dataTwo alternative explanations for the occurrence frequency of a bigram (Dunning 1993)

H1 = P(w2|w1) = P(w2| ¬ w1) = p (independence)

H2 = P(w2|w1) = p1 p2 = P(w2| ¬ w1) (dependence)

log = log ( L(H1) / L(H2) )

L(H) = likelihood of observing O under H

Page 25: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Likelihood ratios (3-2)c1, c2, c12 are the number of occurrences of w1, w2, w1w2, and assuming a binomial distribution:

Page 26: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Likelihood ratios (3-3)

2log

If is a likelihood ratio of a particular form, then is asymptotically distributed (Mood et al., 1974)

2

Likelihood ratio test is more appropriate for sparse data.

Page 27: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mutual Information (7-1)Information you gain about x’ when knowing y’

Pointwise mutual information (Church et al.1991; Church and Hanks 1989)

Page 28: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mutual Information (7-2)

The amount of information about the occurrence of Ayatollah atposition i in the corpus increases by 18.38 bits if we are told thatRuhollah occurs at position i+1.

Page 29: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mutual Information (7-3)

English: house of commonsFrench: chambre de communes

Problem1: information gain direct dependence

Page 30: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mutual Information (7-4)

2 considers more than (house, communes)

MI considers only (house, communes)

Page 31: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mutual Information (7-5)Problem2: Data sparseness

Page 32: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mutual Information (7-6)For Perfect dependence:

For perfect independence:

MI is a not good measure of dependence since the score depends on the frequency of the individual words.

Page 33: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Mutual Information (7-7)

Pointwise MI: MI(new, companies)Uncertainty reduced in predicting “companies” When knowing the previous word is “new”Small sample, not good measure if count is lowMI 0, good indication of independence

Mutual information: MI (wi-1 , wi )How much information (entropy) gained

• Unary Model P(w) - Bigram Model P(wi | wi-1)

Estimated using a large sample

Page 34: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Outline

IntroductionApproaches to find collocations

FrequencyMean and VarianceHypothesis testMutual information

Applications

Page 35: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

ApplicationsComputational lexicographyInformation Retrieval

Accuracy of retrieval can be improved if the similarity between a user query and a document is determined based on common collocations instead of words. (Fagan 1989)

Natural Language Generation (Smadja 1993)Cross Language information retrieval (Hull and Grefenstette 1998)

Page 36: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Collocations and Word Sense Disambiguation

Association or co-occurrencedoctor and nurseplane and airport

Both are important for word sense disambiguation

Collocation - local context (One sense per collocation)• Drop me a line (letter)• .. on the line .. (phone line)

Occurrence - topical context or global context• Subject based disambiguation

Page 37: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

ReferencesChoueka, Yaacov. 1988. Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO, pp. 43–38.Justeson, John S., and Slava M. Katz. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1:9–27.Church, Kenneth Ward, and Patrick Hanks. 1989. Word association norms, mutual information and lexicography. In ACL 27, pp. 76–83.Church, Kenneth, William Gale, Patrick Hanks, and Donald Hindle. 1991. Using statistics in lexical analysis. In Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115–164. Hillsdale, NJ: Lawrence Erlbaum.Kilgarriff, Adam, and Tony Rose. 1998. Metrics for corpus similarity and homogeneity. Manuscript, ITRI, University of Brighton.Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19:61–74.Mood, Alexander M., Franklin A. Graybill, and Duane C. Boes. 1974. Introduction to the theory of statistics. New York: McGraw-Hill. 3rd edition.Fagan, Joel L. 1989. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science 40:115–132.Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19:143–177.Hull, David A., and Gregory Grefenstette. 1998. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Karen Sparck Jones and Peter Willett (eds.), Readings in Information Retrieval. San Francisco: Morgan Kaufmann.

Page 38: COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information

Thanks!