collocations he zhongjun 2007-04-13. outline introduction approaches to find collocations frequency...
TRANSCRIPT
COLLOCATIONS
He Zhongjun 2007-04-13
Outline
IntroductionApproaches to find collocations
FrequencyMean and VarianceHypothesis testMutual information
Applications
Outline
IntroductionApproaches to find collocations
FrequencyMean and VarianceHypothesis testMutual information
Applications
What are collocations?
A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (-- the book)
A collocation is a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. (-- Choueka, 1988)
Examplesnoun phrases
strong tea vs. powerful teaverbs
make a decision vs. take a decisionknock … door vs. hit … doormake up
Idiomskick the bucket ( 死掉 )
Subtle, unexplainable, native speaker usagebroad daylight vs. bright daylight昨天,去年,上个月
…
Introduction – Character /Criteria
Non-compositionalitye.g. kick the bucket white wine, white hair, white
womanNon-substitutabilitye.g. white wine -> yellow wine?
Non-modifiabilitye.g. as poor as church mouse / mice ?
Can not translate word by word
Outline
IntroductionApproaches to find collocations
FrequencyMean and VarianceHypothesis testMutual information
Applications
Frequency (2-1)
Countinge.g. the count of bigrams
in corpus
Not effective, most of the pairs are function words!
Frequency (2-2)
Filter by Part-Of-Speech (Justeson and Katz 1995)Or using stop list of
function words
simple quantitative technique+ simple linguistic knowledge
Mean and Variance(4-1)
Fixed bigrams -> bigrams at a distanceshe knocked on his doorThey knocked at the door100 women knocked on Donaldson ‘ s doorA man knocked on the metal front door
Mean offset(3+3+5+5)/4 = 4.0
deviation
Mean and Variance(4-2)
Mean
Variance
2
2 1
( )
1
n
ii
d ds
n
1
n
ii
dd
n
Low variance means two words usually occur at about the same distance
Mean and Variance(4-3)The mean of -1.15 indicates that strong usually occurs at the left side.
e.g. strong business support
strong and for don’t form collocations
Mean and Variance(4-4)
If the mean is close to 1.0 and the deviation is low, it can find collocations as frequency-based method. It can also find loose phrases.
Hypothesis TestingWhat if high frequency and low variance is accidental
e.g. new companies, new and companies are frequently occurring words, however, it is not collocation.Hypothesis testing: assessing whether or not something is a chance event
Null hypothesis H0 : there is no association between the words beyond chance occurrencesCompute the probability p that the event would occur if H0 were trueIf p > P reject H0
otherwise, accept H0
t-test (5-1)
2
xt
s
N
t statistic:
sample mean distribution mean
sample variance
sample size
Think of the corpus as a long sequence of N bigrams, if the interest bigram occurs, the value is 1, otherwise, the value is 0. (binomial distribution )
t-test (5-2) N(new) = 15828, N(companies) = 4675, N(tokens)=14307668
N(new companies) = 8
P(new) = 15828/14307668, P(companies) = 4675/14307668 P(new companies) = 8/14307668 =5.591*10-7
H0: P(new companies) = p(new)p(companies) = 3.615 * 10-7
mean: (assuming Bernoulli trial)
variance:
t = 0.9999932 < 2.576 Accept H0
73.615 10
2 (1 )s p p p
( 0.005)
t-test (5-3)
Rank the bigrams with the same frequency, which a frequency-based method cannot do.
t-test (5-4)Using t-test to find words whose co-occurrence patterns best distinguish between two words
e.g. lexicography (Church et al., 1989)
t-test (5-5)
Pearson’s chi-square test (4-1)t-test assumes probabilities are approximately normally distributed
test not assuming normalityCompare the observed frequencies with the frequencies expected for independence. If the difference is large, reject H0
2
Pearson’s chi-square test (4-2)
22
,
( )1.55 3.841( 0.05)ij ij
i j ij
O E
E
Accept H0,, new and companies occur independently!
Pearson’s chi-square test (4-3)Identification of translation pairs in aligned corpora (Church et al., 1991)
59 is the number of sentence pairs which have cow in English and vache in French.
2 456400
Reject H0, (cow, vache) is a translation pair.
Pearson’s chi-square test (4-4)Metric for Corpus similarity (Kilgarriff et al., 1998)
H0= Two corpora drawn from same source
Likelihood ratios (3-1)More appropriate of sparse dataTwo alternative explanations for the occurrence frequency of a bigram (Dunning 1993)
H1 = P(w2|w1) = P(w2| ¬ w1) = p (independence)
H2 = P(w2|w1) = p1 p2 = P(w2| ¬ w1) (dependence)
log = log ( L(H1) / L(H2) )
L(H) = likelihood of observing O under H
Likelihood ratios (3-2)c1, c2, c12 are the number of occurrences of w1, w2, w1w2, and assuming a binomial distribution:
Likelihood ratios (3-3)
2log
If is a likelihood ratio of a particular form, then is asymptotically distributed (Mood et al., 1974)
2
Likelihood ratio test is more appropriate for sparse data.
Mutual Information (7-1)Information you gain about x’ when knowing y’
Pointwise mutual information (Church et al.1991; Church and Hanks 1989)
Mutual Information (7-2)
The amount of information about the occurrence of Ayatollah atposition i in the corpus increases by 18.38 bits if we are told thatRuhollah occurs at position i+1.
Mutual Information (7-3)
English: house of commonsFrench: chambre de communes
Problem1: information gain direct dependence
Mutual Information (7-4)
2 considers more than (house, communes)
MI considers only (house, communes)
Mutual Information (7-5)Problem2: Data sparseness
Mutual Information (7-6)For Perfect dependence:
For perfect independence:
MI is a not good measure of dependence since the score depends on the frequency of the individual words.
Mutual Information (7-7)
Pointwise MI: MI(new, companies)Uncertainty reduced in predicting “companies” When knowing the previous word is “new”Small sample, not good measure if count is lowMI 0, good indication of independence
Mutual information: MI (wi-1 , wi )How much information (entropy) gained
• Unary Model P(w) - Bigram Model P(wi | wi-1)
Estimated using a large sample
Outline
IntroductionApproaches to find collocations
FrequencyMean and VarianceHypothesis testMutual information
Applications
ApplicationsComputational lexicographyInformation Retrieval
Accuracy of retrieval can be improved if the similarity between a user query and a document is determined based on common collocations instead of words. (Fagan 1989)
Natural Language Generation (Smadja 1993)Cross Language information retrieval (Hull and Grefenstette 1998)
Collocations and Word Sense Disambiguation
Association or co-occurrencedoctor and nurseplane and airport
Both are important for word sense disambiguation
Collocation - local context (One sense per collocation)• Drop me a line (letter)• .. on the line .. (phone line)
Occurrence - topical context or global context• Subject based disambiguation
ReferencesChoueka, Yaacov. 1988. Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO, pp. 43–38.Justeson, John S., and Slava M. Katz. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1:9–27.Church, Kenneth Ward, and Patrick Hanks. 1989. Word association norms, mutual information and lexicography. In ACL 27, pp. 76–83.Church, Kenneth, William Gale, Patrick Hanks, and Donald Hindle. 1991. Using statistics in lexical analysis. In Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115–164. Hillsdale, NJ: Lawrence Erlbaum.Kilgarriff, Adam, and Tony Rose. 1998. Metrics for corpus similarity and homogeneity. Manuscript, ITRI, University of Brighton.Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19:61–74.Mood, Alexander M., Franklin A. Graybill, and Duane C. Boes. 1974. Introduction to the theory of statistics. New York: McGraw-Hill. 3rd edition.Fagan, Joel L. 1989. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science 40:115–132.Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19:143–177.Hull, David A., and Gregory Grefenstette. 1998. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Karen Sparck Jones and Peter Willett (eds.), Readings in Information Retrieval. San Francisco: Morgan Kaufmann.
Thanks!