1 natural language processing (3a) zhao hai 赵海 department of computer science and engineering...

54
1 Natural Language Processing (3a) Zhao Hai 赵赵 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 [email protected]

Upload: allen-hall

Post on 13-Jan-2016

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

1

Natural Language Processing (3a)

Zhao Hai 赵海

Department of Computer Science and Engineering

Shanghai Jiao Tong University

2010-2011 

[email protected]

Page 2: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

2

Lexicons and Lexical Analysis

Finite State Models and Morphological Analysis

Collocation

Outline

Page 3: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

3

Lexicons and Lexical Analysis (202)

Finite State Models and Morphological Analysis (1)

Morphemes

Morphemes are the smallest meaningful units of language

and are typically word stems or affixes.

For example, the word “books” can be divided into two

morphemes; ‘book’ and ‘s’, where the meaning of ‘s’ is as a

plural suffix.

Page 4: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

4

Lexicons and Lexical Analysis (203)

Finite State Models and Morphological Analysis (2)

Morphology (1)

Morphology is generally divided into two types:

1. Inflectional morphology covers the variant forms of nouns,

adjectives and verbs owing to changes in:

Person (first, second, third); Number (singular, plural);

Tense (present, future, past); Gender (male, female, neuter).

Page 5: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

5

Lexicons and Lexical Analysis (204)

Finite State Models and Morphological Analysis (3)

Morphology (2)

2. Derivational morphology is the formation of a new word by addition of an affix, but it also includes cases of derivation without an affix:

disenchant (V) + -ment disenchantment (N);

reduce (V) + -tion reduction (N);

record (V) record (N);

progress (N) progress (V).

Page 6: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

6

Lexicons and Lexical Analysis (205)

Finite State Models and Morphological Analysis (4)Morphology (3)

Most morphological analysis programs tend to deal only with inflectional morphology, and assume that derivational variants will be listed separately in the lexicon. One exception is the Alvey Natural Language Toolkit morphological analyzer. (Russell G, Pulman S, Ritchie G, and Black A. 1986. A dictionary and morphological analyser for English, Proceedings of 11th COLING Conference p.277-279)

Page 7: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

7

Lexicons and Lexical Analysis (206)

Finite State Models and Morphological Analysis (5)

Morphological Analyzer

A morphological analyzer must be able to undo the spelling rules for adding affixes. For example, the analyzer must be able to interpret “moved” as ‘move’ plus ‘ed’. For English, a few rules cover the generation of plurals and other inflections such as verb endings. The main problem is where a rule has exceptions, which have to be listed explicitly, or where it is not clear which rule applies, if any.

Page 8: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

8

Lexicons and Lexical Analysis (207)Finite State Models and Morphological

Analysis (6)Analysis of Plurals

The following word-stems obey regular rules for the generation of plurals: CHURCHES CHURCH + ES; SPOUSES SPOUSE + S; FLIES FLY + IES; PIES PIE + S. The remaining word-stems are irregular:MICE MOUSE; FISH FISH; ROOVES ROOF + VES; BOOK ENDS BOOK END + S; LIEUTENANTS GENERAL LIEUTENANT (+S) GENERAL.

Page 9: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

9

Lexicons and Lexical Analysis (208)

Finite State Models and Morphological Analysis (7)

Analysis of Inflectional Variants

The following word-stems obey regular rules:

LODGING LODGE + ING; BANNED BAN + NED;

FUMED FUME + D; BREACHED BREACH + ED;

TAKEN TAKE + N. The following word-stems are irregular:

TAUGHT TEACH; FAUGHT FIGHT;

TOOK TAKE.

Page 10: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

10

Lexicons and Lexical Analysis (209)Finite State Models and Morphological

Analysis (8)Finite State Transducers (FSTs) (1)

Finite-state transducers (FST) are automata for which each transition has an output label in addition to the more familiar input label. Transducers transform (transduce) input strings into output strings. The output symbols come from a finite set, usually called output alphabet. Since the input and output alphabet are frequently the same, there is usually no distinction between them, that is, only the input label is given.

Page 11: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

11

Lexicons and Lexical Analysis (210)Finite State Models and Morphological

Analysis (9)Finite State Transducers (FSTs) (2)

Definition: A finite-state transducer (FST) is a 5-tuple M = (Q , Σ, E, i, F) , where Q is a finite set of states, i ∈Q is the initial state, F ⊆ Q is a set of final states, Σ is a finite alphabet and E : Q ×( Σ {ε}) × ∪ Σ* × Q is the set of transitions (arcs). Σ* is the set of all possible words over the Σ:

Σ* = {v | v = v1v2…vn for n ≥ 1 and vi ∈ Σ for all 1≤ i ≤n} ∪

{ε}

Page 12: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

12

Lexicons and Lexical Analysis (211)Finite State Models and Morphological

Analysis (10)Finite State Transducers (FSTs) (3)

Definition: Further, we define the state transition function δ : Q ×( Σ {ε}) → 2∪ Q (the power set of Q) as follows:

δ (p, a) = { q ∈ Q | ∃ v ∈Σ* : ( p, a, v, q ) ∈ E },

and the emission function λ : Q × (Σ {ε})× ∪ Q → 2 Σ* is defined as:

λ(p, a, q) = {v ∈Σ* | (p, a, v, q) ∈ E}

Page 13: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

13

Lexicons and Lexical Analysis (212)

Finite State Models and Morphological Analysis (11)

Finite State Transducers (FSTs) (4)

Ex.: Let M = (QM,ΣM, EM, iM, FM) be an FST, where QM ={0, 1, 2},Σ M = {a, b, c}, δM = {(0, a, b, 1), (0, a, c, 2)} , iM = 0 and FM

={1, 2}.

M transduces a to b or a to c. Note that

for visualizing transducers we use the

colon to separate the input and output

labels of a transduction.

Page 14: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

14

Lexicons and Lexical Analysis (213)

Finite State Models and Morphological Analysis (12)A Simple FST

Ex.: Morphological analysis for the word “happy” and its derived forms: happy happy; happier happy+er; happiest happy+est

12

Page 15: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

15

Lexicons and Lexical Analysis (214)

Finite State Models and Morphological Analysis (13)

Specification for the Simple FST

Arcs labeled by a single letter have that letter as both the input and the output. Nodes that are double circles indicate success states, that is, acceptable words. The dashed link, indicating a jump, is not formally necessary but is useful for showing the break between the processing of the root form and the processing of the suffix. No input is represented as an empty symbolε.

Page 16: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

16

Lexicons and Lexical Analysis (215)

Finite State Models and Morphological Analysis (14)

A Fragment of an FST

This FST accepts the following words, which all start with t: tie (state 4), ties (10), trap (7), traps (10), try (11), tries (15), to (16), torch (19), torches (15), toss (21), and tosses (15). In addition, it outputs tie, tie+s, trap, trap+s, try, try+s, to, torch, torch+s, toss, toss+s.

Page 17: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

17

Lexicons and Lexical Analysis (216)

Finite State Models and Morphological Analysis (15)

Specification for the Fragment of an FST (1)

The entire lexicon can be encoded as an FST that encodes all

the legal input words and transforms them into morphemic

sequences.

The FSTs for the different suffixes need only be defined once,

and all root forms that allow that suffix can point to the same

node.

Page 18: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

18

Lexicons and Lexical Analysis (217)

Finite State Models and Morphological Analysis (16)

Specification for the Fragment of an FST (2)

Words that share a common prefix (such as torch, toss, and so

on) also can share the same nodes, greatly reducing the size of

the network.

Note that you may pass through acceptable states along the

way when processing a word.

Page 19: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

19

Lexicons and Lexical Analysis (218)

Finite State Models and Morphological Analysis (17)

References

J. Hopcroft, J. Ullman. 1979. Introduction to Automata

Theory, Languages and Computation. Addison-Wesley Series

in Computer Science, Addison-Wesley, Reading, Massachusetts,

Menlo Park, California, London.

M. Mohri. 1997. Finite-state transducers in language and

speech processing. Computational Linguistics 23.

Page 20: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

20

Lexicons and Lexical Analysis (219)

Collocation (1)Definition

A collocation is an expression consisting of two or more words

that correspond to some conventional way of saying things.

For example,

noun phrases: strong tea; weapons of mass destruction;

phrasal verbs: to make up;

other phrases: the rich and powerful.

Page 21: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

21

Lexicons and Lexical Analysis (220)

Collocation (2)Compositionality

We call a natural language expression compositional if the meaning of the expression can be predicted from the meaning of the parts. Collocations are characterized by limited compositionality, in which there is usually an element of meaning added to the combination. For example, in the case of strong tea, strong has acquired the meaning rich in some active agent which is closely related, but slightly different from the basic sense having great physical strength.

Page 22: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

22

Lexicons and Lexical Analysis (221)

Collocation (3)Non-Compositionality

Idioms are the most extreme examples of non-compositionality. For instance, the idioms to kick the bucket and to hear it through the grapevine only have an indirect historical relationship to the meanings of the parts of the expression. Most collocations exhibit milder forms of non-compositionality, like the expression international best practice. It is very nearly a systematic composition of its parts, but still has an element of added meaning.

Page 23: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

23

Lexicons and Lexical Analysis (222)

Collocation (4)Other Terms

There is considerable overlap between the concept of

collocation and notions like term, technical term, and

terminological phrase.

The above three terms are commonly used when collocations

are extracted from technical domains (in a process called

terminology extraction).

Page 24: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

24

Lexicons and Lexical Analysis (223)

Collocation (5)Applications (1)

Collocations are important for a number of applications:

natural language generation (to make sure that the output

sounds natural and mistakes like powerful tea or to take a

decision are avoided);

computational lexicography (to automatically identify the

important collocations to be listed in a dictionary entry);

Page 25: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

25

Lexicons and Lexical Analysis (224)

Collocation (6)Applications (2)

parsing (so that preference can be given to parses with natural

collocations)

corpus linguistic research (for instance, the study of social

phenomena like the reinforcement of cultural stereotypes through

language).

Page 26: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

26

Lexicons and Lexical Analysis (225)

Collocation (7)Frequency (1)

Surely the simplest method for finding collocations in a text

corpus is counting.

If two words occur together a lot, then that is evidence that

they have a special function that is not simply explained as the

function that results from their combination.

Page 27: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

27

Lexicons and Lexical Analysis (226)

Collocation (8)Frequency (2)

The table shows the bigrams (sequences of two adjacent words) that are most frequent in the corpus and their frequency. Except for NewYork, all the bigrams are pairs of function words.

A function word is a word which have no lexical meaning, and whose sole function is to express grammatical relationships, such as prepositions, articles, and conjunctions.

New York Times corpus

Page 28: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

28

Lexicons and Lexical Analysis (227)

Collocation (9)Frequency (3)

But just selecting the most frequently occurring bigrams is not

very interesting. Justeson and Katz (1995): pass the candidate

phrases through a part-of-speech filter which only lets through

those patterns that are likely to be “phrases”.

Page 29: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

29

Lexicons and Lexical Analysis (228)

Collocation (10)Frequency (4)

Page 30: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

30

Lexicons and Lexical Analysis (229)

Collocation (11)Frequency (5)

Each is followed by an example from the text which is used as

a test set. In these patterns A refers to an adjective, P to a

preposition, and N to a noun.

The next table shows the most highly ranked phrases after

applying the filter. The results are surprisingly good. There are

only 3 bigrams that we would not regard as non-compositional

phrases: last year, last week, and first time.

Page 31: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

31

Lexicons and Lexical Analysis (230)

Collocation (12)Frequency (6)

Page 32: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

32

Lexicons and Lexical Analysis (231)

Collocation (13)Frequency (7)

York City is an artefact of the way we have implemented the Justeson and Katz filter. The full implementation would search

for the longest sequence that fits one of the part-of-speech

patterns and would thus find the longer phrase New York City.

The twenty highest ranking phrases containing strong and

powerful all have the form A N (where A is either strong or

powerful). They have been listed in the following table.

Page 33: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

33

Lexicons and Lexical Analysis (232)

Collocation (14)Frequency (8)

Page 34: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

34

Lexicons and Lexical Analysis (233)

Collocation (15)Frequency (9)

Given the simplicity of the method, these results are

surprisingly accurate. For example, they give evidence that

strong challenge and powerful computers are correct whereas

powerful challenge and strong computers are not.

However, we can also see the limits of a frequency-based

method. The nouns man and force are used with both adjectives

(strong force occurs further down the list with a frequency of 4).

Page 35: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

35

Lexicons and Lexical Analysis (234)

Collocation (16)Frequency (10)

Neither strong tea nor powerful tea occurs in New York Times

corpus.

However, searching the larger corpus of the World Wide

Web we find 799 examples of strong tea and 17 examples of

powerful tea (the latter mostly in the computational linguistics

literature on collocations), which indicates that the correct phrase

is strong tea.

Page 36: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

36

Lexicons and Lexical Analysis (235)

Collocation (17)Frequency (11)

Justeson and Katz’ method of collocation discovery is

instructive in that it demonstrates an important point.

A simple quantitative technique (the frequency filter)

combined with a small amount of linguistic knowledge (the

importance of parts of speech) goes a long way.

Later we will use a stop list that excludes words whose most

frequent tag is not a verb, noun or adjective.

Page 37: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

37

Lexicons and Lexical Analysis (236)

Collocation (18)Mean and Variance (1)

Frequency-based search works well for fixed phrases. But

many collocations consist of two words that stand in a more

flexible relationship to one another.

Consider the verb knock and one of its most frequent

arguments, door. Here are some examples of knocking on or at a

door from our corpus:

Page 38: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

38

Lexicons and Lexical Analysis (237)

Collocation (19)Mean and Variance (2)

She knocked on his door.

They knocked at the door.

100 women knocked on Donaldson’s door.

A man knocked on the metal front door.

Page 39: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

39

Lexicons and Lexical Analysis (238)

Collocation (20)Mean and Variance (3)

The words that appear between knocked and door vary and the

distance between the two words is not constant so a fixed phrase

approach would not work here.

But there is enough regularity in the patterns to allow us to

determine that knock is the right verb to use in English for this

situation, not hit, beat or rap.

Page 40: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

40

Lexicons and Lexical Analysis (239)

Collocation (21)Mean and Variance (4)

To simplify matters we only look at fixed phrase collocations

in most cases, and usually at just bi-grams.

We define a collocational window (usually a window of 3 to

4 words on each side of a word), and we enter every word pair in

there as a collocational bigram. Then we proceed to do our

calculations as usual on this larger pool of bigrams

Page 41: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

41

Lexicons and Lexical Analysis (240)

Collocation (22)Mean and Variance (5)

Using a three word collocational window to capture bigrams at a distance

Page 42: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

42

Lexicons and Lexical Analysis (241)

Collocation (23)Mean and Variance (6)

The mean and variance based methods described by

definition look at the pattern of varying distance between two

words.

One way of discovering the relationship between knocked and

door is to compute the mean and variance of the offsets (signed

distances) between the two words in the corpus.

Page 43: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

43

Lexicons and Lexical Analysis (242)

Collocation (24)Mean and Variance (7)

The mean is simply the average offset. For the examples

previously, we compute the mean offset between knocked and

door as follows:

This assumes a tokenization of Donaldson’s as three words

Donaldson, apostrophe, and s.

Page 44: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

44

Lexicons and Lexical Analysis (243)

Collocation (25)Mean and Variance (8)

The variance measures how much the individual offsets

deviate from the mean. We estimate it as follows:

where n is the number of times the two words co-occur, di is the

offset for co-occurrence i, andμis the mean.

Page 45: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

45

Lexicons and Lexical Analysis (244)

Collocation (26)Mean and Variance (9)

As is customary, we use the standard deviation , the

square root of the variance, to assess how variable the offset

between two words is. The standard deviation for the four

examples of knocked / door in the above case is :

Page 46: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

46

Lexicons and Lexical Analysis (245)Collocation (27)

Mean and Variance (10)

The mean and standard deviation characterize the

distribution of distances between two words in a corpus. We

can use this information to discover collocations by looking for

pairs with low standard deviation.

We can also explain the information that variance gets at in

terms of peaks in the distribution of one word with respect to

another.

Page 47: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

47

Lexicons and Lexical Analysis (246)

Collocation (28)Mean and Variance (11)

The variance of strong with respect to opposition is small

Page 48: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

48

Lexicons and Lexical Analysis (247)

Collocation (29)Mean and Variance (12)

Because of this greater variability we get a higher and a mean that is between positions -1 and -2 (-1.45) .

Page 49: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

49

Lexicons and Lexical Analysis (248)

Collocation (30)Mean and Variance (13)

The high standard deviation of indicates this randomness. This indicates that for and strong don’t form interesting collocations.

Page 50: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

50

Lexicons and Lexical Analysis (249)

Collocation (31)Mean and Variance (14)

Page 51: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

51

Lexicons and Lexical Analysis (250)

Collocation (32)Mean and Variance (15)

If the mean is close to 1.0 and the standard deviation low, as is

the case for NewYork, then we have the type of phrase that

Justeson and Katz’ frequency-based approach will also discover.

If the mean is much greater than 1.0, then a low standard

deviation indicates an interesting phrase.

Page 52: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

52

Lexicons and Lexical Analysis (251)

Collocation (33)Mean and Variance (16)

High standard deviation indicates that the two words of the

pair stand in no interesting relationship as demonstrated by the

four high-variance.

More interesting are the cases in between, word pairs that have

large counts for several distances in their collocational

distribution.

Page 53: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

53

Lexicons and Lexical Analysis (252)

Collocation (34)References

J. S. Justeson and S. M. Katz. 1995. Technical terminnology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1.

M. A. K. Halliday. 1966. Lexis as a linguistic level. In C. E. Bazell, J. C. Catford, M. A. K. Halliday, and R. H. Robins (eds.), In memory of J. R. Firth. London: Longmans.

F. Smadja. 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19.

Page 54: 1 Natural Language Processing (3a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

54

Lexicons and Lexical Analysis (253)Assignments (7)

1. Pick a document in which your name occurs (an email, a university

transcript or a letter). Does Justeson and Katz’s filter identify your name as

a collocation?

2. We used the World Wide Web as an auxiliary corpus above because

neither strong tea nor powerful tea occurred in the New York Times.

Modify Justeson and Katz’s method so that it uses the World Wide Web as

a resource of last resort. Take a mainstream search engine as your tool.