1 natural language processing (2a) zhao hai 赵海 department of computer science and engineering...

32
1 Natural Language Processing (2a) Zhao Hai 赵赵 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 [email protected] http://bcmi.sjtu.edu.cn/~zhaohai/lessons/nlp2011/inde x.html

Upload: blaise-griffin

Post on 11-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

1

Natural Language Processing (2a)

Zhao Hai 赵海

Department of Computer Science and Engineering

Shanghai Jiao Tong University

2010-2011 

[email protected]

http://bcmi.sjtu.edu.cn/~zhaohai/lessons/nlp2011/index.html

Page 2: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

2

Lexicons and Lexical Analysis

Lexicon: A Language Resource

A Lexicon for English Words: WordNet

Outline

Page 3: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

3

Lexicon: A Language Resource (1)Features for Lexicons (1)

A lexicon means machine dictionary, which has the following features: It elaborately provides all information which a dictionary contains; Based on semantic descriptions, it describes syntagmatic and

paradigmatic relationships for each word, e.g.:

red + flower, green + leave, big + eye (syntagmatic rel.)

red, green, and big; flower, leave and eye (paradigmatic rel.);

Lexicons and Lexical Analysis (1)

Page 4: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

4

Lexicon: A Language Resource (2)Features for Lexicons (2)

word building: fixed collocation between words;

systematization: description consistency including

morphological, syntactic and semantic description;

formalization: expression with meta-langauge, e.g.

[±noun].

Lexicons and Lexical Analysis (2)

Page 5: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

5

Lexicon: A Language Resource (3)Construction of Lexicons

The construction of a lexicon might contain the following critical points:

a knowledgebase rather than database is built. This work should be fulfilled by domain experts;

it can be built by manual or semi-automatic mode; it can be applied to any machine platforms and domains; it should have a general framework, so that it is able to

interact with other lexicons.

Lexicons and Lexical Analysis (3)

Page 6: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

6

Lexicon: A Language Resource (4)Types of Lexicons

The lexicon can be divided into four categories:

general lexicon (or basic lexicon);

collocation lexicon;

bilingual lexicon;

domain lexicon.

Lexicons and Lexical Analysis (4)

Page 7: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

7

Lexicon: A Language Resource (5)Information within Lexicons

The information of a basic lexicon may contain: lexical information (lexical entry etc.); morphological information (POS, tense, etc.); syntactic information (sentence pattern of verb, etc.); semantic information (semantic attribute, predicate frame,

etc.); conceptual information (conceptual mark, word meaning

explanation, etc.).

Lexicons and Lexical Analysis (5)

Page 8: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

8

Lexicon: A Language Resource (6)Sample (Morp., Syn. and Sem.)

“给” (give) :

Morp = [hq2, hq7, vjg, vjl, …];

Syn = [bso, bss, ksd, …];

Sem = [kyd, 240202].

e.g.: hq2 – allow to be followed by a numeral (verb as a quantifier);

bso – it can not act as an object solely;

kyd – donate or bestow;

240202 – taxonomic code

Lexicons and Lexical Analysis (6)

Page 9: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

9

Lexicon: A Language Resource (7)Sample (Frame)

“给” (give) → S = NP + VP + NP1 + NP2 Syntactic Frame

NP = [AP] + [QP] + N

VP = [ADP] + V

NP1 = [QP] + N

NP2 = [QP] + N;

NP = AGT (Agent) Semantic Frame

NP1 = DAT (Dative)

NP2 = OBJ (Patient)

NP = human | country | society | saying Semantic Constraint

NP1 = human | animal | collectivity | region

NP2 = thing | a slap in the face | way out | elicitation

Lexicons and Lexical Analysis (7)

Page 10: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

10

Lexicon: A Language Resource (8)Collocation Lexicon

Col(w) = <cat, mor, syn, msy, sen>

where: cat – multi-POS;

mor – morphology;

syn – syntax and semantics;

msy – nesting collocation;

sen – sentence modifying rule set.

Lexicons and Lexical Analysis (8)

Page 11: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

11

Lexicon: A Language Resource (9)Sample (Collocation Lexicon)

w: ‘ 大概’ (probably)

cat: ^ ‘ 大概’ + (‘ 的’ ; n) @setmark(a);

cat: ^ ‘ 大概’ + (m; p; v; a; b; z) @setmark(d);

cat: q + ^ ‘ 大概’ @setmark(n);

Lexicons and Lexical Analysis (9)

Page 12: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

12

A Lexicon for English Words: WordNet (1)What is WordNet ?

WordNet is an on-line lexical reference system whose design is

inspired by current psycholinguistic theories of human lexical

memory.

English nouns, verbs, adjectives and adverbs are organized

into synonym sets, each representing one underlying lexical

concept. Different relations link the synonym.

Lexicons and Lexical Analysis (10)

Page 13: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

13

A Lexicon for English Words: WordNet (2)Information within WordNet

WordNet divides the lexicon into five categories: Nouns Verbs Adjectives Adverbs Function verbs (particles)

WordNet organizes lexical information in terms of wordmeanings, rather than word forms. Therefore, for organization,semantic relations are used.

Lexicons and Lexical Analysis (11)

Page 14: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

14

A Lexicon for English Words: WordNet (3)Psycholinguistics

The 20th Century has seen the emergence of psycho-

linguistics, an interdisciplinary field of research concerned with

the cognitive bases of linguistic competence.

Both linguists and psycholinguists have explored in consider-

able depth the factors determining the contemporary (belonging

to the same time) structure of linguistic knowledge in general, and

lexical knowledge in particular.

Lexicons and Lexical Analysis (12)

Page 15: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

15

A Lexicon for English Words: WordNet (4)Psycholexicology

Miller and Johnson-Laird (1976) have proposed that research concerned with the lexical component of language should be called psycholexicology. As linguistic theories evolved in recent decades, linguists became increasingly explicit about the information a lexicon must contain in order for the phonological, syntactic, and lexical components to work together in the everyday production and comprehension of linguistic messages, and those proposals have been incorporated into the work of psycholinguists.

Lexicons and Lexical Analysis (13)

Page 16: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

16

A Lexicon for English Words: WordNet (5)

Lexicography

Beginning with word association studies at the turn of the

century and continuing down to the sophisticated experimental

tasks of the past twenty years, psycholinguists have discovered

many synchronic properties of the mental lexicon that can be

exploited in lexicography.

Lexicons and Lexical Analysis (14)

Page 17: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

17

A Lexicon for English Words: WordNet (6)Naissance of WordNet

In 1985 a group of psychologists and linguists at Princeton

University undertook to develop a lexical database along lines

suggested by these investigations (Miller, 1985).

The initial idea was to provide an aid to use in searching

dictionaries conceptually, rather than merely alphabetically.

As the work proceeded, however, it demanded a more

ambitious formulation of its own principles and goals.

Lexicons and Lexical Analysis (15)

Page 18: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

18

POS Unique Strings Synsets Total Word-Sense Pairs

Noun 117798 82115 146312

Verb 11529 13767 25047

Adjective 21479 18156 30002

Adverb 4481 3621 5580

Totals 155287 117659 206941

Lexicons and Lexical Analysis (16)

A Lexicon for English Words: WordNet (7)Size of WordNet

http://wordnet.princeton.edu/

Page 19: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

19

A Lexicon for English Words: WordNet (8)

Some Problems

What kinds of utterances enter into these lexical associations?

What is the nature and organization of the lexicalized

concepts

that words can express?

What syntactic roles do different words play?

Lexicons and Lexical Analysis (17)

Page 20: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

20

Lexicons and Lexical Analysis (18)

A Lexicon for English Words: WordNet (9)Lexical Matrix (1)

In order to reduce ambiguity, ‘‘word form’’ is used here to

refer to the physical utterance;

‘‘word meaning’’ is referred to the lexicalized concept that a

form can be used to express;

Then the starting point for lexical semantics can be said to be

the mapping between forms and meanings.

Page 21: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

21

Lexicons and Lexical Analysis (19)

A Lexicon for English Words: WordNet (10)Lexical Matrix (2)

Word Meanings

Word Forms

F1 F2 F3 . . . Fn

M1

M2

M3

.

.

.

Mm

E1,1 E1,2

E2,2

E3,3

.

.

.

Em,n

If there are two entries in

the same column, the word

form is polysemous; if

there are two entries in the

same row, the two word

forms are synonyms

(relative to a context).

Therefore, F1 and F2 are

synonyms; F2 is

polysemous.

Page 22: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

22

Lexicons and Lexical Analysis (20)

A Lexicon for English Words: WordNet (11)Polysemy and Synonymy

Mappings between forms and meanings are many:many—some

forms have several different meanings, and some meanings can be

expressed by several different forms.

That is to say, a listener or reader who recognizes a form must

cope with its polysemy; a speaker or writer who hopes to express a

meaning must decide between synonyms.

Page 23: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

23

Lexicons and Lexical Analysis (21)

A Lexicon for English Words: WordNet (12)

Some of the Relations

Synonym

Antonym

Hyponymy / Hypernymy (Subordination / Superordination)

Meronymy / Holonymy (Part-Whole)

Page 24: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

24

Lexicons and Lexical Analysis (22)

A Lexicon for English Words: WordNet (13)Synonym (1)

There are several definitions for synonym:

Two expressions are synonymous if the substitution of one for the

other never changes the truth value of a sentence in which the

substitution is made.

Two expressions are synonymous in a linguistic context C if the

substitution of one for the other in C does not alter the truth value.

Page 25: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

25

Lexicons and Lexical Analysis (23)

A Lexicon for English Words: WordNet (14)Synonym (2)

Note that the definition of synonymy in terms of substitutability

makes it necessary to partition WordNet into nouns, verbs,

adjectives, and adverbs.

That is to say, if concepts are represented by synsets, and if

synonyms must be interchangeable, then words in different

syntactic categories cannot be synonyms (cannot form synsets)

because they are not interchangeable.

Page 26: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

26

Lexicons and Lexical Analysis (24)

A Lexicon for English Words: WordNet (15)Antonym (1)

The antonym of a word x is sometimes not-x, but not always. For

example, rich and poor are antonyms, but to say that someone is

not rich does not imply that they must be poor; many people

consider themselves neither rich nor poor.

Antonymy is a lexical relation between word forms, not a

semantic relation between word meanings.

Page 27: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

27

Lexicons and Lexical Analysis (25)

A Lexicon for English Words: WordNet (16)Antonym (2)

For example, the meanings {rise, ascend} and {fall, descend} may be

conceptual opposites, but they are not antonyms; [rise / fall] are

antonyms and so are [ascend / descend], but most people hesitate and look

thoughtful when asked if rise and descend, or ascend and fall, are antonyms.

Note that synonymy words are enclosed in curly brackets, ‘{’ and ‘}’,

and other lexical relations will be enclosed in square brackets, ‘[’ and ‘]’.

Page 28: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

28

Lexicons and Lexical Analysis (26)

A Lexicon for English Words: WordNet (17)Hyponymy / Hypernymy

It is a semantic relation between word meanings. It is also called as subordination / superordination, subset / superset, or the ISA relation. Hyponymy is transitive and asymmetrical. x is said to be a hyponymy of y if native speakers of English accept the sentence constructed as “An x is a (kind of) y.”

Ex.: tree is a hyponymy of plant

plant is a hypernymy of a tree

Page 29: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

29

Lexicons and Lexical Analysis (27)

A Lexicon for English Words: WordNet (18)

Meronymy / Holonymy

It is a semantic relation which can also be called as part-whole

or HASA relation.

x is said to be a meronymy of y if native speakers of English

accept the sentence constructed as “An x is a part of y”.

Ex.: a frame is a part of car or

a car has a frame.

Page 30: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

30

Lexicons and Lexical Analysis (28)

A Lexicon for English Words: WordNet (19)User Interface

Page 31: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

31

Lexicons and Lexical Analysis (29)

A Lexicon for English Words: WordNet (20)References

G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J. Miller.

1990. Introduction to WordNet: An on-line lexical database.

Journal of Lexicography, Vol. 3, pages 235-244.

G. Miller. 1990. Nouns in WordNet: A Lexical Inheritance

System. Journal of Lexicography, Vol. 3, pages 245-264.

C. Fellbaum. 1990. English Verbs as a Semantic. Journal of

Lexicography, Vol. 3, pages 278-301.

Page 32: 1 Natural Language Processing (2a) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University 2010-2011 zhaohai@cs.sjtu.edu.cn

32

Lexicons and Lexical Analysis (30)Assignments (2)

1. The text described several different example tests for distinguishing word

classes. For example, nouns can occur in sentences of the form I saw the

X, whereas adjectives can occur in sentences of the form It’s so X. Give

some additional tests to distinguish these forms and to distinguish

between count nouns and mass nouns. State whether each of the

following words can be used as an adjective, count noun, or mass noun.

If the word is ambiguous, give all its possible uses.

milk, house, liquid, green, group, concept, airborne