cs598 dnr fall 2005 machine learning in natural language

36
1 CS598 DNR FALL 2005 Machine Learning in Natural Language Introduction: Part 3 Linguistics Essentials (The role of Linguistics in NLP)

Upload: rafi

Post on 07-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

CS598 DNR FALL 2005 Machine Learning in Natural Language. Introduction: Part 3 Linguistics Essentials (The role of Linguistics in NLP). Introduction. This is not a class in NLP – but we want to discuss how to make progress in natural language understanding - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

1

CS598 DNR FALL 2005

Machine Learning in

Natural Language

Introduction: Part 3Linguistics Essentials

(The role of Linguistics in NLP)

Page 2: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

2

This is not a class in NLP – but we want to discuss how to make progress in natural language understanding

Introduce basic linguistics concepts.

Basic terminology

Discuss the levels of analysis used in NLP

Problems associated with each level.

Introduction

Page 3: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

3

Comprehension(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in

England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous.

1. Who is Christopher Robin? 2. When was Winnie the Pooh written?3. What did Mr. Robin do when Chris was three years old?4. Where did young Chris live? 5. Why did Chris write two books of his

own?

Other motivating problems: Entailment; Translation,Generation…

Page 4: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

4

Discuss the levels of analysis used in NLP

Problems associated with each level.

For each level of Linguistics Analysis we will ask:

What are the problems here? What would we consider as a solution?

Introduction

Page 5: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

5

Levels of Analysis

In traditional linguistics people talk about several levels of analysis, or types of linguistics knowledge.

Morphology How words are constructed

Syntax Structural relation between words

Semantics The meaning of words and of combinations fo words

Pragmatics. How a sentence is used? What’s its purpose

Discourse (sometimes distinguished as a subfield of Pragmatics)

Relationships between sentences; global context.

Page 6: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

6

Morphology

Morphology: How words are constructed; prefixes & Suffixes The simple cases are: kick, kicks, kicked, kicking But other cases may be sit, sits, sat, sitting Not just as simple as adding and deleting certain endings,

as in: gorge, gorgeous good, goods arm, army This might be very different in other languages... (Problems; solutions)

Page 7: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

7

Syntax

Syntax: Structural relationship between words. The main issues here are structural ambiguities, as in: I saw the Grand Canyon flying to New York. or Time flies like an arrow. The sentence can be interpreted as a

Metaphor: time passes quickly, but also Declaratively: Insects have an affinity for arrows Imperative: measure the time of the insects.

Key issue: Often syntax doesn't tell us much about meaning.

Plastic cat food can cover

Page 8: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

8

Semantics

Semantics: The meaning of words and of combinations of words.

Some key issue here: Lexical ambiguities:

I walked to the bank {of the river / to get money}.

The bug in the room {was probably planted by spies /

flew out the window}. Compositionality: The meaning of phrases/sentences as a

function of the meaning of words in them. (Problems; Solutions)

Page 9: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

9

Pragmatics/Discourse

Pragmatics: How a sentence is used; its purpose. E.g.: Rules of conversation:

Can you tell me what time it is Could I have the salt

Discourse: Relations between sentences; global context. An important example here is the problem of co-reference: When Chris was three years old, his father

wrote a poem about him. “Chicago?”

(Running towards an agent in an airport; Ticket Agency)

Page 10: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

10

Morphology and Part-of-Speech

Words are related by morphological processes such as forming plural forms from singular forms: dog...dogs adding prefixes and suffixes: conceive ...inconceivable

Importance? It makes language more predictable. It allows us to handle new words which are outside our vocabulary. Understanding morphology may support generalization to unknown

words. However, Morphology may be tricky. Not always as simple as stripping common prefixes and suffixes.

preempt....... empt ? gorgeous.... like a gorge? apply........... like an apple? old.............. oldly? Mrs. .......... plural of Mr. atomic......... not Tom-like

Page 11: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

11

Morphological Processes

Inflectional forms: Words generated share the same basic meaning and part of

speech. Words are generated by systematic modifications of the root forms.

kick,kicks,kicked, kicking

Derivational forms: Words generated may have different meaning and part of speech.

friend...friendly; wide...widely; hard...hardly

Is there a problem to solve here? What would you consider a solution?

Page 12: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

12

Part of Speech The part-of-speech of words in a sentence has an important role in all

recent works in natural language; Necessary to read the literature and the corpora.

Part of speech (POS) is a way to categorize words based on a particular syntactic (and often semantic) function they take in the sentence.

Sometimes called syntactic or grammatical categories. Important POS: Nouns: typically refer to people, animals and “things”. Verbs: express the action in the sentence. Adjectives: describe properties of nouns. Children eat sweet candy

Children: Noun - group of people. eat: Verb - describes what people do with candy. sweet: Adj.- a property of candy candy: Noun - a particular type of food

Other basic Parts of Speech: adjective, adverb, article, pronoun, conjunction …

Data / Demo

Page 13: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

13

Part of Speech (cont.) Useful sub-categorization of POS into two types: Open class words:

A constantly changing set; new words are often introduced into the language.

nouns, verbs, adjectives and adverbs Closed class words:

A relatively stable set; new words are rarely introduced into the language. articles, pronouns, prepositions, conjunctions.

It is therefore easier to deal with closed class words.

Articles: a, an, the Pronouns: I, you, me, we, he, she, him, her, it, them, they Prepositions: to, for, with, between, at, of Demonstratives: this, that, these, those Quantifiers: some, every, most, any, both Conjunctions: and, or, but

Page 14: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

14

Closed class words (not so easy) Articles pose a lot of difficulty for language

generation. Most noun phrases start with an article:

a newspaper, an apple, the movie But, there are many exceptions,

The bowl was full of rice. *The bowl was full of apple. I go to college. *I go to university. She went on vacation. *She went on trip. He fell asleep in class. *He fell asleep in room.

Page 15: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

15

Closed class words (not so easy-II)

Another closed class words that are hard to deal with: prepositions & particles.

Prepositions represent relations: time, location, modification, complements.

He put the book on the table He gave the book to Mary He walked up the stairs

Particles are prepositions that follow verbs to create new verb forms. He passed out

But also He threw the cookies up the chimney vs. He threw up the cookies

And sometimes, it can be ambiguous: He looked over the paper.

Other problems with prepositions include attachments, which will be discussed later when we discuss semantics.

Problems? Solutions? POS? Disambiguation? Text Correction?

Page 16: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

16

Nouns Nouns refer to entities in the world, which represent

objects, places, concepts, people, events dog, city, idea, marathon

Count nouns: describe specific objects or sets of objects (above)

Mass nouns: describe composites or substances, dirt, water, garbage, deer.

Pronouns are special class of nouns that refer to a person or a “thing'' that is salient in the context of use.

After Mary had arrived in the village, she looked for a hotel. Relative Pronouns are pronouns like:

who, which, that The man who saw Elvis.. The UFO that landed in Toledo ... The Rolling Stones concert, which I attended, ...

Page 17: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

17

Nouns (cont.) Nouns can be objects of verbs or subjects of verbs:

Children eat sweet candy Subject Object

Proper nouns are names like Mary, Smith, United stated, IBM, Little Rock.

Nouns have Modifiers. They can be modified by: adjectives: words that attribute qualities to objects.

wet, loud, happy, funny or by noun modifiers:

dog food, tin can, song book. In this case we can talk about the head noun which represents the main

concept, e.g., dog food. A noun is usually embedded in a noun phrase.

A syntactic unit of the sentence in which information about the noun is gathered.

The noun is the head of the noun phrase. In addition to the noun we may find in a noun phrase an article: The

tree, and an adjective: “The tall tree''. Problems, Solutions? Identification? Why do we need to solve it?

How to evaluate it?

Page 18: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

18

Verbs Verbs: Words that represent actions, commands or

assertions.

Main verbs: walk, eat, believe, claim, ask Auxiliary verbs: be, do, have Modal verbs: will, can, could

Verbs can be transitive: they take a complement, as in: eat an apple; read a book; sing a song intransitive: verbs that do not take complements, as in: she laughed; he slept; I lied

Page 19: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

19

Verbs (cont.) Verbs: have morphological forms: Base: walk be go Present: walks is goes Past: walked was went Present Participle: walking being going Past Participle: walked been going

Page 20: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

20

Verbs (cont.) Verbs can be Active or Passive. The passive voice form consists of a form of “to be”

followed by the past participle. Active Passive I saw Elvis Elvis was seen by me. I will find him. He will be found by me. I have found him. He has been found by me.

The roles are reversed in actives and passives. John killed Sam: subject is killer, direct object

is victim Sam was killed by John subject is victim, object of

“by'' is killer Some verbs take indirect objects, e.g.

I gave Mary the book vs. I gave the book to Mary. Mary: indirect object; book: direct object

Page 21: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

21

Verbs (cont.) Prepositions and Particles are important in the context

of verbs. When they appear as Particles they create new verb

forms. Sometimes, we need to know the meaning of the

sentence to decide if a word is a preposition or a particle.

She ran up the hill She ran up the bill

Page 22: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

22

Verb Phrases The verb phrase is the syntactic unit that organizes all

elements of the sentence that depend syntactically on the verb.

The Verb is the head of the verb phrase.

An Adverb is an element of the verb phrase which specify

place, time, manner, degree

She often travels to Las Vegas. She allegedly committed perjury. She started her career off impressively.

Page 23: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

23

Verb Sub-categorization This is a categorization of verbs according to the types of

complements they take.

Complements of a verb are different syntactic means that verbs can exploit to express related entities.

The set of complements that a verb can appear with is called its subcategorization frame.

Examples Verbnet

Page 24: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

24

Sub-categorization Frames Intransitive: NP(subject)

The woman walked Transitive: NP (subject) NP(object)

John loves Mary Dbl obj Construction: NP (subject) NP (direct object) NP (object)

Mary gave John flowers Reflexive Verbs NP (subject) Reflexive Pronoun(object)

She introduced herself NP (subject) NP (object) PP(location)

She put the book on the table Clause complement NP (subject) NP (object) that clause

She told me that Gary is coming.

Complements of verbs can be either Obligatory arguments (subject, object, direct object)

She put the book on the table or Optional (like pp phrase or a subordinate clause (e.g., "that“ clause).

She gave her presentation on the stage.

Page 25: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

25

Sub-categorization Frames Intransitive: NP(subject)

The woman walked Transitive: NP (subject) NP(object)

John loves Mary Dbl obj Construction: NP (subject) NP (direct object) NP (object)

Mary gave John flowers Reflexive Verbs NP (subject) Reflexive Pronoun(object)

She introduced herself NP (subject) NP (object) PP(location)

She put the book on the table Clause complement NP (subject) NP (object) that clause

She told me that Gary is coming.

Complements of verbs can be either Obligatory arguments (subject, object, direct object)

She put the book on the table or Optional (like pp phrase or a subordinate clause (e.g., "that“ clause).

She gave her presentation on the stage.

Page 26: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

26

Syntactic and Semantic Regularities

Subcategorization frames capture syntactic regularities. There are also semantic regularities, usually called selectional

restrictions or preferences.

E.g., "bark" prefers dogs as subjects "eat" prefers edible things as objects. Sentences that violate selectional preferences sound odd.

The cat barked all night. I eat philosophy every day.

Last word about verbs: Gerunds are present particles that function as nouns.

sleeping bags; drinking fountain; moving sale;

Page 27: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

27

Syntax Words is a sentence are not randomly strung together in a sequences. Words are organized in phrases and arranged in particular word order. Syntax is the study of regularities and laws of word order and phrase

structure.

In English, we cannot determine the meaning of the sentence from the meaning of the words.

Mary gave Peter a book. Peter gave Mary a book. The basic word order in English is: Subject-Verb-Object This holds for declarative sentences,

The children should eat spinach but the order changes to express a particular "mood":

Interrogative (question): Should the children eat spinach? [Try on demos]

Imperative (command, request): Eat spinach!

Page 28: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

28

Rewrite Rules The regularities of word order are captured using rewrite rules. The symbol on the left of the rule can be re-written as the set of

symbols on the right. S NP VP NP John, garbage VP laughed, smells

This set of rewrite rules can produce the following sentences: John laughed Garbage laughed John smelled Garbage

smelled.

Symbols that cannot be decomposed are called terminal symbols. Symbols that can be decomposed are called nonterminals. An intuitive way to represent a sentence structure is as a tree, in which

each nonterminal represents the application of the rewrite tree. T he following example present a tree representation of the sentence

John walked the dog with fleas.

Page 29: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

29

Rewrite Rules The regularities of word order are captured using rewrite rules. The symbol on the left of the rule can be re-written as the set of

symbols on the right. S NP VP NP John, garbage VP laughed, smells

This set of rewrite rules can produce the following sentences: John laughed Garbage laughed John smelled Garbage

smelled.

Symbols that cannot be decomposed are called terminal symbols. Symbols that can be decomposed are called nonterminals. An intuitive way to represent a sentence structure is as a tree, in which

each nonterminal represents the application of the rewrite tree. T he following example present a tree representation of the sentence

John walked the dog with fleas.

Page 30: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

30

Rewrite Rules This is produced using a set of rewrite rules that we call the Grammar: A formal specification of the structures allowable in a

language .A grammar that can produce this tree is:

S --> NP VP NP --> Det NP NP --> Det noun PP NP --> ADJ NP NP --> noun NP NP --> noun PP NP --> noun VP --> V NP PP VP --> V NP VP --> V PP VP --> V PP --> Prep NP PP PP --> Prep NP

SNP

N

NDet

VVP

NPNP

PPNPJohn walked the dog with the fleas

But, the same grammar can also produce other trees. E.g., the one that means that the fleas helped John walk the dog. That is, the grammar is not enough.

Page 31: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

31

Parsing A parsing technique is a method for determining the structure of a

sentence with respect to (given) a grammar. A parser is a computer program that determines the structure of the

sentence. Not to confuse with a program that induces the grammar. Lexical vs. non-lexical grammar: many grammars today are lexicalized

in that the re-write rules include specific words.

Notice that rewrite rules can be applied recursively. This is important, since it allows for simple nonterminals to expand to a large number of words.This allows for the generation for many long term dependencies, e.g., between subjects and verbs, and is a source of difficulties in NLP.

Shallow parse is a parse of the sentence at a shallow level – only one or two levels above the non-terminals. This is considered an easier task that, quite often can be more robust.

There are multiple grammar formalisms. What we showed here is a constituent-based formalisms; but there exist others.

Page 32: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

32

Semantics Semantics: the study of the meaning of language. Can be

decomposed into: Lexical semantics: the study of meaning of individual words Global semantics: how the meaning of individual words are

combined into meaning of sentences (or more). One approach to lexical semantics is to study how word

meanings are related to each other. To study this, words can be organized into lexical hierarchies

(as done in WordNet).

Page 33: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

33

Lexical Semantics Hypernym: a word with a more general sense.

hypernym(cat)= animal Hyponnym: a word with a more specific sense. Antonym: a word having opposite meaning.

antonym(hot)=cold. Meronym: part-of.

meronym(tree)=leaf. Synonym: same meaning Homonyms: words that are written the same way but represent

different words. Bank (river, finance); suit (law, set of garment)

Polysemy: word with two senses that are related Branch: natural subdivision of a plant; separate but dependent part of an organization.

Page 34: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

34

Lexical Semantics When we move to global semantics, the natural problem is:

How to use the meaning of single words to produce a meaning of a sentence?

This is a hard problem, since natural language does not obey the principle of compositionality.

E.g., the word white refers to different colors in the following expressions:

white paper; white hair; white skin; white wine

There are problems of idioms and the scope of words in the sentence that makes this even harder.Mutli-word expressions

Page 35: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

35

Pragmatics One of the important issues studied here is that of discourse

analysis. A central problem there is that resolution of anaphoric relations. An example:

Mary helped the other passenger out of the cab. The man had asked her to help him because of his foot injury.

Anaphoric relations hold between Noun Phrases that refer to the same thing in the world.

In the above example, there are quite a few ways to resolve the identify of "the man","him" and "his foot".

This issue is important in many applications, in particular in information extraction -- where there is a need to keep track of participants.

The Reference problem vs. the Co-reference problem.

Page 36: CS598 DNR FALL 2005 Machine Learning  in  Natural Language

36

SummaryLinguistics is subdivided traditionally into Phonetics (physical sounds of the language; consonants, vowels,

intonation) Phonology (how sounds are mentally represented), Morphology, Syntax, Semantics and Pragmatics. Most of the work within the statistics and learning-based approaches

to natural language is done in the areas of Syntax, Semantics, and some Pragmatics and this will be our main concern in this course as well.

Phonetics is also studied using related methods, within the Speech community, and the techniques we will present in this course could be used there, as well as in Morphology and Discourse analysis.