cs60057 speech &natural language processing

67
Lecture 1, 7/21/2005 Natural Language Processing 1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007

Upload: audra

Post on 13-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

CS60057 Speech &Natural Language Processing. Autumn 2007. Lecture 3 27 July 2007. Levels of (Formal) Description. 6 basic levels (more or less explicitly present in most theories) : and beyond (pragmatics/logic/...) meaning (semantics) (surface) syntax morphology phonology - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 1

CS60057Speech &Natural Language

Processing

Autumn 2007

Lecture 3

27 July 2007

Page 2: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 2

Levels of (Formal) Description 6 basic levels (more or less explicitly present in most theories):

and beyond (pragmatics/logic/...)

meaning (semantics)

(surface) syntax

morphology

phonology

phonetics/orthography

Each level has an input and output representation output from one level is the input to the next (upper) level sometimes levels might be skipped (merged) or split

Page 3: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 3

Phonetics/Orthography Input:

acoustic signal (phonetics) / text (orthography)

Output: phonetic alphabet (phonetics) / text (orthography)

Deals with: Phonetics:

consonant & vowel (& others) formation in the vocal tract classification of consonants, vowels, ... in relation to frequencies,

shape & position of the tongue and various muscles intonation

Orthography: normalization, punctuation, etc.

Page 4: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 4

Phonology Input:

sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes]

Output: sequence of phonemes (~ (lexical) letters; in an abstract alphabet)

Deals with: relation between sounds and phonemes (units which might have

some function on the upper level) e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)

Page 5: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 5

Morphology

Input: sequence of phonemes (~ (lexical) letters)

Output: sequence of pairs (lemma, (morphological) tag)

Deals with: composition of phonemes into word forms and their

underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding)

e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.

Page 6: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 6

(Surface) Syntax Input:

sequence of pairs (lemma, (morphological) tag) Output:

sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms

Deals with: the relation between lemmas & morphological categories and

the sentence structure uses syntactic categories such as Subject, Verb, Object,... e.g.: I/PP1 see/VB a/DT dog/NN ~ ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S

Page 7: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 7

Meaning (semantics) Input:

sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions)

Output: sentence structure (tree) with annotated nodes (semantic

lemmas, (morpho-syntactic) tags, deep functions) Deals with:

relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s

e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)

Page 8: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 8

...and Beyond

Input: sentence structure (tree): annotated nodes (autosemantic

lemmas, (morphosyntactic) tags, deep functions) Output:

logical form, which can be evaluated (true/false) Deals with:

assignment of objects from the real world to the nodes of the sentence structure

e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...])[Time:bef 99/9/27/14:15][Place:39ş19’40”N76ş37’10”W]

Page 9: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 9

Three Views

Three equivalent formal ways to look at what we’re up to (not including tables)

Regular Expressions

Regular LanguagesFinite State Automata

Page 10: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 10

Transition

Finite-state methods are particularly useful in dealing with a lexicon.

Lots of devices, some with limited memory, need access to big lists of words.

So we’ll switch to talking about some facts about words and then come back to computational methods

Page 11: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 11

MORPHOLOGY

Page 12: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 12

Morphology Morphology is the study of the ways that words are built up from

smaller meaningful units called morphemes (morph = shape, logos = word)

We can usefully divide morphemes into two classes Stems: The core meaning bearing units Affixes: Bits and pieces that adhere to stems to change their

meanings and grammatical functions Prefix: un-, anti-, etc Suffix: -ity, -ation, etc Infix: are inserted inside the stem

Tagalog: um + hingi humingi Circumfixes – precede and follow the stem

English doesn’t stack more affixes. But Turkish can have words with a lot of suffixes. Languages, such as Turkish, tend to string affixes together are

called agglutinative languages.

Page 13: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 13

Surface and Lexical Forms

The surface level of a word represents the actual spelling of that word. geliyorum eats cats kitabım

The lexical level of a word represents a simple concatenation of morphemes making up that word. gel +PROG +1SG eat +AOR cat +PLU kitap +P1SG

Morphological processors try to find correspondences between lexical and surface forms of words. Morphological recognition/ analysis – surface to lexical Morphological generation/ synthesis – lexical to surface

Page 14: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 14

Morphology: Morphemes & Order

Handles what is an isolated form in written text

Grouping of phonemes into morphemes sequence deliverables deliver, able and s (3

units)

Morpheme Combination certain combinations/sequencing possible, other not:

deliver+able+s, but not able+derive+s; noun+s, but not noun+ing typically fixed (in any given language)

Page 15: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 15

Inflectional & Derivational Morphology We can also divide morphology up into two broad classes

Inflectional Derivational

Inflectional morphology concerns the combination of stems and affixes where the resulting word Has the same word class as the original Serves a grammatical/semantic purpose different from the original

After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. eat / eats pencil / pencils

After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. compute / computer do / undo friend / friendly Uygar / uygarlaş kapı / kapıcı

The irregular changes may happen with derivational affixes.

Page 16: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 16

Morphological Parsing Morphological parsing is to find the lexical form of a word

from its surface form. cats -- cat +N +PLU cat -- cat +N +SG goose -- goose +N +SG or goose +V geese -- goose +N +PLU gooses -- goose +V +3SG catch -- catch +V caught -- catch +V +PAST or catch +V +PP

There can be more than one lexical level representation for a given word. (ambiguity)

Page 17: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 17

Morphological Analysis

Analyzing words into their linguistic components (morphemes). Morphemes are the smallest meaningful units of language.

cars car+PLUgiving give+PROGAsachhilAma AsA+PROG+PAST+1st I/We was/were coming

Ambiguity: More than one alternativesflies flyVERB+PROG

flyNOUN+PLU

mAtAlakare

Page 18: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 18

Fly + s flys flies (y i rule) Duckling

Go-getter get + erDoer do + erBeer ?

What knowledge do we need?How do we represent it?How do we compute with it?

Page 19: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 19

Knowledge needed

Knowledge of stems or roots Duck is a possible root, not ducklWe need a dictionary (lexicon)

Only some endings go on some words Do + er ok Be + er – not ok

In addition, spelling change rules that adjust the surface form Get + er – double the t getter Fox + s – insert e – foxes Fly + s – insert e – flys – y to i – flies Chase + ed – drop e - chased

Page 20: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 20

Put all this in a big dictionary (lexicon)

Turkish – approx 600 106 forms Finnish – 107

Hindi, Bengali, Telugu, Tamil? Besides, always novel forms can be constructed

Anti-missile Anti-anti-missile

Anti-anti-anti-missile ……..

Compounding of words – Sanskrit, German

Page 21: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 21

Morphology: From Morphemes to Lemmas & Categories Lemma: lexical unit, “pointer” to lexicon

typically is represented as the “base form”, or “dictionary headword”

possibly indexed when ambiguous/polysemous: state1 (verb), state2 (state-of-the-art), state3 (government)

from one or more morphemes (“root”, “stem”, “root+derivation”, ...)

Categories: non-lexical small number of possible values (< 100, often < 5-10)

Page 22: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 22

Morphology Level: The Mapping

Formally: A+ 2(L,C1,C2,...,Cn)

A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes)

L is the set of possible lemmas, uniquely identified Ci are morphological categories, such as:

grammatical number, gender, case person, tense, negation, degree of comparison, voice, aspect, ... tone, politeness, ... part of speech (not quite morphological category, but...)

A, L and Ci are obviously language-dependent

Page 23: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 23

Morphological Analysis (cont.)

Relatively simple for English. But for many Indian languages, it may be more difficult.

Examples

Inflectional and Derivational Morphology. Common tools: Finite-state transducers

Page 24: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 24

Bengali Verb Paradigms

Page 25: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 25

Bengali Verb morphology for one of the paradigms

Page 26: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 26

Page 27: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 27

Page 28: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 28

Finite State Machines

FSAs are equivalent to regular languages FSTs are equivalent to regular relations (over pairs of

regular languages) FSTs are like FSAs but with complex labels. We can use FSTs to transduce between surface and

lexical levels.

Page 29: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 29

Simple Rules

Page 30: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 30

Adding in the Words

Page 31: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 31

Derivational Rules

Page 32: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 32

Parsing/Generation vs. Recognition

Recognition is usually not quite what we need. Usually if we find some string in the language we need

to find the structure in it (parsing) Or we have some structure and we want to produce a

surface form (production/generation) Example

From “cats” to “cat +N +PL” and back

Page 33: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 33

Morphological Parsing

Given the input cats, we’d like to outputcat +N +Pl, telling us that cat is a plural noun.

Given the Spanish input bebo, we’d like to outputbeber +V +PInd +1P +Sg telling us that bebo is the present indicative first person singular form of the Spanish verb beber, ‘to drink’.

Page 34: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 34

Morphological Anlayser

To build a morphological analyser we need: lexicon: the list of stems and affixes, together with basic information

about them morphotactics: the model of morpheme ordering (eg English plural

morpheme follows the noun rather than a verb) orthographic rules: these spelling rules are used to model the

changes that occur in a word, usually when two morphemes combine (e.g., fly+s = flies)

Page 35: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 35

Lexicon & Morphotactics

Typically list of word parts (lexicon) and the models of ordering can be combined together into an FSA which will recognise the all the valid word forms.

For this to be possible the word parts must first be classified into sublexicons.

The FSA defines the morphotactics (ordering constraints).

Page 36: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 36

Sublexicons to classify the list of word parts

reg-noun irreg-pl-noun irreg-sg-noun plural

cat mice mouse -s

fox sheep sheep

geese goose

Page 37: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 37

FSA Expresses Morphotactics (ordering model)

Page 38: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 38

Towards the Analyser

We can use lexc or xfst to build such an FSA (see lex1.lexc)

To augment this to produce an analysis we must create a transducer Tnum which maps between the lexical level and an "intermediate" level that is needed to handle the spelling rules of English.

Page 39: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 39

Three Levels of Analysis

Page 40: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 40

1. Tnum: Noun Number Inflection

• multi-character symbols• morpheme boundary ^• word boundary #

Page 41: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 41

Intermediate Form to Surface

The reason we need to have an intermediate form is that funny things happen at morpheme boundaries, e.g.cat^s catsfox^s foxesfly^s flies

The rules which describe these changes are called orthographic rules or "spelling rules".

Page 42: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 42

More English Spelling Rules

consonant doubling: beg / begging y replacement: try/tries k insertion: panic/panicked e deletion: make/making e insertion: watch/watches Each rule can be stated in more detail ...

Page 43: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 43

Spelling Rules

Chomsky & Halle (1968) invented a special notation for spelling rules.

A very similar notation is embodied in the "conditional replacement" rules of xfst.

E -> F || L _ Rwhich means replace E with F when it appears between left context L and right context R

Page 44: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 44

A Particular Spelling Rule

This rule does e-insertion

^ -> e || x _ s#

Page 45: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 45

e insertion over 3 levels

The rule corresponds to the mapping betweensurface and intermediate levels

Page 46: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 46

e insertion as an FST

Page 47: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 47

Incorporating Spelling Rules

Spelling rules, each corresponding to an FST, can be run in parallel provided that they are "aligned".

The set of spelling rules is positioned between the surface level and the intermediate level.

Parallel execution of FSTs can be carried out: by simulation: in this case FSTs must first be aligned. by first constructing a a single FST corresponding to their

intersection.

Page 48: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 48

Putting it all together

execution of FSTi

takes place in parallel

Page 49: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 49

Kaplan and KayThe Xerox View

FSTi are alignedbut separate

FSTi intersectedtogether

Page 50: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 50

Finite State Transducers

The simple story Add another tape Add extra symbols to the transitions

On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around.

Page 51: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 51

FSTs

Page 52: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 52

English Plural

surface lexical

cat cat+N+Sg

cats cat+N+Pl

foxes fox+N+Pl

mice mouse+N+Pl

sheep sheep+N+Plsheep+N+Sg

Page 53: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 53

Transitions

c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the

other +PL:s means read +PL and write an s

c:c a:a t:t +N:ε +PL:s

Page 54: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 54

Typical Uses

Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA).

And we’ll write to the second tape using the other symbols on the transitions.

Page 55: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 55

Ambiguity

Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed

In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result

Page 56: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 56

Ambiguity

What’s the right parse for Unionizable Union-ize-able Un-ion-ize-able

Each represents a valid path through the derivational morphology machine.

Page 57: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 57

Ambiguity

There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return

them all (without choosing) Bias the search so that only one or a few likely paths

are explored

Page 58: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 58

The Gory Details

Of course, its not as easy as “cat +N +PL” <-> “cats”

As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation

changes that go along with inflectional changes Cats vs Dogs Fox and Foxes

Page 59: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 59

Multi-Tape Machines

To deal with this we can simply add more tapes and use the output of one tape machine as the input to the next

So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols

Page 60: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 60

Generativity

Nothing really privileged about the directions. We can write from one and read from the other or vice-

versa. One way is generation, the other way is analysis

Page 61: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 61

Multi-Level Tape Machines

We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

Page 62: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 62

Lexical to Intermediate Level

Page 63: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 63

Intermediate to Surface

The add an “e” rule as in fox^s# <-> foxes#

Page 64: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 64

Foxes

Page 65: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 65

Note

A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply.

Meaning that they are written out unchanged to the output tape.

Turns out the multiple tapes aren’t really needed; they can be compiled away.

Page 66: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 66

Overall Scheme

We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). Lexical level to intermediate forms

We have a larger set of machines that capture orthographic/spelling rules. Intermediate forms to surface forms

Page 67: CS60057 Speech &Natural Language Processing

Lecture 1, 7/21/2005 Natural Language Processing 67

Overall Scheme