finite-state methods

41
600.465 - Intro to NLP - J. Eisner 1 Finite-State Methods

Upload: janine

Post on 22-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Finite-State Methods. c. a. e. Finite state acceptors (FSAs). Things you may know about FSAs: Equivalence to regexps Union, Kleene *, concat, intersect, complement, reversal Determinization, minimization Pumping, Myhill-Nerode. Defines the language a? c* - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 1

Finite-State Methods

Page 2: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 2

Finite state acceptors (FSAs)

Things you may know about FSAs: Equivalence to

regexps Union, Kleene *,

concat, intersect, complement, reversal

Determinization, minimization

Pumping, Myhill-Nerode

a

c

Defines the Defines the languagelanguage a? c* a? c*

= {a, ac, acc, accc, = {a, ac, acc, accc, …,…, , c, , c, cc, ccc, cc, ccc, …}…}

Page 3: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 3

n-gram models not good enough

Want to model grammaticality A “training” sentence known to be grammatical:

BOS mouse traps catch mouse traps EOS

Resulting trigram model has to overgeneralize: allows sentences with 0 verbsallows sentences with 0 verbsBOS mouse traps EOS

allows sentences with 2 or more verbsallows sentences with 2 or more verbsBOS mouse traps catch mouse traps catch mouse traps catch mouse traps EOS

Can’t remember whether it’s in subject or object(i.e., whether it’s gotten to the verb yet)

trigram model must allow these trigramstrigram model must allow these trigrams

Page 4: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 4

Want to model grammaticalityBOS mouse traps catch mouse traps EOS

Finite-state can capture the generalization here:

Finite-state models can “get it”

Noun+ Verb Noun+Noun+ Verb Noun+Noun

Noun Verb

Noun

Noun

preverbal states(still need a verb

to reach final state)

postverbal states(verbs no longer

allowed)

Allows arbitrarily long NPs (just keep looping around for another Noun modifier).

Still, never forgets whether it’s preverbal or postverbal! (Unlike 50-gram model)

Page 5: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 5

How powerful are regexps / FSAs?

More powerful than n-gram models The hidden state may “remember” arbitrary past context With k states, can remember which of k “types” of context

it’s in

Equivalent to HMMs In both cases, you observe a sequence and it is “explained”

by a hidden path of states. The FSA states are like HMM tags.

Appropriate for phonology and morphologyWord = Syllable+ = (Onset Nucleus Coda?)+ = (C+ V+ C*)+ = ( (b|d|f|…)+ (a|e|i|o|u)+ (b|d|f|…)* )+

Page 6: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 6

How powerful are regexps / FSAs?

But less powerful than CFGs / pushdown automata Can’t do recursive center-embedding Hmm, humans have trouble processing those constructions

too … This is the rat that ate the malt. This is the malt that the rat ate.

This is the cat that bit the rat that ate the malt. This is the malt that the rat that the cat bit ate.

This is the dog that chased the cat that bit the rat that ate the malt.

This is the malt that [the rat that [the cat that [the dog chased] bit] ate].

finite-state can handle this

pattern (can you write the

regexp?)

but not this pattern,which requires a CFG

Page 7: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 7

How powerful are regexps / FSAs?

But less powerful than CFGs / pushdown automata

More important: Less explanatory than CFGs An CFG without recursive center-embedding can be

converted into an equivalent FSA – but the FSA will usually be far larger

Because FSAs can’t reuse the same phrase type in different places

Noun

Noun Verb

Noun

NounS =S =

duplicatedstructure

duplicatedstructure

Noun

NounNP =NP =

NP Verb NPS =S =

more elegant – usingnonterminals like this

is equivalent to a CFG

conv

ertin

g to

FSA

copi

es th

e NP

twice

Page 8: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 8

Strings vs. String Pairs

FSA = “finite-state acceptor” Describes a language

(which strings are grammatical?)

FST = “finite-state transducer” Describes a relation

(which pairs of strings are related?) underlying form surface form sentence translation original edited …

Page 9: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 9

Example: Edit Distance

c: l: a: r: a:

:c

c:c

:cl:c

:c

a:c

:c

r:c

:c

a:c

:cc: l: a: r: a:

:ac:a

:a

l:a

:a

a:a

:a

r:a

:a

a:a

:ac: l: a: r: a:

:c

c:c

:c

l:c

:c

a:c

:c

r:c

:c

a:c

:cc: l: a: r: a:

:a

c:a

:a

l:a

:a

a:a

:a

r:a

:a

a:a

:ac: l: a: r: a:

0 1 2 3 4 50

1

2

3

4

position in upper string

posi

tion in low

er

stri

ngCost of best

path relatingthese two strings?

Page 10: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 10

Example: Morphology

VP VP [head=vouloir,...]

VV[head=vouloir,tense=Present,num=SG, person=P3]

......

veutveut

Page 11: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 11

Example: Unweighted transducer

veut

vouloir +Pres +Sing + P3

Finite-state transducer

inflected form

canonical form inflection codes

v o u l o i r +Pres +Sing +P3

v e u t

slide courtesy of L. Karttunen (modified)

VP VP [head=vouloir,...]

VV[head=vouloir,tense=Present,num=SG, person=P3]

......

veutveut

the relevant path

Page 12: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 12

veut

vouloir +Pres +Sing + P3

Finite-state transducer

inflected form

canonical form inflection codes

v o u l o i r +Pres +Sing +P3

v e u t

Example: Unweighted transducer

Bidirectional: generation or analysis

Compact and fast Xerox sells for about 20

languges including English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, ...

Research systems for many other languages, including Arabic, Malay

slide courtesy of L. Karttunen

the relevant path

Page 13: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 13

Relation: like a function, but multiple outputs ok

Regular: finite-state Transducer: automaton w/ outputs

b ? a ? aaaaa ?

Regular Relation (of strings)

b:b

a:a

a:

a:c

b:

b:b

?:c

?:a

?:b

{b} {}{ac, aca, acab,

acabc}

Invertible? Closed under composition?

Page 14: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 14

Can weight the arcs: vs. b {b} a {} aaaaa {ac, aca, acab,

acabc}

How to find best outputs? For aaaaa? For all inputs at once?

Regular Relation (of strings)

b:b

a:a

a:

a:c

b:

b:b

?:c

?:a

?:b

Page 15: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 15

Function from strings to ...

a:x/.5

c:z/.7

:y/.5.3

Acceptors (FSAs) Transducers (FSTs)

a:x

c:z

:y

a

c

Unweighted

Weighted a/.5

c/.7

/.5.3

{false, true} strings

numbers (string, num) pairs

Page 16: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 16

Sample functions

Acceptors (FSAs) Transducers (FSTs)

Unweighted

Weighted

{false, true} strings

numbers (string, num) pairs

Grammatical?

How grammatical?Better, how likely?

MarkupCorrectionTranslation

Good markupsGood correctionsGood translations

Page 17: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 17

Terminology (acceptors)

StringString

RegexpRegexp FSAFSA

acce

pts

matches

matches

compiles into

implements

Regular languageRegular language

defines recognizes

(or ge

nera

tes)

Page 18: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 18

Terminology (transducers)

String pairString pair

RegexpRegexp FSTFST

matches

matches

compiles into

implements

Regular relationRegular relation

defines recognizes

(or, tr

ansd

uces

one

strin

g of

the

pair

into

the

othe

r)acce

pts

(or ge

nera

tes)

??

Page 19: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 19

Perspectives on a Transducer Remember these CFG perspectives:

Similarly, 3 views of a transducer: Given 0 strings, generate a new string pair (by picking a path) Given one string (upper or lower), transduce it to the other kind Given two strings (upper & lower), decide whether to accept the pair

FST just defines the regular relation (mathematical object: set of pairs). What’s “input” and “output” depends on what one asks about the relation.The 0, 1, or 2 given string(s) constrain which paths you can use.

3 views of a context-free rule

generation (production): S NP VP parsing (comprehension): S NP VP verification (checking): S = NP VP

(randsent)(parse)

v o u l o i r +Pres +Sing +P3

v e u t

Page 20: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 20

Functions

ab?d abcd

f

g

Page 21: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 21

Functions

ab?d

Function composition: f g

[first f, then g – intuitive notation, but opposite of the traditional math notation]

Page 22: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 22

From Functions to Relations

ab?d abcd

abed

abjd

3

2

6

4

2

8

...

f

g

Page 23: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 23

From Functions to Relations

ab?d

...

Relation composition: f g

3

2

6

4

2

8

Page 24: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 24

From Functions to Relations

ab?d

...

Relation composition: f g

3+4

2+2

6+8

Page 25: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 25

From Functions to Relations

ab?d

Often in NLP, all of the functions or relations involved can be described as finite-state machines, and manipulated using standard algorithms.

Pick min-cost or max-prob output2+2

Page 26: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 26

Building a lexical transducer

Regular ExpressionLexicon

LexiconFSA

Compiler

Regular Expressionsfor Rules

ComposedRule FSTs

Lexical Transducer(a single FST)composition

slide courtesy of L. Karttunen (modified)

big | clear | clever | ear | fat | ...

rlc ae

v ee

t hf a

b i g +Adj

r

+Comp

b i g g e

one path

Page 27: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 27

Building a lexical transducer

Actually, the lexicon must contain elements likebig +Adj +Comp

So write it as a more complicated expression:(big | clear | clever | fat | ...) +Adj ( | +Comp | +Sup) adjectives | (ear | father | ...) +Noun (+Sing | +Pl) nouns | ... ...

Q: Why do we need a lexicon at all?

Regular ExpressionLexicon

LexiconFSA

slide courtesy of L. Karttunen (modified)

big | clear | clever | ear | fat | ...

rlc ae

v ee

t hf a

Page 28: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 28

Inverting Relations

ab?d abcd

abed

abjd

3

2

6

4

2

8

...

f

g

Page 29: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 29

Inverting Relations

ab?d abcd

abed

abjd

3

2

6

4

2

8

...

f -1

g-1

Page 30: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 30

Inverting Relations

ab?d

...

(f g)-1 = g-1 f -1

3+4

2+2

6+8

Page 31: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 31

Weighted version of transducer: Assigns a weight to each string pair

payer+IndP+SG+P1

paie

paye

Weighted French Transducer

suis

suivre+Imp+SG + P2

suivre+IndP+SG+P2

suivre+IndP+SG+P1

être+IndP +SG + P1

“upper language”

“lower language”

slide courtesy of L. Karttunen (modified)

419

20

50

3

12

Page 32: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 32

Composition Cascades

You can build fancy noisy-channel models by composing transducers …

Examples: Phonological/morphological rewrite rules? English orthography English phonology

Japanese phonology Japanese orthography e.g. ??? goruhubaggu

Information extraction

Page 33: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 33600.465 - Intro to NLP - J. Eisner 33

FASTUS – Information Extraction Appelt et al, 1992-?

Input: Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with …

Output:Relationship: TIE-UPEntities: “Bridgestone Sports Co.”

“A local concern”“A Japanese trading house”

Joint Venture Company: “Bridgestone Sports Taiwan Co.”Amount: NT$20000000

Page 34: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 34600.465 - Intro to NLP - J. Eisner 34

FASTUS: Successive Markups(details on subsequent slides)

Tokenization.o.

Multiwords.o.

Basic phrases (noun groups, verb groups …).o.

Complex phrases.o.

Semantic Patterns.o.

Merging different references

Page 35: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 35600.465 - Intro to NLP - J. Eisner 35

FASTUS: Tokenization

Spaces, hyphens, etc. wouldn’t would not their them ’s company. company .

butCo. Co.

Page 36: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 36600.465 - Intro to NLP - J. Eisner 36

FASTUS: Multiwords

“set up” “joint venture” “San Francisco Symphony Orchestra,”

“Canadian Opera Company”

… use a specialized regexp to match musical groups.

... what kind of regexp would match company names?

Page 37: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 37600.465 - Intro to NLP - J. Eisner 37

FASTUS : Basic phrases

Output looks like this (no nested brackets!):… [NG it] [VG had set_up] [NP a joint_venture] [Prep in]

Company Name: Bridgestone Sports Co.Verb Group: saidNoun Group: FridayNoun Group: itVerb Group: had set upNoun Group: a joint venturePreposition: inLocation: TaiwanPreposition: withNoun Group: a local concern

Page 38: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 38600.465 - Intro to NLP - J. Eisner 38

FASTUS: Noun Groups

Build FSA to recognize phrases likeapproximately 5 kgmore than 30 peoplethe newly elected presidentthe largest leftist political forcea government and commercial project

Use the FSA for left-to-right longest-match markup

What does FSA look like? See next slide …

Page 39: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 39600.465 - Intro to NLP - J. Eisner 39

FASTUS: Noun Groups

Described with a kind of non-recursive CFG …(a regexp can include names that stand for other regexps)

NG Pronoun | Time-NP | Date-NPNG (Det) (Adjs) HeadNouns…Adjs sequence of adjectives maybe with commas,

conjunctions, adverbs…Det DetNP | DetNonNPDetNP detailed expression to match “the only five,

another three, this, many, hers, all, the most …”…

Page 40: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 40600.465 - Intro to NLP - J. Eisner 40

FASTUS: Semantic patterns

BusinessRelationship =NounGroup(Company/ies) VerbGroup(Set-up) NounGroup(JointVenture) with NounGroup(Company/ies) | …

ProductionActivity = VerbGroup(Produce) NounGroup(Product)

NounGroup(Company/ies) NounGroup & … is made easy by the processing done at a previous level

Use this for spotting references to put in the database.

Page 41: Finite-State Methods

600.465 - Intro to NLP - J. Eisner 41

Composition Cascades

You can build fancy noisy-channel models by composing transducers …

… now let’s turn to how you might build the individual transducers in the cascade. We’ll use a variety of operators that

combine simpler transducers and acceptors into more complex ones.

Composition is just one example.