finite-state methods in natural language processing

37
Finite-State Methods in Finite-State Methods in Natural Language Natural Language Processing Processing Lauri Karttunen LSA 2005 Summer Institute July 25, 2005

Upload: badrani

Post on 14-Jan-2016

15 views

Category:

Documents


0 download

DESCRIPTION

Finite-State Methods in Natural Language Processing. Lauri Karttunen LSA 2005 Summer Institute July 25, 2005. Course Outline. July 18: Intro to computational morphology XFST Readings - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Finite-State Methods in Natural Language Processing

Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing

Lauri Karttunen

LSA 2005 Summer Institute

July 25, 2005

Page 2: Finite-State Methods in Natural Language Processing

Course OutlineCourse Outline

July 18:Intro to computational morphologyXFST

ReadingsLauri Karttunen, “Finite-State Constraints”, The Last

Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.

Karttunen and Beesley, “25 Years of Finite-State Morphology”

Chapter 1: “Gentle Introduction” (B&K)

July 20:Regular expressionsMore on XFST

ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”

Page 3: Finite-State Methods in Natural Language Processing

July 25More on XFST: Date ParserConcatenative morphotactics: The LEXC language

ReadingsChapter 4. “The LEXC Language”

July 27Constraining non-local dependencies: Flag DiacriticsNon-concatenative morphotactics

Reduplication, interdigitation

ReadingsChapter 5. “Flag Diacritics”Chapter 8. “Non-Concatenative Morphotactics”

Page 4: Finite-State Methods in Natural Language Processing

August 1Realizational morphology

ReadingsGregory T. Stump. Inflectional Morphology. A Theory of

Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)

Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.

August 3Optimality theory

ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic

and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Page 5: Finite-State Methods in Natural Language Processing

Solution to Assignment 1, Part 1Solution to Assignment 1, Part 1

define Hundreds [OneToNine { hundred}

({ } OneToNinetyNine)];

define OneTo999 [OneToNine | Teens | Tens |

Hundreds ];

define Thousands [OneTo999 { thousand}

({ } OneTo999)];

define UpToMillion [OneToNine | Teens | Tens |

Hundreds | Thousands ];

Page 6: Finite-State Methods in Natural Language Processing

What is this? What is this?

xfst[0]: source Dutch.scriptprint random-lower 3tweeennegentigvierenveertigeenennegentigxfst[1]: define Dutch

xfst[0]: source English.scriptxfst[1]: print random-lower 3twenty-sevenninety-oneforty-fivexfst[1]:define English

xfst[0]: regex Dutch.i .o. English ;

Page 7: Finite-State Methods in Natural Language Processing

SyllabificationSyllabification

define C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];

s t r u k t u r a l i s m is t r u k t u r a l i s m is t r u k . t u . r a . l i s . m is t r u k . t u . r a . l i s . m i

[C* V+ C*] @-> ... "." || _ [C V][C* V+ C*] @-> ... "." || _ [C V]

““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the

C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”

Page 8: Finite-State Methods in Natural Language Processing

Finnish SyllabificationFinnish Syllabification

# -*- coding: utf8 -*-

define FinnWords {kala}|{riippuu}|{tietoinen}|{sataa}| {satoi}|{saata}|{saatoin}|{auta}|{laiva}| {leipä}|{häijy}|{koulu}|{köyhä}|{lea}|

{viestien}|{tuote}| {virtu.ositeetti}| {laukaus}|{lakkautan}|{voimistelijoiden}| {heittäen}|{heittäisin}|{laulaen}];

define HighV [u | y | i]; # High voweldefine MidV [e | o | ö]; # Mid voweldefine LowV [a | ä] ; # Low voweldefine V [HighV | MidV | LowV]; # Vowel

define LongV [{aa}|{ee}|{ii}|{oo}|{uu}|{yy}|{ää}|{öö}]; define Diph [[[MidV | LowV] HighV]|{ie}|{uo}|{yö}];

Page 9: Finite-State Methods in Natural Language Processing

Syllabification (Continued)Syllabification (Continued)

define Nuc [V | LongV | Diph];

define C [b | c | d | f | g | h | j | k | l | m |

n | p | q | r | s | t | v | w | x | z];

define Syllabify [ C* Nuc C* @-> ... "." || _ C V ] ;

regex FinnWords.o. Syllabify ;

print lower-words

Page 10: Finite-State Methods in Natural Language Processing

Syllabification (continued)Syllabification (continued)

Problem cases

Incorrect Correctlea le.a

lau.laen lau.la.en

lau.kaus lau.ka.us

define Syllabify [ C* Nuc C* @-> ... "." || _ C V

.o.

[. .] -> "." || [a | ä | i] _ [e | u | y] (C) .#. ,

e _ a ] ;

Page 11: Finite-State Methods in Natural Language Processing

Parsing DatesParsing Dates

Today is [Monday, July 25, 2005].

Today is Monday, [July 25, 2005].

Today is [Monday, July 25], 2005.

Today is Monday, [July 25], 2005.

Today is [Monday], July 25, 2005.

Best result

Bad results

Need left-to-right, longest-match constraints.

Page 12: Finite-State Methods in Natural Language Processing

Defining the Language of DatesDefining the Language of Dates

define OneToNine [1|2|3|4|5|6|7|8|9];

define ZeroToNine ["0"|OneToNine];

define Day [{Monday} | {Tuesday} | {Wednesday} | {Thursday} | {Friday} | {Saturday} | {Sunday}] ;

define Month29 {February};

define Month30 [{April} | {June} | {September} | {December}];

define Month31 [{January} | {March} | {May} | {July} | {August} | {October} | {December}] ;

define Month [Month29 | Month30 | Month31];

Page 13: Finite-State Methods in Natural Language Processing

Language of Dates (Continued)Language of Dates (Continued)

# Date is a number from 1 to 31define Date [OneToNine | [1 | 2] ZeroToNine | 3 [%0 | 1]];

# Year is a number from 1 to 9999 (watch out for the Y10K bug!)

define Year [OneToNine ZeroToNine^<4];

# A date expression consists of a Day (Monday) or a Month and a Date (July 25) with an optional Day (Monday, July 25) and Year (July 25, 2005) or both (Monday, July 25, 2005).

define AllDates [Day | (Day {, }) Month { } Date ({, } Year)];

Page 14: Finite-State Methods in Natural Language Processing

All Dates from 1.1.1 to 31.12.9999All Dates from 1.1.1 to 31.12.9999

, ,

FebJan

Mar

MayJunJul

Apr

Aug

OctNovDec

Sep

3

,

,

123456789

0123456789

0

123456789

0123456789

123456789

0

10

21

TueMon

Wed

FriSatSun

Thu 456789

MayJan Feb Mar Apr Jun

Jul Aug Oct Nov DecSep

13 states, 96 arcs13 states, 96 arcs29 760 007 date expressions29 760 007 date expressions

Page 15: Finite-State Methods in Natural Language Processing

Parser for DatesParser for Dates

AllDates @-> “<DT>“ ... “</DT>“Compiles into an

unambiguous transducer (136

states, 2798 arcs).

Today is Today is <DT>Monday, July 25, 2005</DT><DT>Monday, July 25, 2005</DT> because because

yesterday was yesterday was <DT>Sunday</DT><DT>Sunday</DT> and it was and it was

<DT>July 24</DT><DT>July 24</DT> so tomorrow must be so tomorrow must be

<DT>Tuesday, July 26</DT><DT>Tuesday, July 26</DT> and not and not <DT>July 27><DT>July 27>

as it says on the program.as it says on the program.

Page 16: Finite-State Methods in Natural Language Processing

Problem of ReferenceProblem of Reference

Valid dates

Monday, July 25, 2005

Tuesday, February 29, 2000

Monday, September 16, 1996Invalid dates

Wednesday, April 31, 1996

Thursday, February 29, 1900

Tuesday, July 25, 2005

Page 17: Finite-State Methods in Natural Language Processing

Refinement by IntersectionRefinement by Intersection

AllDatesAllDates

ValidValidDatesDates

LeapYearsLeapYears

Feb 29 => _ ...

MaxDaysMaxDaysIn MonthIn Month

~$[Month29 { 30}];

WeekdayDateWeekdayDate

Page 18: Finite-State Methods in Natural Language Processing

MaxDaysMaxDays

define MaxDays30 ~$[Month29 { 30}];

define MaxDays31 ~$[[Month29 | Month30] { 31}];

define MaxDays [MaxDays30 & MaxDays31];

Page 19: Finite-State Methods in Natural Language Processing

LeapYear constraintLeapYear constraint

define Even [{0} | 2 | 4 | 6 | 8] ; define Odd [1 | 3 | 5 | 7 | 9] ;

define N [Even | Odd];

define Div4 [4 | 8 |

N* [Even [%0 | 4 | 8] |

Odd [2 | 6]]];

define LeapYear [Div4 - [[N+ - Div4] {00}]] ;

Page 20: Finite-State Methods in Natural Language Processing

LeapYear Constraint (Continued)LeapYear Constraint (Continued)

Bad Solution 1define LeapDates {February 29, } => _ LeapYear ;

Bad Solution 2define NotLeapYear [Year - LeapYear];

define LeapDates ~${February 29, } NotLeapYear];

Almost Correctdefine LeapDates [

{February 29, } => _ [?* - [NotLeapYear [\N]*]]];

Good Solutiondefine LeapDates [

{February 29, } => _ [?* - [NotLeapYear [\N]*]]] .#.;

Page 21: Finite-State Methods in Natural Language Processing

Vacuous Context ConditionsVacuous Context Conditions

A context condition L _ R is compiled as ?* L _ R ?*.

Any expression that contains the empty string is “swallowed up” when concatenated with ?*. (a) ?* == ?* (a) == ?*

[?* - a] ?* == ?* [?* - a] == ?*

~a ?* == ?* ~a == ?*

Not vacuous:a -> b || _ c* [.#.| \c] ;

Page 22: Finite-State Methods in Natural Language Processing

DateParsersDateParsers

define ValidDates [AllDates & MaxDays & LeapDates];

define ValidDateParser [ValidDates @->

"<DATE>" ... "</DATE>"

|| _ [.#. | \N]];

define InValidDates = [AllDates - ValidDates];

define InvDateParser [InValidDates @->

"<INV-DATE>" ... "</INV-DATE>"

|| _ [.#. | \N]];

define DateParser [InvDateParser .o. ValidDateParser];

<INV-DATE><DATE>February 29</DATE>, 1900</INV-DATE>

Page 23: Finite-State Methods in Natural Language Processing

Date/NonDate parser 1Date/NonDate parser 1

define DateParser [ValidDateParser .o. InvDateParser];

<DATE>February 29<DATE>, 1900

No nested tags for the input "February 29, 1900” because InvDateParser does not apply to strings that have been tagged already.

Page 24: Finite-State Methods in Natural Language Processing

Date/NonDate parser 2Date/NonDate parser 2

define DateParser [ValidDates @-> "<DATE>" ... "</DATE>",

InvalidDates @-> "<NON-DATE>" ... "</NON-DATE>"

|| _ [\N | .#.]]

Parallel replacement of two patterns with the same constraint on the right context.

<NON-DATE>February 29, 1900</NON-DATE>

<DATE>February 29, 2000<DATE>

Page 25: Finite-State Methods in Natural Language Processing

ObservationsObservations

For some subsets of natural language, such as dates, a finite-state description is more appropriate than a phrase structure grammar.

Regular languages and relations can be modified directly with the finite-state calculus without rewriting the grammars that describe them.

This is a fundamental advantage over higher-level formalisms.

Page 26: Finite-State Methods in Natural Language Processing

The LEXC FormalismThe LEXC Formalism

Page 27: Finite-State Methods in Natural Language Processing

What is LEXC?What is LEXC?

A special application for making lexical transducers (On the B&K book CD).

A language for describing morphotactic constraints by way of sublexicons and continuation classes.

Why another regular expression formalism?The general regular expression compiler in XFST is

oriented towards compiling networks from symbols and symbol pairs, not from words. LEXC is word-based.

Compiling large lexicons (tens of thousands of words) by the standard union operator is inefficient. LEXC has another, a more efficient algorithm for building networks from lists of words, stems, and affixes.

Page 28: Finite-State Methods in Natural Language Processing

LEXC SyntaxLEXC Syntax

Multichar_Symbols +Noun +Sg +Pl

Lexicon Root cat SgPl ; dog SgPl ; goose Sg ; goose:geese Pl;

Lexicon SgPl Sg; 0:s Pl;Lexicon Sg +Noun+Sg:0 #;Lexicon Pl +Noun+Pl:0 # ;

Multicharacter symbols need to be declared.

There must be a sublexicon called ‘Root’

Entries consist of optional string or string pair followed by an obligatory continuation class.

Every continuation class must refer to a sublexicon, except for #, the termination class.

Page 29: Finite-State Methods in Natural Language Processing

Esperanto chartEsperanto chart

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

Page 30: Finite-State Methods in Natural Language Processing

Esperanto chart 2Esperanto chart 2

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

Page 31: Finite-State Methods in Natural Language Processing

Esperanto chart 3Esperanto chart 3

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

Page 32: Finite-State Methods in Natural Language Processing

Esperanto chart 4Esperanto chart 4

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

Page 33: Finite-State Methods in Natural Language Processing

Esperanto chart 5Esperanto chart 5

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

Root

Adjective

APrefix

AStem ADeriv

AMod

ATag

ASuff

AtoNInfl

Plur Acc

NPrefix

NMod

NStem

NDeriv

NTag

Noun

NSuff

Page 34: Finite-State Methods in Natural Language Processing

Esperanto.lexcEsperanto.lexc

LEXICON Root

Adjective;

Noun;

LEXICON Adjective

APrefix;

AStem;

LEXICON APrefix

Neg+:ne AStem;

Op+:mal AStem;

LEXICON AStem

bon ADeriv;

LEXICON ADeriv

ATag;

AMod;

LEXICON AMod

+Aug:eg ATag;

+Dim:et ATag;

LEXICON ATag

+Adj:0 ASuff;

+Adj:0 AtoN;

LEXICON ASuff

+ASuff:a Infl;

... etc.

Page 35: Finite-State Methods in Natural Language Processing

ConstraintsConstraints

gege

hund

bon

nemal eg

et

inineget

o

a

ec

jj n

MF+ +Fem

+Pl

Page 36: Finite-State Methods in Natural Language Processing

Constraints 2Constraints 2

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

MF%+ => _ ~$[%+Fem] %+Pl ;MF+ +Fem

+Pl

Page 37: Finite-State Methods in Natural Language Processing

Constraints 3Constraints 3

xfst[0]: read lexc < esperanto.lexc

Reading from 'adj-noun.lexc'

Root...2, Nouns...2, NounRoots...4, Nmf...5, ....

Building lexicon...Minimizing...Done!

2.7 Kb. 45 states, 70 arcs, Circular.

Closing 'adj-noun-tags.lexc'

xfst[1]: regex MF%+ => _ ~$[%+Fem] %+Pl ;

1.2 Kb, 2 states, 7 arcs, Circular

xfst[2]: compose

3.2 Kb, 61 states, 89 arcs, Circular

Less words, bigger network!