finite-state methods in natural language processing

Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing

Lauri Karttunen

LSA 2005 Summer Institute

July 25, 2005

Course OutlineCourse Outline

July 18:Intro to computational morphologyXFST

ReadingsLauri Karttunen, “Finite-State Constraints”, The Last

Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.

Karttunen and Beesley, “25 Years of Finite-State Morphology”

Chapter 1: “Gentle Introduction” (B&K)

July 20:Regular expressionsMore on XFST

ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”

July 25More on XFST: Date ParserConcatenative morphotactics: The LEXC language

ReadingsChapter 4. “The LEXC Language”

July 27Constraining non-local dependencies: Flag DiacriticsNon-concatenative morphotactics

Reduplication, interdigitation

ReadingsChapter 5. “Flag Diacritics”Chapter 8. “Non-Concatenative Morphotactics”

August 1Realizational morphology

ReadingsGregory T. Stump. Inflectional Morphology. A Theory of

Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)

Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.

August 3Optimality theory

ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic

and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Solution to Assignment 1, Part 1Solution to Assignment 1, Part 1

define Hundreds [OneToNine { hundred}

({ } OneToNinetyNine)];

define OneTo999 [OneToNine | Teens | Tens |

Hundreds ];

define Thousands [OneTo999 { thousand}

({ } OneTo999)];

define UpToMillion [OneToNine | Teens | Tens |

Hundreds | Thousands ];

What is this? What is this?

xfst[0]: source Dutch.scriptprint random-lower 3tweeennegentigvierenveertigeenennegentigxfst[1]: define Dutch

xfst[0]: source English.scriptxfst[1]: print random-lower 3twenty-sevenninety-oneforty-fivexfst[1]:define English

xfst[0]: regex Dutch.i .o. English ;

SyllabificationSyllabification

define C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];

s t r u k t u r a l i s m is t r u k t u r a l i s m is t r u k . t u . r a . l i s . m is t r u k . t u . r a . l i s . m i

[C* V+ C*] @-> ... "." || _ [C V][C* V+ C*] @-> ... "." || _ [C V]

““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the

C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”

Finnish SyllabificationFinnish Syllabification

# -*- coding: utf8 -*-

define FinnWords {kala}|{riippuu}|{tietoinen}|{sataa}| {satoi}|{saata}|{saatoin}|{auta}|{laiva}| {leipä}|{häijy}|{koulu}|{köyhä}|{lea}|

{viestien}|{tuote}| {virtu.ositeetti}| {laukaus}|{lakkautan}|{voimistelijoiden}| {heittäen}|{heittäisin}|{laulaen}];

define HighV [u | y | i]; # High voweldefine MidV [e | o | ö]; # Mid voweldefine LowV [a | ä] ; # Low voweldefine V [HighV | MidV | LowV]; # Vowel

define LongV [{aa}|{ee}|{ii}|{oo}|{uu}|{yy}|{ää}|{öö}]; define Diph [[[MidV | LowV] HighV]|{ie}|{uo}|{yö}];

Syllabification (Continued)Syllabification (Continued)

define Nuc [V | LongV | Diph];

define C [b | c | d | f | g | h | j | k | l | m |

n | p | q | r | s | t | v | w | x | z];

define Syllabify [ C* Nuc C* @-> ... "." || _ C V ] ;

regex FinnWords.o. Syllabify ;

print lower-words

Syllabification (continued)Syllabification (continued)

Problem cases

Incorrect Correctlea le.a

lau.laen lau.la.en

lau.kaus lau.ka.us

define Syllabify [ C* Nuc C* @-> ... "." || _ C V

.o.

[. .] -> "." || [a | ä | i] _ [e | u | y] (C) .#. ,

e _ a ] ;

Parsing DatesParsing Dates

Today is [Monday, July 25, 2005].

Today is Monday, [July 25, 2005].

Today is [Monday, July 25], 2005.

Today is Monday, [July 25], 2005.

Today is [Monday], July 25, 2005.

Best result

Bad results

Need left-to-right, longest-match constraints.

Defining the Language of DatesDefining the Language of Dates

define OneToNine [1|2|3|4|5|6|7|8|9];

define ZeroToNine ["0"|OneToNine];

define Day [{Monday} | {Tuesday} | {Wednesday} | {Thursday} | {Friday} | {Saturday} | {Sunday}] ;

define Month29 {February};

define Month30 [{April} | {June} | {September} | {December}];

define Month31 [{January} | {March} | {May} | {July} | {August} | {October} | {December}] ;

define Month [Month29 | Month30 | Month31];

Language of Dates (Continued)Language of Dates (Continued)

# Date is a number from 1 to 31define Date [OneToNine | [1 | 2] ZeroToNine | 3 [%0 | 1]];

# Year is a number from 1 to 9999 (watch out for the Y10K bug!)

define Year [OneToNine ZeroToNine^<4];

# A date expression consists of a Day (Monday) or a Month and a Date (July 25) with an optional Day (Monday, July 25) and Year (July 25, 2005) or both (Monday, July 25, 2005).

define AllDates [Day | (Day {, }) Month { } Date ({, } Year)];

All Dates from 1.1.1 to 31.12.9999All Dates from 1.1.1 to 31.12.9999

, ,

FebJan

Mar

MayJunJul

Apr

Aug

OctNovDec

Sep

3

,

,

123456789

0123456789

0

123456789

0123456789

123456789

0

10

21

TueMon

Wed

FriSatSun

Thu 456789

MayJan Feb Mar Apr Jun

Jul Aug Oct Nov DecSep

13 states, 96 arcs13 states, 96 arcs29 760 007 date expressions29 760 007 date expressions

Parser for DatesParser for Dates

AllDates @-> “<DT>“ ... “</DT>“Compiles into an

unambiguous transducer (136

states, 2798 arcs).

Today is Today is <DT>Monday, July 25, 2005</DT><DT>Monday, July 25, 2005</DT> because because

yesterday was yesterday was <DT>Sunday</DT><DT>Sunday</DT> and it was and it was

<DT>July 24</DT><DT>July 24</DT> so tomorrow must be so tomorrow must be

<DT>Tuesday, July 26</DT><DT>Tuesday, July 26</DT> and not and not <DT>July 27><DT>July 27>

as it says on the program.as it says on the program.

Problem of ReferenceProblem of Reference

Valid dates

Monday, July 25, 2005

Tuesday, February 29, 2000

Monday, September 16, 1996Invalid dates

Wednesday, April 31, 1996

Thursday, February 29, 1900

Tuesday, July 25, 2005

Refinement by IntersectionRefinement by Intersection

AllDatesAllDates

ValidValidDatesDates

LeapYearsLeapYears

Feb 29 => _ ...

MaxDaysMaxDaysIn MonthIn Month

~$[Month29 { 30}];

WeekdayDateWeekdayDate

MaxDaysMaxDays

define MaxDays30 ~$[Month29 { 30}];

define MaxDays31 ~$[[Month29 | Month30] { 31}];

define MaxDays [MaxDays30 & MaxDays31];

LeapYear constraintLeapYear constraint

define Even [{0} | 2 | 4 | 6 | 8] ; define Odd [1 | 3 | 5 | 7 | 9] ;

define N [Even | Odd];

define Div4 [4 | 8 |

N* [Even [%0 | 4 | 8] |

Odd [2 | 6]]];

define LeapYear [Div4 - [[N+ - Div4] {00}]] ;

LeapYear Constraint (Continued)LeapYear Constraint (Continued)

Bad Solution 1define LeapDates {February 29, } => _ LeapYear ;

Bad Solution 2define NotLeapYear [Year - LeapYear];

define LeapDates ~${February 29, } NotLeapYear];

Almost Correctdefine LeapDates [

{February 29, } => _ [?* - [NotLeapYear [\N]*]]];

Good Solutiondefine LeapDates [

{February 29, } => _ [?* - [NotLeapYear [\N]*]]] .#.;

Vacuous Context ConditionsVacuous Context Conditions

A context condition L _ R is compiled as ?* L _ R ?*.

Any expression that contains the empty string is “swallowed up” when concatenated with ?*. (a) ?* == ?* (a) == ?*

[?* - a] ?* == ?* [?* - a] == ?*

~a ?* == ?* ~a == ?*

Not vacuous:a -> b || _ c* [.#.| \c] ;

DateParsersDateParsers

define ValidDates [AllDates & MaxDays & LeapDates];

define ValidDateParser [ValidDates @->

"<DATE>" ... "</DATE>"

|| _ [.#. | \N]];

define InValidDates = [AllDates - ValidDates];

define InvDateParser [InValidDates @->

"<INV-DATE>" ... "</INV-DATE>"

|| _ [.#. | \N]];

define DateParser [InvDateParser .o. ValidDateParser];

<INV-DATE><DATE>February 29</DATE>, 1900</INV-DATE>

Date/NonDate parser 1Date/NonDate parser 1

define DateParser [ValidDateParser .o. InvDateParser];

<DATE>February 29<DATE>, 1900

No nested tags for the input "February 29, 1900” because InvDateParser does not apply to strings that have been tagged already.

Date/NonDate parser 2Date/NonDate parser 2

define DateParser [ValidDates @-> "<DATE>" ... "</DATE>",

InvalidDates @-> "<NON-DATE>" ... "</NON-DATE>"

|| _ [\N | .#.]]

Parallel replacement of two patterns with the same constraint on the right context.

<NON-DATE>February 29, 1900</NON-DATE>

<DATE>February 29, 2000<DATE>

ObservationsObservations

For some subsets of natural language, such as dates, a finite-state description is more appropriate than a phrase structure grammar.

Regular languages and relations can be modified directly with the finite-state calculus without rewriting the grammars that describe them.

This is a fundamental advantage over higher-level formalisms.

The LEXC FormalismThe LEXC Formalism

What is LEXC?What is LEXC?

A special application for making lexical transducers (On the B&K book CD).

A language for describing morphotactic constraints by way of sublexicons and continuation classes.

Why another regular expression formalism?The general regular expression compiler in XFST is

oriented towards compiling networks from symbols and symbol pairs, not from words. LEXC is word-based.

Compiling large lexicons (tens of thousands of words) by the standard union operator is inefficient. LEXC has another, a more efficient algorithm for building networks from lists of words, stems, and affixes.

LEXC SyntaxLEXC Syntax

Multichar_Symbols +Noun +Sg +Pl

Lexicon Root cat SgPl ; dog SgPl ; goose Sg ; goose:geese Pl;

Lexicon SgPl Sg; 0:s Pl;Lexicon Sg +Noun+Sg:0 #;Lexicon Pl +Noun+Pl:0 # ;

Multicharacter symbols need to be declared.

There must be a sublexicon called ‘Root’

Entries consist of optional string or string pair followed by an obligatory continuation class.

Every continuation class must refer to a sublexicon, except for #, the termination class.

Esperanto chartEsperanto chart

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

Esperanto chart 2Esperanto chart 2

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n


ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n


ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

Root

Adjective

APrefix

AStem ADeriv

AMod

ATag

ASuff

AtoNInfl

Plur Acc

NPrefix

NMod

NStem

NDeriv

NTag

Noun

NSuff

Esperanto.lexcEsperanto.lexc

LEXICON Root

Adjective;

Noun;

LEXICON Adjective

APrefix;

AStem;

LEXICON APrefix

Neg+:ne AStem;

Op+:mal AStem;

LEXICON AStem

bon ADeriv;

LEXICON ADeriv

ATag;

AMod;

LEXICON AMod

+Aug:eg ATag;

+Dim:et ATag;

LEXICON ATag

+Adj:0 ASuff;

+Adj:0 AtoN;

LEXICON ASuff

+ASuff:a Infl;

... etc.

ConstraintsConstraints

gege

hund

bon

nemal eg

et

inineget

o

a

ec

jj n

MF+ +Fem

+Pl

Constraints 2Constraints 2

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

MF%+ => _ ~$[%+Fem] %+Pl ;MF+ +Fem

+Pl

Constraints 3Constraints 3

xfst[0]: read lexc < esperanto.lexc

Reading from 'adj-noun.lexc'

Root...2, Nouns...2, NounRoots...4, Nmf...5, ....

Building lexicon...Minimizing...Done!

2.7 Kb. 45 states, 70 arcs, Circular.

Closing 'adj-noun-tags.lexc'

xfst[1]: regex MF%+ => _ ~$[%+Fem] %+Pl ;

1.2 Kb, 2 states, 7 arcs, Circular

xfst[2]: compose

3.2 Kb, 61 states, 89 arcs, Circular

Less words, bigger network!

finite-state methods in natural language processing

Documents

c v pattern

c vinsert

e u y c

c v regex finnwords

e i o u s t r u

highv u y i

v highv midv lowv

hundreds onetonine