finite-state methods in natural language processing

Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing

Lauri Karttunen

LSA 2005 Summer Institute

July 25, 2005

Course OutlineCourse Outline

July 18:Intro to computational morphologyXFST

ReadingsLauri Karttunen, “Finite-State Constraints”, The Last

Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.

Karttunen and Beesley, “25 Years of Finite-State Morphology”

Chapter 1: “Gentle Introduction” (B&K)

July 20:Regular expressionsMore on XFST

ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”

July 25More on XFST: Date ParserConcatenative morphotactics: The LEXC language

ReadingsChapter 4. “The LEXC Language”

July 27Constraining non-local dependencies: Flag DiacriticsNon-concatenative morphotactics

Reduplication, interdigitation

ReadingsChapter 5. “Flag Diacritics”Chapter 8. “Non-Concatenative Morphotactics”

August 1Realizational morphology

ReadingsGregory T. Stump. Inflectional Morphology. A Theory of

Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)

Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.

August 3Optimality theory

ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic

and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Solution to Assignment 1, Part 1Solution to Assignment 1, Part 1

define Hundreds [OneToNine { hundred}

({ } OneToNinetyNine)];

define OneTo999 [OneToNine | Teens | Tens |

Hundreds ];

define Thousands [OneTo999 { thousand}

({ } OneTo999)];

define UpToMillion [OneToNine | Teens | Tens |

Hundreds | Thousands ];

What is this? What is this?

xfst[0]: source Dutch.scriptprint random-lower 3tweeennegentigvierenveertigeenennegentigxfst[1]: define Dutch

xfst[0]: source English.scriptxfst[1]: print random-lower 3twenty-sevenninety-oneforty-fivexfst[1]:define English

xfst[0]: regex Dutch.i .o. English ;

SyllabificationSyllabification

define C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];

s t r u k t u r a l i s m is t r u k t u r a l i s m is t r u k . t u . r a . l i s . m is t r u k . t u . r a . l i s . m i

[C* V+ C*] @-> ... "." || _ [C V][C* V+ C*] @-> ... "." || _ [C V]

““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the

C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”

Finnish SyllabificationFinnish Syllabification

# -*- coding: utf8 -*-

define FinnWords {kala}|{riippuu}|{tietoinen}|{sataa}| {satoi}|{saata}|{saatoin}|{auta}|{laiva}| {leipä}|{häijy}|{koulu}|{köyhä}|{lea}|

{viestien}|{tuote}| {virtu.ositeetti}| {laukaus}|{lakkautan}|{voimistelijoiden}| {heittäen}|{heittäisin}|{laulaen}];

define LongV [{aa}|{ee}|{ii}|{oo}|{uu}|{yy}|{ää}|{öö}]; define Diph [[[MidV | LowV] HighV]|{ie}|{uo}|{yö}];

Syllabification (Continued)Syllabification (Continued)

define Nuc [V | LongV | Diph];

define C [b | c | d | f | g | h | j | k | l | m |

n | p | q | r | s | t | v | w | x | z];

define Syllabify [ C* Nuc C* @-> ... "." || _ C V ] ;

regex FinnWords.o. Syllabify ;

print lower-words

Syllabification (continued)Syllabification (continued)

Problem cases

Incorrect Correctlea le.a

lau.laen lau.la.en

lau.kaus lau.ka.us

define Syllabify [ C* Nuc C* @-> ... "." || _ C V

[. .] -> "." || [a | ä | i] _ [e | u | y] (C) .#. ,

e _ a ] ;

Parsing DatesParsing Dates

Today is [Monday, July 25, 2005].

Today is Monday, [July 25, 2005].

Today is [Monday, July 25], 2005.

Today is Monday, [July 25], 2005.

Today is [Monday], July 25, 2005.

Best result

Bad results

Need left-to-right, longest-match constraints.

Defining the Language of DatesDefining the Language of Dates

define OneToNine [1|2|3|4|5|6|7|8|9];

define ZeroToNine ["0"|OneToNine];

define Day [{Monday} | {Tuesday} | {Wednesday} | {Thursday} | {Friday} | {Saturday} | {Sunday}] ;

define Month29 {February};

define Month30 [{April} | {June} | {September} | {December}];

define Month31 [{January} | {March} | {May} | {July} | {August} | {October} | {December}] ;

define Month [Month29 | Month30 | Month31];

Language of Dates (Continued)Language of Dates (Continued)

# Date is a number from 1 to 31define Date [OneToNine | [1 | 2] ZeroToNine | 3 [%0 | 1]];

# Year is a number from 1 to 9999 (watch out for the Y10K bug!)

define Year [OneToNine ZeroToNine^<4];

# A date expression consists of a Day (Monday) or a Month and a Date (July 25) with an optional Day (Monday, July 25) and Year (July 25, 2005) or both (Monday, July 25, 2005).

define AllDates [Day | (Day {, }) Month { } Date ({, } Year)];

All Dates from 1.1.1 to 31.12.9999All Dates from 1.1.1 to 31.12.9999

FebJan

MayJunJul

OctNovDec

123456789

0123456789

123456789

0123456789

123456789

TueMon

FriSatSun

Thu 456789

MayJan Feb Mar Apr Jun

Jul Aug Oct Nov DecSep

13 states, 96 arcs13 states, 96 arcs29 760 007 date expressions29 760 007 date expressions

Parser for DatesParser for Dates

AllDates @-> “<DT>“ ... “</DT>“Compiles into an

unambiguous transducer (136

states, 2798 arcs).

Today is Today is <DT>Monday, July 25, 2005</DT><DT>Monday, July 25, 2005</DT> because because

yesterday was yesterday was <DT>Sunday</DT><DT>Sunday</DT> and it was and it was

<DT>July 24</DT><DT>July 24</DT> so tomorrow must be so tomorrow must be

<DT>Tuesday, July 26</DT><DT>Tuesday, July 26</DT> and not and not <DT>July 27><DT>July 27>

as it says on the program.as it says on the program.

Problem of ReferenceProblem of Reference

Valid dates

Monday, July 25, 2005

Tuesday, February 29, 2000

Monday, September 16, 1996Invalid dates

Wednesday, April 31, 1996

Thursday, February 29, 1900

Tuesday, July 25, 2005

Refinement by IntersectionRefinement by Intersection

AllDatesAllDates

ValidValidDatesDates

LeapYearsLeapYears

Feb 29 => _ ...

MaxDaysMaxDaysIn MonthIn Month

~$[Month29 { 30}];

WeekdayDateWeekdayDate

MaxDaysMaxDays

define MaxDays30 ~$[Month29 { 30}];

define MaxDays31 ~$[[Month29 | Month30] { 31}];

define MaxDays [MaxDays30 & MaxDays31];

LeapYear constraintLeapYear constraint

define Even [{0} | 2 | 4 | 6 | 8] ; define Odd [1 | 3 | 5 | 7 | 9] ;

define N [Even | Odd];

define Div4 [4 | 8 |

N* [Even [%0 | 4 | 8] |

Odd [2 | 6]]];

define LeapYear [Div4 - [[N+ - Div4] {00}]] ;

LeapYear Constraint (Continued)LeapYear Constraint (Continued)

Bad Solution 1define LeapDates {February 29, } => _ LeapYear ;

Bad Solution 2define NotLeapYear [Year - LeapYear];

define LeapDates ~${February 29, } NotLeapYear];

Almost Correctdefine LeapDates [

{February 29, } => _ [?* - [NotLeapYear [\N]*]]];

Good Solutiondefine LeapDates [

{February 29, } => _ [?* - [NotLeapYear [\N]*]]] .#.;

Vacuous Context ConditionsVacuous Context Conditions

A context condition L _ R is compiled as ?* L _ R ?*.

Any expression that contains the empty string is “swallowed up” when concatenated with ?*. (a) ?* == ?* (a) == ?*

[?* - a] ?* == ?* [?* - a] == ?*

~a ?* == ?* ~a == ?*

Not vacuous:a -> b || _ c* [.#.| \c] ;

DateParsersDateParsers

define ValidDates [AllDates & MaxDays & LeapDates];

define ValidDateParser [ValidDates @->

"<DATE>" ... "</DATE>"

|| _ [.#. | \N]];

define InValidDates = [AllDates - ValidDates];

define InvDateParser [InValidDates @->

"<INV-DATE>" ... "</INV-DATE>"

|| _ [.#. | \N]];

define DateParser [InvDateParser .o. ValidDateParser];

<INV-DATE><DATE>February 29</DATE>, 1900</INV-DATE>

Date/NonDate parser 1Date/NonDate parser 1

define DateParser [ValidDateParser .o. InvDateParser];

<DATE>February 29<DATE>, 1900

No nested tags for the input "February 29, 1900” because InvDateParser does not apply to strings that have been tagged already.

Date/NonDate parser 2Date/NonDate parser 2

define DateParser [ValidDates @-> "<DATE>" ... "</DATE>",

InvalidDates @-> "<NON-DATE>" ... "</NON-DATE>"

|| _ [\N | .#.]]

Parallel replacement of two patterns with the same constraint on the right context.

<NON-DATE>February 29, 1900</NON-DATE>

<DATE>February 29, 2000<DATE>

ObservationsObservations

For some subsets of natural language, such as dates, a finite-state description is more appropriate than a phrase structure grammar.

Regular languages and relations can be modified directly with the finite-state calculus without rewriting the grammars that describe them.

This is a fundamental advantage over higher-level formalisms.

The LEXC FormalismThe LEXC Formalism

What is LEXC?What is LEXC?

A special application for making lexical transducers (On the B&K book CD).

A language for describing morphotactic constraints by way of sublexicons and continuation classes.

Why another regular expression formalism?The general regular expression compiler in XFST is

oriented towards compiling networks from symbols and symbol pairs, not from words. LEXC is word-based.

Compiling large lexicons (tens of thousands of words) by the standard union operator is inefficient. LEXC has another, a more efficient algorithm for building networks from lists of words, stems, and affixes.

LEXC SyntaxLEXC Syntax

Multichar_Symbols +Noun +Sg +Pl

Lexicon Root cat SgPl ; dog SgPl ; goose Sg ; goose:geese Pl;

Lexicon SgPl Sg; 0:s Pl;Lexicon Sg +Noun+Sg:0 #;Lexicon Pl +Noun+Pl:0 # ;

Multicharacter symbols need to be declared.

There must be a sublexicon called ‘Root’

Entries consist of optional string or string pair followed by an obligatory continuation class.

Every continuation class must refer to a sublexicon, except for #, the termination class.

Esperanto chartEsperanto chart

nemal eg

ineget

Esperanto chart 2Esperanto chart 2

nemal eg

ineget

nemal eg

ineget

nemal eg

ineget

nemal eg

ineget

Adjective

APrefix

AStem ADeriv

AtoNInfl

Plur Acc

NPrefix

NDeriv

Esperanto.lexcEsperanto.lexc

LEXICON Root

Adjective;

LEXICON Adjective

APrefix;

AStem;

LEXICON APrefix

Neg+:ne AStem;

Op+:mal AStem;

LEXICON AStem

bon ADeriv;

LEXICON ADeriv

LEXICON AMod

+Aug:eg ATag;

+Dim:et ATag;

LEXICON ATag

+Adj:0 ASuff;

+Adj:0 AtoN;

LEXICON ASuff

+ASuff:a Infl;

... etc.

ConstraintsConstraints

nemal eg

inineget

MF+ +Fem

Constraints 2Constraints 2

nemal eg

ineget

MF%+ => _ ~$[%+Fem] %+Pl ;MF+ +Fem

Constraints 3Constraints 3

xfst[0]: read lexc < esperanto.lexc

Reading from 'adj-noun.lexc'

Root...2, Nouns...2, NounRoots...4, Nmf...5, ....

Building lexicon...Minimizing...Done!

2.7 Kb. 45 states, 70 arcs, Circular.

Closing 'adj-noun-tags.lexc'

xfst[1]: regex MF%+ => _ ~$[%+Fem] %+Pl ;

1.2 Kb, 2 states, 7 arcs, Circular

xfst[2]: compose

3.2 Kb, 61 states, 89 arcs, Circular

Less words, bigger network!

finite-state methods in natural language processing

c v pattern

c vinsert

e u y c

c v regex finnwords

e i o u s t r u

highv u y i

v highv midv lowv

hundreds onetonine

Documents

mattiussi claudio (2001) the geometry of time-stepping...

finite element methods lectures_lui.pdf

finite difference and finite element methods · pdf...

finite difference methods 18-20 september 2012. thematic...

bathe - finite element methods

finite element methods - tum

finite-state methods in natural language processing lauri...

probabilistic finite element methods for · pdf...

galerkin finite element methods

numerical methods specific methods: finite differences...

mech593 finite element methods

ece503: finite precision signal processing lecture...

finite-state methods in natural language processing

adaptive finite element methods -...

finite elements methods

finite-state methods in natural-language processing:...

finite-state methods in natural-language processing:...

finite difference methods 05

finite difference and finite element methods for solving ......

finite difference and finite element methods for solving