finite-state methods in natural language processing
DESCRIPTION
Finite-State Methods in Natural Language Processing. Lauri Karttunen LSA 2005 Summer Institute July 25, 2005. Course Outline. July 18: Intro to computational morphology XFST Readings - PowerPoint PPT PresentationTRANSCRIPT
Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing
Lauri Karttunen
LSA 2005 Summer Institute
July 25, 2005
Course OutlineCourse Outline
July 18:Intro to computational morphologyXFST
ReadingsLauri Karttunen, “Finite-State Constraints”, The Last
Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.
Karttunen and Beesley, “25 Years of Finite-State Morphology”
Chapter 1: “Gentle Introduction” (B&K)
July 20:Regular expressionsMore on XFST
ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”
July 25More on XFST: Date ParserConcatenative morphotactics: The LEXC language
ReadingsChapter 4. “The LEXC Language”
July 27Constraining non-local dependencies: Flag DiacriticsNon-concatenative morphotactics
Reduplication, interdigitation
ReadingsChapter 5. “Flag Diacritics”Chapter 8. “Non-Concatenative Morphotactics”
August 1Realizational morphology
ReadingsGregory T. Stump. Inflectional Morphology. A Theory of
Paradigm Structure. Cambridge U. Press. 2001. (An excerpt)
Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.
August 3Optimality theory
ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic
and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.
Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.
Solution to Assignment 1, Part 1Solution to Assignment 1, Part 1
define Hundreds [OneToNine { hundred}
({ } OneToNinetyNine)];
define OneTo999 [OneToNine | Teens | Tens |
Hundreds ];
define Thousands [OneTo999 { thousand}
({ } OneTo999)];
define UpToMillion [OneToNine | Teens | Tens |
Hundreds | Thousands ];
What is this? What is this?
xfst[0]: source Dutch.scriptprint random-lower 3tweeennegentigvierenveertigeenennegentigxfst[1]: define Dutch
xfst[0]: source English.scriptxfst[1]: print random-lower 3twenty-sevenninety-oneforty-fivexfst[1]:define English
xfst[0]: regex Dutch.i .o. English ;
SyllabificationSyllabification
define C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];
s t r u k t u r a l i s m is t r u k t u r a l i s m is t r u k . t u . r a . l i s . m is t r u k . t u . r a . l i s . m i
[C* V+ C*] @-> ... "." || _ [C V][C* V+ C*] @-> ... "." || _ [C V]
““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the
C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”
Finnish SyllabificationFinnish Syllabification
# -*- coding: utf8 -*-
define FinnWords {kala}|{riippuu}|{tietoinen}|{sataa}| {satoi}|{saata}|{saatoin}|{auta}|{laiva}| {leipä}|{häijy}|{koulu}|{köyhä}|{lea}|
{viestien}|{tuote}| {virtu.ositeetti}| {laukaus}|{lakkautan}|{voimistelijoiden}| {heittäen}|{heittäisin}|{laulaen}];
define HighV [u | y | i]; # High voweldefine MidV [e | o | ö]; # Mid voweldefine LowV [a | ä] ; # Low voweldefine V [HighV | MidV | LowV]; # Vowel
define LongV [{aa}|{ee}|{ii}|{oo}|{uu}|{yy}|{ää}|{öö}]; define Diph [[[MidV | LowV] HighV]|{ie}|{uo}|{yö}];
Syllabification (Continued)Syllabification (Continued)
define Nuc [V | LongV | Diph];
define C [b | c | d | f | g | h | j | k | l | m |
n | p | q | r | s | t | v | w | x | z];
define Syllabify [ C* Nuc C* @-> ... "." || _ C V ] ;
regex FinnWords.o. Syllabify ;
print lower-words
Syllabification (continued)Syllabification (continued)
Problem cases
Incorrect Correctlea le.a
lau.laen lau.la.en
lau.kaus lau.ka.us
define Syllabify [ C* Nuc C* @-> ... "." || _ C V
.o.
[. .] -> "." || [a | ä | i] _ [e | u | y] (C) .#. ,
e _ a ] ;
Parsing DatesParsing Dates
Today is [Monday, July 25, 2005].
Today is Monday, [July 25, 2005].
Today is [Monday, July 25], 2005.
Today is Monday, [July 25], 2005.
Today is [Monday], July 25, 2005.
Best result
Bad results
Need left-to-right, longest-match constraints.
Defining the Language of DatesDefining the Language of Dates
define OneToNine [1|2|3|4|5|6|7|8|9];
define ZeroToNine ["0"|OneToNine];
define Day [{Monday} | {Tuesday} | {Wednesday} | {Thursday} | {Friday} | {Saturday} | {Sunday}] ;
define Month29 {February};
define Month30 [{April} | {June} | {September} | {December}];
define Month31 [{January} | {March} | {May} | {July} | {August} | {October} | {December}] ;
define Month [Month29 | Month30 | Month31];
Language of Dates (Continued)Language of Dates (Continued)
# Date is a number from 1 to 31define Date [OneToNine | [1 | 2] ZeroToNine | 3 [%0 | 1]];
# Year is a number from 1 to 9999 (watch out for the Y10K bug!)
define Year [OneToNine ZeroToNine^<4];
# A date expression consists of a Day (Monday) or a Month and a Date (July 25) with an optional Day (Monday, July 25) and Year (July 25, 2005) or both (Monday, July 25, 2005).
define AllDates [Day | (Day {, }) Month { } Date ({, } Year)];
All Dates from 1.1.1 to 31.12.9999All Dates from 1.1.1 to 31.12.9999
, ,
FebJan
Mar
MayJunJul
Apr
Aug
OctNovDec
Sep
3
,
,
123456789
0123456789
0
123456789
0123456789
123456789
0
10
21
TueMon
Wed
FriSatSun
Thu 456789
MayJan Feb Mar Apr Jun
Jul Aug Oct Nov DecSep
13 states, 96 arcs13 states, 96 arcs29 760 007 date expressions29 760 007 date expressions
Parser for DatesParser for Dates
AllDates @-> “<DT>“ ... “</DT>“Compiles into an
unambiguous transducer (136
states, 2798 arcs).
Today is Today is <DT>Monday, July 25, 2005</DT><DT>Monday, July 25, 2005</DT> because because
yesterday was yesterday was <DT>Sunday</DT><DT>Sunday</DT> and it was and it was
<DT>July 24</DT><DT>July 24</DT> so tomorrow must be so tomorrow must be
<DT>Tuesday, July 26</DT><DT>Tuesday, July 26</DT> and not and not <DT>July 27><DT>July 27>
as it says on the program.as it says on the program.
Problem of ReferenceProblem of Reference
Valid dates
Monday, July 25, 2005
Tuesday, February 29, 2000
Monday, September 16, 1996Invalid dates
Wednesday, April 31, 1996
Thursday, February 29, 1900
Tuesday, July 25, 2005
Refinement by IntersectionRefinement by Intersection
AllDatesAllDates
ValidValidDatesDates
LeapYearsLeapYears
Feb 29 => _ ...
MaxDaysMaxDaysIn MonthIn Month
~$[Month29 { 30}];
WeekdayDateWeekdayDate
MaxDaysMaxDays
define MaxDays30 ~$[Month29 { 30}];
define MaxDays31 ~$[[Month29 | Month30] { 31}];
define MaxDays [MaxDays30 & MaxDays31];
LeapYear constraintLeapYear constraint
define Even [{0} | 2 | 4 | 6 | 8] ; define Odd [1 | 3 | 5 | 7 | 9] ;
define N [Even | Odd];
define Div4 [4 | 8 |
N* [Even [%0 | 4 | 8] |
Odd [2 | 6]]];
define LeapYear [Div4 - [[N+ - Div4] {00}]] ;
LeapYear Constraint (Continued)LeapYear Constraint (Continued)
Bad Solution 1define LeapDates {February 29, } => _ LeapYear ;
Bad Solution 2define NotLeapYear [Year - LeapYear];
define LeapDates ~${February 29, } NotLeapYear];
Almost Correctdefine LeapDates [
{February 29, } => _ [?* - [NotLeapYear [\N]*]]];
Good Solutiondefine LeapDates [
{February 29, } => _ [?* - [NotLeapYear [\N]*]]] .#.;
Vacuous Context ConditionsVacuous Context Conditions
A context condition L _ R is compiled as ?* L _ R ?*.
Any expression that contains the empty string is “swallowed up” when concatenated with ?*. (a) ?* == ?* (a) == ?*
[?* - a] ?* == ?* [?* - a] == ?*
~a ?* == ?* ~a == ?*
Not vacuous:a -> b || _ c* [.#.| \c] ;
DateParsersDateParsers
define ValidDates [AllDates & MaxDays & LeapDates];
define ValidDateParser [ValidDates @->
"<DATE>" ... "</DATE>"
|| _ [.#. | \N]];
define InValidDates = [AllDates - ValidDates];
define InvDateParser [InValidDates @->
"<INV-DATE>" ... "</INV-DATE>"
|| _ [.#. | \N]];
define DateParser [InvDateParser .o. ValidDateParser];
<INV-DATE><DATE>February 29</DATE>, 1900</INV-DATE>
Date/NonDate parser 1Date/NonDate parser 1
define DateParser [ValidDateParser .o. InvDateParser];
<DATE>February 29<DATE>, 1900
No nested tags for the input "February 29, 1900” because InvDateParser does not apply to strings that have been tagged already.
Date/NonDate parser 2Date/NonDate parser 2
define DateParser [ValidDates @-> "<DATE>" ... "</DATE>",
InvalidDates @-> "<NON-DATE>" ... "</NON-DATE>"
|| _ [\N | .#.]]
Parallel replacement of two patterns with the same constraint on the right context.
<NON-DATE>February 29, 1900</NON-DATE>
<DATE>February 29, 2000<DATE>
ObservationsObservations
For some subsets of natural language, such as dates, a finite-state description is more appropriate than a phrase structure grammar.
Regular languages and relations can be modified directly with the finite-state calculus without rewriting the grammars that describe them.
This is a fundamental advantage over higher-level formalisms.
The LEXC FormalismThe LEXC Formalism
What is LEXC?What is LEXC?
A special application for making lexical transducers (On the B&K book CD).
A language for describing morphotactic constraints by way of sublexicons and continuation classes.
Why another regular expression formalism?The general regular expression compiler in XFST is
oriented towards compiling networks from symbols and symbol pairs, not from words. LEXC is word-based.
Compiling large lexicons (tens of thousands of words) by the standard union operator is inefficient. LEXC has another, a more efficient algorithm for building networks from lists of words, stems, and affixes.
LEXC SyntaxLEXC Syntax
Multichar_Symbols +Noun +Sg +Pl
Lexicon Root cat SgPl ; dog SgPl ; goose Sg ; goose:geese Pl;
Lexicon SgPl Sg; 0:s Pl;Lexicon Sg +Noun+Sg:0 #;Lexicon Pl +Noun+Pl:0 # ;
Multicharacter symbols need to be declared.
There must be a sublexicon called ‘Root’
Entries consist of optional string or string pair followed by an obligatory continuation class.
Every continuation class must refer to a sublexicon, except for #, the termination class.
Esperanto chartEsperanto chart
ge
hund
bon
nemal eg
et
ineget
o
a
ec
j n
Esperanto chart 2Esperanto chart 2
ge
hund
bon
nemal eg
et
ineget
o
a
ec
j n
Esperanto chart 3Esperanto chart 3
ge
hund
bon
nemal eg
et
ineget
o
a
ec
j n
Esperanto chart 4Esperanto chart 4
ge
hund
bon
nemal eg
et
ineget
o
a
ec
j n
Esperanto chart 5Esperanto chart 5
ge
hund
bon
nemal eg
et
ineget
o
a
ec
j n
Root
Adjective
APrefix
AStem ADeriv
AMod
ATag
ASuff
AtoNInfl
Plur Acc
NPrefix
NMod
NStem
NDeriv
NTag
Noun
NSuff
Esperanto.lexcEsperanto.lexc
LEXICON Root
Adjective;
Noun;
LEXICON Adjective
APrefix;
AStem;
LEXICON APrefix
Neg+:ne AStem;
Op+:mal AStem;
LEXICON AStem
bon ADeriv;
LEXICON ADeriv
ATag;
AMod;
LEXICON AMod
+Aug:eg ATag;
+Dim:et ATag;
LEXICON ATag
+Adj:0 ASuff;
+Adj:0 AtoN;
LEXICON ASuff
+ASuff:a Infl;
... etc.
ConstraintsConstraints
gege
hund
bon
nemal eg
et
inineget
o
a
ec
jj n
MF+ +Fem
+Pl
Constraints 2Constraints 2
ge
hund
bon
nemal eg
et
ineget
o
a
ec
j n
MF%+ => _ ~$[%+Fem] %+Pl ;MF+ +Fem
+Pl
Constraints 3Constraints 3
xfst[0]: read lexc < esperanto.lexc
Reading from 'adj-noun.lexc'
Root...2, Nouns...2, NounRoots...4, Nmf...5, ....
Building lexicon...Minimizing...Done!
2.7 Kb. 45 states, 70 arcs, Circular.
Closing 'adj-noun-tags.lexc'
xfst[1]: regex MF%+ => _ ~$[%+Fem] %+Pl ;
1.2 Kb, 2 states, 7 arcs, Circular
xfst[2]: compose
3.2 Kb, 61 states, 89 arcs, Circular
Less words, bigger network!