finite-state programming
DESCRIPTION
Finite-State Programming. Some Examples. Finite-state “programming”. Finite-state “programming”. Finite-state “programming”. slide courtesy of L. Karttunen (modified). Some Xerox Extensions. $ containment => restriction -> @-> replacement - PowerPoint PPT PresentationTRANSCRIPT
600.465 - Intro to NLP - J. Eisner 1
Finite-State Programming
Some Examples
600.465 - Intro to NLP - J. Eisner 2
Finite-state “programming”
Function Function on (set of) strings Source code Regular expression Object code Finite state machine Compiler Regexp compiler Optimization of object code
Determinization, minimization, pruning
600.465 - Intro to NLP - J. Eisner 3
Finite-state “programming”
Function composition (Weighted) compositionHigher-order function OperatorFunction inversion(available in Prolog)
Function inversion
Structuredprogramming
Ops + small regexps
600.465 - Intro to NLP - J. Eisner 4
Finite-state “programming”
Parallelism Apply to set of strings Nondeterminism Nondeterminism Stochasticity Prob.-weighted arcs
600.465 - Intro to NLP - J. Eisner 5
Some Xerox Extensions
$ containment=> restriction-> @-> replacement
Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.
slide courtesy of L. Karttunen (modified)
600.465 - Intro to NLP - J. Eisner 6
Containment
$[ab*c]$[ab*c]
““Must contain a substringMust contain a substringthat matches that matches ab*cab*c.”.”
Accepts Accepts xxxacyyxxxacyyRejects Rejects bcbabcba
?* [ab*c] ?* ?* [ab*c] ?*
Equivalent expression Equivalent expression
bb
cc
aa
a,b,c,¿a,b,c,¿
a,b,c,¿a,b,c,¿
Warning: ?? in regexps means “any character at all.”
But ¿¿ in machines means “any character not explicitly
mentioned anywhere in the machine.”
slide courtesy of L. Karttunen (modified)
600.465 - Intro to NLP - J. Eisner 7
Restriction
¿¿cc
bb
bb
cc¿¿ aa
cc
a => b _ ca => b _ c
““AnyAny aa must be preceded bymust be preceded by bband followed byand followed by cc.”.”
Accepts Accepts bacbbacdebacbbacdeRejects Rejects bacbacaa
~[~[~[?* b] a ?*~[?* b] a ?*]] & & ~[~[?* a ~[c ?*]?* a ~[c ?*]]] Equivalent expression Equivalent expression
slide courtesy of L. Karttunen (modified)
contains aa not preceded by bb
contains aa not followed by cc
600.465 - Intro to NLP - J. Eisner 8
Replacement
a:ba:b
bb
aa
¿¿
¿¿
b:ab:a
aa
a:ba:b
a b -> b a
““Replace ‘ab’ by ‘ba’.”Replace ‘ab’ by ‘ba’.”
Transduces Transduces abcdbaba to to bacdbbaa
[[~$[a b] ~$[a b] [[a b] .x. [b a]][[a b] .x. [b a]]]*]* ~$[a b]~$[a b] Equivalent expression Equivalent expression
slide courtesy of L. Karttunen (modified)
600.465 - Intro to NLP - J. Eisner 9
Replacement is Nondeterministic
a b -> b a | x
““Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”
Transduces Transduces abcdbaba to { to {bacdbbaa,, bacdbxa,, xcdbbaa,, xcdbxa}}
600.465 - Intro to NLP - J. Eisner 10
Replacement is Nondeterministic
[ a b -> b a | x ] .o. [ x => _ c ]
““Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.”
Transduces Transduces abcdbaba to { to {bacdbbaa,, bacdbxa,, xcdbbaa,, xcdbxa}}
600.465 - Intro to NLP - J. Eisner 11
Replacement is Nondeterministic
a b | b | b a | a b a -> x
applied to “aba”Four overlapping substrings match; we haven’t told it which
one to replace so it chooses nondeterministically
a b a a b a a b a a b aa x a a x x a x
slide courtesy of L. Karttunen (modified)
600.465 - Intro to NLP - J. Eisner 12
More Replace Operators
Optional replacement: a b (->) b a
Directed replacement guarantees a unique result by
constraining the factorization of the input string by Direction of the match (rightward or
leftward) Length (longest or shortest)
slide courtesy of L. Karttunen
600.465 - Intro to NLP - J. Eisner 13
@-> Left-to-right, Longest-match Replacement
a b | b | b a | a b a @-> x
applied to “aba”
a b a a b a a b a a b aa x a a x x a x
slide courtesy of L. Karttunen
@-> left-to-right, longest match @> left-to-right, shortest match ->@ right-to-left, longest match >@ right-to-left, shortest match
600.465 - Intro to NLP - J. Eisner 14
Using “…” for marking
0:[0:[
[[
0:]0:]
¿¿
aa
eeii
oo
uu]]
a|e|i|o|u -> [ ... ]
p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]
Note: actually have to write as Note: actually have to write as -> %[ ... %] or or -> “[” ... “]”
since [] are parens in the regexp languagesince [] are parens in the regexp language
slide courtesy of L. Karttunen (modified)
600.465 - Intro to NLP - J. Eisner 15
Using “…” for marking
0:[0:[
[[
0:]0:]
¿¿
aa
eeii
oo
uu]]
a|e|i|o|u -> [ ... ]
p o t a t op o t a t op[o]t[a]t[o]p[o]t[a]t[o]
Which way does the FST transduce potatoe?
slide courtesy of L. Karttunen (modified)
How would you change it to get the other answer?
p o t a t o ep o t a t o ep[o]t[a]t[o][e]p[o]t[a]t[o][e]
p o t a t o ep o t a t o ep[o]t[a]t[o e]p[o]t[a]t[o e]
vs.
600.465 - Intro to NLP - J. Eisner 16
Example: Finnish Syllabificationdefine C [ b | c | d | f ...define C [ b | c | d | f ...define V [ a | e | i | o | u ];define V [ a | e | i | o | u ];
s t r u k t u r a l i s m s t r u k t u r a l i s m iis t r u k - t u - r a - l i s - m s t r u k - t u - r a - l i s - m ii
[C* V+ C*] @-> ... "-" || _ [C V][C* V+ C*] @-> ... "-" || _ [C V]
““Insert a hyphen after the longest instance of theInsert a hyphen after the longest instance of the C* V+ C*C* V+ C* pattern in front of a pattern in front of a C VC V pattern.” pattern.”
slide courtesy of L. Karttunen
why?why?
600.465 - Intro to NLP - J. Eisner 17
Conditional Replacement
The relation that replaces A by B between L and R leaving everything else unchanged.
A -> BA -> BReplacement
L _ RL _ R
Context
Sources of complexity:
Replacements and contexts may overlap
Alternative ways of interpreting “between left and right.”
slide courtesy of L. Karttunen
600.465 - Intro to NLP - J. Eisner 18
Hand-Coded Example: Parsing DatesToday is [Tuesday, July 25, 2000].
Today is Tuesday, [July 25, 2000].Today is [Tuesday, July 25], 2000.Today is Tuesday, [July 25], 2000.Today is [Tuesday], July 25, 2000.
Best result
Bad results
Need left-to-right, longest-match constraints.
slide courtesy of L. Karttunen
600.465 - Intro to NLP - J. Eisner 19
Source code: Language of Dates
Day = Monday | Tuesday | ... | SundayMonth = January | February | ... | DecemberDate = 1 | 2 | 3 | ... | 3 1Year = %0To9 (%0To9 (%0To9 (%0To9))) - %0?* from 1 to 9999
AllDates = Day | (Day “, ”) Month “ ” Date (“, ” Year))
slide courtesy of L. Karttunen
600.465 - Intro to NLP - J. Eisner 20
actually represents 7 arcs, each labeled by a string
Object code: All Dates from 1/1/1 to 12/31/9999
, ,
FebJan
Mar
MayJunJul
Apr
Aug
OctNovDec
Sep
3
,
,
123456789
0123456789
0123456789
0123456789
123456789
0
10
21
TueMon
Wed
FriSatSun
Thu 456789
MayJan Feb Mar Apr JunJul Aug Oct Nov DecSep
13 states, 96 arcs13 states, 96 arcs29 760 007 date expressions29 760 007 date expressions
slide courtesy of L. Karttunen
600.465 - Intro to NLP - J. Eisner 21
Parser for DatesAllDates @-> “[DT ” ... “]”
Compiles into an unambiguous
transducer (23 states, 332 arcs).
Today is Today is [DT Tuesday, July 25, 2000][DT Tuesday, July 25, 2000] because yesterday because yesterday
was was [DT Monday][DT Monday] and it was and it was [DT July 24][DT July 24] so tomorrow must so tomorrow must
be be [DT Wednesday, July 26][DT Wednesday, July 26] and not and not [DT July 27][DT July 27] as it says as it says
on the program.on the program.
Xerox left-to-right replacement operator
slide courtesy of L. Karttunen (modified)
600.465 - Intro to NLP - J. Eisner 22
Problem of Reference
Valid datesTuesday, July 25, 2000Tuesday, February 29, 2000Monday, September 16, 1996
Invalid datesWednesday, April 31, 1996Thursday, February 29, 1900Tuesday, July 26, 2000
slide courtesy of L. Karttunen
600.465 - Intro to NLP - J. Eisner 23
Refinement by IntersectionAllDatesAllDates
ValidValidDatesDates
WeekdayDateWeekdayDate
MaxDaysMaxDaysIn MonthIn Month
“ 31” => Jan|Mar|May|… _“ 30” => Jan|Mar|Apr|… _
LeapYearsLeapYears
Feb 29, => _ …
slide courtesy of L. Karttunen (modified)
Xerox contextual restriction operator
Q: Why do these rulesstart with spaces?(And is it enough?)
Q: Why does this ruleend with a comma?Q: Can we write the whole rule?
Q: LeapYears made use of a “divisible by 4” FSA; can we build a “divisible by 7” FSA (base-ten input)?
600.465 - Intro to NLP - J. Eisner 24
Defining Valid Dates
AllDates&
MaxDaysInMonth&
LeapYears&
WeekdayDates
= ValidDates
AllDates: 13 states, 96 arcs29 760 007 date expressions
ValidDates: 805 states, 6472 arcs7 307 053 date expressions
slide courtesy of L. Karttunen
600.465 - Intro to NLP - J. Eisner 25
Parser for Valid and Invalid Dates
[AllDates - ValidDates] @-> “[ID ” ... “]”,
ValidDates @-> “[VD ” ... “]”
Today is [VD Tuesday, July 25, 2000],not [ID Tuesday, July 26, 2000].
valid date
invalid date
2688 states,20439 arcs
slide courtesy of L. Karttunen
Comma creates a single FST that does left-to-right
longest match against either pattern
600.465 - Intro to NLP - J. Eisner 26
More Engineering Applications Markup
Dates, names, places, noun phrases; spelling/grammar errors?
Hyphenation Informative templates for information extraction (FASTUS) Word segmentation (use probabilities!) Part-of-speech tagging (use probabilities – maybe!)
Translation Spelling correction / edit distance Phonology, morphology: series of little fixups? constraints? Speech Transliteration / back-transliteration Machine translation?
Learning …
600.465 - Intro to NLP - J. Eisner 27
FASTUS – Information Extraction Appelt et al, 1992-?
Input: Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with …
Output:Relationship: TIE-UPEntities: “Bridgestone Sports Co.”
“A local concern”“A Japanese trading house”
Joint Venture Company: “Bridgestone Sports Taiwan Co.”Amount: NT$20000000
600.465 - Intro to NLP - J. Eisner 28
FASTUS: Successive Markups(details on subsequent slides)
Tokenization.o.
Multiwords.o.
Basic phrases (noun groups, verb groups …).o.
Complex phrases.o.
Semantic Patterns.o.
Merging different references
600.465 - Intro to NLP - J. Eisner 29
FASTUS: Tokenization
Spaces, hyphens, etc. wouldn’t would not their them ’s company. company .
butCo. Co.
600.465 - Intro to NLP - J. Eisner 30
FASTUS: Multiwords “set up” “joint venture” “San Francisco Symphony Orchestra,”
“Canadian Opera Company”
… use a specialized regexp to match musical groups.
... what kind of regexp would match company names?
600.465 - Intro to NLP - J. Eisner 31
FASTUS : Basic phrases
Output looks like this (no nested brackets!):… [NG it] [VG had set_up] [NG a joint_venture] [Prep in]
…
Company Name: Bridgestone Sports Co.Verb Group: saidNoun Group: FridayNoun Group: itVerb Group: had set upNoun Group: a joint venturePreposition: inLocation: TaiwanPreposition: withNoun Group: a local concern
600.465 - Intro to NLP - J. Eisner 32
FASTUS: Noun Groups
Build FSA to recognize phrases likeapproximately 5 kgmore than 30 peoplethe newly elected presidentthe largest leftist political forcea government and commercial project
Use the FSA for left-to-right longest-match markup
What does FSA look like? See next slide …
600.465 - Intro to NLP - J. Eisner 33
FASTUS: Noun GroupsDescribed with a kind of non-recursive CFG …(a regexp can include names that stand for other regexps)
NG Pronoun | Time-NP | Date-NPNG (Det) (Adjs) HeadNouns…Adjs sequence of adjectives maybe with commas,
conjunctions, adverbs…Det DetNP | DetNonNPDetNP detailed expression to match “the only five,
another three, this, many, hers, all, the most …”…
600.465 - Intro to NLP - J. Eisner 34
FASTUS: Semantic patterns
BusinessRelationship =NounGroup(Company/ies) VerbGroup(Set-up) NounGroup(JointVenture) with NounGroup(Company/ies) | …
ProductionActivity = VerbGroup(Produce) NounGroup(Product)
NounGroup(Company/ies) NounGroup & … is made easy by the processing done at a previous level
Use this for spotting references to put in the database.