finite-state methods in natural language processing lauri karttunen lsa 2005 summer institute july...

Post on 29-Mar-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Finite-State Methods in Natural Finite-State Methods in Natural Language ProcessingLanguage Processing

Lauri Karttunen

LSA 2005 Summer Institute

July 27, 2005

Course OutlineCourse Outline

July 18:Intro to computational morphologyXFST

ReadingsLauri Karttunen, “Finite-State Constraints”, The Last

Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993.

Karttunen and Beesley, “25 Years of Finite-State Morphology”

Chapter 1: “Gentle Introduction” (B&K)

July 20:Regular expressionsMore on XFST

ReadingsChapter 2: “Systematic Introduction”Chapter 3: “The XFST interface”

July 25More on XFST: Date ParserConcatenative morphotactics: The LEXC language

ReadingsChapter 4. “The LEXC Language”

July 27Constraining non-local dependencies: Flag DiacriticsComplex morphotactics and alternations: Finnish

Numerals

ReadingsChapter 5. “Flag Diacritics””

August 1Non-concatenative morphotactics

Reduplication, interdigitation

Realizational morphologyReadings

Chapter 8. “Non-Concatenative Morphotactics”Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm

Structure. Cambridge U. Press. 2001. (An excerpt)Lauri Karttunen, “Computing with Realizational Morphology”, Lecture

Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003.

August 3Optimality theory

ReadingsPaul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic and

Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003.

Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Syllabification revisitedSyllabification revisited

define MarkNonDiphthongs [ [. .] -> "." || [HighV | MidV] _ LowV, # i.a, e.a LowV _ MidV, # a.e i _ [MidV - e], # i.o, i.ä u _ [MidV - o], # u.e y _ [MidV - ö], # y.e $V i _ e, # poiki.en V u _ o, # $V y _ ö, # $V [MidV | LowV] _ [u|y] C C|.#.]]; # oike.us

define Syllabify [ C* V+ C* @-> ... "." || _ C V ];

regex FinnWords .o. MarkNonDiphthongs .o. Syllabify;

ConstraintsConstraints

ge

hund

bon

nemal eg

et

ineget

o

a

ec

j n

MF%+ => _ ~$[%+Fem] %+Pl ;MF+ +Fem

+Pl

Constraining by compositionConstraining by composition

xfst[0]: read lexc < adj-noun-tags.lexcRoot...2, Nouns...2, NounRoots...4, Nmf...5, ....Building lexicon...Minimizing...Done!2.7 Kb. 45 states, 70 arcs, Circular.

xfst[1]: up gehundinoMF+hund+Noun+Fem+Sg

xfst[1]: regex "MF+" => _ ~$["+Fem"] "+Pl" ;1.2 Kb, 2 states, 7 arcs, Circular

xfst[2]: compose3.2 Kb, 61 states, 89 arcs, Circularxfst[1]: up gehundinoxfst[1]: *** Not accepted ***Less words, bigger network.

Esperanto with FlagsEsperanto with Flags

Multichar_Symbols+Noun +Adj +Nsuff+ASuff +Nize+Pl +Sg +Acc MF++Aug +Dim +Fem Op+ Neg+@U.MF.Yes@ @U.MF.No@

LEXICON Root Nouns ; Adjectives ;

LEXICON Nouns NounRoots ; @U.MF.Yes@ Ge ; LEXICON GeMF+:ge NounRoots;

LEXICON NounRoots bird Nmf ; hund Nmf ;kat Nmf ;

LEXICON Nmf+Noun:0 AugDimFem ;

LEXICON AugDimFem@U.MF.No@ Fem ; +Dim:et AugDimFem ; +Aug:eg AugDimFem ; Nend ; Adjend ;

LEXICON Fem+Fem:in AugDimFem ;

Constraining by flagsConstraining by flags

xfst[0]: read lexc < esperanto-flags.lexc

xfst[1]: up gehundinoxfst[1]:xfst[1]: down MF+hund+Noun+Fem+NSuff+Sgxfst[1]:

xfst[1]: set obey-flags offvariable obey-flags = off

xfst[1]: up gehundinoxfst[1]: MF+hund+Noun+Fem+NSuff+Sg

xfst[1]: set show-flags onvariable show-flags = on

xfst[1]: down MF+hund+Noun+Fem+NSuff+Sg@U.MF.Yes@gehund@U.MF.No@ino@U.MF.No@

Flags in the sigmaFlags in the sigma

xfst[1]: print sigma

MF+ Neg+ Op+ a b c d e f g h i j k l m n o r

t u v +ASuff +Acc +Adj +Aug +Dim +Fem +Nsuff

+Nize +Noun +Pl +Sg @U.MF.No@ @U.MF.Yes@

Size: 35

@U.MF.Yes@: UNIFY feature 'MF' with value 'Yes'

@U.MF.No@: UNIFY feature 'MF' with value 'No'

2 flag diacritics

Eliminating flagsEliminating flags

xfst[1]: eliminate flag MF3.2 Kb. 61 states 89 arcs, CircularSize: 35

xfst[1]: print sigmaMF+ Neg+ Op+ a b c d e f g h i j k l m n o r t uv +ASuff +Acc +Adj +Aug +Dim +Fem +NSuff +Nize +Noun +Pl +SgSize: 33

The eliminate flag command composes the network with constraint networks that have the same effect as the flag diacritics that are removed.

Flag DiacriticsFlag Diacritics

Special symbols for encoding features, that is, attribute-value pairs.

Checked at runtime to avoid the cost of compiling them into the structure of the network

If a check fails, the path is abandoned.

Attributes and ValuesAttributes and Values

Epsilon arcs with feature constraints.

@U.Feature.Value@

@C.Feature@

Unify ‘Feature’ with ‘Value’ if possible.

Set ‘Feature’ to the unspecified value.

RulesRules

There can be any number of attributes.

An attribute can have any number of values.

If the value of an attribute is unspecified, it unifies successfully with any given value and is set to that value.

If the value of an attribute is specified, it unifies only with the given value.

Actions: Unify, Positive SetActions: Unify, Positive Set

@U.Feature.Value@ Unify Value with the current setting of Feature, if possible. Otherwise fail.

@P.Feature.Value@ Set Feature to Value regardless of the currentsetting. Always succeeds.

More Actions: Negative Set, ClearMore Actions: Negative Set, Clear

@N.Feature.Value@ Set Feature to thecomplement of Value

regardless of the current

setting. Always succeeds.

@C.Feature@ Make Feature beunspecified.

Alwayssucceeds.

More Actions: RequireMore Actions: Require

@R.Feature.Value@ Succeed in Feature is set

to Value. Otherwise fail.

@R.Feature@ Succeed if Feature hasbeen set to some

value.Otherwise fail.

More Actions: EqualityMore Actions: Equality

@E.Feature1.Feature2@ Succeed if Feature1has the same value asFeature2. Otherwise

fail.

Eliminating flagsEliminating flags

The constraints on "@U.FEATURE.VALUE@" have the form

~[?* PROHIBIT_FLAGS ~$[ALLOW_FLAGS] SELF ?*]

Constraint for eliminating @U.MF.No@:

~[?* ["@U.MF.Yes@"] # prohibit

~$["@P.MF.No@" | ”@C.MF@”] # allow

"@U.MF.No@"

?*]

Finnish NumeralsFinnish Numerals

Numbers and NumeralsNumbers and Numerals

The mapping from integers 0, 1, 2, 3 … to the corresponding numerals one, two, three… is a regular relation.

Some languages have a very simple numeral system, some are more complicated:seventy-three, soixante-treize, drei-und-sibzig

We can compile transducers that map between the numbers and the corresponding numerals.

Number-to-Numeral transducerNumber-to-Numeral transducer

Generation

105

hundred five hundred and five

one hundred and five

Analysis

hundred five

105

The Goal Ahead: FinnishThe Goal Ahead: Finnish

Analysis

sadanviiden

105+Sg+Gen

hundred and five (Sg Gen)

Generation

28+Ord+Pl+Gen

kahdensienkymmenensienkahdeksansien

twenty-eighth (Pl Gen)

Finnish NumeralsFinnish Numerals

Compound numerals written as one word 2 • 1000 + 5 • 100 + 3 • 10 + 1 = 2531

kaksituhattaviisisataakolmekymmentäyksi

Express ordinality, number, and casesata+Sg+Nom (100) sata+Ord+Sg+Nom (100th)sata sadas

sata+Sg+Gen (100) sata+Ord+Sg+Gen (100th)sadan sadannen

sata+Pl+Gen (100) sata+Ord+Pl+Gen (100th)satojen sadansien

Singular vs. PluralSingular vs. Plural

Numerals generally occur with singular nounskaksi+Sg+Gen kenkä+Sg+Gen

kahden kengän omistaja

(owner of two shoes)

Sets and public events may be in pluralkaksi+Pl+Gen kenkä+Pl+Gen kaksien kenkien omistaja(owner of two pairs of shoes)

kolme+Ord+Pl+Nom olympialainen+Pl+Nomkolmannet olympialaiset(third olympic games)

yksi+Pl+Nom hää+Pl+Nomyhdet häät(one wedding)

MorphotacticsMorphotactics

All parts of compound numerals agree in all respects two thousand five hundred (2500)kaksi+Sg+Gen tuhat+Sg+Gen viisi+Sg+Gen sata+Sg+Genkahden tuhannen viiden sadan

two ten eighth (28th)kaksi+Ord+Pl+Gen kymmenen+Ord+Pl+Gen kahdeksan+Ord+Pl+Genkahde ns i en kymmene ns i en kahdeksa ns i en

Singular nominative is exceptionalSingular nominative is exceptional

Numeral with a nounkaksi+Gen kenkä+Gen

kahden kengän (two shoes)

kaksi+Nom kenkä+Part

kaksi kenkää (two shoes)

Compound numeralkaksi+Gen tuhat+Gen viisi+Gen sata+Gen kolme+Gen (2503) kahden tuhannen viiden sadan kolmen

kaksi+Nom tuhat+Part viisi+Nom sata+Part kolme+Nom (2503) (kaksi • tuhatta) + (viisi • sataa) + kolme

Morphological AlternationsMorphological Alternations

Semiregular stem alternationsyksi+Sg+Nom : yksi (one)yksi+Sg+Ess : yhtenäyksi+Sg+Gen : yhdenyksi+Sg+Part : yhtäyksi+Pl+Gen : yksien

Irregular stem alternationsyksi+Ord+Sg+Nom : ensimmäinen (first)

Regular suffix alternationsVowel harmony

kolme+Sg+Part : kolmea vs. neljä+Sg+Part : neljää

Illative vowelkolme+Sg+Ill : kolmeen vs. neljä+Ill+Part : neljään

Partitive tyksi+Sg+Part : yhtä vs. neljä+Sg+Part : neljää

Solution for FinnishSolution for Finnish

Maps a number with morphological tagsinto an inflected Finnish numeral.Encodes morphotactic constraints.

Numbers/Finnish

Transducer

lexc sourcelexicon

.o.

Looping lexicon with all the formsof all Finnish single numerals concatenatedin all possible ways. Composed with morphophonological rules.

ExampleExample

Numbers/Finnish

Transducer

2 5 +Ord +Pl +Genkaksi +Ord +Pl +Gen kymmenen +Ord +Pl +Gen viisi +Ord +Pl +Gen

lexc sourcelexicon

.o.

kaksi +Pl +Nom kymmenen +Part VIISI +Ord +Genkahdet kymmentä viidennen (ungrammatical)

kaksi +Ord +Pl +Gen kymmenen +Ord +Pl +Gen viisi +Ord +Pl +Genkahdensien kymmenensien viidensien

Sublexicon for OneSublexicon for One

LEXICON Yksi YKSI+Sg:yksi Nom; # singular nominative YKSI+Sg:yhde WeakGrade; # weak stem (most cases) YKSI+Sg:yhte StrongGrade; # strong stem (essive, ill.) YKSI+Sg:yht Par; # partitive stem YKSI:yks PlStem1; # plural stem YKSI+Ord1+Sg:ensimmäinen Nom; # singular nominative YKSI+Ord1+Sg:ensimmäise AnyGrade; # weak/strong stem YKSI+Ord1+Sg:ensimmäis Par; # partitive stem YKSI+Ord+Sg:yhdes Nom; # singular nominative YKSI+Ord+Sg:yhdenne WeakGrade; # weak stem YKSI+Ord+Sg:yhdente StrongGrade; # strong stem YKSI+Ord+Sg:yhdet Par; # partitive stem YKSI+Ord:yhdens PlStem1; # plural stem

Some sublexiconsSome sublexicons

LEXICON WeakGrade

SgGen; ! Singular Genitive

PlNom; ! Plural Nominative

InvarWeak; ! Invariant (plural and singular) cases

LEXICON InvarWeak

+Tra:ksi Next; ! Translative “into”

+Ine:ssA Next; ! Inessive “in”

+Ela:ltA Next; ! Elative “from” (inside)

+Ade:llA Next; ! Adessive “on”

+Abl:ltA Next; ! Ablative “from” (outside)

+All:lle Next; ! Allative “onto”

+Abe:ttA Next; ! Abessive “without”

Sample paths for TwoSample paths for Two

kaksi+Sg+Nom kaksi+Sg+Gen kaksi+Sg+Esskaksi kahde n kahte na

kaksi+Sg+Par kaksi+Pl+Gen kaksi+Pl+Illkah TA kaks i en kaks i Vn

kaksi+Ord+Sg+Nom kaksi+Ord1+Sg+Nomkahde s toinen

kaksi+Ord+Sg+Ill kaksi+Ord1+Sg+Illkahde nte Vn toise Vn

Morphophonologial rulesMorphophonologial rules

define BackV [a | o | u];define FrontV [ä | ö | y];define Vow [BackV | FrontV | i | e];

define VHarmony [A -> a || BackV ~$[FrontV] _

.o.

A -> ä];

define IllativeV [V -> a || a (h) _ ,

V -> e || e (h) _ , … ]

define PartitiveT [T -> 0 || \Vow Vow _ ];

Example againExample again

Numbers/Finnish

Transducer

2 5 +Ord +Pl +GenKAKSI +Ord +Pl +Gen KYMMENEN +Ord +Pl +Gen VIISI +Ord +Pl +Gen

lexc sourcelexicon

.o.

morpho-phonological

rules

.o.

KAKSI +Pl +Nom KYMMENEN +Part VIISI +Ord +Gen (ungrammatical)kahdet kymmentä viidennen

KAKSI +Ord +Pl +Gen KYMMENEN +Ord +Pl +Gen VIISI +Ord +Pl +Genkahdensien kymmenensien viidensien

Remaining problemsRemaining problems

Special ordinals for yksi (one), kaksi (two)ensimmäinen (1st) vs. kahdeskymmenesyhdes (21st)

Compose the lexicon with an appropriate filter to eliminate unwanted variants.

No internal tags2+Sg+Gen00+Sg+Gen

Delete them: 0 <- Tag || _ $[\Tag Tag+] .#. ;

Singular nominative as partitive in compounds%+Nom -> %+Par // %+Sg %+Nom ~$Tag %+Sg _ ;

Ordinal/Plural/Case agreementFlag diacritics!

Flags for Finnish numeralsFlags for Finnish numerals

@U.Type.Card@ @U.Type.Ord@

@U.Number.Sg@ @U.Number.Pl@

@U.Case.Nom@ @U.Case.Gen@ @U.Case.Par@ @U.Case.Tra@

@U.Case.Ess@ @U.Case.Abe@ @U.Case.Ine@ @U.Case.Ela@

@U.Case.Ill@ @U.Case.Ade@ @U.Case.Abl@ @U.Case.All@

@U.Case.Com@ @U.Case.Ins@

3 00 +Sg +Gen @U.Type.Card@ @U.Num.Sg@ @U.Case.Gen@ @U.Type.Card@ @U.Num.Sg@ @U.Case.Gen@

k o lmen s a dan

300+Sg+Genkolmensadan

ConclusionConclusion

Mapping from numbers to numerals can be done in a simple and elegant way even for languages with complex morphology.

Necessary for text to speech applications.

Tervetuloa kahdensienkymmenensienkahdeksansien olympialaisten avajaisiin!

Welcome to the opening ceremonies of the 28th Olympic Games!

Demo!Demo!

top related