omnivorous mt

55
Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages Christian Monson, Ariadna Font Llitjós, Roberto Aranovich, Lori Levin, Ralf Brown, Erik Peterson, Jaime Carbonell, and Alon Lavie

Upload: ike

Post on 31-Jan-2016

43 views

Category:

Documents


1 download

DESCRIPTION

Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages. Christian Monson, Ariadna Font Llitjós, Roberto Aranovich, Lori Levin , Ralf Brown, Erik Peterson, Jaime Carbonell, and Alon Lavie. Omnivorous MT. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Omnivorous MT

Building NLP Systems for Two Resource Scarce Indigenous

Languages: Mapudungun and Quechua, and some other

languages

Christian Monson, Ariadna Font Llitjós, Roberto Aranovich, Lori Levin, Ralf

Brown, Erik Peterson, Jaime Carbonell, and Alon Lavie

Page 2: Omnivorous MT

Omnivorous MT

• Eat whatever resources are available

• Eat large or small amounts of data

Mapusaurus RoseaeMapu = landMapuche = land peopleMapudungun= land speech

Page 3: Omnivorous MT

AVENUE’s Inventory

• Resources– Parallel corpus– Monolingual corpus– Lexicon– Morphological

Analyzer (lemmatizer)– Human Linguist– Human non-linguist

• Techniques– Rule based transfer

system– Example Based MT– Morphology Learning– Rule Learning– Interactive Rule

Refinement– Multi-Engine MT

This research was funded in part by NSF grant number IIS-0121-631.

Page 4: Omnivorous MT

Startup without corpus or linguist

Requires someone who is bilingual and literate

Page 5: Omnivorous MT

The Elicitation Tool has been used with these languages

• Mapudungun• Hindi• Hebrew• Quechua• Aymara• Thai• Japanese• Chinese• Dutch• Arabic

Page 6: Omnivorous MT

Purpose of Elicitation

• Provide a small but highly targeted corpus of hand aligned data– To support machine

learning from a small data set

– To discover basic word order

– To discover how syntactic dependencies are expressed

– To discover which grammatical meanings are reflected in the morphology or syntax of the language

srcsent: Tú caístetgtsent: eymi ütrünagimialigned: ((1,1),(2,2))context: tú = Juan [masculino, 2a persona del

singular]comment: You (John) fell

srcsent: Tú estás cayendotgtsent: eymi petu ütünagimialigned: ((1,1),(2 3,2 3))context: tú = Juan [masculino, 2a persona del

singular]comment: You (John) are falling

srcsent: Tú caíste tgtsent: eymi ütrunagimialigned: ((1,1),(2,2))context: tú = María [femenino, 2a persona del

singular]comment: You (Mary) fell

Page 7: Omnivorous MT

Feature Structuressrcsent: Mary was not a leader.context: Translate this as though it were spoken to a peer co-

worker;

((actor ((np-function fn-actor)(np-animacy anim-human)(np- biological-gender bio-gender-female) (np-general-type proper-noun-type)(np-identifiability identifiable)(np- specificity specific)…))

(pred ((np-function fn-predicate-nominal)(np-animacy anim- human)(np-biological-gender bio-gender-female) (np- general-type common-noun-type)(np-specificity specificity- neutral)…))

(c-v-lexical-aspect state)(c-copula-type copula-role)(c-secondary-type secondary-copula)(c-solidarity solidarity-neutral) (c-v-grammatical-aspect gram-aspect-neutral)(c-v-absolute-tense past) (c-v-phase-aspect phase-aspect-neutral) (c-general-type declarative-clause)(c-polarity polarity-negative)(c-my-causer-intentionality intentionality-n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)…)

Page 8: Omnivorous MT

Current Work

• Search space:– Elements of meanings that might be

expressed by syntax or morphology: tense, aspect, person, number, gender, causation, evidentiality, etc.

– Syntactic dependencies: subject, object– Interactions of features:

• Tense and person • Tense and interrogative mood• Etc.

Page 9: Omnivorous MT

Current Work

• For a new language– For each item of the search space

• Eliminate it as irrelevant or• Explore it

– Using as few sentences as possible

Page 10: Omnivorous MT

Mar 1, 2006

Tools for Creating Elicitation Corpora

List of semantic features and values

The Corpus

Feature Maps: which combinations of features and values are of interest

…Clause-Level

Noun-Phrase

Tense & Aspect Modality

Feature Structure Sets

Feature Specification

Reverse Annotated Feature Structure Sets: add English sentences

Smaller CorpusSampling

XML SchemaXSLT Script

Page 11: Omnivorous MT

Mar 1, 2006

Tools for Creating Elicitation Corpora

List of semantic features and values

The Corpus

Feature Maps: which combinations of features and values are of interest

…Clause-Level

Noun-Phrase

Tense & Aspect Modality

Feature Structure Sets

Feature Specification

Reverse Annotated Feature Structure Sets: add English sentences

Smaller CorpusSampling

Combination Formalism

Page 12: Omnivorous MT

Mar 1, 2006

Tools for Creating Elicitation Corpora

List of semantic features and values

The Corpus

Feature Maps: which combinations of features and values are of interest

…Clause-Level

Noun-Phrase

Tense & Aspect Modality

Feature Structure Sets

Feature Specification

Reverse Annotated Feature Structure Sets: add English sentences

Smaller CorpusSampling

Feature Structure Viewer

Page 13: Omnivorous MT

Mar 1, 2006

Tools for Creating Elicitation Corpora

List of semantic features and values

The Corpus

Feature Maps: which combinations of features and values are of interest

…Clause-Level

Noun-Phrase

Tense & Aspect Modality

Feature Structure Sets

Feature Specification

Reverse Annotated Feature Structure Sets: add English sentences

Smaller CorpusSampling

Page 14: Omnivorous MT

Outline

• Two ideas– Omnivorous MT– Startup for low resource situation

• Four Languages– Mapudungun– Quechua– Hindi– Hebrew

Page 15: Omnivorous MT

The Avenue Low Resource Scenario

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

Page 16: Omnivorous MT

The Avenue Low Resource Scenario

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

Page 17: Omnivorous MT

The Avenue Low Resource Scenario

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

Page 18: Omnivorous MT

The Avenue Low Resource Scenario

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

Page 19: Omnivorous MT

Mapudungun Language

• 900,000 Mapuche people• At least 300.000 speakers of Mapudungun• Polysynthetic

sl: pe- rke- fi- ñ Maria ver-REPORT-3pO-1pSgS/INDtl: DICEN QUE LA VI A MARÍA (They say that) I saw Maria.

Page 20: Omnivorous MT

AVENUE Mapudungun

• Joint project between Carnegie Mellon University, the Chilean Ministry of Education, and Universidad de la Frontera.

Page 21: Omnivorous MT

Mapudungun to Spanish Resources

• Initially: – Large team of native speakers at Universidad de la Frontera,

Temuco, Chile• Some knowledge of linguistics• No knowledge of computational linguistics

– No corpus– A few short word lists– No morphological analyzer

• Later: Computational Linguists with non-native knowledge of Mapudungun

• Other considerations:– Produce something that is useful to the community, especially for

bilingual education– Experimental MT systems are not useful

Page 22: Omnivorous MT

Mapudungun

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

Corpus: 170 hours of spoken Mapudungun

Example Based MT

Spelling checker

Spanish Morphology from UPC, Barcelona

Page 23: Omnivorous MT

Mapudungun Products

• http://www.lenguasamerindias.org/– Click: traductor mapudungún– Dictionary lookup (Mapudungun to Spanish)– Morphological analysis– Example Based MT (Mapudungun to Spanish)

Page 24: Omnivorous MT

V

pe

I Didn’t see Maria

VSuff

la

VSuffG VSuff

fi

VSuffG VSuff

ñ

VSuffG

NP

N

Maria

N

S

V

VP

S

VP

NP“a”V

V“no”

vi N

María

N

Page 25: Omnivorous MT

V

pe

Transfer to Spanish: Top-Down

VSuff

la

VSuffG VSuff

fi

VSuffG VSuff

ñ

VSuffG

NP

N

Maria

N

S

V

VP

S

VP

NP“a”V

VP::VP [VBar NP] -> [VBar "a" NP]( (X1::Y1)

(X2::Y3)

((X2 type) = (*NOT* personal)) ((X2 human) =c +)

(X0 = X1) ((X0 object) = X2)

(Y0 = X0)

((Y0 object) = (X0 object))(Y1 = Y0)(Y3 = (Y0 object))((Y1 objmarker person) = (Y3 person))((Y1 objmarker number) = (Y3 number))((Y1 objmarker gender) = (Y3 ender)))

Page 26: Omnivorous MT

AVENUE Hebrew

• Joint project of Carnegie Mellon University and University of Haifa

Page 27: Omnivorous MT

Hebrew Language

• Native language of about 3-4 Million in Israel• Semitic language, closely related to Arabic and with

similar linguistic properties– Root+Pattern word formation system– Rich verb and noun morphology– Particles attach as prefixed to the following word: definite article

(H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)…

• Unique alphabet and Writing System– 22 letters represent (mostly) consonants– Vowels represented (mostly) by diacritics– Modern texts omit the diacritic vowels, thus additional level of

ambiguity: “bare” word word– Example: MHGR mehager, m+hagar, m+h+ger

Page 28: Omnivorous MT

Hebrew Resources

• Morphological analyzer developed at Technion

• Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary

• Human Computational Linguists

• Native Speakers

Page 29: Omnivorous MT

Hebrew

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

Page 30: Omnivorous MT

Flat Seed Rule Generation

Learning Example: NP

Eng: the big apple

Heb: ha-tapuax ha-gadol

Generated Seed Rule:

NP::NP [ART ADJ N] [ART N ART ADJ]

((X1::Y1)

(X1::Y3)

(X2::Y4)

(X3::Y2))

Page 31: Omnivorous MT

Compositionality Learning

Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N]

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8))

NP::NP [ART ADJ N] [ART N ART ADJ]

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N]

((X1::Y1) (X2::Y2))

Generated Compositional Rule:

S::S [NP V NP] [NP V P NP]

((X1::Y1) (X2::Y2) (X3::Y4))

Page 32: Omnivorous MT

Constraint LearningInput: Rules and their Example Sets

S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26}

((X1::Y1) (X2::Y2) (X3::Y4))

NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13}

((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11}

((X1::Y1) (X2::Y2))

Output: Rules with Feature Constraints:

S::S [NP V NP] [NP V P NP]

((X1::Y1) (X2::Y2) (X3::Y4)

(X1 NUM = X2 NUM)

(Y1 NUM = Y2 NUM)

(X1 NUM = Y1 NUM))

Page 33: Omnivorous MT

Quechua facts• Agglutinative language

• A stem can often have 10 to 12 suffixes, but it can have up to 28 suffixes

• Supposedly clear cut boundaries, but in reality several suffixes change when followed by certain other suffixes

• No irregular verbs, nouns or adjectives

• Does not mark for gender

• No adjective agreement

• No definite or indefinite articles (‘topic’ and ‘focus’ markers perform a similar task of articles and intonation in English or Spanish)

Page 34: Omnivorous MT

Quechua examples

– taki+ni (also written takiniy)sing 1sg (I sing) canto

– taki+sha+ni (takishaniy)sing progr 1sg (I am singing) estoy cantando

– taki+pa+ku+q+chu? taki sing -pa+ku to join a group to do something -q agentive -chu interrogative

(para) cantar con la gente (del pueblo)? (to sing with the people (of the village)?)

Page 35: Omnivorous MT

Quechua Resources

• A few native speakers, not linguists

• A computational linguist learning Quechua

• Two fluent, but non-native linguists

Page 36: Omnivorous MT

Quechua

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

Parallel Corpus: OCR with correction

Page 37: Omnivorous MT

Grammar rules;taki+sha+ni -> estoy cantando (I am singing){VBar,3} VBar::VBar : [V VSuff VSuff] -> [V V]( (X1::Y2)

((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) =c ger) ((y2 mood) = (x2 mood)) ((y1 form) =c estar) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 tense) = (x3 tense))((x0 tense) = (x3 tense))((y1 mood) = (x3 mood))((x3 inflected) =c +)((x0 inflected) = +))

lex = cantarmood = ger

lex = estarperson = 1number = sgtense = presmood = ind

SpanishMorphologyGeneration

estoy

cantando

Page 38: Omnivorous MT

Hindi Resources

• Large statistical lexicon from the Linguistic Data Consortium (LDC)

• Parallel Corpus from LDC• Morphological Analyzer-Generator from LDC• Lots of native speakers• Computational linguists with little or no

knowledge of Hindi• Experimented with the size of the parallel corpus

– Miserly and large scenarios

Page 39: Omnivorous MT

Hindi

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

15,000 Noun Phrases from Penn TreeBank

Parallel Corpus

EBMT

SMT

Supported by DARPA TIDES

Page 40: Omnivorous MT

Manual Transfer Rules: Example

; NP1 ke NP2 -> NP2 of NP1; Ex: jIvana ke eka aXyAya; life of (one) chapter ; ==> a chapter of life;{NP,12}NP::NP : [PP NP1] -> [NP1 PP]( (X1::Y2) (X2::Y1); ((x2 lexwx) = 'kA'))

{NP,13}NP::NP : [NP1] -> [NP1]( (X1::Y1))

{PP,12}PP::PP : [NP Postp] -> [Prep NP]( (X1::Y2) (X2::Y1))

NP

PP NP1

NP P Adj N

N1 ke eka aXyAya

N

jIvana

NP

NP1 PP

Adj N P NP

one chapter of N1

N

life

Page 41: Omnivorous MT

System BLEU M-BLEU NIST

EBMT 0.058 0.165 4.22

SMT 0.093 0.191 4.64

XFER (naïve) man

grammar

0.055 0.177 4.46

XFER (strong) no grammar

0.109 0.224 5.29

XFER (strong) learned

grammar

0.116 0.231 5.37

XFER (strong) man

grammar

0.135 0.243 5.59

XFER+SMT

0.136 0.243 5.65

Very miserly training data.

Seven combinations of components

Strong decoder allows re-ordering

Three automatic scoring metrics

Hindi-English

Page 42: Omnivorous MT

Extra Slides

Page 43: Omnivorous MT

The Avenue Low Resource Scenario

Learning

Module

Learned Transfer

Rules

Lexical Resources

Run Time Transfer System

Decoder

Translation

Correction

Tool

Word-Aligned Parallel Corpus

Elicitation Tool

Elicitation Corpus

Elicitation Rule Learning

Run-Time System

Rule Refinement

Rule

Refinement

Module

Morphology

Morphology Analyzer

Learning Module Handcrafted

rules

INPUT TEXT

OUTPUT TEXT

Page 44: Omnivorous MT

Feature Specification

• Defines Features and their values

• Sets default values for features

• Specifies feature requirements and restrictions

• Written in XML

Page 45: Omnivorous MT

Feature SpecificationFeature: c-copula-type

(a copula is a verb like “be”; some languages do not have copulas)Values     

copula-n/a   Restrictions: 1. ~(c-secondary-type secondary-copula)Notes:

copula-role   Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. A role is something like a job or a function. "He is a teacher" "This is a vegetable peeler"

copula-identity   Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. "Clark Kent is Superman" "Sam is the teacher"

copula-location   Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. "The book is on the table" There is a long list of locative relations later in the feature specification.

copula-description   Restrictions: 1. (c-secondary-type secondary-copula)Notes: 1. A description is an attribute. "The children are happy." "The books are long."

Page 46: Omnivorous MT

Feature Maps

• Some features interact in the grammar– English –s reflects person and number of the subject and tense of

the verb.– In expressing the English present progressive tense, the auxiliary

verb is in a different place in a question and a statement:• He is running.

• Is he running?

• We need to check many, but not all combinations of features and values.

• Using unlimited feature combinations leads to an unmanageable number of sentences

Page 47: Omnivorous MT
Page 48: Omnivorous MT

Evidentiality Map

Lexical Aspect

Assertiveness

Polarity

Source

Tense

Gram.

Aspect

activity-accomplishment

Assertiveness-asserted, Assetiveness-neutral

Polarity-positive, Polarity-negative

Hearsay, quotative, inferred, assumption

Visual, Auditory, non-visual-or-auditory

Past Present, Future Past Present

Perfective, progressive, habitual, neutral

habitual, neutral, progressive

Perfective, progressive, habitual, neutral

habitual, neutral, progressive

Page 49: Omnivorous MT

Current Work

• Navigation– Start: large search space of all possible

feature combinations– Finish: each feature has been eliminated as

irrelevant or has been explored– Goal: dynamically find the most efficient path

through the search space for each language.

Page 50: Omnivorous MT

Current Work

• Feature Detection– Which features have an effect on

morphosyntax?– What is the effect?– Drives the Navigation process

Page 51: Omnivorous MT

Feature Detection: Spanish

The girl saw a red book.((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))La niña vió un libro rojo

A girl saw a red book((1,1)(2,2)(3,3)(4,4)(5,6)(6,5))Una niña vió un libro rojo

I saw the red book((1,1)(2,2)(3,3)(4,5)(5,4))Yo vi el libro rojo

I saw a red book.

((1,1)(2,2)(3,3)(4,5)(5,4)) Yo vi un libro rojo

Feature: definitenessValues: definite, indefiniteFunction-of-*: subj, objMarked-on-head-of-*: noMarked-on-dependent: yesMarked-on-governor: noMarked-on-other: noAdd/delete-word: noChange-in-alignment: no

Page 52: Omnivorous MT

Feature Detection: Chinese

A girl saw a red book.

((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8))

有 一个 女人 看见 了 一本 红色 的 书 。

The girl saw a red book.

((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7))

女人 看见 了 一本 红色的 书

Feature: definiteness

Values: definite, indefinite

Function-of-*: subject

Marked-on-head-of-*: no

Marked-on-dependent: no

Marked-on-governor: no

Add/delete-word: yes

Change-in-alignment: no

Page 53: Omnivorous MT

Feature Detection: Chinese

I saw the red book((1, 3)(2, 4)(2, 5)(4, 1)(5, 2))

红色的 书, 我 看见 了

I saw a red book.((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6))我 看见 了 一本 红色的 书 。

Feature: definitenesValues: definite, indefiniteFunction-of-*: objectMarked-on-head-of-*: noMarked-on-dependent: noMarked-on-governor: noAdd/delete-word: yesChange-in-alignment: yes

Page 54: Omnivorous MT

Feature Detection: Hebrew

A girl saw a red book.((2,1) (3,2)(5,4)(6,3))

ראתה ספר אדוםילדה

The girl saw a red book((1,1)(2,1)(3,2)(5,4)(6,3))

ראתה ספר אדוםהילדה

I saw a red book.((2,1)(4,3)(5,2))

אדוםספרראיתי

I saw the red book.((2,1)(3,3)(3,4)(4,4)(5,3))

האדוםהספרראיתי את

Feature: definitenessValues: definite, indefiniteFunction-of-*: subj, objMarked-on-head-of-*: yesMarked-on-dependent: yesMarked-on-governor: noAdd-word: noChange-in-alignment: no

Page 55: Omnivorous MT

Feature Detection Feeds into…

• Corpus Navigation: which minimal pairs to pursue next.– Don’t pursue gender in Mapudungun– Do pursue definiteness in Hebrew

• Morphology Learning:– Morphological learner identifies the forms of the morphemes– Feature detection identifies the functions

• Rule learning:– Rule learner will have to learn a constraint for each morpho-

syntactic marker that is discovered• E.g., Adjectives and nouns agree in gender, number, and definiteness

in Hebrew.