linguistically-motivated, statistically-driven induction of morphology erwin chan dept. of computer...

50
Linguistically- motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Upload: charleen-norton

Post on 29-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Linguistically-motivated, statistically-driven

induction of morphology

Erwin Chan

Dept. of Computer and Information ScienceUniversity of Pennsylvania

Page 2: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Overview

• Problem: induction of morphology from unannotated text

• Main idea: knowledge of linguistic and statistical properties of morphology allows for a simple induction algorithm

• Develops ideas from previous work:– Goldsmith (2001)– Schone & Jurafsky (2000)– Yarowsky & Wicentowski (2000, 2004)

Page 3: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Outline

1. Goals of morphology induction

2. Linguistic model of morphology

3. Statistical model of morphology

4. Induction algorithm

5. Conclusion, relevance to cognitive science

Page 4: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Computational modeling of language acquisition

Raw corpus Induction algorithm

(“fully” unsupervised)

Linguisticknowledge

Page 5: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Desired properties of output

1. Analysis of input data– morphology, POS, parse

2. Generalize analysis– produce tool to apply to new data– morphological analyzer, POS tagger, parser

Page 6: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Generalize morphological structure

• Word-specific morphological analysisdogs = dog + s

cats = cat + s

churches = church + es

finches = finch + es

• Out-of-vocabulary words?

• Summarize phonological propertiesIf ends in ch, add es, otherwise add s

Page 7: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Morphophonological rules

• generative phonology, finite-state morphology

• Analysis: inflected base form• Generation: base form inflected

• Rule specifies:– rewrite pattern– context of application

• N.PL rule: $ es / ch _ #$ s / _ # ( $ is null suffix )

Page 8: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Towards induction of rules

• This presentation: from a corpus,– Select words to be base forms– Formulate rewrite patterns (transforms)

• Future: learn other rule components– context of application– POS categories (e.g. “Noun”)– fine-grained inflectional categories (Noun.PL)– allomorphs

Page 9: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Outline

1. Goals of morphology induction

2. Linguistic model of morphology

3. Statistical model of morphology

4. Induction algorithm

5. Conclusion, relevance to cognitive science

Page 10: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Linguistic model of morphology

• Model that generates inflectional morph paradigms– Base forms– Transforms– Transform signatures

• Simplifying assumptions:– One inflectional property for word

(not adequate for agglutinative languages: Finnish)– Omit derivational morphology

Page 11: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Base-and-transforms model of morphological paradigms

• Apply transforms to base form to generate

each inflection

base

Lexeme 1

base base

Lexeme 2 Lexeme 3

Page 12: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Base forms

• Same inflectional type across lexemes for a particular POS category– e.g. Nom.Sg for all nouns

• Representation in lexicon

• Surface form– not abstract, underlying

Page 13: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Transforms

• Specifies conversion process between base and inflected forms

• Similar to a rule, but omits context of application

• Tuple of 2 regular expressions (X,Y)– X: replaced portion of base form– Y: replaced portion of inflected form

Page 14: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Transform examples (for English)

Base form

eat

time

time

hang

Inflected form

eating

times

timing

hung

Transform

( $, ing )

( $, s )

( e, ing )

( *a*, *u* ) non-concat.

Page 15: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Transform signatures

• Summarizes the inflections of a set of words– set of base forms X set of transforms– each base form belongs to exactly one trans. signature

Base forms Transforms t-sig #1 { time, save } { ( $, s ) ( e, ing ) } t-sig #2 { walk } { ( $, s ) }

generates: time, times, timing,save, saves, saving, walk, walks

Page 16: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Comparison to stem-suffix signatures

• Stem-suffix signature (Goldsmith 2001,2007)

Stems Suffixes

sig #1 { time, save, walk } { $, s }

sig #2 { tim, sav } { ing }

• Compare lexical representations– stem-suffix sig: multiple stems for a lexeme– transform sig: one base form per lexeme

Page 17: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Outline

1. Goals of morphology induction

2. Linguistic model of morphology

3. Statistical model of morphology

4. Induction algorithm

5. Conclusion, relevance to cognitive science

Page 18: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Statistical model of morphology

• Need to show learnability of linguistic model

• Understand distribution of data:look for patterns that hold across languages

• Propose simple model of distribution of inflections

• Implications for linguistic model

Page 19: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Examine annotated corpora

• Word representation: (lemma, infl. category)e.g. went = ( go, verb-past-tense )

• Collapse phonological sub-classese.g. N.Masc.Sg N.Sg

N.Fem.Sg N.Sg

Page 20: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Spanish newswire verbs

Lemma Inflection

Log(freq)

Sparse data

Page 21: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

CHILDES adult Spanish verbs

InflectionLemma

Log(freq)

Page 22: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Dist. of inflectional categories

• (roughly) Zipfian

• Slovene nouns

• 3 inflections

don’t occur at all

  # types # types

N.Nom.Sg 7950 N.Inst.Pl 1630

N.Gen.Sg 5967 N.Dat.Sg 1515

N.Acc.Sg 5157 N.Gen.Dual 876

N.Nom.Pl 4154 N.Nom.Dual 682

N.Gen.Pl 3900 N.Dat.Pl 626

N.Inst.Sg 3334 N.Acc.Dual 586

N.Loc.Sg 3252 N.Loc.Dual 160

N.Acc.Pl 2967 N.Inst.Dual 120

N.Loc.Pl 1848 N.Dat.Dual 14

Page 23: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

High type frequency of base form

• Most type-frequent inflection accords with intuitive notions of what inflection a base form should be– Slovene: A.Pos.Nom.Sg.Indef

N.Nom.Sg

V.Main.Ind.Pres.3.Sg– Swedish: A.Pos.Sg.Indef.Nom

N.Sg.Indef.Nom

V.Inf.Act– Spanish: A.Sg

N.Sg

V.Inf

Page 24: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Multinomial distribution

• Urn-and-balls problem– Assume inflectional categories have constant prob.– Choose lexeme and number of words, then

generate inflections according to their prob. dist.

• Let an inflection set be the inflectional types of the words generated for a particular lexeme

• What is the prob. dist. over inflection sets?Can calculate from multinomial

Page 25: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Inflection sets and base forms

• If base form is usually most frequent, multinomial predicts:– Inflection set with base relatively high prob– Inflection set without base relatively low prob

– If a rare inflection occurs,

its base form is likely to occur

Page 26: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Occurrence of base in infl sets• Percentage of inflection sets of size >= 2

that contain most type-freq inflectionAdj Noun Verb

Slovene 64% 68% 80%

Greek 89% 83% 62%

Swedish 80% 84% 57%

Spanish 82%

Sp CHILDES 70%

Page 27: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Implications for linguistic model

• Zipfian + multinomial distributions predict

that data will exist in corpus to support rule learning

– Prominence of base form

– (base, inflected) exist even for rare inflections

Page 28: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Outline

1. Goals of morphology induction

2. Linguistic model of morphology

3. Statistical model of morphology

4. Induction algorithm

5. Conclusion, relevance to cognitive science

Page 29: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Overview of induction algorithm

• Learn transform signatures for portion of vocab– Select words to be base forms

• Construct increasingly complex data structures1. suffixes

2. transforms

3. transform signatures

• Ranking and filtering based on ling, stat models

Page 30: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Additional simplifying assumptions

• Assume language is suffixing

• Not learning POS categories

Page 31: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Step 1. Suffixes

• Find 50 most type-frequent suffixes

• Keep track of words that end in each suffix

ing: { beating, eating, cheating, etc. }

• Rank by number of types

Page 32: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Most type-frequent suffixes (Brown)# types # types

1. $ 42596 41. les 237

2. s 10730 42. ses 230

3. e 4967 43. et 224

4. d 4800 44. ck 223

5. ed 3868 45. ding 220

6. y 3648 46. ning 219

7. n 3226 47. ded 219

8. g 3107 48. ment 217

9. ng 2951 49. ngs 216

10. ing 2869 50. rd 211

Page 33: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Step 2. Transforms

• For each pair of suffixes s1 and s2,

construct 2 transforms: (s1,s2) and (s2,s1)– Don’t allow deletion: ( _ , $)

• Hypothesize base forms (next slide)

• Rank transforms by # of base forms

• Keep top 50

Page 34: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Transform construction

s1 words s2 words

Base forms

for (s1,s2)

relation (s1,s2)

Page 35: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Top transforms (Brown corpus)# base forms

# base forms

1. ( $, s ) 5257 41. ( on, ng ) 229

2. ( ing, ed ) 1922 42. ( ng, on ) 229

3. ( ed, ing ) 1922 43. ( $, r ) 221

4. ( $, 's ) 1609 44. ( ion, e ) 216

5. ( $, ed ) 1481 45. ( e, ion ) 216

6. ( $, ing ) 1335 46. ( y, e ) 214

7. ( $, ly ) 1069 47. ( e, y ) 214

8. ( $, d ) 1041 48. ( $, al ) 213

9. ( s, ed ) 925 49. ( y, ed ) 212

10. ( ed, s ) 925 50. ( ed, y ) 212

Page 36: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Step 3. Transform signatures

• Intersect base form sets of different transforms

Transform 1 ( $, s )

Transform 2( $, ing )

Base forms for transform 1

Base forms for transform 2

Base forms in transforms 1 and 2

3 transform signatures

Page 37: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Rank, filter transform signatures

• Rank by number of words

• Go down list and filter:

Missing base form #4. ($,s) ($,ed) ($,ing)

#5. (s,ed) (s,ing) transforms consisting of

“derived” suffixes

Page 38: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Filter transform signatures

• Remove redundant signatures

(want a grammar of minimal size)

#1 ($,s)

#2 ($,’s)

#14 ($,s) ($,'s) redundant:

combination of #1 and #2

Page 39: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Final transform signatures

1. ( $, s )

2. ( $, 's )

3. ( $, s ) ( $, ed ) ( $, ing )

4. ( $, ly )

5. ( $, s ) ( $, d ) ( e, ing )

6. ( y, ies )

7. ( $, ly ) ( $, ness )

8. ( $, s ) ( $, ed) ( $, ing ) ( $, er ) ( $, ers )

9. ( $, ed ) ( $, ing ) ( $, es )

10. ( $, ' )

11. ( $, s ) ( $, al )

12. ( $, e )

13. ( $, y )

spurious

Deletion from base

Page 40: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Evaluation: precision of relation

• Precision:– Whether (base, derived-from-base)

relationship is inflectional

– Gold standard: Brown corpus lemmas– 96.7% correct

Page 41: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Error Analysis

1. Agglutinative morphologyInflected base gold basesurvivors’ survivors survivor

2. Gold standard doesn’t have deriv basehunters hunt hunter

3. Spurious morphological relationshiphone hon honelouise louis louise

Page 42: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Evaluation: vocab coverage

• Brown open-class POS categories– 31709 base forms– 539494 tokens (all inflections)

• 13 transform signatures– 5846 base forms = 18.4% coverage– 113165 tokens = 21.0% coverage

• (include redundant: 27%, 41.9% coverage)

Page 43: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

How to expand coverage

• Have initial, high-precision set of base forms

• Bootstrap– Find other inflections of base forms– Use new inflections to acquire more base forms– Repeat

Page 44: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Why induction algorithm works

• Exploits combinatorics of multinomial

• Find legitimate morphological relationships– Intersection filters non-linguistic features– only linguistic features likely to co-occur

across large portion of vocabulary

• Find base forms– t-sigs with base more probable than t-sigs without,

so t-sigs with base are ranked high

Page 45: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Comparison to other algorithms

• Components:– spelling and frequencies– set intersection, set cover (greedy approx. algorithm)– knowledge of base-and-transforms model

• Doesn’t use:– entropy – parameter optimization– minimum description length– transitional probability between characters– distributional semantics

Page 46: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Outline

1. Goals of morphology induction

2. Linguistic model of morphology

3. Statistical model of morphology

4. Induction algorithm

5. Conclusion, relevance to cognitive science

Page 47: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Summary

• Task: induction of morphology from raw data– Importance of generalization– Generalization through morphophonological

rules

• Linguistic model:– Base forms, transforms, transform signatures– Improved lexical representation

Page 48: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Summary

• Statistical model:– Zipf + Multinomial prominence of base forms– Data distribution sufficient to learn ling. model

• Induction algorithm:– build increasingly complex representations

suffix transform transform signature

– uses knowledge of linguistic and statistical models

Page 49: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Main ideas

• Knowledge of linguistic and statistical properties of morphology allows for a simple induction algorithm

• Look for “universal” properties of data

• Incorporate “universals” into algorithm

as a learning bias

Page 50: Linguistically-motivated, statistically-driven induction of morphology Erwin Chan Dept. of Computer and Information Science University of Pennsylvania

Relevance to cognitive science

• Linguistics:– Statistical / algorithmic evidence for rules– Statistical origin of rules ?

• Psycholinguistics:– “Past tense” learning models (R&M, Pinker)– presupposes list of (base, inflected) forms

• Computational linguistics:– towards induction of phonological rules and

finite-state models of morphology