the oslo-bergen tagger obt+stat - a short presentation

Post on 12-Jan-2016

71 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Oslo-Bergen Tagger OBT+stat - a short presentation. André Lynum, Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad. Morphosyntactic tagger and lemmatizer. Bokmål and Nynorsk Based on lexicon and linguistic rules - PowerPoint PPT Presentation

TRANSCRIPT

The Oslo-Bergen TaggerOBT+stat - a short presentation

André Lynum, Kristin Hagen, Janne Bondi Johannessen and Anders Nøklestad

Morphosyntactic tagger and lemmatizer• Bokmål and Nynorsk• Based on lexicon and linguistic rules• Statistical disambiguation for

completely unambiguous output (Currently Bokmål only)

Purpose

• Annotation for linguistic research (e.g. The Oslo Corpus) • Large scale corpora annotation (e.g. NoWaC in progress)

Applications

• Grammar checker in Microsoft Word and others• Open source and commercial translation systems (Apertium,

NyNo, Kaldera)• Commercial Content Management Systems (TextUrgy)

Resources

Lexicon based on Norsk ordbank Bokmål:    151 229 entriesNynorsk:  126 323 entries

Resources

Hand-made Constraint Grammar rules

Bokmål:    2214 morphological rulesNynorsk:  3849 morphological rules 

Resources

Development and test corpora Training/development corpus approx. 120,000 words each for Bokmål and Nynorsk

Test/evaluation corpusapprox. 30,000 words each for Bokmål and Nynorsk 

Resources

Dependency syntax for both Bokmål and Nynorsk

Technology

Multitagger                           Common LispCG Disambiguator               VislCG3 (C++)Statistical Disambiguator     Ruby, HunPos

Pipeline

Results

Competitive results on varied domains

Multitagger

• Sophisticated tokenizer, morphological analyzer and compound word analyzer (guesser)

• Enumerates all possible tags and lemmas• Tags composed of detailed morphosyntactic information

Multitagger output<word>Dette</word>"<dette>""dette"    verb inf i2 pa4"dette"    pron nøyt ent pers 3"dette"    det dem nøyt ent<word>er</word>"<er>""være"  verb pres a5 pr1 pr2 <aux1/perf_part><word>en</word>"<en>""en"      det mask ent kvant"en"      pron ent pers hum"en"      adv"ene"    verb imp tr1<word>testsetning</word>"<testsetning>""testsetning"    subst appell fem ub ent samset"testsetning"    subst appell mask ub ent samset<word>.</word>"<.>""$."    clb <<< <punkt>

Multitagger output

<word>en</word>"<en>" "en"    det mask ent kvant "en"    pron ent pers hum "en"    adv "ene"  verb imp tr1

CG Disambiguator

• Based on detailed Constraint Grammar rulesets for Bokmål and Nynorsk

• Rules compatible with the state of the art VislCG3 disambiguator

• Efficiently disambiguates multitagger cohorts with high precision

• Leaves some ambiguity by design

CG Rules

#:2553 SELECT:2553 (subst mask ent) IF         (NOT 0 farlige-mask-subst)         (NOT 0 fv)         (NOT 0 adj)         (NOT -1 komma/konj)         (**-1C mask-det LINK NOT 0 nr2-det LINK NOT *1 ikke-adv-adj);#  "en vidunderlig vakker sommerfugl"

Example output<word>Dette</word>"<dette>""dette" pron nøyt ent pers 3 SELECT:2607; "dette" verb inf i2 pa4 SELECT:2607 ; "dette" det dem nøyt ent SELECT:2607 <word>er</word>"<er>""være" verb pres a5 pr1 pr2 <aux1/perf_part><word>en</word>"<en>""en" det mask ent kvant SELECT:2762; "en" adv REMOVE:3689 ; "en" pron ent pers hum SELECT:2762 ; "ene" verb imp tr1 SELECT:2762<word>testsetning</word>"<testsetning>""testsetning" subst appell mask ub ent samset SELECT:2553; "testsetning" subst appell fem ub ent samset SELECT:2553 <word>.</word>"<.>""$." clb <<< <punkt>

Example of ambiguity left unresolved<word>Setninger</word>"<setninger>""setning" subst appell fem ub fl "setning" subst appell mask ub fl <word>kan</word>"<kan>""kunne" verb pres tr1 tr3 <aux1/infinitiv> <word>være</word>"<være>""være" verb inf tr5 "være" verb inf a5 pr1 pr2 <aux1/perf_part> ; "være" subst appell nøyt ubøy REMOVE:3123 <word>vanskelige</word>"<vanskelige>""vanskelig" adj fl pos ; "vanskelig" adj be ent pos REMOVE:2318 <word>.</word>"<.>""$." clb <<< <punkt>

Example of ambiguity left unresolved

<word>Setninger</word>"<setninger>""setning"    subst appell fem ub fl "setning"    subst appell mask ub fl 

Example of unresolved ambiguity<word>Det</word>"<det>""det" pron nøyt ent pers 3 SELECT:2607 ; "det" det dem nøyt ent SELECT:2607 <word>dreier</word>"<dreier>""dreie" verb pres tr1 i2 tr11 SELECT:2467 ; "drei" subst appell mask ub fl SELECT:2467 ; "dreier" subst appell mask ub ent SELECT:2467<word>seg</word>"<seg>""seg" pron akk refl SELECT:3333 ; "sige" verb pret i2 a3 pa4 SELECT:3333<word>om</word>"<om>""om" prep SELECT:2653 ; "om" sbu SELECT:2653<word>åndsverk</word>"<åndsverk>""åndsverk" subst appell nøyt ub fl <*verk> "åndsverk" subst appell nøyt ub ent <*verk> <word>.</word>"<.>""$." clb <<< <punkt>

Example of unresolved ambiguity

<word>åndsverk</word>"<åndsverk>""åndsverk"    subst appell nøyt ub fl <*verk> "åndsverk"    subst appell nøyt ub ent <*verk> 

Example of lemma ambiguity

<word>Det</word>"<det>""Det" subst prop <*> <word>gamle</word>"<gamle>""gammel" adj be ent pos SELECT:3064 "gammal" adj be ent pos SELECT:3064 ; "gammel" adj fl pos SELECT:3064 ; "gammal" adj fl pos SELECT:3064 <word>testamentet</word>"<testamentet>""testament" subst appell nøyt be ent "testamente" subst appell nøyt be ent <word>.</word>"<.>"

Example of lemma ambiguity

<word>gamle</word>"<gamle>""gammel" adj be ent pos SELECT:3064 "gammal" adj be ent pos SELECT:3064 

Example of lemma ambiguity

<word>Oslo</word>"<oslo>" "Oslo" subst prop <word>er</word>"<er>" "være" verb pres a5 pr1 pr2 <aux1/perf_part> <word>byen</word>"<byen>" "bye" subst appell mask be ent  "by" subst appell mask be ent <word>vår</word>"<vår>" "vår" det mask ent poss SELECT:2689 ; "vår" det fem ent poss SELECT:2689 ; "vår" subst appell mask ub ent SELECT:2689 <word>.</word>"<.>" "$." clb <<< <punkt>

Example of lemma ambiguity

<word>byen</word>"<byen>" "bye" subst appell mask be ent  "by" subst appell mask be ent

Example of unwanted ambiguity

Livet på jorden har tilpasset seg og tildels utnyttet de skiftende forhold.

Example of unwanted ambiguity  <word>og</word>"<og>" "og" konj  "og" konj clb ; "og" adv REMOVE:2227 <word>til dels</word>"<til dels>" "til dels" adv prep+subst @adv <word>utnyttet</word>"<utnyttet>" "utnytte" verb pret tr1  "utnytte" verb perf-part tr1 ; "utnytte" adj nøyt ub ent <perf-part> tr1 REMOVE:2274 ; "utnytte" adj ub m/f ent <perf-part> tr1 REMOVE:2274 <word>de</word>"<de>" "de" det dem fl SELECT:2780 ; "de" pron fl pers 3 nom SELECT:2780 <word>skiftende</word>"<skiftende>" "skifte" adj <pres-part> tr1 i1 i2 tr11 pa1 pa2 pa5 tr13 <word>forhold</word>

Example of unwanted ambiguity

<word>utnyttet</word>"<utnyttet>" "utnytte" verb pret tr1  "utnytte" verb perf-part tr1

Statistical disambiguator

• Uses a statistical model to fully disambiguate• Simple model based on existing resources• Must discriminate between the ambiguities left by the CG

disambiguator

Earlier ambiguities - now resolved

<word>Setninger</word>"<setninger>" "setning" subst appell fem ub fl     <Correct!> "setning" subst appell mask ub fl 

Earlier ambiguities - now resolved

<word>om</word>"<om>" "om" prep      <Correct!> "om" sbu  <word>åndsverk</word>"<åndsverk>" "åndsverk" subst appell nøyt ub fl <*verk>          <Correct!> "åndsverk" subst appell nøyt ub ent <*verk> 

Earlier ambiguities - now resolved

<word>gamle</word>"<gamle>" "gammel" adj be ent pos    <Correct!> "gammal" adj be ent pos "gammel" adj fl pos  "gammal" adj fl pos

Earlier ambiguities - now resolved

<word>byen</word>"<byen>" "bye" subst appell mask be ent  "by" subst appell mask be ent     <Correct!>

Statistical disambiguation process

• Statistical tagger is run independently of the CG disambiguator

• The output is aligned• Statistical tagger result used to select among ambiguous

results• Simple lemma disambiguation

HMM modelling

• Robust performance on smaller amounts of training data• Good unknown word handling• Cheap and mature

Our HMM model

• Trained on 122 523 words in 8178 sentences• Variety of domains• More than 350 distinct tags• Not very good accuracy really

HMM model integration

Ambiguities in ca. 4.5% of tokensCoverage ca. 80%

Lemma disambiguation

Mainly resolved by tag disambiguationBut some are still disambiguous

Using word form frequencies

Idea: lemmas occur as word forms in large corpora

Use word frequencies from NoWaC to disambiguate among lemmas

Remaining ambiguities

Randomly selected

Expectations

• Cheap and cheerful modeling• Facing a variety of hard disambiguation decisions• On a large morphosyntactic tagset• Evaluated on a slightly eclectic corpus

Results: CG Disambiguation

Precision 96.03%Recall      99.02%F-score    97.2%

Results: Full disambiguation

Accuracy 96.56%

Results: Full disambiguation

Overall accuracy 96.56%Tagging accuracy 96.74%Lemma accuracy 98.33%

Details

Tagger coverage  79.39% Tagger accuracy  81.70%Lemma coverage 54.23%Lemma accuracy 86.71%

Forthcoming (technical)

• Optimizing for very large corpora (> billion words)• More sophisticated modeling• Discriminative modeling or MBT modeling• Constrained decoding• Better lemma disambiguation

Forthcoming (theoretical)

• Finding the best division of labor between data driven and rule driven approaches

• Pivoting on specific errors and ambiguities• Working more with syntax (CG3 dependency trees)

Links

• http://tekstlab.uio.no/obt-ny/index.html• http://github.com/andrely/OBT-Stat

top related