combining multiple dictionaries to improve tokenization of

28
Combining Multiple Dictionaries to Improve Tokenization of Ainu Language Michal Ptaszynski1 Yuka Ito2 Karol Nowakowski3 Hirotoshi Honma2 Yoko Nakajima2 Fumito Masui1 1 Kitami Institute of Technology 2 Kushiro National College of Technology 3 Independent Researcher

Upload: others

Post on 13-May-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Combining Multiple Dictionaries to Improve Tokenization of

Combining Multiple Dictionaries to Improve Tokenization

of Ainu Language

Michal Ptaszynski1 Yuka Ito2 Karol Nowakowski3

Hirotoshi Honma2 Yoko Nakajima2 Fumito Masui1

1 Kitami Institute of Technology

2 Kushiro National College of Technology

3 Independent Researcher

Page 2: Combining Multiple Dictionaries to Improve Tokenization of

Combining Multiple Dictionaries to Improve Tokenization

of Ainu Language

Michal Ptaszynski1 Yuka Ito2 Karol Nowakowski3

Hirotoshi Honma2 Yoko Nakajima2 Fumito Masui1

1 Kitami Institute of Technology

2 Kushiro National College of Technology

3 Independent Researcher

Page 3: Combining Multiple Dictionaries to Improve Tokenization of

INTRODUCTION

• Ainu language is a language of Ainu people, mostly living in northern Japan.

• Population of Ainu = about 23 thousand people.

• Number of native speakers = less than hundred (Hohmann, 2008).

• Ainu language is critically endangered (Moseley, 2010).

Purpose of this research:

• Create language analysis toolkit including POS tagger, translation support tool and shallow parser.

• Help in linguistic and language anthropology research and support translators of Ainu texts.

• Contribute to the process of reviving Ainu language.

Page 4: Combining Multiple Dictionaries to Improve Tokenization of

PREVIOUS RESEARCH ON AINU LANGAUGE

Linguistic Studies:

• collections of Ainu epic stories and myths(Chiri, 1978; Kayano, 1998; Piłsudski and Majewicz,2004)

• dictionaries and lexicons(Hattori, 1964; Chiri, 1975-1976; Nakagawa, 1995; Kayano, 1996; Tamura, 1998; Kirikae, 2003)

• grammar descriptions(Chiri,1974; Murasaki,1979; Refsing,1986; Kindaichi,1993; Sato,2008)

NLP-related Studies:

• automatically gather word translations from texts (Echizen-ya et al., 2004)

• analysis / retrieval of hierarchical Ainu-Japanese translations (Azumi & Momouchi,2009ab)

• annotating Ainu “yukar” stories for machine translation system (Momouchi et al. 2008)

• a system for translation of Ainu topological names (Momouchi and Kobayashi 2010)

Our previous work:

• created POST-AL, a simple POS tagger for Ainu language (Ptaszynski & Momouchi, 2012)

• Expanding to a toolkit (Michal Ptaszynski, et al. 2013)

• Improving POS tagging (Michal Ptaszynski, et al. 2016)

Page 5: Combining Multiple Dictionaries to Improve Tokenization of

POST-AL

DICTIONARY

• base dictionary for POST-AL:

• Ainu shin-yoshu jiten

(Lexicon to Yukie Chiri’s Ainu

Shin-yosyu (Ainu Songs of Gods))

by Kirikae (2003)

• transform dictionary information

to XML database:

• 1. token (word, morpheme, etc.)

• 2. part of speech

• 3. meaning (in Japanese)

• 4. usage examples (partially)

• 5. reference to the story it

appears in (partially)

Page 6: Combining Multiple Dictionaries to Improve Tokenization of

POST-AL

• SYSTEM DESCRIPTION

• Tokenization

• DL-LSM: (Dictionary Lookup based on

Longest Match Principle)

• POS Tagging

• CON-POST: (Contextual Part of Speech

Tagging) based on higher order HMM

trained on dictionary examples

• Token Translation

• CON-ToT: (Contextual Token Translation)

translation selected specifically for the

word selected in CON-POST

input

POS tagging

output

tokenization

token translation

Page 7: Combining Multiple Dictionaries to Improve Tokenization of

POST-AL

• SYSTEM DESCRIPTION

• Tokenization

• DL-LSM: (Dictionary Lookup based on

Longest Match Principle)

• POS Tagging

• CON-POST: (Contextual Part of Speech

Tagging) based on higher order HMM

trained on dictionary examples

• Token Translation

• CON-ToT: (Contextual Token Translation)

translation selected specifically for the

word selected in CON-POST

input

POS tagging

output

tokenization

token translation

Page 8: Combining Multiple Dictionaries to Improve Tokenization of

POST-AL

• SYSTEM DESCRIPTION

• Tokenization

• DL-LSM: (Dictionary Lookup based on

Longest Match Principle)

• POS Tagging

• CON-POST: (Contextual Part of Speech

Tagging) based on higher order HMM

trained on dictionary examples

• Token Translation

• CON-ToT: (Contextual Token Translation)

translation selected specifically for the

word selected in CON-POST

input

POS tagging

output

tokenization

token translation

1. Improve tokenization algorithm

2. Improve dictionary base

Page 9: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING TOKENIZER

APPLIED TOKENIZERS

• POST-AL Tokenizer• dictionary lookup

• longest match principle

• keeps track of the already matched word patterns to avoid over-tokenization (splitting each word recursively)

• NLTK Word Tokenizer• NLTK (Natural Language Tool-Kit) Word Tokenizer

http://www.nltk.org/

• re-trained on the same Ainu dictionary base

Page 10: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING TOKENIZER

Test data• 5 yukar stories (tokenized by experts):

9. Tororo hanrok , hanrok ! (Unknown meaning)

10.Kutnisa kutunkutun (Unknown meaning)

11.Tan ota hure hure (This sand, red, red!)

12.Kappa rew rew kappa (Otter, flexible otter)

13.Tonupeka ranran (Unknown meaning / … raining [?])

Page 11: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING TOKENIZER

POSTAL NLTK word

tokenizer tokenizer

Pr Re F1 Pr Re F1

Yuk09 81.3% 85.9% 83.2% 38.4% 92.9% 53.8%

Yuk10 90.2% 93.4% 91.5% 28.8% 82.1% 42.1%

Yuk11 86.0% 90.6% 87.9% 31.9% 89.1% 46.5%

Yuk12 84.5% 87.7% 85.7% 33.1% 86.0% 46.9%

Yuk13 87.4% 92.7% 89.6% 34.8% 86.1% 49.0%

Page 12: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING TOKENIZER

POSTAL NLTK word

tokenizer tokenizer

Pr Re F1 Pr Re F1

Yuk09 81.3% 85.9% 83.2% 38.4% 92.9% 53.8%

Yuk10 90.2% 93.4% 91.5% 28.8% 82.1% 42.1%

Yuk11 86.0% 90.6% 87.9% 31.9% 89.1% 46.5%

Yuk12 84.5% 87.7% 85.7% 33.1% 86.0% 46.9%

Yuk13 87.4% 92.7% 89.6% 34.8% 86.1% 49.0%

Page 13: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING TOKENIZER

• Retrained NLTK tokenizer doesrecursive tokenization

Page 14: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

APPLIED DICTIONARIES

1. Lexicon to Ainu Songs of Gods(based on written scripts)Kirikae (2003) - later KK

2. Ainugo kaiwa jiten(Ainu conversational dictionary)Jinbo & Kanazawa (1898-1986) - later JK

3. Combined 1. + 2.

Page 15: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

APPLIED DATASETS

• Yukar - Ainu Songs of Gods

• JK sentence samples

• Samples from: Shibatani’sThe Languages of Japan

• Mukawa Dialect Samples(by Chiba Univ.)

http://cas-chiba.net/Ainu-archives/mukawa/mukawa.cgi

Page 16: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

RESULTS

Tokenization

* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13

JK

sample

sentences

Mukawa

dialect

dictionary

- sample

sentences

Shibatani

colloquial

samplesAverage

JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%

KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%

JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%

Page 17: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

RESULTS

Tokenization

* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13

JK

sample

sentences

Mukawa

dialect

dictionary

- sample

sentences

Shibatani

colloquial

samplesAverage

JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%

KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%

JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%

KK-based was better on yukar data

Page 18: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

RESULTS

Tokenization

* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13

JK

sample

sentences

Mukawa

dialect

dictionary

- sample

sentences

Shibatani

colloquial

samplesAverage

JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%

KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%

JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%

KK-based was better on yukar data

JK-based was better on JK

samples

Page 19: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

RESULTS

Tokenization

* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13

JK

sample

sentences

Mukawa

dialect

dictionary

- sample

sentences

Shibatani

colloquial

samplesAverage

JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%

KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%

JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%

KK-based was better on yukar data

JK-based was better on JK

samples

JK+KK combined were in the middle…

Page 20: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

RESULTS

Tokenization

* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13

JK

sample

sentences

Mukawa

dialect

dictionary

- sample

sentences

Shibatani

colloquial

samplesAverage

JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%

KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%

JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%

KK-based was better on yukar data

JK-based was better on JK

samples

JK+KK combined were in the middle…

…but for new data did not

hinder performance…

Page 21: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

RESULTS

Tokenization

* F-scoreYuk9 Yuk10 Yuk11 Yuk12 Yuk13

JK

sample

sentences

Mukawa

dialect

dictionary

- sample

sentences

Shibatani

colloquial

samplesAverage

JK 53.8% 57.7% 56.2% 50.4% 58.3% 86.2% 69.00% 64.90% 62.1%

KK 81.6% 83.1% 91.1% 77.4% 87.0% 66.8% 65.50% 57.80% 76.3%

JK + KK 73.4% 80.4% 81.9% 73.9% 85.3% 82.9% 69.00% 70.80% 77.2%

KK-based was better on yukar data

JK-based was better on JK

samples

JK+KK combined were in the middle…

…but for new data did not

hinder performance…

…and in general

improved it!

Page 22: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

ERROR ANALYSISTokenizer Gold std Category

kutun kutun kutunkutun Dictionarya sawa as a wa Tokenizertasi ne ta sine Tokenizer

karku su kar kusu Tokenizerneap ne a p Tokenizeraw a a wa Tokenizer

kuru n kur un Tokenizernep ne p Tokenizer

cir uska ci ruska Tokenizerciki k ci kik Tokenizer

cioarkaye ci oarkaye Tokenizerayke a ike Tokenizer

pokna sir poknasir Dictionaryciousi ci ousi Tokenizer

montum mon tum Tokenizercioarkaye ci oarkaye Tokenizerpetetok o pet etoko Tokenizerpirkare ra pirka rera Tokenizer

… … …

Page 23: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

ERROR ANALYSIS

• 8% of errors =word not in dictionary

Tokenizer Gold std Categorykutun kutun kutunkutun Dictionary

a sawa as a wa Tokenizertasi ne ta sine Tokenizer

karku su kar kusu Tokenizerneap ne a p Tokenizeraw a a wa Tokenizer

kuru n kur un Tokenizernep ne p Tokenizer

cir uska ci ruska Tokenizerciki k ci kik Tokenizer

cioarkaye ci oarkaye Tokenizerayke a ike Tokenizer

pokna sir poknasir Dictionaryciousi ci ousi Tokenizer

montum mon tum Tokenizercioarkaye ci oarkaye Tokenizerpetetok o pet etoko Tokenizerpirkare ra pirka rera Tokenizer

… … …

Page 24: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

ERROR ANALYSIS

• 8% of errors =word not in dictionary

• 92% of errors =word in dict.but wrongly split

Tokenizer Gold std Categorykutun kutun kutunkutun Dictionary

a sawa as a wa Tokenizertasi ne ta sine Tokenizer

karku su kar kusu Tokenizerneap ne a p Tokenizeraw a a wa Tokenizer

kuru n kur un Tokenizernep ne p Tokenizer

cir uska ci ruska Tokenizerciki k ci kik Tokenizer

cioarkaye ci oarkaye Tokenizerayke a ike Tokenizer

pokna sir poknasir Dictionaryciousi ci ousi Tokenizer

montum mon tum Tokenizercioarkaye ci oarkaye Tokenizerpetetok o pet etoko Tokenizerpirkare ra pirka rera Tokenizer

isoytak isoitak Dictionary

Page 25: Combining Multiple Dictionaries to Improve Tokenization of

IMPROVING DICTIONARY

ERROR ANALYSIS

• 8% of errors =word not in dictionary

• 92% of errors =word in dict.but wrongly split

Tokenizer Gold std Categorykutun kutun kutunkutun Dictionary

a sawa as a wa Tokenizertasi ne ta sine Tokenizer

karku su kar kusu Tokenizerneap ne a p Tokenizeraw a a wa Tokenizer

kuru n kur un Tokenizernep ne p Tokenizer

cir uska ci ruska Tokenizerciki k ci kik Tokenizer

cioarkaye ci oarkaye Tokenizerayke a ike Tokenizer

pokna sir poknasir Dictionaryciousi ci ousi Tokenizer

montum mon tum Tokenizercioarkaye ci oarkaye Tokenizerpetetok o pet etoko Tokenizerpirkare ra pirka rera Tokenizer

isoytak isoitak Dictionary

Main problemPossible solutions:

1. Statistical (probability of a space between letters or letter ngrams)

2. Apply contextual information (usage examples from dictionaries)

3. Automatically obtain word ngrams for disambiguation

4. Hybrid

Page 26: Combining Multiple Dictionaries to Improve Tokenization of

CONCLUSIONS

• Apply NLP techniques to revitalize Ainu language

• Created POST-AL – needed improvements

• Improve tokenization• Tokenizer:

• POST-AL

• NLTK

• Dictionary base• Kirikae lexicon: Yukar stories-based (written)

• Jinbo&Kanazawa dictionary (spoken)

• KK+JK: Combined

Page 27: Combining Multiple Dictionaries to Improve Tokenization of

CONCLUSIONS

• Apply NLP techniques to revitalize Ainu language

• Created POST-AL – needed improvements

• Improve tokenization• Tokenizer:

• POST-AL

• NLTK

• Dictionary base• Kirikae lexicon: Yukar stories-based (written)

• Jinbo&Kanazawa dictionary (spoken)

• KK+JK: Combined

• Custom tokenizer (POST-AL) better than NLTK

• Combined dictionaries were in general better on new data

• Still need to improve tokenization process

Page 28: Combining Multiple Dictionaries to Improve Tokenization of

Iyayraykere for your attention!

Michal Ptaszynski

Kitami Institute of Technology

[email protected]