introduction to japanese input method
TRANSCRIPT
![Page 1: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/1.jpg)
Introduction to Japanese Input Method
Yoh Okuno
![Page 2: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/2.jpg)
Who are you?
• Name: Yoh Okuno
• Software Engineer at Yahoo! Japan
• Interest: NLP, Machine Learning, Data Mining
• Skill: C/C++, Python, Hadoop, etc.
• Website: http://www.yoh.okuno.name/
![Page 3: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/3.jpg)
Activities • Winner of Microsoft Speller Challenge
• TokyoNLP: Founded NLP community in Japan
• Social IME: Developed cloud-‐based IME
• Academic Papers about..
– Phrase Extraction for Predictive Input Method
– Large-‐Scale Language Models via Hadoop
![Page 4: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/4.jpg)
What is Japanese Input Method? • Japanese language has too many characters!
– More than 6,000 kanji and 50 kana characters
• We cannot input directly by a keyboard L
![Page 5: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/5.jpg)
Using Kana Kanji Conversion
• We can input kana and convert to kanji.
• Conversion is ambiguous!
• Accuracy is key issue of kana kanji conversion
Ex: input good morning
![Page 6: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/6.jpg)
Statistical Approach • Statistical approach resolves ambiguity well
• Use corpora and show frequent words
Corpora Model
Converter User
Train
Lookup Input: Kana
Output: Kanji
(Batch)
![Page 7: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/7.jpg)
Noisy Channel Model • We want to know most probable output
• Bayes rule divides it into two components
• P(y): Language model
• P(x|y): Pronunciation model (easier task)
y = argmaxy
P (y|x)
P (y|x) ∝ P (y)P (x|y)
x: input kana y: output kanji
![Page 8: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/8.jpg)
Language Model
• Sentence is sequence of words
• Assume 1st order Markov chain
• Maximum likelihood estimation
P (y) =�
i
P (yi|yi−1)
P (yi|yi−1) =C(yi, yi−1)
C(yi−1)C(y): count of y in corpus
![Page 9: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/9.jpg)
Viterbi Algorithm • Viterbi algorithm searches best path in lattice
Linear time complexity (Dynamic programming)
![Page 10: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/10.jpg)
Trie: lookup dictionary • Tree with node=character
• Efficient substring search
• Query: KENKYUSURU
• Result: KE, KEN,
KENK, KENKYU
け
ん
き
ゅ
う
こ
う
た
っ
き
ー
し
![Page 11: Introduction to Japanese Input Method](https://reader033.vdocument.in/reader033/viewer/2022052223/55794df7d8b42a31678b528b/html5/thumbnails/11.jpg)
Conclusion
• Japanese input needs special software
• Kana kanji conversion is fully statistical task
• Search and lookup are interesting algorithms
• Any questions?