introduction to japanese input method
TRANSCRIPT
Introduction to Japanese Input Method
Yoh Okuno
Who are you?
• Name: Yoh Okuno
• Software Engineer at Yahoo! Japan
• Interest: NLP, Machine Learning, Data Mining
• Skill: C/C++, Python, Hadoop, etc.
• Website: http://www.yoh.okuno.name/
Activities • Winner of Microsoft Speller Challenge
• TokyoNLP: Founded NLP community in Japan
• Social IME: Developed cloud-‐based IME
• Academic Papers about..
– Phrase Extraction for Predictive Input Method
– Large-‐Scale Language Models via Hadoop
What is Japanese Input Method? • Japanese language has too many characters!
– More than 6,000 kanji and 50 kana characters
• We cannot input directly by a keyboard L
Using Kana Kanji Conversion
• We can input kana and convert to kanji.
• Conversion is ambiguous!
• Accuracy is key issue of kana kanji conversion
Ex: input good morning
Statistical Approach • Statistical approach resolves ambiguity well
• Use corpora and show frequent words
Corpora Model
Converter User
Train
Lookup Input: Kana
Output: Kanji
(Batch)
Noisy Channel Model • We want to know most probable output
• Bayes rule divides it into two components
• P(y): Language model
• P(x|y): Pronunciation model (easier task)
y = argmaxy
P (y|x)
P (y|x) ∝ P (y)P (x|y)
x: input kana y: output kanji
Language Model
• Sentence is sequence of words
• Assume 1st order Markov chain
• Maximum likelihood estimation
P (y) =�
i
P (yi|yi−1)
P (yi|yi−1) =C(yi, yi−1)
C(yi−1)C(y): count of y in corpus
Viterbi Algorithm • Viterbi algorithm searches best path in lattice
Linear time complexity (Dynamic programming)
Trie: lookup dictionary • Tree with node=character
• Efficient substring search
• Query: KENKYUSURU
• Result: KE, KEN,
KENK, KENKYU
け
ん
き
ゅ
う
こ
う
た
っ
き
ー
し
Conclusion
• Japanese input needs special software
• Kana kanji conversion is fully statistical task
• Search and lookup are interesting algorithms
• Any questions?