deep technologies about kana kanji conversion
TRANSCRIPT
![Page 1: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/1.jpg)
Deep Technologies in Kana Kanji Conversion
Yoh Okuno
![Page 2: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/2.jpg)
Components of Converter • Model, Training, Storage, Interface etc.
Corpora Model
Converter User
Train
Lookup Input: Kana
Output: Kanji
(Batch)
![Page 3: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/3.jpg)
Deep Technologies
• Various Language Models
• Training LMs using the Web and Hadoop
• Automatic Pronunciation Inference
• Data Compression
• Predictive Conversion and Spelling Correction
![Page 4: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/4.jpg)
Various Language Models
![Page 5: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/5.jpg)
Various Language Models • Word N-‐gram
– Accurate but too large!
• Class N-‐gram
– Small but inaccurate
• Combination
– Good trade-‐off is needed
![Page 6: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/6.jpg)
Language Models • Word N-‐gram
• Class Bigram
• Phrase-‐based Model
P (y) =�
i
P (yi|yi−1i−N+1)
P (y) =�
i
P (yi|ci)P (ci|ci−1)
P (y) =�
i∈IC
P (yi|ci)P (ci|ci−1)�
i∈IW
P (yi+N−1i , ci+N−1|ci)P (ci|ci−1)
Class-‐based sub model Word-‐based sub model
![Page 7: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/7.jpg)
Phrase-‐based Model • Replace partial class bigram with word N-‐gram
• Intermediate classes are marginalized
Phrase probability: P(w1, w2, w3, c3 | c1) Only left-‐side class is conditionalized!
: Classes : Words
![Page 8: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/8.jpg)
Training Large-‐Scale Language Models
![Page 9: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/9.jpg)
Issues about Training • How to collect large corpora?
– Crawl, crawl, crawl !
– Morphological analyzer is needed
• How to store and process them?
– Hadoop MapReduce helps us
– Speeding up N-‐gram counting?
![Page 10: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/10.jpg)
Crawling the Web • Raw html can be collected from the Web
• Statistics have no copyright
• Required components:
– Web crawler
– Body text extraction
– (Spam filter)
– Morphological analyzer make use of cloud
![Page 11: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/11.jpg)
Japanese Morphological Analyzer
• Input: raw text
• Output: segmented words, part-‐of-‐speech...
![Page 12: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/12.jpg)
MapReduce for Language Model
• Distributed computing of N-‐gram statistics
Corpora Mapper
Mapper
Mapper
Reducer
Reducer N-‐grams
Corpora
Mapper: extract N-‐grams from corpora
Reducer: aggregate N-‐gram counts
![Page 13: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/13.jpg)
MapReduce: Pseudo Code
![Page 14: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/14.jpg)
Speeding up N-‐gram count • Use binary representation for N-‐grams
– Variable length ID for word is efficient
• Use In-‐mapper combine by Jimmy Lin
– Combine in-‐memory is more efficient
• Use Stripes Pattern by Jimmy Lin
– Group N-‐grams by first word
![Page 15: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/15.jpg)
Performance-‐Size Trade off
15
Cross Entropy(bit) and Size(byte) Threshold
Mobile PC Cloud
[Okuno+ 2011]
![Page 16: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/16.jpg)
Automatic Pronunciation Inference
![Page 17: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/17.jpg)
Pronunciation Inference
• Japanese word has 1-‐3 pronunciations
• How to pronounce sentences or phrases?
• Basic approach:
– Word-‐based: Combination of word pronunciation
– Character-‐based: Combination of character’s
![Page 18: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/18.jpg)
Mining Pronunciation via Hadoop
• Corpora contain (phrase, pronunciation) pairs
• Expression like:四季多彩(しきたさい)
• In English: Phrase (Pronunciation)
• Distributed grep by the regular expression:
“\p{InCJKUnifiedIdeographs}+(\p{InHiragana}+)”
![Page 19: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/19.jpg)
Character Alignment Task • Character Alignment for Noise Reduction
• Input: Pairs of Word and Pronunciation
• Output: Aligned Pairs
四季多彩 しきたさい 西都原 さいとばる iPhone あいふぉん
四|季|多|彩| し|き|た|さい| 西|都|原| さい|と|ばる| i|Ph|o|n|e| あい|ふ|ぉ|ん|_|
We can use HMM and EM Algorithm
![Page 20: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/20.jpg)
Data Compression
![Page 21: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/21.jpg)
Why Compression?
• IMs should save memory for other apps
• Typically 50 MB for PC and 1-‐2 MB for mobile
• Compress data as small as possible!
• Solution: Succinct data structures
![Page 22: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/22.jpg)
LOUDS: Succinct Trie
22
a b c d e f g h i
10 11110 0 110 0 10 0 0 10 0
size = #nodes * 2 + 1 = 19 bit require auxiliary index besides
• Use unary code to represent tree compactly
a
b c
d e
f
g h
i
![Page 23: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/23.jpg)
MARISA: Nested Patricia Trie
• Merge no-‐branch nodes in tree
[Yata+ 11]
Normal Trie
Patricia Trie (Apply recursively)
![Page 24: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/24.jpg)
Other Functions
![Page 25: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/25.jpg)
Predictive Conversion
• Motivation: we want to save key strokes
• Approach: show most probable completion
when users input their first some characters
![Page 26: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/26.jpg)
Predictive Conversion • Accuracy and length are trade-‐offs
• Phrase extraction is needed
– Eliminate candidates like とうございます
(you very much): sub-‐sequence of phrase
おはよう Good
おはようございます Good morning
![Page 27: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/27.jpg)
Phrase Extraction for Prediction • A paper about phrase extraction to appear
• Digest: fast and accurate phrase extraction
[Okuno+ 2011]
![Page 28: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/28.jpg)
Spelling Correction
• Correct user’s miss types
• Search: Trie for fuzzy match
• Model: Edit distance for error model
• Edit operation: Insert, delete and replace
![Page 29: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/29.jpg)
Conclusion
![Page 30: Deep Technologies about Kana Kanji Conversion](https://reader034.vdocument.in/reader034/viewer/2022042518/55794df5d8b42a31678b5289/html5/thumbnails/30.jpg)
Conclusion
• Various technologies are needed
– Statistical language models and training
– Morphological analyzer, pronunciation inference
– Data compression and retrieval
– Predictive conversion and spell correction