deep technologies about kana kanji conversion

Deep Technologies in Kana Kanji Conversion

Yoh Okuno

Components of Converter •  Model, Training, Storage, Interface etc.

Corpora Model

Converter User

Train

Lookup Input: Kana

Output: Kanji

(Batch)

Deep Technologies

•  Various Language Models

•  Training LMs using the Web and Hadoop

•  Automatic Pronunciation Inference

•  Data Compression

•  Predictive Conversion and Spelling Correction

Various Language Models

Various Language Models •  Word N-‐gram

– Accurate but too large!

•  Class N-‐gram

– Small but inaccurate

•  Combination

– Good trade-‐off is needed

Phrase-‐based Model •  Replace partial class bigram with word N-‐gram

•  Intermediate classes are marginalized

Phrase probability: P(w1, w2, w3, c3 | c1) Only left-‐side class is conditionalized!

: Classes : Words

Training Large-‐Scale Language Models

Issues about Training •  How to collect large corpora?

– Crawl, crawl, crawl !

– Morphological analyzer is needed

•  How to store and process them?

– Hadoop MapReduce helps us

– Speeding up N-‐gram counting?

Crawling the Web •  Raw html can be collected from the Web

•  Statistics have no copyright

•  Required components:

– Web crawler

– Body text extraction

–  (Spam filter)

– Morphological analyzer make use of cloud

Japanese Morphological Analyzer

•  Input: raw text

•  Output: segmented words, part-‐of-‐speech...

MapReduce for Language Model

•  Distributed computing of N-‐gram statistics

Corpora Mapper

Mapper

Mapper

Reducer

Reducer N-‐grams

Corpora

Mapper: extract N-‐grams from corpora

Reducer: aggregate N-‐gram counts

MapReduce: Pseudo Code

Speeding up N-‐gram count •  Use binary representation for N-‐grams

– Variable length ID for word is efficient

•  Use In-‐mapper combine by Jimmy Lin

– Combine in-‐memory is more efficient

•  Use Stripes Pattern by Jimmy Lin

– Group N-‐grams by first word

Performance-‐Size Trade off

15

Cross Entropy(bit) and Size(byte) Threshold

Mobile PC Cloud

[Okuno+ 2011]

Automatic Pronunciation Inference

Pronunciation Inference

•  Japanese word has 1-‐3 pronunciations

•  How to pronounce sentences or phrases?

•  Basic approach:

– Word-‐based: Combination of word pronunciation

– Character-‐based: Combination of character’s

Mining Pronunciation via Hadoop

•  Corpora contain (phrase, pronunciation) pairs

•  Expression like：四季多彩（しきたさい）

•  In English: Phrase (Pronunciation)

•  Distributed grep by the regular expression:

“\p{InCJKUnifiedIdeographs}+（\p{InHiragana}+）”

Character Alignment Task •  Character Alignment for Noise Reduction

•  Input: Pairs of Word and Pronunciation

•  Output: Aligned Pairs

四季多彩しきたさい西都原さいとばる iPhone あいふぉん

四|季|多|彩| し|き|た|さい| 西|都|原| さい|と|ばる| i|Ph|o|n|e| あい|ふ|ぉ|ん|_|

We can use HMM and EM Algorithm

Data Compression

Why Compression?

•  IMs should save memory for other apps

•  Typically 50 MB for PC and 1-‐2 MB for mobile

•  Compress data as small as possible!

•  Solution: Succinct data structures

LOUDS: Succinct Trie

22

a b c d e f g h i

10 11110 0 110 0 10 0 0 10 0

size = #nodes * 2 + 1 = 19 bit require auxiliary index besides

•  Use unary code to represent tree compactly

a

b c

d e

f

g h

i

MARISA: Nested Patricia Trie

•  Merge no-‐branch nodes in tree

[Yata+ 11]

Normal Trie

Patricia Trie (Apply recursively)

Other Functions

Predictive Conversion

•  Motivation: we want to save key strokes

•  Approach: show most probable completion

when users input their first some characters

Predictive Conversion •  Accuracy and length are trade-‐offs

•  Phrase extraction is needed

– Eliminate candidates like とうございます

(you very much): sub-‐sequence of phrase

おはよう Good

おはようございます Good morning

Phrase Extraction for Prediction •  A paper about phrase extraction to appear

•  Digest: fast and accurate phrase extraction

[Okuno+ 2011]

Spelling Correction

•  Correct user’s miss types

•  Search: Trie for fuzzy match

•  Model: Edit distance for error model

•  Edit operation: Insert, delete and replace

Conclusion

Conclusion

•  Various technologies are needed

– Statistical language models and training

– Morphological analyzer, pronunciation inference

– Data compression and retrieval

– Predictive conversion and spell correction

deep technologies about kana kanji conversion

Technology

p yi ci p ci ci1 p yi

phrasebased model p

class ngram small

extract ngrams

pronunciation output

p yi yin

accurate phrase extraction

pairs of word