1 a preliminary study on unknown word problem in chinese word segmentation authors: ming -yu lin...
TRANSCRIPT
1
A preliminary study on unknown word problem in Chinese word segmentation
Authors: Ming -Yu LinTung –Hui Chiang
Keh-Yih SuSpeaker: Jbc
2
Abstract Unknown word is the main factor
that affect the performance of WS. To solve the unknown word, this
paper proposes two way: Morphological rule: solving the
regular unknown words. Statistical model : solving the
irregular unknown words.
3
Outline Introduction System architecture Overview of the baseline model The morphological analysis Tagging part of speech Unknown word modeling
4
5
Introduction-(1) Word:
許多中文處理工作的基本單位 在中文有沒有界限的困擾
Unknown word 影響 WS 頗大 . Unknown word 的分類 :
Regular: EX: time, date (11:50, 11/12), reduplication
Irregular: EX: proper names, compound nouns.
6
7
Introduction-(2) 不同類型的 unknown word 的對策 :
Regular: 使用 morphological rule 來辨識 . Irregular: 使用統計模式來辨識 .
8
System Architecture-(1)
10
System Architecture-(2) Lexicon:
89590 entries. 49 tags.
# of characters / word
# of entries
1 1,734
2 35,492
3 19,650
4 24,054
5 6,140
6 2,020
>=7 500
Total 89,590
11
System Architecture-(3) Morphological Rules:
17 條 . ( 在最後面的 Appendix A)
Corpus:
12
Morphological Rules
13
Statistics of Corpora
14
Overview of the Baseline Model-(1) The baseline model:
15
Overview of the Baseline Model-(2) Baseline vs. Max match:
16
17
Overview of the Baseline Model-(3) Two error patterns:
s_ns( mis-combined error): Ex.| 一 | 個 | 人 | | 一 | 個人 | ns_s( over-segmentation error): Ex.| 轉換器 | | 轉換 | 器 |
18
Statistics of Error Patterns
19
The Morphological Analysis-(1) 本 paper 提出了使用 Morphological rul
es 來找出規則的 unknown words. Rule ordering:
Using SFS(sequencial forward selection) procedure.
Cost = wr * (1-Pr) + wp * (1-Pp)
20
The Morphological Analysis-(2)
21
The Morphological Analysis-(3) Baseline model + morphological
rule:
22
The Morphological Analysis-(4) 使用 morphological rule 後對 s_ns 與
ns_s 的改善 :
23
Tagging part of speech-(1)
24
Tagging part of speech-(2)
25
Tagging part of speech-(3)
26
Tagging part of speech-(4)
27
Unknown word modeling-(1) 5 unknown word categories:
應加入辭典的 words. Ex: 爭議 應用 morphological rules 規範的 words. E
x: 牛肝 , 牛心 . 縮寫 . Ex: 國大 . 專有名詞 . Ex: 胡適 . 其他 .( 如印錯的 word, Ex: 吩付 辭典中沒有
的 word. )
28
Unknown word modeling-(2) 使用 unknown word model 來找不規
則的 unknown word. 確認有無 unknown word 存在所預測的區
域 . 如果有 , 找出 unknown word 是那一塊 .
29
Unknown word modeling-(3) 確認有沒有 :
30
Unknown word modeling-(4) 確認那一塊 :
31
Result-(1)
32
Result-(2)