1 a preliminary study on unknown word problem in chinese word segmentation authors: ming -yu lin...

31
1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung –Hui Chiang Keh-Yih Su Speaker: Jbc

Upload: pearl-carpenter

Post on 03-Jan-2016

327 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

1

A preliminary study on unknown word problem in Chinese word segmentation

Authors: Ming -Yu LinTung –Hui Chiang

Keh-Yih SuSpeaker: Jbc

Page 2: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

2

Abstract Unknown word is the main factor

that affect the performance of WS. To solve the unknown word, this

paper proposes two way: Morphological rule: solving the

regular unknown words. Statistical model : solving the

irregular unknown words.

Page 3: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

3

Outline Introduction System architecture Overview of the baseline model The morphological analysis Tagging part of speech Unknown word modeling

Page 4: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

4

Page 5: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

5

Introduction-(1) Word:

許多中文處理工作的基本單位 在中文有沒有界限的困擾

Unknown word 影響 WS 頗大 . Unknown word 的分類 :

Regular: EX: time, date (11:50, 11/12), reduplication

Irregular: EX: proper names, compound nouns.

Page 6: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

6

Page 7: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

7

Introduction-(2) 不同類型的 unknown word 的對策 :

Regular: 使用 morphological rule 來辨識 . Irregular: 使用統計模式來辨識 .

Page 8: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

8

System Architecture-(1)

Page 9: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

10

System Architecture-(2) Lexicon:

89590 entries. 49 tags.

# of characters / word

# of entries

1 1,734

2 35,492

3 19,650

4 24,054

5 6,140

6 2,020

>=7 500

Total 89,590

Page 10: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

11

System Architecture-(3) Morphological Rules:

17 條 . ( 在最後面的 Appendix A)

Corpus:

Page 11: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

12

Morphological Rules

Page 12: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

13

Statistics of Corpora

Page 13: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

14

Overview of the Baseline Model-(1) The baseline model:

Page 14: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

15

Overview of the Baseline Model-(2) Baseline vs. Max match:

Page 15: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

16

Page 16: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

17

Overview of the Baseline Model-(3) Two error patterns:

s_ns( mis-combined error): Ex.| 一 | 個 | 人 | | 一 | 個人 | ns_s( over-segmentation error): Ex.| 轉換器 | | 轉換 | 器 |

Page 17: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

18

Statistics of Error Patterns

Page 18: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

19

The Morphological Analysis-(1) 本 paper 提出了使用 Morphological rul

es 來找出規則的 unknown words. Rule ordering:

Using SFS(sequencial forward selection) procedure.

Cost = wr * (1-Pr) + wp * (1-Pp)

Page 19: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

20

The Morphological Analysis-(2)

Page 20: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

21

The Morphological Analysis-(3) Baseline model + morphological

rule:

Page 21: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

22

The Morphological Analysis-(4) 使用 morphological rule 後對 s_ns 與

ns_s 的改善 :

Page 22: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

23

Tagging part of speech-(1)

Page 23: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

24

Tagging part of speech-(2)

Page 24: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

25

Tagging part of speech-(3)

Page 25: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

26

Tagging part of speech-(4)

Page 26: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

27

Unknown word modeling-(1) 5 unknown word categories:

應加入辭典的 words. Ex: 爭議 應用 morphological rules 規範的 words. E

x: 牛肝 , 牛心 . 縮寫 . Ex: 國大 . 專有名詞 . Ex: 胡適 . 其他 .( 如印錯的 word, Ex: 吩付 辭典中沒有

的 word. )

Page 27: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

28

Unknown word modeling-(2) 使用 unknown word model 來找不規

則的 unknown word. 確認有無 unknown word 存在所預測的區

域 . 如果有 , 找出 unknown word 是那一塊 .

Page 28: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

29

Unknown word modeling-(3) 確認有沒有 :

Page 29: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

30

Unknown word modeling-(4) 確認那一塊 :

Page 30: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

31

Result-(1)

Page 31: 1 A preliminary study on unknown word problem in Chinese word segmentation Authors: Ming -Yu Lin Tung – Hui Chiang Keh-Yih Su Speaker: Jbc

32

Result-(2)