tsinghua university 1 statistical properties of overlapping ambiguities in chinese word segmentation...

27
Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao, Maosong Sun and Wolfgang Menzel State Key Lab of Intelligent Tech. & Sys. Tsinghua University Department Informatic, Hamburg University

Upload: beverley-daniel

Post on 11-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 1

Statistical Properties of Overlapping Ambiguities inChinese Word Segmentation and aStrategy for Their Disambiguation

Wei Qiao, Maosong Sun and Wolfgang MenzelState Key Lab of Intelligent Tech. & Sys.

Tsinghua UniversityDepartment Informatic, Hamburg University

Page 2: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 2

Part Ⅰ

Background

Page 3: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 3

Introduction

Chinese word segmentationCombination ambiguity 火 把 (torch) 火 (fire) 把 (make)

Overlapping ambiguity

a. 先解决其主要问题,再解决其次要问题 其 次要 (the subordinate) b. 首先要关注整体,其次要注意细节 其次 要 (secondly we

should)

火 把

Page 4: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 4

Overlapping ambiguity string (OAS)Length; Order; Intersection length; Structure

Maximal overlapping ambiguity string (MOAS)

True / Pseudo ambiguity MOAS e.g. 其次要 ( TM ) : 其次 要 & 其 次要 e.g. 部长篇小说 (PM) : 部 (measure word) 长篇小说

Related Terms

order2order10 1 2 3

0-2, 1-3

3

Page 5: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 5

[Sun et al.,1999]100 million characterA set of core for MOAS is found

[Li, et al., 2003] 650 million characterSimilar method is used to improve the performance of segmenter

Previous Work

Page 6: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 6

Two basic issues remain unsolved in their work:

Only include news data, the results need further validatedDetermine the core of pseudo OA strings. both for general-purpose and domain-specific.

Motivation

Page 7: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 7

Statistical Properties of MOAS

From General CorpusFrom Domain-specific Corpus

Part Ⅱ

Page 8: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 8

Data SetCBC : 929,963,468 charactersRich in content (from 1920’s) covering rich categories such as novel, essay, news……

Chinese Word ListPeking University, with 74,191 entries

Automatically find totally 733,066 distinct MOAS types in CBC

From General Corpus

Page 9: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 9

Detailed DistributionPerspective 1: Length

From General Corpus

Page 10: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 10

Perspective 2: Order

From General Corpus

Page 11: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 11

Perspective 3: Intersection Length

From General Corpus

Page 12: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 12

Perspective 4: Structure distribution

From General Corpus

Page 13: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 13

Top N Frequent MOAS --Core candidate

3500 ~ 50.78%

7000 ~ 60.43%

40000 ~ 80.39%

From General Corpus

Page 14: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 14

Stability VS Corpus size

From General Corpus

# of MOAS VS Corpus size

# of top N MOASVS Corpus size

Top 7000

Page 15: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 15

Pseudo MOAS DetectionRelax definition on “Pseudo”

Eg. “ 出国门”: 出 国门 (go abroad) in almost all the

cases 出国 门 (the way to go abroad) small

possibility

5,507 PM and 1,439 TM judged by hand

Token coverage of PM and TM over CBC

From General Corpus

Page 16: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 16

Domain-Specific CorporaEncy55: 90.02 million charactersWeb55: 54.97 million characters

Common Parts

From Domain-specific Corpora

Page 17: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 17

Frequent MOAS Coverage in Domain Specific Corpora (N=3,500)

From Domain-specific Corpora

Page 18: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 18

From Domain-specific Corpora

Frequent MOAS Coverage in Domain Specific Corpora (N=7,000)

Page 19: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 19

From Domain-specific Corpora

Frequent MOAS Coverage in Domain Specific Corpora (N=40,000)

Page 20: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 20

From Domain-specific Corpora

PM and TM distribution over Domain Corpora

42% of overlapping ambiguities in any Chinese text can be 100% solved.

Page 21: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 21

Part Ⅲ

Disambiguation

Page 22: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 22

Disambiguation Method

Current performance on OAPerformance of ICTCLAS1.0 http://www.nlp.org.cn on OAs

e.g. 公安局 长 是 主管 这一 事故 的

The police chief ( 公安 局长 ) is the person who in charge of

this accident.

Performance of MSR-Seg1.0 http://research.microsoft.com/-S-MSRSeg on OAs

e.g. 核电站的特殊性 质 The special properties ( 特殊 性质 ) of nuclear power

station

Page 23: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 23

Disambiguation Method

Performance of CRF-base[Lafferty 2001] CWS on OAs

e.g. 这一 现状 先 天地 决定 了 他们 的 使命

This situation congenitally ( 先天 地 ) makes them to take the mission

About 2% of OAS are mistakenly segmented

——it is a net gain

Page 24: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 24

Individual-based methodSimple table lookup: record the PMs and the correct segmentation in a table

AdvantageSatisfactory token coverage to MOASsFull correctness for segmentation of pseudo MOASsLow cost in time and space complexity.

Disambiguation Method

Page 25: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 25

An extension of [Sun et. al, 1999]Adjust the exist results in large corporaFurther verify the properties on

domain-specific corporaAn disambiguation strategy is

proposedOver 42% Overlapping ambiguity can

be resolved without any mistakeWill be more effective when facing

running text

Conclusion

Page 26: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 26

Reference Lafferty J., A. McCallum, and F. Pereira. 2001. Conditional random fields:

Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference of ICML, pages 282-289.

Li R., S.H. Liu, S.W. Ye, and Z.Z. Shi. 2001. A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6): 13-18. (In Chinese)

Li M., J.F. Gao, C.N. Huang, and J.F. Li. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of SIGHAN’2003, pages 1-7.

Sun M.S. and Z.P. Zuo. 1998. Overlapping ambiguities in Chinese text. Quantitative and Computational Studies on the Chinese Language, pages 323-338.

Sun M.S., C.N. Huang, and B.K.Y. T’sou. 1997. Using character bigram for ambiguity resolution In Chinese word segmentation. Computer Research and Development, 34(5): 332-339. (In Chinese)

Sun M.S., Z.P. Zuo and B.K.Y. T’sou. 1999. The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing, 13(1): 27-37. (In Chinese)

Page 27: Tsinghua University 1 Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation Wei Qiao,

Tsinghua University 27

Thank you

any comments ? ^.^