word segmentation models: overview

Jing-Shin Chang 1

Word Segmentation Models: Overview

Chinese Words, Morphemes and Compounds Word Segmentation Problems Heuristics Approaches Unknown Word Problems Probabilistic Model

supervised mode unsupervised mode

Dynamic Programming Unsupervised Model for Identifying New Words

Jing-Shin Chang 2

English Text With Well Delimited Word Boundary

(Computer Manual) For information about installation, see Microsoft Word Getting Started.

To choose a command from a menu, point to a menu name and click the left mouse button ( 滑鼠左鍵 ). For example, point to the File menu and click to display the File commands. If a command name is followed by an ellipsis, a dialog box ( 對話框 ) appears so you can set the options you want. You can also change the shortcut keys ( 快捷鍵 ) assigned to commands. (Microsoft Word User Guide)

(1996/10/29 CNN) Microsoft Corp. announced a major restructuring Tuesday that creates

two worldwide product groups and shuffles the top ranks of senior management. Under the fourth realignment ..., the company will separate its consumer products from its business applications, creating a Platforms and Applications group and an Interactive Media group. ... Nathan Myhrvold, who also co-managed the Applications and Content group, was named to the newly created position of chief technology officer.

Jing-Shin Chang 3

Chinese Text Without Well Delimited Word Boundaries

China Times 1997/7/26: 台經院指出，隨著股市活絡與景氣回溫，第一季車輛及零件營業額成長十六 .

八一％，顯示民間需求回升。再加上為加入ＷＴＯ，開放進口已是時勢所趨，也將帶動消費成長。台經院預測今年民間消費全年成長率可提昇至六 . 七四％。

在投資方面，第一季國內投資出現回升走勢，固定資本形成實質增加六 . 五六％，其中民間投資實質增加八 . 九五％。在持續有民間大型投資計畫進行、國內房市回溫、與政府開放投資、加速執行公共工程等多項因素下，預測今年全年民間投資將成長十一 . 八％。

台經院表示，口蹄疫連鎖效應在第二季顯現，使第二季出口貿易成長率比預期低，出口年增率二 . 一％，比去年低。而進口年增率為七 . 三八％，因此第二季貿易出超僅十七 . 一四億美元，比去年第二季減少四十三 . 六五％。不過，由於第三、四季為出口旺季，加上國際組織均預測今年世界貿易量擴大，台經院認為我國商品出口應可轉趨順暢。

Jing-Shin Chang 4

Example: Word Segmentation [Chang 97]

Input: 移送台中少年法庭審理

Seg1*: 移送 / 台中少年 / 法庭審理 Seg2*: 移送 / 台中 / 少年 / 法庭審理 Seg3 : 移送 / 台中 / 少年法庭 / 審理

Successively better segmentation with an unsupervised approach ([Chang 97])

Input: 土地公有政策

Seg1 : 土地 / 公有 / 政策 Seg2*: 土地公 / 有 / 政策

Longest match problem + Unknown word problem

Jing-Shin Chang 5

Example: Word Segmentation

Input: 修憲亂成一團結果什麼也沒得到 Output: 修憲亂成一團結果什麼也沒得到

mis-merge problem

Jing-Shin Chang 6

Why Word Segmentation

Word is the natural unit for natural language analyses & NLP applications

Tricky output may results if tokenization is not carefully conducted. tokenization is the first step in most NLP applications e.g., using character bi-grams as the indexing keys (I.e., representati

ves of documents) in search engine design and other similarity-based information retrieval tasks

Jing-Shin Chang 7

Word Segmentation Problems in Basic IR System

Information Sources & Acquisition Web Pages

Web Robots: access all web pages of interested or registered sites to local storage News Groups

News server: accept postings to the news groups BBS Articles

BBS server: admin posting of BBS articles IntraNet documents

shared through local lans Document Conversion & Normalization

html to txt, etc. Indexing System

identify features of documents & keep a representative signature for each document

Searching System convert query into representative signature compare the signature of input query to the signatures of archived documents rank the relevant documents by similarity

Jing-Shin Chang 8

Basic Indexing Techniques & WS Problems

Vector Space Approach document (or query) as a vector of frequency of terms (or variants of fre

quencies) compare query vector against document vectors for similarity & relevan

ce Problems (quick but dirty)

depends on word frequencies only (not even compound words) independent of word orders (no structural or syntactic information) simple minded query functions: (user requirements not satisfied)

keyword matching (exact or fuzzy) logical operators (AND, OR, NOT) near/quasi natural language query

Chinese specific problems: weird output due to un-segmented input Indexed with character 2-grams (not by words) 資訊月 => 資訊月刊島內頻寬升級為 1GB 黨內頻喊換閣揆錄音帶內容…尹清峰頻頻說 : “…”

Jing-Shin Chang 9

Heuristics Approaches

Matching Against Lexicon scan left-to-right or right-to-left

Heuristic Matching Criteria (1) Longest (Maximal) Match

select the longest sub-string on multiple matches (2) Minimum number of matches

select the segmentation patterns with smallest number of words

Greedy Method, Hard-Rejection skip over matched lexicon entry, and repeat matching, regardless of

whether there are embedded or overlapped word candidate in the current matched word

Jing-Shin Chang 10

Heuristics Approaches

Problems hard-decision: skip over possible matching if it was covered by a

previous match (impossible to recover based on more evidences) i.e., p(w) = 1 or 0, for any word ‘w’, depending on whether it was covered by

a previously matched word, unconditionally

less contextual constraints: depends on local match not depending on all context not jointly optimized

cannot handle unknown word problem: words not registered in dictionary will not be handled gracefully e.g., new compound words, proper names, numbers

Advantages simple and easy to implement

only need a large dictionary need no training corpora for estimating probabilities

Jing-Shin Chang 11

Problems with Segmentation Using Known Words

Incomplete Error Recovery Capability Two types of segmentation errors due to unknown word problems:

• Over-segmentation: Split unknown words into short segments (e.g., single character regions `修憲 '=> `修憲 ')

– 分析家對馬來西亞的預測– <=> 分析家對馬來西亞的預測

• Under-Segmentation: Prefer long segment when combining segments

(搶詞問題 ) e.g., `土地公有政策‘ =WS Error (`公有’ unknown)=> `土地公有政策 ' =Merge=> `土地公有 ', `有政策 ' (NOT: `土地 ', `公有 ', `政策 ')

團結 : mis-merge=> 修憲亂成一團結果什麼也沒得到 MERGE operation ONLY recover over-split candidates but NOT over-merged

(under-segmented) candidates

Jing-Shin Chang 12

Problems with Segmentation Using Known Words

Use known words for segmentation without considering potential unknown words (zero word probabilities to unknown words)

cannot take advantages of contextual constraints over unknown words to get the desired segmentation

millions of randomly merged unknown word candidates for filter (-省都委會 :) 獲省都委會同意 => 獲省都委會同意

=>省都 |省都委 |省都委會 |都委 |都委會同 |委會同 |委會同意 (+省都委會 :) 獲省都委會同意 => 獲省都委會同意

an extra disambiguation step for resolving overlapping candidates e.g., 省都 vs 省都委會 (etc.) e.g., 彰化縣警刑警隊少年組

Jing-Shin Chang 13

Probabilistic Models

Find all possible segmentation patterns, and select the best one according to a scoring function

Advantages of Probabilistic Models: Soft-decision: retain all possibilities of segmentation, without pre-e

xclude any possibilities, and select the best by a scoring function which might maximizes the joint likelihood of segmentation

Take care of contextual constraints to maximize the likelihood of the whole segmentation pattern

all words in a segmentation pattern will impose constraints on neighboring words the segmentation pattern which best fit such constraints (or criteria) are selected

Unsupervised training is possible even though there is no dictionary or there is only a small seed dictionary or seed segmentation corpus

because many probabilistic optimization criteria can be maximized by iteratively trying possible segmentations and re-estimation in some known ways

e.g., EM & Viterbi training

Jing-Shin Chang 14

Probabilistic Model

Basic Model: Word Uni-gram Model [Chang 1991]: to jointly optimize the likelihoo

d of segmentation by the product of probabilities of constituent words in the segmentation pattern

Dynamic Programming: fast searching for the best possible segmentation even though there is a vast number of possible segmentation patterns

Other Models: [Chiang et. al, 1992] Including parts-of-speech ( 詞類 ), morphological ( 詞素 ) features int

o accounts Including simple, yet probably useful features like length distributio

n into accounts Taking care of unknown words into consideration

Jing-Shin Chang 15

Word Uni-gram Model for Identifying Words

Segmentation Stage: Find the best segmentation pattern S*

which maximizes the following likelihood function of the input corpus

c1n : input characters c1, c1, ..., cn

Sj : j-th segmentation pattern, consisting of { wj,1, wj,2, ..., wj,mj } V(t): vocabulary (n-grams in the augmented dictionary used for segmentatio

n) S*(V): the best segmentation (is a function of V)

S V P S w c VS

j jj m j n

j

*,, ( )( ) arg max ( | , ) 1 1

P S w c V P w Vj jj m j n

j ii m j

( | , ) ( | ),, ( )

,, ( )

1 11

Jing-Shin Chang 16

Dynamic Programming (DP)

Dynamic Programming A methodology for searching for the best solution without explicitly

enumerating all candidates in the solution space

Resolve the optimization problem for the whole problem by resolving the optimization problem of a much simpler sub-problem, whose solution does not really depend on the large number of combination of the remaining parts of the whole problem

therefore, virtually reduce the large solution space into a very small one

Resolving successively larger sub-problems after the simpler ones are resolved, and finally resolve the optimization problem of the whole task

Requirement The optimum solution of the sub-problem should not depends on the

remaining parts of the whole problem

Jing-Shin Chang 17

Dynamic Programming (DP) Steps

Initialization initialize known path scores

Recursion find best previous local path, assuming current node is one of the

node in the best path, by comparing sum of local and accumulative scores

keep the trace of the best previous path, and accumulative score for this best path

Termination Path Backtracking

trace back the best path

Jing-Shin Chang 18


Examples: shortest path problem speech recognition: DTW (Dynamic Time Wrapping)

minimum alignment cost between an input speech feature vector and a speech feature vector for the typical utterance of a word

speech-to-speech distance measure speech-text alignment

align words in speech waveform with written transcription an extension of isolated word recognition using DTW speech-to-phonetic transcription

spelling correction: minimum editing cost between an input string and a reference pattern (e.g.,

dictionary word) editing operations: insertion, deletion, substitution (including matching) advanced operations: swapping

post-editing cost: cost required to modify machine translated text into fluent translation

Jing-Shin Chang 19


Examples: Bilingual Text Alignment

find corresponding sentences in two parallel bi-lingual corpus sentence length to sentence length distribution

• in words• in characters

Word Correspondence, Translation Equivalent, Bilingual Collocation ( 連語 )

find corresponding words in aligned sentences of bilingual corpora word association metrics as distance between matching

• word association metrics: anything that indicate degree of (in-)dependency between word pairs can be used for this purpose

• to be addressed in later chapters …

Machine Translation

Jing-Shin Chang 20

Application of DP: Feedback Control via. A Parameterized System

S(i): Source SentenceT(i)*: Preferred Target SentenceT(i,j): Output Target Sentence(t): Parameter Set (at time t)e(t-1,i,j): Difference between T(i)* and T(i,j)

e(t-1,i,j)

MT

(t)={P(.|.)}

S(i) T(i,j)

T(i)*

-

+

Jing-Shin Chang 21

Application of DP: Feedback Controlled Parameterized MT Architecture

Metrics for error distance (i) Levison Distance

(ii) e(t-1,I,j) = log P( T(I)* | S(I) ) - log P( T(I,j) | S(I) )

This is my own book

mine

is

book

This Nop

Sub

Sub

Sub

Del

Sub

Del

Ins

Nop

Del

Nop

Ins: InserationDel: DeletionSub: SubstitutionNop: No Operation

Jing-Shin Chang 22

Dynamic Programming (DP) for Finding the Best Word Segmentation

Ex. 國民大會代表人民行使職權 (c1, c2, …, cN) Scan all character boundaries left-to-right For all word boundary with index ‘idx’, assuming that id

x is one of the best segmentation boundary, then the best previous segmentation boundary idx_best can be found by

idx_best = argmax {accumulative_score( 0, j ) x d( j, idx) } over all j = idx-1 to max(idx – k , 0) (k: maximum word length (in character

s)) d(j, idx) = Prob(c[j+1…idx]) (the probability that c[j+1…idx] form a word) initialization: accumulative_score( 0,0) = 1.0

update: accumulative_score(0, idx) = accumulative_score(0, idx_best) x d(idx_best, idx)

After scanning all word boundaries, and finding all (assumed) best previous word boundaries, trace back from the end (which is surely one of the best word boundary), and get the real best segments

right-to-left scanning is virtually identical with the above steps

Jing-Shin Chang 23

Unsupervised Word Segmentation: Viterbi Training for Identifying New Words

Criteria: 1. produce words that maximizes the likelihood of the input corpus 2. avoid producing over-segmented entries due to unknown words

Viterbi Training Approach: Re-estimate the parameters of the segmentation model iteratively to im

prove the system performance, where the word candidates in the augmented dictionary contain known words and potential words in the input corpus.

Potential unknown words will be assigned non-zero probabilities automatically in the above process.

Jing-Shin Chang 24

Viterbi Training for Identifying Words (cont.)

Segmentation Stage: Find the best segmentation pattern S*

which maximizes the following likelihood function of the input corpus

c1n : input characters c1, c1, ..., cn

Sj : j-th segmentation pattern, consisting of { wj,1, wj,2, ..., wj,mj } V(t): vocabulary (n-grams in the augmented dictionary used for segmentatio

n) S*(V): the best segmentation (is a function of V)

S V P S w c VS

j jj m j n

j

*,, ( )( ) arg max ( | , ) 1 1

P S w c V P w Vj jj m j n

j ii m j

( | , ) ( | ),, ( )

,, ( )

1 11

Jing-Shin Chang 25

Viterbi Training for Identifying Words (cont.)

Reestimation Stage: Estimate the word probability which maximizes the likelihood of the input text:

Initial Estimation:

Reestimation:

P w VNumber w in corpus

Number of all w in corpusj ij i

j i

( | )( )

,,

,

P w VNumber w in best segmentation

Number of all w in best segmentationj ij i

j i

( | )( )

,,

,

word segmentation models: overview

Documents

word segmentationinput

word frequencies

word segmentation chang

word segmentation models

signature of input query

jingshin changchinese

better segmentation

applications group