word segmentation models: overview
DESCRIPTION
Word Segmentation Models: Overview. Chinese Words, Morphemes and Compounds Word Segmentation Problems Heuristics Approaches Unknown Word Problems Probabilistic Model supervised mode unsupervised mode Dynamic Programming Unsupervised Model for Identifying New Words. - PowerPoint PPT PresentationTRANSCRIPT
Jing-Shin Chang 1
Word Segmentation Models: Overview
Chinese Words, Morphemes and Compounds Word Segmentation Problems Heuristics Approaches Unknown Word Problems Probabilistic Model
supervised mode unsupervised mode
Dynamic Programming Unsupervised Model for Identifying New Words
Jing-Shin Chang 2
English Text With Well Delimited Word Boundary
(Computer Manual) For information about installation, see Microsoft Word Getting Started.
To choose a command from a menu, point to a menu name and click the left mouse button ( 滑鼠左鍵 ). For example, point to the File menu and click to display the File commands. If a command name is followed by an ellipsis, a dialog box ( 對話框 ) appears so you can set the options you want. You can also change the shortcut keys ( 快捷鍵 ) assigned to commands. (Microsoft Word User Guide)
(1996/10/29 CNN) Microsoft Corp. announced a major restructuring Tuesday that creates
two worldwide product groups and shuffles the top ranks of senior management. Under the fourth realignment ..., the company will separate its consumer products from its business applications, creating a Platforms and Applications group and an Interactive Media group. ... Nathan Myhrvold, who also co-managed the Applications and Content group, was named to the newly created position of chief technology officer.
Jing-Shin Chang 3
Chinese Text Without Well Delimited Word Boundaries
China Times 1997/7/26: 台經院指出,隨著股市活絡與景氣回溫,第一季車輛及零件營業額成長十六 .
八一%,顯示民間需求回升。再加上為加入WTO,開放進口已是時勢所趨,也將帶動消費成長。台經院預測今年民間消費全年成長率可提昇至六 . 七四%。
在投資方面,第一季國內投資出現回升走勢,固定資本形成實質增加六 . 五六%,其中民間投資實質增加八 . 九五%。在持續有民間大型投資計畫進行、國內房市 回溫、與政府開放投資、加速執行公共工程等多項因素下,預測今年全年民間投資將成長十一 . 八%。
台經院表示,口蹄疫 連鎖效應在第二季顯現,使第二季出口貿易成長率比預期低,出口年增率二 . 一%,比去年低。而進口年增率為七 . 三八%,因此第二季貿易出超僅十七 . 一四億美元,比去年第二季減少四十三 . 六五%。不過,由於第三、四季為出口旺季,加上國際組織均預測今年世界貿易量擴大,台經院認為我國商品出口應可轉趨順暢。
Jing-Shin Chang 4
Example: Word Segmentation [Chang 97]
Input: 移送台中少年法庭審理
Seg1*: 移送 / 台中少年 / 法庭審理 Seg2*: 移送 / 台中 / 少年 / 法庭審理 Seg3 : 移送 / 台中 / 少年法庭 / 審理
Successively better segmentation with an unsupervised approach ([Chang 97])
Input: 土地公有政策
Seg1 : 土地 / 公有 / 政策 Seg2*: 土地公 / 有 / 政策
Longest match problem + Unknown word problem
Jing-Shin Chang 5
Example: Word Segmentation
Input: 修憲亂成一團結果什麼也沒得到 Output: 修 憲 亂 成 一 團結 果 什麼 也 沒 得 到
mis-merge problem
Jing-Shin Chang 6
Why Word Segmentation
Word is the natural unit for natural language analyses & NLP applications
Tricky output may results if tokenization is not carefully conducted. tokenization is the first step in most NLP applications e.g., using character bi-grams as the indexing keys (I.e., representati
ves of documents) in search engine design and other similarity-based information retrieval tasks
Jing-Shin Chang 7
Word Segmentation Problems in Basic IR System
Information Sources & Acquisition Web Pages
Web Robots: access all web pages of interested or registered sites to local storage News Groups
News server: accept postings to the news groups BBS Articles
BBS server: admin posting of BBS articles IntraNet documents
shared through local lans Document Conversion & Normalization
html to txt, etc. Indexing System
identify features of documents & keep a representative signature for each document
Searching System convert query into representative signature compare the signature of input query to the signatures of archived documents rank the relevant documents by similarity
Jing-Shin Chang 8
Basic Indexing Techniques & WS Problems
Vector Space Approach document (or query) as a vector of frequency of terms (or variants of fre
quencies) compare query vector against document vectors for similarity & relevan
ce Problems (quick but dirty)
depends on word frequencies only (not even compound words) independent of word orders (no structural or syntactic information) simple minded query functions: (user requirements not satisfied)
keyword matching (exact or fuzzy) logical operators (AND, OR, NOT) near/quasi natural language query
Chinese specific problems: weird output due to un-segmented input Indexed with character 2-grams (not by words) 資訊月 => 資訊月刊 島內頻寬升級為 1GB 黨內頻喊換閣揆 錄音帶內容…尹清峰頻頻說 : “…”
Jing-Shin Chang 9
Heuristics Approaches
Matching Against Lexicon scan left-to-right or right-to-left
Heuristic Matching Criteria (1) Longest (Maximal) Match
select the longest sub-string on multiple matches (2) Minimum number of matches
select the segmentation patterns with smallest number of words
Greedy Method, Hard-Rejection skip over matched lexicon entry, and repeat matching, regardless of
whether there are embedded or overlapped word candidate in the current matched word
Jing-Shin Chang 10
Heuristics Approaches
Problems hard-decision: skip over possible matching if it was covered by a
previous match (impossible to recover based on more evidences) i.e., p(w) = 1 or 0, for any word ‘w’, depending on whether it was covered by
a previously matched word, unconditionally
less contextual constraints: depends on local match not depending on all context not jointly optimized
cannot handle unknown word problem: words not registered in dictionary will not be handled gracefully e.g., new compound words, proper names, numbers
Advantages simple and easy to implement
only need a large dictionary need no training corpora for estimating probabilities
Jing-Shin Chang 11
Problems with Segmentation Using Known Words
Incomplete Error Recovery Capability Two types of segmentation errors due to unknown word problems:
• Over-segmentation: Split unknown words into short segments (e.g., single character regions `修憲 '=> `修 憲 ')
– 分析家 對 馬來西亞 的 預測– <=> 分析 家 對 馬 來 西亞 的 預測
• Under-Segmentation: Prefer long segment when combining segments
(搶詞問題 ) e.g., `土地 公有 政策‘ =WS Error (`公有’ unknown)=> `土地公 有 政策 ' =Merge=> `土地公有 ', `有政策 ' (NOT: `土地 ', `公有 ', `政策 ')
團結 : mis-merge=> 修 憲 亂 成 一 團結 果 什麼 也 沒 得 到 MERGE operation ONLY recover over-split candidates but NOT over-merged
(under-segmented) candidates
Jing-Shin Chang 12
Problems with Segmentation Using Known Words
Use known words for segmentation without considering potential unknown words (zero word probabilities to unknown words)
cannot take advantages of contextual constraints over unknown words to get the desired segmentation
millions of randomly merged unknown word candidates for filter (-省都委會 :) 獲省都委會同意 => 獲 省 都 委 會同 意
=>省都 |省都委 |省都委會 |都委 |都委會同 |委會同 |委會同意 (+省都委會 :) 獲省都委會同意 => 獲 省都委會 同意
an extra disambiguation step for resolving overlapping candidates e.g., 省都 vs 省都委會 (etc.) e.g., 彰化 縣 警 刑警隊 少年組
Jing-Shin Chang 13
Probabilistic Models
Find all possible segmentation patterns, and select the best one according to a scoring function
Advantages of Probabilistic Models: Soft-decision: retain all possibilities of segmentation, without pre-e
xclude any possibilities, and select the best by a scoring function which might maximizes the joint likelihood of segmentation
Take care of contextual constraints to maximize the likelihood of the whole segmentation pattern
all words in a segmentation pattern will impose constraints on neighboring words the segmentation pattern which best fit such constraints (or criteria) are selected
Unsupervised training is possible even though there is no dictionary or there is only a small seed dictionary or seed segmentation corpus
because many probabilistic optimization criteria can be maximized by iteratively trying possible segmentations and re-estimation in some known ways
e.g., EM & Viterbi training
Jing-Shin Chang 14
Probabilistic Model
Basic Model: Word Uni-gram Model [Chang 1991]: to jointly optimize the likelihoo
d of segmentation by the product of probabilities of constituent words in the segmentation pattern
Dynamic Programming: fast searching for the best possible segmentation even though there is a vast number of possible segmentation patterns
Other Models: [Chiang et. al, 1992] Including parts-of-speech ( 詞類 ), morphological ( 詞素 ) features int
o accounts Including simple, yet probably useful features like length distributio
n into accounts Taking care of unknown words into consideration
Jing-Shin Chang 15
Word Uni-gram Model for Identifying Words
Segmentation Stage: Find the best segmentation pattern S*
which maximizes the following likelihood function of the input corpus
c1n : input characters c1, c1, ..., cn
Sj : j-th segmentation pattern, consisting of { wj,1, wj,2, ..., wj,mj } V(t): vocabulary (n-grams in the augmented dictionary used for segmentatio
n) S*(V): the best segmentation (is a function of V)
S V P S w c VS
j jj m j n
j
*,, ( )( ) arg max ( | , ) 1 1
P S w c V P w Vj jj m j n
j ii m j
( | , ) ( | ),, ( )
,, ( )
1 11
Jing-Shin Chang 16
Dynamic Programming (DP)
Dynamic Programming A methodology for searching for the best solution without explicitly
enumerating all candidates in the solution space
Resolve the optimization problem for the whole problem by resolving the optimization problem of a much simpler sub-problem, whose solution does not really depend on the large number of combination of the remaining parts of the whole problem
therefore, virtually reduce the large solution space into a very small one
Resolving successively larger sub-problems after the simpler ones are resolved, and finally resolve the optimization problem of the whole task
Requirement The optimum solution of the sub-problem should not depends on the
remaining parts of the whole problem
Jing-Shin Chang 17
Dynamic Programming (DP) Steps
Initialization initialize known path scores
Recursion find best previous local path, assuming current node is one of the
node in the best path, by comparing sum of local and accumulative scores
keep the trace of the best previous path, and accumulative score for this best path
Termination Path Backtracking
trace back the best path
Jing-Shin Chang 18
Dynamic Programming (DP)
Examples: shortest path problem speech recognition: DTW (Dynamic Time Wrapping)
minimum alignment cost between an input speech feature vector and a speech feature vector for the typical utterance of a word
speech-to-speech distance measure speech-text alignment
align words in speech waveform with written transcription an extension of isolated word recognition using DTW speech-to-phonetic transcription
spelling correction: minimum editing cost between an input string and a reference pattern (e.g.,
dictionary word) editing operations: insertion, deletion, substitution (including matching) advanced operations: swapping
post-editing cost: cost required to modify machine translated text into fluent translation
Jing-Shin Chang 19
Dynamic Programming (DP)
Examples: Bilingual Text Alignment
find corresponding sentences in two parallel bi-lingual corpus sentence length to sentence length distribution
• in words• in characters
Word Correspondence, Translation Equivalent, Bilingual Collocation ( 連語 )
find corresponding words in aligned sentences of bilingual corpora word association metrics as distance between matching
• word association metrics: anything that indicate degree of (in-)dependency between word pairs can be used for this purpose
• to be addressed in later chapters …
Machine Translation
Jing-Shin Chang 20
Application of DP: Feedback Control via. A Parameterized System
S(i): Source SentenceT(i)*: Preferred Target SentenceT(i,j): Output Target Sentence(t): Parameter Set (at time t)e(t-1,i,j): Difference between T(i)* and T(i,j)
e(t-1,i,j)
MT
(t)={P(.|.)}
S(i) T(i,j)
T(i)*
-
+
Jing-Shin Chang 21
Application of DP: Feedback Controlled Parameterized MT Architecture
Metrics for error distance (i) Levison Distance
(ii) e(t-1,I,j) = log P( T(I)* | S(I) ) - log P( T(I,j) | S(I) )
This is my own book
mine
is
book
This Nop
Sub
Sub
Sub
Del
Sub
Del
Ins
Nop
Del
Nop
Ins: InserationDel: DeletionSub: SubstitutionNop: No Operation
Jing-Shin Chang 22
Dynamic Programming (DP) for Finding the Best Word Segmentation
Ex. 國民大會代表人民行使職權 (c1, c2, …, cN) Scan all character boundaries left-to-right For all word boundary with index ‘idx’, assuming that id
x is one of the best segmentation boundary, then the best previous segmentation boundary idx_best can be found by
idx_best = argmax {accumulative_score( 0, j ) x d( j, idx) } over all j = idx-1 to max(idx – k , 0) (k: maximum word length (in character
s)) d(j, idx) = Prob(c[j+1…idx]) (the probability that c[j+1…idx] form a word) initialization: accumulative_score( 0,0) = 1.0
update: accumulative_score(0, idx) = accumulative_score(0, idx_best) x d(idx_best, idx)
After scanning all word boundaries, and finding all (assumed) best previous word boundaries, trace back from the end (which is surely one of the best word boundary), and get the real best segments
right-to-left scanning is virtually identical with the above steps
Jing-Shin Chang 23
Unsupervised Word Segmentation: Viterbi Training for Identifying New Words
Criteria: 1. produce words that maximizes the likelihood of the input corpus 2. avoid producing over-segmented entries due to unknown words
Viterbi Training Approach: Re-estimate the parameters of the segmentation model iteratively to im
prove the system performance, where the word candidates in the augmented dictionary contain known words and potential words in the input corpus.
Potential unknown words will be assigned non-zero probabilities automatically in the above process.
Jing-Shin Chang 24
Viterbi Training for Identifying Words (cont.)
Segmentation Stage: Find the best segmentation pattern S*
which maximizes the following likelihood function of the input corpus
c1n : input characters c1, c1, ..., cn
Sj : j-th segmentation pattern, consisting of { wj,1, wj,2, ..., wj,mj } V(t): vocabulary (n-grams in the augmented dictionary used for segmentatio
n) S*(V): the best segmentation (is a function of V)
S V P S w c VS
j jj m j n
j
*,, ( )( ) arg max ( | , ) 1 1
P S w c V P w Vj jj m j n
j ii m j
( | , ) ( | ),, ( )
,, ( )
1 11
Jing-Shin Chang 25
Viterbi Training for Identifying Words (cont.)
Reestimation Stage: Estimate the word probability which maximizes the likelihood of the input text:
Initial Estimation:
Reestimation:
P w VNumber w in corpus
Number of all w in corpusj ij i
j i
( | )( )
,,
,
P w VNumber w in best segmentation
Number of all w in best segmentationj ij i
j i
( | )( )
,,
,