the current status of chinese-english ebmt -where are we now
DESCRIPTION
The current status of Chinese-English EBMT -where are we now. Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001. Topics. Overview of this project Rapid deploy Machine Translation system between Chinese and English For HLT 2001 (Jun 00-Jan 01) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/1.jpg)
The current status of Chinese-English EBMT
-where are we now
Joy (Ying Zhang)
Ralf Brown, Robert Frederking, Erik Peterson
Aug 2001
![Page 2: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/2.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
2
Topics• Overview of this project
– Rapid deploy Machine Translation system between Chinese and English
• For HLT 2001 (Jun 00-Jan 01)– Augment the segmenter with new words found in the corpus
• For MT-Summit VIII Paper (Jan 01- May 01)– Two-threshold method used in tokenization code to find new words in corpus
• For PI meeting (Jun 01- Jul 01)– Accurate ablation experiments– Named-entities added to the training– Multi-corpora experiment
• After PI meeting (Aug 01)– Study of results reported for PI meeting– Review of evaluation methods– Type-token relations
• Plan for future research
![Page 3: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/3.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
3
Overview of Ch-En EBMT• Adapting EBMT to Chinese
• Corpus used– Hong Kong legal code (from LDC)– Hong Kong news articles (from LDC)
• In this project:– Robert Frederking, Ralf Brown, Joy, Erik Peterson,
Stephan Vogel, Alon Lavie, Lori Levin,
![Page 4: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/4.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
4
Corpus Cleaning• Convert from Big5 to GB• Divided into Training set (90%), Dev-test (5%)
and test set (5%)• Sentence level alignment, using Church & Gale
Method (by ISI)• Cleaned• Convert two-byte Chinese characters to their
cognates
![Page 5: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/5.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
5
Corpus Statistics• Hong Kong Legal Code:
– Chinese: 23 MB– English: 37.8 MB
• Hong Kong News (After cleaning)– 7622 Documents– Dev-test: Size: 1,331,915 byte , 4,992 sentence pairs– Final-test: Size: 1,329,764 byte, 4,866 sentence pairs– Training: Size: 25,720,755 byte, 95,752 sentence pairs– Vocabulary size under LDC segmenter– Dev-test: Total type 8,529 Total token 134,749– Final-test: Total type 8,511 Total token 135,372– Training: Total type 20,451 Total token 2,600,095
![Page 6: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/6.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
6
Chinese Segmentation• There are no spaces between Chinese words in written Chinese.
• The segmentation problem: Given a sentence with no spaces, break it into words
![Page 7: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/7.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
7
Vague Definition of Words
• In English, word might be “a group of letters having meaning separated by spaces in the sentence”---- Doesn’t work for Chinese
• Is the word a single Chinese character?---Not necessarily
• Is the word the smallest set of characters that can have meaning by themselves? --- Maybe
• Is the word the longest set of characters that can have meaning by themselves? --- Perhaps
![Page 8: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/8.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
8
Our Definition of Words/Phrases/Terms
• Chinese Characters– The smallest unit in written Chinese is a character, which is represented
by 2 bytes in GB-2312 code.
• Chinese Words– A word in natural language is the smallest reusable unit which can be used
in isolation.
• Chinese Phrases– We define a Chinese phrase as a sequence of Chinese words. For each
word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself.
• Terms– A term is a meaningful constituent. It can be either a word or a phrase.
![Page 9: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/9.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
9
Complicated Constructions
• There are some constructions that can cause problems for segmentation:– Transliterated foreign words and names: Using Chinese characters
for the sound of English names. The meaning of each character is irrelevant and can not be relied on. Each Chinese-speaking region will often transliterate the same name differently
![Page 10: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/10.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
10
Complicated Constructions (2)
– Abbreviations: In Chinese abbreviations are formed by taking a character from each word in the phrase being abbreviated.
– Virtually any phrase can be abbreviated by taking on a character from each component, and these characters usually have no independent relation to each other
![Page 11: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/11.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
11
Complicated Constructions (3)
– Chinese Names• Name = Surname (gen. one character) + Given
name (one or two characters)
• About 100 common surnames, but the number of given names is huge
• The complication for NLP: the same characters in names can be used in “regular” words. Just like in English: Bill Brown as a name.
![Page 12: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/12.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
12
Complicated Constructions (4)
– Chinese Numbers• Similar to English, there are several ways to write
numbers in Chinese:
![Page 13: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/13.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
13
Segmenter
• Approaches– Statistical approaches:
• Idea: Building collocation models for Chinese characters, such as first-order HMM. Place the space at the place where two characters rarely co-occur.
• Cons:– Data sparseness
– Cross boundary
![Page 14: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/14.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
14
Segmenter (2)– Dictionary-based approaches
• Idea: Use a dictionary to find the words in the sentence
• Forward maximum match / backward maximum match/ or both direction
• Cons:– The size and quality of the dictionary used are of great
importance: New words, Named-entity
– Maximum (greedy) match may cause mis-segmentations
![Page 15: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/15.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
15
Segmenter (3)
– A combination of dictionary and linguistic knowledge
• Ideas: Using morphology, POS, grammar and heuristics to aid disambiguation
• Pros: high accuracy (possible)
• Cons: – Require a dictionary with POS and word-frequency
– Computationally expensive
![Page 16: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/16.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
16
Segmenter (4)
• We first used LDC’s segmenter
• Currently we are using a forward/backward maximum match segmenter for baseline. The word frequency dictionary is from LDC
• Word frequency dictionary from LDC: 43959 entries
![Page 17: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/17.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
17
For HLT 2001
• Ying Zhang, Ralf D. Brown, and Robert E. Frederking. "Adapting an Example-Based Translation System to Chinese". To appear in Proceedings of Human Language Technology Conference 2001 (HLT-2001).
![Page 18: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/18.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
18
For MT-Summit VIII
• Ying Zhang, Ralf D. Brown, Robert E. Frederking and Alon Lavie. "Pre-processing of Bilingual Corpora for Mandarin-English EBMT". Accepted in MT Summit VIII (Santiago de Compostela, Spain, Sep. 2001)
• Two-threshold for tokenization
![Page 19: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/19.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
19
For MT-Summit VIII (2)
![Page 20: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/20.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
20
For PI Meeting (1)
• Baseline System
• Full System
• Baseline + Named-Entity
• Multi-corpora System
![Page 21: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/21.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
21
For PI Meeting (2)
• Baseline System
![Page 22: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/22.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
22
For PI Meeting (3)
• Full System
![Page 23: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/23.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
23
For PI Meeting (4)
• Named-Entity
![Page 24: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/24.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
24
For PI Meeting (5)
• Multi-Corpora Experiment– Motivation– Corpus Clustering– Experiment
![Page 25: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/25.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
25
Evaluation Issues
• Automatic Measures– EBMT Source Match– EBMT Source Coverage– EBMT Target Coverage– MEMT (EBMT+DICT) Unigram Coverage– MEMT (EBMT+DICT) PER
• Human Evaluations
![Page 26: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/26.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
26
Evaluation Issues (2)
• Human Evaluations– 4-5 graders each time– 6 categories
![Page 27: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/27.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
27
Evaluation Issues (3)
![Page 28: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/28.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
28
After PI Meeting (0)• Study of results reported in PI meeting(http://pizza.is.cs.cmu.edu/research/internal/ebmt/tokenLen/index.htm)
– The quality of Named-Entity (Cleaned by Erik)
– Performance difference of EBMT while changing the average length of Chinese word token (by changing segmentation)
– How to evaluate the performance of the system
• Experiment of G-EBMT– Word clustering
![Page 29: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/29.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
29
After PI Meeting (1)
• Changing the average length of Chinese token– No bracket on English– Use a subset of LDC’s frequency dictionary for
segmentation– Study the performance of EBMT system on
different average Chinese token length
![Page 30: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/30.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
30
After PI Meeting (2)Thresh for freqFileEntries in Freq File#Type #Token type:token Avg. Len/typeAvg/token
INF 0 2325 238115 1:102.41 1.000 1.0001000 770 2686 208796 1:77.73 1.144 1.140800 933 2780 205043 1:73.75 1.175 1.160600 1188 2923 199837 1:68.36 1.220 1.190500 1347 3009 195630 1:65.01 1.246 1.217400 1619 3157 192331 1:60.92 1.288 1.238300 1978 3335 188984 1:56.66 1.333 1.260200 2640 3635 184878 1:50.86 1.403 1.288100 4062 4228 176129 1:41.66 1.515 1.35290 4298 4321 175302 1:40.57 1.530 1.35880 4606 4435 174039 1:39.24 1.547 1.36870 4964 4569 172729 1:37.80 1.569 1.37960 5381 4713 171357 1:36.36 1.589 1.39050 5940 4876 170290 1:34.92 1.612 1.39840 6726 5105 168526 1:33.01 1.642 1.41330 7928 5418 166273 1:30.69 1.681 1.432
![Page 31: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/31.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
31
After PI Meeting (3)• Avg. Token Len. vs. StatDict Recall
![Page 32: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/32.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
32
After PI Meeting (4)• Avg. Token Len. vs. Source word match
![Page 33: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/33.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
33
After PI Meeting (5)• Avg. Token Len vs. Source Coverage
![Page 34: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/34.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
34
After PI Meeting (6)• Avg. Token Len. Vs.
![Page 35: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/35.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
35
After PI Meeting (7)• Avg. Token Len. Vs. Src/Tgt Coverage of EBMT in MEMT
![Page 36: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/36.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
36
After PI Meeting (8)
• Avg. Token Len. Vs. Translation Unigram Coverage
![Page 37: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/37.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
37
After PI Meeting (9)• Avg. Token Len. Vs. Hypothesis Len (Len of translation)
The reference translation’s length is 1163 words
![Page 38: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/38.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
38
After PI Meeting (10)
• Avg. Token Len. Vs. PER
![Page 39: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/39.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
39
After PI Meeting (11)• Type-Token curve for Chinese
![Page 40: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/40.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
40
After PI Meeting (12)• Type-Token curve of Chinese and English
![Page 41: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/41.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
41
Future Research Plan
• Generalized EBMT– Word-clustering– Grammar Induction
• Using Machine Learning to optimize the parameters used in MEMT
• Better Alignment Model: Integrating segmentation, brackting and alignment
![Page 42: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/42.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
42
New Alignment Model (1)
• Using both monolingual and bilingual collocation information to segment and align corpus
![Page 43: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/43.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
43
New Alignment Model (2)
![Page 44: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/44.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
44
New Alignment Model (3)
![Page 45: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/45.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
45
New Alignment Model (4)
![Page 46: The current status of Chinese-English EBMT -where are we now](https://reader035.vdocument.in/reader035/viewer/2022062222/568152dd550346895dc0f9b4/html5/thumbnails/46.jpg)
Language Technologies InstituteSchool of Computer Science, Carnegie Mellon University
46
References
• Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc.