improving named entity translation combining phonetic and semantic similarities fei huang, stephan...
TRANSCRIPT
Improving Named Entity Translation Combining Phonetic and Semantic Similarities
Fei Huang, Stephan Vogel, Alex WaibelLanguage Technologies Institute School of Comp
uter Science, CMUNAACL 2004
Introduction
In the 2001 C-E translation evaluation test data, 20% of NEs are not included in the 50K LDC C-E translation lexicon.
Most previous studies focused only on phonetic information
There are NEs not translated in phonetic values (e.g. “南懷仁 , Ferdinand Verbiest”)
Combining phonetic similarities (transliteration) and semantic similarities (context) to cover these non-transliterated NEs.
Source language: Chinese Target language: English
Surface String Transliteration Training data:
LDC C-E dictionary Bootstrapping unsupervised learning Learning transliterating probabilities between pinyin and English letters
Pre-processing: Romanizing Chinese word into pinyin. 0th iteration: Using editing distances to generate mappings between
Chinese and English word pairs.. Using 3,000 word translations with minimum editing distance of the 0th
iteration to estimate new transliterating probabilities. Repeating generating new translation mappings using new transliterating
probabilities. In each iteration, additional 500 pairs with a minimum transliterating cost are
added into the existing NE pair list to update new transliterating probabilities. Repeat until adding more NE pairs does not improve the extraction accuracy
further.
Contextual Semantic Similarity
Training data: a subset of English Xinhua News corpus Context Vector Selection:
POS Phi-Square:
Weight of POS:
Distance Weight of Location:
Weight Vector:
Contextual Semantic Similarity
Contextual Semantic Similarity
Semantic Similarity between Context Vectors: Semantic similarity:
P(vf|ve) is computed with a modified IBM translation model-2 [Brown et al. 1993]:
I: the length of the source vector J: the length of the target vector p(e|f): the word translation probability estimated from a
C-E aligned corpus with IBM model1 P(ve|vf) is estimated in the similar way
Cross-lingual Retrieval for NE translations
Cross-lingual Retrieval for NE translations English NEs in the retrieved text are automatically tag
ged by IdentiFinderTM from BBN (Bikel et al.,1997). Overall similarity score:
The NE pairs with the highest overall similarity scores are considered translations.
Since NE can be translated in several different ways, and there are typos at times, from among the top NE hypothesis with similar spelling, the one with the highest frequency are chosen as the translation.
Cross-lingual Retrieval for NE translations Sentence-based or Document-based?
Test data: Chinese newswire documents 114 Chinese NEs are selected and translated
manually Indexed Corpus: 963,478 English documents f
rom the Xinhua News Agency Retrieval Model: TF-IDF Top 1000 results are regarded as the relevant
text The recall of document-based indexing is bett
er. (70% comparing with 60%)
Experiment Results
Test dataset: NIST 2002 Machine Translation Evaluation test data 100 Chinese documents, 878 sentences, 25430 words 2469 NEs are automatically tagged
(PER: 20%, LOC: 60%, ORG: 20%) Only PER and LOC are focused Among 1,898 tagged PERs and LOCs, 338 of them are true NEs a
nd not covered by the LDC lexicon. Baseline system:
The CMU statistical MT system. (Vogel et al., 2003)
Experiment Results