improving named entity translation combining phonetic and semantic similarities fei huang, stephan...

Improving Named Entity Translation Combining Phonetic and Semantic Similarities

Fei Huang, Stephan Vogel, Alex WaibelLanguage Technologies Institute School of Comp

uter Science, CMUNAACL 2004

Introduction

In the 2001 C-E translation evaluation test data, 20% of NEs are not included in the 50K LDC C-E translation lexicon.

Most previous studies focused only on phonetic information

There are NEs not translated in phonetic values (e.g. “南懷仁 , Ferdinand Verbiest”)

Combining phonetic similarities (transliteration) and semantic similarities (context) to cover these non-transliterated NEs.

Source language: Chinese Target language: English

Surface String Transliteration Training data:

LDC C-E dictionary Bootstrapping unsupervised learning Learning transliterating probabilities between pinyin and English letters

Pre-processing: Romanizing Chinese word into pinyin. 0th iteration: Using editing distances to generate mappings between

Chinese and English word pairs.. Using 3,000 word translations with minimum editing distance of the 0th

iteration to estimate new transliterating probabilities. Repeating generating new translation mappings using new transliterating

probabilities. In each iteration, additional 500 pairs with a minimum transliterating cost are

added into the existing NE pair list to update new transliterating probabilities. Repeat until adding more NE pairs does not improve the extraction accuracy

further.

Contextual Semantic Similarity

Training data: a subset of English Xinhua News corpus Context Vector Selection:

POS Phi-Square:

Weight of POS:

Distance Weight of Location:

Weight Vector:


Semantic Similarity between Context Vectors: Semantic similarity:

P(vf|ve) is computed with a modified IBM translation model-2 [Brown et al. 1993]:

I: the length of the source vector J: the length of the target vector p(e|f): the word translation probability estimated from a

C-E aligned corpus with IBM model1 P(ve|vf) is estimated in the similar way

Cross-lingual Retrieval for NE translations

Cross-lingual Retrieval for NE translations English NEs in the retrieved text are automatically tag

ged by IdentiFinderTM from BBN (Bikel et al.,1997). Overall similarity score:

The NE pairs with the highest overall similarity scores are considered translations.

Since NE can be translated in several different ways, and there are typos at times, from among the top NE hypothesis with similar spelling, the one with the highest frequency are chosen as the translation.

Cross-lingual Retrieval for NE translations Sentence-based or Document-based?

Test data: Chinese newswire documents 114 Chinese NEs are selected and translated

manually Indexed Corpus: 963,478 English documents f

rom the Xinhua News Agency Retrieval Model: TF-IDF Top 1000 results are regarded as the relevant

text The recall of document-based indexing is bett

er. (70% comparing with 60%)

Experiment Results

Test dataset: NIST 2002 Machine Translation Evaluation test data 100 Chinese documents, 878 sentences, 25430 words 2469 NEs are automatically tagged

(PER: 20%, LOC: 60%, ORG: 20%) Only PER and LOC are focused Among 1,898 tagged PERs and LOCs, 338 of them are true NEs a

nd not covered by the LDC lexicon. Baseline system:

The CMU statistical MT system. (Vogel et al., 2003)

Experiment Results

improving named entity translation combining phonetic and semantic similarities fei huang, stephan...

Documents