an integrated approach for arabic-english named entity translation hany hassan ibm cairo technology...

An Integrated Approach for Arabic- English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson Research Center ACL 2005 Workshop ACL 2005 Workshop (on Feature Engineering for Machine Learning in (on Feature Engineering for Machine Learning in NLP) NLP)

Upload: annabel-brooks

Post on 12-Jan-2016




0 download


Page 1: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson

An Integrated Approach for Arabic-English Named

Entity Translation

Hany HassanIBM Cairo Technology Development CenterJeffrey SorensenIBM T.J. Watson Research Center

ACL 2005 WorkshopACL 2005 Workshop(on Feature Engineering for Machine Learning in NLP)(on Feature Engineering for Machine Learning in NLP)

Page 2: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Outline (1/2)

Named Entities (NEs) translation is crucial for effective – cross-language information retrieval– Machine Translation

NEs (only focus on person names, location names and organization names)– might be phonetically transliterated

persons names– might also be mixed between phonetic transliterati

on and semantic translation locations and organizations names

Page 3: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Outline (2/2)

an integrated approach– phrase-based translation

advantage: frequently used NE phrases disadvantage: less frequent words

– word-based translation traditional statistical machine translation techniques such

as IBM Model1 (Brown et al., 1993) disadvantage: many-to-many phrase translations

– transliteration modules advantage: out of vocabulary, unknown words disadvantage: frequently used words

aligning NEs across parallel corpora

Page 4: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Related Work (1/2)

Huang et al., 2003:– used a bilingual dictionary to extract NE pai

rs and deployed it iteratively to extract more NEs.

Moore, 2003:– relies on orthographic clues, only suitable fo

r language pairs with similar scripts and/or orthographic conventions

Page 5: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Related Work (2/2)

Arabic-related transliteration– Arbabi et al., 1998: developed a hybrid neur

al network and knowledge-based system to generate multiple English spellings for Arabic person names.

– Stalls and Knight, 1998: Arabic-English back transliteration.

– Al-Onaizan and Knight, 2002: spelling-based model which directly maps English letter sequences into Arabic letter sequences.

Page 6: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Persons names tend to be phonetically transliterated– the idiomatic translation that has been established

For locations and organizations, the translation can be a mixture of translation and transliteration.

Page 7: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson



A parallel corpus– using NE identifiers similar to the systems desc

ribed in (Florian et al., 2004) for NE detection.– to separately acquire the phrases for the phras

e based system– the translation matrix for the word based syste

m– training data for the transliteration system

Page 8: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Translation and Transliteration Modules

Word Based NE Translation– Basic multi-cost NE Alignment– Multi-cost NE Alignment by Content Word

s Elimination Phrase Based Named Entity Translatio

n Named Entity Transliteration System Integration and Decoding

Page 9: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Basic multi-cost NE Alignment (1/3)

IBM Model1 (Brown et. al, 1993) Huang et al. 2003

– multi-cost aligning approach– The cost for aligning any source and target NE word i

s defined as:

Ed(we,wf): this phonetic-based edit distance employs an Editex style (Zobel and Dart, 1996) distance measure

Page 10: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Basic multi-cost NE Alignment (2/3)

The Editex distance (d) between two letters a and b is: d(a,b) = – 0 if both are identical– 1 if they are in the same group– 2 otherwise

Page 11: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Basic multi-cost NE Alignment (3/3)

Page 12: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Multi-cost NE Alignment by Content Words Elimination (1/2)

Content words might be aligned incorrectly to rare NE words

A two-phase alignment approach– The first phase is aligning the content words u

sing a content-word-only translation matrix -> remove

– Remaining words -> multi-cost alignment

Page 13: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Multi-cost NE Alignment by Content Words Elimination (2/2)

Ex– Wsi: content words in the source sentence.– NEsi: the Named Entity source words.– Wti: the content words in the target sentence.– NEti: the Named Entity target words.

– Source: Ws1 Ws2 NEs1 NEs2 Ws3 Ws4 Ws5– Target: Wt1 Wt2 Wt3 NEt1 NEt2 NEt3 Wt4 NEt4

– Source: NEs1 NEs2 Ws4 Ws5– Target: Wt3 NEt1 NEt2 NEt3 NEt4

Page 14: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Phrase Based Named Entity Translation (1/2)

Tillman (Tillmann, 2003) for block generation with modifications suitable for NE phrase extraction.

A block is defined to be any pair of source and target phrases.

This approach starts from a word alignment generated by HMM Viterbi training (Vogel et. Al, 1996), which is done in both directions between source and target.

Page 15: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Phrase Based Named Entity Translation (2/2)

Page 16: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Named Entity Transliteration (1/3)

Out Of Vocabulary (OOV) words that are not covered by the word or phrase based models.

These source and target sequences construct the blocks which enables the modeling of vowels insertion.

Page 17: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Named Entity Transliteration (2/3)

Arabic name -> “Shoukry” The system tries to model bi-grams from the s

ource language to n-grams in the target language as follows:– $k -> shouk– kr -> kr– ry -> ry

Page 18: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Named Entity Transliteration (3/3)

Use the translation matrix, from the word based alignment models.– Translations with probabilities less than a certain threshold a

re filtered out.– Distance between both romanized Arabic and English -> great

er than the threshold are also filtered out.– The remaining highly confident name pairs are used to train a

letter to letter translation matrix using HMM Viterbi training (Vogel et al., 1996).

– a source block s and a target block t, P(t | s)– a Weighted Finite State Transducer (WFST) for translating any

source sequence to a target sequence

Page 19: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


System Integration and Decoding

Used a dynamic programming beam search decoder similar to the decoder described by Tillman (Tillmann, 2003).

Monolingual target data -> NE phrases– The first language model is a trigram language on N

E phrases.– The second language model is a class based langua

ge model with a class for unknown NEs.

Page 20: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Experimental Setup (1/4)

three NE categories, namely names of persons, organizations, and locations.

trained on a news domain parallel corpus containing 2.8 million Arabic words and 3.4 million words.

Monolingual English data was annotated with NE types

Page 21: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Experimental Setup (2/4) manually constructed a test set

The BLEU score (Papineni et al., 2002) with a single reference translation was deployed for evaluation.

BLEU-3 which uses up to 3-grams is deployed since three words phrase is a reasonable length for various NE types.

Page 22: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Experimental Setup (3/4)

Page 23: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Experimental Setup (4/4)

Page 24: An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson


Conclusion and Future Work

We have presented an integrated system that can handle various NE categories and requires the regular parallel and monolingual corpora which are typically used in the training of any statistical machine translation system along with NEs identifier.

We will evaluate the effect of the system on CLIR and MT tasks.