automatic translation of named entities in multiple languages using web search engines present by...

Automatic Translation of Named Entities in Multiple Languages Us

ing Web Search EnginesPresent by Richard C. Wang

Supervised by Teruko Mitamura

December 15, 2005

Presentation Outline

IntroductionRelated WorkSystem ImplementationExperimental ResultsConclusionFuture Work

Introduction Machine Translation

Reduces human labor for translating text

One challenge – Translating newly emerged proper nouns (named entities) Movie, book, magazine, protein, cell, disease, person, location, c

ompany, organization names, etc. No central database that stores NE (and their translations)

World Wide Web – An enormous unstructured corpus Contains named entities in various languages

Automatic translation of NE in multiple languages Near-language-independent approach Utilize popular search engines: Google, Yahoo!, AlltheWeb

Related Work Wu, Lin, and Chang (2005) – English-to-Chinese NE MT

Surface pattern knowledge learner Trained transliteration model

Shima (2005) – English-to-Japanese NE MT Hand-coded transliteration model Heuristic computations using N-gram

Huang, Zhang, and Vogel (2005) – Chinese-to-English NE MT Cross-lingual query expansion Trained transliteration model IBM translation model Frequency-Distance model

In contrast, our system does not require any Training data and process Transliteration model

System Architecture Overview

sSearch Engines (Ext

ernal)

Search Results

Segment Parser

SegmentsTranslation Candidate Extractor

Translation Candidates

Translation Candidate

Filter

Filtered Translation Ca

ndidates

Candidate Score Calc

ulator

Scored Translation Cand

idates

Querying World Wide Web We want to retrieve documents containing

Source word s and target word t

Search for s using Google, Yahoo!, and AlltheWeb Request for web pages written in the same language as t

Current system supports target languages: English, Simplified/Traditional Chinese, Japanese, and Korean A target language can be added easily (see Adding Target Lang

uages slide)

Current system allows s to be in any language except: s and t have to be in languages that use different character sets

i.e. (English, Chinese), (Korean, Japanese), (Hebrew, English) Can be overcome by using English as a pivot language (see Fut

ure Work slide)

Preprocessing Returned ResultsSegments Snippet

Our system preprocesses results by: Extracting each snippet and insert into N

such that no snippets in N can have duplicating titles Extracting each segment from each snippet in N and

insert into G such that any segment in G cannot be a substring of

another segment in G Prevent words to have biased weights

Weights are dependent on their occurring frequencies

Extracting Translation Candidates

Translation Candidate Any lonely cluster in the target language that resides in t

he same segment as the source word

Our system uses regular expression patterns to extract lonely clusters

Oftentimes there is at least one correct translation in the returned results that is lonely

Lonely Clusters Clusters

Filtering Translation Candidates

Suppose candidate A is a substring of candidate B, and if B occurs more than half the times that A occurs in all segments, then A is discarded.

For example:

Since TF(B) > 0.5 x TF(A), A is discarded

Candidate TF

A “Back to the” 40

B “Back to the Future” 25

Ranking Translation Candidates

Source word: “The Lord of the Rings” Target Language: Japanese

Feature DefinitionTFc # of occurrences of c in all segments

DFc # of segments that contain c

CTFc # of occurrences of lonely c in all segments

CDFc # of segments that contain lonely c

NGc # of grams that c is consist of

WDc sum of inverse word distance between s and c in all segments

)max()max()max()max()max()max( WD

WD

NG

NG

CDF

CDF

CTF

CTF

DF

DF

TF

TFScore cccccc

c

Adding a New Target Language

Three basic elements: Tokenization Pattern

A regular expression pattern for tokenization Search Engine Language Code

Language codes for the target language for each of the search engines

Other General Properties Common minimum number of grams/alphabets for n

amed entities in the target language Is the language spaced or non-spaced

Experimental Data

Gold Test Word Original Gold Translation Additional Gold Trans.

纽约客 new yorker The New Yorker

牛虻 The Gadfly gadfly

汇丰银行 HSBCHong Kong and Shang Hai Banking Corporation

海豹 seal seals

Mt. Pinatubo ピナツボ火山 , ピナトゥボ火山ピナツボ山

Roger Dingmanロージャー・ディングマン , ロージャーディングマンロジャーディングマン

Jean-Henri Dunant アンリ・デュナン , アンリデュナンジャン・アンリ・デュナン

Charles Wang チャールズ・ウォン , チャールズウォンチャールズ・ワン

Dataset # Test Words Source-Target

EJ 202 English-Japanese

CE 310 Simplified Chinese-English

Evaluation Metric

Translatable words words that our system is able to produce at leas

t one translation candidate for

WordsleTranslatab#

WordsTranslatedCorrectly #P

Test Words Standard Gold#

Words TranslatedCorrectly #R

RP

PRF

21

Usefulness of Features

F1 Scores using Various Heuristic Methods

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70%

CDF + CTF + Filter + WD + DF + NG + TF

CTF + Filter + WD + DF + NG + TF

Filter + WD + DF + NG + TF

WD + DF + NG + TF

DF + NG + TF

NG + TF

TF

Heu

ristic

Met

hod

F1 Score

EJ (202)

CE (310)

Experimental Results (EJ)

Performance of Top N Translations (EJ Dataset)

45%

50%

55%

60%

65%

70%

75%

80%

85%

0 5 10 15 20 25 30Top N Translations

Per

form

ance

(%

)

Precision

Recall

F1

202155127

# Correctly Translated Words

Precision Recall F1

Top 1 100 64.52% 49.50% 56.02%Top 2 110 70.97% 54.46% 61.62%Top 3 115 74.19% 56.93% 64.43%Top 4 116 74.84% 57.43% 64.99%Top 5 117 75.48% 57.92% 65.55%

Test Set: # Test Words:

# Translated Words: Average # Snippets:

English-Japanese

Experimental Results (CE)

31027479

# Correctly Translated Words

Precision Recall F1

Top 1 196 71.53% 63.23% 67.12%Top 2 215 78.47% 69.35% 73.63%Top 3 230 83.94% 74.19% 78.77%Top 4 232 84.67% 74.84% 79.45%Top 5 237 86.50% 76.45% 81.16%

Test Set: # Test Words:

# Translated Words: Average # Snippets:

Chinese-English

Performance of Top N Translations (CE Dataset)

60%

65%

70%

75%

80%

85%

90%

0 5 10 15 20 25 30Top N Translations

Per

form

ance

(%

)

Precision

Recall

F1

Performance Comparison (EJ)

Our System All Features 155 100 0.645 0.495 0.560

Our System All Features 155 93 0.600 0.460 0.521OriginalTest Set

Extended Test Set

Performance Comparison (CE)

79 71.5 78.5 83.9 84.7 86.5

79 64.6 72.3 77.0 78.5 80.2 OriginalTest Set

Extended Test Set

Conclusion

Even though our system is not state-of-the-art, it is capable of doing named entities translation in multiple languages with decent performance

Most of the correct translations are ranked in top 3.

Coverage of correct translations in the search results needs improvement

Future Work (1)

Incorporate a language-specific transliteration model to improve performance on a particular language pair

Try similar techniques as the cross-lingual query expansion proposed in Fei’s paper to expand the coverage of correct translations in the returned search results

Future Work (2)

神鬼戰士 (Traditional Chinese)

角斗士 (Simplified Chinese)

Gladiator(English)

グラディエーター (Japanese)

I am a pivot!

Translate named entities from non-English to non-English by using English as a pivot

Note: This is a real-working example

Future Work (3)Named Entity Translations

Alon Lavie

Lori LevinDonna GatesAlex WaibelCarnegie Mellon UniversityLanguage Technologies Institute

Teruko Mitamura

Eric NybergCarnegie Mellon UniversityLori LevinLanguage Technologies InstituteKeith MillerKathryn BakerOwen Rambow

Robert Frederking

Carnegie Mellon UniversityRalf Brown Eduard HovyAlan BlackLori LevinEric NybergAlon LavieNancy IdeTeruko MitamuraLanguage Technologies Institute

Carnegie MellonuniversityPennsylvaniaPittsburgh

Search for closely related named entities

automatic translation of named entities in multiple languages using web search engines present by...

Documents