automatic translation of named entities in multiple languages using web search engines present by...
TRANSCRIPT
Automatic Translation of Named Entities in Multiple Languages Us
ing Web Search EnginesPresent by Richard C. Wang
Supervised by Teruko Mitamura
December 15, 2005
Presentation Outline
IntroductionRelated WorkSystem ImplementationExperimental ResultsConclusionFuture Work
Introduction Machine Translation
Reduces human labor for translating text
One challenge – Translating newly emerged proper nouns (named entities) Movie, book, magazine, protein, cell, disease, person, location, c
ompany, organization names, etc. No central database that stores NE (and their translations)
World Wide Web – An enormous unstructured corpus Contains named entities in various languages
Automatic translation of NE in multiple languages Near-language-independent approach Utilize popular search engines: Google, Yahoo!, AlltheWeb
Related Work Wu, Lin, and Chang (2005) – English-to-Chinese NE MT
Surface pattern knowledge learner Trained transliteration model
Shima (2005) – English-to-Japanese NE MT Hand-coded transliteration model Heuristic computations using N-gram
Huang, Zhang, and Vogel (2005) – Chinese-to-English NE MT Cross-lingual query expansion Trained transliteration model IBM translation model Frequency-Distance model
In contrast, our system does not require any Training data and process Transliteration model
System Architecture Overview
sSearch Engines (Ext
ernal)
Search Results
Segment Parser
SegmentsTranslation Candidate Extractor
Translation Candidates
Translation Candidate
Filter
Filtered Translation Ca
ndidates
Candidate Score Calc
ulator
Scored Translation Cand
idates
Querying World Wide Web We want to retrieve documents containing
Source word s and target word t
Search for s using Google, Yahoo!, and AlltheWeb Request for web pages written in the same language as t
Current system supports target languages: English, Simplified/Traditional Chinese, Japanese, and Korean A target language can be added easily (see Adding Target Lang
uages slide)
Current system allows s to be in any language except: s and t have to be in languages that use different character sets
i.e. (English, Chinese), (Korean, Japanese), (Hebrew, English) Can be overcome by using English as a pivot language (see Fut
ure Work slide)
Preprocessing Returned ResultsSegments Snippet
Our system preprocesses results by: Extracting each snippet and insert into N
such that no snippets in N can have duplicating titles Extracting each segment from each snippet in N and
insert into G such that any segment in G cannot be a substring of
another segment in G Prevent words to have biased weights
Weights are dependent on their occurring frequencies
Extracting Translation Candidates
Translation Candidate Any lonely cluster in the target language that resides in t
he same segment as the source word
Our system uses regular expression patterns to extract lonely clusters
Oftentimes there is at least one correct translation in the returned results that is lonely
Lonely Clusters Clusters
Filtering Translation Candidates
Suppose candidate A is a substring of candidate B, and if B occurs more than half the times that A occurs in all segments, then A is discarded.
For example:
Since TF(B) > 0.5 x TF(A), A is discarded
Candidate TF
A “Back to the” 40
B “Back to the Future” 25
Ranking Translation Candidates
Source word: “The Lord of the Rings” Target Language: Japanese
Feature DefinitionTFc # of occurrences of c in all segments
DFc # of segments that contain c
CTFc # of occurrences of lonely c in all segments
CDFc # of segments that contain lonely c
NGc # of grams that c is consist of
WDc sum of inverse word distance between s and c in all segments
)max()max()max()max()max()max( WD
WD
NG
NG
CDF
CDF
CTF
CTF
DF
DF
TF
TFScore cccccc
c
Adding a New Target Language
Three basic elements: Tokenization Pattern
A regular expression pattern for tokenization Search Engine Language Code
Language codes for the target language for each of the search engines
Other General Properties Common minimum number of grams/alphabets for n
amed entities in the target language Is the language spaced or non-spaced
Experimental Data
Gold Test Word Original Gold Translation Additional Gold Trans.
纽约客 new yorker The New Yorker
牛虻 The Gadfly gadfly
汇丰银行 HSBCHong Kong and Shang Hai Banking Corporation
海豹 seal seals
Mt. Pinatubo ピナツボ火山 , ピナトゥボ火山 ピナツボ山
Roger Dingmanロージャー・ディングマン , ロージャーディングマン ロジャーディングマン
Jean-Henri Dunant アンリ・デュナン , アンリデュナン ジャン・アンリ・デュナン
Charles Wang チャールズ・ウォン , チャールズウォン チャールズ・ワン
Dataset # Test Words Source-Target
EJ 202 English-Japanese
CE 310 Simplified Chinese-English
Evaluation Metric
Translatable words words that our system is able to produce at leas
t one translation candidate for
WordsleTranslatab#
WordsTranslatedCorrectly #P
Test Words Standard Gold#
Words TranslatedCorrectly #R
RP
PRF
21
Usefulness of Features
F1 Scores using Various Heuristic Methods
0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70%
CDF + CTF + Filter + WD + DF + NG + TF
CTF + Filter + WD + DF + NG + TF
Filter + WD + DF + NG + TF
WD + DF + NG + TF
DF + NG + TF
NG + TF
TF
Heu
ristic
Met
hod
F1 Score
EJ (202)
CE (310)
Experimental Results (EJ)
Performance of Top N Translations (EJ Dataset)
45%
50%
55%
60%
65%
70%
75%
80%
85%
0 5 10 15 20 25 30Top N Translations
Per
form
ance
(%
)
Precision
Recall
F1
202155127
# Correctly Translated Words
Precision Recall F1
Top 1 100 64.52% 49.50% 56.02%Top 2 110 70.97% 54.46% 61.62%Top 3 115 74.19% 56.93% 64.43%Top 4 116 74.84% 57.43% 64.99%Top 5 117 75.48% 57.92% 65.55%
Test Set: # Test Words:
# Translated Words: Average # Snippets:
English-Japanese
Experimental Results (CE)
31027479
# Correctly Translated Words
Precision Recall F1
Top 1 196 71.53% 63.23% 67.12%Top 2 215 78.47% 69.35% 73.63%Top 3 230 83.94% 74.19% 78.77%Top 4 232 84.67% 74.84% 79.45%Top 5 237 86.50% 76.45% 81.16%
Test Set: # Test Words:
# Translated Words: Average # Snippets:
Chinese-English
Performance of Top N Translations (CE Dataset)
60%
65%
70%
75%
80%
85%
90%
0 5 10 15 20 25 30Top N Translations
Per
form
ance
(%
)
Precision
Recall
F1
Performance Comparison (EJ)
Our System All Features 155 100 0.645 0.495 0.560
Our System All Features 155 93 0.600 0.460 0.521OriginalTest Set
Extended Test Set
Performance Comparison (CE)
79 71.5 78.5 83.9 84.7 86.5
79 64.6 72.3 77.0 78.5 80.2 OriginalTest Set
Extended Test Set
Conclusion
Even though our system is not state-of-the-art, it is capable of doing named entities translation in multiple languages with decent performance
Most of the correct translations are ranked in top 3.
Coverage of correct translations in the search results needs improvement
Future Work (1)
Incorporate a language-specific transliteration model to improve performance on a particular language pair
Try similar techniques as the cross-lingual query expansion proposed in Fei’s paper to expand the coverage of correct translations in the returned search results
Future Work (2)
神鬼戰士 (Traditional Chinese)
角斗士 (Simplified Chinese)
Gladiator(English)
グラディエーター (Japanese)
I am a pivot!
Translate named entities from non-English to non-English by using English as a pivot
Note: This is a real-working example
Future Work (3)Named Entity Translations
Alon Lavie
Lori LevinDonna GatesAlex WaibelCarnegie Mellon UniversityLanguage Technologies Institute
Teruko Mitamura
Eric NybergCarnegie Mellon UniversityLori LevinLanguage Technologies InstituteKeith MillerKathryn BakerOwen Rambow
Robert Frederking
Carnegie Mellon UniversityRalf Brown Eduard HovyAlan BlackLori LevinEric NybergAlon LavieNancy IdeTeruko MitamuraLanguage Technologies Institute
Carnegie MellonuniversityPennsylvaniaPittsburgh
Search for closely related named entities