Design of a Multi-lingual Design of a Multi-lingual MT for Real-time MT for Real-time Broadcast CaptioningBroadcast Captioning
Course Project for 11-731Course Project for 11-731
Ying Zhang (Joy) Ying Zhang (Joy) Joy@[email protected]
Advisor: Eric NybergAdvisor: Eric Nyberg
April 18th, 2001April 18th, 2001
Project DescriptionProject Description
A broadcasting company wishes to translate the captioning for their show. Translations must be provided from English to multiple target languages. Real-time, high-accuracy translations are required. If the captions are poorly formed, then they need not be translated, but the customer would like you to consider teaching a controlled language to the captioners, so that high-quality translation can be achieved.
Domain Analysis (Cont.)Domain Analysis (Cont.)
• Special requirementsSpecial requirements– Translating spoken languageTranslating spoken language– The system must perform in real-timeThe system must perform in real-time– The system can not make pre-edit and The system can not make pre-edit and
post-edit post-edit – The system should provide positive The system should provide positive
information to users as much as possibleinformation to users as much as possible
Domain AnalysisDomain Analysis
• Characteristics amenable for MTCharacteristics amenable for MT– The domain is narrow– Possible to build large scale monolingual corpus– Not necessary to translate every utterance in the
broadcast• Well-defined discourse structure (greetings, etc)Well-defined discourse structure (greetings, etc)• No correspondence in another culture (“the bulls No correspondence in another culture (“the bulls
outrunning the bears today on Wall Street”)outrunning the bears today on Wall Street”)
Domain Analysis-Data(1)Domain Analysis-Data(1)
Fixed PatternsFixed Patterns000084 WWW.MEDIACAPTIONING.COM [CLOSED CAPTIONING
000085 PROVIDED BY BELL ATLANTIC, THE
000089 HEART OF COMMUNICATION] >>
000090 FROM CNN IN WASHINGTON, THIS
000091 IS "WORLDVIEW." I'M BERNARD
000094 SHAW. >> AND I'M JUDY WOODRUFF. >>>
000095 TALKS BETWEEN PROTESTANT AND
Domain Analysis-Data(2)Domain Analysis-Data(2)
• Idioms, Phrases and AcronymsIdioms, Phrases and Acronyms000218 SCHEDULE? >> THE DISARMAMENT OF ALL
000220 PARAMILITARIES INCLUDING THE IRA
000223 OR THE INSTITUTION OF A
000225 CABINET FOR THE NEW NORTHERN
000227 IRELAND ASSEMBLY INCLUDING SINN FEIN
000270 ALL THIS INDICATE THAT THE
000271 ORIGINAL GOOD FRIDAY AGREEMENT WAS
000276 JUST NOT REALISTIC? >> NO,
Domain Analysis-Data(3)Domain Analysis-Data(3)
• Sentence BoundarySentence Boundary000112 PARAMILITARIES MUST DISARM. MR. BLAIR AND
00140 BEFORE TOO LONG JUDY, YOU CAN SEE
000141 BEHIND ME THAT PEOPLE ARE
000143 STILL AT WORK HERE ALMOST 24 HOURS
000145 AFTER THE DEADLINE PASSED
000147 WHERE THERE'S LIFE, THERE'S HOPE
000149 AND WHEN THERE'S TALK, I GUESS
000151 THERE IS LIFE. WE'RE TOLD BY
Assumptions (1)Assumptions (1)
• Partial Translation is acceptablePartial Translation is acceptable– Users may know some English, although their
vocabulary size may not be large enough– Users have visual information– Users may have background information for the
topic– Provide only positive information to user, do not
translate everything unless confident
Assumptions (2)Assumptions (2)
• 10 seconds delay is acceptable10 seconds delay is acceptable
10:01:12 am
10:01:22am
Technical RisksTechnical Risks
• Performance constraints– Real-time– High-quality, even if partial translation is
acceptable
• Interface with hardware and software in broadcasting system
• Specialized user interface if a human translator works together with MT
• The domain of news broadcasting may be too wide to be covered by current MT technology
Business RisksBusiness Risks
• If the quality or real-time requirement can not be reached, the customer will not accept this product
• The population of potential customers who need partial translation result is not large enough
• Human translators provided with transcribed caption can translate it in real-time
• Sales force do not think they can sell this translated service
Technical RationaleTechnical Rationale
• Multi-engine machine translation system (the requirement of multi target languages can not be satisfied now)
• Automated update corpus/lexicon from news source• Provide only positive information, un-translated text has 0
information, wrong translation has negative information!– Translate only chunks with high confidence
– Translate only simple structures, leave conjunction and prepositions for complex structures untranslated
System ArchitectureSystem Architecture
Nyberg and Mitamura (1997) "A Real-Time MT System for Translating Broadcast Captions"Proceedings of MT Summit VI
Extracting Lexicon/PhraseExtracting Lexicon/Phrase
• The lexicon/phrase used in news domain changes rapidly
• Comparable corpus exists
• Extracting lexicon/phrase from comparable corpus
Plan OverviewPlan Overview
Augmenting rules for news domain
Constructing bilingual corpus
Research on extracting
lexicon from comparable corpus
Adjust chart manager for partial translation
Research on effects of partial translation
Training EBMT
ResourcesResources
• Existing KBMT systemExisting KBMT system• Existing EBMT softwareExisting EBMT software• Transcribed caption (monolingual) Transcribed caption (monolingual)
datadata• DictionaryDictionary
BibliographyBibliography
• Nyberg and Mitamura, 1997, A Real-Time MT System for Translating Broadcast Captions, Proceedings of MT Summit VI
• David Turcato, A Unified Example-Based and Lexical Approach to Machine Translation, TMI 99
• Pascale Fung, A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora, Lecture Notes in Artificial Intelligence, Springer Publisher, vol 1529, 1-17.