design of a multi-lingual mt for real-time broadcast captioning course project for 11-731 ying zhang...

20
Design of a Multi- Design of a Multi- lingual MT for Real- lingual MT for Real- time Broadcast time Broadcast Captioning Captioning Course Project for 11-731 Course Project for 11-731 Ying Zhang (Joy) Ying Zhang (Joy) Joy@ Joy@ cs cs . . cmu cmu . . edu edu Advisor: Eric Nyberg Advisor: Eric Nyberg April 18th, 2001 April 18th, 2001

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Design of a Multi-lingual Design of a Multi-lingual MT for Real-time MT for Real-time Broadcast CaptioningBroadcast Captioning

Course Project for 11-731Course Project for 11-731

Ying Zhang (Joy) Ying Zhang (Joy) Joy@[email protected]

Advisor: Eric NybergAdvisor: Eric Nyberg

April 18th, 2001April 18th, 2001

Project DescriptionProject Description

A broadcasting company wishes to translate the captioning for their show. Translations must be provided from English to multiple target languages. Real-time, high-accuracy translations are required. If the captions are poorly formed, then they need not be translated, but the customer would like you to consider teaching a controlled language to the captioners, so that high-quality translation can be achieved.

Domain Analysis (Cont.)Domain Analysis (Cont.)

• Special requirementsSpecial requirements– Translating spoken languageTranslating spoken language– The system must perform in real-timeThe system must perform in real-time– The system can not make pre-edit and The system can not make pre-edit and

post-edit post-edit – The system should provide positive The system should provide positive

information to users as much as possibleinformation to users as much as possible

Domain AnalysisDomain Analysis

• Characteristics amenable for MTCharacteristics amenable for MT– The domain is narrow– Possible to build large scale monolingual corpus– Not necessary to translate every utterance in the

broadcast• Well-defined discourse structure (greetings, etc)Well-defined discourse structure (greetings, etc)• No correspondence in another culture (“the bulls No correspondence in another culture (“the bulls

outrunning the bears today on Wall Street”)outrunning the bears today on Wall Street”)

Domain Analysis-Data(1)Domain Analysis-Data(1)

Fixed PatternsFixed Patterns000084 WWW.MEDIACAPTIONING.COM [CLOSED CAPTIONING

000085 PROVIDED BY BELL ATLANTIC, THE

000089 HEART OF COMMUNICATION] >>

000090 FROM CNN IN WASHINGTON, THIS

000091 IS "WORLDVIEW." I'M BERNARD

000094 SHAW. >> AND I'M JUDY WOODRUFF. >>>

000095 TALKS BETWEEN PROTESTANT AND

Domain Analysis-Data(2)Domain Analysis-Data(2)

• Idioms, Phrases and AcronymsIdioms, Phrases and Acronyms000218 SCHEDULE? >> THE DISARMAMENT OF ALL

000220 PARAMILITARIES INCLUDING THE IRA

000223 OR THE INSTITUTION OF A

000225 CABINET FOR THE NEW NORTHERN

000227 IRELAND ASSEMBLY INCLUDING SINN FEIN

000270 ALL THIS INDICATE THAT THE

000271 ORIGINAL GOOD FRIDAY AGREEMENT WAS

000276 JUST NOT REALISTIC? >> NO,

Domain Analysis-Data(3)Domain Analysis-Data(3)

• Sentence BoundarySentence Boundary000112 PARAMILITARIES MUST DISARM. MR. BLAIR AND

00140 BEFORE TOO LONG JUDY, YOU CAN SEE

000141 BEHIND ME THAT PEOPLE ARE

000143 STILL AT WORK HERE ALMOST 24 HOURS

000145 AFTER THE DEADLINE PASSED

000147 WHERE THERE'S LIFE, THERE'S HOPE

000149 AND WHEN THERE'S TALK, I GUESS

000151 THERE IS LIFE. WE'RE TOLD BY

Assumptions (1)Assumptions (1)

• Partial Translation is acceptablePartial Translation is acceptable– Users may know some English, although their

vocabulary size may not be large enough– Users have visual information– Users may have background information for the

topic– Provide only positive information to user, do not

translate everything unless confident

Assumptions (2)Assumptions (2)

• 10 seconds delay is acceptable10 seconds delay is acceptable

10:01:12 am

10:01:22am

Risk FactorsRisk Factors

• Technical risks• Business risks

Technical RisksTechnical Risks

• Performance constraints– Real-time– High-quality, even if partial translation is

acceptable

• Interface with hardware and software in broadcasting system

• Specialized user interface if a human translator works together with MT

• The domain of news broadcasting may be too wide to be covered by current MT technology

Business RisksBusiness Risks

• If the quality or real-time requirement can not be reached, the customer will not accept this product

• The population of potential customers who need partial translation result is not large enough

• Human translators provided with transcribed caption can translate it in real-time

• Sales force do not think they can sell this translated service

Technical RationaleTechnical Rationale

• Multi-engine machine translation system (the requirement of multi target languages can not be satisfied now)

• Automated update corpus/lexicon from news source• Provide only positive information, un-translated text has 0

information, wrong translation has negative information!– Translate only chunks with high confidence

– Translate only simple structures, leave conjunction and prepositions for complex structures untranslated

System ArchitectureSystem Architecture

Nyberg and Mitamura (1997) "A Real-Time MT System for Translating Broadcast Captions"Proceedings of MT Summit VI

Extracting Lexicon/PhraseExtracting Lexicon/Phrase

• The lexicon/phrase used in news domain changes rapidly

• Comparable corpus exists

• Extracting lexicon/phrase from comparable corpus

Comparable corpusComparable corpus

Plan OverviewPlan Overview

Augmenting rules for news domain

Constructing bilingual corpus

Research on extracting

lexicon from comparable corpus

Adjust chart manager for partial translation

Research on effects of partial translation

Training EBMT

ResourcesResources

• Existing KBMT systemExisting KBMT system• Existing EBMT softwareExisting EBMT software• Transcribed caption (monolingual) Transcribed caption (monolingual)

datadata• DictionaryDictionary

BibliographyBibliography

• Nyberg and Mitamura, 1997, A Real-Time MT System for Translating Broadcast Captions, Proceedings of MT Summit VI

• David Turcato, A Unified Example-Based and Lexical Approach to Machine Translation, TMI 99

• Pascale Fung, A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-parallel Corpora, Lecture Notes in Artificial Intelligence, Springer Publisher, vol 1529, 1-17.

Thanks!Thanks!

• Questions?Questions?