2008 – copyright systran systran challenges and recent advances in hybrid machine translation jean...
TRANSCRIPT
2008 – copyright SYSTRAN
SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation
Jean Senellart, Jin Yang, Jens Stephan
2008 – copyright SYSTRAN
Overview
SYSTRAN – 40 years of innovation
The MT Challenges
SYSTRANLabProjectsHybrid EnginesFrom Research to Products
CWMT08
Conclusions
2008 – copyright SYSTRAN
SYSTRAN
40 years of history
Located in Paris (La Défense) and San Diego
+70 employees: ~ 20 linguists, ~ 30 engineersIncluding 10 PhDs
2008 – copyright SYSTRAN
Core Technology
Core technology “Rule-Based”Based on language descriptionAnalysis – Transfer – Generation paradigmBuild a « syntax tree » based on hierarchical constituents with multi-level relationshipsMulti-pass analysis
• Morphology Analysis• Homograph Resolution• Clause Boundary• Syntagm Identification• Syntactic Role Identification• …
Rely heavily on linguistic resources
2008 – copyright SYSTRAN
2008 – copyright SYSTRAN
Languages
Chinese 882 Korean 78Arabic 422 Italian 62Spanish 358 Ukrainian 47English 350 Polish 42Hindi 325 Dutch 23Portuguese 250 Serbo-Croatian 21Russian 170 Greek 18French 130 Czech 12Japanese 125 Albanian 6Urdu 100 Slovak 6German 100Farsi 82
22 source languages
70 language pairs
Dictionaries: 200K-1M entries per LP~6M reference multi-source / multi-target dictionary
3600
2008 – copyright SYSTRAN
SYSTRAN Activity
Retail products:Windows Desktop ProductSYSTRAN Mobile on PDAMac OS Dashboard Widget
Online ServicesSYSTRANBox, SYSTRANNet, SYSTRANLinks
Corporate customersSymantec, Cisco, Verizon, Ford, Daimler, Chemical
Abstract…Institutional Customers
EC and US agenciesPortals - Online Translation
“Babel Fish”, Google, Yahoo!, Microsoft Live, …
2008 – copyright SYSTRAN
MT Challenges RBMT/SMT Strengths and Weaknesses - I
Rule-Based system builds a translation with available linguistic resources (dictionaries, rules)
Human-built resources• Incremental
Track the translation process• Predictable output
Some phenomena are hard to formalize• Need semantic/pragmatic knowledge
Not designed to deal with exceptions to the rules• … which are very frequent
2008 – copyright SYSTRAN
MT Challenges RBMT/SMT Strengths and Weaknesses - II
Statistical system finds a translation within a choice of many, many possible translations
Very easy to build• Automatic training process
Knowledge acquisition is easy…• Not limited to predefined linguistic patterns – “phrase”
… but cannot “understand” or generalize information • Not even elementary rules
Output is “unpredictable”
2008 – copyright SYSTRAN
MT ChallengesCorpus-Based or Rule-Based Approach?
No conflict between “corpus” and “rule-based” approaches
Possible to learn rules• Already learns terminology – monolingual and multilingual• Some approaches acquire complex rules
Possible to find the best translation amongst several translations“Decoding” can be constrained by syntactic restrictionsLinguistic rules but corpus drives!
2008 – copyright SYSTRAN
SYSTRANLab
Research Projects Overview
Toward Hybrid EnginesCollaborationsStatistical Post-Edition
Lattice Decoding
Source Analysis Adaptation
From Research to Products
2008 – copyright SYSTRAN
Research Projects
Resources AcquisitionConsolidating a 6M entry multilingual dictionaryAcquiring more from corpus – lexicon and rules
Linguistic DevelopmentEntity Recognition with local grammarsAutonomous Generation modules
Introduction of corpus-based technology
ApplicationsMore interactive applicationsProfessional Post-Edition Module (POEM)
2008 – copyright SYSTRAN
SYSTRANLab Research Projects
The Phoenix Project
Collaboration with P. Koehn (University of Edinburgh)
Introduce corpus-based decision modules in SYSTRAN
Specialized modulesWord Sense DisambiguationLattice GenerationPreposition / Determiner Choice
2008 – copyright SYSTRAN
SYSTRANLab Research Projects
The Sphinx Project
Collaboration with CNRC
Sequential use of SYSTRAN and statistical engines (Statistical Post-Edition)
GALE (DARPA Project)
Participated in WMT07, NIST08
2008 – copyright SYSTRAN
SYSTRANLab Research Projects
The Pegasus Project
Collaboration with H. Schwenk (Université du Maine)
Introduce linguistic knowledge in statistical engines
Participated in WMT08
2008 – copyright SYSTRAN
SYSTRANLabHybrid Engines
Introduce Self-Learning capability
Learn “post-edition rules”
Deep integration of statistical decision modules
Insert linguistic knowledge in statistical
engines
HYBRIDHYBRID
2008 – copyright SYSTRAN
CWMT08
Chinese-English MT evaluation
Primary: RBMT+SPE
Contrast: RBMTStarted in 1994, 1.2M terms, S&T-focus
BLEU4 BLEU4-SBP
NIST5 GTM mWER mPER ICT
Primary-a 0.2275 0.2193 7.9180 0.7101 0.7209 0.5085 0.3262
Contrast-b 0.1956 0.1930 7.6356 0.7089 0.7165 0.5123 0.2942
2008 – copyright SYSTRAN
CWMT08: SPE Usage
SPE module trained on 1.8m sentencesCWMT08 training data not use
Not only translation by also annotation by RBMTDates, numerals, etc.
Transfer model is filteredExclusion of “bad rules” by rule based filteringExamples are “random” quotes, entities appearing
Some expressions are “protected”Constituents will be replaced with placeholders before SPETranslated with RBMTRe-injected in translation after SPE
SPE model for CWMT08 is trained using GIZA++, and decoding using Moses (www.statmt.org/moses)
2008 – copyright SYSTRAN
Statistical Post-EditionA Case Study
Case Study – SYMANTEC – English>Chinese
BLEU PERFECT Improv / Degrad
SYSTRAN Raw 20.89 2 -SYSTRAN Cust 34.49 4.8 refSYSTRAN Raw + Translation Model
46.86 7.4 -
SYSTRAN Cust + Translation Model
50.90 10.5 15
2008 – copyright SYSTRAN
Conclusions
Our approach is to start with rule-based frameworkDeveloped techniques give very competitive resultsMajor focus on “degradation” controlLearn more advanced post-edition rules
Generic Translation – still a long way to goBigger still better?
Domain TranslationQuality is there – statistics provides adaptation and fluidity
Need dedicated applications, workflow
Bootstrapping new language pair development