2008 – copyright systran systran challenges and recent advances in hybrid machine translation jean...

2008 – copyright SYSTRAN

SYSTRAN Challenges and Recent Advances in Hybrid Machine Translation

Jean Senellart, Jin Yang, Jens Stephan

[email protected]


Overview

SYSTRAN – 40 years of innovation

The MT Challenges

SYSTRANLabProjectsHybrid EnginesFrom Research to Products

CWMT08

Conclusions


SYSTRAN

40 years of history

Located in Paris (La Défense) and San Diego

+70 employees: ~ 20 linguists, ~ 30 engineersIncluding 10 PhDs


Core Technology

Core technology “Rule-Based”Based on language descriptionAnalysis – Transfer – Generation paradigmBuild a « syntax tree » based on hierarchical constituents with multi-level relationshipsMulti-pass analysis

• Morphology Analysis• Homograph Resolution• Clause Boundary• Syntagm Identification• Syntactic Role Identification• …

Rely heavily on linguistic resources


Languages

Chinese 882 Korean 78Arabic 422 Italian 62Spanish 358 Ukrainian 47English 350 Polish 42Hindi 325 Dutch 23Portuguese 250 Serbo-Croatian 21Russian 170 Greek 18French 130 Czech 12Japanese 125 Albanian 6Urdu 100 Slovak 6German 100Farsi 82

22 source languages

70 language pairs

Dictionaries: 200K-1M entries per LP~6M reference multi-source / multi-target dictionary

3600


SYSTRAN Activity

Retail products:Windows Desktop ProductSYSTRAN Mobile on PDAMac OS Dashboard Widget

Online ServicesSYSTRANBox, SYSTRANNet, SYSTRANLinks

Corporate customersSymantec, Cisco, Verizon, Ford, Daimler, Chemical

Abstract…Institutional Customers

EC and US agenciesPortals - Online Translation

“Babel Fish”, Google, Yahoo!, Microsoft Live, …


MT Challenges RBMT/SMT Strengths and Weaknesses - I

Rule-Based system builds a translation with available linguistic resources (dictionaries, rules)

Human-built resources• Incremental

Track the translation process• Predictable output

Some phenomena are hard to formalize• Need semantic/pragmatic knowledge

Not designed to deal with exceptions to the rules• … which are very frequent


MT Challenges RBMT/SMT Strengths and Weaknesses - II

Statistical system finds a translation within a choice of many, many possible translations

Very easy to build• Automatic training process

Knowledge acquisition is easy…• Not limited to predefined linguistic patterns – “phrase”

… but cannot “understand” or generalize information • Not even elementary rules

Output is “unpredictable”


MT ChallengesCorpus-Based or Rule-Based Approach?

No conflict between “corpus” and “rule-based” approaches

Possible to learn rules• Already learns terminology – monolingual and multilingual• Some approaches acquire complex rules

Possible to find the best translation amongst several translations“Decoding” can be constrained by syntactic restrictionsLinguistic rules but corpus drives!


SYSTRANLab

Research Projects Overview

Toward Hybrid EnginesCollaborationsStatistical Post-Edition

Lattice Decoding

Source Analysis Adaptation

From Research to Products


Research Projects

Resources AcquisitionConsolidating a 6M entry multilingual dictionaryAcquiring more from corpus – lexicon and rules

Linguistic DevelopmentEntity Recognition with local grammarsAutonomous Generation modules

Introduction of corpus-based technology

ApplicationsMore interactive applicationsProfessional Post-Edition Module (POEM)


SYSTRANLab Research Projects

The Phoenix Project

Collaboration with P. Koehn (University of Edinburgh)

Introduce corpus-based decision modules in SYSTRAN

Specialized modulesWord Sense DisambiguationLattice GenerationPreposition / Determiner Choice



The Sphinx Project

Collaboration with CNRC

Sequential use of SYSTRAN and statistical engines (Statistical Post-Edition)

GALE (DARPA Project)

Participated in WMT07, NIST08



The Pegasus Project

Collaboration with H. Schwenk (Université du Maine)

Introduce linguistic knowledge in statistical engines

Participated in WMT08


SYSTRANLabHybrid Engines

Introduce Self-Learning capability

Learn “post-edition rules”

Deep integration of statistical decision modules

Insert linguistic knowledge in statistical

engines

HYBRIDHYBRID


CWMT08

Chinese-English MT evaluation

Primary: RBMT+SPE

Contrast: RBMTStarted in 1994, 1.2M terms, S&T-focus

BLEU4 BLEU4-SBP

NIST5 GTM mWER mPER ICT

Primary-a 0.2275 0.2193 7.9180 0.7101 0.7209 0.5085 0.3262

Contrast-b 0.1956 0.1930 7.6356 0.7089 0.7165 0.5123 0.2942


CWMT08: SPE Usage

SPE module trained on 1.8m sentencesCWMT08 training data not use

Not only translation by also annotation by RBMTDates, numerals, etc.

Transfer model is filteredExclusion of “bad rules” by rule based filteringExamples are “random” quotes, entities appearing

Some expressions are “protected”Constituents will be replaced with placeholders before SPETranslated with RBMTRe-injected in translation after SPE

SPE model for CWMT08 is trained using GIZA++, and decoding using Moses (www.statmt.org/moses)


Statistical Post-EditionA Case Study

Case Study – SYMANTEC – English>Chinese

BLEU PERFECT Improv / Degrad

SYSTRAN Raw 20.89 2 -SYSTRAN Cust 34.49 4.8 refSYSTRAN Raw + Translation Model

46.86 7.4 -

SYSTRAN Cust + Translation Model

50.90 10.5 15


Conclusions

Our approach is to start with rule-based frameworkDeveloped techniques give very competitive resultsMajor focus on “degradation” controlLearn more advanced post-edition rules

Generic Translation – still a long way to goBigger still better?

Domain TranslationQuality is there – statistics provides adaptation and fluidity

Need dedicated applications, workflow

Bootstrapping new language pair development

2008 – copyright systran systran challenges and recent advances in hybrid machine translation jean...

Documents