using external sources of bilingual information for word...

Using external sources of bilingual information forword-level quality estimation in translation

technologies

Miquel Espla-Gomis

Universitat d’AlacantDepartament de Llenguatges i Sistemes Informatics

January 25, 2016

Miquel Espla-Gomis SBI for word-level QE January 25, 2016 1 / 101

About the title

Quite long title for a dissertation; let’s dissect it:

Using external sources of bilingual informationfor word-level quality estimationin translation technologies


Translation technologies



Translation technologies

Technologies that provide translators with translation proposals(hypotheses) for a SL segment to ease the translation task:

machine translation (MT)translation memories (TM)interactive machine translationautomatic post-editingetc.


Word-level quality estimation



Word-level quality estimation

Quality estimation: predicting the quality of a translation hypothesiswithout using preexisting reference translations

helpful to estimate translation effort and, therefore, for budgetingallows choosing the most helpful translation technology

Word-level quality estimation: predicting the quality of each word ina translation hypothesis

it can guide the translator during post-editing


External sources of bilingual information



External sources of bilingual information

Sources of bilingual information (SBI): any source of informationthat allows to translate sub-segments:

machine translationtranslation memoriesphrase tablesetc.

External: sources of information that are independent of thetranslation technologies to be evaluated and can be used as blackboxes


Where are we?

Computer aided-translation (CAT) environments provide differenttranslation technologies to help translators...

... some CAT environments provide word-level QE but it is only forMT...

... there are many approaches for word-level QE, but they are basedon specifically built resources which usually require to be builtindependently.


Where would we like to be?

We would like CAT environments in which word-level QE is providedfor any translation technology available...

... without having to build any specific resources...

... that only require to plug any SBI already available, either locally oron the Internet...

... and that are able to build their own SBI if needed.


Research steps

1 Definition of methods for word-level QE for TM-based CAT usingSBI

new problem identifiedmore work devotedgood starting point: more information available than in MT

2 Extending these methods to MTimportant differences between both technologiesthe two most important translation technologies are covered

3 Researching on methods to create SBI for under-resourcedlanguage pairs

developed in parallelimproves the usability of the word-level QE methods developed


Main working hypothesis

It is possible to develop methods exclusivelybased on external SBI for

word-level QE in TM-based (CAT) and MT.


Outline

1 Introduction to the problems addressed

2 Word-level QE in TM-based CAT

3 Word-level QE in MT

4 Building SBI for under-resourced language pairs

5 Contributions to the state-of-the-art


Introduction to the problems addressed

Outline







Introduction to the problems addressed Word-level QE in TM-based CAT

Outline

1 Introduction to the problems addressedWord-level QE in TM-based CATWord-level QE in MTBuilding new SBI for under-resourced language pairs







Translation memories

Catalan EnglishS1: L’ EAMT es membre de l’IAMT

T1: The EAMT is a member ofthe IAMT

S2: L’ Associacio Internacionalper a la Traduccio Automatica

T2: The International Associa-tion for Machine Translation

S3: el congres d’ enguany secelebra a Lovaina

T3: current year ’s conference isheld in Leuven

. . . . . .

S′: Associacio Europea per a la Traduccio Automatica










. . . . . .

S′: Associacio Europea per a la Traduccio AutomaticaS2: L’ Associacio Internacional per a la Traduccio Automatica

T2: The International Association for Machine Translation










. . . . . .

S′: Associacio Europea per a la Traduccio AutomaticaS2: L’ Associacio Internacional per a la Traduccio AutomaticaT2: The International Association for Machine Translation



Translation-memory-based CAT tools



Word-level QE for TM-based CAT

The problem: detecting which words in suggestions provided byTM-based CAT tools should be post-edited and which should be kept

Example

S′ = “Associacio Europea per a la Traduccio Automatica”S2 = “L’ Associacio Internacional per a la Traduccio Automatica”

T2 = “The International Association for Machine Translation”T ′=“European Association for Machine Translation”


Introduction to the problems addressed Word-level QE in MT

Outline








Word-level QE for MT

The problem: detecting which words in MT output should bepost-edited

Example

S′=“Associacio Europea per a la Traduccio Automatica”MT(S′)=“European Association for the Automatic Translation”

T ′=“European Association for Machine Translation”



Word-level QE for MT

The problem: detecting which words in MT output should bepost-edited

Example





Word-level QE: MT vs. TM-based CAT

T provided by MT for S′: T is a translation of S′

that may be adequate or not

T provided by TM-based CAT for S′: T is not a translation of S′,but is an adequate translation of a similar segment S


Introduction to the problems addressed Building new SBI for under-resourced language pairs

Outline








Creating new SBI for under-resourced languages

The scenario: no SBI available for a translation task betweenunder-resourced languages

The solution: building new SBI by using the Web as a parallel corpusA parallel data crawler can be used to build a parallel corpusMany SBI can be automatically built from a parallel corpus:

dictionariesMT systemsbilingual concordancersetc.



The Web as a parallel corpus


Word-level QE in TM-based CAT

Outline








Related work

No previous works on word-level QE in TM-based CAT...

...but the patent application by Kuhn et al. (2011): word-level QEin TM-based CAT by using statistical word alignments (no details)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Closest approach by Kranias and Samioutou (2004): automaticpost-editing of translation suggestions using lexicons forword-alignment and MT for post-editing



Related work


...but the patent application by Kuhn et al. (2011): word-level QEin TM-based CAT by using statistical word alignments (no details)



Related work


...but the patent application by Kuhn et al. (2011): word-level QEin TM-based CAT by using statistical word alignments (no details)

Closest approach by Kranias and Samioutou (2004): automaticpost-editing of translation suggestions using glossaries forword-alignment and MT



Word-level QE in TM-based CAT: a new problem

word-level QE is a concept that originates in MT

one of the keys of success of TM-based CAT tools:segment-level QE with easily-interpretable fuzzy-match scores

word-level QE for TM-based CAT is a new problem: is it relevant?

YES: pilot study indicates that word-level QE for TM-based CAT couldimprove productivity of translators up to 14%



Rationale

Edit distance provides information about the words in S that do notappear in S′:

Example

Associacio Europea per a la Traduccio Automatica

L’ Associacio Internacional per a la Traduccio Automatica

The International Association for Machine Translation

S

T

S′



Rationale

The objective is to project source-side matching information onto T , tosuggest which words to change (bad) or keep unedited (good):

Example




S

T

S′

��

��

PPPP

PPPP


Word-level QE in TM-based CAT Word-level QE in TM-based CAT using statistical word alignments

Outline


2 Word-level QE in TM-based CATWord-level QE in TM-based CAT using statistical wordalignmentsWord-level QE in CAT using SBI-based word alignmentsWord-level QE in TM-based CAT directly using SBIEvaluation



5 Contributions to the state-of-the-artMiquel Espla-Gomis SBI for word-level QE January 25, 2016 34 / 101


Working hypothesis #1

H1: It is possible to use word alignments to estimatethe quality of TM-based CAT suggestions at the word level



Rationale




S′S

T[GOOD][BAD]

\\\\\

ccc

cc

cc

Automatica and Machine aligned & Automatica matched =⇒ GOODInternacional and International aligned & Internacional unmatched =⇒ BAD



Rationale




S′S

T

?

[?][?]

PPPP

PPPP

PPPP

PPP

for not aligned =⇒ ???The contradictory evidence =⇒ ??? (voting scheme)


Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments

Outline








H2: It is possible to use any SBI to obtain word alignments



Using SBI to obtain word-alignments

Using sub-segment alignments for word alignments based on SBI

Example

S: L’ Associacio Internacional per a la Traduccio Automatica (Catalan)T : The International Association for Machine Translation (English)



Segmentation


CatalanL’AssociacioInternacionalperalaTraduccioAutomaticaL’AssociacioAssociacio InternacionalInternacional per...

EnglishTheInternationalAssociationforMachineTranslationThe InternationalInternational AssociationAssociation forfor Machine...



Sub-segment translation to obtain pairs (σ, τ)


Catalan→ EnglishL’ → TheAssociacio → AssociationInternacional → Internationalper → bya → atla → theTraduccio → TranslationAutomatica → AutomaticL’Associacio → The AssociationAssociacio Internacional →International AssociationInternacional per → International for...

English→ CatalanThe → ElInternational → InternacionalAssociation → Associaciofor → per aMachine → MaquinaTranslation → TraduccioThe International → El InternacionalInternational Association → AssociacioInternacionalAssociation for → Associacio perfor Machine → per a Maquina...



(σ, τ) filtering


Catalan→ EnglishL’ → TheAssociacio → AssociationInternacional → Internationalper → bya → atla → theTraduccio → TranslationAutomatica → AutomaticL’Associacio → The AssociationAssociacio Internacional →International AssociationInternacional per → International for...

English→ CatalanThe → ElInternational → InternacionalAssociation → Associaciofor → per aMachine → MaquinaTranslation → TraduccioThe International → El InternacionalInternational Association → AssociacioInternacionalAssociation for → Associacio perfor Machine → per a Maquina...



Sub-segment alignment

Example

International

Association

for

Machine

L’

Assoc

iació

Inte

rnac

iona

lpe

r a la

Tradu

cció

Autom

àtica

The

Translation



Using SBI to obtain word-alignments?

1 Segmentation of both sentences2 Translation of sub-segments (in both directions) using SBIs to

obtain sub-segment pairs (σ, τ)

3 Filtering pairs (σ, τ) with σ not in S or τ not in T4 Aligning sub-segments (σ, τ) between S and T

5 Aligning words by using sub-segment alignments



Using SBI to obtain word-alignments?

1 Segmentation of both sentences2 Translation of sub-segments (in both directions) using SBIs to

obtain sub-segment pairs (σ, τ)

3 Filtering pairs (σ, τ) with σ not in S or τ not in T4 Aligning sub-segments (σ, τ) between S and T5 Aligning words by using sub-segment alignments



From sub-segment alignments to word alignments

Option A: Intuitive heuristic method based on the concept ofpressure:

Every sub-segment alignment applies force of 1;Every sub-segment alignment applies, on each pair of words, apressure corresponding to its force divided by its area;Alignment strength between two words: summation of all thepressures on a pair of words.



Word alignment strength

Example

International

Association

for

Machine

L’

Assoc

iació

Inte

rnac

iona

lpe

r a la

Tradu

cció

Autom

àtica

The

Translation

Word-alignmentstrength betweenAssociacio andAssociation:

11×1 + 1

2×2 + 13×3 '

1.36




Example

International

Association

for

Machine

L’

Assoc

iació

Inte

rnac

iona

lpe

r a la

Tradu

cció

Autom

àtica

The

Translation

0.11 0.11

0.11

0.11

1.11

0.36

0.36

1.36

1.36

1

0.5 0.5

0.25

0.25

1.25

1.25



From sub-segment alignments to word alignments

Option B: Linear-combination approach:

Sub-segment alignments are grouped depending on their geometry(1× 1, 1× 2, 2× 1, etc.), and a linear-combination approach is usedto assign weights to them


Word-level QE in TM-based CAT Word-level QE in TM-based CAT directly using SBI

Outline








H3: It is possible to use SBI to estimate the qualityof TM-based CAT translation suggestions at the word level



Rationale

Align sub-segments between S and T (as in the method usingSBI-based word alignment)

Obtain word-level QEs by:using an heuristic method that counts matched/unmatched alignedwords (as in the method using word alignments)using a binary classifier to make quality estimations



Evidence from sub-segments

Evidence from sub-segments of length L = 1

Example




S′S

T

\\\\\

ccc

cccc



Contradictory evidence from sub-segments

Contradictory evidence from sub-segments of length L = 1

Example




S′S

T

PPPP

PPPP

PPPP

PPP



Evidence from longer sub-segments

Longer sub-segments provide more context

Example




S′S

T


Word-level QE in TM-based CAT Evaluation

Outline







Evaluation of all the approaches

Evaluation for the five approaches:statistical word alignmentSBI-based word alignment (“pressure”)SBI-based word alignment (linear combination)SBI directly (heuristic)SBI directly (binary classification)

Metrics: accuracy (A) and fraction of words not covered (NC)



Evaluation resources

TM and test set from the same domain for Spanish→English(ES→EN)

Different values of fuzzy-match score threshold (Θ)

Three SBI (MT): Google Translate, Apertium, and PowerTranslator



Using SBI-based word-alignments

training: ES→EN; in-domain; all SBI test: ES→EN; all SBI

method metric Θ ≥ 60% Θ ≥ 70% Θ ≥ 80% Θ ≥ 90%

statistical word A(%) 93.9±.2 94.3±.2 95.1±.2 95.3±.3alignment NC(%) 4.4±.1 4.5±.2 4.3±.2 4.4±.3







pressure SBI-based A(%) 93.7±.2 94.5±.2 95.4±.2 96.0±.2word alignment NC(%) 10.0±.2 8.9±.2 8.2±.3 7.2±.3

linear SBI-based A(%) 94.2±.2 94.8±.2 95.8±.2 96.4±.2word alignment NC(%) 9.8±.2 8.8±.2 8.1±.3 7.0±.3







pressure SBI-based A(%) 93.7±.2 94.5±.2 95.4±.2 96.0±.2word alignment NC(%) 10.0±.2 8.9±.2 8.2±.3 7.2±.3

linear SBI-based A(%) 94.2±.2 94.8±.2 95.8±.2 96.4±.2word alignment NC(%) 9.8±.2 8.8±.2 8.1±.3 7.0±.3

heuristic method A(%) 93.3±.2 93.9±.2 95.1±.2 96.5±.3using SBI directly NC(%) 5.1±.1 5.2±.2 5.5±.2 5.9±.3

binary classification A(%) 95.1±.1 95.6±.2 96.4±.2 96.9±.2using SBI directly NC(%) 5.1±.1 5.2±.2 5.5±.2 5.9±.3



QE accuracy across domains

Training on out-of-domain TMs and re-using the models

Important: simulates performance on a translation proposal(translation unit) not seen during training

Comparing the method based on statistical word alignment andmethods using SBI directly



QE accuracy across domainstraining: ES→EN; out-of-domain; all SBI test: ES→EN; all SBI

Θ(%) training domainword alignment SBI-based approaches

statistical classifier heuristic

A (%) NC (%) A (%) A (%) NC (%)

≥ 60in domain 93.9±.2 6.1±.2 95.1±.1

out of domain 1 91.8±.2 9.6±.2 93.3±.2 93.3±.2 5.1±.1out of domain 2 90.2±.2 11.7±.2 92.0±.2

≥ 70in domain 94.3±.2 5.9±.2 95.6±.2


≥ 80in domain 95.1±.2 5.4±.2 96.4±.2


≥ 90in domain 95.3±.3 4.9±.3 96.9±.2




QE accuracy across SBI

Training on a translation task using Apertium and PowerTranslator separately as SBI

Evaluating on the same translation task using Google Translateas SBI

Comparing the methods using SBI directly



QE accuracy across SBI

training: ES→EN; in-domain; SBI 6= Google test: ES→EN; SBI = Google

Θ(%)A (%) binary classification using SBI directly A (%) heuristic method

Apertium Google Translate Power Translator using SBI directly

≥ 60 92.9±.2 95.1±.1 91.1±.2 93.4±.2≥ 70 93.4±.2 95.5±.2 93.0±.2 94.3±.2≥ 80 95.6±.2 96.3±.2 94.8±.2 95.4±.2≥ 90 96.7±.2 97.0±.2 96.5±.2 96.7±.2



QE accuracy across language pairs

Impact of re-using models for different languages

Models trained on 9 different language pairs: EN→ES, DE→EN,EN→DE, FR→EN, EN→FR, FI→EN, EN→FI, ES→FR, FR→ES

Evaluation on ES→EN

Comparing the methods using SBI directly



QE accuracy across language pairs: impact of the TL

training: XX→YY; in-domain; all SBI test: ES→EN; all SBI

to EN Θ = 60% from EN Θ = 60%

ES→EN 95.1±.1 EN→ES 90.8±.2DE→EN 92.5±.2 EN→DE 92.2±.2FR→EN 92.7±.2 EN→FR 90.9±.2FI→EN 92.4±.2 EN→FI 91.1±.2



QE accuracy across language pairs: best results

training: XX→YY; in-domain; all SBI test: ES→EN; all SBI

training Θ(%)

language pair ≥ 60 ≥ 70 ≥ 80 ≥ 90

ES→EN 95.1±.1 95.5±.2 96.3±.2 97.1±.2FR→EN 92.7±.2 94.2±.2 95.8±.2 96.7±.2heuristic 93.4±.2 94.3±.2 95.4±.2 96.7±.2



Chapter conclusions

New problem identified that is relevant for professional translators

New methods for word alignments based on SBI

Three methods for word level QE for CATStatistical word alignment: best coverage (in-domain), difficult tore-use models when domain changesSBI-based word alignment: worst coverageSBI directly for word-level QE: best accuracy, good results whenre-using models across domains


Word-level QE in MT

Outline







Word-level QE in MT

Related work

Confidence estimation (Blatz et al., 2003): semantic features(WordNet), probabilistic lexicons, word posterior probabilities

Approaches using pseudo-references:word occurrence in n-best lists (Ueffing and Ney, 2007)pseudo-references (Bicici, 2013)inverse machine translation (MULTILIZER approach: Bojar et al.,2014)

General framework for translation QE in MT: QuEst (Specia,2013) [now QuEst++ (Specia, 2015)]


Word-level QE in MT


H4: It is possible to take the SBI-based methodsfor word-level QE in TM-based CATand adapt them for their use in MT


Word-level QE in MT

Adapting the existing methods?

We start with the binary classification approach based on SBI

We can reuse the idea of sub-segment alignments...

... but we do not have S and T : it is not possible to use exactlythe same method


Word-level QE in MT

Adapting the existing methods?

New collection of features is neededpositive features using exact matches of translatedsub-segmentsnegative features using partial matches of translatedsub-segments

Example




Word-level QE in MT

How to use SBI to obtain positive evidence?

1 Source language sub-segments σ and target languagesub-segments τ are translated using SBI

2 Words in a sub-segment τ that is related by SBI to a sub-segmentσ that matches S′ get positive evidence

S′ = “Associacio Europea per a la Traduccio Automatica”ytranslate σx

τ = “European Association for”3 3 3

MT(S′) = “European Association for the Automatic Translation”


Word-level QE in MT

How to use SBI to obtain negative evidence?

1 Source language sub-segments σ are translated using SBI2 Translations τ are partially matched onto MT(S′)3 Words in MT(S′) that do not match but are surrounded by

matching words get negative evidence

Example

S′ = “Associacio Europea per a la Traduccio Automatica”ytranslate σx

τ = the Machine Translation3 7 3

MT(S′) = “European Association the Automatic Translation”


Word-level QE in MT Evaluation

Outline



3 Word-level QE in MTEvaluation





Experimental setting

Binary classification problem: GOOD and BAD classes

Metrics: F w1 , F BAD

1 , and F GOOD1

Datasets:WMT’15: we did compete; 1 dataset (for translating Spanish intoEnglish)WMT’14: we did not compete; 4 datasets (for 4 language pairs)



SBI used

Machine translation: Google Translate and Apertiumonly one translation suggestion for each σtranslation frequency not available

Bilingual concordancer: Reverso Contextmultiple translation suggestions possible for each σtranslation frequency available



WMT’15 participant ranking

WMT’15 evaluation on EN→ES data

System ID F w1 F BAD

1 F GOOD1

• UAlacant/OnLine-SBI-Baseline 71.5% 43.1% 78.1%• HDCL/QUETCHPLUS 72.6% 43.1% 79.4%

UAlacant/OnLine-SBI 69.5% 41.5% 76.1%

SAU/KERC-CRF 77.4% 39.1% 86.4%SAU/KERC-SLG-CRF 77.4% 38.9% 86.4%SHEF2/W2V-BI-2000 65.4% 38.4% 71.6%SHEF2/W2V-BI-2000-SIM 65.3% 38.4% 71.5%SHEF1/QuEst++-AROW 62.1% 38.4% 67.6%...



Results WMT’14

SBI-based methods vs. best performing systems in WMT’14

language OnLine-SBI

pair F w1 F BAD

1 winner method F w1 F BAD

1

EN→ES 61.4% 49.0% FBK-UPV-UEDIN/RNN 62.0% 48.7%ES→EN 75.9% 10.4% RTM-DCU/RTM-GLMd 79.5% 29.1%EN→DE 66.8% 43.1% RTM-DCU/RTM-GLM 71.5% 45.3%DE→EN 75.0% 40.3% RTM-DCU/RTM-GLM 72.4% 26.1%

We are good when detecting errors: terminology, mistranslation,and unintelligible



Chapter conclusions

Confirmed: SBI can be used for word-level QE in MT

Leading performer among the approaches proposed

Small amount of features required (70) when compared to otherapproaches: for example Bicici (2013) uses 511K features

Independence with respect to the MT system used to producethe translation hypotheses


Building SBI for under-resourced language pairs

Outline








Rationale

1 Using Bitextor (Espla-Gomis, 2010) to build a parallel corpus forunder-resourced language pairs

2 Using the parallel corpus obtained to build new SBI3 Using the new SBI for the QE techniques developed



Working hypotheses #5 and #6

H5: It is possible to create new SBI that enable word-level QEfor language pairs with no SBI available

using Bitextor to crawl parallel data

H6: The results obtained in word-level QE for under-resourcedlanguage pairs can be improved by using new SBI obtained

through parallel data crawling


Building SBI for under-resourced language pairs Using the Web as a parallel corpus

Outline




4 Building SBI for under-resourced language pairsUsing the Web as a parallel corpusBuilding SBI for English–CroatianNew SBI in word-level QE in TM-based CAT




Related work

Approaches to detect parallel text in multilingual websites:URL similarity patterns: Ma and Liberman (1999), Nie et al.(1999), and Resnik and Smith (2003)HTML structure comparison: Resnik and Smith (2003), Sin etal. (2005), and Zhang et al. (2006)image coocurrence: Papavassiliou et al. (2013)text comparison (mostly based on bag-of-words overlappingmetrics): Chen et al. (2004), and Barbosa et al. (2012)



How does Bitextor work?

HTML structure comparison



How does Bitextor work?

Word overlap metrics based on Sanchez-Martınez & Carrasco (2011)


Building SBI for under-resourced language pairs Building SBI for English–Croatian

Outline








Building a TM for English–Croatian tourism domain

Crawling 21 bilingual websites from the tourism domain

Using two crawlers: Bitextor and ILSP Focused Crawler

For both crawlers, 2 configurations: accuracy-oriented andrecall-oriented

Evaluation: resulting parallel corpus through manual revision



Building a TM for English–Croatian tourism domain

Evaluation of corpora built by Bitextor and ILSP Focused Crawler

tool setting aligned success unique aligneddocuments rate segments

Focused recall 3,294 73.86% 40,431Crawler accuracy 2,406 90.76% 32,544

Bitextor recall 7,787 74.70% 50,338accuracy 3,758 94.79% 36,834


Building SBI for under-resourced language pairs New SBI in word-level QE in TM-based CAT

Outline








Experimental setup

Evaluation on word-level QE in TM-based CAT forEnglish–Finnish

Parallel corpus crawled using Bitextor on about 10,700 websites(1,700,000 parallel segments)

Performance when using two kinds of SBI: phrase tables andphrase-based SMT (Moses)

Comparison to Google Translate as a SBI



Results for Finnish→English

Comparing Google with new SBI created with Bitextor

SBI metric method Θ ≥ 60% Θ ≥ 70% Θ ≥ 80% Θ ≥ 90%

Google A(%)classification 93.2±.2 94.6±.2 94.7±.2 94.8±.3

heuristic 91.6±.2 93.4±.3 94.1±.3 94.5±.3

NC(%) — 11.2±.2 11.3±.2 11.5±.3 11.6±.4

phrases A(%)classification 87.7±.3 90.3±.3 91.8±.3 92.3±.4

heuristic 85.9±.3 89.6±.3 91.5±.3 92.2±.4

NC(%) — 23.5±.3 23.1±.3 23.1±.4 23.8±.6

Moses A(%)classification 93.1±.2 94.7±.2 94.9±.3 94.7±.4

heuristic 93.1±.2 94.7±.2 94.9±.3 94.7±.4

NC(%) — 45.5±.3 44.2±.4 44.9±.5 46.8±.7

Google A(%)classification 91.3±.2 92.8±.2 93.2±.3 93.0±.3

+ Moses heuristic 87.8±.2 90.5±.2 91.8±.3 92.6±.4

+ phrases NC(%) — 5.1±.2 4.3±.2 4.9±.2 5.4±.3



Chapter conclusions

Bitextor provides high-quality parallel corpora

Bitextor may enable the use of SBI-based approaches forword-level QE in TM-based CAT for under-resourced languagepairs

Encouraging results:Phrase-based SMT system: accuracy comparable to that of GoogleTranslatePhrase tables + Google Translate: coverage improves more thantwice; accuracy falls noticeably


Contributions to the state-of-the-art

Outline








Where would we like to be?

We would like CAT environments in which word-level QE is providedfor any translation technology available...

... without having to build any specific resources...

... that only require to plug any SBI already available for the user...

... and that are able to build their own SBI if needed.



How does our work contribute?

We have developed methods for the 2 most used technologies in CAT,TM and MT,...

... that only need to have access to the SBI already available for theuser...

... and if no SBI are available, we have developed a methodology forcreating new SBI from multilingual websites using the toolBitextor.



New research paths

QE for parallel-corpora crawling

Automatic or aided post-editing

Word-level QE for interactive machine translation




QE for parallel-corpora crawlingQE as a new source of information for comparing documentsUsing word-level QE as a source of information for cleaning paralleldata







Automatic or aided post-editingAdapting current techniques to suggest translation alternativesExtend the current work to consider insertions







Word-level QE for interactive machine translationUsing the SBI-based word-alignment methods developed to keeptrack of the sub-segments translatedEvaluating word-level QE for this new technology



Free/open-source software published

Software published (http://bit.do/mespla_software):Gamblr-CAT: Software for word-level QE in TM-based CATGamblr-MT: Software for word-level QE in MTBitextor: Software for building parallel data from multilingualwebsitesOmegaT-SessionLog: Plugin for traking the actions of atranslator when using the TM-based CAT tool OmegaTOmegaT-Marker-Plugin: Plugin for OmegaT that implementsthe heuristic word-level QE systemFlyligner: Word-alignment software based on SBI


http://bit.do/mespla_software


MOLTES GRACIES PER L’ATENCIO!

THANK YOU VERY MUCH FOR YOURATENTION!

This work has been partially funded by:the Spanish government through projects TIN2009-14009-C02-01 andTIN2012-32615,the European Union through project Abu-MaTran(PIAP-GA-2012-324414), andthe Kazakh Government through project UA KAZAKHUNIVERSITY1-15I.


using external sources of bilingual information for word...

Documents