using external sources of bilingual information for word...
TRANSCRIPT
Using external sources of bilingual information forword-level quality estimation in translation
technologies
Miquel Espla-Gomis
Universitat d’AlacantDepartament de Llenguatges i Sistemes Informatics
January 25, 2016
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 1 / 101
About the title
Quite long title for a dissertation; let’s dissect it:
Using external sources of bilingual informationfor word-level quality estimationin translation technologies
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 2 / 101
Translation technologies
Using external sources of bilingual informationfor word-level quality estimationin translation technologies
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 3 / 101
Translation technologies
Technologies that provide translators with translation proposals(hypotheses) for a SL segment to ease the translation task:
machine translation (MT)translation memories (TM)interactive machine translationautomatic post-editingetc.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 4 / 101
Translation technologies
Technologies that provide translators with translation proposals(hypotheses) for a SL segment to ease the translation task:
machine translation (MT)translation memories (TM)interactive machine translationautomatic post-editingetc.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 4 / 101
Word-level quality estimation
Using external sources of bilingual informationfor word-level quality estimationin translation technologies
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 5 / 101
Word-level quality estimation
Quality estimation: predicting the quality of a translation hypothesiswithout using preexisting reference translations
helpful to estimate translation effort and, therefore, for budgetingallows choosing the most helpful translation technology
Word-level quality estimation: predicting the quality of each word ina translation hypothesis
it can guide the translator during post-editing
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 6 / 101
External sources of bilingual information
Using external sources of bilingual informationfor word-level quality estimationin translation technologies
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 7 / 101
External sources of bilingual information
Sources of bilingual information (SBI): any source of informationthat allows to translate sub-segments:
machine translationtranslation memoriesphrase tablesetc.
External: sources of information that are independent of thetranslation technologies to be evaluated and can be used as blackboxes
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 8 / 101
Where are we?
Computer aided-translation (CAT) environments provide differenttranslation technologies to help translators...
... some CAT environments provide word-level QE but it is only forMT...
... there are many approaches for word-level QE, but they are basedon specifically built resources which usually require to be builtindependently.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 9 / 101
Where are we?
Computer aided-translation (CAT) environments provide differenttranslation technologies to help translators...
... some CAT environments provide word-level QE but it is only forMT...
... there are many approaches for word-level QE, but they are basedon specifically built resources which usually require to be builtindependently.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 9 / 101
Where are we?
Computer aided-translation (CAT) environments provide differenttranslation technologies to help translators...
... some CAT environments provide word-level QE but it is only forMT...
... there are many approaches for word-level QE, but they are basedon specifically built resources which usually require to be builtindependently.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 9 / 101
Where are we?
Computer aided-translation (CAT) environments provide differenttranslation technologies to help translators...
... some CAT environments provide word-level QE but it is only forMT...
... there are many approaches for word-level QE, but they are basedon specifically built resources which usually require to be builtindependently.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 9 / 101
Where would we like to be?
We would like CAT environments in which word-level QE is providedfor any translation technology available...
... without having to build any specific resources...
... that only require to plug any SBI already available, either locally oron the Internet...
... and that are able to build their own SBI if needed.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 10 / 101
Where would we like to be?
We would like CAT environments in which word-level QE is providedfor any translation technology available...
... without having to build any specific resources...
... that only require to plug any SBI already available, either locally oron the Internet...
... and that are able to build their own SBI if needed.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 10 / 101
Where would we like to be?
We would like CAT environments in which word-level QE is providedfor any translation technology available...
... without having to build any specific resources...
... that only require to plug any SBI already available, either locally oron the Internet...
... and that are able to build their own SBI if needed.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 10 / 101
Where would we like to be?
We would like CAT environments in which word-level QE is providedfor any translation technology available...
... without having to build any specific resources...
... that only require to plug any SBI already available, either locally oron the Internet...
... and that are able to build their own SBI if needed.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 10 / 101
Where would we like to be?
We would like CAT environments in which word-level QE is providedfor any translation technology available...
... without having to build any specific resources...
... that only require to plug any SBI already available, either locally oron the Internet...
... and that are able to build their own SBI if needed.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 10 / 101
Research steps
1 Definition of methods for word-level QE for TM-based CAT usingSBI
new problem identifiedmore work devotedgood starting point: more information available than in MT
2 Extending these methods to MTimportant differences between both technologiesthe two most important translation technologies are covered
3 Researching on methods to create SBI for under-resourcedlanguage pairs
developed in parallelimproves the usability of the word-level QE methods developed
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 11 / 101
Research steps
1 Definition of methods for word-level QE for TM-based CAT usingSBI
new problem identifiedmore work devotedgood starting point: more information available than in MT
2 Extending these methods to MTimportant differences between both technologiesthe two most important translation technologies are covered
3 Researching on methods to create SBI for under-resourcedlanguage pairs
developed in parallelimproves the usability of the word-level QE methods developed
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 11 / 101
Research steps
1 Definition of methods for word-level QE for TM-based CAT usingSBI
new problem identifiedmore work devotedgood starting point: more information available than in MT
2 Extending these methods to MTimportant differences between both technologiesthe two most important translation technologies are covered
3 Researching on methods to create SBI for under-resourcedlanguage pairs
developed in parallelimproves the usability of the word-level QE methods developed
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 11 / 101
Main working hypothesis
It is possible to develop methods exclusivelybased on external SBI for
word-level QE in TM-based (CAT) and MT.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 12 / 101
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 13 / 101
Introduction to the problems addressed
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 14 / 101
Introduction to the problems addressed Word-level QE in TM-based CAT
Outline
1 Introduction to the problems addressedWord-level QE in TM-based CATWord-level QE in MTBuilding new SBI for under-resourced language pairs
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 15 / 101
Introduction to the problems addressed Word-level QE in TM-based CAT
Translation memories
Catalan EnglishS1: L’ EAMT es membre de l’IAMT
T1: The EAMT is a member ofthe IAMT
S2: L’ Associacio Internacionalper a la Traduccio Automatica
T2: The International Associa-tion for Machine Translation
S3: el congres d’ enguany secelebra a Lovaina
T3: current year ’s conference isheld in Leuven
. . . . . .
S′: Associacio Europea per a la Traduccio Automatica
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 16 / 101
Introduction to the problems addressed Word-level QE in TM-based CAT
Translation memories
Catalan EnglishS1: L’ EAMT es membre de l’IAMT
T1: The EAMT is a member ofthe IAMT
S2: L’ Associacio Internacionalper a la Traduccio Automatica
T2: The International Associa-tion for Machine Translation
S3: el congres d’ enguany secelebra a Lovaina
T3: current year ’s conference isheld in Leuven
. . . . . .
S′: Associacio Europea per a la Traduccio Automatica
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 16 / 101
Introduction to the problems addressed Word-level QE in TM-based CAT
Translation memories
Catalan EnglishS1: L’ EAMT es membre de l’IAMT
T1: The EAMT is a member ofthe IAMT
S2: L’ Associacio Internacionalper a la Traduccio Automatica
T2: The International Associa-tion for Machine Translation
S3: el congres d’ enguany secelebra a Lovaina
T3: current year ’s conference isheld in Leuven
. . . . . .
S′: Associacio Europea per a la Traduccio AutomaticaS2: L’ Associacio Internacional per a la Traduccio Automatica
T2: The International Association for Machine Translation
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 17 / 101
Introduction to the problems addressed Word-level QE in TM-based CAT
Translation memories
Catalan EnglishS1: L’ EAMT es membre de l’IAMT
T1: The EAMT is a member ofthe IAMT
S2: L’ Associacio Internacionalper a la Traduccio Automatica
T2: The International Associa-tion for Machine Translation
S3: el congres d’ enguany secelebra a Lovaina
T3: current year ’s conference isheld in Leuven
. . . . . .
S′: Associacio Europea per a la Traduccio AutomaticaS2: L’ Associacio Internacional per a la Traduccio AutomaticaT2: The International Association for Machine Translation
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 17 / 101
Introduction to the problems addressed Word-level QE in TM-based CAT
Translation-memory-based CAT tools
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 18 / 101
Introduction to the problems addressed Word-level QE in TM-based CAT
Word-level QE for TM-based CAT
The problem: detecting which words in suggestions provided byTM-based CAT tools should be post-edited and which should be kept
Example
S′ = “Associacio Europea per a la Traduccio Automatica”S2 = “L’ Associacio Internacional per a la Traduccio Automatica”
T2 = “The International Association for Machine Translation”T ′=“European Association for Machine Translation”
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 19 / 101
Introduction to the problems addressed Word-level QE in MT
Outline
1 Introduction to the problems addressedWord-level QE in TM-based CATWord-level QE in MTBuilding new SBI for under-resourced language pairs
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 20 / 101
Introduction to the problems addressed Word-level QE in MT
Word-level QE for MT
The problem: detecting which words in MT output should bepost-edited
Example
S′=“Associacio Europea per a la Traduccio Automatica”MT(S′)=“European Association for the Automatic Translation”
T ′=“European Association for Machine Translation”
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 21 / 101
Introduction to the problems addressed Word-level QE in MT
Word-level QE for MT
The problem: detecting which words in MT output should bepost-edited
Example
S′=“Associacio Europea per a la Traduccio Automatica”MT(S′)=“European Association for the Automatic Translation”
T ′=“European Association for Machine Translation”
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 22 / 101
Introduction to the problems addressed Word-level QE in MT
Word-level QE: MT vs. TM-based CAT
T provided by MT for S′: T is a translation of S′
that may be adequate or not
T provided by TM-based CAT for S′: T is not a translation of S′,but is an adequate translation of a similar segment S
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 23 / 101
Introduction to the problems addressed Building new SBI for under-resourced language pairs
Outline
1 Introduction to the problems addressedWord-level QE in TM-based CATWord-level QE in MTBuilding new SBI for under-resourced language pairs
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 24 / 101
Introduction to the problems addressed Building new SBI for under-resourced language pairs
Creating new SBI for under-resourced languages
The scenario: no SBI available for a translation task betweenunder-resourced languages
The solution: building new SBI by using the Web as a parallel corpusA parallel data crawler can be used to build a parallel corpusMany SBI can be automatically built from a parallel corpus:
dictionariesMT systemsbilingual concordancersetc.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 25 / 101
Introduction to the problems addressed Building new SBI for under-resourced language pairs
The Web as a parallel corpus
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 26 / 101
Word-level QE in TM-based CAT
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 27 / 101
Word-level QE in TM-based CAT
Related work
No previous works on word-level QE in TM-based CAT...
...but the patent application by Kuhn et al. (2011): word-level QEin TM-based CAT by using statistical word alignments (no details)aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Closest approach by Kranias and Samioutou (2004): automaticpost-editing of translation suggestions using lexicons forword-alignment and MT for post-editing
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 28 / 101
Word-level QE in TM-based CAT
Related work
No previous works on word-level QE in TM-based CAT...
...but the patent application by Kuhn et al. (2011): word-level QEin TM-based CAT by using statistical word alignments (no details)
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 29 / 101
Word-level QE in TM-based CAT
Related work
No previous works on word-level QE in TM-based CAT...
...but the patent application by Kuhn et al. (2011): word-level QEin TM-based CAT by using statistical word alignments (no details)
Closest approach by Kranias and Samioutou (2004): automaticpost-editing of translation suggestions using glossaries forword-alignment and MT
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 30 / 101
Word-level QE in TM-based CAT
Word-level QE in TM-based CAT: a new problem
word-level QE is a concept that originates in MT
one of the keys of success of TM-based CAT tools:segment-level QE with easily-interpretable fuzzy-match scores
word-level QE for TM-based CAT is a new problem: is it relevant?
YES: pilot study indicates that word-level QE for TM-based CAT couldimprove productivity of translators up to 14%
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 31 / 101
Word-level QE in TM-based CAT
Word-level QE in TM-based CAT: a new problem
word-level QE is a concept that originates in MT
one of the keys of success of TM-based CAT tools:segment-level QE with easily-interpretable fuzzy-match scores
word-level QE for TM-based CAT is a new problem: is it relevant?
YES: pilot study indicates that word-level QE for TM-based CAT couldimprove productivity of translators up to 14%
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 31 / 101
Word-level QE in TM-based CAT
Word-level QE in TM-based CAT: a new problem
word-level QE is a concept that originates in MT
one of the keys of success of TM-based CAT tools:segment-level QE with easily-interpretable fuzzy-match scores
word-level QE for TM-based CAT is a new problem: is it relevant?
YES: pilot study indicates that word-level QE for TM-based CAT couldimprove productivity of translators up to 14%
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 31 / 101
Word-level QE in TM-based CAT
Word-level QE in TM-based CAT: a new problem
word-level QE is a concept that originates in MT
one of the keys of success of TM-based CAT tools:segment-level QE with easily-interpretable fuzzy-match scores
word-level QE for TM-based CAT is a new problem: is it relevant?
YES: pilot study indicates that word-level QE for TM-based CAT couldimprove productivity of translators up to 14%
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 31 / 101
Word-level QE in TM-based CAT
Rationale
Edit distance provides information about the words in S that do notappear in S′:
Example
Associacio Europea per a la Traduccio Automatica
L’ Associacio Internacional per a la Traduccio Automatica
The International Association for Machine Translation
S
T
S′
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 32 / 101
Word-level QE in TM-based CAT
Rationale
The objective is to project source-side matching information onto T , tosuggest which words to change (bad) or keep unedited (good):
Example
Associacio Europea per a la Traduccio Automatica
L’ Associacio Internacional per a la Traduccio Automatica
The International Association for Machine Translation
S
T
S′
����
����
PPPP
PPPP
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 33 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT using statistical word alignments
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CATWord-level QE in TM-based CAT using statistical wordalignmentsWord-level QE in CAT using SBI-based word alignmentsWord-level QE in TM-based CAT directly using SBIEvaluation
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-artMiquel Espla-Gomis SBI for word-level QE January 25, 2016 34 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT using statistical word alignments
Working hypothesis #1
H1: It is possible to use word alignments to estimatethe quality of TM-based CAT suggestions at the word level
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 35 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT using statistical word alignments
Rationale
Associacio Europea per a la Traduccio Automatica
L’ Associacio Internacional per a la Traduccio Automatica
The International Association for Machine Translation
S′S
T[GOOD][BAD]
\\\\\
ccc
cc
cc
Automatica and Machine aligned & Automatica matched =⇒ GOODInternacional and International aligned & Internacional unmatched =⇒ BAD
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 36 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT using statistical word alignments
Rationale
Associacio Europea per a la Traduccio Automatica
L’ Associacio Internacional per a la Traduccio Automatica
The International Association for Machine Translation
S′S
T
?
[?][?]
PPPP
PPPP
PPPP
PPP
for not aligned =⇒ ???The contradictory evidence =⇒ ??? (voting scheme)
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 37 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CATWord-level QE in TM-based CAT using statistical wordalignmentsWord-level QE in CAT using SBI-based word alignmentsWord-level QE in TM-based CAT directly using SBIEvaluation
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-artMiquel Espla-Gomis SBI for word-level QE January 25, 2016 38 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Working hypothesis #2
H2: It is possible to use any SBI to obtain word alignments
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 39 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Using SBI to obtain word-alignments
Using sub-segment alignments for word alignments based on SBI
Example
S: L’ Associacio Internacional per a la Traduccio Automatica (Catalan)T : The International Association for Machine Translation (English)
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 40 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Segmentation
S: L’ Associacio Internacional per a la Traduccio Automatica (Catalan)T : The International Association for Machine Translation (English)
CatalanL’AssociacioInternacionalperalaTraduccioAutomaticaL’AssociacioAssociacio InternacionalInternacional per...
EnglishTheInternationalAssociationforMachineTranslationThe InternationalInternational AssociationAssociation forfor Machine...
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 41 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Sub-segment translation to obtain pairs (σ, τ)
S: L’ Associacio Internacional per a la Traduccio Automatica (Catalan)T : The International Association for Machine Translation (English)
Catalan→ EnglishL’ → TheAssociacio → AssociationInternacional → Internationalper → bya → atla → theTraduccio → TranslationAutomatica → AutomaticL’Associacio → The AssociationAssociacio Internacional →International AssociationInternacional per → International for...
English→ CatalanThe → ElInternational → InternacionalAssociation → Associaciofor → per aMachine → MaquinaTranslation → TraduccioThe International → El InternacionalInternational Association → AssociacioInternacionalAssociation for → Associacio perfor Machine → per a Maquina...
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 42 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
(σ, τ) filtering
S: L’ Associacio Internacional per a la Traduccio Automatica (Catalan)T : The International Association for Machine Translation (English)
Catalan→ EnglishL’ → TheAssociacio → AssociationInternacional → Internationalper → bya → atla → theTraduccio → TranslationAutomatica → AutomaticL’Associacio → The AssociationAssociacio Internacional →International AssociationInternacional per → International for...
English→ CatalanThe → ElInternational → InternacionalAssociation → Associaciofor → per aMachine → MaquinaTranslation → TraduccioThe International → El InternacionalInternational Association → AssociacioInternacionalAssociation for → Associacio perfor Machine → per a Maquina...
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 43 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Sub-segment alignment
Example
International
Association
for
Machine
L’
Assoc
iació
Inte
rnac
iona
lpe
r a la
Tradu
cció
Autom
àtica
The
Translation
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 44 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Using SBI to obtain word-alignments?
1 Segmentation of both sentences2 Translation of sub-segments (in both directions) using SBIs to
obtain sub-segment pairs (σ, τ)
3 Filtering pairs (σ, τ) with σ not in S or τ not in T4 Aligning sub-segments (σ, τ) between S and T
5 Aligning words by using sub-segment alignments
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 45 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Using SBI to obtain word-alignments?
1 Segmentation of both sentences2 Translation of sub-segments (in both directions) using SBIs to
obtain sub-segment pairs (σ, τ)
3 Filtering pairs (σ, τ) with σ not in S or τ not in T4 Aligning sub-segments (σ, τ) between S and T5 Aligning words by using sub-segment alignments
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 45 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
From sub-segment alignments to word alignments
Option A: Intuitive heuristic method based on the concept ofpressure:
Every sub-segment alignment applies force of 1;Every sub-segment alignment applies, on each pair of words, apressure corresponding to its force divided by its area;Alignment strength between two words: summation of all thepressures on a pair of words.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 46 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
From sub-segment alignments to word alignments
Option A: Intuitive heuristic method based on the concept ofpressure:
Every sub-segment alignment applies force of 1;Every sub-segment alignment applies, on each pair of words, apressure corresponding to its force divided by its area;Alignment strength between two words: summation of all thepressures on a pair of words.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 46 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Word alignment strength
Example
International
Association
for
Machine
L’
Assoc
iació
Inte
rnac
iona
lpe
r a la
Tradu
cció
Autom
àtica
The
Translation
Word-alignmentstrength betweenAssociacio andAssociation:
11×1 + 1
2×2 + 13×3 '
1.36
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 47 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Word alignment strength
Example
International
Association
for
Machine
L’
Assoc
iació
Inte
rnac
iona
lpe
r a la
Tradu
cció
Autom
àtica
The
Translation
0.11 0.11
0.11
0.11
1.11
0.36
0.36
1.36
1.36
1
0.5 0.5
0.25
0.25
1.25
1.25
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 48 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
Word alignment strength
Example
International
Association
for
Machine
L’
Assoc
iació
Inte
rnac
iona
lpe
r a la
Tradu
cció
Autom
àtica
The
Translation
0.11 0.11
0.11
0.11
1.11
0.36
0.36
1.36
1.36
1
0.5 0.5
0.25
0.25
1.25
1.25
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 49 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
From sub-segment alignments to word alignments
Option B: Linear-combination approach:
Sub-segment alignments are grouped depending on their geometry(1× 1, 1× 2, 2× 1, etc.), and a linear-combination approach is usedto assign weights to them
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 50 / 101
Word-level QE in TM-based CAT Word-level QE in CAT using SBI-based word alignments
From sub-segment alignments to word alignments
Option B: Linear-combination approach:
Sub-segment alignments are grouped depending on their geometry(1× 1, 1× 2, 2× 1, etc.), and a linear-combination approach is usedto assign weights to them
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 50 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT directly using SBI
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CATWord-level QE in TM-based CAT using statistical wordalignmentsWord-level QE in CAT using SBI-based word alignmentsWord-level QE in TM-based CAT directly using SBIEvaluation
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-artMiquel Espla-Gomis SBI for word-level QE January 25, 2016 51 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT directly using SBI
Working hypothesis #3
H3: It is possible to use SBI to estimate the qualityof TM-based CAT translation suggestions at the word level
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 52 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT directly using SBI
Rationale
Align sub-segments between S and T (as in the method usingSBI-based word alignment)
Obtain word-level QEs by:using an heuristic method that counts matched/unmatched alignedwords (as in the method using word alignments)using a binary classifier to make quality estimations
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 53 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT directly using SBI
Evidence from sub-segments
Evidence from sub-segments of length L = 1
Example
Associacio Europea per a la Traduccio Automatica
L’ Associacio Internacional per a la Traduccio Automatica
The International Association for Machine Translation
S′S
T
\\\\\
ccc
cccc
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 54 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT directly using SBI
Contradictory evidence from sub-segments
Contradictory evidence from sub-segments of length L = 1
Example
Associacio Europea per a la Traduccio Automatica
L’ Associacio Internacional per a la Traduccio Automatica
The International Association for Machine Translation
S′S
T
PPPP
PPPP
PPPP
PPP
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 55 / 101
Word-level QE in TM-based CAT Word-level QE in TM-based CAT directly using SBI
Evidence from longer sub-segments
Longer sub-segments provide more context
Example
Associacio Europea per a la Traduccio Automatica
L’ Associacio Internacional per a la Traduccio Automatica
The International Association for Machine Translation
S′S
T
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 56 / 101
Word-level QE in TM-based CAT Evaluation
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CATWord-level QE in TM-based CAT using statistical wordalignmentsWord-level QE in CAT using SBI-based word alignmentsWord-level QE in TM-based CAT directly using SBIEvaluation
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-artMiquel Espla-Gomis SBI for word-level QE January 25, 2016 57 / 101
Word-level QE in TM-based CAT Evaluation
Evaluation of all the approaches
Evaluation for the five approaches:statistical word alignmentSBI-based word alignment (“pressure”)SBI-based word alignment (linear combination)SBI directly (heuristic)SBI directly (binary classification)
Metrics: accuracy (A) and fraction of words not covered (NC)
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 58 / 101
Word-level QE in TM-based CAT Evaluation
Evaluation resources
TM and test set from the same domain for Spanish→English(ES→EN)
Different values of fuzzy-match score threshold (Θ)
Three SBI (MT): Google Translate, Apertium, and PowerTranslator
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 59 / 101
Word-level QE in TM-based CAT Evaluation
Using SBI-based word-alignments
training: ES→EN; in-domain; all SBI test: ES→EN; all SBI
method metric Θ ≥ 60% Θ ≥ 70% Θ ≥ 80% Θ ≥ 90%
statistical word A(%) 93.9±.2 94.3±.2 95.1±.2 95.3±.3alignment NC(%) 4.4±.1 4.5±.2 4.3±.2 4.4±.3
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 60 / 101
Word-level QE in TM-based CAT Evaluation
Using SBI-based word-alignments
training: ES→EN; in-domain; all SBI test: ES→EN; all SBI
method metric Θ ≥ 60% Θ ≥ 70% Θ ≥ 80% Θ ≥ 90%
statistical word A(%) 93.9±.2 94.3±.2 95.1±.2 95.3±.3alignment NC(%) 4.4±.1 4.5±.2 4.3±.2 4.4±.3
pressure SBI-based A(%) 93.7±.2 94.5±.2 95.4±.2 96.0±.2word alignment NC(%) 10.0±.2 8.9±.2 8.2±.3 7.2±.3
linear SBI-based A(%) 94.2±.2 94.8±.2 95.8±.2 96.4±.2word alignment NC(%) 9.8±.2 8.8±.2 8.1±.3 7.0±.3
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 60 / 101
Word-level QE in TM-based CAT Evaluation
Using SBI-based word-alignments
training: ES→EN; in-domain; all SBI test: ES→EN; all SBI
method metric Θ ≥ 60% Θ ≥ 70% Θ ≥ 80% Θ ≥ 90%
statistical word A(%) 93.9±.2 94.3±.2 95.1±.2 95.3±.3alignment NC(%) 4.4±.1 4.5±.2 4.3±.2 4.4±.3
pressure SBI-based A(%) 93.7±.2 94.5±.2 95.4±.2 96.0±.2word alignment NC(%) 10.0±.2 8.9±.2 8.2±.3 7.2±.3
linear SBI-based A(%) 94.2±.2 94.8±.2 95.8±.2 96.4±.2word alignment NC(%) 9.8±.2 8.8±.2 8.1±.3 7.0±.3
heuristic method A(%) 93.3±.2 93.9±.2 95.1±.2 96.5±.3using SBI directly NC(%) 5.1±.1 5.2±.2 5.5±.2 5.9±.3
binary classification A(%) 95.1±.1 95.6±.2 96.4±.2 96.9±.2using SBI directly NC(%) 5.1±.1 5.2±.2 5.5±.2 5.9±.3
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 60 / 101
Word-level QE in TM-based CAT Evaluation
QE accuracy across domains
Training on out-of-domain TMs and re-using the models
Important: simulates performance on a translation proposal(translation unit) not seen during training
Comparing the method based on statistical word alignment andmethods using SBI directly
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 61 / 101
Word-level QE in TM-based CAT Evaluation
QE accuracy across domainstraining: ES→EN; out-of-domain; all SBI test: ES→EN; all SBI
Θ(%) training domainword alignment SBI-based approaches
statistical classifier heuristic
A (%) NC (%) A (%) A (%) NC (%)
≥ 60in domain 93.9±.2 6.1±.2 95.1±.1
out of domain 1 91.8±.2 9.6±.2 93.3±.2 93.3±.2 5.1±.1out of domain 2 90.2±.2 11.7±.2 92.0±.2
≥ 70in domain 94.3±.2 5.9±.2 95.6±.2
out of domain 1 92.7±.2 9.7±.2 94.0±.2 94.1±.2 5.2±.2out of domain 2 91.6±.2 11.6±.3 92.7±.2
≥ 80in domain 95.1±.2 5.4±.2 96.4±.2
out of domain 1 93.8±.2 9.2±.3 95.6±.2 95.3±.2 5.5±.2out of domain 2 93.2±.2 11.1±.3 94.3±.2
≥ 90in domain 95.3±.3 4.9±.3 96.9±.2
out of domain 1 94.6±.3 8.9±.3 96.5±.2 96.6±.2 5.9±.3out of domain 2 94.2±.3 10.3±.4 96.3±.2
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 62 / 101
Word-level QE in TM-based CAT Evaluation
QE accuracy across SBI
Training on a translation task using Apertium and PowerTranslator separately as SBI
Evaluating on the same translation task using Google Translateas SBI
Comparing the methods using SBI directly
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 63 / 101
Word-level QE in TM-based CAT Evaluation
QE accuracy across SBI
training: ES→EN; in-domain; SBI 6= Google test: ES→EN; SBI = Google
Θ(%)A (%) binary classification using SBI directly A (%) heuristic method
Apertium Google Translate Power Translator using SBI directly
≥ 60 92.9±.2 95.1±.1 91.1±.2 93.4±.2≥ 70 93.4±.2 95.5±.2 93.0±.2 94.3±.2≥ 80 95.6±.2 96.3±.2 94.8±.2 95.4±.2≥ 90 96.7±.2 97.0±.2 96.5±.2 96.7±.2
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 64 / 101
Word-level QE in TM-based CAT Evaluation
QE accuracy across language pairs
Impact of re-using models for different languages
Models trained on 9 different language pairs: EN→ES, DE→EN,EN→DE, FR→EN, EN→FR, FI→EN, EN→FI, ES→FR, FR→ES
Evaluation on ES→EN
Comparing the methods using SBI directly
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 65 / 101
Word-level QE in TM-based CAT Evaluation
QE accuracy across language pairs: impact of the TL
training: XX→YY; in-domain; all SBI test: ES→EN; all SBI
to EN Θ = 60% from EN Θ = 60%
ES→EN 95.1±.1 EN→ES 90.8±.2DE→EN 92.5±.2 EN→DE 92.2±.2FR→EN 92.7±.2 EN→FR 90.9±.2FI→EN 92.4±.2 EN→FI 91.1±.2
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 66 / 101
Word-level QE in TM-based CAT Evaluation
QE accuracy across language pairs: best results
training: XX→YY; in-domain; all SBI test: ES→EN; all SBI
training Θ(%)
language pair ≥ 60 ≥ 70 ≥ 80 ≥ 90
ES→EN 95.1±.1 95.5±.2 96.3±.2 97.1±.2FR→EN 92.7±.2 94.2±.2 95.8±.2 96.7±.2heuristic 93.4±.2 94.3±.2 95.4±.2 96.7±.2
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 67 / 101
Word-level QE in TM-based CAT Evaluation
Chapter conclusions
New problem identified that is relevant for professional translators
New methods for word alignments based on SBI
Three methods for word level QE for CATStatistical word alignment: best coverage (in-domain), difficult tore-use models when domain changesSBI-based word alignment: worst coverageSBI directly for word-level QE: best accuracy, good results whenre-using models across domains
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 68 / 101
Word-level QE in MT
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 69 / 101
Word-level QE in MT
Related work
Confidence estimation (Blatz et al., 2003): semantic features(WordNet), probabilistic lexicons, word posterior probabilities
Approaches using pseudo-references:word occurrence in n-best lists (Ueffing and Ney, 2007)pseudo-references (Bicici, 2013)inverse machine translation (MULTILIZER approach: Bojar et al.,2014)
General framework for translation QE in MT: QuEst (Specia,2013) [now QuEst++ (Specia, 2015)]
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 70 / 101
Word-level QE in MT
Working hypothesis #4
H4: It is possible to take the SBI-based methodsfor word-level QE in TM-based CATand adapt them for their use in MT
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 71 / 101
Word-level QE in MT
Adapting the existing methods?
We start with the binary classification approach based on SBI
We can reuse the idea of sub-segment alignments...
... but we do not have S and T : it is not possible to use exactlythe same method
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 72 / 101
Word-level QE in MT
Adapting the existing methods?
New collection of features is neededpositive features using exact matches of translatedsub-segmentsnegative features using partial matches of translatedsub-segments
Example
S′=“Associacio Europea per a la Traduccio Automatica”MT(S′)=“European Association for the Automatic Translation”
T ′=“European Association for Machine Translation”
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 73 / 101
Word-level QE in MT
How to use SBI to obtain positive evidence?
1 Source language sub-segments σ and target languagesub-segments τ are translated using SBI
2 Words in a sub-segment τ that is related by SBI to a sub-segmentσ that matches S′ get positive evidence
S′ = “Associacio Europea per a la Traduccio Automatica”ytranslate σx
τ = “European Association for”3 3 3
MT(S′) = “European Association for the Automatic Translation”
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 74 / 101
Word-level QE in MT
How to use SBI to obtain negative evidence?
1 Source language sub-segments σ are translated using SBI2 Translations τ are partially matched onto MT(S′)3 Words in MT(S′) that do not match but are surrounded by
matching words get negative evidence
Example
S′ = “Associacio Europea per a la Traduccio Automatica”ytranslate σx
τ = the Machine Translation3 7 3
MT(S′) = “European Association the Automatic Translation”
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 75 / 101
Word-level QE in MT Evaluation
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MTEvaluation
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 76 / 101
Word-level QE in MT Evaluation
Experimental setting
Binary classification problem: GOOD and BAD classes
Metrics: F w1 , F BAD
1 , and F GOOD1
Datasets:WMT’15: we did compete; 1 dataset (for translating Spanish intoEnglish)WMT’14: we did not compete; 4 datasets (for 4 language pairs)
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 77 / 101
Word-level QE in MT Evaluation
SBI used
Machine translation: Google Translate and Apertiumonly one translation suggestion for each σtranslation frequency not available
Bilingual concordancer: Reverso Contextmultiple translation suggestions possible for each σtranslation frequency available
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 78 / 101
Word-level QE in MT Evaluation
WMT’15 participant ranking
WMT’15 evaluation on EN→ES data
System ID F w1 F BAD
1 F GOOD1
• UAlacant/OnLine-SBI-Baseline 71.5% 43.1% 78.1%• HDCL/QUETCHPLUS 72.6% 43.1% 79.4%
UAlacant/OnLine-SBI 69.5% 41.5% 76.1%
SAU/KERC-CRF 77.4% 39.1% 86.4%SAU/KERC-SLG-CRF 77.4% 38.9% 86.4%SHEF2/W2V-BI-2000 65.4% 38.4% 71.6%SHEF2/W2V-BI-2000-SIM 65.3% 38.4% 71.5%SHEF1/QuEst++-AROW 62.1% 38.4% 67.6%...
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 79 / 101
Word-level QE in MT Evaluation
Results WMT’14
SBI-based methods vs. best performing systems in WMT’14
language OnLine-SBI
pair F w1 F BAD
1 winner method F w1 F BAD
1
EN→ES 61.4% 49.0% FBK-UPV-UEDIN/RNN 62.0% 48.7%ES→EN 75.9% 10.4% RTM-DCU/RTM-GLMd 79.5% 29.1%EN→DE 66.8% 43.1% RTM-DCU/RTM-GLM 71.5% 45.3%DE→EN 75.0% 40.3% RTM-DCU/RTM-GLM 72.4% 26.1%
We are good when detecting errors: terminology, mistranslation,and unintelligible
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 80 / 101
Word-level QE in MT Evaluation
Chapter conclusions
Confirmed: SBI can be used for word-level QE in MT
Leading performer among the approaches proposed
Small amount of features required (70) when compared to otherapproaches: for example Bicici (2013) uses 511K features
Independence with respect to the MT system used to producethe translation hypotheses
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 81 / 101
Building SBI for under-resourced language pairs
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 82 / 101
Building SBI for under-resourced language pairs
Rationale
1 Using Bitextor (Espla-Gomis, 2010) to build a parallel corpus forunder-resourced language pairs
2 Using the parallel corpus obtained to build new SBI3 Using the new SBI for the QE techniques developed
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 83 / 101
Building SBI for under-resourced language pairs
Working hypotheses #5 and #6
H5: It is possible to create new SBI that enable word-level QEfor language pairs with no SBI available
using Bitextor to crawl parallel data
H6: The results obtained in word-level QE for under-resourcedlanguage pairs can be improved by using new SBI obtained
through parallel data crawling
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 84 / 101
Building SBI for under-resourced language pairs Using the Web as a parallel corpus
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairsUsing the Web as a parallel corpusBuilding SBI for English–CroatianNew SBI in word-level QE in TM-based CAT
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 85 / 101
Building SBI for under-resourced language pairs Using the Web as a parallel corpus
Related work
Approaches to detect parallel text in multilingual websites:URL similarity patterns: Ma and Liberman (1999), Nie et al.(1999), and Resnik and Smith (2003)HTML structure comparison: Resnik and Smith (2003), Sin etal. (2005), and Zhang et al. (2006)image coocurrence: Papavassiliou et al. (2013)text comparison (mostly based on bag-of-words overlappingmetrics): Chen et al. (2004), and Barbosa et al. (2012)
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 86 / 101
Building SBI for under-resourced language pairs Using the Web as a parallel corpus
How does Bitextor work?
HTML structure comparison
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 87 / 101
Building SBI for under-resourced language pairs Using the Web as a parallel corpus
How does Bitextor work?
Word overlap metrics based on Sanchez-Martınez & Carrasco (2011)
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 88 / 101
Building SBI for under-resourced language pairs Building SBI for English–Croatian
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairsUsing the Web as a parallel corpusBuilding SBI for English–CroatianNew SBI in word-level QE in TM-based CAT
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 89 / 101
Building SBI for under-resourced language pairs Building SBI for English–Croatian
Building a TM for English–Croatian tourism domain
Crawling 21 bilingual websites from the tourism domain
Using two crawlers: Bitextor and ILSP Focused Crawler
For both crawlers, 2 configurations: accuracy-oriented andrecall-oriented
Evaluation: resulting parallel corpus through manual revision
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 90 / 101
Building SBI for under-resourced language pairs Building SBI for English–Croatian
Building a TM for English–Croatian tourism domain
Evaluation of corpora built by Bitextor and ILSP Focused Crawler
tool setting aligned success unique aligneddocuments rate segments
Focused recall 3,294 73.86% 40,431Crawler accuracy 2,406 90.76% 32,544
Bitextor recall 7,787 74.70% 50,338accuracy 3,758 94.79% 36,834
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 91 / 101
Building SBI for under-resourced language pairs New SBI in word-level QE in TM-based CAT
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairsUsing the Web as a parallel corpusBuilding SBI for English–CroatianNew SBI in word-level QE in TM-based CAT
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 92 / 101
Building SBI for under-resourced language pairs New SBI in word-level QE in TM-based CAT
Experimental setup
Evaluation on word-level QE in TM-based CAT forEnglish–Finnish
Parallel corpus crawled using Bitextor on about 10,700 websites(1,700,000 parallel segments)
Performance when using two kinds of SBI: phrase tables andphrase-based SMT (Moses)
Comparison to Google Translate as a SBI
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 93 / 101
Building SBI for under-resourced language pairs New SBI in word-level QE in TM-based CAT
Results for Finnish→English
Comparing Google with new SBI created with Bitextor
SBI metric method Θ ≥ 60% Θ ≥ 70% Θ ≥ 80% Θ ≥ 90%
Google A(%)classification 93.2±.2 94.6±.2 94.7±.2 94.8±.3
heuristic 91.6±.2 93.4±.3 94.1±.3 94.5±.3
NC(%) — 11.2±.2 11.3±.2 11.5±.3 11.6±.4
phrases A(%)classification 87.7±.3 90.3±.3 91.8±.3 92.3±.4
heuristic 85.9±.3 89.6±.3 91.5±.3 92.2±.4
NC(%) — 23.5±.3 23.1±.3 23.1±.4 23.8±.6
Moses A(%)classification 93.1±.2 94.7±.2 94.9±.3 94.7±.4
heuristic 93.1±.2 94.7±.2 94.9±.3 94.7±.4
NC(%) — 45.5±.3 44.2±.4 44.9±.5 46.8±.7
Google A(%)classification 91.3±.2 92.8±.2 93.2±.3 93.0±.3
+ Moses heuristic 87.8±.2 90.5±.2 91.8±.3 92.6±.4
+ phrases NC(%) — 5.1±.2 4.3±.2 4.9±.2 5.4±.3
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 94 / 101
Building SBI for under-resourced language pairs New SBI in word-level QE in TM-based CAT
Results for Finnish→English
Comparing Google with new SBI created with Bitextor
SBI metric method Θ ≥ 60% Θ ≥ 70% Θ ≥ 80% Θ ≥ 90%
Google A(%)classification 93.2±.2 94.6±.2 94.7±.2 94.8±.3
heuristic 91.6±.2 93.4±.3 94.1±.3 94.5±.3
NC(%) — 11.2±.2 11.3±.2 11.5±.3 11.6±.4
phrases A(%)classification 87.7±.3 90.3±.3 91.8±.3 92.3±.4
heuristic 85.9±.3 89.6±.3 91.5±.3 92.2±.4
NC(%) — 23.5±.3 23.1±.3 23.1±.4 23.8±.6
Moses A(%)classification 93.1±.2 94.7±.2 94.9±.3 94.7±.4
heuristic 93.1±.2 94.7±.2 94.9±.3 94.7±.4
NC(%) — 45.5±.3 44.2±.4 44.9±.5 46.8±.7
Google A(%)classification 91.3±.2 92.8±.2 93.2±.3 93.0±.3
+ Moses heuristic 87.8±.2 90.5±.2 91.8±.3 92.6±.4
+ phrases NC(%) — 5.1±.2 4.3±.2 4.9±.2 5.4±.3
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 94 / 101
Building SBI for under-resourced language pairs New SBI in word-level QE in TM-based CAT
Chapter conclusions
Bitextor provides high-quality parallel corpora
Bitextor may enable the use of SBI-based approaches forword-level QE in TM-based CAT for under-resourced languagepairs
Encouraging results:Phrase-based SMT system: accuracy comparable to that of GoogleTranslatePhrase tables + Google Translate: coverage improves more thantwice; accuracy falls noticeably
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 95 / 101
Contributions to the state-of-the-art
Outline
1 Introduction to the problems addressed
2 Word-level QE in TM-based CAT
3 Word-level QE in MT
4 Building SBI for under-resourced language pairs
5 Contributions to the state-of-the-art
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 96 / 101
Contributions to the state-of-the-art
Where would we like to be?
We would like CAT environments in which word-level QE is providedfor any translation technology available...
... without having to build any specific resources...
... that only require to plug any SBI already available for the user...
... and that are able to build their own SBI if needed.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 97 / 101
Contributions to the state-of-the-art
How does our work contribute?
We have developed methods for the 2 most used technologies in CAT,TM and MT,...
... that only need to have access to the SBI already available for theuser...
... and if no SBI are available, we have developed a methodology forcreating new SBI from multilingual websites using the toolBitextor.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 98 / 101
Contributions to the state-of-the-art
How does our work contribute?
We have developed methods for the 2 most used technologies in CAT,TM and MT,...
... that only need to have access to the SBI already available for theuser...
... and if no SBI are available, we have developed a methodology forcreating new SBI from multilingual websites using the toolBitextor.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 98 / 101
Contributions to the state-of-the-art
How does our work contribute?
We have developed methods for the 2 most used technologies in CAT,TM and MT,...
... that only need to have access to the SBI already available for theuser...
... and if no SBI are available, we have developed a methodology forcreating new SBI from multilingual websites using the toolBitextor.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 98 / 101
Contributions to the state-of-the-art
How does our work contribute?
We have developed methods for the 2 most used technologies in CAT,TM and MT,...
... that only need to have access to the SBI already available for theuser...
... and if no SBI are available, we have developed a methodology forcreating new SBI from multilingual websites using the toolBitextor.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 98 / 101
Contributions to the state-of-the-art
New research paths
QE for parallel-corpora crawling
Automatic or aided post-editing
Word-level QE for interactive machine translation
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 99 / 101
Contributions to the state-of-the-art
QE for parallel-corpora crawling
QE for parallel-corpora crawlingQE as a new source of information for comparing documentsUsing word-level QE as a source of information for cleaning paralleldata
Automatic or aided post-editing
Word-level QE for interactive machine translation
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 99 / 101
Contributions to the state-of-the-art
Automatic or aided post-editing
QE for parallel-corpora crawling
Automatic or aided post-editingAdapting current techniques to suggest translation alternativesExtend the current work to consider insertions
Word-level QE for interactive machine translation
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 99 / 101
Contributions to the state-of-the-art
Word-level QE for interactive machine translation
QE for parallel-corpora crawling
Automatic or aided post-editing
Word-level QE for interactive machine translationUsing the SBI-based word-alignment methods developed to keeptrack of the sub-segments translatedEvaluating word-level QE for this new technology
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 99 / 101
Contributions to the state-of-the-art
Free/open-source software published
Software published (http://bit.do/mespla_software):Gamblr-CAT: Software for word-level QE in TM-based CATGamblr-MT: Software for word-level QE in MTBitextor: Software for building parallel data from multilingualwebsitesOmegaT-SessionLog: Plugin for traking the actions of atranslator when using the TM-based CAT tool OmegaTOmegaT-Marker-Plugin: Plugin for OmegaT that implementsthe heuristic word-level QE systemFlyligner: Word-alignment software based on SBI
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 100 / 101
Contributions to the state-of-the-art
MOLTES GRACIES PER L’ATENCIO!
THANK YOU VERY MUCH FOR YOURATENTION!
This work has been partially funded by:the Spanish government through projects TIN2009-14009-C02-01 andTIN2012-32615,the European Union through project Abu-MaTran(PIAP-GA-2012-324414), andthe Kazakh Government through project UA KAZAKHUNIVERSITY1-15I.
Miquel Espla-Gomis SBI for word-level QE January 25, 2016 101 / 101