Download - In an ideal world
Linguistic information is seamlessly combined to statistical information as part of translation systems to produce perfect translations
We are moving in that direction: Morphology
Syntax
Semantics (SRL): (Wu & Fung 2009) (Liu & Gildea 2010) (Aziz et al. 2011)
Meanwhile…
2
Linguistic information to evaluate MT quality Based on reference translations
Linguistic information to estimate MT quality Using machine learning
Linguistic information to detect errors in MT Automatic post-editing
3
Handle variations in MT (words and structure) wrt reference or identify differences between MT and reference
METEOR (Denkowski & Lavie 2011): words and phrases (Giménez & Màrquez 2010): matching of lexical, syntactic, semantic and discourse units
(Lo & Wu 2011): SRL and manual matching of ‘who’ did ‘what’ to ‘whom’, etc. (Rios et al. 2011): automatic SRL with automatic (inexact) matching of predicates and arguments
4
Essentially: matching of linguistic units Similar to n-gram matching metrics, but units are not only words
Metrics based on lexical units perform better
Issues: Lack of (good) resources for certain languages
Unreliable processing of incorrect translations
Sparsity for sentence-level: depending on the actual features. E.g.: matching of named entities
5
Goal: given the output of an MT system for a given input, provide an estimate of its quality
Uses◦ Filter bad quality translations from post-editing
◦ Select “perfect” translations for publishing
◦ Spot unreliable translations to readers of target language only
◦ Select best translation for a given input when multiple MT/TM systems are available
6
NOT standard MT evaluation:
◦Reference translations are NOT available
◦ Estimation for unseen translations
My approach:
◦Translation unit: sentence
◦ Independent from MT system
7
1. Define aspect of quality to estimate and
how to represent it
2. Identify and extract features that explain that
aspect of quality
3. Collect examples of translations with different
levels of quality and annotate them
4. Learn a model to predict quality scores for
new translations and evaluate it
8
Source text TranslationMT
system
Confidence
indicators
Complexity
indicators
Fluency indicators
Adequacyindicators
Quality?
Features can be shallow or linguistically motivated
9
(S/T/S-T) Sentence length (S/T) Language model (S/T) Token-type ratio (S) Readability metrics: Flesch, etc (S) Average number of possible translations per word (S) % of n-grams belonging to different frequency
quartiles of a source language corpus (T) Untranslated/OOV words (T) Mismatching brackets, quotation marks (S-T) Preservation of punctuation (S-T) Word alignment score etc
These do well for estimation of general quality wrt post-editing needs, but not enough for
other aspects of quality…
10
Count-based (S/T/S-T) Content/non-content words (S/T/S-T) Nouns/verbs/… NP/VP/… (S/T/S-T) Deictics (references) (S/T/S-T) Discourse markers (references) (S/T/S-T) Named entities (S/T/S-T) Zero-subjects (S/T/S-T) Pronominal subjects (S/T/S-T) Negation indicators (T) Subject-verb / adjective-noun agreement (T) Language Model of POS (T) Grammar checking (dangling words) (T) Coherence
11
Alignment-based (S-T) Correct translation of pronouns (S-T) Matching of dependency relations (S-T) Matching of named entities (S-T) Alignment of parse trees (S-T) Alignment of predicates & arguments etc
Some features are language-dependent, others need resources that are language-
dependent, but apply to most languages, e.g. LM of POS tags
12
Count-based feature representation:◦ Source/target only: count or proportion◦ Contrastive features (S-T): very important – but
not a simple matching of linguistic units Alignment may not be possible (e.g. clauses/phrases) Force same linguistic phenomena in S an T?
Vs translated as Ns
How to model different linguistic phenomena?
S = linguistic unit in source; T = linguistic unit in target
F S T | |F S T S TF
S
TF
S …
13
Count-based feature representation:◦ Monotonicity of features◦ Sparsity: is 0-0 as good as 10-10?
Our representation: precision and recall
◦ Does not rely on alignment◦ Upper bound = 1 (also holds for S,T=0)◦ Lower bound = 0
min( , )P
S TF
T min( , )
R
S TF
S
14
S-T: (Pighin and Màrquez 2011): learn expected projection of SRL from source to target
S-T: (Xiong et al 2010)◦ Target LM of words and POS tags, dangling words (link
grammar parser), word posterior probabilities
S-T: (Bach et al 2011)◦ Sequences of words and POS tags, context,
dependency structures, alignment info
Fine grained – need a lot of training data: 72K sentences, 2.2M words and their manual
correction (!)
15
Estimating post-editing effort Human scores (1-4): how much post-editing effort?
Estimating adequacy Human scores (1-4): to which degree does the translation convey the meaning of the original text?
1: requires complete retranslation
2: a lot of post-editing needed, but quicker than retranslation
3: a little post-editing needed 4: fit for purpose
1: completely inadequate 2: poorly adequate
3: Fairly Adequate 4: Highly Adequate
16
Machine learning algorithm: SVM for regression
Evaluation Root Mean Square Error (RMSE)
N
jjj yy
NRMSE
1
2)ˆ(1
17
English-Spanish Europarl data◦ 4 SMT systems 4 sets of 4,000 {source,
translation, score} triples
Quality score: 1-4 post-editing effort
Features: 96 shallow versus 169 shallow + ling:
18
Distribution of post-editing effort scores:
Score MT1 MT2 MT3 MT4
1 4% 9% 10% 73%
2 25% 36% 39% 21%
3 54% 40% 43% 6%
4 17% 10% 9% 0%
Avg. quality
2.83 2.56 2.51 1.34
19
RMSE:
Languages
MT System
All features
No ling. features
en-es MT1 0.600 0.574en-es MT2 0.682 0.671en-es MT3 0.671 0.654en-es MT4 0.541 0.534
Deviation of 17-22%
20
MT: The student still has claimed to take the exam at the end of the year - although she has not chosen course.
SRC: A estudante ainda tem pretensão de prestar vestibular no fim do ano – embora não tenha escolhido o curso
REF: The student still has the intention to take the exam at the end of the year – although she has not chosen the course.
21
Arabic-English Newswire data (GALE)◦ 2 SMT systems (Rosetta team) 2 sets of 2,585
{source, translation, score} triples
Quality score: 1-4 adequacy
Features: 82 shallow versus 122 shallow + ling:
22
Distribution of adequacy scores:
Score MT1 MT2
1 2% 2.3%2 20% 23%3 45% 46%4 33% 28.7%
Avg. quality
3.11 3
23
RMSE :
Languages
MT System
All feature
s
No ling feature
s
ar-en MT1 0.762 0.771ar-en MT2 0.756 0.737
Deviation of 14-26%
24
Best performing: ◦ Length (words, content-words, etc.)
Absolute numbers are better than proportions◦ Language model / corpus frequency◦ Ambiguity of source words
Shallow features are better than linguistic features◦ Except for one adequacy estimation system
Source/target features are better than contrastive features (shallow and linguistic)◦ Absolute numbers are better than proportions
25
Issues:◦ Feature representation◦Sparsity◦ Need deeper features for adequacy estimation◦Annotation:
1-4 post-editing effort: could be more objective 1-4 adequacy: can we isolate adequacy from
fluency?◦Language-dependency ◦ Reliability of resources
Low quality translations◦Availability of resources
26
General vs specific errors
Bottom-up approach: word-based CE◦ (Xiong et al 2010)
Word posterior probability, dangling words (link grammar parser), target words & POS patterns
◦ (Bach et al 2011) Dependency relations, words and POS patterns, e.g.
relate target words to patterns of POS tags in source
27
◦ (Bach et al 2011): best features are source-based
28
Top-down approach (on-going work)◦ Corpus-based analysis: generalize errors in categories◦ Portuguese-English◦ 150 sentences (2 domains, 2 MT systems)◦ RBMT: more systematic errors
Linguistic IndicatorsEuroparl
MT1NewsMT1
EuroparlMT2
NewsMT2
Inflectional error 72 40 63 40Incorrect voice 2 6 13 6Mistranslated pronoun 61 40 63 35Missing pronoun 34 13 23 7Incorrect subject-verb order 6 10 12 9
• ~700 errors / 150 sentences• 42 error categories : a few rules per
category…
29
It is possible to estimate the quality of MT systems wrt post-editing needs using shallow, language- and system-independent features
Adequacy estimation is a harder problem◦ Need more complex linguistic features…
Linguistic features are relevant:◦ Directly useful for error detection (word-level CE)◦ Directly useful for automatic post-editing◦ But… for sentence-level CE: Issues with sparsity Issues with representation: length bias
30
Lucia [email protected]
Aziz, W., Rios, M., Specia, L. (2011). Shallow Semantic Trees for SMT. WMT
Denkowski, M. and Lavie. A. 2011. Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems, WMT.
Giménez, J. and Màrquez, L. 2010. Linguistic Measures for Automatic Machine Translation Evaluation. Machine Translation, Volume 24, Numbers 3-4.
Hardmeier, C. 2011. Improving Machine Translation Quality Prediction with Syntactic Tree Kernels. EAMT-2011.
Liu, D. and Gildea, D. 2010. Semantic role features for machine translation. 23rd Conference on Computational Linguistics.
Pado, S., Galley, M., Jurafsky, D., and Manning, C. 2009. Robust Machine Translation Evaluation with Entailment Features. ACL.
32
Pighin, D. and Màrquez, L. 2011. Automatic Projection of Semantic Structures: an Application to Pairwise Translation Ranking, SSST-5.
Tatsumi, M. and Roturier, J. 2010. Source Text Characteristics and Technical and Temporal Post-Editing Effort : What is Their Relationship ?, 43-51. 2nd JEC Workshop.
Wu,D. and Fung, P. 2009. Semantic roles for SMT: a hybrid two-pass model. HLT/NAAACL.
Xiong, D., Zhang, M. and Li, H. 2010. Error Detection for SMT Using Linguistic Features. ACL-2010.
33
Best features (Pearson’s correlation) (S3 en-es):
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
CE
Aborted nodes
SMT score
Ratio scores
LM target
LM source
Bi-phrase prob
TM
Sent length
BAD 117
BAD 76
34
Filtering out bad translations: 1-2 (S3 en-es) ◦ Average human scores in the top n translations:
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
average top 100 average top 200 average top 300 average top 500
Average scores x TOP N
Human
CE
Aborted nodes
SMT score
Ratio scores
LM target
LM source
Bi-phrase prob
TM
Sent length
BAD 117
BAD 76
35
QE x MT metrics: Pearson’s correlation (S3 en-es)
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1
BLEU-4
BLEU-2
NIST
TER
Meteor exact
Meteor porter
CE
36
◦QE score x MT metrics: Pearson’s correlation across MT systems:
Test set Training set Pearson QE and human
S3 en-es S1 en-es 0.478
S2 en-es 0.517
S3 en-es 0.542
S4 en-es 0.423
S2 en-es S1 en-es 0.531
S2 en-es 0.562
S3 en-es 0.547
S4 en-es 0.442
37
SMT model global score and internal features
Distortion count, phrase probability, ...
% search nodes aborted, pruned, recombined …
Language model using n-best list as corpus
Distance to centre hypothesis in the n-best list
Relative frequency of the words in the translation in the n-
best list
Ratio of SMT model score of the top translation to the sum of
the scores of all hypothesis in the n-best list, …
38
Best performing: ◦ Length (words, content-words, etc.)
Absolute numbers are better than proportions◦ Language model / corpus frequency◦ Ambiguity of source words
Shallow features are better than linguistic features◦ Except for one adequacy estimation system
Source/target features are better than contrastive features (shallow and linguistic)◦ Absolute numbers are better than proportions
Languages
MT System
All featur
es
No ling.
features
All features abs.
en-es MT1 0.600 0.574 0.595en-es MT2 0.682 0.671 0.664en-es MT3 0.671 0.654 0.662en-es MT4 0.541 0.534 0.523
39