estimating post-editing effort with translation quality ... › exarb › arch ›...

Estimating Post-EditingEffort with TranslationQuality Features

Oscar Sagemo

Uppsala UniversityDepartment of Linguistics and PhilologySpråkteknologiprogrammet(Language Technology Programme)Bachelor’s Thesis in Language Technology

May 25, 2016

Supervisors:Sara StymneNils-Erik Lindström

Abstract

The field of Quality Estimation aims to predict translation quality withoutreference translations, through machine learning. This thesis investigates usingQuality Estimation (QE) to predict post-editing effort at the sentence levelby focusing on the impact of features. Regression models are trained on twodatasets emulating real-world scenarios, separated by the availability of post-edited training data. One dataset consists of English-German translations andpost-editions, annotated with HTER, the other dataset consists of Swedish-English translations and reference translations, annotated with TER.

A total of 16 features, including novel features measuring SMT reorderingand German-English noun translation errors are proposed and individuallyevaluated for the two scenarios. The best performing feature vectors for eachscenario are both able to surpass a commonly used baseline set, with the biggestimpact observed on the Swedish-English dataset.

The possibility of estimating post-editing effort without utilising post-editeddata is explored. Through comparing human perceptions of post-editing ef-fort with predictions made from a model trained without post-edited data, acorrelation is indicated between the two.

Contents

Acknowledgements 5

1 Introduction 61.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 82.1 Quality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Conference on Statistical Machine Translation: Shared Task . . . . 9

3 Method 103.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Baseline system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5.1 Post-editing effort . . . . . . . . . . . . . . . . . . . . . . 123.6 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.6.1 WMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.6.2 SQE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.6.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Proposed features 164.1 Reordering measures . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Grammatical correspondence . . . . . . . . . . . . . . . . . . . . . 174.3 Structural integrity . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 English-German Noun Translation Errors . . . . . . . . . . . . . . 18

5 Results and discussion 195.1 Preliminary experiments . . . . . . . . . . . . . . . . . . . . . . . 19

5.1.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . 195.1.2 Parameterising ratio features . . . . . . . . . . . . . . . . . 19

5.2 Feature performance . . . . . . . . . . . . . . . . . . . . . . . . . 205.3 WMT Scenario? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3.1 Shared task results . . . . . . . . . . . . . . . . . . . . . . 235.4 SQE Scenario? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.4.1 Human annotation . . . . . . . . . . . . . . . . . . . . . . 25

3

6 Conclusion and future work 27

A Features 28

Bibliography 29

4

Acknowledgements

5

1 Introduction

Machine Translation (MT) usage is becoming more widespread every day, with readilyavailable online systems for fast look-ups as well as increasingly convenient frame-works allowing users to train their own engine with nothing but a set of translations totrain it with. Despite the surge in popularity, MT has not yet been able to consistentlydeliver fully automatic high quality translations. A human-in-the-loop is thereforeneeded to either cognitively process the output in order to make sense of it, or manu-ally correct the output in order to reach a publishable quality. Therefore, rather thancompeting with manual translations as supplied by language service providers, MThas been adapted as a beneficial tool in the arsenal of professional translators.

In the translation industry, Computer-Aided Translation (CAT) tools have grown tobe the standard practice, where translators routinely make use of Translation Memories(TM), to store past translations and present matching entries when found in new inputtext, effectively speeding up the workflow. Whenever a matching entry is found it iscommonly presented to the translator accompanied by a match score, conveying itscorrespondence with the input.

Incorporating MT into this framework proved intuitive, an efficient approach wasto complement TM matches with MT suggestions, leaving the translator with the taskof post-editing the translation. Whether or not post-editing MT takes less effort thantranslating from scratch, depends on the quality of the translation, which at the presentmust be assessed by the translator.

Assessing Machine Translation (MT) quality is a challenging task, whether at-tempted by human or machine. The underlying problem posing this challenge is themultivalence of natural language, there is no sole correct translation output for anygiven input. Determining the quality of a translation is therefore commonly reducedto subjective human judgement based on predecided metrics like fluency, adequacyor ranking, or calculating a score from an automatic comparison with a providedreference translation like BLEU(Papineni et al., 2002) or NIST (Doddington, 2002).The primary goal of this type of evaluation is to compare an MT system with earlierversions of itself or other similar systems and is thus, coupled with the fact thatproviding human evaluators or reference translations requires resources, mainly doneby developers of MT systems.

As machine translation usage has seen a vast increase in popularity while thequality is still varying, the need for quality assessment grows for the end user, whonormally has very limited resources. Thus posing the problem of assessing qualitywithout reference translations. Quality Estimation aims to solve this as a predictiontask, where feature vectors of translation quality indicators (features) are associatedwith quality labels through machine learning, to be able to estimate quality scoresof machine translations, at run-time. Different applications for this estimated scoreinclude determining if a translation is sufficient for gisting, ranking several translations

6

in order to select the best one, or deciding whether or not a sentence is worth post-editing. Ideally, the quality label is tailored to the intended application.

This thesis focuses on the post-editing application of quality estimation, wheremost modern approaches set the task as a regression problem - attempting to accu-rately predict an edit distance-based quality label through representing translationswith feature vectors. The performance of such approaches rely on determining andextracting features that correlate strongly with the proposed quality label.

1.1 Purpose

This thesis investigates sentence-level quality estimation for post-editing machinetranslations by exploring the impact of a range of features aiming to convey translationquality. Both pre-existing and original features will be investigated on two differentdatasets representing different real world scenarios where quality estimation for post-editing could be utilised, separated by language pairs and the availability of post-editeddata.

The first scenario, with post-edited data (henceforth, WMT), is based on dataprovided by the Conference (previously Workshop) on Statistical Machine Translationfor the shared task in QE to which a submission will be made. The second scenario,without post-edited data (henceforth, SQE), is based on data provided by Semantix, areal language service provider in the early stages of employing post-editing MT as atranslation method.

Through feature-focused experiments, I hope to make contributions by presentingnovel features measuring reordering and language-specific noun translation errors aswell as investigating the possibility of estimating post-editing effort, without usingpost-edited data.

1.2 Outline

The remainder of this thesis will be structured as follows: Chapter 2 presents therelevant previous research in Quality Estimation and introduces the Conference onStatistical Machine Translation. Chapter 3 describes the framework adopted to explorefeature impact and introduces and motivates the learning methods used as well as thedata. Chapter 4 will present and motivate the features proposed for the two scenarios,both as adopted from previous quality estimation experiments and as originally craftedfor this thesis. The results of the features as adapted for both scenarios will bedisplayed and discussed in chapter 5. Lastly, conclusions and suggestions for futurework is presented in chapter 6.

7

2 Background

In this chapter, the requisite background for this thesis is presented as well as anintroduction to the Conference on Statistical Machine Translation.

2.1 Quality Estimation

Early work in Quality Estimation(QE) was originally focused on assessing the confi-dence in any MT output for a given MT system. The benefits of estimating ConfidenceMeasures for Speech Recognition had been studied extensively (Jiang, 2005) but wasnot as common in other NLP fields. Motivated by the possibility of reference-free MTevaluation and the usefulness of confidence measures in Speech Recognition, Blatzet al. (2004) conducted the first comprehensive experimental study of possible meth-ods for Confidence Estimation (CE) for Machine Translation. They explored differenttechniques of associating feature vectors representing translations with quality labelsbased on automatic MT evaluation metrics at sentence and word-level.

A total of 91 features were proposed for the sentence-level CE experiments, a largeportion of which were derived from information extracted from the MT system usedto produce the translations, such features later came to be referred to as confidencefeatures. They set the objective as a binary classification problem of correct/incorrecttranslations obtained from thresholding the scores from the machine learning (ML)algorithms Naive Bayes and Multi-layer perceptron. The response variables, or qualitylabels, proposed were NIST(Doddington, 2002) and a modified version of WER,described and motivated in Blatz et al. (2003).

This methodology was expanded by Quirk (2004), who used a similar set offeatures to train a variety of classifiers on manually annotated data and found thatmodels trained on small human-tagged datasets outperformed large automatically-tagged datasets.

Specia et al. (2009) explored the use of Partial Least Squares to estimate continu-ous scores as opposed to binary classes over both automatically and manually taggeddatasets. Furthermore, they proposed a wide array of new relevant features, separatedby requisite of system-dependent information, as well as a method to identify relevantfeatures in a systematic way. The term Quality Estimation then arose as a means toincorporate system-independent (black-box) and system-dependent (glass-box) underthe same name (Felice, 2012).

Specia and Farzindar (2010) set the task of using QE to filter out translationsunfit for post-editing and explored the usage of Translation Edit Rate (TER) andHuman-targeted Translation Edit Rate (HTER)(Snover et al., 2006) as quality labels.They were able to obtain good performance by computing HTER over a small set ofmachine translations and their post-editions.

8

TER and HTER (Snover et al., 2006) are error metrics that aim to calculate theamount of editing a human would need to perform to correct a machine translationfor it to match a reference translation. They are derived from the same formula, aspresented in Equation 1, but differ in the type of reference translations used: TheHTER uses a post-edited version of the machine translation, thus being a favorablemetric for measuring post-editing effort, while the TER uses an arbitrary referencetranslation, thus depicting general MT quality.

TER =#of edits

average # of reference words(1)

Where possible edits include insertion, deletion, substitution and shifts and all opera-tions have equal cost.

2.2 Conference on Statistical Machine Translation:Shared Task

The conference on Statistical Machine Translation (WMT) is a yearly event, initiallyheld in conjunction with the annual meeting of the Association for ComputationalLinguistics (ACL)1. Prior to each workshop, shared tasks in different fields of SMTare conducted with the end goal of advancing the field. Tasks in quality estimationhave been included since 2012(Callison-Burch et al., 2012) and became the mainforum for quality estimation development ever since.

The shared tasks in QE are commonly divided into sub-tasks, focusing on variousunits of translations. The different units considered for the 2016 shared task aredocuments, words, phrases and sentences. For the purpose of this thesis, the focus ison the sentence-level task, to which a submission was made based on the insights andexperiments presented in this thesis.

The quality label proposed for this year’s task is HTER, which sets the focus onpredicting the post-editing effort needed to correct a translation. Quality labels forpast years include post-editing time and different types of human annotations.

1http://www.aclweb.org/

9

http://www.aclweb.org/

3 Method

This chapter describes the method used to explore the impact of features for the twoseparate scenarios, by presenting a baseline, describing and motivating metrics chosenfor the feature extraction and machine learning implementations as well as presentingthe data used.

3.1 Framework

The first scenario considered(WMT) emulates QE when approached with post-editeddata at hand, by utilising the translations provided for the shared task, annotated withHTER scores. Participation in the WMT shared task involves submitting predictionsperformed by one final system, thus posing a performance-focused methodology. In afeature-driven approach, this translates to finding and extracting features that correlatestrongly with the proposed quality label, as established by performance metrics. Thisapproach was transferred to the second scenario, QE as approached without previouspost-edited data(SQE), with the added steps of constructing a dataset of translationsannotated with TER.

A total of 16 features aiming to convey translation quality were proposed andtested for inclusion, of which two were specifically designed for the English-Germantranslation direction and thus only tested on the WMT dataset. The features aredescribed and motivated in chapter 4.

Initial tests consisted of measuring the features individual performance in combi-nation with a baseline set, in order to sort out the features with an overall negativeimpact on each dataset. The features with a positive impact were then concatenatedand measured, one by one, to form a feature vector resulting in the best performance.

3.2 Baseline system

For the WMT shared task, in order to establish a common ground for measurement,a baseline system trained with 17 features is provided for the participants. These 17baseline features (b17) are used as the foundation for all feature sets in this thesis andperformance is measured in relation to the baseline system.

The same features have been used for all five years of shared tasks in QE, theorganisers note:

“... although the system is referred to as “baseline”, it is in fact a strongsystem. It has proved robust across a range of language pairs, MT systems,and text domains” (Bojar et al., 2015), 14

10

The features quantify the complexity of the source sentence and the fluency of thetarget sentence, by utilizing corpus frequencies, LM probabilities and token counts. Afull list of these features is included in appendix A.

3.3 Feature extraction

All instances in the training data are converted to feature vectors, by computing thevarious measurements as specified in the feature sets, and storing them in a predefinedorder.

For instance, let the following three features define a feature set:

• number of tokens in the source sentence

• average source token length

• LM log probability of target sentence

The features would then be extracted from a translation and stored in a feature vectorin that order. Example 2, shows an English-German translation from the WMT16dataset and its feature representation as defined by the feature set above.

(2) Click Repair All . = Klicken Sie auf " Alles reparieren . "↓

[4.0, 3.75, -10.551]

In order to apply a machine learning model to a new data instance, it must first beconverted to consist of the same feature vector used to train the model. Therefore it isimportant to be able to extract features consistently and automatically. To this end, theopen-source QuEst++(Specia et al., 2015) toolkit is employed in this paper. The toolkitincorporates feature extraction and machine learning and provides both pre-definedfeature extraction modules as well as interfaces allowing convenient implementationof new modules.

3.3.1 Tools

A majority of the proposed features require different linguistic analyses of the sourceand target segments. To this end, several well-known NLP tools were utilised andmerged with the QuEst++ pipeline, through wrappers, where possible. The followingtools were used to extract the features:

• A modified version of the QuEst++ framework, with processors and featuresadded and modified where needed, used to extract a majority of the features.

• Fast Align (Dyer et al., 2013) was used to generate word alignment files.

• Berkeley Parser (Petrov et al., 2006), trained with the included grammars forEnglish and German and the Talbanken part of SUC (Nivre and Megyesi, 2007),to extract the PCFG and phrase structure-based features.

• SRILM (Stolcke, 2002) to train a Part-Of-Speech (POS) Language Model(LM) over the training dataset as well as to compute all LM-based segmentprobabilities and perplexities.

11

• TreeTagger (Schmid, 1994) trained with the included models for English andGerman and HunPos(Halácsy et al., 2007) trained on SUC (Megyesi, 2009) toobtain all POS-related features

3.4 Machine Learning

The most prevalent algorithm in the regression-based QE in the literature is SupportVector Machine (SVM) regression, which has also been employed for the baselinesystems of the shared tasks. After brief initial experiments, seen in sect. 5.1.1, it wasadopted as the main ML algorithm used in this paper as implemented by LibSVM(Chang and Lin, 2011) in QuEst++. A Radial Basis Function (RBF) kernel is used aswell as grid search optimisation of the cost, epsilon and gamma parameters from a5-fold cross validation of the training set.

3.5 Evaluation

As per QuEst++ methodology, ML performance was measured in terms of MeanAverage Error (MAE) and Root Mean Square Error (RMSE) which are defined in Eqs.3 and 4, where yi are the predicted values and yi are the gold-standard values.

MAE =1n

n

∑i|yi− yi | (3)

RMSE =

√1n

n

∑i(yi− yi)

2 (4)

Critique of evaluation metrics The evaluation of submitted systems to the WMTshared tasks for the first four years (12-15) relied on measuring the MAE and RMSEas described in sect. 3.5. In response to this, (Graham, 2015) conducted an analysis ofthe measures and pointed out possibilities of wrongly perceived performance gains byoptimising predictions with respect to the error measures, such as minimising variancein prediction score distributions. As an alternative, she proposed using the unit-freePearson correlation r which has tradition in assessing MT evaluation metrics.

The metric measures the linear association between two variables, which for QEpurposes is the predicted and gold-standard score, and is defined as seen in eq. 5,where yi are the predicted values and yi are the gold-standard values.

This year’s shared task used Pearson correlation as the main metric of evaluationfor the first time, followed by MAE and RMSE.

r =∑

Ni=1(yi− y)(y− ¯y)√

∑Ni=1(yi− y)2

√∑

Ni=1(yi− ¯y)2

(5)

3.5.1 Post-editing effort

The performance, as measured by the evaluation metrics, has different implicationsfor the two separate scenarios as the quality labels differ.

12

In their comparative study, Snover et al. (2006) shows that HTER has a highcorrelation with human judgment as an evaluation metric and that TER, while givingan over-estimate of edit rates, correlates reasonably well with both human judgementand HTER.

As HTER uses post-edited reference translations, it is reasonable to assume thatthe score correlates well with post-editing effort. However, TER scores use arbitraryreference translations which results in a lower correlation with post-editing effort.Therefore, a well predicted TER score, conveys general translation quality more thanpost-editing effort.

In order to validate the TER predictions in terms of post-editing effort, a profes-sional post-editor was tasked with annotating a part of the test set with a quality scoreindicating the amount of editing needed to correct each segment. The quality score isdefined as a scale from 1-5, in accordance with the proposed quality labels of the 2012shared task in QE (Callison-Burch et al., 2012). Koponen (2012) also used the samescale in her study comparing human perceptions of post-editing with post-editingoperations. The scores were defined as follows:

• 1: The MT output is incomprehensible, with little or no information transferredaccurately. It cannot be edited, needs to be translated from scratch.

• 2: 50% -70% of the MT output needs to be edited. It requires a significantediting effort in order to reach publishable level.

• 3: 25-50% of the MT output needs to be edited. It contains different errors andmistranslations that need to be corrected.

• 4: 10-25% of the MT output needs to be edited. It is generally clear andintelligible.

• 5: MT output is perfectly clear and intelligible. It is not necessarily a perfecttranslation, but requires little to no editing.

The annotations were then treated as classes and compared with a sorted list of thepredicted TER scores by utilising a common Information Retrieval metric, AveragePrecision (AP) (Zhu, 2004). It is defined as shown in Equation 6, where k is the rankin the sequence of ordered TER scores, P(k) is the precision of the sublist at k andrel(k) is a function returning 1 if the class at rank k matches the relevant score and 0otherwise.

By measuring the average precision, inferences can be made about the correlationbetween the model’s predictions and post-editing effort. The most interesting casesare average precision for the top and bottom classes, as they convey predictionperformance where it matters most.

Furthermore, the standard deviation and average score was computed over eachclass in order to investigate the distribution of predicted TER scores in relation topost-editing effort.

Lastly, the same metrics were applied to the gold-standard TER scores, in order totest the overall correlation between TER scores and post-edting effort.

AP =∑

nk=1(P(k))rel(k))

number of instances of class(6)

13

3.6 Data

3.6.1 WMT

The dataset used for the WMT scenario is provided by the organisers of the 2016WMT shared task in Quality Estimation.

The organisers provided two different datasets that was divided between thedifferent sub-tasks. For the purpose of this thesis, the focus is only the dataset intendedas input data for the sentence-level task was utilised. The dataset spans a total of15,000 English-German translations from the IT domain, provided by unspecifiedindustry partners. Each entry consists of a source segment, its machine translation, apost-edition of the translation and an edit distance score (HTER) derived from the post-edited version. The dataset was split into separate sets of training and development,with gold-standard scores, and testing data without, as shown in Table 3.1.

The translations were produced by a single in-house MT system, regarded asa “black-box” system since no system-dependent information was provided. Thesetranslations were then post-edited by professional translators. To capture the post-editing effort, HTER scores were computed between each MT translation and thecorresponding post-edited version using the TER toolkit(Snover et al., 2009).

In addition to the translations, participants were also provided with languagemodels, word-based translation models as well as raw n-gram frequency counts,collected from an IT domain-specific English-German parallel corpora which was notprovided. The complementary files were provided to aid in extracting the baselinefeatures (see Appendix A).

3.6.2 SQE

The dataset for the Semantix QE scenario (SQE) consists of 28,398 Swedish-Englishtranslations from the public sector domain, provided by Semantix. Each entry consistsof a source segment, its machine translation and an edit distance score (TER) deriveda reference translation. The dataset is split into separate training and testing sets, asshown in Table 3.1.

The translations were produced by a domain-specific MT system trained within-house human translations using Microsoft Translator Hub.1 No system-dependentinformation is accessible through the Microsoft Translator hub framework.

Dataset Segments

WMT-train 12,000WMT-dev 1,000WMT-test 2,000SQE-train 21,000SQE-test 7,100

Table 3.1: Size and division of the datasets

1https://hub.microsofttranslator.com

14

Constructing a dataset

The data used to construct the SQE dataset consists of reversed English-to-Swedishhuman translations. Due to the fact that all available in-domain Swedish-Englishtranslations had been used to train the MT system, thus making them unsuitablecandidates. After collecting the translations, the Swedish source side was re-translatedwith the in-house MT system, resulting in a set of Swedish source segments, theirEnglish machine translation and an English reference translation. A TER score wascomputed between the machine translated and reference segments, using the TERtoolkit, to be used as quality labels.

Ideally, reference translations used for MT evaluation should consist of humantranslations. Studies have shown (Lembersky et al., 2012; Kurokawa et al., 2009) thatthe translation direction of training data for SMT systems impacts the performance,due to unique characteristics of translated language. It is plausible to assume thatreversing the translation direction also has an impact when used as training data forQE systems, as the reference translations are consequently derived from source textsand vice versa.

A better solution could have been to re-train the MT system after removing apart of the training data to be used as data for the QE system. However, this was notattempted due to time constraints.

In addition to the translations, I also used language models, word-based translationmodels and ngram-frequency counts, computed from the Swedish-English MT trainingdata.

3.6.3 Pre-processing

As the SQE dataset consisted of unfiltered human-translated segments of varyinglengths, the following cleaning steps were taken:

• Deleted segments with <= two source or target words

• Deleted segments with more than 80 source or target words

• Deleted segments containing links or phone numbers

• Deleted segments consisting of >= 50% numbers

• Deleted segments containing markup tags

• Swapped any series of two or more whitespace characters with one

• Randomized the order of the segments and split into 75% training and 25%testing

Additionally, before feature extraction, both datasets were tokenised using the tokeni-sation scripts included in the open-source SMT toolkit Moses Koehn et al. (2007).Additionally, the same toolkit was used to truecase the SQE dataset, returning theinitial words in each segment to their most probable casing, as this improved theperformance of the baseline system. Truecasing was not applied to the WMT dataset,as the truecasingscript utilises a corpus to compute case probabilities, and no corporamatching the domain-specific content were provided.

15

4 Proposed features

This chapter lists and motivates the features used in this thesis, separated by categoryof information conveyed.

14 features were proposed for both scenarios, the features were selected to capturesources of and results of difficulties for SMT systems, by quantifying reorderingmeasures, grammatical correspondence and structural integrity. Additionally, twolanguage-specific features were proposed for the WMT scenario, quantifying nountranslation errors from English to German.

4.1 Reordering measures

Reordering is problematic for MT in general, and especially so when the placement ofverbs differ between languages. English, German and Swedish all have follow a SVOpattern in simple sentences, but differ in verb placement in e.g. subordinate clausesand questions.

Three metrics that measure the amount of reordering done by the MT system wereexplored, to investigate a correlation between SMT reordering and quality labels. Allmetrics are based on alignments between individual words.

• Crossing score: the number of crossings in alignments between source andtarget

• Kendall Tau distance between alignments in source and target

• Squared Kendall Tau distance between alignments in source and target

Crossing score was suggested by Genzel (2010) for SMT reordering and Tau wassuggested by Birch and Osborne (2011) for use in a standard metric with a referencetranslation. To my knowledge, this thesis presents the first usage of these measures forquality estimation.

The features are computed by counting crossing link pairs in a word alignment file,where the number of crossing links considers crossings of all lengths. The SquaredKendall Tau Distance (SKTD) is defined as shown in Eq. 7.

SKTD = 1−

√|crossing link pairs||link pairs|

(7)

The amount of reordering done as measured by these features can suffice toindicate irregularities in reordering through the learning methods. However, due tosimply relying on counting crossings in 1-1 alignments, could inflict noise. All themeasures for reordering only measures the difference in word order in a language

16

independent way. For a specific language pair like English–German or Swedish–English, it would be useful to be able to measure known word order divergences likeverb placement, through more carefully designed and targeted measures. A bettersolution could be adapt the feature to fit the expected reordering for specific translationdirections and to quantify it based on infringements of word-order expectations.

4.2 Grammatical correspondence

Features measuring the relationship between different constituents in source and targetare useful for measuring translation adequacy, i.e. whether or not certain elements ofstructure and meaning was conveyed in the translation.

Several features quantifying grammatical discrepancy are explored, mainly mea-sured in terms of occurrences of syntactic phrases or POS tags.

• Ratio of percentage of verb phrases between source and target

• Ratio of percentage of noun phrases between source and target

• Ratio of percentage of nouns between source and target

• Ratio of percentage of pronouns between source and target

• Ratio of percentage of verbs between source and target

• Ratio of percentage of tokens consisting of alphabetic symbols between sourceand target

The relationship of token types, e.g. part of speech, is commonly parameterised as theratio of percentage (Specia et al., 2011; Felice, 2012), which normalises token countsby sentence length. However, normalizing syntactic phrases by the total amount ofphrases is not as intuitive, as both syntactic constructions vary between languages andphrase structure rules vary between different PCFGs. Therefore, different means ofparameterising the relationship between syntactic constituents were briefly exploredand presented in Section 5.1.2.

4.3 Structural integrity

Measuring the structural integrity of the source segment is intended to convey trans-latability, based on the assumption that ill-formed sentences are more difficult totranslate. The structural integrity of the translated target segment conveys outputfluency, i.e. how well-formed and fluent the sentence is in the target language.

Features measuring well-formedness as conveyed by syntactic parse trees wereexplored for both source and target. Additionally, POS language models were utilisedfor the target segment.

• Source PCFG average confidence of all possible parses in the parser n-best list

• Source PCFG log probability

• Target PCFG log probability

17

• LM log perplexity of POS of the target

• LM log probability of POS of the target

Avramidis et al. (2011) proposed utilising PCFG probabilities confidence as featuresand POS language models showed promising results in the work of Tezcan et al.(2015). Small sizes of n-best lists (1-3) were used for the confidence feature due todifficulties in coming up with more parse trees for several of the input segments in theWMT dataset. The POS language models were trained with an order of 4, over thetarget side of the training data for both scenarios separately.

4.4 English-German Noun Translation Errors

Capturing common translation errors is intended as a direct measure of results of MTdifficulties. However, such features need to be defined with individual language pairsin mind, and are therefore expensive to craft.

Two novel features attempting to capture these errors in the direction English-German are explored for the WMT dataset.

• Ratio of Noun groups between source and target

• Ratio of Genitive constructions between source and target

In previous work on English–German SMT (Stymne et al., 2012), it is noted that thetranslation of noun compounds is problematic. It is common for English compounds,that are written as separate words, to be rendered as separate words or genitiveconstructions in German, instead of the idiomatic compound. Compounds tend to becommon in technical domains, such as IT.

Due to the fact that split compound nouns is a common translation error forGerman machine translations, a feature to look for sequences of nouns in target textwas implemented. The feature looks for any noun group in both source and target andis computed as the ratio of noun groups, where noun groups are defined as the numberof occurrences of sequences of two or more nouns.

Another common compound translation is genitive constructions, which can beover-produced in German. A feature that looks for possible genitive constructions insource and target was designed, and is computed as the ratio of genitive constructions,defined as follows:German: Any noun or proper noun preceded by a noun and the genitive articledes/der.English: Any noun or proper noun preceded by a noun and the possessive clitic ’s orthe possessive preposition of.

Note that these patterns could also match other constructions since “of” can haveother uses and “der” is also used for masculine nominative and feminine dative.

18

5 Results and discussion

This chapter presents the results of the preliminary experiments, regarding ML al-gorithms and ratio parameterisation, the impact of the proposed features for bothscenarios separately, as well as additional evaluation for both scenarios.

5.1 Preliminary experiments

These brief experiments were carried out in order to make initial decisions from whichto base all further experimenting on.

5.1.1 Machine Learning

The Orange toolkit (Demšar et al., 2004) was used to compare 6 ML algorithms forthe baseline features on the development set from the WMT scenario, the results arepresented in Table 5.1.1. The algorithms were chosen based on the implementationsavailable in the Orange toolkit.

ML Algorithm MAE RMSESVM Regression 13.942 19.814Random Forest (RF) Regression 15.527 20.159Univariate Regression 14.089 19.324Stochastic Gradient Descent (SGD) Regression 22.012 29.876Regression Tree 20.485 27.028Linear Regression 14.089 19.324

Table 5.1: A comparison of the baseline performance between 6 ML algorithms

The SGD regression and regression tree algorithms performed considerably worsethan the other 4 algorithms. Linear and Univariate regression, while not commonlyemployed for tasks of this type, showed surprisingly good performance based on thisbrief experiment. While RF regression has been seen in past QE studies (Rubino et al.,2012), the performance difference from the SVM regression was deemed significantenough to not experiment any further. Based on the results, coupled with the fact thatit was the only implemented algorithm in the QuEst++ tookit, SVM Regression waschosen as the ML algorithm for the feature tests.

5.1.2 Parameterising ratio features

Three different means of quantifying the same relationship between constituents insource and target were explored when implementing the verb phrase ratio feature.

19

They are defined as shown below, where VPside is the number of verb phrases in therespective side and Pside is the total number of phrases.

Absolute difference |V Psource−V Ptarget |

Ratio ∑V Psource∑V Ptarget

Ratio of percentage ∑V Psource/∑Psource∑V Ptarget/∑Ptarget

To test the performance of the different mea-sures, each measure was implemented as a feature and concatenated with the baselinefeatures for the WMT system, the results are shown in the table below. Results point

Measure MAE RMSEAbsolute difference 13.864 19.553Ratio 13.842 19.527Ratio of percentage 13.834 19.515

Table 5.2: Performance in terms of MAE and RMSE for the different ratio implementations inthe WMT scenario

towards the ratio of percentage, which went against the initial intuition that normal-ising with the total number of phrases would result in noise, due to the non-linearrelationship in phrase constructions between the languages involved. This metric wasalso applied to the phrase ratios in the second scenario.

5.2 Feature performance

The results of the initial tests of measuring the features individual performance incombination with a baseline set are presented in Table 5.3 in terms of MAE andRMSE, and a comparison chart of the impact on both scenarios is shown in Figure 5.1.The impact is defined as the MAE difference in relation to the corresponding baseline,normalised by total MAE.

The features with an overall negative impact on each scenario were rejected forinclusion in feature combination tests while the features with a positive impact onwere concatenated and measured, one by one, to form a feature vector resulting in thebest performance for each scenario.

20

Figure 5.1: A comparison of the normalised impact in MAE of the 14 language-independentfeatures, as well as the WMT-specific noun group ratio for comparison.

The individual feature impact varied considerably between the two different scenarios,however, some similarities can be observed: The noun, pronoun, and noun phrase ratiohad a comparable negative impact on both scenarios while the verb ratio had a similarpositive impact. This suggests that inserted or deleted verbs have a higher correlationwith edit operations than nouns and pronouns. worth noting, both Swedish and Germanhave compound nouns where English has multi-noun chains, e.g. “baseballmatch”,“Baseballspiel” vis-à-vis “baseball game”. This might affect the performance of thenoun ratio feature as well as explain the positive impact of the noun group ratio feature,which is constructed for this very reason.

Furthermore, the source PCFG average confidence in a 3-best list had a negativeimpact on both scenarios while the source and target PCFG probabilities had a positiveimpact, however the impact differences of the latter features is rather significant despiteboth being positive.

The three reordering measures all showed different relations between the scenarios,which is surprising as they are all based on the same number of crossings. The KendallTau distance (Tau) was the only reordering measure with a positive impact on theWMT scenario while both the squared version (SKTD) and Tau had a positive impacton the SQE scenario. With the Tau feature being one of the best performing featuresoverall.

Another noteworthy observation is that no feature with a negative impact on theSQE scenario performed well on the WMT scenario, while there were many cases theother way around. The biggest differences observed are of the the verb phrase ratioand target POSLM features, all having a significantly more positive impact on theSwedish-English SQE scenario.

21

Feature WMT SQEMAE RMSE MAE RMSE

Baseline (b17) 13.826 19.507 20.751 27.230b17 + Crossings 13.834 19.480 20.789 27.350b17 + Tau 13.801 19.460 20.616 27.112b17 + SKTD 13.836 19.468 20.718 27.206b17 + verb phrase ratio 13.834 19.515 20.607 27.100b17 + ratio of noun phrases 13.846 19.523 20.786 27.146b17 + noun ratio 13.842 19.466 20.787 27.217b17 + pronoun ratio 13.827 19.510 20.776 27.137b17 + verb ratio 13.799 19.604 20.685 27.134b17 + a-z token ratio 13.848 19.488 20.693 27.174b17 + Source PCFG average confidence in 3-best list 13.859 19.551 20.792 27.493b17 + POSLM target log perplexity 13.859 19.465 20.641 27.108b17 + POSLM target log probability 13.851 19.522 20.613 27.138b17 + Source tree PCFG 13.812 19.515 20.682 27.176b17 + Target tree PCFG 13.819 19.534 20.654 27.096b17 + Noun Group Ratio 13.759 19.503 NA NAb17 + Genitive constructions 13.840 19.539 NA NA

Table 5.3: Performance in terms of MAE and RMSE for all individual features

5.3 WMT Scenario?

A majority of the proposed features proved to have a negative impact on the perfor-mance metrics through individual testing, leaving only 5/16 features with a positiveimpact:

• Noun group ratio

• Kendall Tau distance of alignments



• Ratio of percentage of verbs

The surprisingly small amount of positive features may be a result of a disagreementbetween the proposed features and the data. The features mainly rely on linguisticanalyses while the data, being exclusively from the IT-Domain, is inherently irregular.POSLM and syntactic phrase features appears to be particularly unreliable whichmay be due to the nature of the domain, where series of constituents of uncommoncharacter are frequent, as demonstrated in the English-German example from theWMT dataset below:

Choose File > Save As , and choose Photoshop DCS 1.0 or PhotoshopDCS 2.0 from the Format menu .

↓Wählen Sie " Bearbeiten " " Voreinstellungen " ( Windows ) bzw. "

Bridge CS4 " > " Voreinstellungen " ( Mac OS ) und klicken Sie auf "Miniaturen . "

22

This appears to affect syntactic parsers trained on out-of-domain PCFGs as the parseroften had difficulties generating more than 3 trees per sentence and while the proba-bilities of the parse trees for both source and target slightly increased the performanceof the model, they had a much higher impact on the SQE scenario. The differencebetween performance of the POSLM-based features are even higher, even though thelanguage models were trained on in-domain data, suggesting text domain may affectfeature performance.

Of the novel features proposed in this thesis, the noun group ratio and KendallTau Distance showed promising results both individually and in combination withour other features, the noun group ratio feature had the highest impact out of all theproposed features.

Furthermore, out of all the features with an individual positive impact on MAE,only noun group ratio and Tau perform well on RMSE. This carries over to the perfor-mance when combined as well. The impact of the combined features are presented inTable 5.4, with the addition of the Pearson correlation metric, as motivated in Section3.5.

Feature combinations MAE RMSE rbaseline 13.826 19.507 0.381+ Source PCFG 13.812 19.515 0.382+ Target PCFG 13.805 19.560 0.383+ Verb ratio 13.795 19.627 0.383+ Tau 13.757 19.522 0.384+ Noun Group Ratio 13.723 19.552 0.386

Table 5.4: Performance in terms of MAE and RMSE for the combined features resulting in thebest performing feature set for WMT

Based on the results of the feature combinations, MAE seems to have a highercorrelation with the Pearson correlation than RMSE as there is a linear relationshipbetween the decrease in MAE and increase in r, despite the slight increases in RMSE.

5.3.1 Shared task results

A submission to the 2016 shared task in sentence-level QE was made based on the bestperforming feature set, presented in Table 5.4.The submission surpassed the baselineand ranked 9 / 13, with a Pearson correlation of 0.363 on the test set, which is separatefrom the one used for the evaluation performed in this thesis (See Table 3.1).

Only the Pearson correlation is known at the time of writing, as the organisers areexperiencing some issues with the MAE and RMSE metrics.

5.4 SQE Scenario?

The following features had an individual positive impact on the performance metricsfor the SQE scenario:


23


• Kendall Tau distance of alignments

• Ratio of percentage of verbs

• Target POSLM perplexity

• Target POSLM log probability

• Ratio of tokens consisting of alphabetic symbols (a-z)1 between source andtarget.

• Squared Kendall Tau Distance of alignments

• Ratio of percentage of verb phrases in source and target

The individual performance tests suggests that the proposed features were better suitedfor the SQE scenario as 9/14 performed well and the well-performing features had asignificantly larger average impact than the well-performing features for the WMTscenario.

The structural integrity of source and target as measured by PCFG and POSLMfeatures showed promising results, except for the PCFG average confidence featurewhich may be hindered by the small list size of 3 for the number of trees considered,as imposed by the WMT scenario. Considering that texts from the public sectordomain need to be comprehensible by a large number of people, it seems reasonablethat violations of conventional structure carry a heavier weight. This furthers thehypothesis on text domain influence over feature performance.

Feature combination MAE RMSE rbaseline 20.751 27.230 0.501+ source pcfg 20.682 27.176 0.503+ target pcfg 20.581 27.055 0.510+ Tau 20.446 26.917 0.518+ verb ratio 20.403 26.841 0.521+ poslm prob 20.301 26.744 0.526+ poslm perp 20.291 26.714 0.527+ a-z ratio 20.230 26.629 0.531+ sktd 20.212 26.635 0.531+ vp ratio 20.221 26.638 0.531

Table 5.5: Performance in terms of MAE and RMSE for the combined features resulting in thebest performing feature set for SQE

As with the WMT combinations, there is a linear relationship between the decrease inMAE and increase in r, with an performance gains for each feature addition. However,there is an exception with the addition of the last feature, verb phrase ratio. This issurprising as the feature showed a strong positive impact when tested individually andindividual impact differences otherwise more or less correspond to the differences in

1The feature is denoted as a-z to conserve space, but includes the Swedish letters ’å’, ’ä’ and ’ö’.

24

performance when combined. It is possible that decrease in performance is a result ofa high amount of features, causing noise. However, there appears to be no correlationbetween performance gain and number of features based on the previous additions.

Another interesting observation is that the Pearson correlation is significantlyhigher for the SQE scenario than for the WMT scenario, while the MAE and RMSEreflect the opposite. A possible explanation for the high error metrics is a high variationin the distribution of quality scores. In her study of evaluation metrics for QE, Graham(2015) shows that a low variance in quality scores results in a lower MAE scoreand RMSE, and vice versa, as they are based on the absolute difference between allpredictions and gold-standard labels.

To briefly test this hypothesis, the standard-deviation was computed for the gold-standard quality scores in the test data from both scenarios:

SQE WMT20.628 31.424

Indeed, the variation is slightly higher in the distribution of scores for the WMT testset which could affect the reliability of the error metrics, but the difference is notsignificant enough for any conclusions.

5.4.1 Human annotation

In order to validate the predictions for the SQE scenario, they were compared witheffort scores provided by one professional post-editor, as motivated and described inSection 3.5.1. Ideally, in order to avoid biased results, annotation is performed by morethan one annotator and the score is averaged over all the annotators. Furthermore, only240 test segments, about 3.4% of the test set, were annotated and the class distributionwas imbalanced. Due to this, the insights as provided by the comparisons are notconclusive on their own, but may provide some level of evaluation.

The average TER for each annotated class is shown in Table 5.6, both the predictedTER values and the gold-standard values are presented along with the standarddeviation. The average precision for the best class (5), the top two as well as the twoworst classes is shown in table 5.7. The performance for class 1 was excluded due tothe low number of occurrences.

Gold-standard PredictedClass Count Avg. TER (Std-dev) Avg. TER (Std-dev)5 101 40.205 (23.166) 47.834 (10.769)4 95 51.307 (22.151) 50.273 (8.773)3 30 64.390 (28.753) 47.854 (7.759)2 11 93.673 (17.076) 66.570 (9.236)1 3 119.667 (15.035) 72.545 (2.800)

Table 5.6: The average TER for each occurrence of the annotated classes from the predictedand gold standard scores

The difference in standard deviation and average scores for the lower classes is due toa overall higher deviation in the distribution of gold-standard scores.

The average scores at each class indicate the relationship between TER scoresand post-editing effort as perceived by the annotator. There is a linear relationship

25

between higher perceived post-editing effort and TER score for the Gold-standarddistribution. The predicted values show a similar trend, but with a lower TER averagefor the third class.

Based on this small sample size, the results are somewhat encouraging. The gold-standard distribution indicates, that a well predicted TER score, does in fact correlateto post-editing effort. Furthermore, as the two worst classes (1,2) only constitutes 6%of the annotated classes, it is noteworthy that the TER scores matched so well. Thistrend is observable in the average precision as well, as the average precision for thepredicted values reaches 59%, despite the small number of occurrences.

The high precision for the combined top two classes is misleading as they consti-tute 81% of the annotated classes together.

Classes Gold standard Predicted5 63% 53%5,4 93% 86%1,2 47% 59%

Table 5.7: The average precision for the top and bottom annotated classes

26

6 Conclusion and future work

This thesis has investigated the impact of 16 features for two separate scenariosof quality estimation for post-editing. Among these, novel features modeling nountranslation errors and SMT reordering were proposed. A majority of the proposedfeatures (11/16) had a negative impact on the WMT scenario, while a majority (9/14)had a positive impact on the SQE scenario. The reason for the feature performancediscrepancy is believed to lie in the textual domain, as many features rely on linguisticanalyses which is here found to perform poorly on the IT-domain based WMT scenario.The relationship between text domain and QE features remains an interesting researchtopic.

Of the novel features proposed, the noun group ratio and Kendall Tau distancefeatures showed particularly promising results. In the future I would like to investigatean expanded set of translation errors as well as adapt the concept of reorderingmeasures as features to expected reordering in specific translation directions.

The possibility of estimating post-editing effort without using post-edited datawas explored through comparing predicted TER scores and human annotations. Dueto a small amount of annotated segments and an imbalanced annotation distribution,results were inconclusive but indicated a correlation between the predicted scores andperceived post-editing effort.

27

A Features

17 Baseline Features

• number of tokens in the source sentence

• number of tokens in the target sentence

• average source token length

• LM probability of source sentence

• LM probability of target sentence

• number of occurrences of the target word within the target hypothesis (averagedfor all words in the hypothesis - type/token ratio)

• average number of translations per source word in the sentence (as given byIBM 1 table thresholded such that prob(t|s) > 0.2)

• average number of translations per source word in the sentence (as given byIBM 1 table thresholded such that prob(t|s) > 0.01) weighted by the inversefrequency of each word in the source corpus

• percentage of unigrams in quartile 1 of frequency (lower frequency words) in acorpus of the source language (SMT training corpus)

• percentage of unigrams in quartile 4 of frequency (higher frequency words) in acorpus of the source language

• percentage of bigrams in quartile 1 of frequency of source words in a corpus ofthe source language

• percentage of bigrams in quartile 4 of frequency of source words in a corpus ofthe source language

• percentage of trigrams in quartile 1 of frequency of source words in a corpus ofthe source language

• percentage of trigrams in quartile 4 of frequency of source words in a corpus ofthe source language

• percentage of unigrams in the source sentence seen in a corpus (SMT trainingcorpus)

• number of punctuation marks in the source sentence

• number of punctuation marks in the target sentence

28

Bibliography

Eleftherios Avramidis, Maja Popovic, David Vilar, and Aljoscha Burchardt. Evaluatewith confidence estimation: Machine ranking of translation outputs using gram-matical features. In Proceedings of the Sixth Workshop on Statistical MachineTranslation, pages 65–70. Association for Computational Linguistics, 2011.

Alexandra Birch and Miles Osborne. Reordering metrics for mt. In Proceedings of the49th Annual Meeting of the Association for Computational Linguistics: Human Lan-guage Technologies-Volume 1, pages 1027–1035. Association for ComputationalLinguistics, 2011.

John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, AlexKulesza, Alberto Sanchis, and Nicola Ueffing. Confidence estimation for machinetranslation. final report, jhu/clsp summer workshop, 2003.

John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, AlexKulesza, Alberto Sanchis, and Nicola Ueffing. Confidence estimation for machinetranslation. In Proceedings of the 20th international conference on ComputationalLinguistics, page 315. Association for Computational Linguistics, 2004.

Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, MatthiasHuck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Mat-teo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. Find-ings of the 2015 workshop on statistical machine translation. In Proceedingsof the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon,Portugal, September 2015. Association for Computational Linguistics. URLhttp://aclweb.org/anthology/W15-3001.

Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, andLucia Specia, editors. Proceedings of the Seventh Workshop on Statistical MachineTranslation. Association for Computational Linguistics, Montréal, Canada, June2012. URL http://www.aclweb.org/anthology/W12-31.

Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines.ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.

Janez Demšar, Blaž Zupan, Gregor Leban, and Tomaz Curk. Orange: From experi-mental machine learning to interactive data mining. Springer, 2004.

George Doddington. Automatic evaluation of machine translation quality using n-gramco-occurrence statistics. In Proceedings of the second international conferenceon Human Language Technology Research, pages 138–145. Morgan KaufmannPublishers Inc., 2002.

29

http://aclweb.org/anthology/W15-3001

http://www.aclweb.org/anthology/W12-31

Chris Dyer, Victor Chahuneau, and Noah A Smith. A simple, fast, and effectivereparameterization of ibm model 2. In Proceedings of the Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, pages 644–649. Association for Computational Linguistics,2013.

Mariano Felice. Linguistic indicators for quality estimation of machine translations.Universitat Autònoma de Barcelona & University of Wolverhampton, 2012.

Dmitriy Genzel. Automatically learning source-side reordering rules for large scalemachine translation. In Proceedings of the 23rd international conference on com-putational linguistics, pages 376–384. Association for Computational Linguistics,2010.

Yvette Graham. Improving evaluation of machine translation quality estimation.In 53rd Annual Meeting of the Association for Computational Linguistics andSeventh International Joint Conference on Natural Language Processing of theAsian Federation of Natural Language Processing, pages 1804–1813, 2015.

Péter Halácsy, András Kornai, and Csaba Oravecz. Hunpos: an open source trigramtagger. In Proceedings of the 45th annual meeting of the ACL on interactiveposter and demonstration sessions, pages 209–212. Association for ComputationalLinguistics, 2007.

Hui Jiang. Confidence measures for speech recognition: A survey. Speech communi-cation, 45(4):455–470, 2005.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed-erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens,et al. Moses: Open source toolkit for statistical machine translation. In Proceedingsof the 45th annual meeting of the ACL on interactive poster and demonstrationsessions, pages 177–180. Association for Computational Linguistics, 2007.

Maarit Koponen. Comparing human perceptions of post-editing effort with post-editing operations. In Proceedings of the Seventh Workshop on Statistical MachineTranslation, pages 181–190. Association for Computational Linguistics, 2012.

David Kurokawa, Cyril Goutte, and Pierre Isabelle. Automatic detection of translatedtext and its impact on machine translation. Proceedings. MT Summit XII, The twelfthMachine Translation Summit International Association for Machine Translationhosted by the Association for Machine Translation in the Americas, 2009.

Gennadi Lembersky, Noam Ordan, and Shuly Wintner. Adapting translation modelsto translationese improves smt. In Proceedings of the 13th Conference of the Euro-pean Chapter of the Association for Computational Linguistics, EACL ’12, pages255–265, Stroudsburg, PA, USA, 2012. Association for Computational Linguis-tics. ISBN 978-1-937284-19-0. URL http://dl.acm.org/citation.cfm?id=2380816.2380850.

Beata Megyesi. The open source tagger hunpos for swedish. In Proceedings of the17th Nordic conference on computational linguistics (NODALIDA), 2009.

30

http://dl.acm.org/citation.cfm?id=2380816.2380850

http://dl.acm.org/citation.cfm?id=2380816.2380850

Joakim Nivre and Beata Megyesi. Bootstrapping a swedish treebank using cross-corpus harmonization and annotation projection. In Proceedings of the 6th Inter-national Workshop on Treebanks and Linguistic Theories, pages 97–102. Citeseer,2007.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method forautomatic evaluation of machine translation. In Proceedings of the 40th annualmeeting on association for computational linguistics, pages 311–318. Associationfor Computational Linguistics, 2002.

Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. Learning accurate,compact, and interpretable tree annotation. In Proceedings of the 21st Interna-tional Conference on Computational Linguistics and the 44th annual meeting ofthe Association for Computational Linguistics, pages 433–440. Association forComputational Linguistics, 2006.

Christopher Quirk. Training a sentence-level machine translation confidence measure.In LREC. Citeseer, 2004.

Raphael Rubino, Jennifer Foster, Joachim Wagner, Johann Roturier, RasulSamad Zadeh Kaljahi, and Fred Hollowood. Dcu-symantec submission for the wmt2012 quality estimation task. In Proceedings of the Seventh Workshop on StatisticalMachine Translation, pages 138–144. Association for Computational Linguistics,2012.

Helmut Schmid. Probabilistic part-of-speech tagging using decision trees. In Pro-ceedings of the International Conference on New Methods in Language Processing,pages 44–49, Manchester, UK, 1994.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and JohnMakhoul. A study of translation edit rate with targeted human annotation. InProceedings of association for machine translation in the Americas, pages 223–231,2006.

Matthew G Snover, Nitin Madnani, Bonnie Dorr, and Richard Schwartz. Ter-plus:paraphrase, semantic, and alignment enhancements to translation edit rate. MachineTranslation, 23(2-3):117–127, 2009.

Lucia Specia and Atefeh Farzindar. Estimating machine translation post-editing effortwith hter. In Proceedings of the Second Joint EM+/CNGL Workshop BringingMT to the User: Research on Integrating MT in the Translation Industry (JEC 10),pages 33–41, 2010.

Lucia Specia, Marco Turchi, Nicola Cancedda, Marc Dymetman, and Nello Cristianini.Estimating the sentence-level quality of machine translation systems. In 13thConference of the European Association for Machine Translation, pages 28–37,2009.

Lucia Specia, Najeh Hajlaoui, Catalina Hallett, and Wilker Aziz. Predicting machinetranslation adequacy. In Machine Translation Summit, volume 13, pages 19–23,2011.

31

Lucia Specia, Gustavo Paetzold, and Carolina Scarton. Multi-level translation qualityprediction with quest++. In Proceedings of ACL-IJCNLP 2015 System Demonstra-tions, pages 115–120, Beijing, China, July 2015. Association for ComputationalLinguistics and The Asian Federation of Natural Language Processing. URLhttp://www.aclweb.org/anthology/P15-4020.

Andreas Stolcke. SRILM - an extensible language modeling toolkit. In Proceedingsof the Seventh International Conference on Spoken Language Processing, Denver,Colorado, USA, 2002.

Sara Stymne, Nicola Cancedda, and Lars Ahrenberg. Generation of compoundwords for statistical machine translation into compounding languages. Submittedmanuscript, 2012.

Arda Tezcan, Veronique Hoste, Bart Desmet, and Lieve Macken. Ugent-lt3 scatesystem for machine translation quality estimation. In Tenth Workshop on StatisticalMachine Translation, 2015.

Mu Zhu. Recall, precision and average precision. Department of Statistics andActuarial Science, University of Waterloo, Waterloo, 2, 2004.

32

http://www.aclweb.org/anthology/P15-4020

estimating post-editing effort with translation quality ... › exarb › arch ›...

Documents