results of the wmt20 metrics shared tasktems participating in the wmt parallel corpus fil-tering...

38
Proceedings of the 5th Conference on Machine Translation (WMT), pages 688–725 Online, November 19–20, 2020. c 2020 Association for Computational Linguistics 688 Results of the WMT20 Metrics Shared Task Nitika Mathur The University of Melbourne [email protected] Johnny Tian-Zheng Wei University of Southern California, [email protected] Markus Freitag Google Research [email protected] Qingsong Ma Tencent-CSIG, AI Evaluation Lab [email protected] Ondˇ rej Bojar Charles University, MFF ´ UFAL [email protected] Abstract This paper presents the results of the WMT20 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT20 News Translation Task with automatic metrics. Ten research groups submitted 27 metrics, four of which are reference-less “metrics”. In addition, we computed five baseline metrics, includ- ing SENTBLEU, BLEU, TER and CHRF us- ing the SacreBLEU scorer. All metrics were evaluated on how well they correlate at the system-, document- and segment-level with the WMT20 official human scores. We present an extensive analysis on influence of reference translations on metric reliability, how well automatic metrics score human trans- lations, and we also flag major discrepancies between metric and human scores when eval- uating MT systems. Finally, we investigate whether we can use automatic metrics to flag incorrect human ratings. 1 Introduction The metrics shared task 1 has been a key component of WMT since 2008, serving as a way to validate the use of automatic MT evaluation metrics and drive the development of new metrics. We evaluate automatic metrics that score MT out- put by comparing them with a reference translation generated by human translators, who are instructed to translate “from scratch”, without post-editing from MT. In addition, following last year’s collab- oration with the WMT Quality Estimation (QE) task, we also invited submissions of reference-free metrics that compare MT outputs directly with the source segment. Similar to the last year’s editions, the source, ref- erence texts, and MT system outputs for the metric 1 http://www.statmt.org/wmt20/ metrics-task.html task come from the News Translation Task (Bar- rault et al., 2020, which we denote as Findings 2020). This year, the language pairs were English Chinese, Czech, German, Inuktitut, Japanese, Polish, Russian and Tamil. We further included sys- tems participating in the WMT parallel corpus fil- tering task (Koehn et al., 2020): Khmer and Pashto to English. 2 All metrics are evaluated based on their agree- ment with human evaluation. We evaluate met- rics at three levels: comparing MT systems on the entire testset, segments (either sentences or short paragraphs), and new this year, documents. We introduce document-level evaluation to incen- tivize the development of metrics that are take into account broader context of evaluated sentences or paragraphs, following the recent emergence of document-level MT techniques. Multiple References This year, we have two in- dependently generated references for English German, English Russian, and Chinese En- glish. This lets us investigate the influence of ref- erences and the utility of multiple references. We instructed participants to score MT systems against the references individually as well as with all avail- able references. In addition, we also supplied a set of references for English to German, that were gen- erated by asking linguists to paraphrase the WMT reference as much as possible (Freitag et al., 2020). These references are designed to minimise transla- tionese in the reference which could lead to metrics to be biased against systems that generate more natural text. 2 Note that the metrics task inputs also included MT sys- tems translating between German French in the News Translation Task, and English Khmer and Pashto from the WMT parallel corpus filtering task. We are unable to eval- uate metrics on these language pairs as human evaluation is not available

Upload: others

Post on 18-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

Proceedings of the 5th Conference on Machine Translation (WMT), pages 688–725Online, November 19–20, 2020. c©2020 Association for Computational Linguistics

688

Results of the WMT20 Metrics Shared Task

Nitika MathurThe University of Melbourne

[email protected]

Johnny Tian-Zheng WeiUniversity of Southern California,

[email protected]

Markus FreitagGoogle Research

[email protected]

Qingsong MaTencent-CSIG,

AI Evaluation [email protected]

Ondrej BojarCharles University,

MFF [email protected]

Abstract

This paper presents the results of the WMT20Metrics Shared Task. Participants were askedto score the outputs of the translation systemscompeting in the WMT20 News TranslationTask with automatic metrics. Ten researchgroups submitted 27 metrics, four of whichare reference-less “metrics”. In addition,we computed five baseline metrics, includ-ing SENTBLEU, BLEU, TER and CHRF us-ing the SacreBLEU scorer. All metrics wereevaluated on how well they correlate at thesystem-, document- and segment-level withthe WMT20 official human scores.

We present an extensive analysis on influenceof reference translations on metric reliability,how well automatic metrics score human trans-lations, and we also flag major discrepanciesbetween metric and human scores when eval-uating MT systems. Finally, we investigatewhether we can use automatic metrics to flagincorrect human ratings.

1 Introduction

The metrics shared task1 has been a key componentof WMT since 2008, serving as a way to validatethe use of automatic MT evaluation metrics anddrive the development of new metrics.

We evaluate automatic metrics that score MT out-put by comparing them with a reference translationgenerated by human translators, who are instructedto translate “from scratch”, without post-editingfrom MT. In addition, following last year’s collab-oration with the WMT Quality Estimation (QE)task, we also invited submissions of reference-freemetrics that compare MT outputs directly with thesource segment.

Similar to the last year’s editions, the source, ref-erence texts, and MT system outputs for the metric

1http://www.statmt.org/wmt20/metrics-task.html

task come from the News Translation Task (Bar-rault et al., 2020, which we denote as Findings2020). This year, the language pairs were English↔ Chinese, Czech, German, Inuktitut, Japanese,Polish, Russian and Tamil. We further included sys-tems participating in the WMT parallel corpus fil-tering task (Koehn et al., 2020): Khmer and Pashtoto English.2

All metrics are evaluated based on their agree-ment with human evaluation. We evaluate met-rics at three levels: comparing MT systems onthe entire testset, segments (either sentences orshort paragraphs), and new this year, documents.We introduce document-level evaluation to incen-tivize the development of metrics that are take intoaccount broader context of evaluated sentencesor paragraphs, following the recent emergence ofdocument-level MT techniques.

Multiple References This year, we have two in-dependently generated references for English ↔German, English↔ Russian, and Chinese→ En-glish. This lets us investigate the influence of ref-erences and the utility of multiple references. Weinstructed participants to score MT systems againstthe references individually as well as with all avail-able references. In addition, we also supplied a setof references for English to German, that were gen-erated by asking linguists to paraphrase the WMTreference as much as possible (Freitag et al., 2020).These references are designed to minimise transla-tionese in the reference which could lead to metricsto be biased against systems that generate morenatural text.

2Note that the metrics task inputs also included MT sys-tems translating between German ↔ French in the NewsTranslation Task, and English → Khmer and Pashto fromthe WMT parallel corpus filtering task. We are unable to eval-uate metrics on these language pairs as human evaluation isnot available

Page 2: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

689

Evaluating Human Translations Given that wehave multiple human translations, we asked partici-pants to evaluate each human translation using theother as a reference. For these language-pairs, atleast one of these human translations was includedin the human evaluation, so we can directly evalu-ate metrics on how they rank the human translationcompared to the MT systems.

Additional Human Evaluation Finally, wepose the question if some of the discrepancies be-tween metrics and human scores can be explainedby bad human ratings. We rerun some of the humanevaluations by using the same template, but switch-ing the rater pool from non-experts to professionallinguists. In particular, we rerun human evalua-tion for a subset of translations where all metricsdisagree with the WMT human evaluation. This ex-periment could reveal a new use case of automaticmetrics and indicate that automatic metrics can beused to identify bad ratings in human evaluations.

We first give an overview of the task (Sec-tion 2) and summarize the baseline (Section 3.1)and submitted (Section 3.2) metrics. The resultsfor system-, segment-, and document-level evalua-tion are provided in Sections 4, followed by a jointdiscussion Section 5. Section 6 describes our re-running of human evaluation with linguists beforewe summarise our findings in Section 7.

We will release data, code and additionalvisualisations in the metrics package to bemade available at http://www.statmt.org/

wmt20/results.html

2 Task Setup

This year, we provided task participants with onetest set for each examined language pair, i.e. aset of source texts (which are commonly ignoredby MT metrics), corresponding MT outputs (theseare the key inputs to be scored) and one or morereference translations.

In the system-level, metrics aim to correlate witha system’s score which is an average over manyhuman judgments of segment translation qualityproduced by the given system. In the segment-level, metrics aim to produce scores that correlatebest with a human ranking judgment of two out-put translations for a given source segment. Andfinally, we also trial document-level evaluation thisyear. (more on the manual quality assessment inSection 2.3).

Segments are sentences for all language pairsexcept English↔ German and Czech, and for En-glish → Chinese, which do not contain sentenceboundaries and are translated and evaluated at theparagraph-level.

Participants were free to choose which languagepairs and tracks (system/segment/document andreference-based/reference-free) they wanted to takepart in.

2.1 Source and Reference Texts

The source and reference texts we use are mainlysourced from this year’s WMT News TranslationTask (see Findings 2020).

The test set typically contains somewhere be-tween 1000 and 2000 segments for each translationdirection, with fewer segments for some paragraph-segmented test sets, and the English↔ Inuktitutdirections contain 2971 sentences.

All test sets are from the news domain, ex-cept the English↔ Inuktitut datasets which havea mix of in-domain text from Canadian Parlia-ment Hansards (1566 sentences) and out-of-domainnews documents (1405 sentences).

We also have systems from the parallel corpusfiltering task which are from the Wikipedia domain(also labelled newstest2020 in the metrics test set).The Khmer→ English and Pashto→ English con-tain 2320 and 2719 sentences respectively.

The reference translations provided in new-stest2020 were created in the same direction asthe MT systems were translating. The exceptionsare English ↔ Inuktitut, Khmer → English andPashto→ English, where the testset is a mixture of“source-original” and “target-original” texts.

2.2 System Outputs

The results of the Metrics Task are affected by theactual set of MT systems participating in a giventranslation direction. On one hand, if all systemsare very close in their translation quality, then evenhumans will struggle to rank them. This in turnwill make the task for MT metrics very hard. Onthe other hand, if the task includes a wide range ofsystems of varying quality, correlating with humansshould be generally easier. One can also expectthat if the evaluated systems are of different types,they will exhibit different error patterns and variousMT metrics can be differently sensitive to thesepatterns.

Page 3: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

690

• Parallel Corpus Filtering Task. This taskrequired participants to submit scores for eachsentence in the provided noisy parallel texts.These scores were used to subsample sentencepairs, which was then used to train a neuralmachine translation system (fairseq). Thiswas tested on a held-out subset of Wikipediatranslations.

• Regular News Tasks Systems. These areall the other MT systems in the evaluation;differing in whether they are trained only onWMT provided data (“Constrained”, or “Un-constrained”) as in the previous years.

With all language pairs, in addition to the sub-missions to the task, the test sets also include trans-lations from freely available web services (onlineMT systems), which are deemed unconstrained.

Overall, the results are based on 208 systemsacross 18 language pairs.

2.3 Manual Quality Assessment

Human scores were obtained using Direct Assess-ment, where annotators are asked to rate the ad-equacy of a translation compared to either thesource segment or a reference translation of thesame source. This year, human data was collectedfrom reference-based evaluations (or “monolin-gual”) and reference-free evaluations (or “bilin-gual”). The reference-based (monolingual) evalua-tions were crowdsourced, while the reference-less(bilingual) evaluations were mainly from MT re-searchers who committed their time to contributeto the manual evaluation for each submitted systemto the translation task.

Finally, following reports that MT system trans-lations might seem adequate when scored in isola-tion but not in context of the whole document, whenpossible, the ratings are collected for each segmentwith document context. Table 1 summarises thedetails of how human annotations were collectedfor various language-pairs at WMT 2020.

The English→ Inuktitut dataset, which containsa mix of in-domain (Hansard) and out-of-domain(news) data, was only evaluated on out-of-domainsegments, so for system level evaluation, we eval-uate metric scores computed on the news domainonly as well as the full test set.

See Findings 2020 for details on human evalua-tion.

2.3.1 System-level Golden Truth: DAFor the system-level evaluation, the collected con-tinuous DA scores, standardized for each annotator,are averaged across all assessed segments for eachMT system to produce a scalar rating for the sys-tem’s performance.

The underlying set of assessed segments is dif-ferent for each system. Thanks to the fact that thesystem-level DA score is an average over manyjudgments, mean scores are consistent and havebeen found to be reproducible (Graham et al.,2013). For more details see Findings 2020.

The score of an MT system is calculated as theaverage rating of the segments translated by thesystem.

2.3.2 Segment-level Golden Truth: DARRStarting from Bojar et al. (2017), when WMTfully switched to DA, we had to come up witha solid golden standard for segment-level judge-ments. Standard DA scores are reliable only whenaveraged over sufficient number of judgments.3

Fortunately, when we have at least two DAscores for translations of the same source input,it is possible to convert those DA scores into a rel-ative ranking judgement, if the difference in DAscores allows conclusion that one translation is bet-ter than the other. In the following, we denote thesere-interpreted DA judgements as “DARR”, to dis-tinguish it clearly from the relative ranking (“RR”)golden truth used in the past years.4

From the complete set of human assessments col-lected for the News Translation Task, all possiblepairs of DA judgements attributed to distinct trans-lations of the same source segment were convertedinto DARR better/worse judgements. Distinct trans-lations of the same source input whose DA scoresfell within 25 percentage points (which could have

3For segment-level evaluation, one would need to collectmany manual evaluations of the exact same segment as pro-duced by each MT system. Such a sampling would be howeverwasteful for the evaluation needed by WMT, so only some MTsystems happen to be evaluated for a given input segment. Inprinciple, we would like to return to DA’s standard segment-level evaluation in future, where a minimum of 15 humanjudgements of translation quality are collected per translationand combined to get highly accurate scores for translations,but this would increase annotation costs.

4Since the analogue rating scale employed by DA ismarked at the 0-25-50-75-100 points, we use 25 points as theminimum required difference between two system scores toproduce DARR judgements. Note that we rely on judgementscollected from known-reliable volunteers and crowd-sourcedworkers who passed DA’s quality control mechanism. Any in-consistency that could arise from reliance on DA judgementscollected from low quality crowd-sourcing is thus prevented.

Page 4: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

691

Language pairs source/reference crowd/researcher document context

iu-en reference crowd No*-en except iu-en reference crowd Yesen-*, de-fr, fr-de source mix of crowd and researcher* Yes

Table 1: Direct Assessment at WMT20. Note that researcher annotations can contain some amount of professionalannotations

been deemed equal quality) were omitted from theevaluation of segment-level metrics. Conversion ofscores in this way produced a large set of DARRjudgements for all language pairs, shown in Ta-ble 2 due to combinatorial advantage of extractingDARR judgements from all possible pairs of trans-lations of the same source input. We see that onlykm-en and ps-en can suffer from insufficient num-ber of these simulated pairwise comparisons.

The DARR judgements serve as the golden stan-dard for segment-level evaluation in WMT19.

2.3.3 Document-level Golden Truth: DARRAs segments were scored in document context, wecan compute document scores as the average hu-man rating of the segments in the document. Weacknowledge that this may be an oversimplification.First of all, we are hoping that human assessorshave indicated errors in document-level coherenceat at least one of the affected segments, but wehave no evidence that they actually do so. Second,document-level phenomena are rather scarce andaveraging segment-level scores is likely to aver-age out these sparse observations even if they weremarked at individual sentences. And lastly, in somesituations, lack of cross-sentence coherence can beso critical that any strategy of composing sentence-level scores is bound to downplay the severity ofthe error, see e.g. Vojtechova et al. (2019). Atthe current point, we have nothing better to startwith but we believe that better techniques will beproposed in the future.

Graham et al. (2017) recommend around av-eraging 100 annotations per document to obtainreliable document scores. Since the average num-ber of assessments we have is much less than that,we compute the ground truth in the same way asthe segment level evaluation.

We first compute document scores as the averageof all segment scores in the document, which wedenote as DOC-DA. We then generate DOC-DARRpairs of better and worse translations of the samesource document when there is at least a 25 point

difference in the raw DOC-DA scores. See Table 3for details.

In case of DARR (which we denote as DOC-DARR), all language pairs suffer from insufficientnumber of these simulated pairwise comparisons.

Similar to segment-level evaluation, we use theKendall Tau-like formula (Section 2) to evaluatemetric agreement with humans on the generatedpairwise DARR judgements.

Note that we do not include any human-translated segments in this evaluation. In addition,iu-en is excluded from document-level evaluationbecause its DA judgements were collected for iso-lated sentences.

3 Metrics

3.1 BaselinesWe agree with the call to use SacreBLEU (Post,2018) as the standard MT evaluation scorer. We nolonger report scores of the metrics from the Mosesscorer, which requires tokenized text. We use thefollowing metrics from the SacreBLEU scorer asbaselines, with the default parameters:

3.1.1 SacreBLEU baselines• BLEU (Papineni et al., 2002a) is the preci-

sion of n-grams of the MT output comparedto the reference, weighted by a brevitypenalty to punish overly short translations.BLEU+case.mixed+lang.LANGPAIR-+numrefs.1+smooth.exp+tok.13a-+version.1.4.14

We run SacreBLEU with the--sentence-score option to obtainsentence scores for SENTBLEU; this uses thesame parameters as BLEU. Although not it’sintended use, we also compute system- anddocument-level scores for SENTBLEU as themean segment score.

• TER (Snover et al., 2006) measuresthe number of edits (insertions, dele-tions, shifts and substitutions) required

Page 5: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

692

DA>1 Ave DA pairs DARR

cs-en 664 11.3 39187 14018de-en 785 11.0 43669 16584iu-en 2620 4.5 26120 8162ja-en 993 9.0 36169 15193pl-en 1001 11.8 64670 21121ru-en 991 10.0 44664 14024ta-en 997 7.6 26662 12789zh-en 2000 13.8 177492 62586km-en 1963 3.2 8295 3706ps-en 2204 3.1 7994 3507

en-cs 1418 10.3 68587 21121en-de 1418 6.9 30567 9339en-iu 1268 7.9 35384 13159en-ja 1000 9.6 41576 12830en-pl 1000 10.6 52003 17689en-ru 1971 5.7 28274 8330en-ta 1000 7.9 28974 9087en-zh 1418 10.6 72581 12652

Table 2: Segment-level: Number of judgements for DAconverted to DARR data; “DA>1” is the number ofsource input segments in the manual evaluation whereat least two translations of that same source input seg-ment received a DA judgement; “Ave” is the averagenumber of translations with at least one DA judgementavailable for the same source input segment; “DA pairs”is the number of all possible pairs of translations ofthe same source input resulting from “DA>1”; and“DARR” is the number of DA pairs with an absolutedifference in DA scores greater than the 25 percentagepoint margin.

to transform the MT output to thereference. TER+lang.LANGPAIR-+tok.tercom-nonorm-punct-noasian-uncased+version.1.4.14

• CHRF (Popovic, 2015) uses charactern-grams instead of word n-grams to com-pare the MT output with the reference 5.Version string: chrF2+lang.LANGPAIR-+numchars.6+space.false-+version.1.4.14.

3.1.2 CHRF++CHRF++ (Popovic, 2017) includes word unigramsand bigrams in addition to character ngrams. Weran the original Python implementation of the met-

5Note that the SacreBLEU scorer does not yet implementCHRF with multiple references

DOC-DA>1 Ave DOC-DA pairs DOC-DARR

cs-en 102 11.4 6041 1424de-en 118 11.0 6579 1866ja-en 80 8.9 2850 790pl-en 62 11.8 4012 635ru-en 91 9.9 4077 753ta-en 82 7.5 2126 684zh-en 155 13.8 13897 3085

en-cs 130 10.2 6162 1442en-de 130 6.9 2844 669en-iu 35 7.8 969 203en-ja 63 9.7 2686 469en-pl 63 10.7 3359 677en-ru 122 5.7 1768 387en-ta 63 7.9 1834 389en-zh 130 10.6 6667 651

Table 3: Document-level: Number of judgements forDOC-DA converted to DOC-DARR data; “DOC-DA>1”is the number of source input documents in the manualevaluation where we have DOC-DA scores for at leasttwo translations of that same source input documents;“Ave” is the average number of translations with at leastone DOC-DA judgement available for the same sourceinput document; “DOC-DA pairs” is the number of allpossible pairs of translations of the same source inputresulting from “DOC-DA>1”; and “DOC-DARR” is thenumber of DOC-DA pairs with an absolute differencein DOC-DA scores greater than the 25 percentage pointmargin.Note that iu-en is not included as document-contextwas not available for this evaluation.

ric 6 with the default parameters --ncorder 6--nwworder 2 --beta 2

3.2 Submissions

The rest of this section summarizes participatingmetrics.

3.2.1 BERT-BASE-L2, BERT-LARGE-L2,MBERT-L2

The three baselines were obtained by fine-tuningBERT (Devlin et al., 2019) on the ratings of WMTMetrics years 2015 to 2018, using a regressionloss. What distinguishes the metrics is the ini-tial BERT checkpoint: BERT-BASE-L2 uses a12-layer Transformer architecture pre-trained onEnglish data, MBERT-L2 is similar but trained

6chrF++.py available at https://github.com/m-popovic/chrF

Page 6: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

693

Scor

ing

leve

lm

etri

cfe

atur

esL

earn

edse

gdo

csy

sC

itatio

n/Pa

rtic

ipan

tA

vaila

bilit

yBaselines

SE

NT

BL

EU

n-gr

ams

•?

?Pa

pine

niet

al.(

2002

a)ht

tps:

//gith

ub.c

om/m

jpos

t/sac

rebl

euB

LE

Un-

gram

s−

−•

Papi

neni

etal

.(20

02a)

http

s://g

ithub

.com

/mjp

ost/s

acre

bleu

TE

Red

itdi

stan

ce•

Snov

eret

al.(

2006

)ht

tps:

//gith

ub.c

om/m

jpos

t/sac

rebl

euC

HR

Fch

arac

tern

-gra

ms

•Po

povi

c(2

015)

http

s://g

ithub

.com

/mjp

ost/s

acre

bleu

CH

RF

++

char

acte

rn-g

ram

s•

Popo

vic

(201

7)ht

tps:

//gith

ub.c

om/m

-pop

ovic

/chr

F

Reference-basedmetrics

PAR

BL

EU

para

phra

ses

•?

•U

niv

ofE

dinb

urgh

,Uni

vof

Tart

u,JH

UB

awde

net

al.(

2020

)no

tapu

blic

met

ric

PAR

CH

RF

++

para

phra

ses

•?

•U

niv

ofE

dinb

urgh

,Uni

vof

Tart

u,JH

UB

awde

net

al.(

2020

)no

tapu

blic

met

ric

PAR

ES

IMpa

raph

rase

sye

s•

Uni

vof

Edi

nbur

gh,U

niv

ofTa

rtu,

JHU

Baw

den

etal

.(20

20)

nota

publ

icm

etri

cP

RIS

Mpa

raph

rase

s•

John

sH

opki

nsU

nive

rsity

http

s://g

ithub

.com

/thom

pson

b/pr

ism

CH

AR

AC

TE

Rch

arac

tere

ditd

ista

nce

R

WT

HA

ache

nW

ang

etal

.(20

16)

http

s://g

ithub

.com

/rw

th-i

6/C

hara

cTE

RE

ED

char

acte

redi

tdis

tanc

e•

RW

TH

Aac

hen

Stan

chev

etal

.(20

19)

http

s://g

ithub

.com

/rw

th-i

6/E

xten

dedE

ditD

ista

nce

SW

SS

+M

ET

EO

Rse

man

ticsi

mila

rity

,X

uet

al.(

2020

)no

tapu

blic

met

ric

ME

Ew

ord

embe

ddin

gs•

IIIT

-Hyd

erab

ad,A

nany

aM

ukhe

rjee

and

Shar

ma

(202

0)no

tapu

blic

met

ric

YIS

Ico

ntex

tual

wor

dem

bedd

ings

N

RC

Lo

(201

9,20

20)

http

://ch

ikiu

-jac

kie-

lo.o

rg/h

ome/

inde

x.ph

p/yi

siB

ER

T-B

AS

E-L

2co

ntex

tual

wor

dem

bedd

ings

yes

G

oogl

e(D

evlin

etal

.,20

19)

(BL

EU

RT

code

,priv

ate

chec

kpoi

nt)

BE

RT-

LA

RG

E-L

2,co

ntex

tual

wor

dem

bedd

ings

yes

G

oogl

e(D

evlin

etal

.,20

19)

(BL

EU

RT

code

,priv

ate

chec

kpoi

nt)

MB

ER

T-L

2co

ntex

tual

wor

dem

bedd

ings

yes

G

oogl

e(D

evlin

etal

.,20

19)

(BL

EU

RT

code

,priv

ate

chec

kpoi

nt)

BL

EU

RT

cont

extu

alw

ord

embe

ddin

gsye

s•

Goo

gle

(Dev

linet

al.,

2019

)ht

tps:

//gith

ub.c

om/g

oogl

e-re

sear

ch/b

leur

tB

LE

UR

T-E

XT

EN

DE

Dco

ntex

tual

wor

dem

bedd

ings

yes

G

oogl

e(D

evlin

etal

.,20

19)

(BL

EU

RT

code

,priv

ate

chec

kpoi

nt)

YIS

I-C

OM

BII

cont

extu

alw

ord

embe

ddin

gsye

s•

Goo

gle

(Dev

linet

al.,

2019

)no

tapu

blic

met

ric

BL

EU

RT-

CO

MB

Ico

ntex

tual

wor

dem

bedd

ings

yes

G

oogl

e(D

evlin

etal

.,20

19)

nota

publ

icm

etri

cC

OM

ET

pred

icto

r-es

timat

orm

odel

yes

••

U

nbab

el(R

eiet

al.,

2020

b)ht

tps:

//gith

ub.c

om/U

nbab

el/C

OM

ET

CO

ME

T-R

AN

Kpr

edic

tor-

estim

ator

mod

elye

s•

Unb

abel

(Rei

etal

.,20

20b)

http

s://g

ithub

.com

/Unb

abel

/CO

ME

TC

OM

ET-

HT

ER

pred

icto

r-es

timat

orm

odel

yes

••

U

nbab

el(R

eiet

al.,

2020

b)ht

tps:

//gith

ub.c

om/U

nbab

el/C

OM

ET

CO

ME

T-2R

pred

icto

r-es

timat

orm

odel

yes

U

nbab

el(R

eiet

al.,

2020

b)ht

tps:

//gith

ub.c

om/U

nbab

el/C

OM

ET

CO

ME

T-M

QM

pred

icto

r-es

timat

orm

odel

yes

••

U

nbab

el(R

eiet

al.,

2020

b)ht

tps:

//gith

ub.c

om/U

nbab

el/C

OM

ET

BA

Q,E

Q?

?•

?no

tapu

blic

met

ric

src-based

CO

ME

T-Q

Epr

edic

tor-

estim

ator

mod

elye

s•

Unb

abel

(Rei

etal

.,20

20b)

http

s://g

ithub

.com

/Unb

abel

/CO

ME

TO

PE

NK

IWI-

BE

RT

pred

icto

r-es

timat

orm

odel

yes

U

nbab

elK

eple

reta

l.(2

019)

http

s://g

ithub

.com

/Unb

abel

/Ope

nKiw

iO

PE

NK

IWI-

XL

MR

pred

icto

r-es

timat

orm

odel

yes

U

nbab

elK

eple

reta

l.(2

019)

http

s://g

ithub

.com

/Unb

abel

/Ope

nKiw

iY

ISI-

2co

ntex

tual

wor

dem

bedd

ings

N

RC

Lo

and

Lar

kin

(202

0)

Tabl

e4:

Part

icip

ants

ofW

MT

20M

etri

csSh

ared

Task

.“•

”de

note

sth

atth

em

etri

cto

okpa

rtin

(som

eof

the

lang

uage

pair

s)of

the

segm

ent-

and/

ordo

cum

ent-

and/

orsy

stem

-le

vele

valu

atio

n.“

”in

dica

tes

that

the

docu

men

t-an

dsy

stem

-lev

elsc

ores

are

impl

ied,

sim

ply

taki

ngar

ithm

etic

(mac

ro-)

aver

age

ofse

gmen

t-le

vels

core

s.“−

”in

dica

tes

that

the

met

ric

didn

’tpa

rtic

ipat

eth

etr

ack

(Seg

/Doc

/Sys

-lev

el).

“?”

indi

cate

sth

atw

eco

mpu

ted

the

met

ric’

sdo

cum

ento

rsy

stem

scor

efo

rth

istr

ack

asth

em

acro

-ave

rage

ofse

gmen

tsc

ores

,tho

ugh

the

met

ric

isno

tdefi

ned

this

way

.A

met

ric

isle

arne

dif

itis

trai

ned

ona

QE

orm

etri

cev

alua

tion

data

set(

i.e.

pret

rain

ing

orpa

rser

sdo

n’tc

ount

,but

trai

ning

onW

MT

2019

met

rics

task

data

does

).

Page 7: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

694

on Wikipedia data in 102 languages, and BERT-LARGE-L2 is English-only with 24 layers.

3.2.2 BLEURT, BLEURT-EXTENDED,YISI-COMBI, BLEURT-YISI-COMBI

BLEURT (Sellam et al., 2020a) is a BERT-basedregression model trained twice: first on million syn-thetic pairs obtained by random perturbations, thenon ratings from years 2015 to 2019 of the WMTWorkshop. BLEURT-EXTENDED (Sellam et al.,2020b) is a BERT-based regression model trainedon human ratings of years 2015 to 2019 of theWMT Workshop, combined with BERT-Chinesefor to-Chinese sentence pairs. The main checkpointis a 24-layer Transformer, trained on a mixture ofWikipedia articles and training data from WMTNewstest in 20 languages.

YISI-COMBI: We are using YISI-1 on anmBERT model that is fine tuned on WMT datafor single reference submissions. We are usingaggregating internal scores in YISI over differentreferences for the final output for multi referencesubmission.

BLEURT-COMBI: We are using the same outputas YISI-COMBI for single reference submissions.We are mixing YISI-1, YISI-2 and BLEURTscores for different references for the multi ref-erence submission.

3.2.3 CHARACTERCHARACTER (Wang et al., 2016), identical to the2016 setup, is a character-level metric inspired bythe commonly applied translation edit rate (TER).It is defined as the minimum number of characteredits required to adjust a hypothesis, until it com-pletely matches the reference, normalized by thelength of the hypothesis sentence. CHARACTERcalculates the character-level edit distance whileperforming the shift edit on word level. Unlikethe strict matching criterion in TER, a hypothe-sis word is considered to match a reference wordand could be shifted, if the edit distance betweenthem is below a threshold value. The Levenshteindistance between the reference and the shifted hy-pothesis sequence is computed on the characterlevel. In addition, the lengths of hypothesis se-quences instead of reference sequences are usedfor normalizing the edit distance, which effectivelycounters the issue that shorter translations normallyachieve lower TER. Similarly to other character-level metrics, CHARACTER is generally appliedto nontokenized outputs and references, which also

holds for this year’s submission with one exception.This year tokenization was carried out for en-ruhypotheses and references before calculating thescores, since this results in large improvements interms of correlations. For other language pairs, notokenizer was used for pre-processing.

3.2.4 COMETCOMET* metrics (Rei et al., 2020b) were build us-ing the Estimator model or the Translation Rankingmodel proposed in Rei et al. (2020a). Those neuralmodels use XLM-RoBERTa to encode source, MThypothesis and reference in the same cross-lingualspace and then are optimised towards different ob-jectives. COMET (main metric) is an Estimatormodel that regresses on Direct Assessments (DA)from 2017 to 2019 and COMET-2R is a variant ofCOMET (main metric) that was trained to handlemultiple references at inference time. COMET-HTER and COMET-MQM follow the same archi-tecture but regress on Human-mediated TranslationEdit Rate (HTER) and a proprietary metric com-pliant with the Multidimensional Quality Metricsframework (MQM), respectively. COMET-Rankuses the Translation Ranking architecture to di-rectly optimize the distance between “better“ hy-pothesis and the respective source and reference,while pushing the “worse“ hypothesis away. ThisTranslation Ranking model was directly optimisedon DA relative-ranks from 2017 to 2019. Finally,COMET-QE removes the reference at input andproportionately reduces the dimensions of the esti-mator network to accommodate the reduced input.

3.2.5 EEDEED (Stanchev et al., 2019) is a character-basedmetric, which builds upon CDER. It is defined asthe minimum number of operations of an exten-sion to the conventional edit distance containing a“jump” operation. The edit distance operations (in-sertions, deletions and substitutions) are performedat the character level and jumps are performedwhen a blank space is reached. Furthermore, thecoverage of multiple characters in the hypothesis ispenalised by the introduction of a coverage penalty.The sum of the length of the reference and the cov-erage penalty is used as the normalisation term.

3.2.6 MEEMEE (Ananya Mukherjee and Sharma, 2020) isan automatic evaluation metric that leverages thesimilarity between embeddings of words in candi-

Page 8: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

695

date and reference sentences to assess translationquality. Unigrams are matched based on their sur-face forms, root forms and meanings which aids tocapture lexical, morphological and semantic equiv-alence. Semantic evaluation is achieved by usingpretrained fasttext embeddings provided by Face-book to calculate the word similarity score betweenthe candidate and the reference words. MEE com-putes evaluation score using three modules namelyexact match, root match and synonym match. Ineach module, fmean-score is calculated using har-monic mean of precision and recall by assigningmore weight to recall. The final translation score isobtained by taking average of fmean-scores fromindividual modules.

3.2.7 ESIMEnhanced Sequential Inference Model (Chen et al.,2017) is a neural model proposed for Natural Lan-guage Inference that has been adapted for MTevaluation by Mathur et al. (2019). It uses cross-sentence attention and sentence matching heuris-tics to generate a representation of the translationand the reference, which is fed to a feedforwardregressor. This year’s scores were submitted byBawden et al. (2020) as part of the submission onPARESIM.

3.3 OPENKIWI-BERT, OPENKIWI-XLMR

OPENKIWI-BERT and OPENKIWI-XLMR (Ke-pler et al., 2019) are state of the art Quality Estima-tion models developed for the WMT20 QE sharedtask and are trained with WMT Metrics data from2017 to 2019.

3.3.1 PARBLEU, PARCHRF++, PARESIMPARBLEU, PARCHRF++, and PARESIM (Baw-den et al., 2020) are variants of their respectivecore metrics computed against the provided hu-man reference and a set of automatically gener-ated paraphrases. PARBLEU used five paraphrases,while the other two used only one. Both BLEU andCHRF++ have in-built support for multiple refer-ences. For ESIM, we calculate the score for eachreference separately and then average them to getthe final score.

3.3.2 PRISM

PRISM (Thompson and Post, 2020) is a many-many multilingual neural machine translation sys-tem trained on data for 39 language pairs, withdata derived largely from WMT and Wikimatrix. It

casts machine translation evaluation as a zero-shotparaphrasing task, producing segment-level scoresby force-decoding between a system output anda reference, in both directions, and averaging themodel scores. System-level scores are producedby averaging segment-level ones. For evaluationin Inuktikut, Khmer, Pashto, and Tamil, we useda “Prism44” model that was retrained after addingWMT-provided data for these languages to its orig-inal training data set. All other languages wereevaluated with the original “Prism39” model.

3.3.3 SWSS+METEORSWSS (Semantically Weighted Sentence Similar-ity, Xu et al. 2020) is an approach to extractingsemantic core words, which are words that carryimportant semantic meanings in sentences, and us-ing them in MT evaluation. It uses UCCA (Uni-versal Conceptual Cognitive Annotation), a seman-tic representation framework, to identify semanticcore words, and then calculates sentence similar-ity scores on the overlap of semantic core wordsof sentence pairs. Taking sentence-level seman-tic structure information into consideration, SWSScan improve the performance of lexical metricswhen combined with them. The submitted metric(SWSS+METEOR) is a weighted combination ofSWSS and Meteor.

3.3.4 YISI-0, YISI-1, YISI-2YISI (Lo, 2019, 2020) is a unified semantic MTquality evaluation and estimation metric for lan-guages with different levels of available re-sources.YISI-1 is a reference-based MT evaluation met-ric that measures the semantic similarity betweena ma-chine translation and human references byaggregating the idf-weighted lexical semantic sim-ilarities based on the contextual embeddings ex-tracted from pretrained language models (BERT,CamemBERT, RoBERTa, XLM, XLM-RoBERTa,etc.) and optionally incorporating shallow seman-tic structures (denoted as YISI-1 SRL; not partic-ipating this year). YISI-0 is the degenerate ver-sion of YISI-1 that is ready-to-deploy to any lan-guage. It uses longest common character substringto measure the lexical similarity. YISI-2 (Lo andLarkin, 2020) is the bilingual, reference-less ver-sion for MT quality estimation, which uses bilin-gual mappings of the contextual embeddings ex-tracted from pretrained language models (XLMor XLM-RoBERTa) to evaluate the crosslinguallexical semantic similarity between the input and

Page 9: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

696

MT output. Like YISI-1, YISI-2 can exploit shal-low semantic structures as well (denoted as YISI-2 SRL; does not participate this year).

3.4 Pre-processingSince some metrics, such as BLEU, aim to achievea strong positive correlation with human assess-ment, while error metrics, such as TER, aim for astrong negative correlation, in previous years wecompare metrics via the absolute value |r| of agiven metric’s correlation with human assessment.However, this can mask instances of true negativecorrelation for metrics that aim for a positive corre-lation (and vice-versa).

For system, document and segment level scores,we reverse the sign of the score of error met-rics prior to the comparison with human scores,whether on the system, document or segment level:higher scores have to indicate better translationquality.

4 Results

4.1 System-Level EvaluationAs in previous years, we employ the Pearson cor-relation (r) as the main evaluation measure forsystem-level metrics. The Pearson correlation is asfollows:

r =

∑ni=1(Hi −H)(Mi −M)√∑n

i=1(Hi −H)2√∑n

i=1(Mi −M)2(1)

where Hi are human assessment scores of all sys-tems in a given translation direction, Mi are thecorresponding scores as predicted by a given met-ric. H and M are their means, respectively.

As recommended by Graham and Baldwin(2014), we employ Williams significance test(Williams, 1959) to identify differences in correla-tion that are statistically significant. Williams testis a test of significance of a difference in dependentcorrelations and therefore suitable for evaluationof metrics. Correlations not significantly outper-formed by any other metric for the given languagepair are highlighted in bold in all the results tablesthat show Pearson correlation of metric and humanscores.

Pearson correlation is ideal for reporting whethermetric scores have the same trend as human scores.In practice, we use metrics to make decisions com-paring MT systems, and Kendall’s Tau appears tobe more close to this use case, as it directly checks

whether the metric ordering of a pair of MT sys-tems agrees with the human ordering. However,unlike Pearson correlation, it is not sensitive towhether the metric score differences correspond tothe human score differences. We stay with Pearsoncorrelation for the official results, but also reportKendall’s Tau correlation in the appendix.

The calculation of Pearson correlation coeffi-cient is dependent on the mean, which is very sen-sitive to outliers. So if we have systems whosescores are far away from the rest of the systems,the presence of these “outlier” systems can give amisleadingly high impression of the correlations,and potentially change ranking of metrics. To avoidthis, we also report correlations over non-outliersystems only.

To remove outliers, we are guided by the robustoutlier detection method proposed for MT metricevaluation by Mathur et al. (2020). This method,recommended by the statistics literature (Iglewiczand Hoaglin, 1993; Rousseeuw and Hubert, 2011;Leys et al., 2013) depends on the median and themedian absolute deviation (MAD) which is the me-dian of the absolute difference between each pointand the median. The method removes systemswhose human scores are greater than 2.5 MADaway from the median.

The cutoff of 2.5 is subjective, and Leys et al.(2013) suggest the guidelines of using 3 (veryconservative), 2.5 (moderately conservative) or 2(poorly conservative), and recommends 2.5. Forsome language pairs, we override the 2.5 cutoff forsystems that are close to the cutoff. We give exam-ples in Section 5, and list all identified outliers inTable 15 in the Appendix.

4.1.1 System-Level ResultsTables 5 and 6 provide the system-level correlationsof metrics. These tables include results for all MTsystems, and in cases where we detect outliers, wealso report correlation without outliers.

This year, we also carry out an extended analysisof the impact of (multiple) human references, seethe following paragraphs.

Scoring Human Translation In this section, weinvestigate how well the metric submissions scorehuman translations. We have five language pairswhere two reference translations were provided byWMT. The manual DA scoring of News TranslationTask included all the out-of-English human refer-ences in the evaluation along with the MT systems.

Page 10: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

697

cs-e

nde

-en

ja-e

npl

-en

ru-e

nta

-en

zh-e

niu

-en

km-e

nps

-en

109

713

1012

159

76

SE

NT

BL

EU

0.84

40.

800

0.97

80.

786

0.97

40.

851

0.50

20.

284

0.91

60.

833

0.92

50.

829

0.94

80.

950

0.64

90.

469

0.96

90.

888

BL

EU

0.85

10.

800

0.98

50.

778

0.96

90.

826

0.54

90.

355

0.88

40.

761

0.91

60.

807

0.95

60.

957

0.56

90.

348

0.96

90.

888

TE

R0.

845

0.78

30.

993

0.76

60.

974

0.75

20.

586

0.34

60.

904

0.82

90.

805

0.79

50.

956

0.91

10.

733

0.61

60.

973

0.93

5C

HR

F+

+0.

867

0.80

40.

997

0.69

90.

974

0.87

10.

538

0.32

80.

894

0.83

30.

953

0.83

00.

975

0.95

50.

726

0.39

20.

983

0.90

0C

HR

F0.

872

0.80

60.

997

0.68

70.

968

0.86

10.

528

0.31

20.

890

0.83

10.

951

0.82

80.

976

0.95

40.

729

0.33

70.

978

0.89

8PA

RB

LE

U0.

834

0.77

40.

986

0.83

80.

970

0.83

30.

562

0.34

20.

877

0.74

40.

908

0.80

10.

958

0.95

30.

624

0.39

80.

971

0.93

9PA

RC

HR

F+

+0.

865

0.81

00.

998

0.70

80.

974

0.87

70.

551

0.34

70.

885

0.82

30.

942

0.82

50.

976

0.95

60.

720

0.29

60.

985

0.89

9C

HA

RA

CT

ER

0.84

40.

812

0.99

80.

687

0.97

00.

895

0.52

20.

325

0.92

70.

869

0.96

50.

880

0.96

40.

950

0.76

30.

410

0.97

70.

841

EE

D0.

884

0.83

80.

997

0.75

20.

974

0.90

40.

538

0.29

90.

926

0.87

20.

958

0.86

20.

956

0.93

20.

821

0.58

70.

990

0.93

0Y

ISI-

00.

876

0.82

50.

998

0.78

60.

972

0.86

70.

453

0.20

70.

938

0.87

40.

968

0.86

10.

956

0.91

80.

831

0.56

30.

986

0.93

2S

WS

S+

ME

TE

OR

−−

0.97

80.

919

0.47

20.

212

0.92

50.

876

0.96

70.

862

0.95

90.

926

0.76

60.

545

0.99

00.

946

ME

E0.

861

0.82

20.

995

0.71

20.

982

0.90

00.

464

0.29

50.

927

0.87

80.

950

0.83

50.

952

0.94

80.

771

0.56

20.

970

0.87

8P

RIS

M0.

818

0.72

00.

998

0.77

50.

974

0.86

90.

502

0.26

90.

908

0.83

90.

898

0.78

80.

957

0.94

50.

833

0.61

60.

950

0.96

6Y

ISI-

10.

832

0.74

60.

998

0.78

30.

982

0.86

80.

543

0.31

60.

915

0.83

30.

925

0.79

70.

961

0.94

20.

834

0.59

00.

977

0.95

3B

ER

T-B

AS

E-L

20.

775

0.69

30.

997

0.79

10.

971

0.78

90.

552

0.32

80.

919

0.83

60.

909

0.74

60.

967

0.92

90.

704

0.14

50.

967

0.94

5B

ER

T-L

AR

GE

-L2

0.78

40.

695

0.99

00.

800

0.97

40.

784

0.52

00.

282

0.92

50.

843

0.90

10.

760

0.96

20.

928

0.74

40.

211

0.95

90.

950

MB

ER

T-L

20.

798

0.71

50.

995

0.82

40.

969

0.81

10.

555

0.30

20.

908

0.80

50.

887

0.74

00.

959

0.93

50.

837

0.53

00.

980

0.93

8B

LE

UR

T0.

792

0.72

50.

996

0.77

00.

978

0.82

00.

591

0.37

10.

924

0.84

40.

906

0.76

80.

966

0.93

10.

771

0.32

00.

984

0.95

5B

LE

UR

T-E

XT

EN

DE

D0.

771

0.66

80.

985

0.81

80.

961

0.77

20.

551

0.29

80.

900

0.79

70.

897

0.74

30.

945

0.93

10.

789

0.35

90.

985

0.94

2E

SIM

0.79

00.

716

0.99

80.

808

0.98

30.

822

0.59

10.

358

0.92

80.

834

0.88

50.

801

0.96

30.

910

0.80

70.

514

0.92

90.

929

PAR

ES

IM-1

0.78

80.

712

0.99

80.

835

0.98

30.

819

0.59

10.

363

0.92

60.

828

0.88

50.

797

0.96

30.

910

0.80

00.

495

0.92

90.

929

CO

ME

T0.

783

0.69

40.

998

0.77

30.

964

0.82

80.

591

0.34

50.

923

0.83

60.

880

0.76

40.

952

0.93

10.

852

0.60

50.

971

0.94

1C

OM

ET-

2R0.

777

0.69

70.

998

0.77

20.

964

0.81

80.

584

0.33

20.

924

0.84

30.

881

0.77

00.

949

0.92

80.

872

0.64

40.

970

0.94

9C

OM

ET-

HT

ER

0.73

80.

661

0.99

50.

767

0.91

20.

702

0.44

60.

231

0.86

70.

741

0.72

60.

595

0.80

90.

873

0.77

00.

464

0.90

10.

862

CO

ME

T-M

QM

0.72

80.

612

0.99

10.

684

0.90

60.

707

0.42

40.

222

0.85

80.

746

0.76

70.

617

0.78

40.

862

0.84

10.

631

0.91

40.

880

CO

ME

T-R

AN

K0.

705

0.53

40.

964

0.75

70.

923

0.79

30.

483

0.28

40.

868

0.73

20.

787

0.66

40.

877

0.90

90.

158

0.21

40.

911

0.85

5B

AQ

DY

N−

−−

−−

−0.

956

0.92

8−

−−

BA

QS

TAT

IC−

−−

−−

−0.

960

0.93

3−

−−

CO

ME

T-Q

E0.

755

0.62

20.

939

0.80

50.

892

0.58

50.

447

0.21

80.

883

0.77

30.

795

0.67

20.

847

0.88

70.

685

0.66

10.

896

0.83

2O

PE

NK

IWI-

BE

RT

0.72

60.

698

0.98

90.

741

0.73

50.

546

0.35

50.

187

0.86

20.

695

0.64

50.

469

0.62

50.

774

-0.1

26-0

.671

0.75

10.

753

OP

EN

KIW

I-X

LM

R0.

760

0.68

00.

995

0.70

10.

931

0.71

40.

442

0.17

10.

859

0.69

70.

792

0.65

90.

905

0.89

90.

271

-0.5

770.

880

0.86

5Y

ISI-

20.

764

0.64

00.

988

0.40

40.

971

0.77

60.

437

0.23

00.

825

0.81

40.

849

0.76

10.

964

0.93

30.

676

0.37

10.

790

0.94

2

Tabl

e5:

Pear

son

corr

elat

ion

ofto

-Eng

lish

syst

em-l

evel

met

rics

with

DA

hum

anas

sess

men

tove

rMT

syst

ems

usin

gth

ene

wst

est2

020

refe

renc

es.F

orla

ngua

gepa

irs

that

cont

ain

outli

ersy

stem

s,w

eal

sosh

owco

rrel

atio

naf

terr

emov

ing

outli

ersy

stem

s(“

-out

”).C

orre

latio

nsof

met

rics

nots

igni

fican

tlyou

tper

form

edby

any

othe

rfor

that

lang

uage

pair

are

high

light

edin

bold

.

Page 11: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

698

en-c

sen

-de

en-j

aen

-pl

en-r

uen

-ta

en-z

hen

-iu F

UL

Len

-iu N

EW

S#

All

-out

All

-out

All

-out

All

-out

All

All

-out

All

All

-out

All

-out

1210

1411

119

1411

915

1212

118

118

SE

NT

BL

EU

0.84

00.

436

0.93

40.

823

0.94

60.

976

0.95

00.

772

0.98

10.

881

0.85

20.

927

0.12

90.

047

0.07

50.

172

BL

EU

0.82

50.

390

0.92

80.

825

0.94

50.

980

0.94

30.

743

0.98

00.

880

0.82

90.

928

0.16

30.

131

0.07

40.

111

TE

R0.

814

0.33

90.

941

0.84

80.

297

0.80

10.

893

0.55

30.

064

0.87

00.

883

-0.2

130.

384

0.13

30.

357

0.08

3C

HR

F+

+0.

833

0.34

90.

958

0.85

00.

952

0.94

50.

956

0.78

30.

983

0.92

90.

880

0.87

80.

328

0.12

80.

315

0.09

8C

HR

F0.

826

0.31

30.

962

0.86

20.

951

0.96

40.

957

0.79

30.

982

0.93

70.

890

0.92

30.

350

0.12

20.

336

0.09

1PA

RB

LE

U0.

870

0.54

30.

910

0.77

40.

869

0.81

30.

948

0.76

00.

959

0.87

10.

849

0.96

20.

194

0.46

40.

126

0.30

6PA

RC

HR

F+

+0.

860

0.43

80.

957

0.84

50.

955

0.95

10.

953

0.81

80.

975

−0.

948

−−

CH

AR

AC

TE

R0.

807

0.26

90.

961

0.86

80.

951

0.93

60.

935

0.72

60.

961

0.95

70.

851

0.90

50.

503

0.00

80.

515

0.12

1E

ED

0.81

70.

271

0.96

50.

869

0.95

50.

965

0.96

20.

789

0.98

00.

959

0.91

30.

928

0.51

90.

043

0.48

30.

122

ME

E0.

875

0.49

50.

954

0.82

0−

0.95

20.

733

0.72

40.

906

0.86

1−

0.28

70.

094

0.24

20.

113

YIS

I-0

0.79

70.

270

0.95

30.

889

0.96

70.

972

0.95

30.

783

0.97

10.

929

0.89

70.

362

0.52

50.

015

0.50

50.

095

PR

ISM

0.94

90.

805

0.95

80.

851

0.93

20.

921

0.95

80.

742

0.72

40.

863

0.45

20.

221

0.95

7-0

.043

0.94

50.

088

YIS

I-1

0.92

20.

664

0.97

10.

887

0.96

90.

967

0.96

40.

714

0.92

60.

973

0.90

90.

959

0.55

4-0

.217

0.52

3-0

.014

YIS

I-C

OM

BI

−0.

971

0.86

8−

−−

−−

−−

BL

EU

RT-

YIS

I-C

OM

BI

−0.

971

0.86

8−

−−

−−

−−

MB

ER

T-L

20.

946

0.78

20.

970

0.86

10.

977

0.96

90.

976

0.77

50.

946

0.94

40.

834

0.93

4−

−B

LE

UR

T-E

XT

EN

DE

D0.

989

0.96

00.

969

0.87

00.

944

0.95

30.

982

0.82

80.

980

0.94

00.

814

0.92

80.

823

0.12

20.

762

0.15

5E

SIM

0.90

80.

575

0.97

90.

894

0.99

30.

981

0.96

90.

698

0.96

70.

937

0.83

30.

972

0.81

40.

365

0.76

00.

418

PAR

ES

IM-1

0.91

90.

635

0.97

40.

886

0.98

90.

971

0.96

80.

705

0.96

40.

937

0.83

30.

983

0.81

40.

365

0.76

00.

418

CO

ME

T0.

978

0.92

60.

972

0.86

30.

974

0.96

90.

981

0.80

00.

925

0.94

40.

798

0.00

70.

860

0.02

80.

858

0.15

2C

OM

ET-

2R0.

983

0.94

20.

972

0.86

90.

986

0.97

80.

982

0.80

30.

872

0.95

90.

852

-0.0

660.

848

-0.0

080.

867

0.17

7C

OM

ET-

HT

ER

0.97

60.

917

0.95

10.

852

0.98

90.

974

0.97

40.

763

0.80

30.

925

0.68

1-0

.073

0.90

00.

142

0.88

80.

092

CO

ME

T-M

QM

0.97

40.

910

0.88

10.

840

0.97

40.

965

0.96

70.

766

0.78

80.

910

0.64

10.

084

0.87

00.

129

0.86

70.

172

CO

ME

T-R

AN

K0.

959

0.86

80.

877

0.86

00.

931

0.92

80.

957

0.76

00.

676

0.87

60.

511

0.54

00.

283

0.09

90.

392

0.25

2B

AQ

DY

N−

−−

−−

−0.

904

−−

BA

QS

TAT

IC−

−−

−−

−0.

958

−−

EQ

DY

N−

−−

−−

−0.

948

−−

EQ

STA

TIC

−−

−−

−−

0.97

6−

−C

OM

ET-

QE

0.98

90.

974

0.90

30.

831

0.95

30.

955

0.96

90.

804

0.80

70.

887

0.62

20.

375

0.90

50.

578

0.92

80.

651

OP

EN

KIW

I-B

ER

T0.

920

0.83

00.

852

0.82

90.

363

0.78

30.

903

0.45

00.

834

0.84

60.

370

0.55

10.

573

-0.6

020.

808

0.19

4O

PE

NK

IWI-

XL

MR

0.97

20.

911

0.96

80.

814

0.99

20.

976

0.95

70.

638

0.87

50.

910

0.67

6-0

.010

0.51

3-0

.668

0.68

0-0

.358

YIS

I-2

0.71

40.

353

0.89

90.

552

0.85

40.

646

0.47

0-0

.107

0.58

40.

922

0.92

3-0

.215

0.80

2-0

.257

0.83

00.

065

Tabl

e6:

Pear

son

corr

elat

ion

ofou

t-of

-Eng

lish

syst

em-l

evel

met

rics

with

DA

hum

anas

sess

men

tove

rM

Tsy

stem

sus

ing

the

new

stes

t202

0re

fere

nces

;Fo

rla

ngua

gepa

irs

that

cont

ain

outli

ersy

stem

s,w

eal

sosh

owco

rrel

atio

naf

terr

emov

ing

outli

ersy

stem

s.C

orre

latio

nsof

met

rics

nots

igni

fican

tlyou

tper

form

edby

any

othe

rfor

that

lang

uage

pair

are

high

light

edin

bold

.#

The

Eng

lish→

Inuk

titut

hum

anev

alua

tion

only

cont

aine

dth

ene

ws

subs

et,s

ow

ere

com

pute

en-i

usy

stem

scor

esof

met

rics

onth

ene

ws

subs

etof

the

test

set(

1405

sent

ence

s).

Not

eth

atth

esc

ores

ofPA

RB

LE

Uan

dPA

RC

HR

Fw

ere

com

pute

das

aver

age

ofse

gmen

tsco

res

Page 12: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

699

en-d

een

-de P

en-d

e Ben

-de P

en-z

hen

-zh B

de-e

nru

-en

zh-e

nH

IDH

uman

-BH

uman

-BH

uman

-AH

uman

-AH

uman

-BH

uman

-AH

uman

-BH

uman

-BH

uman

-BN

1212

1212

1313

1011

16

SE

NT

BL

EU

0.44

10.

851

0.63

90.

676

0.64

70.

837

0.43

70.

797

0.91

7B

LE

U0.

458

0.86

80.

672

0.66

50.

658

0.81

40.

480

0.73

80.

938

TE

R0.

233

0.49

50.

577

0.69

5-0

.131

-0.1

380.

466

0.81

20.

850

CH

RF

++

0.55

50.

917

0.74

80.

650

0.59

20.

805

0.43

70.

815

0.94

7C

HR

F0.

599

0.91

90.

772

0.64

50.

651

0.81

20.

442

0.82

10.

948

PAR

BL

EU

0.34

90.

676

0.58

00.

682

0.56

90.

787

0.49

80.

716

0.92

6PA

RC

HR

F+

+0.

573

0.89

00.

748

0.69

80.

559

0.77

60.

447

0.80

30.

950

CH

AR

AC

TE

R0.

472

0.89

00.

736

0.63

80.

687

0.85

00.

410

0.85

60.

938

EE

D0.

447

0.89

80.

685

0.64

60.

679

0.83

00.

466

0.86

10.

910

YIS

I-0

0.51

40.

892

0.72

80.

724

0.24

40.

274

0.56

60.

860

0.89

8S

WS

S+

ME

TE

OR

−−

−−

−−

−0.

866

0.91

4M

EE

0.51

20.

886

0.71

90.

642

−−

0.39

90.

855

0.94

1P

RIS

M0.

472

0.72

70.

731

0.74

20.

157

0.16

60.

591

0.83

70.

942

YIS

I-1

0.64

00.

895

0.83

00.

697

0.77

30.

916

0.71

30.

822

0.94

3Y

ISI-

CO

MB

I0.

607

0.89

10.

801

0.70

2−

−−

−−

BL

EU

RT-

YIS

I-C

OM

BI

0.60

70.

891

0.80

10.

702

−−

−−

−B

ER

T-B

AS

E-L

2−

−−

−−

−0.

785

0.81

30.

922

BE

RT-

LA

RG

E-L

2−

−−

−−

−0.

794

0.81

90.

923

MB

ER

T-L

20.

845

0.87

60.

875

0.81

00.

868

0.90

70.

748

0.78

90.

925

BL

EU

RT

−−

−−

−−

0.75

40.

823

0.92

3B

LE

UR

T-E

XT

EN

DE

D0.

888

0.89

60.

883

0.83

80.

865

0.91

00.

811

0.75

70.

914

ES

IM0.

719

0.92

00.

870

0.74

40.

837

0.92

40.

765

0.81

90.

911

PAR

ES

IM-1

0.68

70.

905

0.85

60.

763

0.82

20.

910

0.79

80.

815

0.91

1C

OM

ET

0.85

40.

894

0.87

90.

822

0.07

80.

062

0.75

90.

821

0.91

6C

OM

ET-

2R0.

820

0.86

60.

877

0.86

50.

009

-0.0

030.

756

0.83

70.

911

CO

ME

T-H

TE

R0.

840

0.87

10.

869

0.85

10.

006

-0.0

010.

761

0.71

80.

857

CO

ME

T-M

QM

0.83

90.

876

0.85

90.

825

0.15

80.

154

0.68

20.

722

0.84

6C

OM

ET-

RA

NK

0.78

20.

870

0.83

00.

794

0.57

80.

565

0.70

90.

725

0.89

6B

AQ

DY

N−

−−

−0.

739

−−

−0.

236

BA

QS

TAT

IC−

−−

−0.

915

−−

−0.

239

EQ

DY

N−

−−

−0.

729

−−

−−

EQ

STA

TIC

−−

−−

0.92

5−

−−

−C

OM

ET-

QE

0.88

50.

885

0.84

40.

844

0.47

30.

481

0.80

60.

749

0.86

5O

PE

NK

IWI-

BE

RT

0.74

10.

741

0.83

50.

835

0.48

70.

521

0.65

50.

682

0.74

2O

PE

NK

IWI-

XL

MR

0.73

60.

736

0.79

50.

795

0.05

30.

050

0.66

00.

694

0.89

3Y

ISI-

2-0

.333

-0.3

33-0

.039

-0.0

39-0

.190

-0.1

980.

123

0.51

30.

882

Tabl

e7:

Eva

luat

ing

Hum

antr

ansl

atio

n:Pe

arso

nco

rrel

atio

nof

met

rics

with

DA

hum

anas

sess

men

tfor

allM

Tsy

stem

spl

usH

uman

tran

slat

ion.

The

subs

crip

tBre

pres

ents

anal

tern

ate

refe

renc

e,P

repr

esen

tsa

para

phra

sed

refe

renc

e.N

isth

eto

taln

umbe

rof

MT

syst

ems

(exc

ludi

ngou

tlier

s)an

dH

IDis

the

iden

tity

ofth

ehu

man

tran

slat

ion

eval

uate

d.C

orre

latio

nsof

met

rics

nots

igni

fican

tlyou

tper

form

edby

any

othe

rfor

that

lang

uage

pair

are

high

light

edin

bold

.

Page 13: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

700

For to-English language pairs, only the secondaryhuman reference translations were manually scoredwith DA as the primary human reference transla-tion was shown to the monolingual annotators.

For these language pairs, the metrics can scorea human translation by using the other one as thereference translation. For simplicity, we add thesecond human reference translation to the list oftranslation outputs and observe how its scoring bythe given metric affects the correlation.

Table 7 shows how well the metrics correlatewith the WMT human evaluation when includinghuman translations as additional output. In mostcases, the correlation decreases as metrics strug-gle to correctly score translations that are differentfrom MT systems. Metrics that rely on fine-tuningon existing human assessments from the previousWMT campaigns (e.g. BLEURT, ESIM, COMET)can handle human translations much better on av-erage. Also, the Paraphrased references help thelexical metrics correctly identify the high qualityof human translations.

We present a deeper analysis of how metricsscore human translations in Section 5.1.2. We basethis discussion on scatterplots of human vs metricscores. We include scatterplots of selected metricsin Appendix B.

Influence of References Rewarding multiple al-ternative translations is the primary motivation be-hind multiple-reference based evaluation. It is gen-erally assumed that using multiple reference trans-lation for automatic evaluation is helpful as wecover a wider space of possible translations (Pap-ineni et al., 2002b; Dreyer and Marcu, 2012; Bojaret al., 2013). Nevertheless, new studies (Freitaget al., 2020) showed that multi-reference evaluationdoes not improve the correlation for high qualityoutput anymore. Since we have multiple referencesavailable for five language pairs, we can look athow much the choice of reference(s) influencescorrelation.

Table 8 compares metric correlations on the pri-mary reference set newstest2020, alternative refer-ence newstestB2020, paraphrased reference new-stestP2020 (only for English-German), or usingall available references newstestM2020. We onlyreport system-level correlations of metrics on MTsystems after discarding outliers.

4.2 Segment- and Document-LevelEvaluation

Segment-level evaluation relies on the manualjudgements collected in the News Translation Taskevaluation. This year, again we were unable tofollow the methodology outlined in Graham et al.(2015) for evaluating of segment-level metrics be-cause the sampling of segments did not providesufficient number of assessments of the same seg-ment. We therefore convert pairs of DA scores forcompeting translations to DARR better/worse pref-erences as described in Section 2.3.2. We furtherfollow the same process to generate DARR groundtruth for documents, as we do not have enoughannotations to obtain accurate human scores.

We measure the quality of metrics’ scoresagainst the DARR golden truth using a Kendall’sTau-like formulation, which is an adaptation of theconventional Kendall’s Tau coefficient. Since wedo not have a total order ranking of all translations,it is not possible to apply conventional Kendall’sTau given the current DARR human evaluationsetup (Graham et al., 2015).

Our Kendall’s Tau-like formulation, τ , is as fol-lows:

τ =|Concordant| − |Discordant||Concordant|+ |Discordant|

(2)

where Concordant is the set of all human compar-isons for which a given metric suggests the sameorder and Discordant is the set of all human com-parisons for which a given metric disagrees. Theformula is not specific with respect to ties, i.e. caseswhere the annotation says that the two outputs areequally good.

The way in which ties (both in human and met-ric judgement) were incorporated in computingKendall τ has changed across the years of WMTMetrics Tasks. Here we adopt the version used inWMT17 DARR evaluation. For a detailed discus-sion on other options, see also Machacek and Bojar(2014).

Whether or not a given comparison of a pair ofdistinct translations of the same source input, s1and s2, is counted as a concordant (Conc) or dis-concordant (Disc) pair is defined by the followingmatrix:

In previous years, we used bootstrap resampling(Koehn, 2004; Graham et al., 2014) to estimate con-fidence intervals for our Kendall’s Tau formulation,and metrics with non-overlapping 95% confidence

Page 14: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

701

en-d

een

-de B

en-d

e Pen

-de M

en-z

hen

-zh B

en-z

h Mde

-en

de-e

n Bde

-en M

ru-e

nru

-en B

ru-e

n Mzh

-en

zh-e

n Bzh

-en M

1111

1111

1212

129

99

1010

1015

1515

SE

NT

BL

EU

0.82

30.

837

0.81

50.

827

0.92

70.

911

0.91

90.

786

0.76

30.

788

0.83

30.

850

0.83

70.

950

0.92

80.

944

BL

EU

0.82

50.

844

0.83

00.

822

0.92

80.

899

0.91

30.

778

0.79

70.

805

0.76

10.

780

0.77

50.

957

0.93

40.

949

TE

R0.

848

0.86

00.

859

0.85

2-0

.213

-0.2

00-0

.203

0.76

60.

744

0.75

80.

829

0.83

20.

853

0.91

10.

875

0.91

1C

HR

F+

+0.

850

0.86

60.

876

0.85

80.

878

0.91

50.

885

0.69

90.

681

0.70

40.

833

0.83

90.

843

0.95

50.

948

0.95

2C

HR

F0.

862

0.87

40.

883

−0.

923

0.91

2−

0.68

70.

683

−0.

831

0.83

9−

0.95

40.

947

−PA

RB

LE

U0.

774

0.79

60.

724

0.79

40.

962

0.95

50.

959

0.83

80.

831

0.82

90.

744

0.76

70.

756

0.95

30.

934

0.94

5PA

RC

HR

F+

+0.

845

0.86

30.

865

0.85

60.

948

0.96

60.

896

0.70

80.

704

0.66

90.

823

0.83

40.

832

0.95

60.

950

0.95

6C

HA

RA

CT

ER

0.86

80.

889

0.83

50.

878

0.90

50.

908

0.90

10.

687

0.69

60.

713

0.86

90.

853

0.87

30.

950

0.94

20.

949

EE

D0.

869

0.87

10.

867

0.86

70.

928

0.92

30.

930

0.75

20.

747

0.75

20.

872

0.86

80.

879

0.93

20.

922

0.93

2Y

ISI-

00.

889

0.88

20.

873

0.88

60.

362

0.27

30.

332

0.78

60.

790

0.79

40.

874

0.86

70.

880

0.91

80.

911

0.91

8S

WS

S+

ME

TE

OR

−−

−−

−−

−−

−−

0.87

60.

891

0.89

10.

926

0.92

30.

929

ME

E0.

820

0.83

30.

830

0.82

0−

−−

0.71

20.

674

0.71

20.

878

0.87

60.

876

0.94

80.

940

0.94

8P

RIS

M0.

851

0.85

40.

839

0.85

10.

221

0.17

80.

221

0.77

50.

763

0.77

00.

839

0.84

10.

842

0.94

50.

949

0.94

5Y

ISI-

10.

887

0.88

60.

888

0.88

50.

959

0.95

90.

960

0.78

30.

781

0.78

10.

833

0.83

70.

838

0.94

20.

942

0.94

3Y

ISI-

CO

MB

I0.

868

0.87

30.

876

0.87

6−

−−

−−

−−

−−

−−

−B

LE

UR

T-Y

ISI-

CO

MB

I0.

868

0.87

30.

876

0.87

6−

−−

−−

−−

−−

−−

−B

ER

T-B

AS

E-L

2−

−−

−−

−−

0.79

10.

798

0.80

20.

836

0.83

30.

835

0.92

90.

936

0.93

3B

ER

T-L

AR

GE

-L2

−−

−−

−−

−0.

800

0.80

10.

812

0.84

30.

844

0.85

00.

928

0.93

50.

932

MB

ER

T-L

20.

861

0.86

20.

841

0.86

50.

934

0.92

50.

936

0.82

40.

825

0.83

40.

805

0.81

30.

816

0.93

50.

938

0.93

9B

LE

UR

T−

−−

−−

−−

0.77

00.

769

0.78

00.

844

0.84

70.

850

0.93

10.

936

0.93

5B

LE

UR

T-E

XT

EN

DE

D0.

870

0.87

00.

860

0.86

70.

928

0.92

30.

925

0.81

80.

805

0.81

20.

797

0.79

30.

795

0.93

10.

932

0.93

2E

SIM

0.89

40.

900

0.88

70.

898

0.97

20.

975

0.97

60.

808

0.79

80.

804

0.83

40.

842

0.83

90.

910

0.92

00.

916

PAR

ES

IM-1

0.88

60.

897

0.87

80.

890

0.98

30.

983

0.98

50.

835

0.80

70.

822

0.82

80.

840

0.83

50.

910

0.91

80.

915

CO

ME

T0.

863

0.86

40.

858

0.86

40.

007

-0.0

14-0

.004

0.77

30.

769

0.77

20.

836

0.83

60.

836

0.93

10.

936

0.93

4C

OM

ET-

2R0.

869

0.86

90.

866

0.87

5-0

.066

-0.0

76-0

.075

0.77

20.

764

0.77

10.

843

0.84

20.

843

0.92

80.

930

0.92

9C

OM

ET-

HT

ER

0.85

20.

855

0.84

80.

853

-0.0

73-0

.075

-0.0

740.

767

0.76

90.

768

0.74

10.

744

0.74

20.

873

0.86

90.

871

CO

ME

T-M

QM

0.84

00.

844

0.83

60.

842

0.08

40.

076

0.08

00.

684

0.68

60.

685

0.74

60.

750

0.74

80.

862

0.86

00.

861

CO

ME

T-R

AN

K0.

860

0.83

90.

831

0.85

20.

540

0.50

70.

530

0.75

70.

582

0.72

30.

732

0.74

30.

757

0.90

90.

908

0.91

9

Tabl

e8:

Influ

ence

ofre

fere

nces

:Pea

rson

corr

elat

ion

ofm

etri

csw

ithD

Ahu

man

asse

ssm

entf

orM

Tsy

stem

sex

clud

ing

outli

ers

inW

MT

2020

fora

llla

ngua

ge-p

airs

with

mul

tiple

refe

renc

es;c

orre

latio

nsof

met

rics

nots

igni

fican

tlyou

tper

form

edby

any

othe

rfo

rth

atla

ngua

gepa

irar

ehi

ghlig

hted

inbo

ld.

The

subs

crip

tBre

pres

ents

ase

cond

ary

refe

renc

e,P

repr

esen

tsa

para

phra

sed

refe

renc

e,M

repr

esen

tsal

lava

ilabl

ere

fere

nces

.N

ote

that

we

excl

ude

refe

renc

e-fr

eem

etri

csfr

omth

ista

ble,

soth

ew

inne

rsar

eno

tcom

para

ble

with

the

mai

nta

bles

.

Page 15: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

702

Metrics1 < s2 s1 = s2 s1 > s2

Hum

an s1 < s2 Conc Disc Discs1 = s2 − − −s1 > s2 Disc Disc Conc

intervals are identified as having statistically sig-nificant difference in performance. The tests areinconclusive for most metric pairs this year and wedo not include them in the paper.

4.2.1 Segment-Level ResultsResults of the segment-level human evaluation fortranslations sampled from the News TranslationTask are shown in Tables 9 and 10, We expect thatcomparing between segments translated by twoMT systems that are far apart in quality would be arelatively easier task for automatic metrics. So wealso include results after discarding segments thatwere translated by outlier systems.

Note that we do not include any human-translated segments in this evaluation.

4.3 Document-level Results

Results of the document-level human evaluationfor translations sampled from the News TranslationTask are shown in Tables 11 and 12.

5 Discussion

5.1 System-Level Results

In general, there is no clear best metric this yearacross all language pairs. For most language pairs,the William’s significance test results in large clus-ters of metrics. The set of “winners” according tothe test (i.e., the metrics that are not outperformedby any other metric) are typically not consistentacross language pairs.

The sample of systems we employ to evaluatemetrics is often small, as few as six MT systems forPashto→ English, for example. This can lead toinconclusive results, as identification of significantdifferences in correlations of metrics is unlikely atsuch a small sample size. Furthermore, Williamstest takes into account the correlation between eachpair of metrics, in addition to the correlation be-tween the metric scores themselves, and this lattercorrelation increases the likelihood of a significantdifference being identified. In extreme cases, thetest would have low power when comparing a met-ric that doesn’t correlate well with other metrics,

resulting in this metric not being outperformed byother metrics despite having a much lower value ofcorrelation.

To strengthen the conclusions of our evaluation,in past years (Bojar et al., 2016, 2017; Ma et al.,2018), we included significance test results forlarge hybrid-super-samples of systems 10K hybridsystems were created per language pair, with cor-responding DA human assessment scores by sam-pling pairs of systems from the News TranslationTask, creating hybrid systems by randomly select-ing each candidate translation from one of the twoselected systems. However, as WMT human an-notations are collected with document context in2020, this style of hybridization is susceptible tobreaking cross-segment references in MT outputsand it would be unreasonable to shuffle individualsegments. The creation of hybrid systems wouldneed to be done by sampling documents instead ofsegments from all sets of systems. Finally, it is pos-sible that including documents translated by outliersystems might falsely lead to high correlations. Webelieve that this merits further investigation basedon data from previous of metrics tasks, and we donot attempt it this year.

In the rest of this section, we present analysisof various aspects of system-level evaluation basedon scatterplots of all metrics. Appendix B containsscatterplots of metrics for each language pair. Weinclude BLEU, chrF, the “best” reference-basedmetric and the “best” reference-free metric (weacknowledge that this is not the best way to definethe best metric, but we choose the metric that ismost highly correlated with humans on the set ofall MT systems after removing outliers).

5.1.1 Influence of Domain in English→Inuktitut

English→ Inuktitut training data was the CanadianHansards domain, and the development data con-tained a small amount of news data. The test setwas a mix of in-domain data from the Hansards andnews documents. The evaluation was only doneon the out-of-domain news documents, so we alsolook at metric scores computed only on the subsetof news sentences.

Figure 1 shows that BLEU scores on the out-of-domain dataset are considerably smaller thanthe full dataset, showing that MT systems have ahigher quality on the in-domain dataset. The rela-tive scores of metrics remain mostly stable whenwe compare scores on the full test set to scores on

Page 16: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

703

cs-e

nde

-en

iu-e

nja

-en

km-e

npl

-en

ps-e

nru

-en

ta-e

nzh

-en

all

all-

out

all

all-

out

all

all-

out

all

all-

out

all

all

all-

out

all

all

all-

out

all

all-

out

all

all-

out

1401

894

6116

584

6185

8162

5381

1519

362

8637

0621

121

1797

935

0714

024

1102

012

789

8749

6258

653

610

SE

NT

BL

EU

0.06

80.

057

0.41

3-0

.025

0.18

20.

170

0.18

80.

061

0.22

6-0

.024

-0.0

460.

096

-0.0

05-0

.038

0.16

20.

069

0.09

30.

060

TE

R-0

.04

-0.0

60.

355

-0.1

370.

021

0.01

20.

044

-0.0

770.

125

-0.1

72-0

.196

-0.0

36-0

.117

-0.1

540.

046

-0.0

63-0

.01

-0.0

47C

HR

F+

+0.

090

0.07

50.

435

0.01

30.

246

0.25

10.

245

0.11

50.

275

0.03

40.

009

0.14

50.

054

0.01

80.

186

0.09

80.

130

0.09

6C

HR

F0.

086

0.07

20.

438

0.01

80.

254

0.26

00.

242

0.10

90.

267

0.02

80.

003

0.14

40.

049

0.01

20.

186

0.09

60.

132

0.09

8PA

RB

LE

U0.

058

0.03

80.

415

-0.0

390.

167

0.16

10.

198

0.07

40.

203

-0.0

25-0

.049

0.10

0-0

.011

-0.0

520.

159

0.06

40.

095

0.05

9PA

RC

HR

F+

+0.

096

0.08

20.

436

0.00

90.

232

0.23

50.

247

0.11

70.

267

0.02

70.

002

0.14

70.

044

0.00

70.

184

0.09

50.

132

0.09

9C

HA

RA

CT

ER

0.09

00.

087

0.44

00.

011

0.21

40.

220

0.22

10.

106

0.24

80.

023

-0.0

020.

172

0.05

70.

028

0.13

80.

078

0.12

30.

093

EE

D0.

091

0.07

80.

440

0.01

30.

256

0.25

80.

235

0.11

60.

271

0.04

50.

022

0.14

90.

053

0.01

80.

198

0.10

30.

129

0.09

3Y

ISI-

00.

072

0.06

50.

441

0.02

40.

261

0.26

30.

241

0.12

10.

268

0.03

50.

013

0.14

00.

065

0.03

00.

183

0.08

90.

127

0.09

3S

WS

S+

ME

TE

OR

−−

0.22

60.

218

0.22

80.

086

0.26

40.

011

-0.0

160.

130

0.04

80.

010

0.20

50.

120

0.13

30.

099

ME

E0.

063

0.04

50.

402

-0.0

40.

134

0.12

60.

187

0.06

40.

206

-0.0

84-0

.105

0.07

8-0

.041

-0.0

840.

114

0.03

20.

083

0.05

0Y

ISI-

10.

117

0.10

30.

468

0.05

10.

253

0.26

00.

277

0.12

80.

316

0.04

20.

023

0.14

70.

091

0.04

90.

248

0.16

20.

146

0.11

5B

ER

T-B

AS

E-L

20.

103

0.08

70.

454

0.02

60.

238

0.22

90.

263

0.12

90.

295

0.03

20.

013

0.15

90.

087

0.03

70.

223

0.13

50.

141

0.11

3B

ER

T-L

AR

GE

-L2

0.10

20.

087

0.45

60.

025

0.25

10.

249

0.26

20.

114

0.31

40.

044

0.02

70.

151

0.09

40.

047

0.24

50.

157

0.13

30.

102

MB

ER

T-L

20.

119

0.11

10.

442

0.00

10.

244

0.23

50.

251

0.12

00.

312

0.04

70.

029

0.15

10.

083

0.03

60.

227

0.13

90.

133

0.10

4B

LE

UR

T0.

126

0.11

80.

456

0.01

50.

258

0.25

60.

265

0.12

30.

327

0.05

70.

040

0.20

70.

093

0.04

60.

230

0.14

50.

137

0.10

7B

LE

UR

T-E

XT

EN

DE

D0.

127

0.11

30.

448

0.00

40.

259

0.25

90.

271

0.12

40.

330

0.04

40.

019

0.16

10.

101

0.05

70.

246

0.16

50.

137

0.10

7E

SIM

0.11

00.

103

0.45

40.

031

0.24

10.

233

0.23

90.

119

0.30

00.

058

0.04

50.

147

0.08

40.

044

0.20

80.

117

0.13

80.

108

PAR

ES

IM-1

0.10

50.

098

0.46

40.

051

0.24

90.

241

0.24

20.

121

0.29

20.

066

0.05

50.

149

0.08

90.

049

0.21

30.

123

0.13

90.

111

CO

ME

T0.

129

0.11

20.

485

0.09

00.

281

0.27

10.

274

0.12

70.

298

0.09

90.

085

0.15

80.

156

0.11

70.

241

0.16

30.

171

0.14

2C

OM

ET-

2R0.

120

0.10

70.

479

0.10

10.

257

0.25

10.

268

0.12

00.

308

0.09

80.

085

0.14

40.

148

0.11

00.

253

0.17

70.

163

0.13

6C

OM

ET-

HT

ER

0.10

30.

087

0.48

10.

088

0.19

80.

199

0.24

10.

095

0.26

90.

080

0.06

70.

116

0.13

10.

098

0.22

70.

151

0.13

50.

113

CO

ME

T-M

QM

0.10

80.

097

0.48

30.

100

0.21

50.

209

0.25

90.

112

0.28

20.

080

0.06

60.

141

0.13

70.

102

0.22

70.

158

0.14

10.

117

CO

ME

T-R

AN

K0.

099

0.09

60.

470

0.06

10.

188

0.18

10.

235

0.08

60.

228

0.07

30.

057

0.10

70.

118

0.08

20.

199

0.11

20.

142

0.11

7C

OM

ET-

QE

0.09

10.

072

0.41

00.

042

0.03

10.

020

0.15

30.

048

0.14

80.

039

0.02

90.

092

0.08

40.

049

0.16

30.

099

0.08

80.

070

OP

EN

KIW

I-B

ER

T0.

036

0.02

90.

379

0.01

3-0

.005

-0.0

090.

110

0.00

00.

168

-0.0

33-0

.043

0.07

6-0

.033

-0.0

670.

118

0.05

20.

029

0.02

0O

PE

NK

IWI-

XL

MR

0.09

30.

079

0.46

30.

074

0.05

60.

031

0.22

00.

086

0.24

40.

059

0.05

10.

106

0.09

20.

065

0.18

80.

109

0.11

50.

089

YIS

I-2

0.06

80.

054

0.41

30.

006

0.03

90.

028

0.20

40.

074

0.21

40.

048

0.04

20.

073

0.07

00.

056

0.19

90.

113

0.11

60.

084

PR

ISM

0.14

30.

135

0.47

50.

057

0.25

50.

254

0.27

20.

146

0.30

40.

109

0.09

30.

165

0.14

50.

111

0.23

70.

151

0.16

70.

138

BA

QD

YN

−−

−−

−−

−−

−0.

119

0.08

9B

AQ

STA

TIC

−−

−−

−−

−−

−0.

119

0.08

7

Tabl

e9:

Segm

ent-

leve

lmet

ric

resu

ltsfo

rto-

Eng

lish

lang

uage

pair

s:K

enda

ll’s

Tau

form

ulat

ion

ofse

gmen

t-le

velm

etri

csc

ores

with

DA

RR

scor

es.F

orla

ngua

gepa

irs

that

cont

ain

outli

ersy

stem

s,w

eal

sosh

owco

rrel

atio

naf

terd

isca

rdin

gse

gmen

tstr

ansl

ated

byou

tlier

syst

ems

Page 17: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

704

en-c

sen

-de

en-iu

en-j

aen

-pl

en-r

uen

-ta

en-z

hal

lal

l-ou

tal

lal

l-ou

tal

lal

l-ou

tal

lal

lal

l-ou

tal

lal

lal

l-ou

tal

l

2112

110

283

9339

4637

1315

954

9012

830

1768

993

1683

3090

8736

9512

652

SE

NT

BL

EU

0.43

20.

194

0.30

30.

155

0.20

6-0

.084

0.47

90.

153

0.06

70.

051

0.39

80.

206

0.39

6T

ER

0.31

70.

067

0.18

20.

044

-0.0

71-0

.337

-0.5

910.

003

-0.0

94-0

.121

0.20

30.

019

-0.3

6C

HR

F+

+0.

478

0.22

80.

367

0.21

50.

338

0.07

50.

506

0.25

50.

154

0.15

60.

579

0.34

90.

388

CH

RF

0.47

20.

229

0.37

90.

224

0.34

40.

095

0.50

60.

250

0.15

00.

153

0.58

90.

359

0.40

0PA

RB

LE

U0.

460

0.22

60.

299

0.13

60.

212

-0.0

510.

052

0.18

30.

088

0.06

20.

340

0.17

80.

356

PAR

CH

RF+

+0.

492

0.25

30.

355

0.19

2−

0.52

70.

272

0.16

70.

176

−0.

398

CH

AR

AC

TE

R0.

413

0.19

50.

311

0.17

90.

309

0.10

80.

471

0.19

80.

107

0.14

30.

525

0.27

00.

339

EE

D0.

458

0.21

00.

363

0.20

30.

361

0.10

90.

515

0.24

80.

151

0.15

50.

587

0.34

20.

393

YIS

I-0

0.43

20.

191

0.34

90.

212

0.36

20.

101

0.48

40.

233

0.13

20.

151

0.54

70.

336

0.31

9M

EE

0.41

10.

157

0.28

90.

128

-0.0

74-0

.272

−0.

125

0.02

50.

027

0.37

30.

168

−Y

ISI-

10.

550

0.32

00.

427

0.26

30.

251

0.08

20.

568

0.34

90.

209

0.25

60.

669

0.44

00.

463

YIS

I-C

OM

BI

−0.

399

0.22

4−

−−

−−

−B

LE

UR

T-C

OM

BI

−0.

399

0.22

4−

−−

−−

−M

BE

RT-

L2

0.56

70.

359

0.36

10.

202

−0.

541

0.35

00.

212

0.24

60.

587

0.33

40.

432

BL

EU

RT-

EX

TE

ND

ED

0.68

90.

517

0.44

70.

278

0.35

90.

112

0.53

30.

430

0.27

10.

305

0.64

30.

419

0.46

0E

SIM

0.46

90.

253

0.34

70.

195

0.12

2-0

.018

0.52

20.

312

0.20

30.

224

0.59

90.

363

0.39

1PA

RE

SIM

-10.

475

0.25

70.

343

0.19

70.

122

-0.0

180.

510

0.32

40.

209

0.23

00.

599

0.36

30.

396

CO

ME

T0.

668

0.48

70.

468

0.32

40.

322

0.07

80.

624

0.46

20.

316

0.34

40.

671

0.45

70.

432

CO

ME

T-2R

0.66

90.

512

0.46

30.

321

0.32

60.

078

0.63

00.

445

0.29

40.

343

0.67

60.

463

0.43

4C

OM

ET-

HT

ER

0.66

50.

500

0.44

00.

303

0.33

10.

088

0.60

10.

427

0.27

40.

292

0.64

00.

411

0.41

1C

OM

ET-

MQ

M0.

666

0.49

00.

423

0.27

50.

313

0.07

80.

588

0.42

40.

271

0.28

10.

635

0.41

30.

388

CO

ME

T-R

AN

K0.

629

0.40

80.

379

0.21

70.

297

0.09

70.

569

0.38

80.

207

0.22

90.

588

0.34

20.

380

CO

ME

T-Q

E0.

614

0.47

00.

347

0.23

3-0

.04

-0.0

510.

470

0.36

00.

211

0.26

40.

514

0.32

00.

346

OP

EN

KIW

I-B

ER

T0.

262

0.14

20.

168

0.05

8-0

.115

-0.2

33-0

.529

0.15

30.

035

0.16

40.

169

0.02

20.

077

OP

EN

KIW

I-X

LM

R0.

607

0.41

70.

369

0.22

40.

060

0.00

90.

553

0.34

70.

189

0.27

90.

604

0.35

40.

377

YIS

I-2

0.18

70.

104

0.29

60.

171

0.14

60.

073

0.38

30.

115

0.05

20.

146

0.54

50.

332

0.15

2P

RIS

M0.

619

0.41

40.

447

0.28

00.

452

0.19

50.

579

0.41

40.

274

0.28

30.

448

0.21

10.

397

BA

QD

YN

−−

−−

−−

−0.

351

BA

QS

TAT

IC−

−−

−−

−−

0.34

4E

QD

YN

−−

−−

−−

−0.

356

EQ

STA

TIC

−−

−−

−−

−0.

409

Tabl

e10

:Se

gmen

t-le

velm

etri

cre

sults

foro

ut-o

f-E

nglis

hla

ngua

gepa

irs:

Ken

dall’

sTa

ufo

rmul

atio

nof

segm

ent-

leve

lmet

ric

scor

esw

ithD

AR

Rsc

ores

;For

lang

uage

pair

sth

atco

ntai

nou

tlier

syst

ems,

we

also

show

corr

elat

ion

afte

rdis

card

ing

segm

ents

tran

slat

edby

outli

ersy

stem

s

Page 18: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

705

cs-e

nde

-en

iu-e

nja

-en

pl-e

nru

-en

ta-e

nzh

-en

all

all-

out

all

all-

out

all

all-

out

all

all-

out

all

all-

out

all

all-

out

all

all-

out

all

all-

out

1424

955

1866

495

3624

790

311

635

529

753

581

684

440

3085

2618

SE

NT

BL

EU

0.10

40.

058

0.60

1-0

.055

0.61

10.

417

0.41

30.

125

0.09

60.

059

0.11

30.

026

0.33

00.

150

0.21

10.

153

TE

R0.

115

0.06

80.

621

-0.0

020.

611

0.50

00.

370

0.04

80.

024

-0.0

590.

089

-0.0

090.

383

0.21

40.

197

0.14

1C

HR

F+

+0.

135

0.11

00.

624

0.00

60.

500

0.33

30.

435

0.15

80.

071

0.03

60.

169

0.08

80.

395

0.20

90.

199

0.13

9C

HR

F0.

126

0.09

10.

626

0.00

20.

611

0.41

70.

453

0.20

90.

065

0.02

10.

195

0.11

20.

395

0.20

00.

209

0.15

4PA

RB

LE

U0.

100

0.04

50.

630

-0.0

020.

556

0.41

70.

428

0.13

80.

065

0.03

20.

086

-0.0

190.

368

0.17

70.

201

0.14

3PA

RC

HR

F+

+0.

117

0.08

10.

642

0.04

20.

611

0.41

70.

438

0.16

40.

087

0.04

00.

171

0.09

50.

412

0.23

20.

203

0.14

6C

HA

RA

CT

ER

0.05

90.

049

0.64

60.

079

0.50

00.

250

0.41

00.

145

0.09

00.

051

0.18

70.

122

0.37

10.

196

0.21

90.

166

EE

D0.

105

0.06

40.

633

0.00

60.

722

0.58

30.

430

0.12

50.

080

0.01

70.

174

0.08

80.

395

0.20

00.

206

0.14

8Y

ISI-

00.

052

0.02

20.

616

0.00

60.

556

0.33

30.

425

0.12

50.

071

0.03

60.

187

0.09

80.

409

0.22

30.

196

0.13

9S

WS

S+

ME

TE

OR

−−

0.72

20.

583

0.37

70.

029

0.10

90.

047

0.21

10.

129

0.44

70.

291

0.20

10.

141

ME

E0.

126

0.11

40.

618

-0.0

060.

444

0.25

00.

438

0.19

00.

014

-0.0

130.

137

0.05

30.

398

0.24

50.

198

0.14

0Y

ISI-

10.

136

0.11

40.

640

0.03

40.

667

0.50

00.

420

0.11

90.

109

0.06

20.

150

0.03

30.

450

0.30

00.

210

0.15

3B

ER

T-B

AS

E-L

20.

164

0.13

90.

654

0.07

50.

778

0.66

70.

430

0.15

10.

046

-0.0

130.

179

0.06

40.

398

0.22

30.

206

0.14

9B

ER

T-L

AR

GE

-L2

0.13

10.

091

0.64

20.

030

0.72

20.

583

0.41

80.

119

0.02

7-0

.028

0.19

50.

084

0.43

90.

291

0.18

50.

124

MB

ER

T-L

20.

149

0.11

80.

621

-0.0

060.

833

0.75

00.

433

0.15

80.

033

-0.0

360.

232

0.12

60.

418

0.25

90.

216

0.16

2B

LE

UR

T0.

154

0.12

50.

641

0.03

80.

667

0.50

00.

420

0.10

00.

039

-0.0

090.

227

0.11

50.

418

0.25

90.

197

0.14

1B

LE

UR

T-E

XT

EN

DE

D0.

140

0.11

40.

633

0.01

40.

833

0.75

00.

430

0.11

30.

077

0.00

60.

243

0.14

30.

412

0.24

50.

198

0.14

1E

SIM

0.13

50.

110

0.67

00.

164

0.72

20.

583

0.40

00.

087

0.03

9-0

.017

0.17

40.

064

0.40

40.

236

0.20

30.

148

PAR

ES

IM-1

0.11

90.

093

0.67

00.

156

0.72

20.

583

0.39

20.

055

0.03

3-0

.021

0.17

10.

060

0.40

10.

232

0.20

80.

154

CO

ME

T0.

142

0.11

40.

626

-0.0

180.

667

0.50

00.

392

0.06

10.

112

0.07

00.

193

0.08

80.

395

0.21

80.

206

0.15

1C

OM

ET-

2R0.

138

0.11

60.

614

0.03

00.

778

0.66

70.

413

0.09

30.

090

0.04

70.

227

0.13

60.

404

0.23

20.

214

0.15

8C

OM

ET-

HT

ER

0.16

00.

133

0.63

80.

042

0.55

60.

333

0.41

50.

138

0.08

30.

040

0.16

90.

084

0.35

40.

191

0.15

00.

105

CO

ME

T-M

QM

0.14

00.

114

0.64

50.

075

0.61

10.

417

0.41

00.

119

0.08

00.

043

0.16

30.

081

0.38

60.

241

0.16

10.

117

CO

ME

T-R

AN

K0.

139

0.13

10.

615

-0.0

260.

667

0.50

00.

365

0.03

50.

112

0.09

60.

185

0.07

40.

325

0.15

40.

199

0.14

7C

OM

ET-

QE

0.09

10.

060

0.63

60.

042

0.38

90.

250

0.32

90.

023

-0.0

02-0

.028

0.15

30.

060

0.30

10.

127

0.16

90.

118

OP

EN

KIW

I-B

ER

T0.

087

0.06

40.

628

0.04

60.

444

0.25

00.

322

0.11

30.

096

0.07

70.

137

0.05

00.

281

0.14

50.

113

0.07

9O

PE

NK

IWI-

XL

MR

0.13

30.

114

0.61

30.

010

0.55

60.

500

0.41

80.

145

0.05

50.

017

0.15

50.

060

0.38

90.

227

0.18

70.

135

YIS

I-2

0.08

30.

072

0.54

7-0

.075

0.27

80.

250

0.38

50.

055

0.11

80.

153

0.24

80.

195

0.38

30.

196

0.19

90.

139

PR

ISM

0.16

90.

156

0.63

6-0

.002

0.66

70.

500

0.42

00.

119

0.10

20.

059

0.21

10.

102

0.40

60.

236

0.19

50.

138

BA

QD

YN

−−

−−

−−

−0.

223

0.17

2B

AQ

STA

TIC

−−

−−

−−

−0.

214

0.16

0

Tabl

e11

:D

ocum

ent-

leve

lmet

ric

resu

ltsfo

rto

-Eng

lish

lang

uage

pair

s:K

enda

ll’s

Tau

form

ulat

ion

ofse

gmen

t-le

velm

etri

csc

ores

with

DA

docu

men

t-le

velm

etri

csc

ores

with

DO

C-D

AR

Rju

dgem

ents

.For

lang

uage

pair

sth

atco

ntai

nou

tlier

syst

ems,

we

also

show

corr

elat

ion

afte

rdis

card

ing

docu

men

tstr

ansl

ated

byou

tlier

syst

ems.

Page 19: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

706

en-c

sen

-de

en-iu

en-j

aen

-pl

en-r

uen

-ta

en-z

hal

lal

l-ou

tal

lal

l-ou

tal

lal

l-ou

tal

lal

lal

l-ou

tal

lal

lal

l-ou

tal

l

1442

572

729

312

203

4846

967

725

438

738

999

651

SE

NT

BL

EU

0.68

00.

273

0.55

00.

359

0.59

6-0

.25

0.80

80.

510

0.15

00.

287

0.79

90.

596

0.59

8T

ER

0.69

10.

294

0.51

70.

308

0.56

7-0

.292

-0.0

70.

439

0.09

40.

178

0.74

80.

616

0.11

8C

HR

F+

+0.

692

0.29

40.

583

0.37

20.

547

-0.4

170.

829

0.53

60.

165

0.33

90.

866

0.59

60.

579

CH

RF

0.68

80.

290

0.59

70.

397

0.57

6-0

.333

0.83

80.

524

0.16

50.

307

0.87

20.

616

0.59

1PA

RB

LE

U0.

727

0.38

10.

528

0.26

90.

576

-0.2

080.

565

0.56

90.

236

0.26

60.

805

0.59

60.

625

PAR

CH

RF+

+0.

717

0.36

40.

575

0.34

0−

0.82

50.

560

0.20

50.

307

−0.

650

CH

AR

AC

TE

R0.

656

0.22

40.

520

0.26

30.

547

-0.2

920.

842

0.44

80.

047

0.32

80.

872

0.65

70.

613

EE

D0.

678

0.28

00.

569

0.34

00.

596

-0.2

920.

834

0.55

40.

228

0.27

70.

856

0.55

60.

588

YIS

I-0

0.65

30.

245

0.55

30.

327

0.64

5-0

.125

0.82

10.

554

0.22

80.

318

0.83

00.

515

0.44

1M

EE

0.71

40.

357

0.55

80.

314

0.52

7-0

.375

−0.

489

0.07

10.

307

0.80

50.

495

−Y

ISI-

10.

763

0.46

20.

605

0.35

90.

616

-0.0

830.

855

0.66

30.

339

0.34

90.

882

0.65

70.

690

YIS

I-C

OM

BI

−0.

594

0.35

3−

−−

−−

−B

LE

UR

T-C

OM

BI

−0.

594

0.35

3−

−−

−−

−M

BE

RT-

L2

0.78

10.

517

0.58

00.

308

−0.

842

0.64

80.

299

0.43

10.

861

0.59

60.

693

BL

EU

RT-

EX

TE

ND

ED

0.84

70.

664

0.63

50.

378

0.63

5-0

.167

0.85

10.

740

0.48

80.

437

0.85

60.

636

0.70

8E

SIM

0.72

00.

392

0.53

40.

237

0.50

7-0

.25

0.85

50.

616

0.31

50.

354

0.83

60.

596

0.67

4PA

RE

SIM

-10.

741

0.44

10.

528

0.23

70.

507

-0.2

50.

829

0.62

80.

331

0.36

40.

836

0.59

60.

662

CO

ME

T0.

845

0.66

40.

632

0.40

40.

547

-0.1

250.

847

0.68

70.

346

0.35

90.

897

0.75

80.

561

CO

ME

T-2R

0.85

90.

699

0.61

30.

391

0.59

6-0

.25

0.86

80.

725

0.40

90.

375

0.86

60.

636

0.55

8C

OM

ET-

HT

ER

0.84

90.

675

0.61

60.

410

0.60

6-0

.042

0.85

50.

669

0.30

70.

287

0.85

60.

657

0.50

2C

OM

ET-

MQ

M0.

849

0.67

50.

594

0.37

20.

567

-0.1

250.

817

0.67

80.

291

0.36

40.

836

0.65

70.

472

CO

ME

T-R

AN

K0.

803

0.57

30.

572

0.32

10.

586

-0.3

750.

812

0.60

70.

165

0.30

70.

841

0.59

60.

472

CO

ME

T-Q

E0.

839

0.65

00.

514

0.25

00.

212

-0.1

670.

812

0.61

90.

142

0.29

70.

820

0.65

70.

469

OP

EN

KIW

I-B

ER

T0.

655

0.39

90.

443

0.14

70.

488

0.08

30.

139

0.42

70.

024

0.34

40.

584

0.11

10.

459

OP

EN

KIW

I-X

LM

R0.

821

0.62

20.

589

0.31

40.

527

0.16

70.

859

0.59

20.

094

0.32

80.

856

0.59

60.

524

YIS

I-2

0.25

50.

105

0.41

60.

224

0.52

70.

333

0.68

00.

117

-0.1

180.

209

0.80

50.

434

0.22

3P

RIS

M0.

792

0.54

50.

594

0.37

80.

665

-0.0

420.

829

0.63

40.

283

0.32

80.

733

0.37

40.

511

BA

QD

YN

−−

−−

−−

−0.

567

BA

QS

TAT

IC−

−−

−−

−−

0.61

9E

QD

YN

−−

−−

−−

−0.

613

EQ

STA

TIC

−−

−−

−−

−0.

644

Tabl

e12

:D

ocum

ent-

leve

lm

etri

cre

sults

for

out-

of-E

nglis

hla

ngua

gepa

irs:

Ken

dall’

sTa

ufo

rmul

atio

nof

docu

men

t-le

vel

met

ric

scor

esw

ithD

OC

-DA

RR

judg

emen

ts.

For

lang

uage

pair

sth

atco

ntai

nou

tlier

syst

ems,

we

also

show

corr

elat

ion

afte

rdis

card

ing

docu

men

tstr

ansl

ated

byou

tlier

syst

ems

Page 20: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

707

0.4 0.2 0.0 0.2 0.4

5

10

15

20

25

BLEU

fullnews

Figure 1: English → Inuktitut: Human vs. BLEUscores on the full dataset vs. the news subset. Onlythe news subset was included in the human evaluation.Each dot corresponds to an MT system, the outlier onthe top-right is UQAM TANLE.

only the news subset that was evaluated. The mainexception is UQAM TANLE; BLEU scores are re-ally high on the out-of-domain data, and increasevery little when computed on the full dataset. Whenlooking at correlations with human scores (Table 7),we expected correlations to increase when com-puted over the news subset. This is true for mostmetrics such as COMET-QE, but the correlationstays the same or actually decreases for other met-rics like PARBLEU.

5.1.2 Scoring Human TranslationsThe alternate reference was included in the man-ual evaluation for German→ English, Russian→English and Chinese→ English. All human refer-ences were included in the out-of-English manualevaluation.7

German → English: HUMAN-B was rankedthird in the manual evaluation. The lexical met-rics (BLEU, CHRF, CHARACTER, EED, MEE,YISI-0, CHRF++, PARBLEU, PARCHRF) give ex-tremely low scores to the HUMAN-B reference.This is also true for PRISM and all the reference-free metrics except COMET-QE. The neural met-rics also give low scores to the human reference,however, the margin of error is much smaller.

7Findings 2020 in the official tables label the alternatereference in into-English direction simply as HUMAN. The“first” reference, which serves as the primary reference forus, was not scored manually in DA into English. Out ofEnglish, the primary reference for us is labelled HUMAN-Ain Findings.

COMET-QE is the only metric that gives highscores to the HUMAN-B reference.

Appendix B also shows the scatterplot “new-stestB2020” where HUMAN-B served as the refer-ence for the metrics. We see some differences inthe vertical axis but the general picture remains thesame even with this fairly different human transla-tion.

Russian→ English: The HUMAN-B referencewas ranked after 6 MT systems in the manual evalu-ation but still within the same cluster, so not signif-icantly distinguishable. Lexical metrics give rela-tively low scores to HUMAN-B. The neural metricsgive relatively higher scores, but score it aboveOnline-A and below ariel197197, i.e. differentlythan DA judgements.

Chinese→ English: The Human translation isranked 12th in the manual evaluation (in a giantcluster which puts together all but one top and onebottom system), and most metrics place it moreor less correctly. Many metrics, including lexicalmetrics, still have correlations above 0.9 even afterincluding the Human translation.

English→German: According to the WMT hu-man evaluation, the HUMAN-B reference receivesthe highest scores, the HUMAN-A reference isranked fourth and Human-P, which was generatedby linguists paraphrasing the WMT references, isranked lower at 10th place. Each human referencefalls into a separate cluster of significance.

Lexical metrics score around 10 MT systemsabove each WMT reference (using the other WMThuman translation as reference). COMET-QE andsome neural metrics (BLEURT, COMET-MQM,COMET-HTER and MBERT-L2) score HUMAN-A and HUMAN-B as better than all MT systems.

When using either of the WMT references, mostmetrics, including all the lexical metrics, score theparaphrased reference much lower than the rest ofthe systems. The COMET family of metrics andBLEURT-EXTENDED are the only metrics thatare able to recognise the merit of the paraphrasedreferences.

When using the paraphrased references, allreference-based metrics score the two human trans-lations above all MT systems, often by a largemargin. PRISM is the sole exception; it scoresthe HUMAN-B reference about half way betweenthe MT systems. Interestingly, most of these met-rics score HUMAN-A above HUMAN-B, i.e. dis-

Page 21: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

708

agreeing with DA judgements. Metric correlationswhen including HUMAN-A system drop dramati-cally when using the alternate WMT reference, butthe correlations are higher with the paraphrased ref-erence. This also holds when scoring HUMAN-Busing the paraphrased vs the main WMT reference(Table 7).

Of the reference-free metrics, COMET-QEscores the two WMT references above all MT sys-tems, and ranks the paraphrased reference similarto its rank in the manual evaluation. OPENKIWI-BERT and OPENKIWI-XLMR are a little bi-ased against these human translations, and YISI-2scores all human translations below all MT sys-tems.

English → Chinese The manual evaluationranks the two Human translations above all MTsystems, but most metrics give these much lowscores.

To summarize, we see that the current MT met-rics generally struggle to score human translationsagainst machine translations reliably. Rare excep-tions include primarily trained neural metrics andreference-less COMET-QE. While the metrics arenot really prepared to score human translations,we find this type of test relevant as more and morelanguage pairs are getting closer to the human trans-lation benchmark. A general-enough metric shouldbe thus able to score human translation comparablyand not rely on some idiosyncratic properties ofMT outputs. We hope that human translations willbe included in WMT DA scoring in the upcomingyears, too.

5.1.3 Influence of OutliersThere are no outlier systems for some language-pairs like Khmer→ English and English → Rus-sian. For others, we have systems whose score isfar away from the scores of the rest of the systems.As these outliers have a large influence on Pear-son correlation, computing the correlation withoutoutliers typically makes the task harder for metricsand results in a decrease in correlation.

For example, we identify three outliers in theGerman → English set; the quality of the lastsystem is extremely low compared to rest. Allreference-based metrics have high correlationswhen including all systems, but correlations dropwhen discarding outliers. In particular, CHRF andPARESIM both had a correlation of 0.95 whencomputed over all systems, but this drops to 0.69

and 0.83 respectively after removing outliers, re-vealing that PARESIM is more reliable with thislanguage pair. An even larger drop is observed forCHRF and CHRF++ in English→ Czech, from 0.8to 0.3. We find this particularly surprising becauseCHRF has always performed well on this languagepair, including in the evaluation on the graduallyreducing set of top N systems, i.e. in harder andharder conditions, see SACREBLEU-CHRF in Ap-pendix A.4 of Ma et al. (2019).

In some cases, metrics can be inaccurate whenscoring outliers, resulting in an increased corre-lation when correlation is recomputed over non-outlier systems. For example, with Chinese→ English, the score of WMTBIOMEDBASE-LINE score is much lower than the next system.Most metrics correctly rank it last as well, butCOMET-HTER, COMET-MQM, COMET-QE andOPENKIWI-BERT give it a higher score than thenext system(s). Note that the other metrics all havea correlation of above 0.9 even after removing theoutlier.

In other cases, removing outliers decreases thecorrelation of a metric and yet it helps its finaloutcome. For instance SENTBLEU averaged overall sentences becomes one of the “winners” in thesystem-level evaluation of translation into English(Table 5). If we trust the results without outliersmore, using averaged sentBLEU seems better thanusing plain old BLEU and not significantly worsethan any other metric going from English into sev-eral target languages.

For some language-pairs, we override the de-cisions made by the outlier detection algorithm,based on whether we believe including or remov-ing these systems from consideration would havean impact on the correlations: For example, withTamil→ English, the last two systems are not clas-sified as outliers by the algorithm, but their humanscores is some distance away from the rest of thesystems. CHRF, CHRF++ and PARCHRF++ are theonly metrics that correctly order these two systems.OPENKIWI-BERT and OPENKIWI-XLMR bothget these two systems wrong with a large margin.But for all metrics, removing these systems leadsto a significant drop in correlation. Thus we countthese two systems as outliers.

Another example is Japanese → English. Forthis language-pair, we have two clusters of 7 and3 systems. Metrics have high correlations whenconsidering all systems, but when looking at MT

Page 22: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

709

systems within individual clusters, there are dis-crepancies between the metric scores comparedto human scores. The outlier detection algorithmflags only the last two systems as outliers, but thepresence of the third system has a disproportion-ate impact on the correlation. We include all threesystems in the set of outliers.

The influence of references For all languagepairs where multiple references were available, thecorrelations are typically very close whether us-ing the primary reference or the alternate reference.For metrics where we do see a difference, there isno consistent pattern whether metrics prefer onereference or the other. We note that although thechange in correlations is small when comparingacross reference sets, the set of “winners” accord-ing to the William’s test for statistical significanceis not stable, particularly for English→ German.When combining references, in most cases, the cor-relation with multiple references lies between thecorrelation of the individual references. For exam-ple, with English→German, BLEU correlates bestwith the secondary reference with a correlation of0.844. But with multiple references, the correlationis 0.825, just above the correlation with the primaryreference with is 0.822 (Table 8).

There are a few exceptions where there is a smallincrease in metric correlation above both individualreferences. For example, the correlation of CHAR-ACTER with German → English increases from0.687 and 0.696 with a single reference to 0.713with both references ( Table 8). But there are nometrics which consistently show an improvementwith multiple references across multiple languagepairs.

5.1.4 Neural vs. Lexical MetricsFor many language pairs, when we look at cor-relation clustering of the reference-based metricsbased on their system-level scores, we end up withtwo major clusters: neural metrics and lexical met-rics. We have seen that lexical and neural metricsdiffer in how they score the human translations.For English→ German, all lexical metrics have aslightly higher correlation than any neural metricwhen evaluating MT systems. However, these met-rics make major errors evaluating the HUMAN-Atranslations with the HUMAN-B reference.

We also see such differences with some MT sys-tems. Selected examples:

• English→ Czech: All lexical metrics includ-

ing BLEU and CHRF are very biased towardsONLINE-B, with metric scores indicating thatthis system is better than all others by a largemargin. It is ranked 7th in the human evalua-tion. Neural metrics and reference-free met-rics are more or less correct when scoring thissystem. Surprisingly, ESIM is an exception tothis, and also ranks ONLINE-B on top.

• Polish→ English: Lexical metrics like BLEUgive very low scores to ONLINE-G.

• Tamil → English: Lexical metrics con-sistently score ONLINE-Z above MI-CROSOFT STC INDIA, but the remainingmetrics including the reference-free metricsrank them in the opposite order. The humanevaluation agrees with the lexical metrics.

• Khmer→ English: lexical metrics score thebest system lower than the next two, whereasmost neural metrics get the order of the topsystems right.

5.1.5 Other Discrepancies between Metricand Human Scores

Here we briefly draw attention to particularities wespotted when manually examining the results.

• German→ English: All metrics score Tohoku-AIP-NTT higher than OPPO, and UEDINhigher than PROMT NMT.

• Russian → English: ONLINE-A, which isranked 2nd in the human evaluation, receiveslow metric scores. In contrast, some met-rics including BLEU and PARBLEU chooseARIEL197197, which is ranked 6th in the hu-man evaluation, as the best system.

• Tamil → English: The highest ranked sys-tem according to human scores, GTCOM, re-ceives lower metric scores than the next threeto six systems. Metrics are biased towardsONLINE-A and against ONLINE-Z.

• Chinese→ English: HUOSHAN TRANSLATE

is a clear winner according to human evalua-tion, but BLEU ranks it lower than the next 3systems. The different between human scoresfor the next 8 systems is not statistically sig-nificant where metric ordering of the systemsdifferently than human scores and these dis-crepancies aren’t penalised harshly by Pear-son correlation.

Page 23: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

710

• English→ Chinese: HUOSHAN TRANSLATE

is a clear winner according to human eval-uation, but BLEU ranks it lower than thenext 3 systems. The different between hu-man scores for the next 8 systems is not sta-tistically significant where metric ordering ofthe systems differently than human scores andthese discrepancies aren’t penalised harshlyby Pearson correlation. While many metricsincluding BLEU have high correlations, oth-ers make major errors scoring the NIUTRANS.OPENKIWI-BERT assigns really low scoresto

Overall, we note that these metric-human discrep-ancies often feature online systems which are prob-ably more diverse that the MT system submissionsto the WMT shared tasks.

5.1.6 Pearson vs. Kendall TauOverall, we found that Pearson correlation doesn’talways give us the complete picture. In particular,outliers have a large influence on the correlationand can mask the presence of discrepancies be-tween metric and human scores with the rest of thesystems. But making a decision on which systemsto discard is not easy.

In this paper, we also explore Kendall’s Tau as analternative to Pearson correlation. Tables 16 and 17in the Appendix show Kendall Tau correlation ofmetrics over all MT systems (not including humantranslations).

Kendall’s Tau is less sensitive to outliers, anddirectly measures whether metrics agree with hu-mans when comparing pairs of systems. However,Kendall’s Tau doesn’t consider the differences inscores, and two metrics whose errors differ in mag-nitude can have the same Kendall’s Tau correlation(Figure 2).

5.2 Segment and Document-Level ResultsOn the more fine-grained evaluation scales, PRISMand the trained neural metrics (the COMET andBLEURT family of metrics) have a better agree-ment with human judgements than lexical metrics

The correlations of the to-English language pairsare consistently much lower, on average, comparedto that of the out-of-English language pairs. Thedifference could be due to the differing set of anno-tators: the to-English human evaluation was crowd-sourced and therefore is likely to be noisier.

Finally, we find that correlations drop markedlyfor most language pairs if we consider only the

0.2 0.0 0.20.4

0.3

0.2

0.1

0.0

0.1

0.2

OK­B: r = 0.73; tau = 0.62

0.2 0.0 0.2

0.50

0.48

0.46

0.44

0.42

0.40 EED: r = 0.97; tau = 0.62

Figure 2: Scatterplots of human scores against two met-rics that have the same Kendall Tau correlation withhuman scores, though OPENKIWI-BERT has bigger er-rors.

segment/document pairs that do not contain outliersystems. We suspect that as the quality of outliersystem translations is typically low, and most of thegenerated better-worse pairs that contain outlierscan be easy for metrics. Removing these pairswould make the task a lot harder. It is also verylikely that the remaining pairs of translations arenoisier, which decreases metric agreement withthese pairwise judgements.

The document-level correlations are typicallyhigher than segment-level correlations. This couldbe due to reduced noise in human scores when aver-aging the scores of multiple segments. Computingmetric scores over documents that contain multiplesegments also helps reduce metric noise.

5.3 Reference-Based Metrics vs.Reference-Free Metrics

We have four submissions of metrics that directlycompare MT outputs with the source segment:COMET-QE, OPENKIWI-BERT, OPENKIWI-XLMR, and YISI-2. Other members of theCOMET family of metrics use information fromboth the source and reference. The remaining met-rics compute scores by comparing the MT outputwith the reference.

While the task of comparing segments in differ-ent languages is harder than comparing segments inthe same language, reference-free metrics have oneadvantage: they are not encumbered by reference-bias. COMET-QE is the only metric that correctlygives a high score to the human translation in Ger-man→ English , and one of the few metrics thatdoes so for English→ Chinese.

This year, the reference-free metrics are highlycompetitive with reference-based metrics for alllanguage-pairs. For English → Tamil, COMET-

Page 24: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

711

QE which has a near perfect correlation of 0.97even after discarding outliers. In contrast, manyreference-based metrics including BLEU and chrFgive really high scores to ONLINE-B, which resultsin low correlations.

6 Use Automatic Metrics to DetectIncorrect Human Preference

It has been argued that non-expert translators lackknowledge of translation and so might not noticesubtle differences that make one translation betterthan another. Castilho et al. (2017) compared theevaluation of MT output of professional translatorsagainst crowd workers. Results showed that forall language pairs, the crowd workers tend to bemore accepting of the MT output by giving higherfluency and adequacy scores. Toral et al. (2018)showed that the ratings acquired by professionaltranslators show a wider gap between human andmachine translations compared to judgments bynon-experts. They recommend using professionallinguists for MT evaluation going forward. Laubliet al. (2020) show that non-experts assess paritybetween human and machine translation whereprofessional translators do not, indicating that theformer neglect more subtle differences betweendifferent translation outputs. Given the previouswork and the fact that the WMT human evalua-tion has been conducted with a mix of researchersand crowd workers, we rerun human evaluationfor a subset of the submissions with professionallinguists. In particular, we want to investigate ifwe can use the quality scores obtained by the au-tomatic metrics to detect incorrect human ratings.We filtered out all pairs of systems where the hu-man evaluation results disagree with all automaticmetrics. Taking the metric scores as a signal, wererun human evaluation for a subset of submis-sions for 2 language pairs: German→English andEnglish→German. We hired 10 professional lin-guists, who rerun the source-based direct assess-ment human evaluation with the same document-based template that has been used for the originalWMT ratings.

6.1 German→English

For German→English, we found that all automaticmetrics disagree with the human evaluation resultsfor OPPO and TOHOKU. OPPO yields a higherhuman rating, while all automatic metrics gave TO-HOKU a higher score. To investigate which of the

results to trust, we rerun the source-based directassessment for these 2 systems with professionallinguists. The results in Table 13 show that profes-sional linguists in fact prefer the output of TOHOKU

as predicted by all automatic metrics.

Evaluation OPPO TOHOKU

avg metric8.85 8.95

(HUMAN-A ref)avg metric

10.15 10.26(Human-B ref)WMT 84.6 81.5z-score 0.220 0.179

prof. linguist 81.0 81.7z-score -0.005 0.010

Table 13: WMT 2020 German→English comparingthe reference-based ratings acquired with crowd work-ers/researcher (WMT) against source-based ratings ac-quired with professional linguists.

6.2 English→GermanFor English→German, we rerun human evaluationfor the top 2 ranked MT systems (based on hu-man evaluation): OPPO, TOHOKU and the humantranslation HUMAN-A. The quality of human trans-lations is usually underestimated by automatic met-rics when computed with standard references. Thisis also visible in this year’s evaluation campaignwhere the average metric scores of all submissionfor the human translation HUMAN-A is much lowerwhen compared to the top MT submissions. Toovercome this problem, Freitag et al. (2020) intro-duced paraphrased references that also value thetranslation quality of human translations and al-ternative (less simple/monotonic) MT output. Aswe can see in Table 14, the average metric scoresof all submissions when computed with the para-phrased references HUMAN-P yield a much higherscore for the human translation HUMAN-A whencompared to all MT outputs.

The official WMT human evaluation ranked thehuman translation third, right behind the two MToutputs from OPPO and TOHOKU. Interestingly,based on the z-scores, WMT predicts OPPO to beof higher quality than TOHOKU which is in dis-agreement with most of the metric scores whencalculated against both types of reference transla-tions. Overall, the automatic metrics come to avery different ranking than the human evaluationfor the top performing submissions.

Page 25: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

712

Evaluation OPPO TOHOKU HUMAN-A

avg metric10.05 10.09 9.14

(Human-B ref)avg metric

11.93 12.07 15.74(Human-P ref)WMT 87.39 88.62 85.10z-score 0.495 0.468 0.379

prof. linguist 73.66 74.70 84.09z-score -0.051 -0.037 0.088

Table 14: WMT 2020 English→German comparingthe source-based ratings acquired with crowd work-ers/researcher (WMT) against source-based ratings ac-quired with professional linguists.

We rerun the human evaluation with the sametemplate, but with professional linguists. Inter-estingly, the human translation has been rankedfirst by a large margin. Furthermore, the MT out-put of TOHOKU has been rated as higher qualitywhen compared to the MT output from OPPO.The results of the human evaluation with profes-sional linguists yield a perfect correlation to themetric scores calculated with the paraphrased ref-erence. This indicates not only the advantages ofparaphrased references when scoring human trans-lations, but also that automatic metrics can be usedto identify incorrect human ratings.

7 Conclusion

This paper summarizes the results of WMT20shared task in machine translation evaluation, theMetrics Shared Task. Participating metrics wereevaluated in terms of their correlation with humanjudgement at the level of the whole test set (system-level evaluation), as well as at a more fine-grainedlevel (document-level evaluation and sentences orparagraphs for segment-level evaluation). We re-ported scores for standard metrics requiring thereference as well metrics that compare MT out-put directly with the source text. For system-level,best metrics reach over 0.95 Pearson correlationor better across several language pairs. In manycases, this correlation drops considerably when thecorrelation is recomputed after discarding outliersystems.

Computing Pearson correlation without outlierscan change the rankings of metrics, and selectingthese outlier systems is not an exact science. We re-port results both with all systems and after discard-ing outliers as together, and also include Kendall

Tau correlation, and hope that together, they givea more complete picture than just reporting onlyone of these numbers. In the end, we believe thatit is impossible to adequately describe data withsummary numbers, and that it’s best to visualisedata to understand patterns.

The results confirm the trends from previousyears, namely metrics based on word or sentence-level embeddings, achieve the highest perfor-mance (Ma et al., 2018, 2019).

For some language pairs, we had two referencesavailable. On these test sets, we found that com-puting scores with two references rarely helpedmetrics achieve a higher correlation than using ei-ther reference individually. This contradicts earlierresearch that shows that multiple references im-prove correlation (Bojar et al., 2013), but is in linewith more recent papers that show additional inde-pendent references might not be helpful (Freitaget al., 2020). We believe that the utility of addi-tional independent references is dependent on theMT systems evaluated, that perhaps they are not ashelpful when scoring high quality MT systems aswith low/mid quality MT.

In addition to scoring MT systems, this year, wealso requested scores for human reference transla-tions. This highlighted the difference between lexi-cal and embedding-based metrics, as lexical met-rics consistently gave low scores to human transla-tions. However, when using the English-Germanparaphrased references, all metrics scored the otherhuman references above all MT systems, highlight-ing the advantages of using paraphrased referenceswhen scoring human translations.

In addition to human references, there are someMT systems where metrics (either the majority ofmetrics, or only the lexical metrics) make majorerrors. It remains an open question as to what itis about these systems that metrics struggle withscoring them correctly.

Compared to last year, the performance of thereference-free metrics has improved, and the corre-lations this year are competitive with the reference-based metrics, and in many cases, outperformBLEU. In particular, COMET-QE is good at recog-nising the high quality of human translations whereBLEU falls short.

In terms of segment-level Kendall’s τ results,the standard metrics correlations was very low forthe to-English language pairs, particularly after dis-carding translations by outlier systems. The corre-

Page 26: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

713

lations of the out-of-English language pair are morein line with recent years, reaching a maximum ofabove 0.6.

It has been shown that context is really importantwhen humans are rating MT outputs (Toral et al.,2018), and the WMT human evaluation is movingtowards evaluating segments with the documentcontext (Barrault et al., 2019). This creates a mis-match with automatic metrics, all of which, thisyear, score each segment independently. This year,we introduce document-level evaluation of metrics.When computing document-level scores, some met-rics from the COMET family include documentcontext when computing segment scores within thedocument. All other metrics included in this year’sevaluation either use the average of the segmentscores or compute the document score based onstatistics computed independently for each segment.In the future, we hope to see more metrics that con-sider broader context when evaluating translationsat all three levels.

For this year, we are unable to draw any mean-ingful conclusions from the document-level evalua-tion task, as it is hard to tease apart the influence ofnoise in the ground truth, inadequate segment-leveltranslations and inadequate translation in contextof the document.

We believe that the noise in the DARR judge-ments is a big factor in the low correlations inthe to-English language pairs. We need furtherresearch into understanding the factors that con-tribute to the Kendall Tau scores and how much wecan trust these results.

There are shortcomings in the methods used toevaluate metrics at the system-, document-, andsegment-level, and we believe that improving meth-ods for evaluating and analysing automatic metricsis a rich area for future research.

Finally, we assume that any discrepancies be-tween metrics and WMT manual evaluation is ametric error, and we acknowledge that this mightnot be true in all cases. There is always scope forimprovement in human evaluation methodology,and the best practice recommendations for humanevaluation are always evolving.

We rerun human evaluation by using the sametemplate as the WMT evaluation, but switchingthe rater pool from non-experts to professional lin-guists for a subset of translations where all metricsdisagree with the WMT human evaluation. Thisexperiment revealed a new use case of automatic

metrics and demonstrated that automatic metricscan be used to identify bad ratings in human eval-uations. The new obtained ratings were in linewith the scores suggested by the automatic metricsand also confirmed the higher translation quality ofhuman translations when compared to MT output.

In this paper, we looked at how outliers influencemetric evaluation, and we wonder how the presenceof these systems influence DA annotations. In aperfect world, annotators score each translation onits own merits without being influenced by previ-ous instances. In this world, given the presenceof much worse translations, do annotators assignhigh scores to the remaining translations that lookrelatively better? Does an MT system receive anunfair advantage if it is consistently scored along-side a low-scoring outlier? And does standardisingthe scores of individual annotators exacerbate thisissue? These and other research questions remainopen this year, keeping the WMT tasks increas-ingly interesting as MT systems are getting closerto human performance.

Acknowledgments

Results in this shared task would not be possiblewithout tight collaboration with organizers of theWMT News Translation Task.

We are grateful to Google for sponsoring theevaluation of selected MT systems by professionallinguists. We thank all participants of the task,particularly Weiyu Wang for pointing out some in-consistencies with the original release of inputs,Jackie Lo, Ricardo Rei, and Craig Stewart for somehelpful feedback and Thibault Sellam for also sub-mitting the scores of finetuned BERT on our re-quest.

Nitika Mathur is supported by the AustralianResearch Council. Ondrej Bojar would like to ac-knowledge the support from the Czech ScienceFoundation (grant n. 19-26934X, NEUREM3).

ReferencesManish Shrivastava Ananya Mukherjee, Hema Ala and

Dipti Misra Sharma. 2020. Mee: An automaticmetric for evaluation using embeddings for machinetranslation. (in press).

Loıc Barrault, Ondrej Bojar, Marta R Costa-jussa,Christian Federmann, Mark Fishel, Yvette Gra-ham, Barry Haddow, Matthias Huck, Philipp Koehn,Shervin Malmasi, et al. 2019. Findings of the 2019Conference on Machine Translation (WMT19). In

Page 27: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

714

Proceedings of the Fourth Conference on MachineTranslation (Volume 2: Shared Task Papers, Day 1),pages 1–61.

Loıc Barrault, Magdalena Biesialska, Ondrej Bojar,Marta R. Costa-jussa, Christian Federmann, YvetteGraham, Roman Grundkiewicz, Barry Haddow,Matthias Huck, Eric Joanis, Tom Kocmi, PhilippKoehn, Chi-kiu Lo, Nikola Ljubesic, ChristofMonz, Makoto Morishita, Masaaki Nagata, Toshi-aki Nakazawa, Santanu Pal, Matt Post, and MarcosZampieri. 2020. Findings of the 2020 conferenceon machine translation (wmt20). In Proceedings ofthe Fifth Conference on Machine Translation, On-line. Association for Computational Linguistics.

Rachel Bawden, Biao Zhang, Andre Tattar, and MattPost. 2020. ParBLEU: Augmenting metrics with au-tomatic paraphrases for the WMT’20 metrics sharedtask. In Proceedings of the Fifth Conference on Ma-chine Translation (Volume 2: Shared Task Papers),Online. Association for Computational Linguistics.

Ondrej Bojar, Matous Machacek, Ales Tamchyna, andDaniel Zeman. 2013. Scratching the surface of pos-sible translations. In Text, Speech, and Dialogue,pages 465–474, Berlin, Heidelberg. Springer BerlinHeidelberg.

Ondrej Bojar, Yvette Graham, and Amir Kamran. 2017.Results of the wmt17 metrics shared task. In Pro-ceedings of the Second Conference on MachineTranslation, Volume 2: Shared Task Papers, pages489–513, Copenhagen, Denmark.

Ondrej Bojar, Yvette Graham, Amir Kamran, andMilos Stanojevic. 2016. Results of the WMT16Metrics Shared Task. In Proceedings of the FirstConference on Machine Translation, pages 199–231,Berlin, Germany.

Sheila Castilho, Joss Moorkens, Federico Gaspari,Andy Way, Panayota Georgakopoulou, Maria Gi-alama, Vilelmini Sosoni, and Rico Sennrich. 2017.Crowdsourcing for nmt evaluation: Professionaltranslators versus the crowd. Translating and theComputer, 39.

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, HuiJiang, and Diana Inkpen. 2017. Enhanced LSTM fornatural language inference. In Proceedings of the55th Annual Meeting of the Association for Compu-tational Linguistics, Vancouver, Canada.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4171–4186, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Markus Dreyer and Daniel Marcu. 2012. HyTER:Meaning-equivalent semantics for translation evalu-ation. In Proceedings of the 2012 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, pages 162–171, Montreal, Canada. Associationfor Computational Linguistics.

Markus Freitag, David Grangier, and Isaac Caswell.2020. BLEU might be guilty but references are notinnocent. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 61–71, Online. Association forComputational Linguistics.

Yvette Graham and Timothy Baldwin. 2014. Testingfor significance of increased correlation with humanjudgment. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 172–176, Doha, Qatar. Associ-ation for Computational Linguistics.

Yvette Graham, Timothy Baldwin, and Nitika Mathur.2015. Accurate evaluation of segment-level ma-chine translation metrics. In Proceedings of the2015 Conference of the North American Chapterof the Association for Computational Linguistics:Human Language Technologies, pages 1183–1191,Denver, Colorado. Association for ComputationalLinguistics.

Yvette Graham, Timothy Baldwin, Alistair Moffat, andJustin Zobel. 2013. Continuous measurement scalesin human evaluation of machine translation. In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse, pages 33–41,Sofia, Bulgaria. Association for Computational Lin-guistics.

Yvette Graham, Qingsong Ma, Timothy Baldwin, QunLiu, Carla Parra, and Carolina Scarton. 2017. Im-proving evaluation of document-level machine trans-lation quality estimation. In Proceedings of the 15thConference of the European Chapter of the Associa-tion for Computational Linguistics: Volume 2, ShortPapers, pages 356–361, Valencia, Spain. Associa-tion for Computational Linguistics.

Yvette Graham, Nitika Mathur, and Timothy Baldwin.2014. Randomized significance tests in machinetranslation. In Proceedings of the Ninth Workshopon Statistical Machine Translation, pages 266–274,Baltimore, Maryland, USA. Association for Compu-tational Linguistics.

Boris Iglewicz and David Caster Hoaglin. 1993. Howto detect and handle outliers, volume 16. Asq Press.

Fabio Kepler, Jonay Trenous, Marcos Treviso, MiguelVera, and Andre F. T. Martins. 2019. OpenKiwi:An open source framework for quality estimation.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics: SystemDemonstrations, pages 117–122, Florence, Italy. As-sociation for Computational Linguistics.

Page 28: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

715

Philipp Koehn. 2004. Statistical significance testsfor machine translation evaluation. In Proceed-ings of the 2004 Conference on Empirical Meth-ods in Natural Language Processing, pages 388–395, Barcelona, Spain. Association for Computa-tional Linguistics.

Philipp Koehn, Vishrav Chaudhary, Ahmed El-Kishky,Naman Goyal, Peng-Jen Chen, and FranciscoGuzman. 2020. Findings of the wmt 2020 sharedtask on parallel corpus filtering and alignment. InProceedings of the Fifth Conference on MachineTranslation, Online. Association for ComputationalLinguistics.

Samuel Laubli, Sheila Castilho, Graham Neubig, RicoSennrich, Qinlan Shen, and Antonio Toral. 2020.A set of recommendations for assessing human–machine parity in language translation. Journal ofArtificial Intelligence Research, 67:653–672.

Christophe Leys, Christophe Ley, Olivier Klein,Philippe Bernard, and Laurent Licata. 2013. Detect-ing outliers: Do not use standard deviation aroundthe mean, use absolute deviation around the me-dian. Journal of Experimental Social Psychology,49(4):764–766.

Chi-kiu Lo. 2019. YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources. In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2: Shared Task Papers, Day 1), pages507–513, Florence, Italy. Association for Computa-tional Linguistics.

Chi-kiu Lo. 2020. Extended study on using pretrainedlanguage models and YiSi-1 for machine translationevaluation. In Proceedings of the Fifth Conferenceon Machine Translation: Shared Task Papers.

Chi-kiu Lo and Samuel Larkin. 2020. Machine trans-lation reference-less evaluation using YiSi-2 withbilingual mappings of massive multilingual lan-guage model. In Proceedings of the Fifth Confer-ence on Machine Translation: Shared Task Papers.

Qingsong Ma, Ondrej Bojar, and Yvette Graham. 2018.Results of the WMT18 metrics shared task: Bothcharacters and embeddings achieve good perfor-mance. In Proceedings of the Third Conference onMachine Translation: Shared Task Papers, pages671–688, Belgium, Brussels. Association for Com-putational Linguistics.

Qingsong Ma, Johnny Wei, Ondrej Bojar, and YvetteGraham. 2019. Results of the WMT19 metricsshared task: Segment-level and strong MT sys-tems pose big challenges. In Proceedings of theFourth Conference on Machine Translation (Volume2: Shared Task Papers, Day 1), pages 62–90, Flo-rence, Italy. Association for Computational Linguis-tics.

Matous Machacek and Ondrej Bojar. 2014. Results ofthe WMT14 metrics shared task. In Proceedings ofthe Ninth Workshop on Statistical Machine Trans-lation, pages 293–301, Baltimore, Maryland, USA.Association for Computational Linguistics.

Nitika Mathur, Timothy Baldwin, and Trevor Cohn.2019. Putting evaluation in context: Contextualembeddings improve machine translation evaluation.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages2799–2808, Florence, Italy. Association for Compu-tational Linguistics.

Nitika Mathur, Timothy Baldwin, and Trevor Cohn.2020. Tangled up in BLEU: Reevaluating the eval-uation of automatic machine translation evaluationmetrics. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics,pages 4984–4997, Online. Association for Computa-tional Linguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002a. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA. Association for ComputationalLinguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002b. BLEU: A method for automaticevaluation of machine translation. In Proceedingsof the 40th Annual Meeting of the Association forComputational Linguistics (ACL 2002), pages 311–318, Philadelphia, USA.

Maja Popovic. 2015. chrF: character n-gram F-scorefor automatic MT evaluation. In Proceedings of theTenth Workshop on Statistical Machine Translation,pages 392–395, Lisbon, Portugal.

Maja Popovic. 2017. chrF++: words helping charac-ter n-grams. In Proceedings of the Second Con-ference on Machine Translation, pages 612–618,Copenhagen, Denmark. Association for Computa-tional Linguistics.

Matt Post. 2018. A call for clarity in reporting BLEUscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.

Ricardo Rei, Craig Stewart, Ana C Farinha, and AlonLavie. 2020a. COMET: A neural framework for MTevaluation. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 2685–2702, Online. Associa-tion for Computational Linguistics.

Ricardo Rei, Craig Stewart, Ana C Farinha, and AlonLavie. 2020b. Unbabel’s participation in the wmt20metrics shared task. In Proceedings of the FifthConference on Machine Translation, Online. Asso-ciation for Computational Linguistics.

Page 29: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

716

Peter J Rousseeuw and Mia Hubert. 2011. Robuststatistics for outlier detection. Wiley Interdisci-plinary Reviews: Data Mining and Knowledge Dis-covery, 1(1):73–79.

Thibault Sellam, Dipanjan Das, and Ankur Parikh.2020a. BLEURT: Learning robust metrics for textgeneration. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguis-tics, pages 7881–7892. Association for Computa-tional Linguistics.

Thibault Sellam, Amy Pu, Hyung Won Chung, Sebas-tian Gehrmann, Qijun Tan, Markus Freitag, Dipan-jan Das, and Ankur Parikh. 2020b. Learning to eval-uate translation beyond english: Bleurt submissionsto the wmt metrics 2020 shared task. In Proceedingsof the Fifth Conference on Machine Translation, On-line. Association for Computational Linguistics.

Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A studyof translation edit rate with targeted human annota-tion. In Proceedings of the Association for MachineTransaltion in the Americas, pages 223–231.

Peter Stanchev, Weiyue Wang, and Hermann Ney. 2019.Eed: Extended edit distance measure for machinetranslation. In Proceedings of the Fourth Confer-ence on Machine Translation (Volume 2: SharedTask Papers, Day 1), pages 514–520, Florence, Italy.Association for Computational Linguistics.

Brian Thompson and Matt Post. 2020. Automatic ma-chine translation evaluation in many languages viazero-shot paraphrasing. In Proceedings of the 2020Conference on Empirical Methods in Natural Lan-guage Processing (EMNLP), Online. Association forComputational Linguistics.

Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way.2018. Attaining the Unattainable? ReassessingClaims of Human Parity in Neural Machine Trans-lation. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 113–123, Belgium, Brussels. Association for Computa-tional Linguistics.

Tereza Vojtechova, Michal Novak, Milos Kloucek, andOndrej Bojar. 2019. SAO WMT19 Test Suite: Ma-chine Translation of Audit Reports. In Proceedingsof the Fourth Conference on Machine Translation,Florence, Italy. Association for Computational Lin-guistics.

Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl,and Hermann Ney. 2016. CharacTer: Translationedit rate on character level. In Proceedings of theFirst Conference on Machine Translation: Volume2, Shared Task Papers, pages 505–510, Berlin, Ger-many. Association for Computational Linguistics.

Evan James Williams. 1959. Regression analysis. wi-ley.

Jin Xu, Yinuo Guo, and Junfeng Hu. 2020. Incorporatesemantic structures into machine translation evalua-tion via ucca. In Proceedings of the Fifth Confer-ence on Machine Translation, Online. Associationfor Computational Linguistics.

Page 30: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

717

A List of Outliers

lp Outliers

cs-en ZLABS-NLP.1149, CUNI-DOCTRANSFORMER.1457de-en YOLO.1052, ZLABS-NLP.1153, WMTBIOMEDBASELINE.387iu-en NIUTRANS.1206, FACEBOOK AI.729ja-en ONLINE-G.1564, ZLABS-NLP.66, ONLINE-Z.1640pl-en ZLABS-NLP.1162ru-en ZLABS-NLP.1164ta-en ONLINE-G.1568, TALP UPC.192zh-en WMTBIOMEDBASELINE.183en-cs ZLABS-NLP.1151, ONLINE-G.1555en-de ZLABS-NLP.179, WMTBIOMEDBASELINE.388, ONLINE-G.1556en-iu news UEDIN.1281, OPPO.722, UQAM TANLE.521en-iu full UEDIN.1281, OPPO.722, UQAM TANLE.521en-iu UEDIN.1281, OPPO.722, UQAM TANLE.521en-pl ONLINE-Z.1634, ZLABS-NLP.180, ONLINE-A.1576en-ta TALP UPC.1049, SJTU-NICT.386, ONLINE-G.1561

Table 15: List of all MT systems that we consider as outliers

B Scatterplots

Here we show scatterplots of human and metric scores of selected metrics.We report the correlation of each metric with human scores on all systems as well as all systems minus

the outliers. Note that we do not exclude human translations when computing these correlations.In the following scatterplots, the violet triangles indicate individual indicate MT system submissions by

researchers and pink downward triangles are online systems. 8 The red crosses are outlier systems.The black diamonds are human translations. For newstest2020 reference set, this is the HUMAN-A

translation, and for newstestB2020 reference set, this is the HUMAN-B translation. The plots for English→ German have two human translations included, and we annotate the label in the plot. In many cases,metric errors scoring these translations stand out.

Metric scores of MT systems with multiple references does not deviate from the scores of eitherreference. So we do not include the scatterplots of the other reference sets unless a human translation isincluded (which is interesting).

We will have scatterplots for all metrics over all reference sets in the metrics package to be madeavailable at http://www.statmt.org/wmt20/results.html

cs-en

0.1 0.0 0.1

22

24

26

28

30BLEU: r = 0.85/ 0.80

0.1 0.0 0.1

0.53

0.54

0.55

0.56

0.57

0.58

chrF: r = 0.87/ 0.81

0.1 0.0 0.1

0.42

0.41

0.40

0.39

0.38

0.37EED: r = 0.88/ 0.84

0.1 0.0 0.1

0.10

0.05

0.00

0.05

0.10

0.15

OpenKiwi­Bert: r = 0.73/ 0.70

8We distinguish between the two in these scatterplots as we notice that metrics often make errors when scoring onlinesystems.

Page 31: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

718

de-en newstest2020a

0.1 0.0 0.1 0.2

32

34

36

38

40

42

44BLEU: r = 0.81/ 0.48

0.1 0.0 0.1 0.2

0.60

0.62

0.64

0.66

0.68

chrF: r = 0.88/ 0.44

0.1 0.0 0.1 0.2

0.1

0.0

0.1

0.2

0.3

BLEURText: r = 0.96/ 0.81

0.1 0.0 0.1 0.2

0.20

0.25

0.30

0.35

COMET­QE: r = 0.96/ 0.81

aIncluding the YOLO.1052 system, which has an extremely low quality, would make it hard to distinguish between the rest ofthe systems, so these plots exclude the system.

iu-en

0.0 0.1 0.2

24

26

28

30BLEU: r = 0.57/ 0.35

0.0 0.1 0.2

0.46

0.48

0.50

0.52

chrF: r = 0.73/ 0.34

0.0 0.1 0.2

0.800

0.775

0.750

0.725

0.700

0.675

COMET­2R: r = 0.87/ 0.64

0.0 0.1 0.2

0.01

0.00

0.01

0.02

0.03

COMET­QE: r = 0.68/ 0.66

ja-en

0.2 0.0 0.2

12.5

15.0

17.5

20.0

22.5

25.0

27.5 BLEU: r = 0.97/ 0.83

0.2 0.0 0.2

0.400

0.425

0.450

0.475

0.500

0.525

chrF: r = 0.97/ 0.86

0.2 0.0 0.2

0.50

0.48

0.46

0.44

0.42

0.40 EED: r = 0.97/ 0.90

0.2 0.0 0.2

0.87

0.88

0.89

0.90

0.91YiSi­2: r = 0.97/ 0.78

pl-en

0.1 0.0 0.1

28

30

32

34

BLEU: r = 0.55/ 0.36

0.1 0.0 0.1

0.56

0.57

0.58

0.59

0.60

0.61

chrF: r = 0.53/ 0.31

0.1 0.0 0.1

0.15

0.20

0.25

0.30

BLEURT: r = 0.59/ 0.37

0.1 0.0 0.1

0.91

0.92

0.93

0.94

0.95YiSi­2: r = 0.44/ 0.23

Page 32: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

719

ru-en newstest2020

0.1 0.0 0.1

30

32

34

36

38

BLEU: r = 0.87/ 0.74

0.1 0.0 0.1

0.60

0.61

0.62

0.63

0.64

0.65

0.66chrF: r = 0.88/ 0.82

0.1 0.0 0.1

0.35

0.34

0.33

0.32

0.31

0.30

EED: r = 0.92/ 0.86

0.1 0.0 0.10.22

0.24

0.26

0.28

0.30

0.32

0.34

COMET­QE: r = 0.87/ 0.75

ru-en newstestB2020

0.1 0.0 0.1

32

34

36

38

40

42 BLEU: r = 0.88/ 0.78

0.1 0.0 0.1

0.61

0.62

0.63

0.64

0.65

0.66

0.67chrF: r = 0.89/ 0.84

0.1 0.0 0.1

0.58

0.60

0.62

0.64

MEE: r = 0.91/ 0.88

0.1 0.0 0.1

0.91

0.92

0.93

0.94

0.95YiSi­2: r = 0.82/ 0.81

ta-en

0.4 0.2 0.0 0.2

5

10

15

20

BLEU: r = 0.92/ 0.81

0.4 0.2 0.0 0.2

0.40

0.45

0.50

0.55chrF: r = 0.95/ 0.83

0.4 0.2 0.0 0.2

0.70

0.65

0.60

0.55

CharacTER: r = 0.96/ 0.88

0.4 0.2 0.0 0.20.87

0.88

0.89

0.90

0.91

0.92

YiSi­2: r = 0.85/ 0.76

zh-en newstest2020

0.2 0.015

20

25

30

35

BLEU: r = 0.94/ 0.94

0.2 0.0

0.45

0.50

0.55

0.60

0.65

chrF: r = 0.97/ 0.95

0.2 0.0

45

50

55

60

65

parchrf++: r = 0.97/ 0.95

0.2 0.0

0.10

0.15

0.20

0.25

0.30

OpenKiwi­XLMR: r = 0.90/ 0.89

Page 33: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

720

zh-en newstestB2020

0.2 0.0

15

20

25

30

BLEU: r = 0.94/ 0.93

0.2 0.0

0.45

0.50

0.55

0.60

chrF: r = 0.97/ 0.95

0.2 0.0

45

50

55

60

parchrf++: r = 0.97/ 0.95

0.2 0.00.88

0.89

0.90

0.91

0.92

YiSi­2: r = 0.96/ 0.93

km-en

0.2 0.0 0.210.0

12.5

15.0

17.5

20.0

22.5

25.0

BLEU: r = 0.97/ 0.97

0.2 0.0 0.20.375

0.400

0.425

0.450

0.475

0.500

0.525chrF: r = 0.98/ 0.98

0.2 0.0 0.2

0.52

0.50

0.48

0.46

0.44

0.42

0.40

EED: r = 0.99/ 0.99

0.2 0.0 0.2

0.05

0.10

0.15

0.20

COMET­QE: r = 0.90/ 0.90

ps-en

0.2 0.1 0.0

14

16

18

20

22

BLEU: r = 0.89/ 0.89

0.2 0.1 0.00.38

0.40

0.42

0.44

0.46

0.48

0.50chrF: r = 0.90/ 0.90

0.2 0.1 0.0

2.8

2.6

2.4

2.2

prism: r = 0.97/ 0.97

0.2 0.1 0.0

0.86

0.87

0.88

0.89

0.90

YiSi­2: r = 0.94/ 0.94

en-cs

0.5 0.0 0.5

20

25

30

35

40

BLEU: r = 0.82/ 0.39

0.5 0.0 0.5

0.50

0.55

0.60

0.65

chrF: r = 0.83/ 0.31

0.5 0.0 0.5

0.4

0.2

0.0

0.2

0.4

BLEURText: r = 0.99/ 0.96

0.5 0.0 0.5

0.2

0.3

0.4

0.5

COMET­QE: r = 0.99/ 0.97

Page 34: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

721

en-de newstest2020

0.0 0.5

15

20

25

30

35

40

P

B

BLEU: r = 0.54/ 0.31

0.0 0.5

0.50

0.55

0.60

0.65

P

B

chrF: r = 0.64/ 0.36

0.0 0.5

0.2

0.1

0.0

0.1

0.2

0.3

P

BBLEURText: r = 0.97/ 0.87

0.0 0.5

0.25

0.30

0.35

0.40

P

BCOMET­QE: r = 0.91/ 0.89

en-de newstestB2020

0.25 0.00 0.25 0.50

15

20

25

30

35

40

P

A

BLEU: r = 0.53/ 0.38

0.25 0.00 0.25 0.500.45

0.50

0.55

0.60

0.65

P

A

chrF: r = 0.59/ 0.39

0.25 0.00 0.25 0.50

0.2

0.1

0.0

0.1

0.2

0.3

P

ABLEURText: r = 0.97/ 0.88

0.25 0.00 0.25 0.50

0.25

0.30

0.35

0.40

P

ACOMET­QE: r = 0.90/ 0.85

en-de newstestP2020

0.0 0.58

10

12

14

16

B

ABLEU: r = 0.81/ 0.65

0.0 0.50.38

0.40

0.42

0.44

0.46

0.48

B

AchrF: r = 0.85/ 0.68

0.0 0.50.4

0.2

0.0

0.2 B A

BLEURText: r = 0.95/ 0.86

0.0 0.5

0.25

0.30

0.35

0.40

B

A

COMET­QE: r = 0.91/ 0.89

en-ja

0.2 0.0 0.2 0.4

15.0

17.5

20.0

22.5

25.0

27.5

BLEU: r = 0.95/ 0.95

0.2 0.0 0.2 0.4

0.225

0.250

0.275

0.300

0.325

0.350

0.375chrF: r = 0.95/ 0.95

0.2 0.0 0.2 0.4

0.2

0.3

0.4

0.5

esim: r = 0.99/ 0.99

0.2 0.0 0.2 0.4

0.1

0.2

0.3

0.4

OpenKiwi­XLMR: r = 0.99/ 0.99

Page 35: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

722

en-pl

0.5 0.0 0.5

20

22

24

26

28BLEU: r = 0.94/ 0.74

0.5 0.0 0.5

0.50

0.52

0.54

0.56

chrF: r = 0.96/ 0.79

0.5 0.0 0.50.2

0.1

0.0

0.1

0.2

0.3

BLEURText: r = 0.98/ 0.83

0.5 0.0 0.5

0.25

0.30

0.35

0.40

0.45

0.50

0.55COMET­QE: r = 0.97/ 0.80

en-ru

0.0 0.2 0.4

20

21

22

23

24

25

BLEU: r = 0.98/ 0.98

0.0 0.2 0.4

0.48

0.50

0.52

chrF: r = 0.98/ 0.98

0.0 0.2 0.4

46

47

48

49

50

chrF++: r = 0.98/ 0.98

0.0 0.2 0.4

0.30

0.35

0.40

0.45

0.50

OpenKiwi­XLMR: r = 0.87/ 0.87

en-ta

0.5 0.0 0.5

2

4

6

8

10

12

14BLEU: r = 0.88/ 0.83

0.5 0.0 0.5

0.30

0.35

0.40

0.45

0.50

0.55

chrF: r = 0.94/ 0.89

0.5 0.0 0.5

0.55

0.50

0.45

EED: r = 0.96/ 0.91

0.5 0.0 0.5

0.87

0.88

0.89

0.90

0.91

0.92YiSi­2: r = 0.92/ 0.92

en-zh newstest2020

0.2 0.4 0.6

30

35

40

45

50BLEU: r = 0.66/ 0.66

0.2 0.4 0.60.25

0.30

0.35

0.40

0.45 chrF: r = 0.65/ 0.65

0.2 0.4 0.6

0.20

0.25

0.30

0.35

0.40

EQ_static: r = 0.93/ 0.93

0.2 0.4 0.6

0.0

0.1

0.2

0.3

OpenKiwi­Bert: r = 0.49/ 0.49

Page 36: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

723

en-zh newstestB2020

0.2 0.425.0

27.5

30.0

32.5

35.0

37.5

40.0

BLEU: r = 0.81/ 0.81

0.2 0.40.24

0.26

0.28

0.30

0.32

0.34

0.36chrF: r = 0.81/ 0.81

0.2 0.4

0.35

0.40

0.45

esim: r = 0.92/ 0.92

0.2 0.4

0.0

0.1

0.2

0.3

OpenKiwi­Bert: r = 0.52/ 0.52

en-iu Full test set

0.25 0.00 0.255

10

15

20

25

BLEU: r = 0.16/ 0.13

0.25 0.00 0.25

0.25

0.30

0.35

0.40

chrF: r = 0.35/ 0.12

0.25 0.00 0.25

0.500

0.525

0.550

0.575

0.600

0.625

esim: r = 0.81/ 0.37

0.25 0.00 0.25

0.02

0.01

0.00

0.01

0.02

COMET­QE: r = 0.91/ 0.58

en-zh Out of domain (News) subset

0.25 0.00 0.25

5

10

15

20

25

BLEU: r = 0.07/ 0.11

0.25 0.00 0.25

0.25

0.30

0.35

0.40

0.45 chrF: r = 0.34/ 0.09

0.25 0.00 0.25

0.45

0.50

0.55

0.60

paresim­1: r = 0.76/ 0.42

0.25 0.00 0.250.02

0.01

0.00

0.01

0.02

0.03COMET­QE: r = 0.93/ 0.65

C Additional System-level Results

We also report Kendall Tau correlation of metrics at the system level.

Page 37: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

724

cs-en de-en ja-en pl-en ru-en ta-en zh-en iu-en km-en ps-en12 12 10 14 11 14 16 11 7 6

HUMAN RAW 0.727 0.758 0.778 0.429 0.673 0.604 0.650 0.891 0.905 1.000SENTBLEU 0.788 0.758 0.733 0.297 0.564 0.692 0.850 0.455 0.619 0.600BLEU 0.848 0.697 0.778 0.407 0.455 0.692 0.833 0.309 0.714 0.600TER 0.758 0.788 0.689 0.287 0.600 0.780 0.800 0.514 0.878 0.867CHRF++ 0.818 0.697 0.778 0.407 0.673 0.714 0.850 0.418 0.619 0.733CHRF 0.818 0.727 0.822 0.363 0.709 0.714 0.833 0.418 0.619 0.733PARBLEU 0.809 0.779 0.778 0.420 0.491 0.685 0.807 0.404 0.714 0.867PARCHRF++ 0.818 0.727 0.822 0.407 0.709 0.714 0.817 0.491 0.619 0.733CHARACTER 0.758 0.758 0.822 0.341 0.745 0.692 0.800 0.527 0.810 0.733EED 0.788 0.727 0.733 0.297 0.782 0.758 0.833 0.636 0.714 0.733YISI-0 0.758 0.758 0.689 0.231 0.782 0.802 0.833 0.600 0.714 0.733SWSS+METEOR − − 0.822 0.341 0.818 0.736 0.817 0.491 0.714 0.733MEE 0.758 0.697 0.867 0.363 0.709 0.692 0.783 0.636 0.714 0.733PRISM 0.758 0.727 0.867 0.341 0.564 0.648 0.800 0.673 0.714 0.867YISI-1 0.758 0.758 0.778 0.451 0.564 0.692 0.817 0.673 1.000 0.867BERT-BASE-L2 0.758 0.848 0.822 0.407 0.491 0.604 0.633 0.564 1.000 0.867BERT-LARGE-L2 0.758 0.848 0.867 0.341 0.564 0.626 0.700 0.527 1.000 0.867MBERT-L2 0.758 0.818 0.822 0.429 0.564 0.604 0.750 0.673 1.000 0.867BLEURT 0.758 0.788 0.822 0.407 0.600 0.604 0.650 0.527 1.000 0.867BLEURT-EXTENDED 0.727 0.848 0.778 0.341 0.455 0.582 0.617 0.527 0.905 0.867ESIM 0.727 0.848 0.822 0.451 0.491 0.670 0.717 0.636 1.000 0.867PARESIM-1 0.727 0.879 0.822 0.451 0.491 0.670 0.700 0.636 1.000 0.867COMET 0.727 0.758 0.778 0.407 0.564 0.626 0.733 0.636 1.000 0.867COMET-2R 0.727 0.788 0.778 0.451 0.527 0.582 0.717 0.600 1.000 0.867COMET-HTER 0.667 0.788 0.822 0.275 0.491 0.604 0.533 0.564 1.000 0.867COMET-MQM 0.667 0.727 0.822 0.275 0.455 0.582 0.517 0.636 1.000 1.000COMET-RANK 0.576 0.727 0.822 0.341 0.455 0.626 0.650 0.309 0.810 1.000BAQ DYN − − − − − − 0.817 − − −BAQ STATIC − − − − − − 0.867 − − −COMET-QE 0.697 0.788 0.778 0.297 0.455 0.516 0.550 0.491 0.905 0.733OPENKIWI-BERT 0.697 0.667 0.733 0.187 0.455 0.429 0.450 -0.055 0.714 0.467OPENKIWI-XLMR 0.727 0.636 0.822 0.275 0.418 0.560 0.567 0.018 1.000 0.867YISI-2 0.576 0.515 0.778 0.319 0.527 0.582 0.750 0.491 0.810 0.867

Table 16: Kendall Tau correlation of system-level metrics with DA human assessment for all MT systems notincluding Human translations. In addition to the metrics, we also include raw human scores where annotatorscores were not standardised.

Page 38: Results of the WMT20 Metrics Shared Tasktems participating in the WMT parallel corpus fil-tering task (Koehn et al.,2020): Khmer and Pashto to English.2 All metrics are evaluated

725

en-cs en-de en-ja en-pl en-ru en-ta en-zh en-iu full en-iu news12 14 11 14 9 15 12 11 11

HUMAN RAW 1.000 0.868 0.964 0.846 0.778 0.810 0.818 0.600 0.600SENTBLEU 0.515 0.802 0.855 0.604 0.944 0.867 0.727 0.236 0.273BLEU 0.515 0.802 0.818 0.582 0.889 0.829 0.727 0.236 0.236TER 0.515 0.824 0.018 0.641 0.556 0.752 0.242 0.309 0.309CHRF++ 0.485 0.868 0.782 0.604 0.889 0.829 0.727 0.309 0.309CHRF 0.485 0.868 0.818 0.604 0.889 0.810 0.727 0.345 0.309PARBLEU 0.504 0.736 0.611 0.633 0.761 0.842 0.718 0.404 0.345PARCHRF++ 0.515 0.846 0.818 0.670 0.889 − 0.727 − −CHARACTER 0.515 0.890 0.782 0.560 0.944 0.771 0.697 0.236 0.345EED 0.545 0.868 0.782 0.604 0.833 0.867 0.727 0.273 0.273YISI-0 0.545 0.846 0.818 0.604 0.944 0.790 0.515 0.236 0.345MEE 0.576 0.802 − 0.582 0.667 0.829 − 0.273 0.382PRISM 0.818 0.868 0.818 0.670 0.611 0.562 0.576 0.418 0.600YISI-1 0.606 0.868 0.782 0.626 0.833 0.810 0.758 0.091 0.273YISI-COMBI − 0.824 − − − − − − −BLEURT-YISI-COMBI − 0.824 − − − − − − −MBERT-L2 0.788 0.846 0.782 0.736 0.778 0.752 0.909 − −BLEURT-EXTENDED 0.879 0.802 0.782 0.780 0.833 0.771 0.848 0.382 0.345ESIM 0.606 0.912 0.855 0.692 0.833 0.752 0.788 0.382 0.455PARESIM-1 0.667 0.890 0.818 0.692 0.833 0.752 0.818 0.382 0.455COMET 0.909 0.846 0.745 0.736 0.722 0.771 0.606 0.382 0.382COMET-2R 0.909 0.890 0.891 0.714 0.611 0.790 0.606 0.309 0.418COMET-HTER 0.909 0.802 0.818 0.736 0.667 0.619 0.576 0.491 0.491COMET-MQM 0.909 0.802 0.818 0.736 0.667 0.619 0.545 0.527 0.455COMET-RANK 0.848 0.780 0.782 0.692 0.556 0.524 0.515 0.127 0.345BAQ DYN − − − − − − 0.697 − −BAQ STATIC − − − − − − 0.788 − −EQ DYN − − − − − − 0.727 − −EQ STATIC − − − − − − 0.818 − −COMET-QE 0.848 0.802 0.709 0.802 0.667 0.543 0.576 0.600 0.673OPENKIWI-BERT 0.758 0.780 0.236 0.538 0.722 0.314 0.606 -0.273 0.200OPENKIWI-XLMR 0.909 0.780 0.818 0.692 0.667 0.657 0.545 0.018 0.200YISI-2 0.485 0.582 0.527 0.077 0.444 0.886 0.121 0.309 0.455

Table 17: Kendall Tau correlation of out-of-English system-level metrics with DA human assessment for all MTsystems not including Human translations. In addition to the metrics, we also include raw human scores whereannotator scores were not standardised.