Rationale for a multilingual corpus for
machine translation evaluation
Debbie Elliott
Anthony Hartley
Eric Atwell
Corpus Linguistics 2003, Lancaster, England
Outline
•Brief introduction to machine translation evaluation methods
•Corpus content for MT evaluation by end-users
•Why compile a new corpus?
•How large should our new corpus be?
•Which language pairs should be included?
•Which text types should be included?
•Conclusions
Machine translation evaluation methods (1)
Evaluation by developers
•Test suites are used to evaluate the translation of specific linguistic phenomena (eg. before and after system modifications)
•Test suites contain short annotated test items with correct target translations
•They are used to test the handling of grammatical phenomena
•Vocabulary is limited
•Items are not rated in terms of frequency or relevance to a particular application
•Scoring is objective
Machine translation evaluation methods (2)
Evaluation by end-users
•Texts are translated by different MT systems (and often humans) for comparison
A number of methods can be used to evaluate MT output …
•Fidelity (the preservation of original content) can be evaluated by comparing segments of MT output with segments from the source text (bilingual evaluators) or from expert human translations (monolingual evaluators). Each segment is given a score
•Fluency (the extent to which the translation reads like an original text) can be evaluated by scoring each target text sentence
•Texts can be selected to reflect user needs
Machine translation evaluation methods (3)
Evaluation by end-users
•Scoring by human evaluators is subjective, so:
•Several evaluators are used and a mean score is calculated for each text
•Evaluators rate a number of texts translated by each system
•Human evaluation is expensive, so:
•Recent research has involved the investigation of automated evaluation methods
Corpus content for MT evaluation
by end-users
Essential:
•Source texts in one or more languages
•Machine translations of source texts by systems for evaluation
Not always essential:
•One or more expert human translations in selected target language(s) to be used as reference translations or for inclusion in evaluation with MT output
•Available evaluation scores if corpus is to be used to validate new automated evaluation methods
Why compile a new corpus for MT evaluation? (1)
Existing corpora have limitations: many projects have involved the use of small numbers of texts in only one language pair
Carroll
(Pierce 1966) 144 Russian sentences (scientific)
3 English human translations
Nagao et al.
(1985) 1,682 Japanese sentences (scientific)
0 human translations
Shiwen
(1993) 3,200 English sentences (random)
1 Chinese human translation
IBM BLEU
(Papineni et al. 2001) Approx. 500 Chinese sentences (news stories)
Up to 4 English human translations
Why compile a new corpus for MT evaluation? (2)
Much research has made use of the DARPA 1994 corpus:
•Source texts: 100 French, 100 Spanish, 100 Japanese
•All newspaper articles of approx. 300-400 words/800 Japanese characters
•2 English human translations of each source text
•5 machine translations of each source text
•Scores for adequacy, fluency and informativeness for all 100 translations in each language pair by 5 MT systems and 1 human
Why compile a new corpus for MT evaluation? (3)
We need:
•a corpus that reflects user needs (not just newspaper articles)
•a larger number of language pairs with English as a source and target language
•sub-corpora (for each language pair) large enough to provide reliable evaluation results
•at least one human translation and several machine translations of each source text
•human evaluation results for selected attributes (eg. fidelity and fluency) for the validation of new automated evaluation methods
•a corpus available to all for MT evaluation research
How large should our new corpus be? (1)
The corpus cannot be unnecessarily large:
•human MT evaluations are time-consuming and expensive
•expert human translations of each source text, if not already available, will be costly to produce
However:
•we need enough words to obtain reliable MT evaluation results
How large should our new corpus be? (2)
We carried out a statistical analysis of the DARPA 1994 scores for all 3 language pairs:
We calculated the mean score for each attribute and the overall score for each system with varying numbers of texts (1 to 100):
Source language French: Adequacy
MT1 MT2 MT3 MT4 MT5 HT
Text 1 Score for text 1
Text 2 Mean score for texts 1 & 2
etc. etc.
DARPA 1994 (French-English)
Mean adequacy scores for varying numbers of texts
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
Number of texts evaluated
Me
an
ad
eq
ua
cy
sc
ore
Candide Globalink Metal Systran Human XLT
DARPA 1994 (French-English)
Mean overall scores for varying numbers of texts
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 4 7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
Number of texts evaluated
Me
an
ov
era
ll s
co
re
Candide Globalink Metal Systran Human XLT
How large should our new corpus be? (3)
Results from statistical analysis:
•10 texts (3,500 words), and often fewer, allow us to identify the highest (human) and lowest ranking system for individual attributes and overall scores
•10 texts allow us to identify the highest-ranking MT system as well (but up to 30 texts required for informativeness)
•After approx. 30 texts (10,500 words) scores begin to remain consistent within a relatively small variance fluctuation
•After approx. 40 texts (14,000 words) we have a clearer picture of how all five MT systems compare and further sampling confirms this
•Further research: the same statistical analysis will be performed using texts from our new corpus and our chosen metrics
Which language pairs should be included? (1)
•A variety of language pairs allows for the testing of portability of new evaluation methods
•The availability of MT systems for evaluation will influence our choice
•Our survey of MT users (ongoing since January 2003) is also providing guidelines …
Language pairs translated by MT users
(English as source language)
0
2
4
6
8
10
12
Englis
h-Spa
nish
Englis
h-Fre
nch
Englis
h-Ger
man
Englis
h-Por
tugu
ese
Englis
h-Ita
lian
Englis
h-Ja
pane
se
Englis
h-Dan
ish
Englis
h-Dut
ch
Englis
h-Gre
ek
Englis
h-Chin
ese
Englis
h-Finn
ish
Englis
h-Nor
wegian
Englis
h-Rus
sian
Englis
h-Swed
ish
Englis
h-Viet
nam
ese
Nu
mb
er
of
res
po
nd
en
ts
Language pairs translated by MT users
(English as target language)
0
1
2
3
4
5
6
7
8
9
10
Germ
an-E
nglis
h
Frenc
h-Eng
lish
Spani
sh-E
nglis
h
Italia
n-Eng
lish
Japa
nese
-Eng
lish
Chine
se-E
nglis
h
Portu
gues
e-Eng
lish
Finni
sh-E
nglis
h
Vietn
ames
e-Eng
lish
Nu
mb
er o
f re
spo
nd
ents
Which language pairs should be included? (2)
Phase One: French, German, Spanish and Italian plus texts in typologically different languages (Chinese, Japanese) translated into English
Phase Two: Consider additional source languages (eg. Portuguese and Russian into English)
Phase Three: English translated into other languages
Which text types should be included?
•MT systems are used in translation companies and international organisations to translate a number of different text types and topics
•These text types and a variety of topics must be represented in our corpus
•Our survey of MT users is providing guidelines on the kinds of texts and topics most frequently translated using MT systems …
Text types machine translated by companies
0
2
4
6
8
10
12
user
man
uals
tech
nical
docs
web p
ages
legal
docs
inter
nal c
ompa
ny d
ocs
busin
ess
lette
rs
instru
ction
boo
klets
s
med
ical d
ocs
calls
for t
ende
r
mem
os
newsp
aper
arti
cles
scien
tific
docs
softw
are
strin
gs
acad
emic
pape
rs
pate
nts
finan
cial d
ocs
tour
ist/tr
avel
info
Nu
mb
er o
f re
spo
nd
ents
Text types machine translated by single users
0
1
2
3
4
web p
ages
acad
emic
pape
rs
newsp
aper
arti
cles
tech
nical
docs
s
tour
ist/tr
avel
info
scien
tific d
ocs
med
ical d
ocs
legal
docs
inter
nal c
ompa
ny d
ocs
busin
ess l
ette
rs
pate
nts
calls
for t
ende
r
user
man
uals
softw
are
string
s
mem
os
instru
ction
boo
klets
finan
cial d
ocs
Nu
mb
er o
f re
spo
nd
ents
Conclusions
•We aim to provide a minimum of 14,000 words per language pair (further research to be conducted)
•Text types will be based on responses to our survey, reflecting real MT use
•The text types for each language pair will be the same, to give balance
•Our corpus will be dynamic: updated to reflect changing trends in the MT user market
•The key feature will be detailed scores from human evaluations, available for research (particularly in automated MT evaluation)
•We plan to make our corpus and human evaluation results available online in 2004
Thank you
We welcome your questions