Download - Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England

Rationale for a multilingual corpus for

machine translation evaluation

Debbie Elliott

Anthony Hartley

Eric Atwell

Corpus Linguistics 2003, Lancaster, England

Outline

•Brief introduction to machine translation evaluation methods

•Corpus content for MT evaluation by end-users

•Why compile a new corpus?

•How large should our new corpus be?

•Which language pairs should be included?

•Which text types should be included?

•Conclusions

Machine translation evaluation methods (1)

Evaluation by developers

•Test suites are used to evaluate the translation of specific linguistic phenomena (eg. before and after system modifications)

•Test suites contain short annotated test items with correct target translations

•They are used to test the handling of grammatical phenomena

•Vocabulary is limited

•Items are not rated in terms of frequency or relevance to a particular application

•Scoring is objective


Evaluation by end-users

•Texts are translated by different MT systems (and often humans) for comparison

A number of methods can be used to evaluate MT output …

•Fidelity (the preservation of original content) can be evaluated by comparing segments of MT output with segments from the source text (bilingual evaluators) or from expert human translations (monolingual evaluators). Each segment is given a score

•Fluency (the extent to which the translation reads like an original text) can be evaluated by scoring each target text sentence

•Texts can be selected to reflect user needs


Evaluation by end-users

•Scoring by human evaluators is subjective, so:

•Several evaluators are used and a mean score is calculated for each text

•Evaluators rate a number of texts translated by each system

•Human evaluation is expensive, so:

•Recent research has involved the investigation of automated evaluation methods

Corpus content for MT evaluation

by end-users

Essential:

•Source texts in one or more languages

•Machine translations of source texts by systems for evaluation

Not always essential:

•One or more expert human translations in selected target language(s) to be used as reference translations or for inclusion in evaluation with MT output

•Available evaluation scores if corpus is to be used to validate new automated evaluation methods

Why compile a new corpus for MT evaluation? (1)

Existing corpora have limitations: many projects have involved the use of small numbers of texts in only one language pair

Carroll

(Pierce 1966) 144 Russian sentences (scientific)

3 English human translations

Nagao et al.

(1985) 1,682 Japanese sentences (scientific)

0 human translations

Shiwen

(1993) 3,200 English sentences (random)

1 Chinese human translation

IBM BLEU

(Papineni et al. 2001) Approx. 500 Chinese sentences (news stories)

Up to 4 English human translations


Much research has made use of the DARPA 1994 corpus:

•Source texts: 100 French, 100 Spanish, 100 Japanese

•All newspaper articles of approx. 300-400 words/800 Japanese characters

•2 English human translations of each source text

•5 machine translations of each source text

•Scores for adequacy, fluency and informativeness for all 100 translations in each language pair by 5 MT systems and 1 human


We need:

•a corpus that reflects user needs (not just newspaper articles)

•a larger number of language pairs with English as a source and target language

•sub-corpora (for each language pair) large enough to provide reliable evaluation results

•at least one human translation and several machine translations of each source text

•human evaluation results for selected attributes (eg. fidelity and fluency) for the validation of new automated evaluation methods

•a corpus available to all for MT evaluation research

How large should our new corpus be? (1)

The corpus cannot be unnecessarily large:

•human MT evaluations are time-consuming and expensive

•expert human translations of each source text, if not already available, will be costly to produce

However:

•we need enough words to obtain reliable MT evaluation results


We carried out a statistical analysis of the DARPA 1994 scores for all 3 language pairs:

We calculated the mean score for each attribute and the overall score for each system with varying numbers of texts (1 to 100):

Source language French: Adequacy

MT1 MT2 MT3 MT4 MT5 HT

Text 1 Score for text 1

Text 2 Mean score for texts 1 & 2

etc. etc.

DARPA 1994 (French-English)

Mean adequacy scores for varying numbers of texts

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96

Number of texts evaluated

Me

an

ad

eq

ua

cy

sc

ore

Candide Globalink Metal Systran Human XLT

DARPA 1994 (French-English)

Mean overall scores for varying numbers of texts

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 4 7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

67

70

73

76

79

82

85

88

91

94

97

100

Number of texts evaluated

Me

an

ov

era

ll s

co

re

Candide Globalink Metal Systran Human XLT


Results from statistical analysis:

•10 texts (3,500 words), and often fewer, allow us to identify the highest (human) and lowest ranking system for individual attributes and overall scores

•10 texts allow us to identify the highest-ranking MT system as well (but up to 30 texts required for informativeness)

•After approx. 30 texts (10,500 words) scores begin to remain consistent within a relatively small variance fluctuation

•After approx. 40 texts (14,000 words) we have a clearer picture of how all five MT systems compare and further sampling confirms this

•Further research: the same statistical analysis will be performed using texts from our new corpus and our chosen metrics

Which language pairs should be included? (1)

•A variety of language pairs allows for the testing of portability of new evaluation methods

•The availability of MT systems for evaluation will influence our choice

•Our survey of MT users (ongoing since January 2003) is also providing guidelines …

Language pairs translated by MT users

(English as source language)

0

2

4

6

8

10

12

Englis

h-Spa

nish

Englis

h-Fre

nch

Englis

h-Ger

man

Englis

h-Por

tugu

ese

Englis

h-Ita

lian

Englis

h-Ja

pane

se

Englis

h-Dan

ish

Englis

h-Dut

ch

Englis

h-Gre

ek

Englis

h-Chin

ese

Englis

h-Finn

ish

Englis

h-Nor

wegian

Englis

h-Rus

sian

Englis

h-Swed

ish

Englis

h-Viet

nam

ese

Nu

mb

er

of

res

po

nd

en

ts

Language pairs translated by MT users

(English as target language)

0

1

2

3

4

5

6

7

8

9

10

Germ

an-E

nglis

h

Frenc

h-Eng

lish

Spani

sh-E

nglis

h

Italia

n-Eng

lish

Japa

nese

-Eng

lish

Chine

se-E

nglis

h

Portu

gues

e-Eng

lish

Finni

sh-E

nglis

h

Vietn

ames

e-Eng

lish

Nu

mb

er o

f re

spo

nd

ents

Which language pairs should be included? (2)

Phase One: French, German, Spanish and Italian plus texts in typologically different languages (Chinese, Japanese) translated into English

Phase Two: Consider additional source languages (eg. Portuguese and Russian into English)

Phase Three: English translated into other languages

Which text types should be included?

•MT systems are used in translation companies and international organisations to translate a number of different text types and topics

•These text types and a variety of topics must be represented in our corpus

•Our survey of MT users is providing guidelines on the kinds of texts and topics most frequently translated using MT systems …

Text types machine translated by companies

0

2

4

6

8

10

12

user

man

uals

tech

nical

docs

web p

ages

legal

docs

inter

nal c

ompa

ny d

ocs

busin

ess

lette

rs

instru

ction

boo

klets

email

s

med

ical d

ocs

calls

for t

ende

r

mem

os

newsp

aper

arti

cles

scien

tific

docs

softw

are

strin

gs

acad

emic

pape

rs

pate

nts

finan

cial d

ocs

tour

ist/tr

avel

info

Nu

mb

er o

f re

spo

nd

ents

Text types machine translated by single users

0

1

2

3

4

web p

ages

acad

emic

pape

rs

newsp

aper

arti

cles

tech

nical

docs

email

s

tour

ist/tr

avel

info

scien

tific d

ocs

med

ical d

ocs

legal

docs

inter

nal c

ompa

ny d

ocs

busin

ess l

ette

rs

pate

nts

calls

for t

ende

r

user

man

uals

softw

are

string

s

mem

os

instru

ction

boo

klets

finan

cial d

ocs

Nu

mb

er o

f re

spo

nd

ents

Conclusions

•We aim to provide a minimum of 14,000 words per language pair (further research to be conducted)

•Text types will be based on responses to our survey, reflecting real MT use

•The text types for each language pair will be the same, to give balance

•Our corpus will be dynamic: updated to reflect changing trends in the MT user market

•The key feature will be detailed scores from human evaluations, available for research (particularly in automated MT evaluation)

•We plan to make our corpus and human evaluation results available online in 2004

Thank you

We welcome your questions

Download - Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England

Top Related