continuous operational evaluation of evolving proprietary mt solution’s adequacy

May 26th 2014

Continuous Operational Evaluation of Evolving Proprietary MT Solution’s Adequacy

Ekaterina [email protected]

Why?Why?

MT Adequacy?MT Adequacy?

What?What?

EvaluationEvaluation

FindingsFindings

Conclusion & Future WorkConclusion & Future Work

Outline

impending industry problem:

WHY?

MTE, May 26th 2014


WHY?

MTE, May 26th 2014

How do we compare MT systems over time?


We measure MT quality continuously

WHY?

MTE, May 26th 2014




WHY?

MTE, May 26th 2014


BLEU?



WHY?

MTE, May 26th 2014


BLEU?We want adequate

translations

Why?Why?


What?What?


FindingsFindings


Outline

How do we define MT adequacy in business?

ADEQUACY

MTE, May 26th 2014

How do we define MT adequacy in business?

ADEQUACY

MTE, May 26th 2014

accelerate time-to-deliveryreduce translation costsachieve near-native fluency

adequacy

ADEQUACY

MTE, May 26th 2014

adequacy

improving MT output’s acceptance for the task of post-editing

ADEQUACY

MTE, May 26th 2014

We aim at evaluating our MT systems continuously and compare results over time

WHAT

MTE, May 26th 2014


We design our system’s improvements based on human end-user feedback

WHAT

MTE, May 26th 2014



We do not directly evaluate translation quality, instead we assesses over-time MT output improvement

WHAT

MTE, May 26th 2014



We do not directly evaluate translation quality, instead we assesses over-time MT output improvement

no annotation effort required

WHAT

MTE, May 26th 2014

Why?Why?


What?What?


• Edit DistanceEdit Distance

FindingsFindings


Outline

We compare the results of 2 MT English<->Danish systems

THE EXAMPLE

MTE, May 26th 2014


THE EXAMPLE

MTE, May 26th 2014

BLEU

1 2 EN->DA 59.22DA->EN 64.26


THE EXAMPLE

MTE, May 26th 2014

BLEU

1 2 EN->DA 59.22 58.84DA->EN 64.26 63.98

3 objective categories to evaluate MT output

– Does the MT output look better than before?

– Does the MT output look worse than before?

– Is it difficult for you to judge whether the MT output is better or not?

CATEGORIES

MTE, May 26th 2014

We will present MT output evaluation based on the Edit Distance (ED) score

EVALUATION

MTE, May 26th 2014

We will present MT output evaluation based on the Edit Distance (ED) score

EVALUATION

MTE, May 26th 2014

We compute in how many edits MT output transforms into the human

translation segment based on the same source

Why?Why?


What?What?


FindingsFindings


Outline

new MTED

old MT ED

87.08 71.31

94.77 87.44

82.62 66.04

74.19 73.84

84.36 79.79

91.26 88.06

75.12 74.48

FINDINGS

MTE, May 26th 2014

new MTED

old MT ED

87.08 71.31

94.77 87.44

82.62 66.04

74.19 73.84

84.36 79.79

91.26 88.06

75.12 74.48

FINDINGS

MTE, May 26th 2014

Y X N

Annotator 1 60% 36% 4%



new MTED

old MT ED

87.08 71.31

94.77 87.44

82.62 66.04

74.19 73.84

84.36 79.79

91.26 88.06

75.12 74.48

FINDINGS

MTE, May 26th 2014

Improved MT acceptance

for the task of post-editing

new MTED

old MT ED

87.08 71.31

94.77 87.44

82.62 66.04

74.19 73.84

84.36 79.79

91.26 88.06

75.12 74.48

FINDINGS

MTE, May 26th 2014

Length variance comparison

between MT output with the new and old

system does not affect MT acceptance

Why?Why?


What?What?


FindingsFindings


Outline

Modify ED to take into consideration the number of UNK words

Modify the metric so that it detects small improvements in the system

– such as number isolation– tag protection

Take segment character length into consideration

– So not to penalize too much shorter segments

FUTURE WORK

MTE, May 26th 2014

Thank you

MTE, May 26th 2014

continuous operational evaluation of evolving proprietary mt solution’s adequacy

Documents

mt systems

mt adequacy

mt quality

mt outputdoes

mt output evaluation

mt outputs acceptance

edits mt output transforms

new mtedold mt ed87