continuous operational evaluation of evolving proprietary mt solution’s adequacy
DESCRIPTION
May 26 th 2014. Continuous Operational Evaluation of Evolving Proprietary MT Solution’s Adequacy. Ekaterina Stambolieva e [email protected]. Outline. Why? MT Adequacy? What? Evaluation Findings Conclusion & Future Work. WHY?. impending industry problem:. - PowerPoint PPT PresentationTRANSCRIPT
May 26th 2014
Continuous Operational Evaluation of Evolving Proprietary MT Solution’s Adequacy
Ekaterina [email protected]
Why?Why?
MT Adequacy?MT Adequacy?
What?What?
EvaluationEvaluation
FindingsFindings
Conclusion & Future WorkConclusion & Future Work
Outline
impending industry problem:
WHY?
MTE, May 26th 2014
impending industry problem:
WHY?
MTE, May 26th 2014
How do we compare MT systems over time?
impending industry problem:
We measure MT quality continuously
WHY?
MTE, May 26th 2014
How do we compare MT systems over time?
impending industry problem:
We measure MT quality continuously
WHY?
MTE, May 26th 2014
How do we compare MT systems over time?
BLEU?
impending industry problem:
We measure MT quality continuously
WHY?
MTE, May 26th 2014
How do we compare MT systems over time?
BLEU?We want adequate
translations
Why?Why?
MT Adequacy?MT Adequacy?
What?What?
EvaluationEvaluation
FindingsFindings
Conclusion & Future WorkConclusion & Future Work
Outline
How do we define MT adequacy in business?
ADEQUACY
MTE, May 26th 2014
How do we define MT adequacy in business?
ADEQUACY
MTE, May 26th 2014
accelerate time-to-deliveryreduce translation costsachieve near-native fluency
adequacy
ADEQUACY
MTE, May 26th 2014
adequacy
improving MT output’s acceptance for the task of post-editing
ADEQUACY
MTE, May 26th 2014
We aim at evaluating our MT systems continuously and compare results over time
WHAT
MTE, May 26th 2014
We aim at evaluating our MT systems continuously and compare results over time
We design our system’s improvements based on human end-user feedback
WHAT
MTE, May 26th 2014
We aim at evaluating our MT systems continuously and compare results over time
We design our system’s improvements based on human end-user feedback
We do not directly evaluate translation quality, instead we assesses over-time MT output improvement
WHAT
MTE, May 26th 2014
We aim at evaluating our MT systems continuously and compare results over time
We design our system’s improvements based on human end-user feedback
We do not directly evaluate translation quality, instead we assesses over-time MT output improvement
no annotation effort required
WHAT
MTE, May 26th 2014
Why?Why?
MT Adequacy?MT Adequacy?
What?What?
EvaluationEvaluation
• Edit DistanceEdit Distance
FindingsFindings
Conclusion & Future WorkConclusion & Future Work
Outline
We compare the results of 2 MT English<->Danish systems
THE EXAMPLE
MTE, May 26th 2014
We compare the results of 2 MT English<->Danish systems
THE EXAMPLE
MTE, May 26th 2014
BLEU
1 2 EN->DA 59.22DA->EN 64.26
We compare the results of 2 MT English<->Danish systems
THE EXAMPLE
MTE, May 26th 2014
BLEU
1 2 EN->DA 59.22 58.84DA->EN 64.26 63.98
3 objective categories to evaluate MT output
– Does the MT output look better than before?
– Does the MT output look worse than before?
– Is it difficult for you to judge whether the MT output is better or not?
CATEGORIES
MTE, May 26th 2014
We will present MT output evaluation based on the Edit Distance (ED) score
EVALUATION
MTE, May 26th 2014
We will present MT output evaluation based on the Edit Distance (ED) score
EVALUATION
MTE, May 26th 2014
We compute in how many edits MT output transforms into the human
translation segment based on the same source
Why?Why?
MT Adequacy?MT Adequacy?
What?What?
EvaluationEvaluation
FindingsFindings
Conclusion & Future WorkConclusion & Future Work
Outline
new MTED
old MT ED
87.08 71.31
94.77 87.44
82.62 66.04
74.19 73.84
84.36 79.79
91.26 88.06
75.12 74.48
FINDINGS
MTE, May 26th 2014
new MTED
old MT ED
87.08 71.31
94.77 87.44
82.62 66.04
74.19 73.84
84.36 79.79
91.26 88.06
75.12 74.48
FINDINGS
MTE, May 26th 2014
Y X N
Annotator 1 60% 36% 4%
Annotator 2 76% 16% 8%
Annotator 3 68% 24% 8%
new MTED
old MT ED
87.08 71.31
94.77 87.44
82.62 66.04
74.19 73.84
84.36 79.79
91.26 88.06
75.12 74.48
FINDINGS
MTE, May 26th 2014
Improved MT acceptance
for the task of post-editing
new MTED
old MT ED
87.08 71.31
94.77 87.44
82.62 66.04
74.19 73.84
84.36 79.79
91.26 88.06
75.12 74.48
FINDINGS
MTE, May 26th 2014
Length variance comparison
between MT output with the new and old
system does not affect MT acceptance
Why?Why?
MT Adequacy?MT Adequacy?
What?What?
EvaluationEvaluation
FindingsFindings
Conclusion & Future WorkConclusion & Future Work
Outline
Modify ED to take into consideration the number of UNK words
Modify the metric so that it detects small improvements in the system
– such as number isolation– tag protection
Take segment character length into consideration
– So not to penalize too much shorter segments
FUTURE WORK
MTE, May 26th 2014
Modify ED to take into consideration the number of UNK words
Modify the metric so that it detects small improvements in the system
– such as number isolation– tag protection
Take segment character length into consideration
– So not to penalize too much shorter segments
FUTURE WORK
MTE, May 26th 2014
Modify ED to take into consideration the number of UNK words
Modify the metric so that it detects small improvements in the system
– such as number isolation– tag protection
Take segment character length into consideration
– So not to penalize too much shorter segments
FUTURE WORK
MTE, May 26th 2014
Thank you
MTE, May 26th 2014