measuring confidence intervals for mt evaluation metrics

14
June 2004 DARPA TIDES MT Workshop Measuring Confidence Intervals for MT Evaluation Metrics Ying Zhang Stephan Vogel Language Technologies Institute Carnegie Mellon University

Upload: kesler

Post on 15-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Measuring Confidence Intervals for MT Evaluation Metrics. Ying Zhang Stephan Vogel Language Technologies Institute Carnegie Mellon University. MT Evaluation Metrics. Human Evaluations (LDC) Fluency and Adequacy Automatic Evaluation Metrics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

Measuring Confidence Intervals for MT Evaluation Metrics

Ying ZhangStephan Vogel

Language Technologies InstituteCarnegie Mellon University

Page 2: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

MT Evaluation Metrics

• Human Evaluations (LDC)– Fluency and Adequacy

• Automatic Evaluation Metrics– mWER: edit distance between the hypothesis and the closest reference

translation

– mPER: position independent error rate

– BLEU:

– Modified BLEU:

– NIST:

)logexp(1

N

nnn pwBPBLEU

N

nnn pwBPBLEUM

1

N

nhypinwwall

occurcothatwwalln

n

n

wwInfo

BPNIST1

__..._

__..._1

1

1

)1(

)...(

Page 3: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

Measuring the Confidence Intervals

• One score per test set

• How accurate is this score?

• To measure the confidence interval a population is required

• Building a test set with multiple human reference translations is expensive

• Bootstrapping (Efron 1986)– Introduced in 1979 as a computer-based method for estimating

the standard errors of a statistical estimation

– Resampling: creating an artificial population by sampling with replacement

– Proposed by Franz Och (2003) to measure the confidence intervals for automatic MT evaluation metrics

Page 4: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

A Schematic of the Bootstrapping Process

Score0

Page 5: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

An Efficient Implementation

• Translate and evaluate on 2,000 test sets?– No Way!

• Resample the n-gram precision information for the sentences– Most MT systems are context independent at the sentence level;– MT evaluation metrics are based on information collected for each testing

sentences– E.g. for BLEU and NIST

RefLen: 61 52 56 59ClosestRefLen 561-gram: 56 46 428.41

– Similar for human judgment and other MT metrics

• Approximation for NIST information gain• Scripts available at: http://projectile.is.cs.cmu.edu

/research/public/tools/bootStrap/tutorial.htm

Page 6: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

Confidence Intervals

• 7 MT systems from June 2002 evaluation

• Observations:– Relative confidence interval: NIST<M-Bleu<Bleu

– I.e. NIST scores have more discriminative powers than BLEU

Page 7: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

Are Two MT Systems Different?

• Comparing two MT systems’ performance– Using the similar method as for single system

– E.g. Diff(Sys1-Sys2):Median=-1.7355 [-1.5453,-1.9056]

– If the confidence intervals overlap with 0, two systems are not significantly different

– M-Bleu and NIST have more discriminative power than Bleu

– Automatic metrics have pretty high correlations with the human ranking

– Human judges like system E (Syntactic system) more than B (Statistical system), but automatic metrics do not

Page 8: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

How much testing data is needed

NIST Scores

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of Testing Data Size

NIS

T S

co

re

A B C D E F G

BLEU Scores

0

0.05

0.1

0.15

0.2

0.25

0.3

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of testing data size

BL

EU

Sc

ore

A B C D E F G

M-Bleu Scores

0.05

0.07

0.09

0.11

0.13

0.15

0.17

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of testing data size

M-B

leu

sc

ore

A B C D E F G

F+A Human Judgments based on Different Size of Testing Data

4

4.2

4.4

4.6

4.8

5

5.2

5.4

5.6

5.8

6

10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Percentage of Testing Data

Hu

ma

n J

ud

gm

en

t

A B C D E F G

Page 9: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

How much testing data is needed

• NIST scores increase steadily with the growing test set size

• The distance between the scores of the different systems remains stable when using 40% or more of the test set

• The confidence intervals become narrower for larger test set

* System A, (Bootstrap Size B=2000)

Page 10: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

How many reference translations are sufficient?

• Confidence intervals become narrower with more reference translations

• [100%](1-ref)~[80~90%](2-ref)~[70~80%](3-ref)~[60%~70%](4-ref)

• One additional reference translation compensates for 10~15% of testing data

* System A, (Bootstrap Size B=2000)

Page 11: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

Bootstrap-t interval vs. normal/t interval

• Normal distribution / t-distribution

• Student’s t-interval (when n is small)

• Bootstrap-t interval– For each bootstrap sample, calculate

– The alpha-th percentile is estimated by the value , such that

– Bootstrap-t interval is

– e.g. if B=1000, the 50th largest value and the 950th largest value gives the bootstrap-t interval

)1,0(~ˆ .

^ Nse

Z

]ˆ,ˆ[^

)(^

)1( sezsez Assuming that

1

.

^ ~ˆ

ntse

Z]ˆ,ˆ[

^

1

^

1

)()1(

setset nn

Assuming that

^*

**

)(

)(ˆ)(

bse

bbZ

)(ˆ t BtbZ /}ˆ)({# )(*

]ˆ,ˆ[^^ )()1(

setset

Page 12: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

Bootstrap-t interval vs. Normal/t interval (Cont.)

• Bootstrap-t intervals assumes no distribution, but– It can give erratic results

– It can be heavily influenced by a few outlying data points

• When B is large, the bootstrap sample scores are pretty close to normal distribution

• Assume normal distribution gives more reliable intervals, e.g. for BLEU relative confidence interval (B=500)– STDEV=0.27 for bootstrap-t interval

– STDEV=0.14 for normal/student-t interval

Historgram of 2000 BLEU Scores

0

20

40

60

80

100

120

140

160

BLEU Score

Fre

q

Page 13: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

The Number of Bootstrap Replications B

• Ideal bootstrap estimate of the confidence interval takes• Computational time increases linearly with B • The greater the B, the smaller of the standard deviation of the estimated confidence intervals. E.g. for BLEU’s relative

confidence interval– STDEV = 0.60 when B=100– STDEV = 0.27 when B=500

• Two rules of thumb:– Even a small B, say B=100 is usually informative– B>1000 gives quite satisfactory results

B

Page 14: Measuring Confidence Intervals for MT Evaluation Metrics

June 2004DARPA TIDES MT Workshop

References

• Efron, B. and R. Tibshirani : 1986, Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy, Statistical Science 1, p. 54-77.

• F. Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proc. Of ACL, Sapporo, Japan.

• M. Bisani and H. Ney : 2004, 'Bootstrap Estimates for Confidence Intervals in ASR Performance Evaluation', In Proc. of ICASP, Montreal, Canada, Vol. 1, pp. 409-412.

• G. Leusch, N. Ueffing, H. Ney : 2003, 'A Novel String-to-String Distance Measure with Applications to Machine Translation Evaluation', In Proc. 9th MT Summit, New Orleans, LO.

• I Dan Melamed, Ryan Green and Joseph P. Turian : 2003, 'Precision and Recall of Machine Translation', In Proc. of NAACL/HLT 2003, Edmonton, Canada.

• King M., Popescu-Belis A. & Hovy E. : 2003, 'FEMTI: creating and using a framework for MT evaluation', In Proc. of 9th Machine Translation Summit, New Orleans, LO, USA.

• S. Nießen, F.J. Och, G. Leusch, H. Ney : 2000, 'An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research', In Proc. LREC 2000, Athens, Greece.

• NIST Report : 2002, Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics, http://www.nist.gov/speech/tests/mt/doc/ngram-study.pdf

• Papineni, Kishore & Roukos, Salim et al. : 2002, 'BLEU: A Method for Automatic Evaluation of Machine Translation', In Proc. of the 20th ACL.

• Ying Zhang, Stephan Vogel, Alex Waibel : 2004, 'Interpreting BLEU/NIST scores: How much improvement do we need to have a better system?,' In: Proc. of LREC 2004, Lisbon, Portugal.