comparing classifiers on comparing classifiers : pitfalls to avoid and recommended approach...

29
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA Presenter Jiyeon Kim (April14 th , 2014) 1

Upload: trever-callard

Post on 30-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

1

On Comparing Classifiers : Pitfalls to Avoid and Rec-ommended ApproachPublished by Steven L. SalzbergDepartment of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA

Presenter Jiyeon Kim(April14th, 2014)

Page 2: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

2

Introduction

How does the researcher choose which classi-fication algorithm to use for a new problem?

Comparing the effectiveness of different algo-rithms on public databases – opportunities or dangers ?

Are many comparisons relied on widely shared datasets statistically valid ?

Page 3: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

3

Contents

1 Definitions

2 Comparing Algorithms

3 Statistical Validity 4 Conclusions

> Candidate Ques-tions

Page 4: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

4

1Definitions

Paired T-Test

Hypothesis Test-

ing

Significant Level

(α)

P-value

Page 5: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

5

1/1 Paired T-Test

• To determine whether two paired sets differ from each other in a signifi-cant way

• Under this assumption - the paired differences are independent and identically normally distributed

Page 6: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

6

1/2 Hypothesis Testing

• Null Hypothesis (H0) vs. Alternative Hy-pothesis (H1)

• Reject the null hypothesis (H0), if the p-value is less than the signifi-cance level

• e.g. In the case of Paired T-test, H0 : There is no difference in two popu-lations. H1 : There is a statistically significant difference.

Page 7: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

7

1/3 Significance Level, α

• The percentage of the time in which the experimenters make an error

• Usually, the significance level is chosen to be 0.05 (or equivalently, 5%)

• A fixed probability of wrongly reject-ing the null hypothesis H0, if it is in fact true ( = P(type I error) )

Page 8: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

8

1/4 P-Value

• The probability of obtaining a test statis-tic result at least as extreme as the one that was actually observed, assuming that the null hypothesis is true

• “Reject the null hypothesis (H0) " when the p-value turns out to be less than a certain significance level, often 0.05

Page 9: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

9

2Comparing Algorithms

• Empirical validation of Classification Re-search has serious experimental deficiencies

• Be careful when making conclusion that a new method is significantly better on well-studied datasets

Page 10: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

10

3Statistical Validity

< Multiplicity Effect >

e.g. Assume that you do 154 experiments ( two-tailed, paired t-test) with significant level 0.05

- You have 154 chances to be significant- The expected number of significant results is

154 * 0.05 = 7.7Now You have 770 % error rate !!

Page 11: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

11

3Statistical Validity

< Bonferroni Adjustment >

① Let α* be the error rate of each experiment② Then, (1- α*) become the chance that we

can get right conclusion③ If we conduct n independent experiments, the chance of getting them all right is (1-

α*)ⁿ④ So, the chance that we will make at least

one mistake, α is 1 - (1- α*)ⁿ

Page 12: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

12

3Statistical Validity

< Bonferroni Adjustment >e.g. (This is not correct usage!!)

Assume that you do 154 experiments ( two-tailed, paired t-test ) with significant level 0.05 again

① The significance level for each experiment, α* = 0.05

② Then, the right conclusion rate, (1 - α*) = (1 - 0.05) = 0.95

③ The chance of getting them all right is (1 - 0.05) ^154

④ So, the significance level for all experiments is

1 - (1 - α*)ⁿ = 1 - (1 – 0.05)^154 =

0.99996 Now You have 99.96% error rate !!

Page 13: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

13

3Statistical Validity

< Bonferroni Adjustment >⇒ “ Then, what should we do? ”

e.g. (This is ‘correct’ usage!!)① α = 1 - (1 – α*)^154 ≤ 0.05 in order

to obtain results significant at the 0.05 level with 154 results

② Then, it gives α*≤ 0.0003 which is more stringent than the original significance level 0.05!

Page 14: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

14

3Statistical Validity

< Bonferroni Adjustment >

/ CAVEAT /This argument is very rough

because it assumes that all the experi-

ments are independent of one an-other!

Page 15: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

15

* Recommended Tests

Simple Binomial Test

ANOVA(Analysis of Variance)

( with Bonferroni Adjustment )

3/1 Alternative statistical tests

Statistical Validity /

Page 16: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

16

To compare two algorithms ( A&B ),a comparison must consider four num-bers ;

① The number of examples that A got right and

B got wrong ⇒ A>B② The number of examples that B

got right and A got wrong ⇒ A>B③ The number that both algorithms got

right④ The number that both algorithms got

wrong

3/1 Alternative statistical tests

Statistical Validity /

Page 17: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

17

To compare two algorithms ( A&B ),a comparison must consider four num-bers ;

① The number of examples that A got right and

B got wrong ⇒ A>B② The number of examples that B

got right and A got wrong ⇒ A>B

⇒ simple but much improved way, Binomial Test!

3/1 Alternative statistical tests

Statistical Validity /

Page 18: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

18

3/2 Community Experi-ments

Statistical Validity /

• Even when using strict significance crite-ria and the appropriate significance tests, there would be mere ‘ accidents of chance ’

• In order to deal with this phenomenon, the most helpful resolution is duplication !

Page 19: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

19

3/3 Repeated TuningStatistical Validity /

• Algorithms are tuned repeatedly on some datasets

• Whenever tuning takes place, every adjust-ment should be considered a separate exper-iment

e.g. If 10 ‘ tuning ’ experiments were at-tempted,then significance level should be 0.005 in-stead of 0.05

Page 20: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

20

3/3 Repeated TuningStatistical Validity /

< Recommended Ap-proach>

To establish the new algorithm’s comparative

merits,① Choose other algorithm that is most similar

to the new one to include in the comparison② Choose a benchmark data set that illus-

trates the strengths of the new algorithm③ Divide the data set into k subsets for cross-

validation④ Run a cross-validation⑤ To compare algorithms, use the appropriate

statistical test

Page 21: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

21

3/3 Repeated TuningStatistical Validity /

< Cross-Validation>

(A)For each of the k subsets of the data set D, create a training set T = D - k(B) Divide each training set into two smaller subsets, T1 and T2 ; T1 will be used for training, and T2 for tuning (C) Once the parameters are optimized, re-run training on the larger set T(D) Finally, measure accuracy on k(E) Overall accuracy is averaged across all k partitions ; These k values also give an estimate of the vari-ance of the algorithms

Page 22: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

22

4Conclusions• No single technique is likely to work best on all databases

• Empirical comparisons should be done for validity of algorithms but these studies must be very careful! - Comparative work should be done in a sta-tistically acceptable framework

• The contents above are to help experimen-tal researchers steer clear of problems in de-signing a comparative study.

Page 23: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

23

> Exam Questions

1

Q) Why should we apply Bonferroni Adjustment to comparing classifiers?

Page 24: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

24

> Exam Questions

1

A) In case of multiple tests, multiplicity effect occurs if we use same significant level for each test as for all tests. So we need to get more stringent level for each experiment by Bonferroni Adjustment.

Page 25: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

25

> Exam Questions

2

Q) Assume that you will do 10 experiments for comparing two classification algo-rithms.

Using Bonferroni Adjustment, de-

termine the criterion of α* (the signifi-cant level for each experiment) in order to get results that are truly significant at the 0.01 level for 10 tests.

Page 26: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

26

> Exam Questions

2

A)

α = 1 - (1 - α*)^10 = 1 - (1 - α*)^10 ≤ 0.01

(1 - α*)^10 ≥ 0.991 - α* ≥ 0.9989

∴ α* ≤ 0.0011

Page 27: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

27

> Exam Questions

3

Q) Specify the difference between paired t-test and simple binomial test in comparing two algorithms.

Page 28: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

28

> Exam Questions

3

A)Paired t-test : determine whether the difference be-tween two algorithms exists or not

Binomial test :compare the percentage of times ‘ algo-rithm A > algorithm B ’ versus ‘ A < B ’, with throwing out the ties

Page 29: Comparing Classifiers On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Department of Computer Science,

29

Thank You.감사합니다 .