cs550 presentation - on comparing classifiers by slazberg

On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach(cited by 581)

Author: Steven L.SalzbergPresented by: Mehmet Ali Abbasoğlu &

Mustafa İlker Saraç

10.04.2014

Contents

1. Motivation2. Comparing Algorithms3. Definitions4. Problems5. Recommended Approach6. Conclusion

Motivation

● Be careful about comparative studies of classification and other algorithms.○ It is easy to result in statistically invalid conclusions.

● How to chose which algorithm to use for a new problem?

● Using brute force one can easily find a phenomenon or pattern that looks impressive. ○ REALLY?

Motivation

● You have lots of data○ Choose one from UCI repository

● You have many classification methods to compare

But,● Any differences in classification accuracy that reach

statistical significance should be reported as important?○ Think again!

Comparing Algorithms

● Many new algorithms has problems according to a survey conducted by Prechelt.○ 29% not evaluated on a real problem○ 8% compared to more than one alternative on real

data

● A survey by Flexer on experimental neural network papers in leading journals○ Only 3 out of 43 used a seperate data set for tuning

parameters.

Comparing Algorithms

● Drawbacks of reporting results on a well studied data set, e.g. a data set from UCI repository○ It is hard to improve results○ Prone to statistical accidents○ They are fine to see initial results for your new

algorithm

● It seems easy to change known algorithms a little then use comparisons to report improved results.○ High risk of statistical invalidity○ Better apply new algorithms

Definitions

● Statistical significance○ In statistics, a result is considered significant not because

it is important or meaningful, but because it has been predicted as unlikely to have occurred by chance alone.

● t-test○ Used to determine whether two sets of data are

significantly different from each other● p-value

○ Probability of getting the same results when comparing 2 hypothesis.

● null hypothesis○ The default position, initial state of the data

Problem 1 : Small repository of datasets

● It is difficult to produce major new results using well-studied and widely shared data.

● Suppose 100 people are studying the effect of algorithms A and B

● At least 5 will get results statistically significant at p <= 0.05

● Clearly results are due to chance.○ The ones who get significant results will publish○ While others will simply move on to other experiments.

Problem 2 : Statistical validity

● Statistics offer many tests that are desined to measure the significance of any difference

● These tests are not designed with computational experiments in mind.

● For example○ 14 different variations of classifier algorithms○ 11 different datasets○ 154 variations, 154 changes to be significant○ Actual p-value used is 154*0.05 = 7.7○ multiplicy effect

Problem 2 : Statistical validity

● Let the significance for each level be α● Chance for making right conclusion for one experiment

is (1 - α )● Assuming experiments are independent of one another,

chance for getting n experiments correct is (1 - α )n

● Chances of not making correct conclusion is 1- ( 1 - α )n

● Substituting α = 0.05● Chances for making incorrect conclusion is 0.9996● To obtain results significant at 0.05 level with 154 tests

1 - ( 1 - α )n < 0.05 α < 0.003

● This adjustment is known as Bonferroni Adjustment.

Problem 3 : Experiments are not independent

● The t-test assumes that the test sets for each algorithm are independent.

● Generally two algorithms are compared on the same data set○ Obviously the test sets are not independent.

Problem 4 : Only considers overall accuracy

● Comparison must consider 4 number when a common test set is used for comparing two algorithms○ A got right and B got wrong ( A > B )○ B got right and A got wrong ( B > A )○ Both algorithms got right ○ Both algorithms got wrong

● If only two algorithms compared○ Throw out ties○ Compare A > B vs B > A

● If more than two algorithms compared○ Use “Analysis of Variance” (ANOVA)○ Bonferroni adjustment for multiple test

Problem 5 : Repeated tuning

● Researchers tune their algorithms repeatedly to perform optimally on a data set.

● Whenever tuning takes place, every adjustment should really be considered as a separate experiment.○ For example if 10 tuning experiments were

attempted, then p-value should be 0.005 instead of 0.05.

● When one uses an algorithm that has been used before, the algorithm may already have been tuned on public databases.

Problem 5 : Repeated tuning

● Recommended approach:○ Reserve a portion of the training set as a tuning set○ Repeatedly test the algorithm and adjust parameters on tuning

set.○ Measure accuracy on the test data.

Problem 5 : Generalizing results

● Common methodological approach○ pick several datasets from UCI repository○ perform series of experiments

■ measuring classification accuracy■ learning rates

● It is not valid to make general statements about other datasets.○ The repository is not an unbiased sample of classification

problems.

● Someone can write an algorithm that works very well on some of the known datasets○ Anyone familiar with the data may be biased.

A Recommended Approach

1. Choose other algorithms to include in the comparison.

2. Chose a benchmark data set.

3. Divide the data set into k subsets for cross validation○ Typically k = 10○ For small data sets, chose larger k.

A Recommended Approach

4. Run cross-validation○ For each of the k subsets of the data set D, create a training

set T = D - k○ Divide T into two subsets: T1 (training) and T2 (tuning) ○ Once parameters are optimized, re-run training on T○ Measure accuracy on k○ Overall accuracy is averaged across all k partitions.

5. Compare algorithms

● In case of multiple data sets, Bonferroni adjustment should be applied.

Conclusion

● Authors do not mean to discourage emprical comparisons

● They try to provide suggestions to avoid pitfalls

● They suggest that○ Statistical tools should be used carefully.○ Every details of the experiment should be reported.

Thank you!

cs550 presentation - on comparing classifiers by slazberg

Data & Analytics