cs550 presentation - on comparing classifiers by slazberg
DESCRIPTION
A presentation for paper written by Steven L. Salzberg on comparing classifiers.TRANSCRIPT
On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach(cited by 581)
Author: Steven L.SalzbergPresented by: Mehmet Ali Abbasoğlu &
Mustafa İlker Saraç
10.04.2014
Contents
1. Motivation2. Comparing Algorithms3. Definitions4. Problems5. Recommended Approach6. Conclusion
Motivation
● Be careful about comparative studies of classification and other algorithms.○ It is easy to result in statistically invalid conclusions.
● How to chose which algorithm to use for a new problem?
● Using brute force one can easily find a phenomenon or pattern that looks impressive. ○ REALLY?
Motivation
● You have lots of data○ Choose one from UCI repository
● You have many classification methods to compare
But,● Any differences in classification accuracy that reach
statistical significance should be reported as important?○ Think again!
Comparing Algorithms
● Many new algorithms has problems according to a survey conducted by Prechelt.○ 29% not evaluated on a real problem○ 8% compared to more than one alternative on real
data
● A survey by Flexer on experimental neural network papers in leading journals○ Only 3 out of 43 used a seperate data set for tuning
parameters.
Comparing Algorithms
● Drawbacks of reporting results on a well studied data set, e.g. a data set from UCI repository○ It is hard to improve results○ Prone to statistical accidents○ They are fine to see initial results for your new
algorithm
● It seems easy to change known algorithms a little then use comparisons to report improved results.○ High risk of statistical invalidity○ Better apply new algorithms
Definitions
● Statistical significance○ In statistics, a result is considered significant not because
it is important or meaningful, but because it has been predicted as unlikely to have occurred by chance alone.
● t-test○ Used to determine whether two sets of data are
significantly different from each other● p-value
○ Probability of getting the same results when comparing 2 hypothesis.
● null hypothesis○ The default position, initial state of the data
Problem 1 : Small repository of datasets
● It is difficult to produce major new results using well-studied and widely shared data.
● Suppose 100 people are studying the effect of algorithms A and B
● At least 5 will get results statistically significant at p <= 0.05
● Clearly results are due to chance.○ The ones who get significant results will publish○ While others will simply move on to other experiments.
Problem 2 : Statistical validity
● Statistics offer many tests that are desined to measure the significance of any difference
● These tests are not designed with computational experiments in mind.
● For example○ 14 different variations of classifier algorithms○ 11 different datasets○ 154 variations, 154 changes to be significant○ Actual p-value used is 154*0.05 = 7.7○ multiplicy effect
Problem 2 : Statistical validity
● Let the significance for each level be α● Chance for making right conclusion for one experiment
is (1 - α )● Assuming experiments are independent of one another,
chance for getting n experiments correct is (1 - α )n
● Chances of not making correct conclusion is 1- ( 1 - α )n
● Substituting α = 0.05● Chances for making incorrect conclusion is 0.9996● To obtain results significant at 0.05 level with 154 tests
1 - ( 1 - α )n < 0.05 α < 0.003
● This adjustment is known as Bonferroni Adjustment.
Problem 3 : Experiments are not independent
● The t-test assumes that the test sets for each algorithm are independent.
● Generally two algorithms are compared on the same data set○ Obviously the test sets are not independent.
Problem 4 : Only considers overall accuracy
● Comparison must consider 4 number when a common test set is used for comparing two algorithms○ A got right and B got wrong ( A > B )○ B got right and A got wrong ( B > A )○ Both algorithms got right ○ Both algorithms got wrong
● If only two algorithms compared○ Throw out ties○ Compare A > B vs B > A
● If more than two algorithms compared○ Use “Analysis of Variance” (ANOVA)○ Bonferroni adjustment for multiple test
Problem 5 : Repeated tuning
● Researchers tune their algorithms repeatedly to perform optimally on a data set.
● Whenever tuning takes place, every adjustment should really be considered as a separate experiment.○ For example if 10 tuning experiments were
attempted, then p-value should be 0.005 instead of 0.05.
● When one uses an algorithm that has been used before, the algorithm may already have been tuned on public databases.
Problem 5 : Repeated tuning
● Recommended approach:○ Reserve a portion of the training set as a tuning set○ Repeatedly test the algorithm and adjust parameters on tuning
set.○ Measure accuracy on the test data.
Problem 5 : Generalizing results
● Common methodological approach○ pick several datasets from UCI repository○ perform series of experiments
■ measuring classification accuracy■ learning rates
● It is not valid to make general statements about other datasets.○ The repository is not an unbiased sample of classification
problems.
● Someone can write an algorithm that works very well on some of the known datasets○ Anyone familiar with the data may be biased.
A Recommended Approach
1. Choose other algorithms to include in the comparison.
2. Chose a benchmark data set.
3. Divide the data set into k subsets for cross validation○ Typically k = 10○ For small data sets, chose larger k.
A Recommended Approach
4. Run cross-validation○ For each of the k subsets of the data set D, create a training
set T = D - k○ Divide T into two subsets: T1 (training) and T2 (tuning) ○ Once parameters are optimized, re-run training on T○ Measure accuracy on k○ Overall accuracy is averaged across all k partitions.
5. Compare algorithms
● In case of multiple data sets, Bonferroni adjustment should be applied.
Conclusion
● Authors do not mean to discourage emprical comparisons
● They try to provide suggestions to avoid pitfalls
● They suggest that○ Statistical tools should be used carefully.○ Every details of the experiment should be reported.
Thank you!