csc314 / csc763 introduction to machine learning · 02/04/2014 · comparing learning algorithms...

CSC314 / CSC763Introduction to Machine Learning

COMSATS Institute of

Information Technology

Dr. Adeel Nawab

More on Evaluating Hypotheses/Learning Algorithms

Lecture Outline:

� Review of Confidence Intervals for Discrete-Valued

Hypotheses

� General Approach to Deriving Confidence Intervals

� Difference in Error of Two Hypotheses� Difference in Error of Two Hypotheses

� Comparing Learning Algorithms k-fold cross-

validation

� Disadvantages of Accuracy as a Measure Confusion

matrices and ROC graphsReading:

Chapter 5 of Mitchell

Chapter 5 of Witten and Frank, 2nd ed.

Reference: M. H. DeGroot, Probability and Statistics, 2nd Ed., Addison-Wesley, 1986.

Review of Confidence Intervals for Discrete-Valued Hypotheses

� Last lecture we examined the question of how

we might evaluate a hypothesis. More

precisely:precisely:

1. Given a hypothesis h and a data sample S of n

instances drawn at random according to D,

what is the best estimate of the accuracy of h

over future instances drawn from D?

2. What is the possible error in this accuracy

esimate?

Review of Confidence Intervals for Discrete-Valued Hypotheses (cont...)In answer, we saw that assuming:

– the n instances in S are drawn

∗ independently of one another

∗

∗

∗

∗ independently of h

∗ according to probability distribution D

– n ≥ 30 Then

1. the most probable value of errorD(h) is

errorS(h)

2. With approximately N% probability, errorD(h)

lies in interval.

Review of Confidence Intervals for Discrete-Valued Hypotheses (cont...)

Review (cont)

� Equation (1) was derived by observing:

– errorS(h) follows a binomial distribution with

∗

∗

∗ mean value errorD(h) and

∗ standard deviation approximated by

Review (cont)

– the N% confidence level for estimating the

mean of a normal distribution of a random

variable Y with observed value y can bevariable Y with observed value y can be

calculated by noting µ falls into y±zNσ N% of

the time.

Two-Sided and One-Sided Bounds

� Confidence intervals discussed so far offer

two-sided bounds – above and below

May only be interested in one-sided bound� May only be interested in one-sided bound

E.g. may only care about upper bound on error –

answer to question:

What is the probability that errorD(h) is at most

U?

and not mind if error is lower than our estimate.

Two-Sided and One-Sided Bounds

� Consider a set of independent, identically

distributed random variables Y1 . . .Yn, all

governed by an arbitrary probability distribution

General Approach to Deriving Confidence Intervals

–Central Limit Theorem

governed by an arbitrary probability distribution

with mean µ and finite variance σ2. Define the

sample mean,


–Central Limit Theorem� Central Limit Theorem. As n→∞, the distribution

governing Y ¯ approaches a Normal distribution, with mean µ and variance σ2 /n.

� Significance: we know the form of the distribution of the sample mean even if we do not know the distribution of sample mean even if we do not know the distribution of the underlying Yi that are being observed.

� Useful because whenever we pick an estimator that is the mean of some sample (e.g. errorS(h)), the distribution governing the estimator can be approximated by the Normal distribution for suitably large n (typically n ≥ 30).

– e.g. use Normal distribution to approximate Binomial distribution that more accurately describes errorS(h).


� Now have a general approach to deriving

confidence intervals for many estimation

problems:

1. Pick parameter p to estimate

– e.g. errorD(h)– e.g. errorD(h)

2. Choose an estimator

– e.g. errorS(h)

3. Determine probability distribution that governs

estimator

– e.g. errorS(h) governed by Binomial distribution,

approximated by Normal when n ≥ 30


4. Find interval (Lower,Upper) such that N% of

probability mass falls in the interval

– e.g Use table of zN values– e.g Use table of zN values

Things are made easier if we pick an estimator

that is the mean of some sample

– then (by Central Limit Theorem) we can ignore

the probability distribution underlying the

sample and approximate the distribution

governing the estimator by the Normal

distribution.

Example: Difference in Error of Two Hypotheses

� Suppose

– we have two hypotheses h1 and h2 for a

discrete-valued target functiondiscrete-valued target function

– h1 is tested on sample S1, h2 on S2, S1, S2

independently drawn from the same distribution

Wish to estimate difference d in true error

between h1 and h2

Use the 4-step generic procedure to derive

confidence interval for d:

Example: Difference in Error of Two Hypotheses

Comparing Learning Algorithms

� Suppose we want to compare two learning

algorithms rather than two specific hypotheses.

Not complete agreement in the machine � Not complete agreement in the machine

learning community about best way to do this.

� One way to do this is to determine whether

learning algorithm LA is better on average for

learning a target function f than learning

algorithm LB.

Comparing Learning Algorithms(Cont.)

� By better on average here we mean relative

performance across all training sets of size n

drawn from instance distribution D.

I.e. we want to estimate:

ES⊂D[errorD(LA(S))−errorD (LB(S))]⊂

where L(S) is the hypothesis output by learner L

using training set S

i.e., the expected difference in true error between

hypotheses output by learners LA and LB,

when trained using randomly selected training

sets S drawn according to distribution D.

Comparing Learning Algorithms(Cont…)

� But, given limited data D0, what is a good

estimator?

– could partition D into training set S and test – could partition D0 into training set S0 and test

set T0, and measure

error T0(LA(S0))−error T0(LB(S0))

– even better, repeat this many times and

average the results

Comparing Learning Algorithms(Cont…)

� Rather than divide limited training/testing data

just once, do so multiple times and average the

results – called k-fold cross-validation:results – called k-fold cross-validation:

1. Partition data D0 into k disjoint test sets T1, T2,

. . . , DK of equal size, where this size is at

least 30.

Comparing Learning Algorithms(Cont.)

Comparing Learning Algorithms –Further Considerations

� Can determine approximate N% confidence

intervals for estimator ¯ δ using a statistical

test called a paired t testtest called a paired t test

� A paired test is one where hypotheses are

compared over identical samples (unlike

discussion of comparing hypotheses above).

� t test uses the t distribution (instead of Normal

distribution).


� Another paired test which is increasingly used

is the Wilcoxon signed rank test

Has advantage that unlike t test does not� Has advantage that unlike t test does not

assume any particular distribution underlying

the error (i.e. it is a non-parametric test)

Rather than partitioning available data D0 into k

disjoint equal-sized partitions, can repeatedly

randomly select a test set of n ≥ 30 examples

from D0 and use rest for training.


� can do this indefinitely many times, to shrink

confidence intervals to arbitrary width

however test sets are no longer independently � however test sets are no longer independently

drawn from underlying instance distribution D,

since instances will recur in separate test sets

� in k-fold cross validation each instance is

included in only one test set

Disadvantages of Accuracy as a Measure (I)

Disadvantages of Accuracy as a Measure (I)

� Accuracy not always a good measure.

Consider a two class classification problem

where 995 of 1000 instances in a test samplewhere 995 of 1000 instances in a test sample

are negative and 5 positive

– a classifier that always predicts negative will

have an accuracy of 99.5 % even though it

never correctly predicts positive examples

Confusion Matrices

� Can get deeper insights into classifier

behaviour by using a confusion matrix

Confusion Matrices (cont)

Disadvantages of Accuracy as a Measure (II)

� Accuracy ignores possibility of different

misclassification costs

– incorrectly predicting +ve costs may be more or – incorrectly predicting +ve costs may be more or

less important than incorrectly predicting -ve

costs

∗ not treating an ill-patient vs. treating a healthy

one

∗ refusing credit to a credit-worthy client vs.

denying credit a client who defaults


� To address this many classifiers have

parameters that can be adjusted to allow

increased TPR at cost of increased FPR; or increased TPR at cost of increased FPR; or

decreased FPR at the cost of decreased TPR.

� For each such parameter setting a (TPR,FPR)

pair results, and the results may be plotted on

a ROC graph (ROC = “receiver operating

characteristic”).


� provides graphical summary of trade-offs

between sensitivity and specificity

term originated in signal detection theory – e.g. � term originated in signal detection theory – e.g.

identifying radar signals of enemy aircraft in

noisy environments

� See Witten and Frank, Chapter 5.7 and

http://en.wikipedia.org/wiki/Receiver-operating-characteristic for

more

ROC Graphs

ROC Graphs – Example

Summary

� Confidence intervals give us a way of assessing how likely the true error of a hypothesis is to fall within an interval around an observed over a sampleinterval around an observed over a sample

� For many practical purposes we will be interested in a one-sided confidence interval only

� The approach to confidence intervals for sample error many be generalized to apply to any estimator which is the mean of some sample

– E.g. may use this approach to derive confidenceinterval for estimated difference in true error between

two hypotheses

Summary

� Difference between learning algorithms, as opposed to hypotheses, are typically assessed by k-fold cross-

validationvalidation

� Accuracy (complement of error) has a number of disadvantages as the sole measure of a learning algorithm

� Deeper insight may be obtained using a confusion

matrix which allows use to distinguish numbers of false

positives/negatives from true positives/negatives

� Costs of different classification errors may be taken into account using ROC graphs

csc314 / csc763 introduction to machine learning · 02/04/2014 · comparing learning algorithms...

Documents