csi 5388: roc analysis

11

CSI 5388: CSI 5388: ROC AnalysisROC Analysis

(Based on ROC Graphs: Notes and Practical (Based on ROC Graphs: Notes and Practical Considerations for Data Mining Researchers by Considerations for Data Mining Researchers by

Tom Fawcett, (Unpublished) January 2003.Tom Fawcett, (Unpublished) January 2003.

22

Evaluating Classification Evaluating Classification SystemsSystems

There are two issues in the evaluation of There are two issues in the evaluation of classification systems:classification systems:• What evaluation measure should we use?What evaluation measure should we use?• How can we ensure that the estimate we How can we ensure that the estimate we

obtain is reliable?obtain is reliable? I will briefly touch upon the second I will briefly touch upon the second

question, but the purpose of this lecture is question, but the purpose of this lecture is to discuss the first question. In particular, I to discuss the first question. In particular, I will introduce concepts in ROC Analysis.will introduce concepts in ROC Analysis.

33

How can we ensure that the How can we ensure that the estimates we obtain are reliable?estimates we obtain are reliable?

There are different ways to address this issue:There are different ways to address this issue:• If the data sets is very large, it is sometimes sufficient to use the If the data sets is very large, it is sometimes sufficient to use the

hold-out method [though it is suggested to repeat it several times hold-out method [though it is suggested to repeat it several times to confirm the results].to confirm the results].

• The most common method is 10-fold cross-validation. It often The most common method is 10-fold cross-validation. It often gives a reliable enough estimate.gives a reliable enough estimate.

• For more reliability (but also more computations), the leave-one-For more reliability (but also more computations), the leave-one-out method is sometimes more appropriateout method is sometimes more appropriate

To assess the reliability of our results, we can report on the variance To assess the reliability of our results, we can report on the variance of the results (to notice overlaps) and performed paired t-tests (or of the results (to notice overlaps) and performed paired t-tests (or other such statistical tests)other such statistical tests)

44

Common Evaluation MeasuresCommon Evaluation Measures1. Confusion Matrix1. Confusion Matrix

PositivePositive NegativeNegative

YesYes True PositivesTrue Positives

(TP)(TP)

False PositivesFalse Positives

(FP)(FP)

NoNo False NegativesFalse Negatives

(FN)(FN)

True NegativesTrue Negatives

(TN)(TN)

Column TotalsColumn Totals PP NN

True ClassHypothe-Sized Class

55

Common Evaluation MeasuresCommon Evaluation Measures2. Accuracy, Precision, Recall, etc…2. Accuracy, Precision, Recall, etc…

FP Rate = FP/N (False FP Rate = FP/N (False Alarm Rate)Alarm Rate)

Precision = TP/(TP+FP)Precision = TP/(TP+FP) Accuracy = Accuracy =

(TP+TN)/(P+N)(TP+TN)/(P+N)

TP Rate = TP/P = TP Rate = TP/P = Recall = Hit Rate = Recall = Hit Rate = SensitivitySensitivity

F-Score = Precision * F-Score = Precision * Recall (though a Recall (though a number of other number of other formulas are also formulas are also acceptable)acceptable)

66

Common Evaluation MeasuresCommon Evaluation Measures3. Problem with these Measures3. Problem with these Measures

They describe the state of affairs at a fixed They describe the state of affairs at a fixed point in a larger evaluation space.point in a larger evaluation space.

We could get a better grasp on the We could get a better grasp on the performance of our learning system if we performance of our learning system if we could judge its behaviour at more than a single could judge its behaviour at more than a single point in that space.point in that space.

For that, we should consider the ROC SpaceFor that, we should consider the ROC Space

77

What does it mean to consider a What does it mean to consider a larger evaluation space? (1)larger evaluation space? (1)

Often, classifiers (e.g., Decision Trees, Rule Often, classifiers (e.g., Decision Trees, Rule Learning systems) only issue decisions: true or false.Learning systems) only issue decisions: true or false.

There is no evaluation space to speak of: we can There is no evaluation space to speak of: we can only judge the value of the classifier’s decision.only judge the value of the classifier’s decision.

WRONG!!! Inside these classifiers, there is a WRONG!!! Inside these classifiers, there is a continuous measure that gets pitted against a continuous measure that gets pitted against a threshold in order for a decision to be made.threshold in order for a decision to be made.

If we could get to that inner process, we could If we could get to that inner process, we could estimate the behaviour of our system in a larger estimate the behaviour of our system in a larger spacespace

88

What does it mean to consider a What does it mean to consider a larger evaluation space? (2)larger evaluation space? (2)

But why do we care about the inner process?But why do we care about the inner process? Well, the classifier’s decision relies on two separate Well, the classifier’s decision relies on two separate

processes. 1) The modeling of the data distribution; 2) The processes. 1) The modeling of the data distribution; 2) The decision based on that modeling. Let’s take the case where decision based on that modeling. Let’s take the case where (1) is done very reliably, but (2) is badly done. In that case, (1) is done very reliably, but (2) is badly done. In that case, we would end up with a bad classifier even though the most we would end up with a bad classifier even though the most difficult part of the job (1) was well done.difficult part of the job (1) was well done.

It is useful to separate (1) from (2) so that (1), the most It is useful to separate (1) from (2) so that (1), the most critical part of the process, can be estimated reliably. As critical part of the process, can be estimated reliably. As well, if necessary, (2) can easily be modified and improved. well, if necessary, (2) can easily be modified and improved.

99

A Concrete Look at the issue: A Concrete Look at the issue: The Neural Network Case (1)The Neural Network Case (1)

In a Multiple Layered Perceptron (MLP), we expect the output In a Multiple Layered Perceptron (MLP), we expect the output unit to issue 1 if the example is positive and 0, otherwise.unit to issue 1 if the example is positive and 0, otherwise.

However, in practice, this is not what happens. The MLP However, in practice, this is not what happens. The MLP issues a number between 0 and 1 which the user interprets to issues a number between 0 and 1 which the user interprets to be 0 or 1.be 0 or 1.

Usually, this is done by setting a threshold at .5 so that Usually, this is done by setting a threshold at .5 so that everything above .5 is positive and everything below .5 is everything above .5 is positive and everything below .5 is negative.negative.

However, this may be a bad threshold. Perhaps we would be However, this may be a bad threshold. Perhaps we would be better off considering a .75 threshold or a .25 one.better off considering a .75 threshold or a .25 one.

1010

A Concrete Look at the issue: A Concrete Look at the issue: The Neural Network Case (2)The Neural Network Case (2)

Please, note that a .75 threshold would amount to decreasing Please, note that a .75 threshold would amount to decreasing the number of false positives at the expense of false negatives. the number of false positives at the expense of false negatives. Conversely, a threshold of .25 would amount to decreasing the Conversely, a threshold of .25 would amount to decreasing the number of false negative, this time at the expense of false number of false negative, this time at the expense of false positives.positives.

ROC Spaces allow us to explore such thresholds on a ROC Spaces allow us to explore such thresholds on a continuous basis.continuous basis.

They provide us with two advantages: 1) They can tell us what They provide us with two advantages: 1) They can tell us what the best spot for our threshold is (given where our priority is in the best spot for our threshold is (given where our priority is in terms of sensitivity to one type over the other type of error) terms of sensitivity to one type over the other type of error) and 2) They allow us to see graphically the behaviour of our and 2) They allow us to see graphically the behaviour of our system over the whole range of possible tradeoffs.system over the whole range of possible tradeoffs.

1111

A Concrete Look at the Issue: A Concrete Look at the Issue: Decision TreesDecision Trees

Unlike an MLP, a Decision Tree only returns a class label. Unlike an MLP, a Decision Tree only returns a class label. However, we can ask how this label was computed However, we can ask how this label was computed internally.internally.

It was computed by considering the proportion of instances It was computed by considering the proportion of instances of both classes at the leaf node the example fell in. The of both classes at the leaf node the example fell in. The decision simply corresponds to the most prevalent class.decision simply corresponds to the most prevalent class.

Rule learners use similar statistics on rule confidences and Rule learners use similar statistics on rule confidences and the confidence of a rule matching an instance.the confidence of a rule matching an instance.

There does, however, exist systems whose process cannot be There does, however, exist systems whose process cannot be translated into a score. For these systems, a score can be translated into a score. For these systems, a score can be generated from an aggregation process. generated from an aggregation process. But is that what But is that what we really want to do???we really want to do???

1212

ROC AnalysisROC Analysis

Now that we see how we can get a score Now that we see how we can get a score rather than a decision from various rather than a decision from various classifiers, let’s look at ROC Analysis per classifiers, let’s look at ROC Analysis per se.se.

Definition:Definition: ROC Graphs are two- ROC Graphs are two-dimensional graphs in which the TP Rate dimensional graphs in which the TP Rate is plotted on the Y Axis and the FP Rate is plotted on the Y Axis and the FP Rate is plotted on the X Axis. A ROC graph is plotted on the X Axis. A ROC graph depicts relative tradeoffs between depicts relative tradeoffs between benefits (true positives) and costs (false benefits (true positives) and costs (false positives)positives)

1313

Points in a ROC Graph (1)Points in a ROC Graph (1)D

A

B

C

E

0 FP Rate 1

1

TPRate

0

Interesting Points:(0,0): Classifier that never issues a positive classification No false positive errors but no true positives results(1,1): Classifier that never issues a negative classification No false negative errors but no true negative results(0,1), D: Perfect classification

1414

Points in a ROC Graph (2)D

A

B

C

E

0 FP Rate 1

1

TPRate

0

Interesting Points:Informally, one point is better than another if it is to the Northwest of the first.Classifiers appearing on the left handside can be thought of as more conservative. Those on the right handside are more liberal in their classification of positive examples.The diagonal y=x corresponds to random guessing

1515

ROC Curves (1)ROC Curves (1) If a classifier issues a discrete outcome, then this If a classifier issues a discrete outcome, then this

corresponds to a point in ROC space. If it issue a corresponds to a point in ROC space. If it issue a ranking or a score, then, as discussed previously, ranking or a score, then, as discussed previously, it needs a threshold to issue a discrete it needs a threshold to issue a discrete classification.classification.

In the continuous case, the threshold can be In the continuous case, the threshold can be placed at various points. As a matter of fact, it placed at various points. As a matter of fact, it can be slided from -can be slided from - to + to +, producing different , producing different points in ROC space, which can connect to trace a points in ROC space, which can connect to trace a curve.curve.

This can be done as in the algorithm described on This can be done as in the algorithm described on the next slide.the next slide.

1616

ROC Curves (2)ROC Curves (2) L= Set of test instances, f(i)= continuous outcome of classifier, L= Set of test instances, f(i)= continuous outcome of classifier,

min and max:= smallest and largest values returned by f, min and max:= smallest and largest values returned by f, increment=the smallest difference between any two f valuesincrement=the smallest difference between any two f values

for t=min to max by increment dofor t=min to max by increment do• FP FP 0 0• TP TP 0 0• for i for i L do L do

if f(i) if f(i) t then t then• if i is a positive example thenif i is a positive example then

TP TP TP + 1 TP + 1• elseelse

FP FP FP + 1 FP + 1• Add point (FP/N, TP/P) to ROC CurveAdd point (FP/N, TP/P) to ROC Curve

1717

An exemple of two curves in An exemple of two curves in ROC spaceROC space

FP Rate

TP

Rate

1818

ROC Curves: A Few remarks (1)ROC Curves: A Few remarks (1)

ROC Curves allow us to make observations ROC Curves allow us to make observations about the classifier. See example in class about the classifier. See example in class (also fig 3 in Fawcett’s paper): the classifier (also fig 3 in Fawcett’s paper): the classifier performs better in the conservative region performs better in the conservative region of the space. of the space.

Its best accuracy corresponds to Its best accuracy corresponds to threshold .54 rather than .5. ROC Curves threshold .54 rather than .5. ROC Curves are useful in helping find the best threshold are useful in helping find the best threshold which should not necessarily be set at .5. which should not necessarily be set at .5. See Example on the next slideSee Example on the next slide

1919

ROC Curves: A Few remarks (2)ROC Curves: A Few remarks (2)

11 22 33 44 55 66 77 88 99 1010

pp pp pp pp pp pp nn nn nn nn

yy yy yy yy yy yy yy yy nn nn

.99.99 .98.98 .9.988

.97.97 .95.95 .94.94 .65.65 .51.51 .48.48 .44.44

InstNumb

TruePredictScore

If the threshold is set at .5, the classifier will make twoErrors. Yet, if it is set at .7, it will make none. This can Clearly be seen on a ROC graph.

2020

ROC Curves: Useful PropertyROC Curves: Useful Property ROC Graphs are insensitive to changes in class ROC Graphs are insensitive to changes in class

distribution. I.e., if the proportion of positive to distribution. I.e., if the proportion of positive to negative instances changes in a test set, the negative instances changes in a test set, the ROC Curve will not change.ROC Curve will not change.

That’s because the TP Rate is calculated using That’s because the TP Rate is calculated using only statistics about the positive class while the only statistics about the positive class while the FP Rate is calculated using only statistics from FP Rate is calculated using only statistics from the negative class. The two are never mixed.the negative class. The two are never mixed.

This is important in domains where the This is important in domains where the distribution of the data changes from, say, distribution of the data changes from, say, month to month or place to place (e.g., fraud month to month or place to place (e.g., fraud detection).detection).

2121

Modification to the Algorithm for Modification to the Algorithm for Creating ROC CurvesCreating ROC Curves

The algorithm on slide 16 is inefficient The algorithm on slide 16 is inefficient because it slides to the next point by a because it slides to the next point by a constant fixed factor. We can make the constant fixed factor. We can make the algorithm more efficient by looking at the algorithm more efficient by looking at the outcome of the classifier and processing it outcome of the classifier and processing it dynamically.dynamically.

As well, we can, compute some averages As well, we can, compute some averages in various segments of the curve or in various segments of the curve or remove all concavities in a ROC Curves remove all concavities in a ROC Curves (Please see section 5 of Fawcett’s paper (Please see section 5 of Fawcett’s paper for a discussion of all these issues). for a discussion of all these issues).

2222

Area under a ROC Curve (AUC)Area under a ROC Curve (AUC)

The AUC is a good way to get a score for the general The AUC is a good way to get a score for the general performance of a classifier and to compare it to that of performance of a classifier and to compare it to that of another classifier.another classifier.

There are two statistical properties of the AUC:There are two statistical properties of the AUC:• The AUC of a classifier is equivalent to the probability The AUC of a classifier is equivalent to the probability

that the classifier will rank a randomly chosen positive that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instanceinstance higher than a randomly chosen negative instance

• The AUC is closely related to the GINI Index (used in The AUC is closely related to the GINI Index (used in CART): GINI + 1 = 2 * AUCCART): GINI + 1 = 2 * AUC

Note that for specific issues about the performance of a Note that for specific issues about the performance of a classifier, the AUC is not sufficient and the ROC Curve classifier, the AUC is not sufficient and the ROC Curve should be looked at, but generally, the AUC is a reliable should be looked at, but generally, the AUC is a reliable measure.measure.

2323

Averaging ROC CurvesAveraging ROC Curves

A single ROC Curve is not sufficient to make A single ROC Curve is not sufficient to make conclusions about a classifier since it conclusions about a classifier since it corresponds to only a single trial and ignores corresponds to only a single trial and ignores the second question we asked on Slide 2.the second question we asked on Slide 2.

In order to avoid this problem, we need to In order to avoid this problem, we need to average several ROC Curves. There are two average several ROC Curves. There are two averaging techniques:averaging techniques:• Vertical AveragingVertical Averaging• Threshold AveragingThreshold Averaging

(These will be discussed in class and are (These will be discussed in class and are discussed in Section 7 of Fawcett’s paper)discussed in Section 7 of Fawcett’s paper)

2424

Additional Topics (Section 8 of Additional Topics (Section 8 of Fawcett’s paper)Fawcett’s paper)

The ROC Convex HullThe ROC Convex Hull Decision Problems with more than 2 Decision Problems with more than 2

classesclasses Combining ClassifiersCombining Classifiers Alternative to ROC Graphs:Alternative to ROC Graphs:

• DET CurvesDET Curves• Cost CurvesCost Curves• LC IndexLC Index

csi 5388: roc analysis

Documents