applied machine learning lecture 6-2: evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · i...

Applied Machine LearningLecture 6-2: Evaluation methods

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

February 11, 2020

mailto:[email protected]

Overview

introduction

evaluating classi�cation systems

evaluating scorers and rankers

evaluating regressors

detour: training losses and evaluation metrics

application-speci�c evaluation: a sample

why evaluate?

I predictive systems are rarely perfect

I so we need to measure how well our systems are doingI and compare alternatives

I selecting a meaningful evaluation protocol is crucial in machinelearning projectsI this is something we emphasize for thesis projects

I what is the performance needed to be �useful�?

how do we evaluate? intrinsic and extrinsic

I intrinsic evaluation: measure the performance in isolation,using some automatically computed metric

I extrinsic evaluation: I changed my predictor � how does thisa�ect the performance of my �downstream task�?I how much more money do I make?I how many more clicks do I get?

is there already an established evaluation procedure?

I for many classical prediction problems, there might be awell-established procedure

I some of which can be quite speci�c to the type of application:I word error rate for speech recognition systemsI BLEU score for machine translation systems

I in this lecture, we'll look at some of the most commongenerally applicable evaluation methods that areI intrinsicI automatic

I we'll come back to some application-speci�c metrics at the end

high-level setup

qualitative analysis

I it can be useful to look at �interesting�instancesI but not as an argument for why your

system is goodI or for systematic comparison

I error analysis � an art more than a scienceI we need to understand our data

I can help me understand what data I lack, orhelp me improve features

Overview

introduction






accuracy and error rate

accuracy =nbr. correct

N

error rate =nbr. incorrect

N

example

confusion matrix

[source]

https://en.wikipedia.org/wiki/Confusion_matrix

considering one class

I let's say we're interested in one class in particularI �spotting problems� or ��nding the needles in the haystack�

evaluation metrics derived from a confusion matrix

I several evaluation metrics are de�nedfrom the cells in a confusion matrixI depending on the scienti�c �eld

I these metrics often come in pairsI overgeneration vs undergeneration

TPFN FP

TN

relevant predicted

see https://en.wikipedia.org/wiki/Confusion_matrix

https://en.wikipedia.org/wiki/Confusion_matrix

precision and recall

I the precision and recall are commonlyused for evaluating classi�ers

P =TP

TP + FPR =

TP

TP + FN

I in scikit-learn:I sklearn.metrics.precision_score

I sklearn.metrics.recall_score

I sklearn.metrics.classification_report

TPFN FP

TN

relevant predicted

FN FP

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

precision / recall tradeo�

FN

TN

FN

FPTN

TP FP

TP

F-score

I the F -score is the harmonic mean of the precision and recall

F =2 · P · RP + R

I in scikit-learn: sklearn.metrics.f1_score

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

other similar pairs of metrics

I sensitivity/speci�city (in medicine)

Sen =TP

TP + FNSpe =

TN

TN + FP

I true positive rate/false positive rate

TPR =TP

TP + FNFPR =

FP

FP + TN

TPFN FP

TN

relevant predicted

example (continued)

Overview

introduction






what if it's more important to �nd the malignant tumors?

I how can we adjust our decisions?

breast cancer example

2 4 6 8 10Clump Thickness

2

4

6

8

10

Sing

le E

pith

elia

l Cel

l Size

ranking and scoring systems

I for instance, search engines

I but also some classi�ers (e.g. decision_function,predict_proba)

intuition: evaluating rankers

evaluating a search engine: quality of the �rst page

I when I use a search engine, how goodare the results on the �rst page?

I precision at k

P@k =nbr relevant items in top k

k

evaluating a search engine (another example)

precision / recall curve

I is there a way to evaluate ranker or scorer that does notdepend on the threshold?

I in a precision/recall curve, we compute the precision fordi�erent recall levelsI that is, for di�erent thresholds

I we visualize the relationship using a plotI good = curve is near top right

P / R curve (example)

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

comparing P/R curves

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

average precision

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

AP = 0.903

ROC curve (Receiver Operating Characteristic)

[source]

https://upload.wikimedia.org/wikipedia/commons/6/6b/Roccurves.png

Area Under Curve score

0.0 0.2 0.4 0.6 0.8 1.0false positive rate

0.0

0.2

0.4

0.6

0.8

1.0

true

posit

ive

rate

AUC = 0.992

Overview

introduction






example

2 4 6 8 10

0

100

200

300

400

2 4 6 8 10

0

100

200

300

400

errors

2 4 6 8 10

0

100

200

300

400

2 4 6 8 10

0

100

200

300

400

mean squared error and mean absolute error

MSE =1

N

N∑i=1

(yi − yi )2 MAE =

1

N

N∑i=1

|yi − yi |

(N = number of instances, yi = prediction for instance i)

R2: the coe�cient of determination

I the MSE and MAE scores depend on the scale

I they don't directly tell whether the regressor behaves �well�

I the coe�cient of determination (R2) is an evaluation scorethat does not depend on the scale

R2 = 1−∑

N

i=1(yi − yi )

2∑N

i=1(yi − y)2

I in case of a perfect regression, this score is 1

I if completely messed up, 0 or negative

MSE and R2 for the example

2 4 6 8 10

0

100

200

300

400 MSE = 3298.02, R2 = 0.75

2 4 6 8 10

0

100

200

300

400 MSE = 1106.66, R2 = 0.92

what to do about ordinal scales

I see e.g. Baccianella et al. (2009) Evaluation Measures for

Ordinal Regression

https://ieeexplore.ieee.org/document/5364825

https://ieeexplore.ieee.org/document/5364825

Overview

introduction






regression and classi�cation evaluation

MSE =1

N

N∑i=1

(yi − yi )2 error =

1

N

N∑i=1

1[yi 6= yi ]

I for regression, MSE is often usedI as the quality metric when we evaluateI as the loss function when we train

I but typically, for classi�cation what we compute duringtraining and testing is di�erentI such as hinge loss or log loss during training

loss functions

3 2 1 0 1 2 30.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0 log losshinge losszero-one loss

Overview

introduction






application-speci�c quality metrics

I as we discussed, there may be an established evaluationprotocol for the task you're working on

I or you might design a new one

humans in the loop

I evaluating using people can be an alternative to automaticevaluationI if we don't have data for evaluating automaticallyI or if we want to evaluate something that is hard to measure

I expensive!

I reliable?

I replicable?

to conclude

Next lecture

I Neural networks

applied machine learning lecture 6-2: evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · i...

Documents