applied machine learning lecture 6-2: evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · i...

53

Upload: others

Post on 23-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

Applied Machine LearningLecture 6-2: Evaluation methods

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

February 11, 2020

Page 2: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

Overview

introduction

evaluating classi�cation systems

evaluating scorers and rankers

evaluating regressors

detour: training losses and evaluation metrics

application-speci�c evaluation: a sample

Page 3: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

why evaluate?

I predictive systems are rarely perfect

I so we need to measure how well our systems are doingI and compare alternatives

I selecting a meaningful evaluation protocol is crucial in machinelearning projectsI this is something we emphasize for thesis projects

I what is the performance needed to be �useful�?

Page 4: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

how do we evaluate? intrinsic and extrinsic

I intrinsic evaluation: measure the performance in isolation,using some automatically computed metric

I extrinsic evaluation: I changed my predictor � how does thisa�ect the performance of my �downstream task�?I how much more money do I make?I how many more clicks do I get?

Page 5: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

is there already an established evaluation procedure?

I for many classical prediction problems, there might be awell-established procedure

I some of which can be quite speci�c to the type of application:I word error rate for speech recognition systemsI BLEU score for machine translation systems

I in this lecture, we'll look at some of the most commongenerally applicable evaluation methods that areI intrinsicI automatic

I we'll come back to some application-speci�c metrics at the end

Page 6: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

high-level setup

Page 7: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

high-level setup

Page 8: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

high-level setup

Page 9: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

qualitative analysis

I it can be useful to look at �interesting�instancesI but not as an argument for why your

system is goodI or for systematic comparison

I error analysis � an art more than a scienceI we need to understand our data

I can help me understand what data I lack, orhelp me improve features

Page 10: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

Overview

introduction

evaluating classi�cation systems

evaluating scorers and rankers

evaluating regressors

detour: training losses and evaluation metrics

application-speci�c evaluation: a sample

Page 11: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

accuracy and error rate

accuracy =nbr. correct

N

error rate =nbr. incorrect

N

Page 12: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

example

Page 13: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

confusion matrix

[source]

Page 14: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

considering one class

I let's say we're interested in one class in particularI �spotting problems� or ��nding the needles in the haystack�

Page 15: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

evaluation metrics derived from a confusion matrix

I several evaluation metrics are de�nedfrom the cells in a confusion matrixI depending on the scienti�c �eld

I these metrics often come in pairsI overgeneration vs undergeneration

TPFN FP

TN

relevant predicted

see https://en.wikipedia.org/wiki/Confusion_matrix

Page 16: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

precision and recall

I the precision and recall are commonlyused for evaluating classi�ers

P =TP

TP + FPR =

TP

TP + FN

I in scikit-learn:I sklearn.metrics.precision_score

I sklearn.metrics.recall_score

I sklearn.metrics.classification_report

TPFN FP

TN

relevant predicted

FN FP

Page 17: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

precision / recall tradeo�

FN

TN

FN

FPTN

TP FP

TP

Page 18: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

F-score

I the F -score is the harmonic mean of the precision and recall

F =2 · P · RP + R

I in scikit-learn: sklearn.metrics.f1_score

Page 19: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

other similar pairs of metrics

I sensitivity/speci�city (in medicine)

Sen =TP

TP + FNSpe =

TN

TN + FP

I true positive rate/false positive rate

TPR =TP

TP + FNFPR =

FP

FP + TN

TPFN FP

TN

relevant predicted

Page 20: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

example (continued)

Page 21: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

Overview

introduction

evaluating classi�cation systems

evaluating scorers and rankers

evaluating regressors

detour: training losses and evaluation metrics

application-speci�c evaluation: a sample

Page 22: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

what if it's more important to �nd the malignant tumors?

I how can we adjust our decisions?

Page 23: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

breast cancer example

2 4 6 8 10Clump Thickness

2

4

6

8

10

Sing

le E

pith

elia

l Cel

l Size

Page 24: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

ranking and scoring systems

I for instance, search engines

I but also some classi�ers (e.g. decision_function,predict_proba)

Page 25: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

intuition: evaluating rankers

Page 26: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

evaluating a search engine: quality of the �rst page

I when I use a search engine, how goodare the results on the �rst page?

I precision at k

P@k =nbr relevant items in top k

k

Page 27: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

evaluating a search engine (another example)

Page 28: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

precision / recall curve

I is there a way to evaluate ranker or scorer that does notdepend on the threshold?

I in a precision/recall curve, we compute the precision fordi�erent recall levelsI that is, for di�erent thresholds

I we visualize the relationship using a plotI good = curve is near top right

Page 29: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

P / R curve (example)

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

Page 30: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

P / R curve (example)

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

Page 31: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

P / R curve (example)

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

Page 32: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

P / R curve (example)

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

Page 33: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

P / R curve (example)

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

Page 34: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

P / R curve (example)

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

Page 35: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

comparing P/R curves

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

Page 36: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

average precision

0.0 0.2 0.4 0.6 0.8 1.0recall

0.0

0.2

0.4

0.6

0.8

1.0

prec

ision

AP = 0.903

Page 37: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

ROC curve (Receiver Operating Characteristic)

[source]

Page 38: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

Area Under Curve score

0.0 0.2 0.4 0.6 0.8 1.0false positive rate

0.0

0.2

0.4

0.6

0.8

1.0

true

posit

ive

rate

AUC = 0.992

Page 39: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

Overview

introduction

evaluating classi�cation systems

evaluating scorers and rankers

evaluating regressors

detour: training losses and evaluation metrics

application-speci�c evaluation: a sample

Page 40: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

example

2 4 6 8 10

0

100

200

300

400

2 4 6 8 10

0

100

200

300

400

Page 41: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

errors

2 4 6 8 10

0

100

200

300

400

2 4 6 8 10

0

100

200

300

400

Page 42: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

mean squared error and mean absolute error

MSE =1

N

N∑i=1

(yi − yi )2 MAE =

1

N

N∑i=1

|yi − yi |

(N = number of instances, yi = prediction for instance i)

Page 43: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

R2: the coe�cient of determination

I the MSE and MAE scores depend on the scale

I they don't directly tell whether the regressor behaves �well�

I the coe�cient of determination (R2) is an evaluation scorethat does not depend on the scale

R2 = 1−∑

N

i=1(yi − yi )

2∑N

i=1(yi − y)2

I in case of a perfect regression, this score is 1

I if completely messed up, 0 or negative

Page 44: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

MSE and R2 for the example

2 4 6 8 10

0

100

200

300

400 MSE = 3298.02, R2 = 0.75

2 4 6 8 10

0

100

200

300

400 MSE = 1106.66, R2 = 0.92

Page 45: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

what to do about ordinal scales

I see e.g. Baccianella et al. (2009) Evaluation Measures for

Ordinal Regression

Page 46: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

Overview

introduction

evaluating classi�cation systems

evaluating scorers and rankers

evaluating regressors

detour: training losses and evaluation metrics

application-speci�c evaluation: a sample

Page 47: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

regression and classi�cation evaluation

MSE =1

N

N∑i=1

(yi − yi )2 error =

1

N

N∑i=1

1[yi 6= yi ]

I for regression, MSE is often usedI as the quality metric when we evaluateI as the loss function when we train

I but typically, for classi�cation what we compute duringtraining and testing is di�erentI such as hinge loss or log loss during training

Page 48: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

loss functions

3 2 1 0 1 2 30.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0 log losshinge losszero-one loss

Page 49: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

Overview

introduction

evaluating classi�cation systems

evaluating scorers and rankers

evaluating regressors

detour: training losses and evaluation metrics

application-speci�c evaluation: a sample

Page 50: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

application-speci�c quality metrics

I as we discussed, there may be an established evaluationprotocol for the task you're working on

I or you might design a new one

Page 51: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

humans in the loop

I evaluating using people can be an alternative to automaticevaluationI if we don't have data for evaluating automaticallyI or if we want to evaluate something that is hard to measure

I expensive!

I reliable?

I replicable?

Page 52: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

to conclude

Page 53: Applied Machine Learning Lecture 6-2: Evaluation methodsrichajo/dit866/lectures/l6/l6_2.pdf · I the coe cient of determination ( R 2) is an evaluation score that does not depend

Next lecture

I Neural networks