lecture 15: model selection and validation, cross validationwe nd the best parameter ^ by minimizing...

Lecture 15: Model Selection and Validation,Cross Validation

Hao Helen Zhang

October 15, 2020

Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation

Outline

Read: ESL 7.4, 7.10

Training set, Validation set, Test Set

Hold-out method

K-fold Cross Validation

Model-based Selection Criteria


Model Assessment

Given data (xi , yi ), i = 1, · · · , n, where xi is the input and yi is theresponse, we fit the model f

Model assessment is extremely important, as it can

measure the quality of a fitted model.

compare different methods and make an optimal choice.

select the best model with a model class F = {fα}, where αis the model index

choose optimal tuning parameters in regularization methods.


Model Selection and Assessment

Two Goals:

Model Assessment: after choosing a final model, estimate itsprediction error (generalization performance) on new data.

Model Selection: estimate the performance of differentmodels and choose the best model

Different models:

Is logistic f1 better than LDA f2?

From the same model class

Is 3-nn better than 5-nn?

What is the best k for the model class F = {fk , 1 ≤ k ≤ n}.


Bias-Variance Trade-off

As complexity increases, the model tends to

adapt to more complicated underlying structures (decrease inbias)

have increased estimation error (increase in variance)

In between there is an optimal model complexity, giving theminimum test error.

Role of tuning parameter α (k in knn, or λ in LASSO)

Tuning parameter α controls the model complexity

Find the best α to produce the minimum of test error


Model Performance Evaluation

How do we evaluate the performance of f ?

its performance on the real-world data (“generalization”performance)

we want to test f on the data that it has never seen before

this set of data is called the “test” setthe test data is an independent set of the training data

Why not training data?

the risk of overfitting


Loss and Errors

If Y is continuous, two commonly-used loss functions are

squared error L(Y , f (X)) = (Y − f (X))2

absolute error L(Y , f (X)) = |Y − f (X)|

If Y takes values {1, ...,K}, the common loss functions are

0− 1 loss L(Y , f (X) = I (Y 6= f (X))

log-likelihood L(Y , p(X) = −2∑K

k=1 I (Y = k) log pk(X)= −2 log pY (X)


Training error and Test error

Training error: the average loss over the training samples

TrainErr =1

n

n∑i=1

L(yi , f (xi ))

Test error: Given a test data set (xi , y′i ) for i = 1, . . . , n′,

TestErr =1

n′

n′∑i=1

L(Y ′i , f (Xi )).

TestErr is an estimate of PE. The larger n′, the betterestimate.

Prediction error: the expected error over all the input

PE = E(X,Y ′,data)[1

n

n∑i=1

L(Y ′i , f (Xi ))],

where Y ′i = f (Xi ) + ε′i , i = 1, · · · , n, are independent ofy1, · · · , yn. Expectation is taken w.r.t. all random quantities.


Examples

Regression:

TrainErr =1

n(yi − f (xi ))2

Classification:

log-likelihood loss: TrainErr is the sample log-likelihood

TrainErr = −2

n

n∑i=1

log pyi (xi ),

also called the cross-entropy loss or deviance.

0− 1 loss: the error is the misclassification rate

TrainErr| =1

n

n∑i=1

I (yi 6= f (xi )), TestErr =1

n′

n′∑i=1

I (yi 6= f (xi ),

TestErr is also called expected misclassification rate


What is Wrong with Training Error

TrainErr is not a good estimate of TestErr

TrainErr is often quite different from TestErr.

TrainErr can dramatically underestimate TestErr.

TrainErr is not an objective measure on the generalizationperformance

we built f based on the training data.

TrainErr decreases with model complexity.

Training error drops to zero if the model is complex enoughA model with zero training error overfits the training data;over-fitted models typically generalize poorly


��

��

� ��

��

��

��

��

� ��

��

��


Common Techniques

How do we estimate TestErr accurately? Consider

data-rich situation

data-insufficient situation

Common Techniques (Efron 2004, JASA)

Hold Out Method (data-rich situation)

Resampling methods:

cross validationbootstrap

Analytical estimation criteria


Data-Rich Situation: Hold Out Method

Key Idea: Estimate the test error by

hold out a subset of the training data from the fitting processapply the fitted model to those held out observations andcalculate the prediction errors


Hold Out Approach: without tuning

Randomly divide the data into two parts: a training set and atest or hold-out set.

The model is fit on the training set, and the fitted model isused to predict the responses for the observations in theheld-out set.

The resulting error provides an estimate of the test error (e.g.,MSE in regression, misclassification rate in classification).


Hold Out Approach: with tuning

Randomly divide the dataset into three parts:

training set: to fit the modelstuning (validation) set: to estimate the prediction error formodel selectiontest set: to assess the generalization error of the final model

The typical proportions are respectively: 50%, 25%, 25%.


Drawbacks of Hold Out Approach

The TestErr estimate can be highly variable, depending onwhich observations are included in the training set and whichare held out

Only a subset of the observations (those not held out) areused to fit the model.

The estimated error tend to overestimate the test error for themodel fit on the entire data set.


Resampling Methods

These methods refit a model to samples formed from the trainingset

Cross-validation

bootstrap techniques

They provide estimates of test-set prediction error, and thestandard deviation and bias of our parameter estimates.


K-fold Cross Validation

The simplest and most popular way to estimate the test error.1 Randomly split the data into K roughly equal parts2 For each k = 1, ...,K

leave the kth portion out, and fit the model using the otherK − 1 parts. Denote the solution by f −k(x)calculate the prediction error of the f −k(x) on the kth(left-out) portion

3 Average the errors

Define the Index function (allocating memberships)

κ : {1, ..., n} → {1, ...,K}.

Then the K -fold cross-validation estimate of prediction error is

CV =1

n

n∑i=1

L(yi , f−κ(i)(xi ))

Typical choice: K = 5, 10Hao Helen Zhang Lecture 15: Model Selection and Validation, Cross Validation

Choose α

Given tuning parameter α, the model fit f −k(x, α).The CV (α) curve is

CV (α) =1

n

n∑i=1

L(yi , f−κ(i)(xi , α))

CV (α) provides an estimate of the test error curve

We find the best parameter α by minimizing CV (α).

Final model is f (x, α).


Application Examples

For regression problems, apply CV to

shrinkage methods: select λ

best subset selection: select the best model

nonparametric methods: select the smoothing parameter

For classification methods, apply CV to

kNN: select k

SVM: select λ


Leave-Out-One (LOO) Cross Validation

If K=n, we have leave-out-one cross validation.

κ(i) = i

LOO-CV is approximately unbiased for the true predictionerror

LOO-CV can have high variance because the n “training sets”are so similar to one another

Computation:

Computational burden is generally considerable, requiring nmodel fitting.

For special models like linear smoothers, computation can bedone quickly. For example: Generalized Cross Validation(GCV)


Statistical Properties of Cross Validation Error Estimates

Cross-validation error is very nearly unbiased for test error.

The slight bias is due to that the training set incross-validation is slightly smaller than the actual data set(e.g. for LOOCV the training set size is n 1 when there are nobserved cases).

In nearly all situations, the effect of this bias is conservative,i.e., the estimated fit is slightly biased in the directionsuggesting a poorer fit. In practice, this bias is rarely aconcern.

The variance of CV-error can be large.

In practice, to reduce variability, multiple rounds ofcross-validation are performed using different partitions, andthe validation results are averaged over the rounds.


Performance of CV

Example: Two-class classification problem

Linear model with best subsets regression of subset size p

In Figure 7.9, (p is tuning parameter)

both curves picks best p = 10. (The true model)

CV curve is flat beyond 10.

Standard error bars are the standard errors of the individualmisclassification error rates


Standard Errors for K-fold CV

Let CVk(fα) denote the validation errors of fα in each fold for α, orsimply CVk(α). Here α ∈ {α1, . . . , αm}, within a set of candidatevalues for α. Then we have

The cross validation error

CV (α) =1

K

K∑k=1

CVk(α).

The standard deviation of CV (α) is calculated as

SE (α) =√

Var(CV (α)).


One-standard Error for K-fold CV

Letα = arg min

α∈{α1,...,αm}CV (α).

First we find α; then we move α in the direction of increasingregularization as much as can as long as

CV (α) ≤ CV (α) + SE (α).

choose the most parsimonious model with error no more thanone standard error above the best error.

Idea: “All equal (up to one standard error), go for thesimplest model”.

In this example, one-SE rule CV would choose p = 9.


�� !��" �#�$�&% �')( �*�+��-,/.��01��2#�3" �-�#��4 5#"��768�9�-�;:�<�<1= >?2#��@&A�7"CB

Subset Size p

Miscla

ssifica

tion Er

ror

5 10 15 20

0.00.1

0.20.3

0.40.5

0.6

••

••

••

••

•• • • • • • • • • • •

••

• •

••

• •

• • • • • • •• • • • •

DFE GIH9J!K LNMPO9Q RTS#U#VXW�Y[Z\W�]X^ U_S_S1]XS `7S#U#V_a bX^9V Z�U!^dce]gfhV Y!S1]Xijilkm bgfnW�V?bgZoW�]X^ YepqS m U `�rXS#UjU!^sa U!ilZ\W�t bgZ�U#VucjS1]Xt b ijWA^qrsfvU ZoS1bXW�^wkW�^qr i/U/ZoxycjS&]zt Z�{IU ieY8U_^9bXS_W|] W�^ Z�{IU }#]sZ|Z�]Xt S_W~r�{�Z��bX^�Uef�]�c� W~rXpqS#U �X�+��


The Wrong and Right Way

Consider a simpler classifier for microarrays:

1 Starting with 5,000 genes, find the top 200 genes having thelargest correlation with the class label

2 Carry about the nearest-centroid classification using top 200genes

How do we select the tuning parameter in the classifier?

Way 1: apply cross-validation to step 2

Way 2: apply cross-validation to steps 1 and 2

Which is right? – Cross validating the whole procedure.


Generalized Cross Validation

If linear fitting methods y = Sy satisfy

yi − f −i (xi ) =yi − f (xi )

1− Sii,

then we do NOT need to train the model n times.Using the estimation Sii = trace(S)/n, we have the approximatescore

GCV =1

n

n∑i=1

[yi − f (xi )

1− trace(S)/n

]2

computational advantage

GCV (α) is always smoother than CV (α)


Model-Based Methods: Information Criteria

Covariance penalties

Cp

AIC

BIC

Stein’s UBR

They can be used for both model selection and test errorestimation.


Generalized Information Criteria (GIC)

Information criteria:

GIC(model) = −2 · loglik + α · df,

the degree of freedom (df) of f as the number of effectiveparameters of the model.

Examples: In linear regression settings, we use

AIC = n log(RSS/n) + 2 · df,

BIC = n log(RSS/n) + log(n) · df

where RSS =∑n

i=1[yi − f (xi )]2 (the residual sum of squares)


Optimism of Training Error Rate

training error:

err =1

n

n∑i=1

L(yi , f (xi ))

test error:Err = E(X,Y ,data)[L(Y , f (X))].

in-sample error rate (an estimate of prediction error)

Errin =1

n

n∑i=1

EynewEyL(ynewi , f (xi ))

Optimismop = Errin − Ey err


Optimism

op is typically positive (Why?)

For the squared error, 0-1, and other loss function,

op =2

n

n∑i=1

Cov(yi , yi ).

The amount by which err under-estimates the true errordepends on the strength of correlation yi and its production.


lecture 15: model selection and validation, cross validationwe nd the best parameter ^ by minimizing...

Documents