ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Outline
• Bias variance tradeoff
• Optimism of training error
• Estimates of in sample prediction error
• BIC
• VC dimension
• Cross-validation (chapter 3), bootstrap
1
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Model selection
• Loss functions
L(Y, Y (X)) = (Y − Y (X))2 squared error,
L(G, G(X)) = I(G 6= G(X)) 0–1 loss,
L(G, p(X)) = −2
KX
k=1
I(G = k) log pk(X)
= −2 log pG(X) log-likelihood.
• Training error:
err =1
N
NX
i=1
L(yi, f(xi)).
• Generalization error:
Err = E[L(Y, f(X))],
2
ESL
Chap
7—
Modelse
lectio
nR
ob
Tib
shira
ni
'&
$%
PSfrag
replacem
ents
Hig
hB
ias
Low
Varia
nce
Low
Bia
s
Hig
hVaria
nce
Prediction Error
Model
Com
plex
ity
Tra
inin
gSam
ple
Test
Sam
ple
Low
Hig
h
Beh
avio
roftest
sam
ple
and
train
ing
sam
ple
error
as
the
mod
el
com
plexity
isva
ried.
3
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Bias-variance decomposition
Err(x0) = E[(Y − f(x0))2|X = x0]
= σ2ε + [Ef(x0) − f(x0)]
2 + E[f(x0) − Ef(x0)]2
= σ2ε + Bias2(f(x0)) + Var(f(x0))
= Irreducible Error + Bias2 + Variance.
4
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
For K-nearest neighbors:
Err(x0) = E[(Y − fk(x0))2|X = x0]
= σ2ε +
[f(x0) −
1
k
k∑
`=1
f(x(`))]2
+ σ2ε/k.
For linear regression:
Err(x0) = E[(Y − fp(x0))2|X = x0]
= σ2ε + [f(x0) − Efp(x0)]
2 + ||h(x0)||2σ2
ε .
1
N
N∑
i=1
Err(xi) = σ2ε +
1
N
N∑
i=1
[f(xi) − Ef(xi)]2 +
p
Nσ2
ε , (1)
5
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Classification and 0-1 loss
• Bias and variance and do not add as they do for squared error:
variance tends to dominate, while bias is tolerable as long as
you are on the correct side of the decision boundary. Hence
biased methods often do well!
• Friedman (1996) “On Bias, Variance 0-1 loss...” shows
Pr(y 6= yB) = Φ[(sign(1/2 − f)
Ef − 1/2√varf
]
where yB is the Bayes classifier (Exercise 7.63)
• Hence on the wrong side of the decision boundary, increasing
the variance can help
6
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
0.0
0.1
0.2
0.3
0.4
Number of Neighbors k
50 40 30 20 10 0
k−NN − Regression
5 10 15 20
0.0
0.1
0.2
0.3
0.4
Subset Size p
Linear Model − Regression
0.0
0.1
0.2
0.3
0.4
Number of Neighbors k
50 40 30 20 10 0
k−NN − Classification
5 10 15 20
0.0
0.1
0.2
0.3
0.4
Subset Size p
Linear Model − Classification
Pred error (orange), squared bias (green) and variance (blue) for a simulated
example. Top row is regression with squared error loss; bottom row is
classification with 0–1 loss. Models are k-nearest neighbors (left) and best
subset regression of size p (right). Variance and bias curves are the same in
regression and classification, but the prediction error curve is different.
7
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Optimism of the Training Error Rate
• training error
err =1
N
NX
i=1
L(yi, f(xi))
• In-sample error rate
Errin =1
N
NX
i=1
EY 0 [L(Y 0
i , f(xi))|T ]
• Optimism
op ≡ Errin − err.
8
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
• For squared error, 0–1, and other loss functions, one can show quite
generally that
E(op) =2
N
NX
i=1
Cov(yi, yi),
• For linear fitting:
NX
i=1
Cov(yi, yi) = dσ2
ε
for the additive error model Y = f(X) + ε, and so
Errin = Eyerr + 2 ·d
Nσ2
ε .
• Cp and AIC statistics:
Cp = err + 2 ·d
Nσε
2.
−2 · E[log Prθ(Y )] ≈ −
2
N· E[loglik] + 2 ·
d
N. (2)
9
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
CP, AIC and linear operators
• f = Sy (eg linear regression, ridge regression, cubic smoothing
splines); yi = fi + εi, E(εi) = 0.
• Errin = σ2ε + (1/N)
∑[f(xi) − fi]
2 + σ2ε tr(ST S)/N
•
E(err) = (1/N)E||(I − S)y||2
= (1/N)E(fT (I − S)T (I − S)f + (1/N)E(εT (I − S)T (I − S)ε)
= Bias2 + (1/N)σ2ε tr(I − 2S − S2)
• Hence
Errin − E(err) = (2/N)σ2ε tr(S)
and
Cov(y, f) = σ2ε S
10
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Number of Basis FunctionsLo
g-lik
elih
ood
0.5
1.0
1.5
2.0
2.5
Log-likelihood Loss
2 4 8 16 32 64 128
O
OO
O O OO
O
O
OO
O O OO
O
O
OO
O O O O O
TrainTestAIC
Number of Basis Functions
Mis
clas
sific
atio
n E
rror
0.10
0.15
0.20
0.25
0.30
0.35
0-1 Loss
2 4 8 16 32 64 128
O
O
O
OO
O
O
O
O
OO
O OO
O
O
O
O
O
O OO
O
O
AIC used for model selection for the phoneme recognition example. The
logistic regression coefficient function β(f) =P
M
m=1hm(f)θm is modeled
in M spline basis functions. In the left panel we see the AIC statistic
used to estimate Errin using log-likelihood loss. Included is an estimate of
Err based on a test sample. It does well except for the over-parametrized
case (M = 256 parameters for N = 1000 observations). In the right
panel the same is done for 0–1 loss. Although the AIC formula does not
strictly apply here, it does a reasonable job in this case.
11
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
BIC- Bayesian information criterion
•
BIC = −2 · loglik + (log N) · d.
• under Gaussian model σ2ε is known, −2 · loglik equals (up to a
constant)∑
i(yi − f(xi))2/σ2
ε , which is N · err/σ2ε for squared
error loss. Hence we can write
BIC =N
σ2ε
[err + (log N) ·
d
Nσ2
ε
].
• candidate models Mm, m = 1, . . . , M prior distribution
Pr(θm|Mm) the posterior probability of a given model is
Pr(Mm|Z) ∝ Pr(Mm) · Pr(Z|Mm)
∝ Pr(Mm) ·
∫Pr(Z|θm,Mm)Pr(θm|Mm)dθm,
12
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
where Z represents the training data {xi, yi}N
1 .
• posterior odds
Pr(Mm|Z)
Pr(M`|Z)=
Pr(Mm)
Pr(M`)·Pr(Z|Mm)
Pr(Z|M`).
• The rightmost quantity
BF(Z) =Pr(Z|Mm)
Pr(Z|M`)
is called the Bayes factor
13
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
• Typically we assume that the prior over models is uniform, so
that Pr(Mm) is constant.
• A Laplace approximation to the integral gives
log Pr(Z|Mm) = log Pr(Z|θm,Mm) −dm
2· log N + O(1).
θm is a maximum likelihood estimate and dm is the number of
free parameters
• If we define our loss function to be
−2 log Pr(Z|θm,Mm),
this is equivalent to the BIC criterion
14
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
VC dimension
• class of indicator functions f(x, α), indexed by a parameter α.
eg f1 = I(α0 + α1x > 0), or f2 = I(sin(αx) > 0).
• VC dimension of a class {f(x, α)} is defined to be the largest
number of points (in some configuration) that can be shattered
by members of {f(x, α)}.
A set of points is said to be shattered by a class of functions if,
no matter how we assign a binary label to each point, a
member of the class can perfectly separate them.
15
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
• VC dim(f1)=2, VC dim (f2)=∞
• VC bounds
sets:
Err ≤ err +ε
2
(1 +
√1 +
4 · err
ε
)
where ε = a1h[log (a2N/h) + 1] − log (η/4)
N.
16
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
The first three panels show that the class of lines in the plane can shatter
three points. The last panel shows that this class cannot shatter four
points, as no line will put the hollow points on one side and the solid
points on the other. Hence the VC dimension of the class of straight
lines in the plane is three. Note that a class of nonlinear curves could
shatter four points, and hence has VC dimension greater than three.
17
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
0.0 0.2 0.4 0.6 0.8 1.0
-1.0
0.0
1.0
PSfrag replacements
x
sin(5
0·x)
The solid curve is the function sin(50x) for x ∈ [0, 1]. The blue (solid)
and green (hollow) points illustrate how the associated indicator function
I(sin(αx) > 0) can shatter (separate) an arbitrarily large number of
points by choosing an appropriately high frequency α.
18
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
reg/KNN reg/linear class/KNN class/linear
020
4060
8010
0
% In
crea
se O
ver
Bes
t
AIC
reg/KNN reg/linear class/KNN class/linear
020
4060
8010
0
% In
crea
se O
ver
Bes
t
BIC
reg/KNN reg/linear class/KNN class/linear
020
4060
8010
0
% In
crea
se O
ver
Bes
tSRM
Boxplots show the distribution of the relative error
100 × [Err(α) − minα Err(α)]/[maxα Err(α) − minα Err(α)] over the four
scenarios. This is the error in using the chosen model relative to the best
model. There are 100 training sets each of size 50 represented in each
boxplot, with the errors computed on test sets of size 500.
19
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Cross-validation
Simple, and best overall- see chapter 3
Subset Size p
Mis
clas
sific
atio
n E
rror
5 10 15 20
0.0
0.1
0.2
0.3
0.4
0.5
0.6
••
••
••
••
•• • • • • • • • • • •
••
• •
••
• •
• • • • • • •• • • • •
Prediction error (orange) and tenfold cross-validation curve (green)
estimated from a single training set, from the class/linear scenario
defined earlier.
20
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Cross-validation
Bias due to reduced training set size
Size of Training Set
1-E
rr
0 50 100 150 200
0.0
0.2
0.4
0.6
0.8
Hypothetical learning curve for a classifier on a given task; a plot of
1 − Err versus the size of the training set N . With a dataset of 200
observations, fivefold cross-validation would use training sets of size 160,
which would behave much like the full set. However, with a dataset of 50
observations fivefold cross-validation would use training sets of size 40,
and this would result in a considerable overestimate of prediction error.
21
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Bootstrap methods
• We wish to assess the statistical accuracy of a quantity S(Z)
computed from our dataset Z.
• B training sets Z∗b, b = 1, . . . , B each of size N are drawn with
replacement from the original dataset.
• The quantity of interest S(Z) is computed from each bootstrap
training set, and the values S(Z∗1), . . . , S(Z∗B) are used to
assess the statistical accuracy of S(Z).
22
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
• bootstrap is useful for estimating the standard error of a statistic
s(Z): we use the standard error of the bootstrap values
s(Z∗1), s(Z∗2), . . . s(Z∗B)
• Eg s(Z) could be the prediction from a cubic spline curve at some
fixed predictor value x.
• There is often more than one way to draw bootstrap samples- eg for
a smoother, could draw samples from the data or draw samples from
the residuals
• bootstrap is “non-parametric”- i.e doesn’t assume a parametric
dist’n for the data. If we carry the bootstrap out parametrically (i.e.
draw from a normal distribution), then we get the usual textbook
(Fisher information-based) formulas for standard errors as N → ∞.
• can get confidence intervals for an underlying population paramater
from the percentiles for the bootstrap values
s(Z∗1), s(Z∗2), . . . s(Z∗B). There are other, more sophisticated ways
to form confidence intervals via the bootstrap
23
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
Bootstrap estimation of prediction error
•
Errboot =1
B
1
N
B∑
b=1
N∑
i=1
L(yi, f∗b(xi)).
•
Pr{observation i ∈ bootstrap sample b} = 1 −(1 −
1
N
)N
≈ 1 − e−1
= 0.632.
• Can be a poor estimate: Consider: 1-NN, 2 equal classes, class
labels independent of features. Then
Errboot = 0.5(1 − .632) = .184; true value is 0.5!
24
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
• Leave-one out bootstrap:
Err(1)
=1
N
N∑
i=1
1
|C−i|
∑
b∈C−i
L(yi, f∗b(xi)).
• .632 bootstrap estimator:
Err(.632)
= .368 · err + .632 · Err(1)
.
• .632+ bootstrap estimator:
Err(.632+)
= (1 − w) · err + w · Err(1)
.
where w = .632/(1 − .368R), R= “overfitting rate”
25
ESL Chap 7 —Model selection Rob Tibshirani'
&
$
%
reg/KNN reg/linear class/KNN class/linear
020
4060
8010
0
% In
crea
se O
ver
Bes
t
Cross−validation
reg/KNN reg/linear class/KNN class/linear
020
4060
8010
0
% In
crea
se O
ver
Bes
tBootstrap
Boxplots show the distribution of the relative error
100 · [Errα − minα Err(α)]/[maxα Err(α) − minα Err(α)] over the four
scenarios. This is the error in using the chosen model relative to the best
model. There are 20 training sets represented in each boxplot.
26