Download - ESL Chap 7 |Model selection Rob Tibshiranistatweb.stanford.edu/~tibs/stat315a/LECTURES/chap7.pdf · 2010. 2. 8. · ESL Chap 7 |Model selection Rob Tibshirani ’ & $ % The rst three

ESL Chap 7 —Model selection Rob Tibshirani'

&

$

%

Outline

• Bias variance tradeoff

• Optimism of training error

• Estimates of in sample prediction error

• BIC

• VC dimension

• Cross-validation (chapter 3), bootstrap

1


&

$

%

Model selection

• Loss functions

L(Y, Y (X)) = (Y − Y (X))2 squared error,

L(G, G(X)) = I(G 6= G(X)) 0–1 loss,

L(G, p(X)) = −2

KX

k=1

I(G = k) log pk(X)

= −2 log pG(X) log-likelihood.

• Training error:

err =1

N

NX

i=1

L(yi, f(xi)).

• Generalization error:

Err = E[L(Y, f(X))],

2

ESL

Chap

7—

Modelse

lectio

nR

ob

Tib

shira

ni

'&

$%

PSfrag

replacem

ents

Hig

hB

ias

Low

Varia

nce

Low

Bia

s

Hig

hVaria

nce

Prediction Error

Model

Com

plex

ity

Tra

inin

gSam

ple

Test

Sam

ple

Low

Hig

h

Beh

avio

roftest

sam

ple

and

train

ing

sam

ple

error

as

the

mod

el

com

plexity

isva

ried.

3


&

$

%

Bias-variance decomposition

Err(x0) = E[(Y − f(x0))2|X = x0]

= σ2ε + [Ef(x0) − f(x0)]

2 + E[f(x0) − Ef(x0)]2

= σ2ε + Bias2(f(x0)) + Var(f(x0))

= Irreducible Error + Bias2 + Variance.

4


&

$

%

For K-nearest neighbors:

Err(x0) = E[(Y − fk(x0))2|X = x0]

= σ2ε +

[f(x0) −

1

k

k∑

`=1

f(x(`))]2

+ σ2ε/k.

For linear regression:

Err(x0) = E[(Y − fp(x0))2|X = x0]

= σ2ε + [f(x0) − Efp(x0)]

2 + ||h(x0)||2σ2

ε .

1

N

N∑

i=1

Err(xi) = σ2ε +

1

N

N∑

i=1

[f(xi) − Ef(xi)]2 +

p

Nσ2

ε , (1)

5


&

$

%

Classification and 0-1 loss

• Bias and variance and do not add as they do for squared error:

variance tends to dominate, while bias is tolerable as long as

you are on the correct side of the decision boundary. Hence

biased methods often do well!

• Friedman (1996) “On Bias, Variance 0-1 loss...” shows

Pr(y 6= yB) = Φ[(sign(1/2 − f)

Ef − 1/2√varf

]

where yB is the Bayes classifier (Exercise 7.63)

• Hence on the wrong side of the decision boundary, increasing

the variance can help

6


&

$

%

0.0

0.1

0.2

0.3

0.4

Number of Neighbors k

50 40 30 20 10 0

k−NN − Regression

5 10 15 20

0.0

0.1

0.2

0.3

0.4

Subset Size p

Linear Model − Regression

0.0

0.1

0.2

0.3

0.4

Number of Neighbors k

50 40 30 20 10 0

k−NN − Classification

5 10 15 20

0.0

0.1

0.2

0.3

0.4

Subset Size p

Linear Model − Classification

Pred error (orange), squared bias (green) and variance (blue) for a simulated

example. Top row is regression with squared error loss; bottom row is

classification with 0–1 loss. Models are k-nearest neighbors (left) and best

subset regression of size p (right). Variance and bias curves are the same in

regression and classification, but the prediction error curve is different.

7


&

$

%

Optimism of the Training Error Rate

• training error

err =1

N

NX

i=1

L(yi, f(xi))

• In-sample error rate

Errin =1

N

NX

i=1

EY 0 [L(Y 0

i , f(xi))|T ]

• Optimism

op ≡ Errin − err.

8


&

$

%

• For squared error, 0–1, and other loss functions, one can show quite

generally that

E(op) =2

N

NX

i=1

Cov(yi, yi),

• For linear fitting:

NX

i=1

Cov(yi, yi) = dσ2

ε

for the additive error model Y = f(X) + ε, and so

Errin = Eyerr + 2 ·d

Nσ2

ε .

• Cp and AIC statistics:

Cp = err + 2 ·d

Nσε

2.

−2 · E[log Prθ(Y )] ≈ −

2

N· E[loglik] + 2 ·

d

N. (2)

9


&

$

%

CP, AIC and linear operators

• f = Sy (eg linear regression, ridge regression, cubic smoothing

splines); yi = fi + εi, E(εi) = 0.

• Errin = σ2ε + (1/N)

∑[f(xi) − fi]

2 + σ2ε tr(ST S)/N

•

E(err) = (1/N)E||(I − S)y||2

= (1/N)E(fT (I − S)T (I − S)f + (1/N)E(εT (I − S)T (I − S)ε)

= Bias2 + (1/N)σ2ε tr(I − 2S − S2)

• Hence

Errin − E(err) = (2/N)σ2ε tr(S)

and

Cov(y, f) = σ2ε S

10


&

$

%

Number of Basis FunctionsLo

g-lik

elih

ood

0.5

1.0

1.5

2.0

2.5

Log-likelihood Loss

2 4 8 16 32 64 128

O

OO

O O OO

O

O

OO

O O OO

O

O

OO

O O O O O

TrainTestAIC

Number of Basis Functions

Mis

clas

sific

atio

n E

rror

0.10

0.15

0.20

0.25

0.30

0.35

0-1 Loss

2 4 8 16 32 64 128

O

O

O

OO

O

O

O

O

OO

O OO

O

O

O

O

O

O OO

O

O

AIC used for model selection for the phoneme recognition example. The

logistic regression coefficient function β(f) =P

M

m=1hm(f)θm is modeled

in M spline basis functions. In the left panel we see the AIC statistic

used to estimate Errin using log-likelihood loss. Included is an estimate of

Err based on a test sample. It does well except for the over-parametrized

case (M = 256 parameters for N = 1000 observations). In the right

panel the same is done for 0–1 loss. Although the AIC formula does not

strictly apply here, it does a reasonable job in this case.

11


&

$

%

BIC- Bayesian information criterion

•

BIC = −2 · loglik + (log N) · d.

• under Gaussian model σ2ε is known, −2 · loglik equals (up to a

constant)∑

i(yi − f(xi))2/σ2

ε , which is N · err/σ2ε for squared

error loss. Hence we can write

BIC =N

σ2ε

[err + (log N) ·

d

Nσ2

ε

].

• candidate models Mm, m = 1, . . . , M prior distribution

Pr(θm|Mm) the posterior probability of a given model is

Pr(Mm|Z) ∝ Pr(Mm) · Pr(Z|Mm)

∝ Pr(Mm) ·

∫Pr(Z|θm,Mm)Pr(θm|Mm)dθm,

12


&

$

%

where Z represents the training data {xi, yi}N

1 .

• posterior odds

Pr(Mm|Z)

Pr(M`|Z)=

Pr(Mm)

Pr(M`)·Pr(Z|Mm)

Pr(Z|M`).

• The rightmost quantity

BF(Z) =Pr(Z|Mm)

Pr(Z|M`)

is called the Bayes factor

13


&

$

%

• Typically we assume that the prior over models is uniform, so

that Pr(Mm) is constant.

• A Laplace approximation to the integral gives

log Pr(Z|Mm) = log Pr(Z|θm,Mm) −dm

2· log N + O(1).

θm is a maximum likelihood estimate and dm is the number of

free parameters

• If we define our loss function to be

−2 log Pr(Z|θm,Mm),

this is equivalent to the BIC criterion

14


&

$

%

VC dimension

• class of indicator functions f(x, α), indexed by a parameter α.

eg f1 = I(α0 + α1x > 0), or f2 = I(sin(αx) > 0).

• VC dimension of a class {f(x, α)} is defined to be the largest

number of points (in some configuration) that can be shattered

by members of {f(x, α)}.

A set of points is said to be shattered by a class of functions if,

no matter how we assign a binary label to each point, a

member of the class can perfectly separate them.

15


&

$

%

• VC dim(f1)=2, VC dim (f2)=∞

• VC bounds

sets:

Err ≤ err +ε

2

(1 +

√1 +

4 · err

ε

)

where ε = a1h[log (a2N/h) + 1] − log (η/4)

N.

16


&

$

%

The first three panels show that the class of lines in the plane can shatter

three points. The last panel shows that this class cannot shatter four

points, as no line will put the hollow points on one side and the solid

points on the other. Hence the VC dimension of the class of straight

lines in the plane is three. Note that a class of nonlinear curves could

shatter four points, and hence has VC dimension greater than three.

17


&

$

%

0.0 0.2 0.4 0.6 0.8 1.0

-1.0

0.0

1.0

PSfrag replacements

x

sin(5

0·x)

The solid curve is the function sin(50x) for x ∈ [0, 1]. The blue (solid)

and green (hollow) points illustrate how the associated indicator function

I(sin(αx) > 0) can shatter (separate) an arbitrarily large number of

points by choosing an appropriately high frequency α.

18


&

$

%

reg/KNN reg/linear class/KNN class/linear

020

4060

8010

0

% In

crea

se O

ver

Bes

t

AIC


020

4060

8010

0

% In

crea

se O

ver

Bes

t

BIC


020

4060

8010

0

% In

crea

se O

ver

Bes

tSRM

Boxplots show the distribution of the relative error

100 × [Err(α) − minα Err(α)]/[maxα Err(α) − minα Err(α)] over the four

scenarios. This is the error in using the chosen model relative to the best

model. There are 100 training sets each of size 50 represented in each

boxplot, with the errors computed on test sets of size 500.

19


&

$

%

Cross-validation

Simple, and best overall- see chapter 3

Subset Size p

Mis

clas

sific

atio

n E

rror

5 10 15 20

0.0

0.1

0.2

0.3

0.4

0.5

0.6

••

••

••

••

•• • • • • • • • • • •

••

• •

••

• •

• • • • • • •• • • • •

Prediction error (orange) and tenfold cross-validation curve (green)

estimated from a single training set, from the class/linear scenario

defined earlier.

20


&

$

%

Cross-validation

Bias due to reduced training set size

Size of Training Set

1-E

rr

0 50 100 150 200

0.0

0.2

0.4

0.6

0.8

Hypothetical learning curve for a classifier on a given task; a plot of

1 − Err versus the size of the training set N . With a dataset of 200

observations, fivefold cross-validation would use training sets of size 160,

which would behave much like the full set. However, with a dataset of 50

observations fivefold cross-validation would use training sets of size 40,

and this would result in a considerable overestimate of prediction error.

21


&

$

%

Bootstrap methods

• We wish to assess the statistical accuracy of a quantity S(Z)

computed from our dataset Z.

• B training sets Z∗b, b = 1, . . . , B each of size N are drawn with

replacement from the original dataset.

• The quantity of interest S(Z) is computed from each bootstrap

training set, and the values S(Z∗1), . . . , S(Z∗B) are used to

assess the statistical accuracy of S(Z).

22


&

$

%

• bootstrap is useful for estimating the standard error of a statistic

s(Z): we use the standard error of the bootstrap values

s(Z∗1), s(Z∗2), . . . s(Z∗B)

• Eg s(Z) could be the prediction from a cubic spline curve at some

fixed predictor value x.

• There is often more than one way to draw bootstrap samples- eg for

a smoother, could draw samples from the data or draw samples from

the residuals

• bootstrap is “non-parametric”- i.e doesn’t assume a parametric

dist’n for the data. If we carry the bootstrap out parametrically (i.e.

draw from a normal distribution), then we get the usual textbook

(Fisher information-based) formulas for standard errors as N → ∞.

• can get confidence intervals for an underlying population paramater

from the percentiles for the bootstrap values

s(Z∗1), s(Z∗2), . . . s(Z∗B). There are other, more sophisticated ways

to form confidence intervals via the bootstrap

23


&

$

%

Bootstrap estimation of prediction error

•

Errboot =1

B

1

N

B∑

b=1

N∑

i=1

L(yi, f∗b(xi)).

•

Pr{observation i ∈ bootstrap sample b} = 1 −(1 −

1

N

)N

≈ 1 − e−1

= 0.632.

• Can be a poor estimate: Consider: 1-NN, 2 equal classes, class

labels independent of features. Then

Errboot = 0.5(1 − .632) = .184; true value is 0.5!

24


&

$

%

• Leave-one out bootstrap:

Err(1)

=1

N

N∑

i=1

1

|C−i|

∑

b∈C−i

L(yi, f∗b(xi)).

• .632 bootstrap estimator:

Err(.632)

= .368 · err + .632 · Err(1)

.

• .632+ bootstrap estimator:

Err(.632+)

= (1 − w) · err + w · Err(1)

.

where w = .632/(1 − .368R), R= “overfitting rate”

25


&

$

%


020

4060

8010

0

% In

crea

se O

ver

Bes

t

Cross−validation


020

4060

8010

0

% In

crea

se O

ver

Bes

tBootstrap

Boxplots show the distribution of the relative error

100 · [Errα − minα Err(α)]/[maxα Err(α) − minα Err(α)] over the four

scenarios. This is the error in using the chosen model relative to the best

model. There are 20 training sets represented in each boxplot.

26

Download - ESL Chap 7 |Model selection Rob Tibshiranistatweb.stanford.edu/~tibs/stat315a/LECTURES/chap7.pdf · 2010. 2. 8. · ESL Chap 7 |Model selection Rob Tibshirani ’ & $ % The rst three

Top Related