the estimation of prediction error: covariance penalties ...jordan/sail/... · the estimation of...

14
The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley EFRON Having constructed a data-based estimation rule, perhaps a logistic regression or a classification tree, the statistician would like to know its performance as a predictor of future cases. There are two main theories concerning prediction error: (1) penalty methods such as C p , Akaike’s information criterion, and Stein’s unbiased risk estimate that depend on the covariance between data points and their corresponding predictions; and (2) cross-validation and related nonparametric bootstrap techniques. This article concerns the connection between the two theories. A Rao–Blackwell type of relation is derived in which nonparametric methods such as cross-validation are seen to be randomized versions of their covariance penalty counterparts. The model-based penalty methods offer substantially better accuracy, assuming that the model is believable. KEY WORDS: C p ; Degrees of freedom; Nonparametric estimates; Parametric bootstrap; Rao–Blackwellization; SURE. 1. INTRODUCTION Prediction problems arise in the following way: A model m(·), for example, an ordinary linear regression, is fit to some data y producing an estimate µ = m(y); we wonder how well µ will predict a future dataset independently generated from the same mechanism that produced y. Two quite separate statisti- cal theories are used to answer this question, cross-validation and what we will call covariance penalties, the latter including Mallow’s C p , Akaike’s information criterion (AIC), and Stein’s unbiased risk estimate (SURE). This article concerns the rela- tionship between the two theories. Figure 1 illustrates a simple prediction problem. Data (x i , y i ) have been observed for 157 healthy volunteers, with x i age and y i a measure of total kidney function. The original goal was to study the decline in function over time, an important factor in kidney transplantation. The response variable y is a compos- ite of several standard kidney function indices. A robust locally linear smoother “lowess(x, y, f = 1/3)”( f controlling the lo- cal window width) produces µ, the indicated regression curve, with sum of squared residuals err 157 i=1 ( y i µ i ) 2 = 495.1. (1.1) However err, the apparent error, is an optimistic assessment of how well the curve in Figure 1 would predict future y values because lowess has fit the curve to this particular dataset. How well can we expect µ to perform on future data? In this case the two theories give almost identical esti- mates of “Err,” the true predictive error of µ: Err = 538.8 for cross-validation and 538.3 for the covariance penalty method, 9% larger than (1.1). Sections 2–4 describe these calculations. Cross-validation and the related bootstrap techniques of Efron (1983) are completely nonparametric. Covariance penal- ties, on the other hand, are model based, in this case relying on an estimated version of the standard additive homescadastic model y i = µ i + i . Nonparametric methods are often prefer- able, but we will show that cross-validation pays a substantial price in terms of decreased estimating efficiency. Bradley Efron is Professor, Department of Statistics, Stanford University, Stanford, CA 94305 (E-mail: [email protected]). Author is grateful to Dr. Bryan D. Myers for bringing the kidney function estimation problem and data to author’s attention, and for several helpful discussions. The model used to estimate a covariance penalty can also be employed to improve cross-validation, by averaging the cross- validation estimate of Err over a collection of the model’s pos- sible datasets. This is the subject of Section 4, where it is shown that the averaged cross-validation estimate nearly equals the co- variance penalty estimate of Err. Roughly speaking, covariance penalties are a Rao–Blackwellized version of cross-validation (and also of the nonparametric bootstrap; Sec. 6) and as such enjoy increased efficiency for estimating prediction error. Covariance penalties originated in the work of Mallows (1973), Akaike (1973), and Stein (1981). The formula was ex- tended to generalized linear models in Efron (1986). Sections 2 and 3 broaden the penalty formula to include all models, and also develop it in a conditional setting that facilitates compar- isons with cross-validation and the nonparametric bootstrap. Versions of the covariance penalty appear in Breiman (1992), Ye (1998), and Tibshirani and Knight (1999), with Ye’s article being particularly relevant here. 2. C p AND SURE Covariance penalty methods first arose in the context where prediction error, say Q( y i , µ i ), is measured by squared error Q( y i , µ i ) = ( y i µ i ) 2 . (2.1) Mallows (1973) considered prediction error for the ho- moscedastic model y (µ2 I), (2.2) the notation indicating that the components of y are uncorre- lated, y i having mean µ i and variance σ 2 . Suppose that we are using a linear estimation rule µ = My, (2.3) where M is an n × n matrix not depending on y. Define err i = ( y i µ i ) 2 and Err i = E 0 ( y 0 i µ i ) 2 , (2.4) the expectation “E 0 ” being over y 0 i i 2 ) independent of y, with µ i held fixed. Mallows showed that Err i err i + 2σ 2 M ii (2.5) © 2004 American Statistical Association Journal of the American Statistical Association September 2004, Vol. 99, No. 467, Theory and Methods DOI 10.1198/016214504000000692 619

Upload: others

Post on 28-May-2020

12 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

The Estimation of Prediction ErrorCovariance Penalties and Cross-Validation

Bradley EFRON

Having constructed a data-based estimation rule perhaps a logistic regression or a classification tree the statistician would like to knowits performance as a predictor of future cases There are two main theories concerning prediction error (1) penalty methods such as CpAkaikersquos information criterion and Steinrsquos unbiased risk estimate that depend on the covariance between data points and their correspondingpredictions and (2) cross-validation and related nonparametric bootstrap techniques This article concerns the connection between the twotheories A RaondashBlackwell type of relation is derived in which nonparametric methods such as cross-validation are seen to be randomizedversions of their covariance penalty counterparts The model-based penalty methods offer substantially better accuracy assuming that themodel is believable

KEY WORDS Cp Degrees of freedom Nonparametric estimates Parametric bootstrap RaondashBlackwellization SURE

1 INTRODUCTION

Prediction problems arise in the following way A modelm(middot) for example an ordinary linear regression is fit to somedata y producing an estimate micro = m(y) we wonder how well micro

will predict a future dataset independently generated from thesame mechanism that produced y Two quite separate statisti-cal theories are used to answer this question cross-validationand what we will call covariance penalties the latter includingMallowrsquos Cp Akaikersquos information criterion (AIC) and Steinrsquosunbiased risk estimate (SURE) This article concerns the rela-tionship between the two theories

Figure 1 illustrates a simple prediction problem Data (xi yi)

have been observed for 157 healthy volunteers with xi ageand yi a measure of total kidney function The original goal wasto study the decline in function over time an important factorin kidney transplantation The response variable y is a compos-ite of several standard kidney function indices A robust locallylinear smoother ldquolowess(xy f = 13)rdquo ( f controlling the lo-cal window width) produces micro the indicated regression curvewith sum of squared residuals

err equiv157sum

i=1

( yi minus microi)2 = 4951 (11)

However err the apparent error is an optimistic assessmentof how well the curve in Figure 1 would predict future y valuesbecause lowess has fit the curve to this particular dataset Howwell can we expect micro to perform on future data

In this case the two theories give almost identical esti-mates of ldquoErrrdquo the true predictive error of micro Err = 5388 forcross-validation and 5383 for the covariance penalty method9 larger than (11) Sections 2ndash4 describe these calculations

Cross-validation and the related bootstrap techniques ofEfron (1983) are completely nonparametric Covariance penal-ties on the other hand are model based in this case relyingon an estimated version of the standard additive homescadasticmodel yi = microi + εi Nonparametric methods are often prefer-able but we will show that cross-validation pays a substantialprice in terms of decreased estimating efficiency

Bradley Efron is Professor Department of Statistics Stanford UniversityStanford CA 94305 (E-mail bradstatstanfordedu) Author is grateful toDr Bryan D Myers for bringing the kidney function estimation problem anddata to authorrsquos attention and for several helpful discussions

The model used to estimate a covariance penalty can also beemployed to improve cross-validation by averaging the cross-validation estimate of Err over a collection of the modelrsquos pos-sible datasets This is the subject of Section 4 where it is shownthat the averaged cross-validation estimate nearly equals the co-variance penalty estimate of Err Roughly speaking covariancepenalties are a RaondashBlackwellized version of cross-validation(and also of the nonparametric bootstrap Sec 6) and as suchenjoy increased efficiency for estimating prediction error

Covariance penalties originated in the work of Mallows(1973) Akaike (1973) and Stein (1981) The formula was ex-tended to generalized linear models in Efron (1986) Sections2 and 3 broaden the penalty formula to include all models andalso develop it in a conditional setting that facilitates compar-isons with cross-validation and the nonparametric bootstrapVersions of the covariance penalty appear in Breiman (1992)Ye (1998) and Tibshirani and Knight (1999) with Yersquos articlebeing particularly relevant here

2 Cp AND SURE

Covariance penalty methods first arose in the context whereprediction error say Q( yi microi) is measured by squared error

Q( yi microi) = ( yi minus microi)2 (21)

Mallows (1973) considered prediction error for the ho-moscedastic model

y sim (micro σ 2I) (22)

the notation indicating that the components of y are uncorre-lated yi having mean microi and variance σ 2

Suppose that we are using a linear estimation rule

micro = My (23)

where M is an n times n matrix not depending on y Define

erri = ( yi minus microi)2 and Erri = E0( y0

i minus microi)2 (24)

the expectation ldquoE0rdquo being over y0i sim (microi σ

2) independent of ywith microi held fixed Mallows showed that

Erri equiv erri + 2σ 2Mii (25)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000692

619

620 Journal of the American Statistical Association September 2004

Figure 1 Kidney Data An Omnibus Measure of Kidney FunctionPlotted versus Age for n = 157 Healthy Volunteers Fitted curve islowess(x y f = 13) sum of squared residuals 4951 How well canwe expect this curve to predict future (x y ) pairs

is an unbiased estimator for the expectation of Erri leading tothe Cp formula for estimating Err = sumn

i=1 Erri

Err = err + 2σ 2 trace(M) err =nsum

i=1

erri (26)

In practice we usually need to replace σ 2 with an estimate σ 2

as in the examples that follow see section 7 of Efron (1986)Dropping the linearity assumption let micro = m(y) be any rule

at all for estimating micro from y Taking expectations in the iden-tity

( yi minus microi)2 + (microi minus microi)

2

= ( yi minus microi)2 + 2(microi minus microi)( yi minus microi) (27)

and using E( yi minus microi)2 = E0( y0

i minus microi)2 gives a convenient ex-

pression for the expectation of Erri (24)

EErri = Eerri + 2 cov(microi yi) (28)

Because cov(microi yi) equals σ 2Mii for a linear rule (28) is seento be a generalization of (25) In words we must add a covari-ance penalty to the apparent error erri in order to unbiasedlyestimate Erri

Formula (28) is not directly applicable since cov(microi yi) isnot an observable statistic Stein (1981) overcame this impedi-ment in the Gaussian case

y sim N(micro σ 2I) (29)

by showing that

covi = σ 2Epartmicroipartyi (210)

[assuming (29) and a differentiability condition on the mappingmicro = m(y)] Because partmicroipartyi is observable this leads to Steinrsquosunbiased risk estimate (SURE) for total prediction error

Err = err + 2σ 2nsum

i=1

partmicroi

partyi (211)

In the linear case it is now common as in Hastie and Tibshi-rani (1990) to define trace(M) as the degrees of freedom (df ) ofthe rule micro = My If we are in the usual regression or analysis ofvariance (ANOVA) situation where M is a projection matrixthen trace(M) = p the dimension of the projected space agree-ing with the usual df definition As in Ye (1998) we can extendthis definition to

df =nsum

i=1

cov(microi yi)

σ 2(212)

for a general rule micro = m(y)Traditional applications of linear models try to keep df n

Becausesumn

i=1 Mii = df this can be interpreted as Mii = O(1n)

in reasonable experimental designs Similarly the informal or-der of magnitude calculations that follow assume

cov(microi yi) = O(1n) (213)

This might better be stated as ldquoO(dfn)rdquo the crucial ingredientfor the asymptotics being a small value of dfn

The bootstrap or more exactly the parametric bootstrapsuggests a direct way of estimating the covariance penaltycov(microi yi) Let f be an assumed density for y In the Gaussiancase we might take f = N(micro σ 2I) with micro = m(y) and σ 2 ob-tained from the residuals of some ldquobigrdquo model presumed to havenegligible bias We then generate a large number ldquoBrdquo of simu-lated observations and estimates from f

f rarr ylowast rarr microlowast = m(ylowast) (214)

and estimate covi = cov(microi yi) from the observed bootstrap co-variance say

covi =Bsum

b=1

microlowastbi

(ylowastb

i minus ylowastmiddoti

)(B minus 1) ylowastmiddot

i =sum

b

ylowastbi

B (215)

leading to the Err estimate

Err = err + 2nsum

i=1

covi (216)

Both Breiman (1992) and Ye (1998) proposed variationson (214) intended to improve the efficiency of the bootstrapestimation procedure see Remark A

Figure 2 displays SURE and parametric bootstrap estimatesof the coordinatewise degrees of freedom dfi for the kidneydata The two sets of estimates partmicroipartyi and coviσ

2 are plottedversus agei vividly demonstrating the decreased stability of thelowess fitting process near the extremes of the age scale Theresampling algorithm (214) employed

ylowast = micro + εlowast (217)

with the components of εlowast a random sample of size n fromthe empirical distribution of the observed residuals εj = yj minus microj

Efron The Estimation of Prediction Error 621

Figure 2 Coordinatewise Degrees of Freedom for Lowess Fit of Fig-ure 1 Plotted versus Age Open circles SURE estimate dfi = partmicroipartyi solid line parametric bootstrap estimates coviσ

2 (214)ndash(215)B = 1000 Total df estimates 685 (SURE) and 667 (parametric boot-strap) The coordinatewise bootstrap estimates are noticeably lessnoisy

(having σ 2 = 317) Almost identical results were obtained tak-ing ε lowast

i sim N(0 σ 2) The lowess estimator was chosen here be-cause it is nonlinear and unsmooth making the df calculationsmore challenging

The two methods gave similar estimates for the total degreesof freedom df = sum

dfi 685 using SURE and 667 plusmn 30 withthe bootstrap the plusmn value indicating simulation error estimatedas

n

σ 2

[sum(Clowastb minus Clowastmiddot)2

B(B minus 1)

]12

Clowastb =nsum

i=1

microlowastbi ( ylowastb

i minus ylowastmiddoti )

n Clowastmiddot =

sum Clowastb

B (218)

However the componentwise bootstrap estimates are notice-ably less noisy having standard deviation 25 times smaller thanthe SURE values over the range 20 le age le 75

Remark A It is not necessary that the bootstrap model fin (214) be based on micro = m(y) The solid curve in Figure 2was recomputed starting from the bigger model (more degreesof freedom) f = N(micro σ 2I) with micro the fit from lowess(xy

f = 16) but still using f = 13 for m(ylowast) at the final stepof (214) This gave almost the same results as in Figure 2

The ultimate ldquobigger modelrdquo is

f = N(y σ 2I) (219)

This choice which is the one made in Ye (1998) Shen Huangand Ye (2002) Shen and Ye (2002) and Breiman (1992) hasthe advantage of not requiring model assumptions It pays for

this luxury with increased estimation error the dfi plot looksmore like the open circles than the solid line in Figure 2 The au-thor prefers checking the dfi estimates against moderately big-ger models such as lowess(xy16) rather than going all theway to (219) see Remark C

In fact the exact choice of f is often quite unimportant No-tice that dfi equiv cov(microi yi)σ

2 is the linear regression coefficientof microi on yi If the regression function Emicroi|yi is roughly lin-ear in yi then its slope can be estimated by a wide variety ofdevices Algorithm 1 of Ye (1998) takes ylowast in (214) from ashrunken version of (219)

ylowast sim N(y cσ 2I) (220)

with c a constant between 6 and 1 and estimates dfi by thelinear regression coefficient of microi on ylowast

i Breimanrsquos ldquolittle boot-straprdquo (1992) employs a related technique the ldquolittlerdquo referringto using c lt 1 in (220) and winds up recommending c between6 and 8 (though c = 1 gave slightly superior accuracy in hissimulation experiments) Shen and Ye (2002) used an equiva-lent form of covariance estimation with c = 5

Remark B The parametric bootstrap algorithm (214)ndash(215)can also be used to assess the difference between fits obtainedfrom two models say Model A and Model B We will think of Aas the smaller of the two that is the one with fewer degrees offreedom though this is not essential The estimated differenceof prediction error is

Err = err + 2nsum

i=1

cov(microlowasti ylowast

i ) (221)

denoting ldquoModel A minus Model Brdquo

Calculation (221) was carried out for the kidney data withlowess(xy f = 23) for Model A and lowess(xy f = 13)

for Model B err = 4985 minus 4951 = 34 With f in (214) esti-mated from Model A 1000 parametric bootstraps (each requir-ing both model fits) gave minus184 for the second term in (221)so

Err = 34 minus 184 = minus150

favoring the smaller Model AThe 1000 pairs of bootstrap fits micro(A)lowast and micro(B)lowast con-

tain useful information beyond evaluating the second term of(221) Figure 3 displays the thousand values of

Errlowast = errlowast minus 184 (222)

This can be considered as a null hypothesis distribution fortesting ldquoModel B is no improvement over Model Ardquo In thiscase the observed Err falls in the lower part of the distrib-ution but for a larger observed value say Err = 200 wemight use the histogram to assign the approximate p valueErrlowast gt Err1000

This calculation ignores the fact that the penalty minus184in (222) is itself variable For linear models the penalty is a con-stant obviating concern In general the penalty term is an or-der of magnitude smaller than err and not likely to contributemuch to the bootstrap variability of Errlowast This was checkedhere using a second level of bootstrapping which made verylittle difference to Figure 3

622 Journal of the American Statistical Association September 2004

Figure 3 1000 Bootstrap Replications of ∆Errlowast

for the DifferenceBetween lowess(x y 23) and lowess(x y 13) Kidney Data Thepoint estimate ∆Err = minus150 is in the lower part of the histogram

Remark C The parametric bootstrap estimate (214)ndash(215)unlike SURE does not depend on micro = m(y) being differen-tiable or even continuous A simulation experiment was runtaking the true model for the diabetes data to be y sim N(micro σ 2I)with σ 2 = 317 and micro the lowess(xy f = 16) fit a noticeablyrougher curve than that of Figure 1 A discontinuous adaptiveestimation rule micro = m(y) was used Polynomial regressions of yon x for powers of x from 0 to 7 were fit with the one havingthe minimum Cp value selected to be micro

Because this is a simulation experiment we can estimate thetrue expected difference between Err and err (28) 1000 sim-ulations of y gave

EErr minus err = 331 plusmn 202 (223)

The parametric bootstrap estimate (214)ndash(216) worked wellhere 1000 replications of y sim N(micro σ 2I) with micro fromlowess(xy f = 13) yielding

Err minus err = 314 plusmn 285 (224)

In contrast bootstrapping from ylowast sim N(y σ 2I) as in (219) gave146 plusmn 182 badly underestimating the true difference 331Starting with the true micro equal to the seventh-degree polynomialfit gave nearly the same results as (223)

3 GENERAL COVARIANCE PENALTIES

The covariance penalty theory of Section 2 can be general-ized beyond squared error to a wide class of error measuresThe q class of error measures (Efron 1986) begins with anyconcave function q(middot) of a real-valued argument Q( y micro) theassessed error for outcome y given prediction micro is then definedto be

Q( y micro) = q(micro) + q(micro)( y minus micro) minus q( y)[q(micro) = dqdmicro|micro

]

(31)

Figure 4 Tangency Construction (31) for General Error MeasureQ(y micro) q( middot ) Is an Arbitrary Concave Function The illustrated case hasq(micro) = micro(1 minus micro) and Q(y micro) = (y minus micro)2

Q( y micro) is the tangency function to q(middot) as illustrated inFigure 4 (31) is a familiar construct in convex analysis(Rockafellar 1970) The choice q(micro) = micro(1 minusmicro) gives squarederror Q( y micro) = ( y minus micro)2

Our examples will include the Bernoulli case y sim Be(micro)where we have n independent observations yi

yi =

1 probability microi

0 probability 1 minus microifor microi isin [01] (32)

Two common error functions used for Bernoulli observationsare counting error

q(micro) = min(micro1 minus micro)

rarr Q( ymicro) =

0 if ymicro on same side of 12

1 if ymicro on different sides of 12(33)

(see Remark F) and binomial deviance

q(micro) = minus2[micro log(micro) + (1 minus micro) log(1 minus micro)]

rarr Q( ymicro) =minus2 logmicro if y = 1

minus2 log(1 minus micro) if y = 0(34)

By a linear transformation we can always make

q(0) = q(1) = 0 (35)

which is convenient for Bernoulli calculationsWe assume that some unknown probability mechanism f has

given the observed data y from which we estimate the expec-tation vector micro = Efy according to the rule micro = m(y)

f rarr y rarr micro = m(y) (36)

Total error will be assessed by summing the component errors

Q(y micro) =nsum

i=1

Q( yi microi) (37)

Efron The Estimation of Prediction Error 623

The following definitions lead to a general version of the Cp

formula (28) Letting

erri = Q( yi microi) and Erri = E0Q( y0i microi) (38)

as in (24) with microi fixed in the expectation and y0i from an in-

dependent copy of y define the

Optimism Oi = Oi(fy) = Erri minus erri (39)

and

Expected optimism i = (f) = EfOi(fy) (310)

Finally let

λi = minusq(microi)2 (311)

For q(micro) = micro(1 minus micro) the squared error case λi = microi minus 12for counting error (33) λi = minus1 or 1 as microi is less or greaterthan 12 for binomial deviance (34)

λi = log(microi(1 minus microi)) (312)

the logit parameter [If Q( y micro) is the deviance function for anyexponential family then λ is the corresponding natural parame-ter see sec 6 of Efron 1986]

Optimism Theorem 1 For error measure Q( y micro) (31) wehave

EErri = Eerri + i (313)

where

i = 2 cov(λi yi) (314)

the expectations and covariance being with respect to f (36)

Proof Erri = erri + Oi by definition immediately giving(313) From (31) we calculate

Erri = q(microi) + q(microi)(microi minus microi) minus Eq( y0i )

erri = q( microi) + q(microi)( yi minus microi) minus q( yi)(315)

and so from (39)ndash(311)

Oi = 2λi( yi minus microi) + q( yi) minus Eq( y0i ) (316)

Because Eq( y0i ) = Eq( yi) y0

i being a copy of yi taking ex-pectations in (316) verifies (314)

The optimism theorem generalizes Steinrsquos result for squarederror (28) to the q class of error measures It was developed byEfron (1986) in a generalized linear model (GLM) context butas verified here it applies to any probability mechanism f rarr yEven independence is not required among the components of ythough it is convenient to assume independence in the condi-tional covariance computations that follow

Parametric bootstrap computations can be used to estimatethe penalty i = 2 cov(λi yi) as in (214) the only change be-ing the substitution of λlowast

i = minusq(microlowasti )2 for microlowast

i in (215)

covi =Bsum

i=1

λlowastbi ( ylowastb

i minus ylowastmiddoti )(B minus 1) (317)

Method (317) was suggested in remark J of Efron (1986) Shenet al (2002) working with deviance in exponential familiesemployed a ldquoshrunkenrdquo version of (317) as in (220)

Section 4 relates covariance penalties to cross-validation Indoing so it helps to work with a conditional version of covi Lety(i) indicate the data vector with yi deleted

y(i) = ( y1 y2 yiminus1 yi+1 yn) (318)

and define the conditional covariance

cov(i) = Eλi middot ( yi minus microi)

∣∣y(i)

equiv E(i) λi middot ( yi minus microi) (319)

E(i) indicating Emiddot|y(i) likewise (i) = 2 cov(i) In situation(21)ndash(23) cov(i) = covi = σ 2Mii but in general we only haveEcov(i) = covi The conditional version of (313)

E(i)Erri = E(i)erri + (i)

(320)

is a more refined statement of the optimism theorem TheSURE formula (210) also applies conditionally cov(i) =σ 2E(i)partmicroipartyi assuming normality (29)

Figure 5 illustrates conditional and unconditional covariancecalculations for subject i = 93 of the kidney study (the open cir-cle in Fig 1) Here we have used squared error and the Gaussianmodel ylowast sim N(micro σ 2I) σ 2 = 317 with micro = lowess(xy13)The conditional and unconditional covariances are nearly thesame cov(i) = 221 versus covi = 218 but the dependence ofmicrolowast

i on ylowasti is much clearer conditioning on y(i)

The conditional approach is computationally expensive Wewould need to repeat the conditional resampling procedure ofFigure 5 separately for each of the n cases whereas a single setof unconditional resamples suffices for all n Here we will usethe conditional covariances (319) mainly for theoretical pur-poses The less expensive unconditional approach performedwell in all of our examples

Figure 5 Conditional and Unconditional Covariance Calculations forSubject i = 93 Kidney Study Open circles 200 pairs (ylowast

i microlowasti ) un-

conditional resamples ylowast sim N( micro σ 2 I) cov i = 218 Dots 100 condi-tional resamples ylowast

i sim N( microi σ 2) y(i) fixed cov(i) = 221 Vertical lineat micro93 = 136

624 Journal of the American Statistical Association September 2004

There is however one situation where the conditional co-variances are easily computed the Bernoulli case y sim Be(micro)In this situation it is easy to see that

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (321)

the notation indicating the two possible values of λi with y(i)

fixed and yi either 1 or 0 This leads to estimates

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (322)

Calculating cov(i) for i = 12 n requires only n recompu-tations of m(middot) one for each i the same number as for cross-validation For reasons discussed next (322) will be termedthe Steinian

There is no general equivalent to the Gaussian SURE for-mula (210) that is an unbiased estimator for cov(i) How-ever a useful approximation can be derived as follows Letti( ylowast

i ) = λi( y(i) ylowasti ) indicate λlowast

i as a function of ylowasti with y(i)

fixed and denote ti = part ti( ylowasti )partylowast

i |microi in Figure 5 ti is the slopeof the solid curve as it crosses the vertical line Suppose ylowast

i hasbootstrap mean and variance (microi Vi) Taylor series yield a sim-ple approximation for cov(i)

cov(i) = E(i)λi middot ( ylowasti minus microi)

= E(i)[ti(microi) + ti middot ( ylowast

i minus microi)]( ylowasti minus microi)

= Viti (323)

only ylowasti being random here The Steinian (322) is a discretized

version of (323) applied to the Bernoulli case which has Vi =microi(1 minus microi)

If Q( y micro) is the deviance function for a one-parameter expo-nential family then λi is the natural parameter and dλidmicroi =1Vi Therefore

cov(i) = Vipartλi

partylowasti

∣∣∣∣microi

= Vi1

Vi

partmicroi

partylowasti

∣∣∣∣microi

= partmicroi

partylowasti

∣∣∣∣microi

(324)

This is a centralized version of the SURE estimate where nowpartmicroipartyi is evaluated at microi instead of yi [In the exponentialfamily representation of the Gaussian case (29) Q( y micro) =( yminusmicro)2σ 2 so the factor σ 2 in (210) has been absorbed into Qin (324)]

Remark D The centralized version of SURE in (324) givesthe correct total degrees of freedom for maximum likelihoodestimation in a p-parameter generalized linear model or moregenerally in a p-parameter curved exponential family (Efron1975)

nsum

i=1

partmicroi

partyi

∣∣∣∣y=micro

= p (325)

The usual uncentralized version of SURE does not satisfy(325) in curved families

Using deviance error and maximum likelihood estimationin a curved exponential family makes erri = minus2 log fmicroi( yi) +constant Combining (314) (324) and (325) gives

Err = minus 2

[sum

i

log fmicroi( yi) minus p + constant

]

(326)

Choosing among competing models by minimizing Erris equivalent to maximizing the penalized likelihoodsum

log fmicroi( yi) minus p which is Akaikersquos information criterion(AIC) These results generalize those for GLMrsquos in section 6of Efron (1986) and will not be verified here

Remark E It is easy to show that the true prediction errorErri (38) satisfies

Erri = Q(microi microi) + D(microi)[D(microi) equiv EQ( yimicroi)

] (327)

For squared error this reduces to the familiar result E0( y0i minus

microi)2 = (microi minus microi)

2 + σ 2 In the Bernoulli case (32) D(microi) =q(microi) and the basic result (316) can be simplified to

Bernoulli case Oi = 2λi( yi minus microi) (328)

using (35)

Remark F The q class includes an asymmetric version ofcounting error (33) that allows the decision boundary to be ata point π1 in (01) other than 12 Letting π0 = 1 minus π1 andρ = (π0π1)

12

q(micro) = min

ρmicro1

ρ(1 minus micro)

rarr Q( y micro) =

0 if y micro same side of π1

ρ if y = 1 and micro le π11

ρif y = 0 and micro gt π1

(329)

Now Q(10)Q(01) = π0π1 This is the appropriate lossstructure for a simple hypothesis-testing situation in which wewant to compensate for unequal prior sampling probabilities

4 THE RELATIONSHIP WITH CROSSndashVALIDATION

Cross-validation is the most widely used error predictiontechnique This section relates cross-validation to covariancepenalties more exactly to conditional parametric bootstrap co-variance penalties A RaondashBlackwell type of relationship willbe developed If we average cross-validation estimates acrossthe bootstrap datasets used to calculate the conditional covari-ances then we get to a good approximation the covariancepenalty The implication is that covariance penalties are moreaccurate than cross-validation assuming of course that we trustthe parametric bootstrap model A similar conclusion is reachedin Shen et al (2002)

The cross-validation estimate of prediction error for coordi-nate i is

Erri = Q( yi microi) (41)

where microi is the ith coordinate of the estimate of micro based on thedeleted dataset y(i) = ( y1 y2 yiminus1 yi+1 yn) say

microi = m(y(i)

)

i (42)

(see Remark H) Equivalently cross-validation estimates theoptimism Oi = Erri minus erri by

Oi = Erri minus erri = Q( yi microi) minus Q( yi microi) (43)

Efron The Estimation of Prediction Error 625

Lemma Letting λi = minusq(microi)2 and λi = minusq(microi)2 asin (311)

Oi = 2(λi minus λi)( yi minus microi) minus Q(microi microi) minus 2(λi minus λi)(microi minus microi)

(44)This is verified directly from definition (31)

The lemma helps connect cross-validation with the condi-tional covariance penalties of (319)ndash(320) Cross-validationitself is conditional in the sense that y(i) is fixed in the calcu-lation of Oi so it is reasonable to suspect some sort of con-nection Suppose that we estimate cov(i) by bootstrap samplingas in (317) but now with y(i) fixed and only ylowast

i random saywith density fi The form of (44) makes it especially conve-nient for ylowast

i to have conditional expectation microi (rather than theobvious choice microi) which we denote by

E(i)ylowasti equiv Efi

ylowast

i

∣∣y(i)

= microi (45)

In a Bernoulli situation we would take ylowasti sim Be(microi)

Denote the bootstrap versions of microi and λi as microlowasti = m(y(i) ylowast

i )

and λlowasti = minusq(microlowast

i )2

Theorem 1 With ylowasti sim fi satisfying (45) and y(i) fixed

E(i)Olowasti = 2cov(i) minus E(i)Q(microi micro

lowasti ) (46)

cov(i) being the conditional covariance estimate E(i)λlowasti middot ( ylowast

i minusmicroi)

Proof Applying the lemma with microi rarr microi yi rarr ylowasti

λi rarr λlowasti and microi rarr microlowast

i gives

Olowasti = 2(λlowast

i minus λi)( ylowasti minus microi) minus Q(microi micro

lowasti ) (47)

Notice that microi and λi stay fixed in (47) because they dependonly on y(i) and that this same fact eliminates the last termin (44) Taking conditional expectations E(i) in (47) completesthe proof

In (46) 2cov(i) equals (i) the estimate of the condi-tional covariance penalty (i) (320) Typically (i) is of orderOp(1n) as in (213) while the remainder term E(i)Q(microi micro

lowasti )

is only Op(1n2) See Remark H The implication is that

E(i)Olowasti = (i) = 2 middot cov(i) (48)

In other words averaging the cross-validation estimate Olowasti

over fi the distribution of ylowasti used to calculate the covari-

ance penalty (i) gives approximately (i) itself If we thinkof fi as summarizing all available information for the unknowndistribution of yi that is as a sufficient statistic then (i) is aRaondashBlackwellized improvement on Oi

This same phenomenon occurs beyond the conditional frame-work of the theorem Figure 6 concerns cross-validation of thelowess(xy13) curve in Figure 1 Using the same uncondi-tional resampling model (217) as in Figure 2 B = 200 boot-strap replications of the cross-validation estimate (43) weregenerated

Olowastbi = Q

(ylowastb

i microlowastbi

) minus Q(ylowastb

i microlowastbi

)

i = 12 n and b = 12 B (49)

The small points in Figure 6 indicate individual values Olowastbi

2σ 2 The triangles show averages over the 200 replicationsOlowastmiddot

i 2σ 2 There is striking agreement with the covariance

Figure 6 Small Dots Indicate 200 Bootstrap Replications of Cross-Validation Optimism Estimates (49) Triangles Their Averages Closely Matchthe Covariance Penalty Curve From Figure 2 (Vertical distance plotted in df units)

626 Journal of the American Statistical Association September 2004

penalty curve coviσ2 from Figure 2 confirming EOlowast

i = i

as in (48) Nearly the same results were obtained bootstrappingfrom micro rather than micro

Approximation (48) can be made more explicit in the caseof squared error loss applied to linear rules micro = My that areldquoself-stablerdquo that is where the cross-validation estimate (42)satisfies

microi =sum

j =i

Mijyj Mij = Mij

(1 minus Mii) (410)

Self-stable rules include all the usual regression and ANOVAestimate as well as spline methods see Remark I Suppose weare resampling from yi sim fi with mean and variance

ylowasti sim (microi σ

2) (411)

where microi might differ from microi or microi The covariance penalty i

is then estimated by i = 2σ 2MiiUsing (410) it is straightforward to calculate the conditional

expectation of the cross-validation estimate Oi

Efi

Olowast

i

∣∣y(i)

= i middot [1 minus Mii2][

1 +(

microi minus microi

σ

)2]

(412)

If microi = microi then (412) becomes E(i)Olowasti = i[1minusMii2] an ex-

act version of (48) The choice microi = microi results in

Efi

Olowast

i

∣∣y(i)

= i[1 minus Mii2][

1 +(

Miiyi minus microi

σ

)2]

(413)

In both cases the conditional expectation of Olowasti is i[1 +

Op(1n)] where the Op(1n) term tends to be slightly negativeThe unconditional expectation of Oi with respect to the true

distribution y sim (micro σ 2I) looks like (412)

EOi = i[1 minus Mii2][

1 +sum

j =i

M2ij + 2

i

]

(414)

i equaling the covariance penalty 2σ 2Mii and

2i =

[(

microi minussum

j =i

Mijmicroj

)

σ

]2

(415)

For M a projection matrix M2 = M the termsum

M2ij = Mii(1minus

Mii) EOi exceeds i but only by a factor of 1 + O(1n) if2

i = 0 Notice thatsum

j =i

Mijmicroj minus microi = Emicroi minus microi (416)

so that 2i will be large if the cross-validation estimator microi is

badly biasedFinally suppose y sim N(micro σ 2I) and micro = My is a self-stable

projection estimator Then the coefficient of variation of Oi is

CVOi = 2 + 4(1 minus Mii)2i

(1 + 2(1 minus Mii)2i )

2= 2 (417)

the last approximation being quite accurate unless microi is badlybiased This says that Oi must always be a highly variableestimate of its expectation (414) or approximately of i =2σ 2Mii However it is still possible for the sum O = iOi toestimate ii = 2σ 2df with reasonable accuracy

As an example micro = My was fit to the kidney data whereM represented a natural spline with 8 degrees of freedom (in-cluding the intercept) One hundred simulated data vectors ylowast simN(microj σ

2I) were independently generated σ 2 = 317 each giv-ing a cross-validated df estimate df

lowast = Olowast2σ 2 These had em-pirical mean and standard deviation

dflowast sim 834 plusmn 164 (418)

Of course there is no reason to use cross-validation here be-cause the covariance penalty estimate df always equals thecorrect df value 8 This is an extreme example of the RaondashBlackwell type of result in Theorem 1 showing the cross-validation estimate df as a randomized version of df

Remark G Theorem 1 applies directly to grouped cross-validation in which the observations are removed in groupsrather than singly Suppose group i consists of observations( yi1 yi2 yiJ) and likewise microi = (microi1 microiJ) microi =(microi microiJ) y(i) equals y with group i removed andmicroi = m(y(i))i1i2iJ Theorem 1 then holds as stated with

Olowasti = sum

j Olowastij cov(i) = sum

j cov(ij) and so forth Another wayto say this is that by additivity the theory of Sections 2ndash4 canbe extended to vector observations yi

Remark H The full dataset for a prediction problem theldquotraining setrdquo is of the form

v = (v1 v2 vn) with vi = (xi yi) (419)

xi being a p vector of observed covariates such as age for thekidney data and yi a response Covariance penalties operatein a regression theory framework where the xi are consideredfixed ancillaries whether or not they are random which is whynotation such as micro = m(y) can suppress x Cross-validationhowever changes x as well as y In this framework it is moreprecise to write the prediction rule as

m(xv) for x isin X (420)

indicating that the training set v determines a rule m(middotv) whichthen can be evaluated at any x in the predictor space X (42) isbetter expressed as microi = m(xiv(i))

In the cross-validation framework we can suppose that vhas been produced by random sampling (ldquoiidrdquo) from some(p + 1)-dimensional distribution F

Fiidrarr v1 v2 vn (421)

Standard influence function arguments as in chapter 2 of Ham-pel Ronchetti Rousseeuw and Stahel (1986) give the first-order approximation

microi minus microi = m(xiv) minus m(xiv(i)

) = IFi minus IF(i)

n (422)

where IFj = IF(vj m(xiv)F) is the influence function for microi

evaluated at vj and IF(i) = sum

j =i IFj(n minus 1)The point here is that microi minus microi is Op(1n) in situations

where the influence function exists boundedly see Li (1987)for a more careful discussion In situation (410) microi minus microi =Mii( yi minus microi) so that Mii = O(1n) as in (213) implies microi minusmicroi = Op(1n) Similarly microlowast

i minus microi = Op(1n) in (46) If the

Efron The Estimation of Prediction Error 627

function q(micro) of Figure 4 is locally quadratic near microi thenQ(microi micro

lowasti ) in (46) will be Op(1n2) as claimed in (48)

Order of magnitude asymptotics are only a rough guide topractice and are not crucial to the methods discussed here Inany given situation bootstrap calculations such as (317) willgive valid estimates whether or not (213) is meaningful

Remark I A prediction rule is ldquoself-stablerdquo if adding a newpoint (xi yi) that falls exactly on the prediction surface does notchange the prediction at xi in notation (420) if

m(xiv(i) cup (xi microi)

) = microi (423)

This implies microi = sum

j =i Mijyj + Miimicroi for a linear rule whichis (410) Any ldquoleast-Qrdquo rule which chooses micro by minimizingsum

Q( yimicroi) over some candidate collection of possible microrsquosmust be self-stable and this class can be extended by addingpenalty terms as with smoothing splines Maximum likelihoodestimation in ordinary or curved GLMrsquos belongs to the least-Qclass

5 A SIMULATION

Here is a small simulation study intended to illustrate covari-ance penaltycross-validation relationships in a Bernoulli datasetting Figure 7 shows the underlying model used to generatethe simulations There are 30 bivariate vectors xi and their as-sociated probabilities microi

(ximicroi) i = 12 30 (51)

from which we generated 200 30-dimensional Bernoulli re-sponse vectors

y sim Be(micro) (52)

Figure 7 Underlying Model Used for Simulation Study n = 30 Bivari-ate Vectors xi and Associated Probabilities microi (51)

as in (32) [The underlying model (51) was itself randomlygenerated by 30 independent replications of

Yi sim Be

(1

2

)

and xi sim N2

((

Yi minus 1

20

)

I)

(53)

with microi the Bayesian posterior ProbYi = 1|xi]Our prediction rule micro = m(y) was based on the coefficients

for Fisherrsquos linear discriminant boundary α + β primex = 0

microi = 1[1 + eminus(α+β primexi)

] (54)

Equation (213) of Efron (1983) describes the (α β) computa-tions Rule (54) is not the logistic regression estimate of microi andin fact will be somewhat more efficient given mechanism (53)(Efron 1975)

Binomial deviance error (34) was used to assess predictionaccuracy Three estimates of the total expected optimism =sum30

i=1 i (310) were computed for each of the 200 y vectorsthe cross-validation estimate O = sum

Oi (43) the parametricbootstrap estimate 2

sumcovi (317) with ylowast sim Be(micro) and the

Steinian 2sum

cov(i) (322)The results appear in Figure 8 as histograms of the 200 df es-

timates (ie estimates of optimism2) The Steinian and para-metric bootstrap gave similar results correlation 72 with thebootstrap estimates slightly but consistently larger Strikingly

(a)

(b)

Figure 8 Degree-of-Freedom Estimates (optimism2) 200 Simula-tions (52) and (54) The two covariance penalty estimates Steinian (a)and parametric bootstrap (b) have about one-third the standard devia-tion of cross-validation Error measured by binomial deviance (34) trueΩ2 = 157

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 2: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

620 Journal of the American Statistical Association September 2004

Figure 1 Kidney Data An Omnibus Measure of Kidney FunctionPlotted versus Age for n = 157 Healthy Volunteers Fitted curve islowess(x y f = 13) sum of squared residuals 4951 How well canwe expect this curve to predict future (x y ) pairs

is an unbiased estimator for the expectation of Erri leading tothe Cp formula for estimating Err = sumn

i=1 Erri

Err = err + 2σ 2 trace(M) err =nsum

i=1

erri (26)

In practice we usually need to replace σ 2 with an estimate σ 2

as in the examples that follow see section 7 of Efron (1986)Dropping the linearity assumption let micro = m(y) be any rule

at all for estimating micro from y Taking expectations in the iden-tity

( yi minus microi)2 + (microi minus microi)

2

= ( yi minus microi)2 + 2(microi minus microi)( yi minus microi) (27)

and using E( yi minus microi)2 = E0( y0

i minus microi)2 gives a convenient ex-

pression for the expectation of Erri (24)

EErri = Eerri + 2 cov(microi yi) (28)

Because cov(microi yi) equals σ 2Mii for a linear rule (28) is seento be a generalization of (25) In words we must add a covari-ance penalty to the apparent error erri in order to unbiasedlyestimate Erri

Formula (28) is not directly applicable since cov(microi yi) isnot an observable statistic Stein (1981) overcame this impedi-ment in the Gaussian case

y sim N(micro σ 2I) (29)

by showing that

covi = σ 2Epartmicroipartyi (210)

[assuming (29) and a differentiability condition on the mappingmicro = m(y)] Because partmicroipartyi is observable this leads to Steinrsquosunbiased risk estimate (SURE) for total prediction error

Err = err + 2σ 2nsum

i=1

partmicroi

partyi (211)

In the linear case it is now common as in Hastie and Tibshi-rani (1990) to define trace(M) as the degrees of freedom (df ) ofthe rule micro = My If we are in the usual regression or analysis ofvariance (ANOVA) situation where M is a projection matrixthen trace(M) = p the dimension of the projected space agree-ing with the usual df definition As in Ye (1998) we can extendthis definition to

df =nsum

i=1

cov(microi yi)

σ 2(212)

for a general rule micro = m(y)Traditional applications of linear models try to keep df n

Becausesumn

i=1 Mii = df this can be interpreted as Mii = O(1n)

in reasonable experimental designs Similarly the informal or-der of magnitude calculations that follow assume

cov(microi yi) = O(1n) (213)

This might better be stated as ldquoO(dfn)rdquo the crucial ingredientfor the asymptotics being a small value of dfn

The bootstrap or more exactly the parametric bootstrapsuggests a direct way of estimating the covariance penaltycov(microi yi) Let f be an assumed density for y In the Gaussiancase we might take f = N(micro σ 2I) with micro = m(y) and σ 2 ob-tained from the residuals of some ldquobigrdquo model presumed to havenegligible bias We then generate a large number ldquoBrdquo of simu-lated observations and estimates from f

f rarr ylowast rarr microlowast = m(ylowast) (214)

and estimate covi = cov(microi yi) from the observed bootstrap co-variance say

covi =Bsum

b=1

microlowastbi

(ylowastb

i minus ylowastmiddoti

)(B minus 1) ylowastmiddot

i =sum

b

ylowastbi

B (215)

leading to the Err estimate

Err = err + 2nsum

i=1

covi (216)

Both Breiman (1992) and Ye (1998) proposed variationson (214) intended to improve the efficiency of the bootstrapestimation procedure see Remark A

Figure 2 displays SURE and parametric bootstrap estimatesof the coordinatewise degrees of freedom dfi for the kidneydata The two sets of estimates partmicroipartyi and coviσ

2 are plottedversus agei vividly demonstrating the decreased stability of thelowess fitting process near the extremes of the age scale Theresampling algorithm (214) employed

ylowast = micro + εlowast (217)

with the components of εlowast a random sample of size n fromthe empirical distribution of the observed residuals εj = yj minus microj

Efron The Estimation of Prediction Error 621

Figure 2 Coordinatewise Degrees of Freedom for Lowess Fit of Fig-ure 1 Plotted versus Age Open circles SURE estimate dfi = partmicroipartyi solid line parametric bootstrap estimates coviσ

2 (214)ndash(215)B = 1000 Total df estimates 685 (SURE) and 667 (parametric boot-strap) The coordinatewise bootstrap estimates are noticeably lessnoisy

(having σ 2 = 317) Almost identical results were obtained tak-ing ε lowast

i sim N(0 σ 2) The lowess estimator was chosen here be-cause it is nonlinear and unsmooth making the df calculationsmore challenging

The two methods gave similar estimates for the total degreesof freedom df = sum

dfi 685 using SURE and 667 plusmn 30 withthe bootstrap the plusmn value indicating simulation error estimatedas

n

σ 2

[sum(Clowastb minus Clowastmiddot)2

B(B minus 1)

]12

Clowastb =nsum

i=1

microlowastbi ( ylowastb

i minus ylowastmiddoti )

n Clowastmiddot =

sum Clowastb

B (218)

However the componentwise bootstrap estimates are notice-ably less noisy having standard deviation 25 times smaller thanthe SURE values over the range 20 le age le 75

Remark A It is not necessary that the bootstrap model fin (214) be based on micro = m(y) The solid curve in Figure 2was recomputed starting from the bigger model (more degreesof freedom) f = N(micro σ 2I) with micro the fit from lowess(xy

f = 16) but still using f = 13 for m(ylowast) at the final stepof (214) This gave almost the same results as in Figure 2

The ultimate ldquobigger modelrdquo is

f = N(y σ 2I) (219)

This choice which is the one made in Ye (1998) Shen Huangand Ye (2002) Shen and Ye (2002) and Breiman (1992) hasthe advantage of not requiring model assumptions It pays for

this luxury with increased estimation error the dfi plot looksmore like the open circles than the solid line in Figure 2 The au-thor prefers checking the dfi estimates against moderately big-ger models such as lowess(xy16) rather than going all theway to (219) see Remark C

In fact the exact choice of f is often quite unimportant No-tice that dfi equiv cov(microi yi)σ

2 is the linear regression coefficientof microi on yi If the regression function Emicroi|yi is roughly lin-ear in yi then its slope can be estimated by a wide variety ofdevices Algorithm 1 of Ye (1998) takes ylowast in (214) from ashrunken version of (219)

ylowast sim N(y cσ 2I) (220)

with c a constant between 6 and 1 and estimates dfi by thelinear regression coefficient of microi on ylowast

i Breimanrsquos ldquolittle boot-straprdquo (1992) employs a related technique the ldquolittlerdquo referringto using c lt 1 in (220) and winds up recommending c between6 and 8 (though c = 1 gave slightly superior accuracy in hissimulation experiments) Shen and Ye (2002) used an equiva-lent form of covariance estimation with c = 5

Remark B The parametric bootstrap algorithm (214)ndash(215)can also be used to assess the difference between fits obtainedfrom two models say Model A and Model B We will think of Aas the smaller of the two that is the one with fewer degrees offreedom though this is not essential The estimated differenceof prediction error is

Err = err + 2nsum

i=1

cov(microlowasti ylowast

i ) (221)

denoting ldquoModel A minus Model Brdquo

Calculation (221) was carried out for the kidney data withlowess(xy f = 23) for Model A and lowess(xy f = 13)

for Model B err = 4985 minus 4951 = 34 With f in (214) esti-mated from Model A 1000 parametric bootstraps (each requir-ing both model fits) gave minus184 for the second term in (221)so

Err = 34 minus 184 = minus150

favoring the smaller Model AThe 1000 pairs of bootstrap fits micro(A)lowast and micro(B)lowast con-

tain useful information beyond evaluating the second term of(221) Figure 3 displays the thousand values of

Errlowast = errlowast minus 184 (222)

This can be considered as a null hypothesis distribution fortesting ldquoModel B is no improvement over Model Ardquo In thiscase the observed Err falls in the lower part of the distrib-ution but for a larger observed value say Err = 200 wemight use the histogram to assign the approximate p valueErrlowast gt Err1000

This calculation ignores the fact that the penalty minus184in (222) is itself variable For linear models the penalty is a con-stant obviating concern In general the penalty term is an or-der of magnitude smaller than err and not likely to contributemuch to the bootstrap variability of Errlowast This was checkedhere using a second level of bootstrapping which made verylittle difference to Figure 3

622 Journal of the American Statistical Association September 2004

Figure 3 1000 Bootstrap Replications of ∆Errlowast

for the DifferenceBetween lowess(x y 23) and lowess(x y 13) Kidney Data Thepoint estimate ∆Err = minus150 is in the lower part of the histogram

Remark C The parametric bootstrap estimate (214)ndash(215)unlike SURE does not depend on micro = m(y) being differen-tiable or even continuous A simulation experiment was runtaking the true model for the diabetes data to be y sim N(micro σ 2I)with σ 2 = 317 and micro the lowess(xy f = 16) fit a noticeablyrougher curve than that of Figure 1 A discontinuous adaptiveestimation rule micro = m(y) was used Polynomial regressions of yon x for powers of x from 0 to 7 were fit with the one havingthe minimum Cp value selected to be micro

Because this is a simulation experiment we can estimate thetrue expected difference between Err and err (28) 1000 sim-ulations of y gave

EErr minus err = 331 plusmn 202 (223)

The parametric bootstrap estimate (214)ndash(216) worked wellhere 1000 replications of y sim N(micro σ 2I) with micro fromlowess(xy f = 13) yielding

Err minus err = 314 plusmn 285 (224)

In contrast bootstrapping from ylowast sim N(y σ 2I) as in (219) gave146 plusmn 182 badly underestimating the true difference 331Starting with the true micro equal to the seventh-degree polynomialfit gave nearly the same results as (223)

3 GENERAL COVARIANCE PENALTIES

The covariance penalty theory of Section 2 can be general-ized beyond squared error to a wide class of error measuresThe q class of error measures (Efron 1986) begins with anyconcave function q(middot) of a real-valued argument Q( y micro) theassessed error for outcome y given prediction micro is then definedto be

Q( y micro) = q(micro) + q(micro)( y minus micro) minus q( y)[q(micro) = dqdmicro|micro

]

(31)

Figure 4 Tangency Construction (31) for General Error MeasureQ(y micro) q( middot ) Is an Arbitrary Concave Function The illustrated case hasq(micro) = micro(1 minus micro) and Q(y micro) = (y minus micro)2

Q( y micro) is the tangency function to q(middot) as illustrated inFigure 4 (31) is a familiar construct in convex analysis(Rockafellar 1970) The choice q(micro) = micro(1 minusmicro) gives squarederror Q( y micro) = ( y minus micro)2

Our examples will include the Bernoulli case y sim Be(micro)where we have n independent observations yi

yi =

1 probability microi

0 probability 1 minus microifor microi isin [01] (32)

Two common error functions used for Bernoulli observationsare counting error

q(micro) = min(micro1 minus micro)

rarr Q( ymicro) =

0 if ymicro on same side of 12

1 if ymicro on different sides of 12(33)

(see Remark F) and binomial deviance

q(micro) = minus2[micro log(micro) + (1 minus micro) log(1 minus micro)]

rarr Q( ymicro) =minus2 logmicro if y = 1

minus2 log(1 minus micro) if y = 0(34)

By a linear transformation we can always make

q(0) = q(1) = 0 (35)

which is convenient for Bernoulli calculationsWe assume that some unknown probability mechanism f has

given the observed data y from which we estimate the expec-tation vector micro = Efy according to the rule micro = m(y)

f rarr y rarr micro = m(y) (36)

Total error will be assessed by summing the component errors

Q(y micro) =nsum

i=1

Q( yi microi) (37)

Efron The Estimation of Prediction Error 623

The following definitions lead to a general version of the Cp

formula (28) Letting

erri = Q( yi microi) and Erri = E0Q( y0i microi) (38)

as in (24) with microi fixed in the expectation and y0i from an in-

dependent copy of y define the

Optimism Oi = Oi(fy) = Erri minus erri (39)

and

Expected optimism i = (f) = EfOi(fy) (310)

Finally let

λi = minusq(microi)2 (311)

For q(micro) = micro(1 minus micro) the squared error case λi = microi minus 12for counting error (33) λi = minus1 or 1 as microi is less or greaterthan 12 for binomial deviance (34)

λi = log(microi(1 minus microi)) (312)

the logit parameter [If Q( y micro) is the deviance function for anyexponential family then λ is the corresponding natural parame-ter see sec 6 of Efron 1986]

Optimism Theorem 1 For error measure Q( y micro) (31) wehave

EErri = Eerri + i (313)

where

i = 2 cov(λi yi) (314)

the expectations and covariance being with respect to f (36)

Proof Erri = erri + Oi by definition immediately giving(313) From (31) we calculate

Erri = q(microi) + q(microi)(microi minus microi) minus Eq( y0i )

erri = q( microi) + q(microi)( yi minus microi) minus q( yi)(315)

and so from (39)ndash(311)

Oi = 2λi( yi minus microi) + q( yi) minus Eq( y0i ) (316)

Because Eq( y0i ) = Eq( yi) y0

i being a copy of yi taking ex-pectations in (316) verifies (314)

The optimism theorem generalizes Steinrsquos result for squarederror (28) to the q class of error measures It was developed byEfron (1986) in a generalized linear model (GLM) context butas verified here it applies to any probability mechanism f rarr yEven independence is not required among the components of ythough it is convenient to assume independence in the condi-tional covariance computations that follow

Parametric bootstrap computations can be used to estimatethe penalty i = 2 cov(λi yi) as in (214) the only change be-ing the substitution of λlowast

i = minusq(microlowasti )2 for microlowast

i in (215)

covi =Bsum

i=1

λlowastbi ( ylowastb

i minus ylowastmiddoti )(B minus 1) (317)

Method (317) was suggested in remark J of Efron (1986) Shenet al (2002) working with deviance in exponential familiesemployed a ldquoshrunkenrdquo version of (317) as in (220)

Section 4 relates covariance penalties to cross-validation Indoing so it helps to work with a conditional version of covi Lety(i) indicate the data vector with yi deleted

y(i) = ( y1 y2 yiminus1 yi+1 yn) (318)

and define the conditional covariance

cov(i) = Eλi middot ( yi minus microi)

∣∣y(i)

equiv E(i) λi middot ( yi minus microi) (319)

E(i) indicating Emiddot|y(i) likewise (i) = 2 cov(i) In situation(21)ndash(23) cov(i) = covi = σ 2Mii but in general we only haveEcov(i) = covi The conditional version of (313)

E(i)Erri = E(i)erri + (i)

(320)

is a more refined statement of the optimism theorem TheSURE formula (210) also applies conditionally cov(i) =σ 2E(i)partmicroipartyi assuming normality (29)

Figure 5 illustrates conditional and unconditional covariancecalculations for subject i = 93 of the kidney study (the open cir-cle in Fig 1) Here we have used squared error and the Gaussianmodel ylowast sim N(micro σ 2I) σ 2 = 317 with micro = lowess(xy13)The conditional and unconditional covariances are nearly thesame cov(i) = 221 versus covi = 218 but the dependence ofmicrolowast

i on ylowasti is much clearer conditioning on y(i)

The conditional approach is computationally expensive Wewould need to repeat the conditional resampling procedure ofFigure 5 separately for each of the n cases whereas a single setof unconditional resamples suffices for all n Here we will usethe conditional covariances (319) mainly for theoretical pur-poses The less expensive unconditional approach performedwell in all of our examples

Figure 5 Conditional and Unconditional Covariance Calculations forSubject i = 93 Kidney Study Open circles 200 pairs (ylowast

i microlowasti ) un-

conditional resamples ylowast sim N( micro σ 2 I) cov i = 218 Dots 100 condi-tional resamples ylowast

i sim N( microi σ 2) y(i) fixed cov(i) = 221 Vertical lineat micro93 = 136

624 Journal of the American Statistical Association September 2004

There is however one situation where the conditional co-variances are easily computed the Bernoulli case y sim Be(micro)In this situation it is easy to see that

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (321)

the notation indicating the two possible values of λi with y(i)

fixed and yi either 1 or 0 This leads to estimates

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (322)

Calculating cov(i) for i = 12 n requires only n recompu-tations of m(middot) one for each i the same number as for cross-validation For reasons discussed next (322) will be termedthe Steinian

There is no general equivalent to the Gaussian SURE for-mula (210) that is an unbiased estimator for cov(i) How-ever a useful approximation can be derived as follows Letti( ylowast

i ) = λi( y(i) ylowasti ) indicate λlowast

i as a function of ylowasti with y(i)

fixed and denote ti = part ti( ylowasti )partylowast

i |microi in Figure 5 ti is the slopeof the solid curve as it crosses the vertical line Suppose ylowast

i hasbootstrap mean and variance (microi Vi) Taylor series yield a sim-ple approximation for cov(i)

cov(i) = E(i)λi middot ( ylowasti minus microi)

= E(i)[ti(microi) + ti middot ( ylowast

i minus microi)]( ylowasti minus microi)

= Viti (323)

only ylowasti being random here The Steinian (322) is a discretized

version of (323) applied to the Bernoulli case which has Vi =microi(1 minus microi)

If Q( y micro) is the deviance function for a one-parameter expo-nential family then λi is the natural parameter and dλidmicroi =1Vi Therefore

cov(i) = Vipartλi

partylowasti

∣∣∣∣microi

= Vi1

Vi

partmicroi

partylowasti

∣∣∣∣microi

= partmicroi

partylowasti

∣∣∣∣microi

(324)

This is a centralized version of the SURE estimate where nowpartmicroipartyi is evaluated at microi instead of yi [In the exponentialfamily representation of the Gaussian case (29) Q( y micro) =( yminusmicro)2σ 2 so the factor σ 2 in (210) has been absorbed into Qin (324)]

Remark D The centralized version of SURE in (324) givesthe correct total degrees of freedom for maximum likelihoodestimation in a p-parameter generalized linear model or moregenerally in a p-parameter curved exponential family (Efron1975)

nsum

i=1

partmicroi

partyi

∣∣∣∣y=micro

= p (325)

The usual uncentralized version of SURE does not satisfy(325) in curved families

Using deviance error and maximum likelihood estimationin a curved exponential family makes erri = minus2 log fmicroi( yi) +constant Combining (314) (324) and (325) gives

Err = minus 2

[sum

i

log fmicroi( yi) minus p + constant

]

(326)

Choosing among competing models by minimizing Erris equivalent to maximizing the penalized likelihoodsum

log fmicroi( yi) minus p which is Akaikersquos information criterion(AIC) These results generalize those for GLMrsquos in section 6of Efron (1986) and will not be verified here

Remark E It is easy to show that the true prediction errorErri (38) satisfies

Erri = Q(microi microi) + D(microi)[D(microi) equiv EQ( yimicroi)

] (327)

For squared error this reduces to the familiar result E0( y0i minus

microi)2 = (microi minus microi)

2 + σ 2 In the Bernoulli case (32) D(microi) =q(microi) and the basic result (316) can be simplified to

Bernoulli case Oi = 2λi( yi minus microi) (328)

using (35)

Remark F The q class includes an asymmetric version ofcounting error (33) that allows the decision boundary to be ata point π1 in (01) other than 12 Letting π0 = 1 minus π1 andρ = (π0π1)

12

q(micro) = min

ρmicro1

ρ(1 minus micro)

rarr Q( y micro) =

0 if y micro same side of π1

ρ if y = 1 and micro le π11

ρif y = 0 and micro gt π1

(329)

Now Q(10)Q(01) = π0π1 This is the appropriate lossstructure for a simple hypothesis-testing situation in which wewant to compensate for unequal prior sampling probabilities

4 THE RELATIONSHIP WITH CROSSndashVALIDATION

Cross-validation is the most widely used error predictiontechnique This section relates cross-validation to covariancepenalties more exactly to conditional parametric bootstrap co-variance penalties A RaondashBlackwell type of relationship willbe developed If we average cross-validation estimates acrossthe bootstrap datasets used to calculate the conditional covari-ances then we get to a good approximation the covariancepenalty The implication is that covariance penalties are moreaccurate than cross-validation assuming of course that we trustthe parametric bootstrap model A similar conclusion is reachedin Shen et al (2002)

The cross-validation estimate of prediction error for coordi-nate i is

Erri = Q( yi microi) (41)

where microi is the ith coordinate of the estimate of micro based on thedeleted dataset y(i) = ( y1 y2 yiminus1 yi+1 yn) say

microi = m(y(i)

)

i (42)

(see Remark H) Equivalently cross-validation estimates theoptimism Oi = Erri minus erri by

Oi = Erri minus erri = Q( yi microi) minus Q( yi microi) (43)

Efron The Estimation of Prediction Error 625

Lemma Letting λi = minusq(microi)2 and λi = minusq(microi)2 asin (311)

Oi = 2(λi minus λi)( yi minus microi) minus Q(microi microi) minus 2(λi minus λi)(microi minus microi)

(44)This is verified directly from definition (31)

The lemma helps connect cross-validation with the condi-tional covariance penalties of (319)ndash(320) Cross-validationitself is conditional in the sense that y(i) is fixed in the calcu-lation of Oi so it is reasonable to suspect some sort of con-nection Suppose that we estimate cov(i) by bootstrap samplingas in (317) but now with y(i) fixed and only ylowast

i random saywith density fi The form of (44) makes it especially conve-nient for ylowast

i to have conditional expectation microi (rather than theobvious choice microi) which we denote by

E(i)ylowasti equiv Efi

ylowast

i

∣∣y(i)

= microi (45)

In a Bernoulli situation we would take ylowasti sim Be(microi)

Denote the bootstrap versions of microi and λi as microlowasti = m(y(i) ylowast

i )

and λlowasti = minusq(microlowast

i )2

Theorem 1 With ylowasti sim fi satisfying (45) and y(i) fixed

E(i)Olowasti = 2cov(i) minus E(i)Q(microi micro

lowasti ) (46)

cov(i) being the conditional covariance estimate E(i)λlowasti middot ( ylowast

i minusmicroi)

Proof Applying the lemma with microi rarr microi yi rarr ylowasti

λi rarr λlowasti and microi rarr microlowast

i gives

Olowasti = 2(λlowast

i minus λi)( ylowasti minus microi) minus Q(microi micro

lowasti ) (47)

Notice that microi and λi stay fixed in (47) because they dependonly on y(i) and that this same fact eliminates the last termin (44) Taking conditional expectations E(i) in (47) completesthe proof

In (46) 2cov(i) equals (i) the estimate of the condi-tional covariance penalty (i) (320) Typically (i) is of orderOp(1n) as in (213) while the remainder term E(i)Q(microi micro

lowasti )

is only Op(1n2) See Remark H The implication is that

E(i)Olowasti = (i) = 2 middot cov(i) (48)

In other words averaging the cross-validation estimate Olowasti

over fi the distribution of ylowasti used to calculate the covari-

ance penalty (i) gives approximately (i) itself If we thinkof fi as summarizing all available information for the unknowndistribution of yi that is as a sufficient statistic then (i) is aRaondashBlackwellized improvement on Oi

This same phenomenon occurs beyond the conditional frame-work of the theorem Figure 6 concerns cross-validation of thelowess(xy13) curve in Figure 1 Using the same uncondi-tional resampling model (217) as in Figure 2 B = 200 boot-strap replications of the cross-validation estimate (43) weregenerated

Olowastbi = Q

(ylowastb

i microlowastbi

) minus Q(ylowastb

i microlowastbi

)

i = 12 n and b = 12 B (49)

The small points in Figure 6 indicate individual values Olowastbi

2σ 2 The triangles show averages over the 200 replicationsOlowastmiddot

i 2σ 2 There is striking agreement with the covariance

Figure 6 Small Dots Indicate 200 Bootstrap Replications of Cross-Validation Optimism Estimates (49) Triangles Their Averages Closely Matchthe Covariance Penalty Curve From Figure 2 (Vertical distance plotted in df units)

626 Journal of the American Statistical Association September 2004

penalty curve coviσ2 from Figure 2 confirming EOlowast

i = i

as in (48) Nearly the same results were obtained bootstrappingfrom micro rather than micro

Approximation (48) can be made more explicit in the caseof squared error loss applied to linear rules micro = My that areldquoself-stablerdquo that is where the cross-validation estimate (42)satisfies

microi =sum

j =i

Mijyj Mij = Mij

(1 minus Mii) (410)

Self-stable rules include all the usual regression and ANOVAestimate as well as spline methods see Remark I Suppose weare resampling from yi sim fi with mean and variance

ylowasti sim (microi σ

2) (411)

where microi might differ from microi or microi The covariance penalty i

is then estimated by i = 2σ 2MiiUsing (410) it is straightforward to calculate the conditional

expectation of the cross-validation estimate Oi

Efi

Olowast

i

∣∣y(i)

= i middot [1 minus Mii2][

1 +(

microi minus microi

σ

)2]

(412)

If microi = microi then (412) becomes E(i)Olowasti = i[1minusMii2] an ex-

act version of (48) The choice microi = microi results in

Efi

Olowast

i

∣∣y(i)

= i[1 minus Mii2][

1 +(

Miiyi minus microi

σ

)2]

(413)

In both cases the conditional expectation of Olowasti is i[1 +

Op(1n)] where the Op(1n) term tends to be slightly negativeThe unconditional expectation of Oi with respect to the true

distribution y sim (micro σ 2I) looks like (412)

EOi = i[1 minus Mii2][

1 +sum

j =i

M2ij + 2

i

]

(414)

i equaling the covariance penalty 2σ 2Mii and

2i =

[(

microi minussum

j =i

Mijmicroj

)

σ

]2

(415)

For M a projection matrix M2 = M the termsum

M2ij = Mii(1minus

Mii) EOi exceeds i but only by a factor of 1 + O(1n) if2

i = 0 Notice thatsum

j =i

Mijmicroj minus microi = Emicroi minus microi (416)

so that 2i will be large if the cross-validation estimator microi is

badly biasedFinally suppose y sim N(micro σ 2I) and micro = My is a self-stable

projection estimator Then the coefficient of variation of Oi is

CVOi = 2 + 4(1 minus Mii)2i

(1 + 2(1 minus Mii)2i )

2= 2 (417)

the last approximation being quite accurate unless microi is badlybiased This says that Oi must always be a highly variableestimate of its expectation (414) or approximately of i =2σ 2Mii However it is still possible for the sum O = iOi toestimate ii = 2σ 2df with reasonable accuracy

As an example micro = My was fit to the kidney data whereM represented a natural spline with 8 degrees of freedom (in-cluding the intercept) One hundred simulated data vectors ylowast simN(microj σ

2I) were independently generated σ 2 = 317 each giv-ing a cross-validated df estimate df

lowast = Olowast2σ 2 These had em-pirical mean and standard deviation

dflowast sim 834 plusmn 164 (418)

Of course there is no reason to use cross-validation here be-cause the covariance penalty estimate df always equals thecorrect df value 8 This is an extreme example of the RaondashBlackwell type of result in Theorem 1 showing the cross-validation estimate df as a randomized version of df

Remark G Theorem 1 applies directly to grouped cross-validation in which the observations are removed in groupsrather than singly Suppose group i consists of observations( yi1 yi2 yiJ) and likewise microi = (microi1 microiJ) microi =(microi microiJ) y(i) equals y with group i removed andmicroi = m(y(i))i1i2iJ Theorem 1 then holds as stated with

Olowasti = sum

j Olowastij cov(i) = sum

j cov(ij) and so forth Another wayto say this is that by additivity the theory of Sections 2ndash4 canbe extended to vector observations yi

Remark H The full dataset for a prediction problem theldquotraining setrdquo is of the form

v = (v1 v2 vn) with vi = (xi yi) (419)

xi being a p vector of observed covariates such as age for thekidney data and yi a response Covariance penalties operatein a regression theory framework where the xi are consideredfixed ancillaries whether or not they are random which is whynotation such as micro = m(y) can suppress x Cross-validationhowever changes x as well as y In this framework it is moreprecise to write the prediction rule as

m(xv) for x isin X (420)

indicating that the training set v determines a rule m(middotv) whichthen can be evaluated at any x in the predictor space X (42) isbetter expressed as microi = m(xiv(i))

In the cross-validation framework we can suppose that vhas been produced by random sampling (ldquoiidrdquo) from some(p + 1)-dimensional distribution F

Fiidrarr v1 v2 vn (421)

Standard influence function arguments as in chapter 2 of Ham-pel Ronchetti Rousseeuw and Stahel (1986) give the first-order approximation

microi minus microi = m(xiv) minus m(xiv(i)

) = IFi minus IF(i)

n (422)

where IFj = IF(vj m(xiv)F) is the influence function for microi

evaluated at vj and IF(i) = sum

j =i IFj(n minus 1)The point here is that microi minus microi is Op(1n) in situations

where the influence function exists boundedly see Li (1987)for a more careful discussion In situation (410) microi minus microi =Mii( yi minus microi) so that Mii = O(1n) as in (213) implies microi minusmicroi = Op(1n) Similarly microlowast

i minus microi = Op(1n) in (46) If the

Efron The Estimation of Prediction Error 627

function q(micro) of Figure 4 is locally quadratic near microi thenQ(microi micro

lowasti ) in (46) will be Op(1n2) as claimed in (48)

Order of magnitude asymptotics are only a rough guide topractice and are not crucial to the methods discussed here Inany given situation bootstrap calculations such as (317) willgive valid estimates whether or not (213) is meaningful

Remark I A prediction rule is ldquoself-stablerdquo if adding a newpoint (xi yi) that falls exactly on the prediction surface does notchange the prediction at xi in notation (420) if

m(xiv(i) cup (xi microi)

) = microi (423)

This implies microi = sum

j =i Mijyj + Miimicroi for a linear rule whichis (410) Any ldquoleast-Qrdquo rule which chooses micro by minimizingsum

Q( yimicroi) over some candidate collection of possible microrsquosmust be self-stable and this class can be extended by addingpenalty terms as with smoothing splines Maximum likelihoodestimation in ordinary or curved GLMrsquos belongs to the least-Qclass

5 A SIMULATION

Here is a small simulation study intended to illustrate covari-ance penaltycross-validation relationships in a Bernoulli datasetting Figure 7 shows the underlying model used to generatethe simulations There are 30 bivariate vectors xi and their as-sociated probabilities microi

(ximicroi) i = 12 30 (51)

from which we generated 200 30-dimensional Bernoulli re-sponse vectors

y sim Be(micro) (52)

Figure 7 Underlying Model Used for Simulation Study n = 30 Bivari-ate Vectors xi and Associated Probabilities microi (51)

as in (32) [The underlying model (51) was itself randomlygenerated by 30 independent replications of

Yi sim Be

(1

2

)

and xi sim N2

((

Yi minus 1

20

)

I)

(53)

with microi the Bayesian posterior ProbYi = 1|xi]Our prediction rule micro = m(y) was based on the coefficients

for Fisherrsquos linear discriminant boundary α + β primex = 0

microi = 1[1 + eminus(α+β primexi)

] (54)

Equation (213) of Efron (1983) describes the (α β) computa-tions Rule (54) is not the logistic regression estimate of microi andin fact will be somewhat more efficient given mechanism (53)(Efron 1975)

Binomial deviance error (34) was used to assess predictionaccuracy Three estimates of the total expected optimism =sum30

i=1 i (310) were computed for each of the 200 y vectorsthe cross-validation estimate O = sum

Oi (43) the parametricbootstrap estimate 2

sumcovi (317) with ylowast sim Be(micro) and the

Steinian 2sum

cov(i) (322)The results appear in Figure 8 as histograms of the 200 df es-

timates (ie estimates of optimism2) The Steinian and para-metric bootstrap gave similar results correlation 72 with thebootstrap estimates slightly but consistently larger Strikingly

(a)

(b)

Figure 8 Degree-of-Freedom Estimates (optimism2) 200 Simula-tions (52) and (54) The two covariance penalty estimates Steinian (a)and parametric bootstrap (b) have about one-third the standard devia-tion of cross-validation Error measured by binomial deviance (34) trueΩ2 = 157

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 3: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

Efron The Estimation of Prediction Error 621

Figure 2 Coordinatewise Degrees of Freedom for Lowess Fit of Fig-ure 1 Plotted versus Age Open circles SURE estimate dfi = partmicroipartyi solid line parametric bootstrap estimates coviσ

2 (214)ndash(215)B = 1000 Total df estimates 685 (SURE) and 667 (parametric boot-strap) The coordinatewise bootstrap estimates are noticeably lessnoisy

(having σ 2 = 317) Almost identical results were obtained tak-ing ε lowast

i sim N(0 σ 2) The lowess estimator was chosen here be-cause it is nonlinear and unsmooth making the df calculationsmore challenging

The two methods gave similar estimates for the total degreesof freedom df = sum

dfi 685 using SURE and 667 plusmn 30 withthe bootstrap the plusmn value indicating simulation error estimatedas

n

σ 2

[sum(Clowastb minus Clowastmiddot)2

B(B minus 1)

]12

Clowastb =nsum

i=1

microlowastbi ( ylowastb

i minus ylowastmiddoti )

n Clowastmiddot =

sum Clowastb

B (218)

However the componentwise bootstrap estimates are notice-ably less noisy having standard deviation 25 times smaller thanthe SURE values over the range 20 le age le 75

Remark A It is not necessary that the bootstrap model fin (214) be based on micro = m(y) The solid curve in Figure 2was recomputed starting from the bigger model (more degreesof freedom) f = N(micro σ 2I) with micro the fit from lowess(xy

f = 16) but still using f = 13 for m(ylowast) at the final stepof (214) This gave almost the same results as in Figure 2

The ultimate ldquobigger modelrdquo is

f = N(y σ 2I) (219)

This choice which is the one made in Ye (1998) Shen Huangand Ye (2002) Shen and Ye (2002) and Breiman (1992) hasthe advantage of not requiring model assumptions It pays for

this luxury with increased estimation error the dfi plot looksmore like the open circles than the solid line in Figure 2 The au-thor prefers checking the dfi estimates against moderately big-ger models such as lowess(xy16) rather than going all theway to (219) see Remark C

In fact the exact choice of f is often quite unimportant No-tice that dfi equiv cov(microi yi)σ

2 is the linear regression coefficientof microi on yi If the regression function Emicroi|yi is roughly lin-ear in yi then its slope can be estimated by a wide variety ofdevices Algorithm 1 of Ye (1998) takes ylowast in (214) from ashrunken version of (219)

ylowast sim N(y cσ 2I) (220)

with c a constant between 6 and 1 and estimates dfi by thelinear regression coefficient of microi on ylowast

i Breimanrsquos ldquolittle boot-straprdquo (1992) employs a related technique the ldquolittlerdquo referringto using c lt 1 in (220) and winds up recommending c between6 and 8 (though c = 1 gave slightly superior accuracy in hissimulation experiments) Shen and Ye (2002) used an equiva-lent form of covariance estimation with c = 5

Remark B The parametric bootstrap algorithm (214)ndash(215)can also be used to assess the difference between fits obtainedfrom two models say Model A and Model B We will think of Aas the smaller of the two that is the one with fewer degrees offreedom though this is not essential The estimated differenceof prediction error is

Err = err + 2nsum

i=1

cov(microlowasti ylowast

i ) (221)

denoting ldquoModel A minus Model Brdquo

Calculation (221) was carried out for the kidney data withlowess(xy f = 23) for Model A and lowess(xy f = 13)

for Model B err = 4985 minus 4951 = 34 With f in (214) esti-mated from Model A 1000 parametric bootstraps (each requir-ing both model fits) gave minus184 for the second term in (221)so

Err = 34 minus 184 = minus150

favoring the smaller Model AThe 1000 pairs of bootstrap fits micro(A)lowast and micro(B)lowast con-

tain useful information beyond evaluating the second term of(221) Figure 3 displays the thousand values of

Errlowast = errlowast minus 184 (222)

This can be considered as a null hypothesis distribution fortesting ldquoModel B is no improvement over Model Ardquo In thiscase the observed Err falls in the lower part of the distrib-ution but for a larger observed value say Err = 200 wemight use the histogram to assign the approximate p valueErrlowast gt Err1000

This calculation ignores the fact that the penalty minus184in (222) is itself variable For linear models the penalty is a con-stant obviating concern In general the penalty term is an or-der of magnitude smaller than err and not likely to contributemuch to the bootstrap variability of Errlowast This was checkedhere using a second level of bootstrapping which made verylittle difference to Figure 3

622 Journal of the American Statistical Association September 2004

Figure 3 1000 Bootstrap Replications of ∆Errlowast

for the DifferenceBetween lowess(x y 23) and lowess(x y 13) Kidney Data Thepoint estimate ∆Err = minus150 is in the lower part of the histogram

Remark C The parametric bootstrap estimate (214)ndash(215)unlike SURE does not depend on micro = m(y) being differen-tiable or even continuous A simulation experiment was runtaking the true model for the diabetes data to be y sim N(micro σ 2I)with σ 2 = 317 and micro the lowess(xy f = 16) fit a noticeablyrougher curve than that of Figure 1 A discontinuous adaptiveestimation rule micro = m(y) was used Polynomial regressions of yon x for powers of x from 0 to 7 were fit with the one havingthe minimum Cp value selected to be micro

Because this is a simulation experiment we can estimate thetrue expected difference between Err and err (28) 1000 sim-ulations of y gave

EErr minus err = 331 plusmn 202 (223)

The parametric bootstrap estimate (214)ndash(216) worked wellhere 1000 replications of y sim N(micro σ 2I) with micro fromlowess(xy f = 13) yielding

Err minus err = 314 plusmn 285 (224)

In contrast bootstrapping from ylowast sim N(y σ 2I) as in (219) gave146 plusmn 182 badly underestimating the true difference 331Starting with the true micro equal to the seventh-degree polynomialfit gave nearly the same results as (223)

3 GENERAL COVARIANCE PENALTIES

The covariance penalty theory of Section 2 can be general-ized beyond squared error to a wide class of error measuresThe q class of error measures (Efron 1986) begins with anyconcave function q(middot) of a real-valued argument Q( y micro) theassessed error for outcome y given prediction micro is then definedto be

Q( y micro) = q(micro) + q(micro)( y minus micro) minus q( y)[q(micro) = dqdmicro|micro

]

(31)

Figure 4 Tangency Construction (31) for General Error MeasureQ(y micro) q( middot ) Is an Arbitrary Concave Function The illustrated case hasq(micro) = micro(1 minus micro) and Q(y micro) = (y minus micro)2

Q( y micro) is the tangency function to q(middot) as illustrated inFigure 4 (31) is a familiar construct in convex analysis(Rockafellar 1970) The choice q(micro) = micro(1 minusmicro) gives squarederror Q( y micro) = ( y minus micro)2

Our examples will include the Bernoulli case y sim Be(micro)where we have n independent observations yi

yi =

1 probability microi

0 probability 1 minus microifor microi isin [01] (32)

Two common error functions used for Bernoulli observationsare counting error

q(micro) = min(micro1 minus micro)

rarr Q( ymicro) =

0 if ymicro on same side of 12

1 if ymicro on different sides of 12(33)

(see Remark F) and binomial deviance

q(micro) = minus2[micro log(micro) + (1 minus micro) log(1 minus micro)]

rarr Q( ymicro) =minus2 logmicro if y = 1

minus2 log(1 minus micro) if y = 0(34)

By a linear transformation we can always make

q(0) = q(1) = 0 (35)

which is convenient for Bernoulli calculationsWe assume that some unknown probability mechanism f has

given the observed data y from which we estimate the expec-tation vector micro = Efy according to the rule micro = m(y)

f rarr y rarr micro = m(y) (36)

Total error will be assessed by summing the component errors

Q(y micro) =nsum

i=1

Q( yi microi) (37)

Efron The Estimation of Prediction Error 623

The following definitions lead to a general version of the Cp

formula (28) Letting

erri = Q( yi microi) and Erri = E0Q( y0i microi) (38)

as in (24) with microi fixed in the expectation and y0i from an in-

dependent copy of y define the

Optimism Oi = Oi(fy) = Erri minus erri (39)

and

Expected optimism i = (f) = EfOi(fy) (310)

Finally let

λi = minusq(microi)2 (311)

For q(micro) = micro(1 minus micro) the squared error case λi = microi minus 12for counting error (33) λi = minus1 or 1 as microi is less or greaterthan 12 for binomial deviance (34)

λi = log(microi(1 minus microi)) (312)

the logit parameter [If Q( y micro) is the deviance function for anyexponential family then λ is the corresponding natural parame-ter see sec 6 of Efron 1986]

Optimism Theorem 1 For error measure Q( y micro) (31) wehave

EErri = Eerri + i (313)

where

i = 2 cov(λi yi) (314)

the expectations and covariance being with respect to f (36)

Proof Erri = erri + Oi by definition immediately giving(313) From (31) we calculate

Erri = q(microi) + q(microi)(microi minus microi) minus Eq( y0i )

erri = q( microi) + q(microi)( yi minus microi) minus q( yi)(315)

and so from (39)ndash(311)

Oi = 2λi( yi minus microi) + q( yi) minus Eq( y0i ) (316)

Because Eq( y0i ) = Eq( yi) y0

i being a copy of yi taking ex-pectations in (316) verifies (314)

The optimism theorem generalizes Steinrsquos result for squarederror (28) to the q class of error measures It was developed byEfron (1986) in a generalized linear model (GLM) context butas verified here it applies to any probability mechanism f rarr yEven independence is not required among the components of ythough it is convenient to assume independence in the condi-tional covariance computations that follow

Parametric bootstrap computations can be used to estimatethe penalty i = 2 cov(λi yi) as in (214) the only change be-ing the substitution of λlowast

i = minusq(microlowasti )2 for microlowast

i in (215)

covi =Bsum

i=1

λlowastbi ( ylowastb

i minus ylowastmiddoti )(B minus 1) (317)

Method (317) was suggested in remark J of Efron (1986) Shenet al (2002) working with deviance in exponential familiesemployed a ldquoshrunkenrdquo version of (317) as in (220)

Section 4 relates covariance penalties to cross-validation Indoing so it helps to work with a conditional version of covi Lety(i) indicate the data vector with yi deleted

y(i) = ( y1 y2 yiminus1 yi+1 yn) (318)

and define the conditional covariance

cov(i) = Eλi middot ( yi minus microi)

∣∣y(i)

equiv E(i) λi middot ( yi minus microi) (319)

E(i) indicating Emiddot|y(i) likewise (i) = 2 cov(i) In situation(21)ndash(23) cov(i) = covi = σ 2Mii but in general we only haveEcov(i) = covi The conditional version of (313)

E(i)Erri = E(i)erri + (i)

(320)

is a more refined statement of the optimism theorem TheSURE formula (210) also applies conditionally cov(i) =σ 2E(i)partmicroipartyi assuming normality (29)

Figure 5 illustrates conditional and unconditional covariancecalculations for subject i = 93 of the kidney study (the open cir-cle in Fig 1) Here we have used squared error and the Gaussianmodel ylowast sim N(micro σ 2I) σ 2 = 317 with micro = lowess(xy13)The conditional and unconditional covariances are nearly thesame cov(i) = 221 versus covi = 218 but the dependence ofmicrolowast

i on ylowasti is much clearer conditioning on y(i)

The conditional approach is computationally expensive Wewould need to repeat the conditional resampling procedure ofFigure 5 separately for each of the n cases whereas a single setof unconditional resamples suffices for all n Here we will usethe conditional covariances (319) mainly for theoretical pur-poses The less expensive unconditional approach performedwell in all of our examples

Figure 5 Conditional and Unconditional Covariance Calculations forSubject i = 93 Kidney Study Open circles 200 pairs (ylowast

i microlowasti ) un-

conditional resamples ylowast sim N( micro σ 2 I) cov i = 218 Dots 100 condi-tional resamples ylowast

i sim N( microi σ 2) y(i) fixed cov(i) = 221 Vertical lineat micro93 = 136

624 Journal of the American Statistical Association September 2004

There is however one situation where the conditional co-variances are easily computed the Bernoulli case y sim Be(micro)In this situation it is easy to see that

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (321)

the notation indicating the two possible values of λi with y(i)

fixed and yi either 1 or 0 This leads to estimates

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (322)

Calculating cov(i) for i = 12 n requires only n recompu-tations of m(middot) one for each i the same number as for cross-validation For reasons discussed next (322) will be termedthe Steinian

There is no general equivalent to the Gaussian SURE for-mula (210) that is an unbiased estimator for cov(i) How-ever a useful approximation can be derived as follows Letti( ylowast

i ) = λi( y(i) ylowasti ) indicate λlowast

i as a function of ylowasti with y(i)

fixed and denote ti = part ti( ylowasti )partylowast

i |microi in Figure 5 ti is the slopeof the solid curve as it crosses the vertical line Suppose ylowast

i hasbootstrap mean and variance (microi Vi) Taylor series yield a sim-ple approximation for cov(i)

cov(i) = E(i)λi middot ( ylowasti minus microi)

= E(i)[ti(microi) + ti middot ( ylowast

i minus microi)]( ylowasti minus microi)

= Viti (323)

only ylowasti being random here The Steinian (322) is a discretized

version of (323) applied to the Bernoulli case which has Vi =microi(1 minus microi)

If Q( y micro) is the deviance function for a one-parameter expo-nential family then λi is the natural parameter and dλidmicroi =1Vi Therefore

cov(i) = Vipartλi

partylowasti

∣∣∣∣microi

= Vi1

Vi

partmicroi

partylowasti

∣∣∣∣microi

= partmicroi

partylowasti

∣∣∣∣microi

(324)

This is a centralized version of the SURE estimate where nowpartmicroipartyi is evaluated at microi instead of yi [In the exponentialfamily representation of the Gaussian case (29) Q( y micro) =( yminusmicro)2σ 2 so the factor σ 2 in (210) has been absorbed into Qin (324)]

Remark D The centralized version of SURE in (324) givesthe correct total degrees of freedom for maximum likelihoodestimation in a p-parameter generalized linear model or moregenerally in a p-parameter curved exponential family (Efron1975)

nsum

i=1

partmicroi

partyi

∣∣∣∣y=micro

= p (325)

The usual uncentralized version of SURE does not satisfy(325) in curved families

Using deviance error and maximum likelihood estimationin a curved exponential family makes erri = minus2 log fmicroi( yi) +constant Combining (314) (324) and (325) gives

Err = minus 2

[sum

i

log fmicroi( yi) minus p + constant

]

(326)

Choosing among competing models by minimizing Erris equivalent to maximizing the penalized likelihoodsum

log fmicroi( yi) minus p which is Akaikersquos information criterion(AIC) These results generalize those for GLMrsquos in section 6of Efron (1986) and will not be verified here

Remark E It is easy to show that the true prediction errorErri (38) satisfies

Erri = Q(microi microi) + D(microi)[D(microi) equiv EQ( yimicroi)

] (327)

For squared error this reduces to the familiar result E0( y0i minus

microi)2 = (microi minus microi)

2 + σ 2 In the Bernoulli case (32) D(microi) =q(microi) and the basic result (316) can be simplified to

Bernoulli case Oi = 2λi( yi minus microi) (328)

using (35)

Remark F The q class includes an asymmetric version ofcounting error (33) that allows the decision boundary to be ata point π1 in (01) other than 12 Letting π0 = 1 minus π1 andρ = (π0π1)

12

q(micro) = min

ρmicro1

ρ(1 minus micro)

rarr Q( y micro) =

0 if y micro same side of π1

ρ if y = 1 and micro le π11

ρif y = 0 and micro gt π1

(329)

Now Q(10)Q(01) = π0π1 This is the appropriate lossstructure for a simple hypothesis-testing situation in which wewant to compensate for unequal prior sampling probabilities

4 THE RELATIONSHIP WITH CROSSndashVALIDATION

Cross-validation is the most widely used error predictiontechnique This section relates cross-validation to covariancepenalties more exactly to conditional parametric bootstrap co-variance penalties A RaondashBlackwell type of relationship willbe developed If we average cross-validation estimates acrossthe bootstrap datasets used to calculate the conditional covari-ances then we get to a good approximation the covariancepenalty The implication is that covariance penalties are moreaccurate than cross-validation assuming of course that we trustthe parametric bootstrap model A similar conclusion is reachedin Shen et al (2002)

The cross-validation estimate of prediction error for coordi-nate i is

Erri = Q( yi microi) (41)

where microi is the ith coordinate of the estimate of micro based on thedeleted dataset y(i) = ( y1 y2 yiminus1 yi+1 yn) say

microi = m(y(i)

)

i (42)

(see Remark H) Equivalently cross-validation estimates theoptimism Oi = Erri minus erri by

Oi = Erri minus erri = Q( yi microi) minus Q( yi microi) (43)

Efron The Estimation of Prediction Error 625

Lemma Letting λi = minusq(microi)2 and λi = minusq(microi)2 asin (311)

Oi = 2(λi minus λi)( yi minus microi) minus Q(microi microi) minus 2(λi minus λi)(microi minus microi)

(44)This is verified directly from definition (31)

The lemma helps connect cross-validation with the condi-tional covariance penalties of (319)ndash(320) Cross-validationitself is conditional in the sense that y(i) is fixed in the calcu-lation of Oi so it is reasonable to suspect some sort of con-nection Suppose that we estimate cov(i) by bootstrap samplingas in (317) but now with y(i) fixed and only ylowast

i random saywith density fi The form of (44) makes it especially conve-nient for ylowast

i to have conditional expectation microi (rather than theobvious choice microi) which we denote by

E(i)ylowasti equiv Efi

ylowast

i

∣∣y(i)

= microi (45)

In a Bernoulli situation we would take ylowasti sim Be(microi)

Denote the bootstrap versions of microi and λi as microlowasti = m(y(i) ylowast

i )

and λlowasti = minusq(microlowast

i )2

Theorem 1 With ylowasti sim fi satisfying (45) and y(i) fixed

E(i)Olowasti = 2cov(i) minus E(i)Q(microi micro

lowasti ) (46)

cov(i) being the conditional covariance estimate E(i)λlowasti middot ( ylowast

i minusmicroi)

Proof Applying the lemma with microi rarr microi yi rarr ylowasti

λi rarr λlowasti and microi rarr microlowast

i gives

Olowasti = 2(λlowast

i minus λi)( ylowasti minus microi) minus Q(microi micro

lowasti ) (47)

Notice that microi and λi stay fixed in (47) because they dependonly on y(i) and that this same fact eliminates the last termin (44) Taking conditional expectations E(i) in (47) completesthe proof

In (46) 2cov(i) equals (i) the estimate of the condi-tional covariance penalty (i) (320) Typically (i) is of orderOp(1n) as in (213) while the remainder term E(i)Q(microi micro

lowasti )

is only Op(1n2) See Remark H The implication is that

E(i)Olowasti = (i) = 2 middot cov(i) (48)

In other words averaging the cross-validation estimate Olowasti

over fi the distribution of ylowasti used to calculate the covari-

ance penalty (i) gives approximately (i) itself If we thinkof fi as summarizing all available information for the unknowndistribution of yi that is as a sufficient statistic then (i) is aRaondashBlackwellized improvement on Oi

This same phenomenon occurs beyond the conditional frame-work of the theorem Figure 6 concerns cross-validation of thelowess(xy13) curve in Figure 1 Using the same uncondi-tional resampling model (217) as in Figure 2 B = 200 boot-strap replications of the cross-validation estimate (43) weregenerated

Olowastbi = Q

(ylowastb

i microlowastbi

) minus Q(ylowastb

i microlowastbi

)

i = 12 n and b = 12 B (49)

The small points in Figure 6 indicate individual values Olowastbi

2σ 2 The triangles show averages over the 200 replicationsOlowastmiddot

i 2σ 2 There is striking agreement with the covariance

Figure 6 Small Dots Indicate 200 Bootstrap Replications of Cross-Validation Optimism Estimates (49) Triangles Their Averages Closely Matchthe Covariance Penalty Curve From Figure 2 (Vertical distance plotted in df units)

626 Journal of the American Statistical Association September 2004

penalty curve coviσ2 from Figure 2 confirming EOlowast

i = i

as in (48) Nearly the same results were obtained bootstrappingfrom micro rather than micro

Approximation (48) can be made more explicit in the caseof squared error loss applied to linear rules micro = My that areldquoself-stablerdquo that is where the cross-validation estimate (42)satisfies

microi =sum

j =i

Mijyj Mij = Mij

(1 minus Mii) (410)

Self-stable rules include all the usual regression and ANOVAestimate as well as spline methods see Remark I Suppose weare resampling from yi sim fi with mean and variance

ylowasti sim (microi σ

2) (411)

where microi might differ from microi or microi The covariance penalty i

is then estimated by i = 2σ 2MiiUsing (410) it is straightforward to calculate the conditional

expectation of the cross-validation estimate Oi

Efi

Olowast

i

∣∣y(i)

= i middot [1 minus Mii2][

1 +(

microi minus microi

σ

)2]

(412)

If microi = microi then (412) becomes E(i)Olowasti = i[1minusMii2] an ex-

act version of (48) The choice microi = microi results in

Efi

Olowast

i

∣∣y(i)

= i[1 minus Mii2][

1 +(

Miiyi minus microi

σ

)2]

(413)

In both cases the conditional expectation of Olowasti is i[1 +

Op(1n)] where the Op(1n) term tends to be slightly negativeThe unconditional expectation of Oi with respect to the true

distribution y sim (micro σ 2I) looks like (412)

EOi = i[1 minus Mii2][

1 +sum

j =i

M2ij + 2

i

]

(414)

i equaling the covariance penalty 2σ 2Mii and

2i =

[(

microi minussum

j =i

Mijmicroj

)

σ

]2

(415)

For M a projection matrix M2 = M the termsum

M2ij = Mii(1minus

Mii) EOi exceeds i but only by a factor of 1 + O(1n) if2

i = 0 Notice thatsum

j =i

Mijmicroj minus microi = Emicroi minus microi (416)

so that 2i will be large if the cross-validation estimator microi is

badly biasedFinally suppose y sim N(micro σ 2I) and micro = My is a self-stable

projection estimator Then the coefficient of variation of Oi is

CVOi = 2 + 4(1 minus Mii)2i

(1 + 2(1 minus Mii)2i )

2= 2 (417)

the last approximation being quite accurate unless microi is badlybiased This says that Oi must always be a highly variableestimate of its expectation (414) or approximately of i =2σ 2Mii However it is still possible for the sum O = iOi toestimate ii = 2σ 2df with reasonable accuracy

As an example micro = My was fit to the kidney data whereM represented a natural spline with 8 degrees of freedom (in-cluding the intercept) One hundred simulated data vectors ylowast simN(microj σ

2I) were independently generated σ 2 = 317 each giv-ing a cross-validated df estimate df

lowast = Olowast2σ 2 These had em-pirical mean and standard deviation

dflowast sim 834 plusmn 164 (418)

Of course there is no reason to use cross-validation here be-cause the covariance penalty estimate df always equals thecorrect df value 8 This is an extreme example of the RaondashBlackwell type of result in Theorem 1 showing the cross-validation estimate df as a randomized version of df

Remark G Theorem 1 applies directly to grouped cross-validation in which the observations are removed in groupsrather than singly Suppose group i consists of observations( yi1 yi2 yiJ) and likewise microi = (microi1 microiJ) microi =(microi microiJ) y(i) equals y with group i removed andmicroi = m(y(i))i1i2iJ Theorem 1 then holds as stated with

Olowasti = sum

j Olowastij cov(i) = sum

j cov(ij) and so forth Another wayto say this is that by additivity the theory of Sections 2ndash4 canbe extended to vector observations yi

Remark H The full dataset for a prediction problem theldquotraining setrdquo is of the form

v = (v1 v2 vn) with vi = (xi yi) (419)

xi being a p vector of observed covariates such as age for thekidney data and yi a response Covariance penalties operatein a regression theory framework where the xi are consideredfixed ancillaries whether or not they are random which is whynotation such as micro = m(y) can suppress x Cross-validationhowever changes x as well as y In this framework it is moreprecise to write the prediction rule as

m(xv) for x isin X (420)

indicating that the training set v determines a rule m(middotv) whichthen can be evaluated at any x in the predictor space X (42) isbetter expressed as microi = m(xiv(i))

In the cross-validation framework we can suppose that vhas been produced by random sampling (ldquoiidrdquo) from some(p + 1)-dimensional distribution F

Fiidrarr v1 v2 vn (421)

Standard influence function arguments as in chapter 2 of Ham-pel Ronchetti Rousseeuw and Stahel (1986) give the first-order approximation

microi minus microi = m(xiv) minus m(xiv(i)

) = IFi minus IF(i)

n (422)

where IFj = IF(vj m(xiv)F) is the influence function for microi

evaluated at vj and IF(i) = sum

j =i IFj(n minus 1)The point here is that microi minus microi is Op(1n) in situations

where the influence function exists boundedly see Li (1987)for a more careful discussion In situation (410) microi minus microi =Mii( yi minus microi) so that Mii = O(1n) as in (213) implies microi minusmicroi = Op(1n) Similarly microlowast

i minus microi = Op(1n) in (46) If the

Efron The Estimation of Prediction Error 627

function q(micro) of Figure 4 is locally quadratic near microi thenQ(microi micro

lowasti ) in (46) will be Op(1n2) as claimed in (48)

Order of magnitude asymptotics are only a rough guide topractice and are not crucial to the methods discussed here Inany given situation bootstrap calculations such as (317) willgive valid estimates whether or not (213) is meaningful

Remark I A prediction rule is ldquoself-stablerdquo if adding a newpoint (xi yi) that falls exactly on the prediction surface does notchange the prediction at xi in notation (420) if

m(xiv(i) cup (xi microi)

) = microi (423)

This implies microi = sum

j =i Mijyj + Miimicroi for a linear rule whichis (410) Any ldquoleast-Qrdquo rule which chooses micro by minimizingsum

Q( yimicroi) over some candidate collection of possible microrsquosmust be self-stable and this class can be extended by addingpenalty terms as with smoothing splines Maximum likelihoodestimation in ordinary or curved GLMrsquos belongs to the least-Qclass

5 A SIMULATION

Here is a small simulation study intended to illustrate covari-ance penaltycross-validation relationships in a Bernoulli datasetting Figure 7 shows the underlying model used to generatethe simulations There are 30 bivariate vectors xi and their as-sociated probabilities microi

(ximicroi) i = 12 30 (51)

from which we generated 200 30-dimensional Bernoulli re-sponse vectors

y sim Be(micro) (52)

Figure 7 Underlying Model Used for Simulation Study n = 30 Bivari-ate Vectors xi and Associated Probabilities microi (51)

as in (32) [The underlying model (51) was itself randomlygenerated by 30 independent replications of

Yi sim Be

(1

2

)

and xi sim N2

((

Yi minus 1

20

)

I)

(53)

with microi the Bayesian posterior ProbYi = 1|xi]Our prediction rule micro = m(y) was based on the coefficients

for Fisherrsquos linear discriminant boundary α + β primex = 0

microi = 1[1 + eminus(α+β primexi)

] (54)

Equation (213) of Efron (1983) describes the (α β) computa-tions Rule (54) is not the logistic regression estimate of microi andin fact will be somewhat more efficient given mechanism (53)(Efron 1975)

Binomial deviance error (34) was used to assess predictionaccuracy Three estimates of the total expected optimism =sum30

i=1 i (310) were computed for each of the 200 y vectorsthe cross-validation estimate O = sum

Oi (43) the parametricbootstrap estimate 2

sumcovi (317) with ylowast sim Be(micro) and the

Steinian 2sum

cov(i) (322)The results appear in Figure 8 as histograms of the 200 df es-

timates (ie estimates of optimism2) The Steinian and para-metric bootstrap gave similar results correlation 72 with thebootstrap estimates slightly but consistently larger Strikingly

(a)

(b)

Figure 8 Degree-of-Freedom Estimates (optimism2) 200 Simula-tions (52) and (54) The two covariance penalty estimates Steinian (a)and parametric bootstrap (b) have about one-third the standard devia-tion of cross-validation Error measured by binomial deviance (34) trueΩ2 = 157

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 4: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

622 Journal of the American Statistical Association September 2004

Figure 3 1000 Bootstrap Replications of ∆Errlowast

for the DifferenceBetween lowess(x y 23) and lowess(x y 13) Kidney Data Thepoint estimate ∆Err = minus150 is in the lower part of the histogram

Remark C The parametric bootstrap estimate (214)ndash(215)unlike SURE does not depend on micro = m(y) being differen-tiable or even continuous A simulation experiment was runtaking the true model for the diabetes data to be y sim N(micro σ 2I)with σ 2 = 317 and micro the lowess(xy f = 16) fit a noticeablyrougher curve than that of Figure 1 A discontinuous adaptiveestimation rule micro = m(y) was used Polynomial regressions of yon x for powers of x from 0 to 7 were fit with the one havingthe minimum Cp value selected to be micro

Because this is a simulation experiment we can estimate thetrue expected difference between Err and err (28) 1000 sim-ulations of y gave

EErr minus err = 331 plusmn 202 (223)

The parametric bootstrap estimate (214)ndash(216) worked wellhere 1000 replications of y sim N(micro σ 2I) with micro fromlowess(xy f = 13) yielding

Err minus err = 314 plusmn 285 (224)

In contrast bootstrapping from ylowast sim N(y σ 2I) as in (219) gave146 plusmn 182 badly underestimating the true difference 331Starting with the true micro equal to the seventh-degree polynomialfit gave nearly the same results as (223)

3 GENERAL COVARIANCE PENALTIES

The covariance penalty theory of Section 2 can be general-ized beyond squared error to a wide class of error measuresThe q class of error measures (Efron 1986) begins with anyconcave function q(middot) of a real-valued argument Q( y micro) theassessed error for outcome y given prediction micro is then definedto be

Q( y micro) = q(micro) + q(micro)( y minus micro) minus q( y)[q(micro) = dqdmicro|micro

]

(31)

Figure 4 Tangency Construction (31) for General Error MeasureQ(y micro) q( middot ) Is an Arbitrary Concave Function The illustrated case hasq(micro) = micro(1 minus micro) and Q(y micro) = (y minus micro)2

Q( y micro) is the tangency function to q(middot) as illustrated inFigure 4 (31) is a familiar construct in convex analysis(Rockafellar 1970) The choice q(micro) = micro(1 minusmicro) gives squarederror Q( y micro) = ( y minus micro)2

Our examples will include the Bernoulli case y sim Be(micro)where we have n independent observations yi

yi =

1 probability microi

0 probability 1 minus microifor microi isin [01] (32)

Two common error functions used for Bernoulli observationsare counting error

q(micro) = min(micro1 minus micro)

rarr Q( ymicro) =

0 if ymicro on same side of 12

1 if ymicro on different sides of 12(33)

(see Remark F) and binomial deviance

q(micro) = minus2[micro log(micro) + (1 minus micro) log(1 minus micro)]

rarr Q( ymicro) =minus2 logmicro if y = 1

minus2 log(1 minus micro) if y = 0(34)

By a linear transformation we can always make

q(0) = q(1) = 0 (35)

which is convenient for Bernoulli calculationsWe assume that some unknown probability mechanism f has

given the observed data y from which we estimate the expec-tation vector micro = Efy according to the rule micro = m(y)

f rarr y rarr micro = m(y) (36)

Total error will be assessed by summing the component errors

Q(y micro) =nsum

i=1

Q( yi microi) (37)

Efron The Estimation of Prediction Error 623

The following definitions lead to a general version of the Cp

formula (28) Letting

erri = Q( yi microi) and Erri = E0Q( y0i microi) (38)

as in (24) with microi fixed in the expectation and y0i from an in-

dependent copy of y define the

Optimism Oi = Oi(fy) = Erri minus erri (39)

and

Expected optimism i = (f) = EfOi(fy) (310)

Finally let

λi = minusq(microi)2 (311)

For q(micro) = micro(1 minus micro) the squared error case λi = microi minus 12for counting error (33) λi = minus1 or 1 as microi is less or greaterthan 12 for binomial deviance (34)

λi = log(microi(1 minus microi)) (312)

the logit parameter [If Q( y micro) is the deviance function for anyexponential family then λ is the corresponding natural parame-ter see sec 6 of Efron 1986]

Optimism Theorem 1 For error measure Q( y micro) (31) wehave

EErri = Eerri + i (313)

where

i = 2 cov(λi yi) (314)

the expectations and covariance being with respect to f (36)

Proof Erri = erri + Oi by definition immediately giving(313) From (31) we calculate

Erri = q(microi) + q(microi)(microi minus microi) minus Eq( y0i )

erri = q( microi) + q(microi)( yi minus microi) minus q( yi)(315)

and so from (39)ndash(311)

Oi = 2λi( yi minus microi) + q( yi) minus Eq( y0i ) (316)

Because Eq( y0i ) = Eq( yi) y0

i being a copy of yi taking ex-pectations in (316) verifies (314)

The optimism theorem generalizes Steinrsquos result for squarederror (28) to the q class of error measures It was developed byEfron (1986) in a generalized linear model (GLM) context butas verified here it applies to any probability mechanism f rarr yEven independence is not required among the components of ythough it is convenient to assume independence in the condi-tional covariance computations that follow

Parametric bootstrap computations can be used to estimatethe penalty i = 2 cov(λi yi) as in (214) the only change be-ing the substitution of λlowast

i = minusq(microlowasti )2 for microlowast

i in (215)

covi =Bsum

i=1

λlowastbi ( ylowastb

i minus ylowastmiddoti )(B minus 1) (317)

Method (317) was suggested in remark J of Efron (1986) Shenet al (2002) working with deviance in exponential familiesemployed a ldquoshrunkenrdquo version of (317) as in (220)

Section 4 relates covariance penalties to cross-validation Indoing so it helps to work with a conditional version of covi Lety(i) indicate the data vector with yi deleted

y(i) = ( y1 y2 yiminus1 yi+1 yn) (318)

and define the conditional covariance

cov(i) = Eλi middot ( yi minus microi)

∣∣y(i)

equiv E(i) λi middot ( yi minus microi) (319)

E(i) indicating Emiddot|y(i) likewise (i) = 2 cov(i) In situation(21)ndash(23) cov(i) = covi = σ 2Mii but in general we only haveEcov(i) = covi The conditional version of (313)

E(i)Erri = E(i)erri + (i)

(320)

is a more refined statement of the optimism theorem TheSURE formula (210) also applies conditionally cov(i) =σ 2E(i)partmicroipartyi assuming normality (29)

Figure 5 illustrates conditional and unconditional covariancecalculations for subject i = 93 of the kidney study (the open cir-cle in Fig 1) Here we have used squared error and the Gaussianmodel ylowast sim N(micro σ 2I) σ 2 = 317 with micro = lowess(xy13)The conditional and unconditional covariances are nearly thesame cov(i) = 221 versus covi = 218 but the dependence ofmicrolowast

i on ylowasti is much clearer conditioning on y(i)

The conditional approach is computationally expensive Wewould need to repeat the conditional resampling procedure ofFigure 5 separately for each of the n cases whereas a single setof unconditional resamples suffices for all n Here we will usethe conditional covariances (319) mainly for theoretical pur-poses The less expensive unconditional approach performedwell in all of our examples

Figure 5 Conditional and Unconditional Covariance Calculations forSubject i = 93 Kidney Study Open circles 200 pairs (ylowast

i microlowasti ) un-

conditional resamples ylowast sim N( micro σ 2 I) cov i = 218 Dots 100 condi-tional resamples ylowast

i sim N( microi σ 2) y(i) fixed cov(i) = 221 Vertical lineat micro93 = 136

624 Journal of the American Statistical Association September 2004

There is however one situation where the conditional co-variances are easily computed the Bernoulli case y sim Be(micro)In this situation it is easy to see that

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (321)

the notation indicating the two possible values of λi with y(i)

fixed and yi either 1 or 0 This leads to estimates

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (322)

Calculating cov(i) for i = 12 n requires only n recompu-tations of m(middot) one for each i the same number as for cross-validation For reasons discussed next (322) will be termedthe Steinian

There is no general equivalent to the Gaussian SURE for-mula (210) that is an unbiased estimator for cov(i) How-ever a useful approximation can be derived as follows Letti( ylowast

i ) = λi( y(i) ylowasti ) indicate λlowast

i as a function of ylowasti with y(i)

fixed and denote ti = part ti( ylowasti )partylowast

i |microi in Figure 5 ti is the slopeof the solid curve as it crosses the vertical line Suppose ylowast

i hasbootstrap mean and variance (microi Vi) Taylor series yield a sim-ple approximation for cov(i)

cov(i) = E(i)λi middot ( ylowasti minus microi)

= E(i)[ti(microi) + ti middot ( ylowast

i minus microi)]( ylowasti minus microi)

= Viti (323)

only ylowasti being random here The Steinian (322) is a discretized

version of (323) applied to the Bernoulli case which has Vi =microi(1 minus microi)

If Q( y micro) is the deviance function for a one-parameter expo-nential family then λi is the natural parameter and dλidmicroi =1Vi Therefore

cov(i) = Vipartλi

partylowasti

∣∣∣∣microi

= Vi1

Vi

partmicroi

partylowasti

∣∣∣∣microi

= partmicroi

partylowasti

∣∣∣∣microi

(324)

This is a centralized version of the SURE estimate where nowpartmicroipartyi is evaluated at microi instead of yi [In the exponentialfamily representation of the Gaussian case (29) Q( y micro) =( yminusmicro)2σ 2 so the factor σ 2 in (210) has been absorbed into Qin (324)]

Remark D The centralized version of SURE in (324) givesthe correct total degrees of freedom for maximum likelihoodestimation in a p-parameter generalized linear model or moregenerally in a p-parameter curved exponential family (Efron1975)

nsum

i=1

partmicroi

partyi

∣∣∣∣y=micro

= p (325)

The usual uncentralized version of SURE does not satisfy(325) in curved families

Using deviance error and maximum likelihood estimationin a curved exponential family makes erri = minus2 log fmicroi( yi) +constant Combining (314) (324) and (325) gives

Err = minus 2

[sum

i

log fmicroi( yi) minus p + constant

]

(326)

Choosing among competing models by minimizing Erris equivalent to maximizing the penalized likelihoodsum

log fmicroi( yi) minus p which is Akaikersquos information criterion(AIC) These results generalize those for GLMrsquos in section 6of Efron (1986) and will not be verified here

Remark E It is easy to show that the true prediction errorErri (38) satisfies

Erri = Q(microi microi) + D(microi)[D(microi) equiv EQ( yimicroi)

] (327)

For squared error this reduces to the familiar result E0( y0i minus

microi)2 = (microi minus microi)

2 + σ 2 In the Bernoulli case (32) D(microi) =q(microi) and the basic result (316) can be simplified to

Bernoulli case Oi = 2λi( yi minus microi) (328)

using (35)

Remark F The q class includes an asymmetric version ofcounting error (33) that allows the decision boundary to be ata point π1 in (01) other than 12 Letting π0 = 1 minus π1 andρ = (π0π1)

12

q(micro) = min

ρmicro1

ρ(1 minus micro)

rarr Q( y micro) =

0 if y micro same side of π1

ρ if y = 1 and micro le π11

ρif y = 0 and micro gt π1

(329)

Now Q(10)Q(01) = π0π1 This is the appropriate lossstructure for a simple hypothesis-testing situation in which wewant to compensate for unequal prior sampling probabilities

4 THE RELATIONSHIP WITH CROSSndashVALIDATION

Cross-validation is the most widely used error predictiontechnique This section relates cross-validation to covariancepenalties more exactly to conditional parametric bootstrap co-variance penalties A RaondashBlackwell type of relationship willbe developed If we average cross-validation estimates acrossthe bootstrap datasets used to calculate the conditional covari-ances then we get to a good approximation the covariancepenalty The implication is that covariance penalties are moreaccurate than cross-validation assuming of course that we trustthe parametric bootstrap model A similar conclusion is reachedin Shen et al (2002)

The cross-validation estimate of prediction error for coordi-nate i is

Erri = Q( yi microi) (41)

where microi is the ith coordinate of the estimate of micro based on thedeleted dataset y(i) = ( y1 y2 yiminus1 yi+1 yn) say

microi = m(y(i)

)

i (42)

(see Remark H) Equivalently cross-validation estimates theoptimism Oi = Erri minus erri by

Oi = Erri minus erri = Q( yi microi) minus Q( yi microi) (43)

Efron The Estimation of Prediction Error 625

Lemma Letting λi = minusq(microi)2 and λi = minusq(microi)2 asin (311)

Oi = 2(λi minus λi)( yi minus microi) minus Q(microi microi) minus 2(λi minus λi)(microi minus microi)

(44)This is verified directly from definition (31)

The lemma helps connect cross-validation with the condi-tional covariance penalties of (319)ndash(320) Cross-validationitself is conditional in the sense that y(i) is fixed in the calcu-lation of Oi so it is reasonable to suspect some sort of con-nection Suppose that we estimate cov(i) by bootstrap samplingas in (317) but now with y(i) fixed and only ylowast

i random saywith density fi The form of (44) makes it especially conve-nient for ylowast

i to have conditional expectation microi (rather than theobvious choice microi) which we denote by

E(i)ylowasti equiv Efi

ylowast

i

∣∣y(i)

= microi (45)

In a Bernoulli situation we would take ylowasti sim Be(microi)

Denote the bootstrap versions of microi and λi as microlowasti = m(y(i) ylowast

i )

and λlowasti = minusq(microlowast

i )2

Theorem 1 With ylowasti sim fi satisfying (45) and y(i) fixed

E(i)Olowasti = 2cov(i) minus E(i)Q(microi micro

lowasti ) (46)

cov(i) being the conditional covariance estimate E(i)λlowasti middot ( ylowast

i minusmicroi)

Proof Applying the lemma with microi rarr microi yi rarr ylowasti

λi rarr λlowasti and microi rarr microlowast

i gives

Olowasti = 2(λlowast

i minus λi)( ylowasti minus microi) minus Q(microi micro

lowasti ) (47)

Notice that microi and λi stay fixed in (47) because they dependonly on y(i) and that this same fact eliminates the last termin (44) Taking conditional expectations E(i) in (47) completesthe proof

In (46) 2cov(i) equals (i) the estimate of the condi-tional covariance penalty (i) (320) Typically (i) is of orderOp(1n) as in (213) while the remainder term E(i)Q(microi micro

lowasti )

is only Op(1n2) See Remark H The implication is that

E(i)Olowasti = (i) = 2 middot cov(i) (48)

In other words averaging the cross-validation estimate Olowasti

over fi the distribution of ylowasti used to calculate the covari-

ance penalty (i) gives approximately (i) itself If we thinkof fi as summarizing all available information for the unknowndistribution of yi that is as a sufficient statistic then (i) is aRaondashBlackwellized improvement on Oi

This same phenomenon occurs beyond the conditional frame-work of the theorem Figure 6 concerns cross-validation of thelowess(xy13) curve in Figure 1 Using the same uncondi-tional resampling model (217) as in Figure 2 B = 200 boot-strap replications of the cross-validation estimate (43) weregenerated

Olowastbi = Q

(ylowastb

i microlowastbi

) minus Q(ylowastb

i microlowastbi

)

i = 12 n and b = 12 B (49)

The small points in Figure 6 indicate individual values Olowastbi

2σ 2 The triangles show averages over the 200 replicationsOlowastmiddot

i 2σ 2 There is striking agreement with the covariance

Figure 6 Small Dots Indicate 200 Bootstrap Replications of Cross-Validation Optimism Estimates (49) Triangles Their Averages Closely Matchthe Covariance Penalty Curve From Figure 2 (Vertical distance plotted in df units)

626 Journal of the American Statistical Association September 2004

penalty curve coviσ2 from Figure 2 confirming EOlowast

i = i

as in (48) Nearly the same results were obtained bootstrappingfrom micro rather than micro

Approximation (48) can be made more explicit in the caseof squared error loss applied to linear rules micro = My that areldquoself-stablerdquo that is where the cross-validation estimate (42)satisfies

microi =sum

j =i

Mijyj Mij = Mij

(1 minus Mii) (410)

Self-stable rules include all the usual regression and ANOVAestimate as well as spline methods see Remark I Suppose weare resampling from yi sim fi with mean and variance

ylowasti sim (microi σ

2) (411)

where microi might differ from microi or microi The covariance penalty i

is then estimated by i = 2σ 2MiiUsing (410) it is straightforward to calculate the conditional

expectation of the cross-validation estimate Oi

Efi

Olowast

i

∣∣y(i)

= i middot [1 minus Mii2][

1 +(

microi minus microi

σ

)2]

(412)

If microi = microi then (412) becomes E(i)Olowasti = i[1minusMii2] an ex-

act version of (48) The choice microi = microi results in

Efi

Olowast

i

∣∣y(i)

= i[1 minus Mii2][

1 +(

Miiyi minus microi

σ

)2]

(413)

In both cases the conditional expectation of Olowasti is i[1 +

Op(1n)] where the Op(1n) term tends to be slightly negativeThe unconditional expectation of Oi with respect to the true

distribution y sim (micro σ 2I) looks like (412)

EOi = i[1 minus Mii2][

1 +sum

j =i

M2ij + 2

i

]

(414)

i equaling the covariance penalty 2σ 2Mii and

2i =

[(

microi minussum

j =i

Mijmicroj

)

σ

]2

(415)

For M a projection matrix M2 = M the termsum

M2ij = Mii(1minus

Mii) EOi exceeds i but only by a factor of 1 + O(1n) if2

i = 0 Notice thatsum

j =i

Mijmicroj minus microi = Emicroi minus microi (416)

so that 2i will be large if the cross-validation estimator microi is

badly biasedFinally suppose y sim N(micro σ 2I) and micro = My is a self-stable

projection estimator Then the coefficient of variation of Oi is

CVOi = 2 + 4(1 minus Mii)2i

(1 + 2(1 minus Mii)2i )

2= 2 (417)

the last approximation being quite accurate unless microi is badlybiased This says that Oi must always be a highly variableestimate of its expectation (414) or approximately of i =2σ 2Mii However it is still possible for the sum O = iOi toestimate ii = 2σ 2df with reasonable accuracy

As an example micro = My was fit to the kidney data whereM represented a natural spline with 8 degrees of freedom (in-cluding the intercept) One hundred simulated data vectors ylowast simN(microj σ

2I) were independently generated σ 2 = 317 each giv-ing a cross-validated df estimate df

lowast = Olowast2σ 2 These had em-pirical mean and standard deviation

dflowast sim 834 plusmn 164 (418)

Of course there is no reason to use cross-validation here be-cause the covariance penalty estimate df always equals thecorrect df value 8 This is an extreme example of the RaondashBlackwell type of result in Theorem 1 showing the cross-validation estimate df as a randomized version of df

Remark G Theorem 1 applies directly to grouped cross-validation in which the observations are removed in groupsrather than singly Suppose group i consists of observations( yi1 yi2 yiJ) and likewise microi = (microi1 microiJ) microi =(microi microiJ) y(i) equals y with group i removed andmicroi = m(y(i))i1i2iJ Theorem 1 then holds as stated with

Olowasti = sum

j Olowastij cov(i) = sum

j cov(ij) and so forth Another wayto say this is that by additivity the theory of Sections 2ndash4 canbe extended to vector observations yi

Remark H The full dataset for a prediction problem theldquotraining setrdquo is of the form

v = (v1 v2 vn) with vi = (xi yi) (419)

xi being a p vector of observed covariates such as age for thekidney data and yi a response Covariance penalties operatein a regression theory framework where the xi are consideredfixed ancillaries whether or not they are random which is whynotation such as micro = m(y) can suppress x Cross-validationhowever changes x as well as y In this framework it is moreprecise to write the prediction rule as

m(xv) for x isin X (420)

indicating that the training set v determines a rule m(middotv) whichthen can be evaluated at any x in the predictor space X (42) isbetter expressed as microi = m(xiv(i))

In the cross-validation framework we can suppose that vhas been produced by random sampling (ldquoiidrdquo) from some(p + 1)-dimensional distribution F

Fiidrarr v1 v2 vn (421)

Standard influence function arguments as in chapter 2 of Ham-pel Ronchetti Rousseeuw and Stahel (1986) give the first-order approximation

microi minus microi = m(xiv) minus m(xiv(i)

) = IFi minus IF(i)

n (422)

where IFj = IF(vj m(xiv)F) is the influence function for microi

evaluated at vj and IF(i) = sum

j =i IFj(n minus 1)The point here is that microi minus microi is Op(1n) in situations

where the influence function exists boundedly see Li (1987)for a more careful discussion In situation (410) microi minus microi =Mii( yi minus microi) so that Mii = O(1n) as in (213) implies microi minusmicroi = Op(1n) Similarly microlowast

i minus microi = Op(1n) in (46) If the

Efron The Estimation of Prediction Error 627

function q(micro) of Figure 4 is locally quadratic near microi thenQ(microi micro

lowasti ) in (46) will be Op(1n2) as claimed in (48)

Order of magnitude asymptotics are only a rough guide topractice and are not crucial to the methods discussed here Inany given situation bootstrap calculations such as (317) willgive valid estimates whether or not (213) is meaningful

Remark I A prediction rule is ldquoself-stablerdquo if adding a newpoint (xi yi) that falls exactly on the prediction surface does notchange the prediction at xi in notation (420) if

m(xiv(i) cup (xi microi)

) = microi (423)

This implies microi = sum

j =i Mijyj + Miimicroi for a linear rule whichis (410) Any ldquoleast-Qrdquo rule which chooses micro by minimizingsum

Q( yimicroi) over some candidate collection of possible microrsquosmust be self-stable and this class can be extended by addingpenalty terms as with smoothing splines Maximum likelihoodestimation in ordinary or curved GLMrsquos belongs to the least-Qclass

5 A SIMULATION

Here is a small simulation study intended to illustrate covari-ance penaltycross-validation relationships in a Bernoulli datasetting Figure 7 shows the underlying model used to generatethe simulations There are 30 bivariate vectors xi and their as-sociated probabilities microi

(ximicroi) i = 12 30 (51)

from which we generated 200 30-dimensional Bernoulli re-sponse vectors

y sim Be(micro) (52)

Figure 7 Underlying Model Used for Simulation Study n = 30 Bivari-ate Vectors xi and Associated Probabilities microi (51)

as in (32) [The underlying model (51) was itself randomlygenerated by 30 independent replications of

Yi sim Be

(1

2

)

and xi sim N2

((

Yi minus 1

20

)

I)

(53)

with microi the Bayesian posterior ProbYi = 1|xi]Our prediction rule micro = m(y) was based on the coefficients

for Fisherrsquos linear discriminant boundary α + β primex = 0

microi = 1[1 + eminus(α+β primexi)

] (54)

Equation (213) of Efron (1983) describes the (α β) computa-tions Rule (54) is not the logistic regression estimate of microi andin fact will be somewhat more efficient given mechanism (53)(Efron 1975)

Binomial deviance error (34) was used to assess predictionaccuracy Three estimates of the total expected optimism =sum30

i=1 i (310) were computed for each of the 200 y vectorsthe cross-validation estimate O = sum

Oi (43) the parametricbootstrap estimate 2

sumcovi (317) with ylowast sim Be(micro) and the

Steinian 2sum

cov(i) (322)The results appear in Figure 8 as histograms of the 200 df es-

timates (ie estimates of optimism2) The Steinian and para-metric bootstrap gave similar results correlation 72 with thebootstrap estimates slightly but consistently larger Strikingly

(a)

(b)

Figure 8 Degree-of-Freedom Estimates (optimism2) 200 Simula-tions (52) and (54) The two covariance penalty estimates Steinian (a)and parametric bootstrap (b) have about one-third the standard devia-tion of cross-validation Error measured by binomial deviance (34) trueΩ2 = 157

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 5: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

Efron The Estimation of Prediction Error 623

The following definitions lead to a general version of the Cp

formula (28) Letting

erri = Q( yi microi) and Erri = E0Q( y0i microi) (38)

as in (24) with microi fixed in the expectation and y0i from an in-

dependent copy of y define the

Optimism Oi = Oi(fy) = Erri minus erri (39)

and

Expected optimism i = (f) = EfOi(fy) (310)

Finally let

λi = minusq(microi)2 (311)

For q(micro) = micro(1 minus micro) the squared error case λi = microi minus 12for counting error (33) λi = minus1 or 1 as microi is less or greaterthan 12 for binomial deviance (34)

λi = log(microi(1 minus microi)) (312)

the logit parameter [If Q( y micro) is the deviance function for anyexponential family then λ is the corresponding natural parame-ter see sec 6 of Efron 1986]

Optimism Theorem 1 For error measure Q( y micro) (31) wehave

EErri = Eerri + i (313)

where

i = 2 cov(λi yi) (314)

the expectations and covariance being with respect to f (36)

Proof Erri = erri + Oi by definition immediately giving(313) From (31) we calculate

Erri = q(microi) + q(microi)(microi minus microi) minus Eq( y0i )

erri = q( microi) + q(microi)( yi minus microi) minus q( yi)(315)

and so from (39)ndash(311)

Oi = 2λi( yi minus microi) + q( yi) minus Eq( y0i ) (316)

Because Eq( y0i ) = Eq( yi) y0

i being a copy of yi taking ex-pectations in (316) verifies (314)

The optimism theorem generalizes Steinrsquos result for squarederror (28) to the q class of error measures It was developed byEfron (1986) in a generalized linear model (GLM) context butas verified here it applies to any probability mechanism f rarr yEven independence is not required among the components of ythough it is convenient to assume independence in the condi-tional covariance computations that follow

Parametric bootstrap computations can be used to estimatethe penalty i = 2 cov(λi yi) as in (214) the only change be-ing the substitution of λlowast

i = minusq(microlowasti )2 for microlowast

i in (215)

covi =Bsum

i=1

λlowastbi ( ylowastb

i minus ylowastmiddoti )(B minus 1) (317)

Method (317) was suggested in remark J of Efron (1986) Shenet al (2002) working with deviance in exponential familiesemployed a ldquoshrunkenrdquo version of (317) as in (220)

Section 4 relates covariance penalties to cross-validation Indoing so it helps to work with a conditional version of covi Lety(i) indicate the data vector with yi deleted

y(i) = ( y1 y2 yiminus1 yi+1 yn) (318)

and define the conditional covariance

cov(i) = Eλi middot ( yi minus microi)

∣∣y(i)

equiv E(i) λi middot ( yi minus microi) (319)

E(i) indicating Emiddot|y(i) likewise (i) = 2 cov(i) In situation(21)ndash(23) cov(i) = covi = σ 2Mii but in general we only haveEcov(i) = covi The conditional version of (313)

E(i)Erri = E(i)erri + (i)

(320)

is a more refined statement of the optimism theorem TheSURE formula (210) also applies conditionally cov(i) =σ 2E(i)partmicroipartyi assuming normality (29)

Figure 5 illustrates conditional and unconditional covariancecalculations for subject i = 93 of the kidney study (the open cir-cle in Fig 1) Here we have used squared error and the Gaussianmodel ylowast sim N(micro σ 2I) σ 2 = 317 with micro = lowess(xy13)The conditional and unconditional covariances are nearly thesame cov(i) = 221 versus covi = 218 but the dependence ofmicrolowast

i on ylowasti is much clearer conditioning on y(i)

The conditional approach is computationally expensive Wewould need to repeat the conditional resampling procedure ofFigure 5 separately for each of the n cases whereas a single setof unconditional resamples suffices for all n Here we will usethe conditional covariances (319) mainly for theoretical pur-poses The less expensive unconditional approach performedwell in all of our examples

Figure 5 Conditional and Unconditional Covariance Calculations forSubject i = 93 Kidney Study Open circles 200 pairs (ylowast

i microlowasti ) un-

conditional resamples ylowast sim N( micro σ 2 I) cov i = 218 Dots 100 condi-tional resamples ylowast

i sim N( microi σ 2) y(i) fixed cov(i) = 221 Vertical lineat micro93 = 136

624 Journal of the American Statistical Association September 2004

There is however one situation where the conditional co-variances are easily computed the Bernoulli case y sim Be(micro)In this situation it is easy to see that

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (321)

the notation indicating the two possible values of λi with y(i)

fixed and yi either 1 or 0 This leads to estimates

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (322)

Calculating cov(i) for i = 12 n requires only n recompu-tations of m(middot) one for each i the same number as for cross-validation For reasons discussed next (322) will be termedthe Steinian

There is no general equivalent to the Gaussian SURE for-mula (210) that is an unbiased estimator for cov(i) How-ever a useful approximation can be derived as follows Letti( ylowast

i ) = λi( y(i) ylowasti ) indicate λlowast

i as a function of ylowasti with y(i)

fixed and denote ti = part ti( ylowasti )partylowast

i |microi in Figure 5 ti is the slopeof the solid curve as it crosses the vertical line Suppose ylowast

i hasbootstrap mean and variance (microi Vi) Taylor series yield a sim-ple approximation for cov(i)

cov(i) = E(i)λi middot ( ylowasti minus microi)

= E(i)[ti(microi) + ti middot ( ylowast

i minus microi)]( ylowasti minus microi)

= Viti (323)

only ylowasti being random here The Steinian (322) is a discretized

version of (323) applied to the Bernoulli case which has Vi =microi(1 minus microi)

If Q( y micro) is the deviance function for a one-parameter expo-nential family then λi is the natural parameter and dλidmicroi =1Vi Therefore

cov(i) = Vipartλi

partylowasti

∣∣∣∣microi

= Vi1

Vi

partmicroi

partylowasti

∣∣∣∣microi

= partmicroi

partylowasti

∣∣∣∣microi

(324)

This is a centralized version of the SURE estimate where nowpartmicroipartyi is evaluated at microi instead of yi [In the exponentialfamily representation of the Gaussian case (29) Q( y micro) =( yminusmicro)2σ 2 so the factor σ 2 in (210) has been absorbed into Qin (324)]

Remark D The centralized version of SURE in (324) givesthe correct total degrees of freedom for maximum likelihoodestimation in a p-parameter generalized linear model or moregenerally in a p-parameter curved exponential family (Efron1975)

nsum

i=1

partmicroi

partyi

∣∣∣∣y=micro

= p (325)

The usual uncentralized version of SURE does not satisfy(325) in curved families

Using deviance error and maximum likelihood estimationin a curved exponential family makes erri = minus2 log fmicroi( yi) +constant Combining (314) (324) and (325) gives

Err = minus 2

[sum

i

log fmicroi( yi) minus p + constant

]

(326)

Choosing among competing models by minimizing Erris equivalent to maximizing the penalized likelihoodsum

log fmicroi( yi) minus p which is Akaikersquos information criterion(AIC) These results generalize those for GLMrsquos in section 6of Efron (1986) and will not be verified here

Remark E It is easy to show that the true prediction errorErri (38) satisfies

Erri = Q(microi microi) + D(microi)[D(microi) equiv EQ( yimicroi)

] (327)

For squared error this reduces to the familiar result E0( y0i minus

microi)2 = (microi minus microi)

2 + σ 2 In the Bernoulli case (32) D(microi) =q(microi) and the basic result (316) can be simplified to

Bernoulli case Oi = 2λi( yi minus microi) (328)

using (35)

Remark F The q class includes an asymmetric version ofcounting error (33) that allows the decision boundary to be ata point π1 in (01) other than 12 Letting π0 = 1 minus π1 andρ = (π0π1)

12

q(micro) = min

ρmicro1

ρ(1 minus micro)

rarr Q( y micro) =

0 if y micro same side of π1

ρ if y = 1 and micro le π11

ρif y = 0 and micro gt π1

(329)

Now Q(10)Q(01) = π0π1 This is the appropriate lossstructure for a simple hypothesis-testing situation in which wewant to compensate for unequal prior sampling probabilities

4 THE RELATIONSHIP WITH CROSSndashVALIDATION

Cross-validation is the most widely used error predictiontechnique This section relates cross-validation to covariancepenalties more exactly to conditional parametric bootstrap co-variance penalties A RaondashBlackwell type of relationship willbe developed If we average cross-validation estimates acrossthe bootstrap datasets used to calculate the conditional covari-ances then we get to a good approximation the covariancepenalty The implication is that covariance penalties are moreaccurate than cross-validation assuming of course that we trustthe parametric bootstrap model A similar conclusion is reachedin Shen et al (2002)

The cross-validation estimate of prediction error for coordi-nate i is

Erri = Q( yi microi) (41)

where microi is the ith coordinate of the estimate of micro based on thedeleted dataset y(i) = ( y1 y2 yiminus1 yi+1 yn) say

microi = m(y(i)

)

i (42)

(see Remark H) Equivalently cross-validation estimates theoptimism Oi = Erri minus erri by

Oi = Erri minus erri = Q( yi microi) minus Q( yi microi) (43)

Efron The Estimation of Prediction Error 625

Lemma Letting λi = minusq(microi)2 and λi = minusq(microi)2 asin (311)

Oi = 2(λi minus λi)( yi minus microi) minus Q(microi microi) minus 2(λi minus λi)(microi minus microi)

(44)This is verified directly from definition (31)

The lemma helps connect cross-validation with the condi-tional covariance penalties of (319)ndash(320) Cross-validationitself is conditional in the sense that y(i) is fixed in the calcu-lation of Oi so it is reasonable to suspect some sort of con-nection Suppose that we estimate cov(i) by bootstrap samplingas in (317) but now with y(i) fixed and only ylowast

i random saywith density fi The form of (44) makes it especially conve-nient for ylowast

i to have conditional expectation microi (rather than theobvious choice microi) which we denote by

E(i)ylowasti equiv Efi

ylowast

i

∣∣y(i)

= microi (45)

In a Bernoulli situation we would take ylowasti sim Be(microi)

Denote the bootstrap versions of microi and λi as microlowasti = m(y(i) ylowast

i )

and λlowasti = minusq(microlowast

i )2

Theorem 1 With ylowasti sim fi satisfying (45) and y(i) fixed

E(i)Olowasti = 2cov(i) minus E(i)Q(microi micro

lowasti ) (46)

cov(i) being the conditional covariance estimate E(i)λlowasti middot ( ylowast

i minusmicroi)

Proof Applying the lemma with microi rarr microi yi rarr ylowasti

λi rarr λlowasti and microi rarr microlowast

i gives

Olowasti = 2(λlowast

i minus λi)( ylowasti minus microi) minus Q(microi micro

lowasti ) (47)

Notice that microi and λi stay fixed in (47) because they dependonly on y(i) and that this same fact eliminates the last termin (44) Taking conditional expectations E(i) in (47) completesthe proof

In (46) 2cov(i) equals (i) the estimate of the condi-tional covariance penalty (i) (320) Typically (i) is of orderOp(1n) as in (213) while the remainder term E(i)Q(microi micro

lowasti )

is only Op(1n2) See Remark H The implication is that

E(i)Olowasti = (i) = 2 middot cov(i) (48)

In other words averaging the cross-validation estimate Olowasti

over fi the distribution of ylowasti used to calculate the covari-

ance penalty (i) gives approximately (i) itself If we thinkof fi as summarizing all available information for the unknowndistribution of yi that is as a sufficient statistic then (i) is aRaondashBlackwellized improvement on Oi

This same phenomenon occurs beyond the conditional frame-work of the theorem Figure 6 concerns cross-validation of thelowess(xy13) curve in Figure 1 Using the same uncondi-tional resampling model (217) as in Figure 2 B = 200 boot-strap replications of the cross-validation estimate (43) weregenerated

Olowastbi = Q

(ylowastb

i microlowastbi

) minus Q(ylowastb

i microlowastbi

)

i = 12 n and b = 12 B (49)

The small points in Figure 6 indicate individual values Olowastbi

2σ 2 The triangles show averages over the 200 replicationsOlowastmiddot

i 2σ 2 There is striking agreement with the covariance

Figure 6 Small Dots Indicate 200 Bootstrap Replications of Cross-Validation Optimism Estimates (49) Triangles Their Averages Closely Matchthe Covariance Penalty Curve From Figure 2 (Vertical distance plotted in df units)

626 Journal of the American Statistical Association September 2004

penalty curve coviσ2 from Figure 2 confirming EOlowast

i = i

as in (48) Nearly the same results were obtained bootstrappingfrom micro rather than micro

Approximation (48) can be made more explicit in the caseof squared error loss applied to linear rules micro = My that areldquoself-stablerdquo that is where the cross-validation estimate (42)satisfies

microi =sum

j =i

Mijyj Mij = Mij

(1 minus Mii) (410)

Self-stable rules include all the usual regression and ANOVAestimate as well as spline methods see Remark I Suppose weare resampling from yi sim fi with mean and variance

ylowasti sim (microi σ

2) (411)

where microi might differ from microi or microi The covariance penalty i

is then estimated by i = 2σ 2MiiUsing (410) it is straightforward to calculate the conditional

expectation of the cross-validation estimate Oi

Efi

Olowast

i

∣∣y(i)

= i middot [1 minus Mii2][

1 +(

microi minus microi

σ

)2]

(412)

If microi = microi then (412) becomes E(i)Olowasti = i[1minusMii2] an ex-

act version of (48) The choice microi = microi results in

Efi

Olowast

i

∣∣y(i)

= i[1 minus Mii2][

1 +(

Miiyi minus microi

σ

)2]

(413)

In both cases the conditional expectation of Olowasti is i[1 +

Op(1n)] where the Op(1n) term tends to be slightly negativeThe unconditional expectation of Oi with respect to the true

distribution y sim (micro σ 2I) looks like (412)

EOi = i[1 minus Mii2][

1 +sum

j =i

M2ij + 2

i

]

(414)

i equaling the covariance penalty 2σ 2Mii and

2i =

[(

microi minussum

j =i

Mijmicroj

)

σ

]2

(415)

For M a projection matrix M2 = M the termsum

M2ij = Mii(1minus

Mii) EOi exceeds i but only by a factor of 1 + O(1n) if2

i = 0 Notice thatsum

j =i

Mijmicroj minus microi = Emicroi minus microi (416)

so that 2i will be large if the cross-validation estimator microi is

badly biasedFinally suppose y sim N(micro σ 2I) and micro = My is a self-stable

projection estimator Then the coefficient of variation of Oi is

CVOi = 2 + 4(1 minus Mii)2i

(1 + 2(1 minus Mii)2i )

2= 2 (417)

the last approximation being quite accurate unless microi is badlybiased This says that Oi must always be a highly variableestimate of its expectation (414) or approximately of i =2σ 2Mii However it is still possible for the sum O = iOi toestimate ii = 2σ 2df with reasonable accuracy

As an example micro = My was fit to the kidney data whereM represented a natural spline with 8 degrees of freedom (in-cluding the intercept) One hundred simulated data vectors ylowast simN(microj σ

2I) were independently generated σ 2 = 317 each giv-ing a cross-validated df estimate df

lowast = Olowast2σ 2 These had em-pirical mean and standard deviation

dflowast sim 834 plusmn 164 (418)

Of course there is no reason to use cross-validation here be-cause the covariance penalty estimate df always equals thecorrect df value 8 This is an extreme example of the RaondashBlackwell type of result in Theorem 1 showing the cross-validation estimate df as a randomized version of df

Remark G Theorem 1 applies directly to grouped cross-validation in which the observations are removed in groupsrather than singly Suppose group i consists of observations( yi1 yi2 yiJ) and likewise microi = (microi1 microiJ) microi =(microi microiJ) y(i) equals y with group i removed andmicroi = m(y(i))i1i2iJ Theorem 1 then holds as stated with

Olowasti = sum

j Olowastij cov(i) = sum

j cov(ij) and so forth Another wayto say this is that by additivity the theory of Sections 2ndash4 canbe extended to vector observations yi

Remark H The full dataset for a prediction problem theldquotraining setrdquo is of the form

v = (v1 v2 vn) with vi = (xi yi) (419)

xi being a p vector of observed covariates such as age for thekidney data and yi a response Covariance penalties operatein a regression theory framework where the xi are consideredfixed ancillaries whether or not they are random which is whynotation such as micro = m(y) can suppress x Cross-validationhowever changes x as well as y In this framework it is moreprecise to write the prediction rule as

m(xv) for x isin X (420)

indicating that the training set v determines a rule m(middotv) whichthen can be evaluated at any x in the predictor space X (42) isbetter expressed as microi = m(xiv(i))

In the cross-validation framework we can suppose that vhas been produced by random sampling (ldquoiidrdquo) from some(p + 1)-dimensional distribution F

Fiidrarr v1 v2 vn (421)

Standard influence function arguments as in chapter 2 of Ham-pel Ronchetti Rousseeuw and Stahel (1986) give the first-order approximation

microi minus microi = m(xiv) minus m(xiv(i)

) = IFi minus IF(i)

n (422)

where IFj = IF(vj m(xiv)F) is the influence function for microi

evaluated at vj and IF(i) = sum

j =i IFj(n minus 1)The point here is that microi minus microi is Op(1n) in situations

where the influence function exists boundedly see Li (1987)for a more careful discussion In situation (410) microi minus microi =Mii( yi minus microi) so that Mii = O(1n) as in (213) implies microi minusmicroi = Op(1n) Similarly microlowast

i minus microi = Op(1n) in (46) If the

Efron The Estimation of Prediction Error 627

function q(micro) of Figure 4 is locally quadratic near microi thenQ(microi micro

lowasti ) in (46) will be Op(1n2) as claimed in (48)

Order of magnitude asymptotics are only a rough guide topractice and are not crucial to the methods discussed here Inany given situation bootstrap calculations such as (317) willgive valid estimates whether or not (213) is meaningful

Remark I A prediction rule is ldquoself-stablerdquo if adding a newpoint (xi yi) that falls exactly on the prediction surface does notchange the prediction at xi in notation (420) if

m(xiv(i) cup (xi microi)

) = microi (423)

This implies microi = sum

j =i Mijyj + Miimicroi for a linear rule whichis (410) Any ldquoleast-Qrdquo rule which chooses micro by minimizingsum

Q( yimicroi) over some candidate collection of possible microrsquosmust be self-stable and this class can be extended by addingpenalty terms as with smoothing splines Maximum likelihoodestimation in ordinary or curved GLMrsquos belongs to the least-Qclass

5 A SIMULATION

Here is a small simulation study intended to illustrate covari-ance penaltycross-validation relationships in a Bernoulli datasetting Figure 7 shows the underlying model used to generatethe simulations There are 30 bivariate vectors xi and their as-sociated probabilities microi

(ximicroi) i = 12 30 (51)

from which we generated 200 30-dimensional Bernoulli re-sponse vectors

y sim Be(micro) (52)

Figure 7 Underlying Model Used for Simulation Study n = 30 Bivari-ate Vectors xi and Associated Probabilities microi (51)

as in (32) [The underlying model (51) was itself randomlygenerated by 30 independent replications of

Yi sim Be

(1

2

)

and xi sim N2

((

Yi minus 1

20

)

I)

(53)

with microi the Bayesian posterior ProbYi = 1|xi]Our prediction rule micro = m(y) was based on the coefficients

for Fisherrsquos linear discriminant boundary α + β primex = 0

microi = 1[1 + eminus(α+β primexi)

] (54)

Equation (213) of Efron (1983) describes the (α β) computa-tions Rule (54) is not the logistic regression estimate of microi andin fact will be somewhat more efficient given mechanism (53)(Efron 1975)

Binomial deviance error (34) was used to assess predictionaccuracy Three estimates of the total expected optimism =sum30

i=1 i (310) were computed for each of the 200 y vectorsthe cross-validation estimate O = sum

Oi (43) the parametricbootstrap estimate 2

sumcovi (317) with ylowast sim Be(micro) and the

Steinian 2sum

cov(i) (322)The results appear in Figure 8 as histograms of the 200 df es-

timates (ie estimates of optimism2) The Steinian and para-metric bootstrap gave similar results correlation 72 with thebootstrap estimates slightly but consistently larger Strikingly

(a)

(b)

Figure 8 Degree-of-Freedom Estimates (optimism2) 200 Simula-tions (52) and (54) The two covariance penalty estimates Steinian (a)and parametric bootstrap (b) have about one-third the standard devia-tion of cross-validation Error measured by binomial deviance (34) trueΩ2 = 157

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 6: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

624 Journal of the American Statistical Association September 2004

There is however one situation where the conditional co-variances are easily computed the Bernoulli case y sim Be(micro)In this situation it is easy to see that

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (321)

the notation indicating the two possible values of λi with y(i)

fixed and yi either 1 or 0 This leads to estimates

cov(i) = microi(1 minus microi)[λi

(y(i)1

) minus λi(y(i)0

)] (322)

Calculating cov(i) for i = 12 n requires only n recompu-tations of m(middot) one for each i the same number as for cross-validation For reasons discussed next (322) will be termedthe Steinian

There is no general equivalent to the Gaussian SURE for-mula (210) that is an unbiased estimator for cov(i) How-ever a useful approximation can be derived as follows Letti( ylowast

i ) = λi( y(i) ylowasti ) indicate λlowast

i as a function of ylowasti with y(i)

fixed and denote ti = part ti( ylowasti )partylowast

i |microi in Figure 5 ti is the slopeof the solid curve as it crosses the vertical line Suppose ylowast

i hasbootstrap mean and variance (microi Vi) Taylor series yield a sim-ple approximation for cov(i)

cov(i) = E(i)λi middot ( ylowasti minus microi)

= E(i)[ti(microi) + ti middot ( ylowast

i minus microi)]( ylowasti minus microi)

= Viti (323)

only ylowasti being random here The Steinian (322) is a discretized

version of (323) applied to the Bernoulli case which has Vi =microi(1 minus microi)

If Q( y micro) is the deviance function for a one-parameter expo-nential family then λi is the natural parameter and dλidmicroi =1Vi Therefore

cov(i) = Vipartλi

partylowasti

∣∣∣∣microi

= Vi1

Vi

partmicroi

partylowasti

∣∣∣∣microi

= partmicroi

partylowasti

∣∣∣∣microi

(324)

This is a centralized version of the SURE estimate where nowpartmicroipartyi is evaluated at microi instead of yi [In the exponentialfamily representation of the Gaussian case (29) Q( y micro) =( yminusmicro)2σ 2 so the factor σ 2 in (210) has been absorbed into Qin (324)]

Remark D The centralized version of SURE in (324) givesthe correct total degrees of freedom for maximum likelihoodestimation in a p-parameter generalized linear model or moregenerally in a p-parameter curved exponential family (Efron1975)

nsum

i=1

partmicroi

partyi

∣∣∣∣y=micro

= p (325)

The usual uncentralized version of SURE does not satisfy(325) in curved families

Using deviance error and maximum likelihood estimationin a curved exponential family makes erri = minus2 log fmicroi( yi) +constant Combining (314) (324) and (325) gives

Err = minus 2

[sum

i

log fmicroi( yi) minus p + constant

]

(326)

Choosing among competing models by minimizing Erris equivalent to maximizing the penalized likelihoodsum

log fmicroi( yi) minus p which is Akaikersquos information criterion(AIC) These results generalize those for GLMrsquos in section 6of Efron (1986) and will not be verified here

Remark E It is easy to show that the true prediction errorErri (38) satisfies

Erri = Q(microi microi) + D(microi)[D(microi) equiv EQ( yimicroi)

] (327)

For squared error this reduces to the familiar result E0( y0i minus

microi)2 = (microi minus microi)

2 + σ 2 In the Bernoulli case (32) D(microi) =q(microi) and the basic result (316) can be simplified to

Bernoulli case Oi = 2λi( yi minus microi) (328)

using (35)

Remark F The q class includes an asymmetric version ofcounting error (33) that allows the decision boundary to be ata point π1 in (01) other than 12 Letting π0 = 1 minus π1 andρ = (π0π1)

12

q(micro) = min

ρmicro1

ρ(1 minus micro)

rarr Q( y micro) =

0 if y micro same side of π1

ρ if y = 1 and micro le π11

ρif y = 0 and micro gt π1

(329)

Now Q(10)Q(01) = π0π1 This is the appropriate lossstructure for a simple hypothesis-testing situation in which wewant to compensate for unequal prior sampling probabilities

4 THE RELATIONSHIP WITH CROSSndashVALIDATION

Cross-validation is the most widely used error predictiontechnique This section relates cross-validation to covariancepenalties more exactly to conditional parametric bootstrap co-variance penalties A RaondashBlackwell type of relationship willbe developed If we average cross-validation estimates acrossthe bootstrap datasets used to calculate the conditional covari-ances then we get to a good approximation the covariancepenalty The implication is that covariance penalties are moreaccurate than cross-validation assuming of course that we trustthe parametric bootstrap model A similar conclusion is reachedin Shen et al (2002)

The cross-validation estimate of prediction error for coordi-nate i is

Erri = Q( yi microi) (41)

where microi is the ith coordinate of the estimate of micro based on thedeleted dataset y(i) = ( y1 y2 yiminus1 yi+1 yn) say

microi = m(y(i)

)

i (42)

(see Remark H) Equivalently cross-validation estimates theoptimism Oi = Erri minus erri by

Oi = Erri minus erri = Q( yi microi) minus Q( yi microi) (43)

Efron The Estimation of Prediction Error 625

Lemma Letting λi = minusq(microi)2 and λi = minusq(microi)2 asin (311)

Oi = 2(λi minus λi)( yi minus microi) minus Q(microi microi) minus 2(λi minus λi)(microi minus microi)

(44)This is verified directly from definition (31)

The lemma helps connect cross-validation with the condi-tional covariance penalties of (319)ndash(320) Cross-validationitself is conditional in the sense that y(i) is fixed in the calcu-lation of Oi so it is reasonable to suspect some sort of con-nection Suppose that we estimate cov(i) by bootstrap samplingas in (317) but now with y(i) fixed and only ylowast

i random saywith density fi The form of (44) makes it especially conve-nient for ylowast

i to have conditional expectation microi (rather than theobvious choice microi) which we denote by

E(i)ylowasti equiv Efi

ylowast

i

∣∣y(i)

= microi (45)

In a Bernoulli situation we would take ylowasti sim Be(microi)

Denote the bootstrap versions of microi and λi as microlowasti = m(y(i) ylowast

i )

and λlowasti = minusq(microlowast

i )2

Theorem 1 With ylowasti sim fi satisfying (45) and y(i) fixed

E(i)Olowasti = 2cov(i) minus E(i)Q(microi micro

lowasti ) (46)

cov(i) being the conditional covariance estimate E(i)λlowasti middot ( ylowast

i minusmicroi)

Proof Applying the lemma with microi rarr microi yi rarr ylowasti

λi rarr λlowasti and microi rarr microlowast

i gives

Olowasti = 2(λlowast

i minus λi)( ylowasti minus microi) minus Q(microi micro

lowasti ) (47)

Notice that microi and λi stay fixed in (47) because they dependonly on y(i) and that this same fact eliminates the last termin (44) Taking conditional expectations E(i) in (47) completesthe proof

In (46) 2cov(i) equals (i) the estimate of the condi-tional covariance penalty (i) (320) Typically (i) is of orderOp(1n) as in (213) while the remainder term E(i)Q(microi micro

lowasti )

is only Op(1n2) See Remark H The implication is that

E(i)Olowasti = (i) = 2 middot cov(i) (48)

In other words averaging the cross-validation estimate Olowasti

over fi the distribution of ylowasti used to calculate the covari-

ance penalty (i) gives approximately (i) itself If we thinkof fi as summarizing all available information for the unknowndistribution of yi that is as a sufficient statistic then (i) is aRaondashBlackwellized improvement on Oi

This same phenomenon occurs beyond the conditional frame-work of the theorem Figure 6 concerns cross-validation of thelowess(xy13) curve in Figure 1 Using the same uncondi-tional resampling model (217) as in Figure 2 B = 200 boot-strap replications of the cross-validation estimate (43) weregenerated

Olowastbi = Q

(ylowastb

i microlowastbi

) minus Q(ylowastb

i microlowastbi

)

i = 12 n and b = 12 B (49)

The small points in Figure 6 indicate individual values Olowastbi

2σ 2 The triangles show averages over the 200 replicationsOlowastmiddot

i 2σ 2 There is striking agreement with the covariance

Figure 6 Small Dots Indicate 200 Bootstrap Replications of Cross-Validation Optimism Estimates (49) Triangles Their Averages Closely Matchthe Covariance Penalty Curve From Figure 2 (Vertical distance plotted in df units)

626 Journal of the American Statistical Association September 2004

penalty curve coviσ2 from Figure 2 confirming EOlowast

i = i

as in (48) Nearly the same results were obtained bootstrappingfrom micro rather than micro

Approximation (48) can be made more explicit in the caseof squared error loss applied to linear rules micro = My that areldquoself-stablerdquo that is where the cross-validation estimate (42)satisfies

microi =sum

j =i

Mijyj Mij = Mij

(1 minus Mii) (410)

Self-stable rules include all the usual regression and ANOVAestimate as well as spline methods see Remark I Suppose weare resampling from yi sim fi with mean and variance

ylowasti sim (microi σ

2) (411)

where microi might differ from microi or microi The covariance penalty i

is then estimated by i = 2σ 2MiiUsing (410) it is straightforward to calculate the conditional

expectation of the cross-validation estimate Oi

Efi

Olowast

i

∣∣y(i)

= i middot [1 minus Mii2][

1 +(

microi minus microi

σ

)2]

(412)

If microi = microi then (412) becomes E(i)Olowasti = i[1minusMii2] an ex-

act version of (48) The choice microi = microi results in

Efi

Olowast

i

∣∣y(i)

= i[1 minus Mii2][

1 +(

Miiyi minus microi

σ

)2]

(413)

In both cases the conditional expectation of Olowasti is i[1 +

Op(1n)] where the Op(1n) term tends to be slightly negativeThe unconditional expectation of Oi with respect to the true

distribution y sim (micro σ 2I) looks like (412)

EOi = i[1 minus Mii2][

1 +sum

j =i

M2ij + 2

i

]

(414)

i equaling the covariance penalty 2σ 2Mii and

2i =

[(

microi minussum

j =i

Mijmicroj

)

σ

]2

(415)

For M a projection matrix M2 = M the termsum

M2ij = Mii(1minus

Mii) EOi exceeds i but only by a factor of 1 + O(1n) if2

i = 0 Notice thatsum

j =i

Mijmicroj minus microi = Emicroi minus microi (416)

so that 2i will be large if the cross-validation estimator microi is

badly biasedFinally suppose y sim N(micro σ 2I) and micro = My is a self-stable

projection estimator Then the coefficient of variation of Oi is

CVOi = 2 + 4(1 minus Mii)2i

(1 + 2(1 minus Mii)2i )

2= 2 (417)

the last approximation being quite accurate unless microi is badlybiased This says that Oi must always be a highly variableestimate of its expectation (414) or approximately of i =2σ 2Mii However it is still possible for the sum O = iOi toestimate ii = 2σ 2df with reasonable accuracy

As an example micro = My was fit to the kidney data whereM represented a natural spline with 8 degrees of freedom (in-cluding the intercept) One hundred simulated data vectors ylowast simN(microj σ

2I) were independently generated σ 2 = 317 each giv-ing a cross-validated df estimate df

lowast = Olowast2σ 2 These had em-pirical mean and standard deviation

dflowast sim 834 plusmn 164 (418)

Of course there is no reason to use cross-validation here be-cause the covariance penalty estimate df always equals thecorrect df value 8 This is an extreme example of the RaondashBlackwell type of result in Theorem 1 showing the cross-validation estimate df as a randomized version of df

Remark G Theorem 1 applies directly to grouped cross-validation in which the observations are removed in groupsrather than singly Suppose group i consists of observations( yi1 yi2 yiJ) and likewise microi = (microi1 microiJ) microi =(microi microiJ) y(i) equals y with group i removed andmicroi = m(y(i))i1i2iJ Theorem 1 then holds as stated with

Olowasti = sum

j Olowastij cov(i) = sum

j cov(ij) and so forth Another wayto say this is that by additivity the theory of Sections 2ndash4 canbe extended to vector observations yi

Remark H The full dataset for a prediction problem theldquotraining setrdquo is of the form

v = (v1 v2 vn) with vi = (xi yi) (419)

xi being a p vector of observed covariates such as age for thekidney data and yi a response Covariance penalties operatein a regression theory framework where the xi are consideredfixed ancillaries whether or not they are random which is whynotation such as micro = m(y) can suppress x Cross-validationhowever changes x as well as y In this framework it is moreprecise to write the prediction rule as

m(xv) for x isin X (420)

indicating that the training set v determines a rule m(middotv) whichthen can be evaluated at any x in the predictor space X (42) isbetter expressed as microi = m(xiv(i))

In the cross-validation framework we can suppose that vhas been produced by random sampling (ldquoiidrdquo) from some(p + 1)-dimensional distribution F

Fiidrarr v1 v2 vn (421)

Standard influence function arguments as in chapter 2 of Ham-pel Ronchetti Rousseeuw and Stahel (1986) give the first-order approximation

microi minus microi = m(xiv) minus m(xiv(i)

) = IFi minus IF(i)

n (422)

where IFj = IF(vj m(xiv)F) is the influence function for microi

evaluated at vj and IF(i) = sum

j =i IFj(n minus 1)The point here is that microi minus microi is Op(1n) in situations

where the influence function exists boundedly see Li (1987)for a more careful discussion In situation (410) microi minus microi =Mii( yi minus microi) so that Mii = O(1n) as in (213) implies microi minusmicroi = Op(1n) Similarly microlowast

i minus microi = Op(1n) in (46) If the

Efron The Estimation of Prediction Error 627

function q(micro) of Figure 4 is locally quadratic near microi thenQ(microi micro

lowasti ) in (46) will be Op(1n2) as claimed in (48)

Order of magnitude asymptotics are only a rough guide topractice and are not crucial to the methods discussed here Inany given situation bootstrap calculations such as (317) willgive valid estimates whether or not (213) is meaningful

Remark I A prediction rule is ldquoself-stablerdquo if adding a newpoint (xi yi) that falls exactly on the prediction surface does notchange the prediction at xi in notation (420) if

m(xiv(i) cup (xi microi)

) = microi (423)

This implies microi = sum

j =i Mijyj + Miimicroi for a linear rule whichis (410) Any ldquoleast-Qrdquo rule which chooses micro by minimizingsum

Q( yimicroi) over some candidate collection of possible microrsquosmust be self-stable and this class can be extended by addingpenalty terms as with smoothing splines Maximum likelihoodestimation in ordinary or curved GLMrsquos belongs to the least-Qclass

5 A SIMULATION

Here is a small simulation study intended to illustrate covari-ance penaltycross-validation relationships in a Bernoulli datasetting Figure 7 shows the underlying model used to generatethe simulations There are 30 bivariate vectors xi and their as-sociated probabilities microi

(ximicroi) i = 12 30 (51)

from which we generated 200 30-dimensional Bernoulli re-sponse vectors

y sim Be(micro) (52)

Figure 7 Underlying Model Used for Simulation Study n = 30 Bivari-ate Vectors xi and Associated Probabilities microi (51)

as in (32) [The underlying model (51) was itself randomlygenerated by 30 independent replications of

Yi sim Be

(1

2

)

and xi sim N2

((

Yi minus 1

20

)

I)

(53)

with microi the Bayesian posterior ProbYi = 1|xi]Our prediction rule micro = m(y) was based on the coefficients

for Fisherrsquos linear discriminant boundary α + β primex = 0

microi = 1[1 + eminus(α+β primexi)

] (54)

Equation (213) of Efron (1983) describes the (α β) computa-tions Rule (54) is not the logistic regression estimate of microi andin fact will be somewhat more efficient given mechanism (53)(Efron 1975)

Binomial deviance error (34) was used to assess predictionaccuracy Three estimates of the total expected optimism =sum30

i=1 i (310) were computed for each of the 200 y vectorsthe cross-validation estimate O = sum

Oi (43) the parametricbootstrap estimate 2

sumcovi (317) with ylowast sim Be(micro) and the

Steinian 2sum

cov(i) (322)The results appear in Figure 8 as histograms of the 200 df es-

timates (ie estimates of optimism2) The Steinian and para-metric bootstrap gave similar results correlation 72 with thebootstrap estimates slightly but consistently larger Strikingly

(a)

(b)

Figure 8 Degree-of-Freedom Estimates (optimism2) 200 Simula-tions (52) and (54) The two covariance penalty estimates Steinian (a)and parametric bootstrap (b) have about one-third the standard devia-tion of cross-validation Error measured by binomial deviance (34) trueΩ2 = 157

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 7: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

Efron The Estimation of Prediction Error 625

Lemma Letting λi = minusq(microi)2 and λi = minusq(microi)2 asin (311)

Oi = 2(λi minus λi)( yi minus microi) minus Q(microi microi) minus 2(λi minus λi)(microi minus microi)

(44)This is verified directly from definition (31)

The lemma helps connect cross-validation with the condi-tional covariance penalties of (319)ndash(320) Cross-validationitself is conditional in the sense that y(i) is fixed in the calcu-lation of Oi so it is reasonable to suspect some sort of con-nection Suppose that we estimate cov(i) by bootstrap samplingas in (317) but now with y(i) fixed and only ylowast

i random saywith density fi The form of (44) makes it especially conve-nient for ylowast

i to have conditional expectation microi (rather than theobvious choice microi) which we denote by

E(i)ylowasti equiv Efi

ylowast

i

∣∣y(i)

= microi (45)

In a Bernoulli situation we would take ylowasti sim Be(microi)

Denote the bootstrap versions of microi and λi as microlowasti = m(y(i) ylowast

i )

and λlowasti = minusq(microlowast

i )2

Theorem 1 With ylowasti sim fi satisfying (45) and y(i) fixed

E(i)Olowasti = 2cov(i) minus E(i)Q(microi micro

lowasti ) (46)

cov(i) being the conditional covariance estimate E(i)λlowasti middot ( ylowast

i minusmicroi)

Proof Applying the lemma with microi rarr microi yi rarr ylowasti

λi rarr λlowasti and microi rarr microlowast

i gives

Olowasti = 2(λlowast

i minus λi)( ylowasti minus microi) minus Q(microi micro

lowasti ) (47)

Notice that microi and λi stay fixed in (47) because they dependonly on y(i) and that this same fact eliminates the last termin (44) Taking conditional expectations E(i) in (47) completesthe proof

In (46) 2cov(i) equals (i) the estimate of the condi-tional covariance penalty (i) (320) Typically (i) is of orderOp(1n) as in (213) while the remainder term E(i)Q(microi micro

lowasti )

is only Op(1n2) See Remark H The implication is that

E(i)Olowasti = (i) = 2 middot cov(i) (48)

In other words averaging the cross-validation estimate Olowasti

over fi the distribution of ylowasti used to calculate the covari-

ance penalty (i) gives approximately (i) itself If we thinkof fi as summarizing all available information for the unknowndistribution of yi that is as a sufficient statistic then (i) is aRaondashBlackwellized improvement on Oi

This same phenomenon occurs beyond the conditional frame-work of the theorem Figure 6 concerns cross-validation of thelowess(xy13) curve in Figure 1 Using the same uncondi-tional resampling model (217) as in Figure 2 B = 200 boot-strap replications of the cross-validation estimate (43) weregenerated

Olowastbi = Q

(ylowastb

i microlowastbi

) minus Q(ylowastb

i microlowastbi

)

i = 12 n and b = 12 B (49)

The small points in Figure 6 indicate individual values Olowastbi

2σ 2 The triangles show averages over the 200 replicationsOlowastmiddot

i 2σ 2 There is striking agreement with the covariance

Figure 6 Small Dots Indicate 200 Bootstrap Replications of Cross-Validation Optimism Estimates (49) Triangles Their Averages Closely Matchthe Covariance Penalty Curve From Figure 2 (Vertical distance plotted in df units)

626 Journal of the American Statistical Association September 2004

penalty curve coviσ2 from Figure 2 confirming EOlowast

i = i

as in (48) Nearly the same results were obtained bootstrappingfrom micro rather than micro

Approximation (48) can be made more explicit in the caseof squared error loss applied to linear rules micro = My that areldquoself-stablerdquo that is where the cross-validation estimate (42)satisfies

microi =sum

j =i

Mijyj Mij = Mij

(1 minus Mii) (410)

Self-stable rules include all the usual regression and ANOVAestimate as well as spline methods see Remark I Suppose weare resampling from yi sim fi with mean and variance

ylowasti sim (microi σ

2) (411)

where microi might differ from microi or microi The covariance penalty i

is then estimated by i = 2σ 2MiiUsing (410) it is straightforward to calculate the conditional

expectation of the cross-validation estimate Oi

Efi

Olowast

i

∣∣y(i)

= i middot [1 minus Mii2][

1 +(

microi minus microi

σ

)2]

(412)

If microi = microi then (412) becomes E(i)Olowasti = i[1minusMii2] an ex-

act version of (48) The choice microi = microi results in

Efi

Olowast

i

∣∣y(i)

= i[1 minus Mii2][

1 +(

Miiyi minus microi

σ

)2]

(413)

In both cases the conditional expectation of Olowasti is i[1 +

Op(1n)] where the Op(1n) term tends to be slightly negativeThe unconditional expectation of Oi with respect to the true

distribution y sim (micro σ 2I) looks like (412)

EOi = i[1 minus Mii2][

1 +sum

j =i

M2ij + 2

i

]

(414)

i equaling the covariance penalty 2σ 2Mii and

2i =

[(

microi minussum

j =i

Mijmicroj

)

σ

]2

(415)

For M a projection matrix M2 = M the termsum

M2ij = Mii(1minus

Mii) EOi exceeds i but only by a factor of 1 + O(1n) if2

i = 0 Notice thatsum

j =i

Mijmicroj minus microi = Emicroi minus microi (416)

so that 2i will be large if the cross-validation estimator microi is

badly biasedFinally suppose y sim N(micro σ 2I) and micro = My is a self-stable

projection estimator Then the coefficient of variation of Oi is

CVOi = 2 + 4(1 minus Mii)2i

(1 + 2(1 minus Mii)2i )

2= 2 (417)

the last approximation being quite accurate unless microi is badlybiased This says that Oi must always be a highly variableestimate of its expectation (414) or approximately of i =2σ 2Mii However it is still possible for the sum O = iOi toestimate ii = 2σ 2df with reasonable accuracy

As an example micro = My was fit to the kidney data whereM represented a natural spline with 8 degrees of freedom (in-cluding the intercept) One hundred simulated data vectors ylowast simN(microj σ

2I) were independently generated σ 2 = 317 each giv-ing a cross-validated df estimate df

lowast = Olowast2σ 2 These had em-pirical mean and standard deviation

dflowast sim 834 plusmn 164 (418)

Of course there is no reason to use cross-validation here be-cause the covariance penalty estimate df always equals thecorrect df value 8 This is an extreme example of the RaondashBlackwell type of result in Theorem 1 showing the cross-validation estimate df as a randomized version of df

Remark G Theorem 1 applies directly to grouped cross-validation in which the observations are removed in groupsrather than singly Suppose group i consists of observations( yi1 yi2 yiJ) and likewise microi = (microi1 microiJ) microi =(microi microiJ) y(i) equals y with group i removed andmicroi = m(y(i))i1i2iJ Theorem 1 then holds as stated with

Olowasti = sum

j Olowastij cov(i) = sum

j cov(ij) and so forth Another wayto say this is that by additivity the theory of Sections 2ndash4 canbe extended to vector observations yi

Remark H The full dataset for a prediction problem theldquotraining setrdquo is of the form

v = (v1 v2 vn) with vi = (xi yi) (419)

xi being a p vector of observed covariates such as age for thekidney data and yi a response Covariance penalties operatein a regression theory framework where the xi are consideredfixed ancillaries whether or not they are random which is whynotation such as micro = m(y) can suppress x Cross-validationhowever changes x as well as y In this framework it is moreprecise to write the prediction rule as

m(xv) for x isin X (420)

indicating that the training set v determines a rule m(middotv) whichthen can be evaluated at any x in the predictor space X (42) isbetter expressed as microi = m(xiv(i))

In the cross-validation framework we can suppose that vhas been produced by random sampling (ldquoiidrdquo) from some(p + 1)-dimensional distribution F

Fiidrarr v1 v2 vn (421)

Standard influence function arguments as in chapter 2 of Ham-pel Ronchetti Rousseeuw and Stahel (1986) give the first-order approximation

microi minus microi = m(xiv) minus m(xiv(i)

) = IFi minus IF(i)

n (422)

where IFj = IF(vj m(xiv)F) is the influence function for microi

evaluated at vj and IF(i) = sum

j =i IFj(n minus 1)The point here is that microi minus microi is Op(1n) in situations

where the influence function exists boundedly see Li (1987)for a more careful discussion In situation (410) microi minus microi =Mii( yi minus microi) so that Mii = O(1n) as in (213) implies microi minusmicroi = Op(1n) Similarly microlowast

i minus microi = Op(1n) in (46) If the

Efron The Estimation of Prediction Error 627

function q(micro) of Figure 4 is locally quadratic near microi thenQ(microi micro

lowasti ) in (46) will be Op(1n2) as claimed in (48)

Order of magnitude asymptotics are only a rough guide topractice and are not crucial to the methods discussed here Inany given situation bootstrap calculations such as (317) willgive valid estimates whether or not (213) is meaningful

Remark I A prediction rule is ldquoself-stablerdquo if adding a newpoint (xi yi) that falls exactly on the prediction surface does notchange the prediction at xi in notation (420) if

m(xiv(i) cup (xi microi)

) = microi (423)

This implies microi = sum

j =i Mijyj + Miimicroi for a linear rule whichis (410) Any ldquoleast-Qrdquo rule which chooses micro by minimizingsum

Q( yimicroi) over some candidate collection of possible microrsquosmust be self-stable and this class can be extended by addingpenalty terms as with smoothing splines Maximum likelihoodestimation in ordinary or curved GLMrsquos belongs to the least-Qclass

5 A SIMULATION

Here is a small simulation study intended to illustrate covari-ance penaltycross-validation relationships in a Bernoulli datasetting Figure 7 shows the underlying model used to generatethe simulations There are 30 bivariate vectors xi and their as-sociated probabilities microi

(ximicroi) i = 12 30 (51)

from which we generated 200 30-dimensional Bernoulli re-sponse vectors

y sim Be(micro) (52)

Figure 7 Underlying Model Used for Simulation Study n = 30 Bivari-ate Vectors xi and Associated Probabilities microi (51)

as in (32) [The underlying model (51) was itself randomlygenerated by 30 independent replications of

Yi sim Be

(1

2

)

and xi sim N2

((

Yi minus 1

20

)

I)

(53)

with microi the Bayesian posterior ProbYi = 1|xi]Our prediction rule micro = m(y) was based on the coefficients

for Fisherrsquos linear discriminant boundary α + β primex = 0

microi = 1[1 + eminus(α+β primexi)

] (54)

Equation (213) of Efron (1983) describes the (α β) computa-tions Rule (54) is not the logistic regression estimate of microi andin fact will be somewhat more efficient given mechanism (53)(Efron 1975)

Binomial deviance error (34) was used to assess predictionaccuracy Three estimates of the total expected optimism =sum30

i=1 i (310) were computed for each of the 200 y vectorsthe cross-validation estimate O = sum

Oi (43) the parametricbootstrap estimate 2

sumcovi (317) with ylowast sim Be(micro) and the

Steinian 2sum

cov(i) (322)The results appear in Figure 8 as histograms of the 200 df es-

timates (ie estimates of optimism2) The Steinian and para-metric bootstrap gave similar results correlation 72 with thebootstrap estimates slightly but consistently larger Strikingly

(a)

(b)

Figure 8 Degree-of-Freedom Estimates (optimism2) 200 Simula-tions (52) and (54) The two covariance penalty estimates Steinian (a)and parametric bootstrap (b) have about one-third the standard devia-tion of cross-validation Error measured by binomial deviance (34) trueΩ2 = 157

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 8: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

626 Journal of the American Statistical Association September 2004

penalty curve coviσ2 from Figure 2 confirming EOlowast

i = i

as in (48) Nearly the same results were obtained bootstrappingfrom micro rather than micro

Approximation (48) can be made more explicit in the caseof squared error loss applied to linear rules micro = My that areldquoself-stablerdquo that is where the cross-validation estimate (42)satisfies

microi =sum

j =i

Mijyj Mij = Mij

(1 minus Mii) (410)

Self-stable rules include all the usual regression and ANOVAestimate as well as spline methods see Remark I Suppose weare resampling from yi sim fi with mean and variance

ylowasti sim (microi σ

2) (411)

where microi might differ from microi or microi The covariance penalty i

is then estimated by i = 2σ 2MiiUsing (410) it is straightforward to calculate the conditional

expectation of the cross-validation estimate Oi

Efi

Olowast

i

∣∣y(i)

= i middot [1 minus Mii2][

1 +(

microi minus microi

σ

)2]

(412)

If microi = microi then (412) becomes E(i)Olowasti = i[1minusMii2] an ex-

act version of (48) The choice microi = microi results in

Efi

Olowast

i

∣∣y(i)

= i[1 minus Mii2][

1 +(

Miiyi minus microi

σ

)2]

(413)

In both cases the conditional expectation of Olowasti is i[1 +

Op(1n)] where the Op(1n) term tends to be slightly negativeThe unconditional expectation of Oi with respect to the true

distribution y sim (micro σ 2I) looks like (412)

EOi = i[1 minus Mii2][

1 +sum

j =i

M2ij + 2

i

]

(414)

i equaling the covariance penalty 2σ 2Mii and

2i =

[(

microi minussum

j =i

Mijmicroj

)

σ

]2

(415)

For M a projection matrix M2 = M the termsum

M2ij = Mii(1minus

Mii) EOi exceeds i but only by a factor of 1 + O(1n) if2

i = 0 Notice thatsum

j =i

Mijmicroj minus microi = Emicroi minus microi (416)

so that 2i will be large if the cross-validation estimator microi is

badly biasedFinally suppose y sim N(micro σ 2I) and micro = My is a self-stable

projection estimator Then the coefficient of variation of Oi is

CVOi = 2 + 4(1 minus Mii)2i

(1 + 2(1 minus Mii)2i )

2= 2 (417)

the last approximation being quite accurate unless microi is badlybiased This says that Oi must always be a highly variableestimate of its expectation (414) or approximately of i =2σ 2Mii However it is still possible for the sum O = iOi toestimate ii = 2σ 2df with reasonable accuracy

As an example micro = My was fit to the kidney data whereM represented a natural spline with 8 degrees of freedom (in-cluding the intercept) One hundred simulated data vectors ylowast simN(microj σ

2I) were independently generated σ 2 = 317 each giv-ing a cross-validated df estimate df

lowast = Olowast2σ 2 These had em-pirical mean and standard deviation

dflowast sim 834 plusmn 164 (418)

Of course there is no reason to use cross-validation here be-cause the covariance penalty estimate df always equals thecorrect df value 8 This is an extreme example of the RaondashBlackwell type of result in Theorem 1 showing the cross-validation estimate df as a randomized version of df

Remark G Theorem 1 applies directly to grouped cross-validation in which the observations are removed in groupsrather than singly Suppose group i consists of observations( yi1 yi2 yiJ) and likewise microi = (microi1 microiJ) microi =(microi microiJ) y(i) equals y with group i removed andmicroi = m(y(i))i1i2iJ Theorem 1 then holds as stated with

Olowasti = sum

j Olowastij cov(i) = sum

j cov(ij) and so forth Another wayto say this is that by additivity the theory of Sections 2ndash4 canbe extended to vector observations yi

Remark H The full dataset for a prediction problem theldquotraining setrdquo is of the form

v = (v1 v2 vn) with vi = (xi yi) (419)

xi being a p vector of observed covariates such as age for thekidney data and yi a response Covariance penalties operatein a regression theory framework where the xi are consideredfixed ancillaries whether or not they are random which is whynotation such as micro = m(y) can suppress x Cross-validationhowever changes x as well as y In this framework it is moreprecise to write the prediction rule as

m(xv) for x isin X (420)

indicating that the training set v determines a rule m(middotv) whichthen can be evaluated at any x in the predictor space X (42) isbetter expressed as microi = m(xiv(i))

In the cross-validation framework we can suppose that vhas been produced by random sampling (ldquoiidrdquo) from some(p + 1)-dimensional distribution F

Fiidrarr v1 v2 vn (421)

Standard influence function arguments as in chapter 2 of Ham-pel Ronchetti Rousseeuw and Stahel (1986) give the first-order approximation

microi minus microi = m(xiv) minus m(xiv(i)

) = IFi minus IF(i)

n (422)

where IFj = IF(vj m(xiv)F) is the influence function for microi

evaluated at vj and IF(i) = sum

j =i IFj(n minus 1)The point here is that microi minus microi is Op(1n) in situations

where the influence function exists boundedly see Li (1987)for a more careful discussion In situation (410) microi minus microi =Mii( yi minus microi) so that Mii = O(1n) as in (213) implies microi minusmicroi = Op(1n) Similarly microlowast

i minus microi = Op(1n) in (46) If the

Efron The Estimation of Prediction Error 627

function q(micro) of Figure 4 is locally quadratic near microi thenQ(microi micro

lowasti ) in (46) will be Op(1n2) as claimed in (48)

Order of magnitude asymptotics are only a rough guide topractice and are not crucial to the methods discussed here Inany given situation bootstrap calculations such as (317) willgive valid estimates whether or not (213) is meaningful

Remark I A prediction rule is ldquoself-stablerdquo if adding a newpoint (xi yi) that falls exactly on the prediction surface does notchange the prediction at xi in notation (420) if

m(xiv(i) cup (xi microi)

) = microi (423)

This implies microi = sum

j =i Mijyj + Miimicroi for a linear rule whichis (410) Any ldquoleast-Qrdquo rule which chooses micro by minimizingsum

Q( yimicroi) over some candidate collection of possible microrsquosmust be self-stable and this class can be extended by addingpenalty terms as with smoothing splines Maximum likelihoodestimation in ordinary or curved GLMrsquos belongs to the least-Qclass

5 A SIMULATION

Here is a small simulation study intended to illustrate covari-ance penaltycross-validation relationships in a Bernoulli datasetting Figure 7 shows the underlying model used to generatethe simulations There are 30 bivariate vectors xi and their as-sociated probabilities microi

(ximicroi) i = 12 30 (51)

from which we generated 200 30-dimensional Bernoulli re-sponse vectors

y sim Be(micro) (52)

Figure 7 Underlying Model Used for Simulation Study n = 30 Bivari-ate Vectors xi and Associated Probabilities microi (51)

as in (32) [The underlying model (51) was itself randomlygenerated by 30 independent replications of

Yi sim Be

(1

2

)

and xi sim N2

((

Yi minus 1

20

)

I)

(53)

with microi the Bayesian posterior ProbYi = 1|xi]Our prediction rule micro = m(y) was based on the coefficients

for Fisherrsquos linear discriminant boundary α + β primex = 0

microi = 1[1 + eminus(α+β primexi)

] (54)

Equation (213) of Efron (1983) describes the (α β) computa-tions Rule (54) is not the logistic regression estimate of microi andin fact will be somewhat more efficient given mechanism (53)(Efron 1975)

Binomial deviance error (34) was used to assess predictionaccuracy Three estimates of the total expected optimism =sum30

i=1 i (310) were computed for each of the 200 y vectorsthe cross-validation estimate O = sum

Oi (43) the parametricbootstrap estimate 2

sumcovi (317) with ylowast sim Be(micro) and the

Steinian 2sum

cov(i) (322)The results appear in Figure 8 as histograms of the 200 df es-

timates (ie estimates of optimism2) The Steinian and para-metric bootstrap gave similar results correlation 72 with thebootstrap estimates slightly but consistently larger Strikingly

(a)

(b)

Figure 8 Degree-of-Freedom Estimates (optimism2) 200 Simula-tions (52) and (54) The two covariance penalty estimates Steinian (a)and parametric bootstrap (b) have about one-third the standard devia-tion of cross-validation Error measured by binomial deviance (34) trueΩ2 = 157

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 9: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

Efron The Estimation of Prediction Error 627

function q(micro) of Figure 4 is locally quadratic near microi thenQ(microi micro

lowasti ) in (46) will be Op(1n2) as claimed in (48)

Order of magnitude asymptotics are only a rough guide topractice and are not crucial to the methods discussed here Inany given situation bootstrap calculations such as (317) willgive valid estimates whether or not (213) is meaningful

Remark I A prediction rule is ldquoself-stablerdquo if adding a newpoint (xi yi) that falls exactly on the prediction surface does notchange the prediction at xi in notation (420) if

m(xiv(i) cup (xi microi)

) = microi (423)

This implies microi = sum

j =i Mijyj + Miimicroi for a linear rule whichis (410) Any ldquoleast-Qrdquo rule which chooses micro by minimizingsum

Q( yimicroi) over some candidate collection of possible microrsquosmust be self-stable and this class can be extended by addingpenalty terms as with smoothing splines Maximum likelihoodestimation in ordinary or curved GLMrsquos belongs to the least-Qclass

5 A SIMULATION

Here is a small simulation study intended to illustrate covari-ance penaltycross-validation relationships in a Bernoulli datasetting Figure 7 shows the underlying model used to generatethe simulations There are 30 bivariate vectors xi and their as-sociated probabilities microi

(ximicroi) i = 12 30 (51)

from which we generated 200 30-dimensional Bernoulli re-sponse vectors

y sim Be(micro) (52)

Figure 7 Underlying Model Used for Simulation Study n = 30 Bivari-ate Vectors xi and Associated Probabilities microi (51)

as in (32) [The underlying model (51) was itself randomlygenerated by 30 independent replications of

Yi sim Be

(1

2

)

and xi sim N2

((

Yi minus 1

20

)

I)

(53)

with microi the Bayesian posterior ProbYi = 1|xi]Our prediction rule micro = m(y) was based on the coefficients

for Fisherrsquos linear discriminant boundary α + β primex = 0

microi = 1[1 + eminus(α+β primexi)

] (54)

Equation (213) of Efron (1983) describes the (α β) computa-tions Rule (54) is not the logistic regression estimate of microi andin fact will be somewhat more efficient given mechanism (53)(Efron 1975)

Binomial deviance error (34) was used to assess predictionaccuracy Three estimates of the total expected optimism =sum30

i=1 i (310) were computed for each of the 200 y vectorsthe cross-validation estimate O = sum

Oi (43) the parametricbootstrap estimate 2

sumcovi (317) with ylowast sim Be(micro) and the

Steinian 2sum

cov(i) (322)The results appear in Figure 8 as histograms of the 200 df es-

timates (ie estimates of optimism2) The Steinian and para-metric bootstrap gave similar results correlation 72 with thebootstrap estimates slightly but consistently larger Strikingly

(a)

(b)

Figure 8 Degree-of-Freedom Estimates (optimism2) 200 Simula-tions (52) and (54) The two covariance penalty estimates Steinian (a)and parametric bootstrap (b) have about one-third the standard devia-tion of cross-validation Error measured by binomial deviance (34) trueΩ2 = 157

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 10: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

628 Journal of the American Statistical Association September 2004

the cross-validation estimates were much more variable havingabout three terms larger standard deviation than either covari-ance penalty All three methods were reasonably well centerednear the true value 2 = 157

Figure 8 exemplifies the RaondashBlackwell relationship (48)which guarantees that cross-validation will be more variablethan covariance penalties The comparison would have beenmore extreme if we had estimated micro by logistic regressionrather than (54) in which case the covariance penalties wouldbe nearly constant while cross-validation would still vary

In our simulation study we can calculate the true total opti-mism (328) for each y

O = 2nsum

i=1

λi middot ( yi minus microi) (55)

Figure 9 plots the Steinian estimates versus O2 for the200 simulations The results illustrate an unfortunate phe-nomenon noted in Efron (1983) Optimism estimates tend tobe small when they should be large and vice versa Cross-validation or the parametric bootstrap exhibited the same in-verse relationships The best we can hope for is to estimate theexpected optimism

If we are trying to estimate Err = err+ O with Err = err + then

Err minus Err = minus O (56)

so inverse relationships such that those in Figure 9 make Errless accurate Table 1 shows estimates of E(Err minus Err)2 fromthe simulation experiment

None of the methods did very much better than simply es-timating Err by the apparent error that is taking = 0 andcross-validation was actually worse It is easy to read too much

Figure 9 Steinian Estimate versus True Optimism2 (55) for the200 Simulations Similar inverse relationships hold for parametric boot-strap or cross-validation

Table 1 Average (Err minus Err )2 for the 200 Simulations ldquoApparentrdquoTakes Err = err (ie Ω = 0) Outsample Averages Discussed

in Remark L All Three Methods Outperformed err WhenPrediction Rule Was Ordinary Logistic Regression

Steinian ParBoot CrossVal Apparent

E(Err minus Err)2 539 529 633 578Outsample 594 582 689 641

Logistic regression 362 346 332 538

into these numbers The two points at the extreme right of Fig-ure 9 contribute heavily to the comparison as do other detailsof the computation see Remarks J and L Perhaps the mainpoint is that the efficiency of covariance penalties helps morein estimating than in estimating Err Estimating can beimportant in its own right because it provides df values for thecomparison formal or informal of different models as empha-sized in Ye (1998) Also the values of dfi as a function of xias in Figure 2 are a useful diagnostic for the geometry of thefitting process

The bottom line of Table 1 reports E(Err minus Err)2 for theprediction rule ldquoordinary logistic regressionrdquo rather than (54)Now all three methods handily beat the apparent error The av-erage prediction Err was much bigger for logistic regression615 versus 293 for (54) but Err was easier to estimate forlogistic regression

Remark J Four of the cross-validation estimates corre-sponding to the rightmost points in Figure 9 were negative(ranging from minus9 to minus28) These were truncated at 0 in Fig-ure 8 and Table 1 The parametric bootstrap estimates werebased on only B = 100 replications per case leaving substantialsimulation error Standard components-of-variance calculationsfor the 200 cases were used in Figure 8 and Table 1 to approx-imate the ideal situation B = infin

Remark K The asymptotics in Li (1985) imply that in hissetting it is possible to estimate the optimism itself rather thanits expectation However the form of (55) strongly suggeststhat O is unestimable in the Bernoulli case since it directly in-volves the unobservable componentwise differences yi minus microi

Remark L Err = sumErri (38) is the total prediction error at

the n observed covariate points xi ldquoOutsample errorrdquo

Errout = n middot E0Q

(y0m(x0v)

) (57)

where the training set v is fixed while v0 = (x0 y0) is an inde-pendent random test point drawn from F (49) is the naturalsetting for cross-validation (The factor n is included for com-parison with Err) See section 7 of Efron (1986) The secondline of Table 1 shows that replacing Err with Errout did not af-fect our comparisons Formula (414) suggests that this mightbe less true if our estimation rule had been badly biased

Table 2 shows the comparative ranks of |Err minus Err| for thefour methods of Table 1 applied to rule (54) For example theSteinian was best in 14 of the 200 simulations and worst onlyonce The corresponding ranks are also shown for |ErrminusErrout|with very similar results Cross-validation performed poorlyapparent error tended to be either best or worst the Steinian wasusually second or third while the parametric bootstrap spreadrather evenly across the four ranks

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 11: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

Efron The Estimation of Prediction Error 629

Table 2 Left Comparative Ranks of |Err minus Err| for the 200 Simulations(51)ndash(54) Right Same for |Err minus Errout|

Stein ParBoot CrVal App Stein ParBoot CrVal App

1 14 48 33 105 17 50 32 1012 106 58 31 5 104 58 35 33 79 56 55 10 78 54 55 134 1 38 81 80 1 38 78 83

Mean 234 242 292 233 231 240 290 239rank

6 THE NONPARAMETRIC BOOTSTRAP

Nonparametric bootstrap methods for estimating predictionerror depend on simple random resamples vlowast = (vlowast

1 vlowast2 vlowast

n)

from the training set v (417) rather than parametric resam-ples as in (214) Efron (1983) examined the relationship be-tween the nonparametric bootstrap and cross-validation Thissection develops a RaondashBlackwell type of connection betweenthe nonparametric and parametric bootstrap methods similar toSection 4rsquos cross-validation results

Suppose we have constructed B nonparametric bootstrapsamples vlowast each of which gives a bootstrap estimate microlowastwith microlowast

i = m(xivlowast) in the notation of (418) Let Nbi indi-

cate the number of times vi occurs in bootstrap sample vlowastbb = 12 B define the indicator

Ibi (h) =

1 if Nbi = h

0 if Nbi = h

(61)

h = 01 n and let Qi(h) be the average error when Nbi = h

Qi(h) =sum

b

Ibi (h)Q

(yi micro

lowastbi

)sum

b

Ibi (h) (62)

We expect Qi(0) the average error when vi not involved in thebootstrap prediction of yi to be larger than Qi(1) which will belarger than Qi(2) and so on

A useful class of nonparametric bootstrap optimism esti-mates takes the form

Oi =nsum

h=1

B(h)Si(h) Si(h) = Qi(0) minus Q(h)

h (63)

the ldquoSrdquo standing for ldquosloperdquo Letting Pn(h) be the binomial re-sampling probability

pn(h) = ProbBi(n1n) = h =(

nh

)(n minus 1)nminush

nn (64)

section 8 of Efron (1983) considers two particular choices ofB(h)

ldquoω(boot)rdquo B(h) = h(h minus 1)pn(h) and(65)

ldquoω(0)rdquo B(h) = hpn(h)

Here we will concentrate on (63) with B(h) = hpn(h) whichis convenient but not crucial to the discussion Then B(h) is aprobability distribution

sumn1 B(h) = 1 with expectation

nsum

1

B(h) middot h = 1 + n minus 1

n (66)

The estimate Oi = sumn1 B(h)Si(h) is seen to be a weighted av-

erage of the downward slopes Si(h) Most of the weight is onthe first few values of h because B(h) rapidly approaches theshifted Poisson(1) density eminus1(h minus 1) as n rarr infin

We first consider a conditional version of the nonparametricbootstrap Define v(i)(h) to be the augmented training set

v(i)(h) = v(i) cup h copies of (xi yi) h = 01 n

(67)

giving corresponding estimates microi(h) = m(xiv(i)(h)) andλi(h) = minusq(microi(h))2 For v(i)(0) = v(i) the training set withvi = (xi yi) removed microi(0) = microi (42) and λi(0) = λi Theconditional version of definition (63) is

O(i) =nsum

h=1

B(h)Si(h)

Si(h) = [Q( yi microi) minus Q

(yi microi(h)

)]h (68)

This is defined to be the conditional nonparametric bootstrapestimate of the conditional optimism (i) (320) Notice thatsetting B(h) = (100 0) would make O(i) equal Oi thecross-validation estimate (43)

As before we can average O(i)(y) over conditional para-metric resamples ylowast = (y(i) ylowast

i ) (45) with y(i) and x =(x1 x2 xn) fixed That is we can parametrically averagethe nonparametric bootstrap The proof of Theorem 1 applieshere giving a similar result

Theorem 2 Assuming (45) the conditional parametricbootstrap expectation of Olowast

(i) = O(i)(y(i) ylowasti ) is

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)h

minusnsum

h=1

B(h)E(i)Q

(microi micro

lowasti (h)

)h (69)

where

cov(i)(h) = E(i) λi(h)lowast( ylowasti minus microi) (610)

The second term on the right side of (69) is Op(1n2) asin (48) giving

E(i)Olowast

(i)

= 2nsum

h=1

B(h)cov(i)(h)

h (611)

Point vi has h times more weight in the augmented training setv(i)(h) than in v = vi(1) so as in (420) influence functioncalculations suggest

microlowasti (h) minus microi = h middot (microlowast

i minus microi) and cov(i)(h) = h middot cov(i)

(612)

microlowasti = microlowast

i (1) so that (611) becomes

E(i)Olowast

(i)

= 2cov(i) = (i) (613)

Averaging the conditional nonparametric bootstrap estimatesover parametric resamples (y(i) ylowast

i ) results in a close approx-imation to the conditional covariance penalty (i)

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 12: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

630 Journal of the American Statistical Association September 2004

Expression (69) can be exactly evaluated for linear projec-tion estimates micro = My (using squared error loss)

M = X(X1X)minus1Xprime Xprime = (x1 x2 xn) (614)

Then the projection matrix corresponding to v(i)(h) has ith di-agonal element

Mii(h) = hMii

1 + (h minus 1)Mii Mii = Mii(1) = xprime

i(XprimeX)minus1xi

(615)

and if ylowasti sim (microi σ

2) with y(i) fixed then cov(i) = σ 2Mii(h) Us-ing (66) and the self-stable relationship microi minus microi = Mii( yi minus microi)(69) can be evaluated as

E(i)Olowasti = (i) middot [1 minus 4Mii] (616)

In this case (613) errs by a factor of only [1 + O(1n)]Result (612) implies an approximate RaondashBlackwell rela-

tionship between nonparametric and parametric bootstrap opti-mism estimates when both are carried out conditionally on v(i)As with cross-validation this relationship seems to extend tothe more familiar unconditional bootstrap estimator Figure 10concerns the kidney data and squared error loss where thistime the fitting rule micro = m(y) is ldquoloess(tot sim age span = 5)rdquoLoess unlike lowess is a linear rule micro = My although it is notself-stable The solid curve traces the coordinatewise covari-ance penalty dfi estimates Mii as a function of agei

The small points in Figure 10 represent individual uncon-ditional nonparametric bootstrap dfi estimates Olowast

i 2σ 2 (63)

evaluated for 50 parametric bootstrap data vectors ylowast obtainedas in (217) Remark M provides the details Their means acrossthe 50 replications the triangles follow the Mii curve As withcross-validation if we attempt to improve nonparametric boot-strap optimism estimates by averaging across the ylowast vectorsgiving the covariance penalty i we wind up close to i it-self

As in Figure 8 we can expect nonparametric bootstrap df esti-mates to be much more variable than covariance penalties Var-ious versions of the nonparametric bootstrap particularly theldquo632 rulerdquo outperformed cross-validation in Efron (1983) andEfron and Tibshirani (1997) and may be preferred when non-parametric error predictions are required However covariancepenalty methods offer increased accuracy whenever their un-derlying models are believable

A general verification of the results of Figure 10 linkingthe unconditional versions of the nonparametric and parametricbootstrap df estimates is conjectural at this point Remark Noutlines a plausibility argument

Remark M Figure 10 involved two resampling levels Para-metric bootstrap samples ylowasta = micro+εlowasta were drawn as in (217)for a = 12 50 with micro and the residuals εj = yj minus microj

determined by loess(span = 5) then B = 200 nonparametricbootstrap samples were drawn from each set vlowasta = (vlowasta

1 vlowasta2

vlowastan ) with say Nab

j repetitions of vlowastaj = (xj ylowasta

j ) in theabth nonparametric resample b = 12 B For each ldquoardquo then times B matrices of counts Nab

i and estimates microlowastabi gave Qi(h)lowasta

Figure 10 Small Dots Indicate 50 Parametric Bootstrap Replications of Unconditional Nonparametric Optimism Estimates (63) TrianglesTheir Averages Closely Follow the Covariance Penalty Estimates (solid curve) Vertical distance plotted in df units Here the estimation rule isloess(span = 5) See Remark M for details

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 13: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

Efron The Estimation of Prediction Error 631

Si(h)lowasta and Olowastai as in (62)ndash(63) The points Olowasta

i 2σ 2 (withσ 2 = sum

ε2j n) are the small dots in Figure 10 while the tri-

angles are their averages over a = 12 50 Standard t testsaccepted the null hypotheses that the averages were centered onthe solid curve

Remark N Theorem 2 applies to parametric averaging ofthe conditional nonparametric bootstrap For the usual uncondi-tional nonparametric bootstrap the bootstrap weights Nj on thepoints vj = (xj yj) in v(i) vary so that the last term in (44) is nolonger negated by assumption (45) Instead it adds a remainderterm to (69)

minus2sum

h=1

B(h)E(i)(

λlowasti (h) minus λlowast

i

)(microlowast

i minus microi)h (617)

Here microlowasti = m(xivlowast

(i)) where vlowast(i) puts weight Nj on vj for j = i

and λlowasti = minusq(microlowast

i )2To justify approximation (613) we need to show that (617)

is op(1n) This can be demonstrated explicitly for linear pro-jections (614) The result seems plausible in general sinceλlowast

i (h) minus λlowasti is Op(1n) while microlowast

i minus microi the nonparametric boot-strap deviation of microlowast

i from microi would usually be Op(1radic

n )

7 SUMMARY

Figure 11 classifies prediction error estimates on two crite-ria Parametric (model-based) versus nonparametric and con-ditional versus unconditional The classification can also bedescribed by which parts of the training set (xj yj) j =12 n are varied in the error rate computations TheSteinian only varies yi in estimating the ith error rate keep-ing all the covariates xj and also yj for j = i fixed at the otherextreme the nonparametric bootstrap simultaneously varies theentire training set

Here are some comparisons and comments concerning thefour methods

bull The parametric methods require modeling assumptionsin order to carry out the covariance penalty calculations

Figure 11 Two-Way Classification of Prediction Error Estimates Dis-cussed in This Article The conditional methods are local in the sensethat only the i th case data are varied in estimating the i th error rate

When these assumptions are justified the RaondashBlackwelltype of results of Sections 4 and 6 imply that the paramet-ric techniques will be more efficient than their nonpara-metric counterparts particularly for estimating degrees offreedom

bull The modeling assumptions need not rely on the estimationrule micro = m(y) under investigation We can use ldquobiggerrdquomodels as in Remark A that is ones less likely to be bi-ased

bull Modeling assumptions are less important for rules micro =m(y) that are close to linear In genuinely linear situationssuch as those needed for the Cp and AIC criteria the co-variance corrections are constants that do not depend onthe model at all The centralized version of SURE (325)extends this property to maximum likelihood estimation incurved families

bull Local methods extrapolate error estimates from smallchanges in the training set Global methods make muchlarger changes in the training set of a size commensuratewith actual random sampling which is an advantage indealing with ldquoroughrdquo rules m(y) such as nearest neighborsor classification trees see Efron and Tibshirani (1997)

bull Steinrsquos SURE criterion (211) is local because it dependson partial derivatives and parametric (29) without beingmodel based It performed more like cross-validation thanthe parametric bootstrap in the situation of Figure 2

bull The computational burden in our examples was less forglobal methods Equation (218) with λlowastb

i replacing microlowastbi

for general error measures helps determine the numberof replications B required for the parametric bootstrapGrouping the usual labor-saving tactic in applying cross-validation can also be applied to covariance penalty meth-ods as in Remark G though now it is not clear that this iscomputationally helpful

bull As shown in Remark B the bootstrap methodrsquos computa-tions can also be used for hypothesis tests comparing theefficacy of different models

Accurate estimation of prediction error tends to be difficult inpractice particularly when applied to the choice between com-peting rules micro = m(y) In the authorrsquos opinion it will often beworth chancing plausible modeling assumptions for the covari-ance penalty estimates rather than relying entirely on nonpara-metric methods

[Received December 2002 Revised October 2003]

REFERENCES

Akaike H (1973) ldquoInformation Theory and an Extension of the MaximumLikelihood Principlerdquo Second International Symposium on Information The-ory 267ndash281

Breiman L (1992) ldquoThe Little Bootstrap and Other Methods for Dimensional-ity Selection in Regression X-Fixed Prediction Errorrdquo Journal of the Ameri-can Statistical Association 87 738ndash754

Efron B (1975) ldquoDefining the Curvature of a Statistical Problem (With Appli-cations to Second Order Efficiency)rdquo (with discussion) The Annals of Statis-tics 3 1189ndash1242

(1975) ldquoThe Efficiency of Logistic Regression Compared to NormalDiscriminant Analysesrdquo Journal of the American Statistical Association 70892ndash898

(1983) ldquoEstimating the Error Rate of a Prediction Rule Improvementon Cross-Validationrdquo Journal of the American Statistical Association 78316ndash331

(1986) ldquoHow Biased Is the Apparent Error Rate of a Prediction RulerdquoJournal of the American Statistical Association 81 461ndash470

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908

Page 14: The Estimation of Prediction Error: Covariance Penalties ...jordan/sail/... · The Estimation of Prediction Error: Covariance Penalties and Cross-Validation Bradley E FRON Having

632 Journal of the American Statistical Association September 2004

Efron B and Tibshirani R (1997) ldquoImprovements on Cross-ValidationThe 632+ Bootstrap Methodrdquo Journal of the American Statistical Associa-tion 92 548ndash560

Hampel F Ronchetti E Rousseeuw P and Stahel W (1986) Robust Statis-tics the Approach Based on Influence Functions New York Wiley

Hastie T and Tibshirani R (1990) Generalized Linear Models LondonChapman amp Hall

Li K (1985) ldquoFrom Steinrsquos Unbiased Risk Estimates to the Method of Gener-alized Cross-Validationrdquo The Annals of Statistics 13 1352ndash1377

(1987) ldquoAsymptotic Optimality for CpCL Cross-Validation and Gen-eralized Cross-Validation Discrete Index Setrdquo The Annals of Statistics 15958ndash975

Mallows C (1973) ldquoSome Comments on Cp rdquo Technometrics 15 661ndash675

Rockafellar R (1970) Convex Analysis Princeton NJ Princeton UniversityPress

Shen X Huang H-C and Ye J (2004) ldquoAdaptive Model Selection and As-sessment for Exponential Family Modelsrdquo Technometrics to appear

Shen X and Ye J (2002) ldquoAdaptive Model Selectionrdquo Journal of the Ameri-can Statistical Association 97 210ndash221

Stein C (1981) ldquoEstimation of the Mean of a Multivariate Normal Distribu-tionrdquo The Annals of Statistics 9 1135ndash1151

Tibshirani R and Knight K (1999) ldquoThe Covariance Inflation Criterion forAdaptive Model Selectionrdquo Journal of the Royal Statistical Society Ser B61 529ndash546

Ye J (1998) ldquoOn Measuring and Correcting the Effects of Data Miningand Model Selectionrdquo Journal of the American Statistical Association 93120ndash131

CommentPrabir BURMAN

I would like to begin by thanking Professor Efron for writinga paper that sheds new light on cross-validation and relatedmethods along with his proposals for stable model selec-tion procedures Stability can be an issue for ordinary cross-validation especially for not-so-smooth procedures such asstepwise regression and other such sequential procedures Ifwe use the language of learning and test sets ordinary cross-validation uses a learning set of size n minus 1 and a test set ofsize 1 If one can average over a test set of infinite size thenone gets a stable estimator Professor Efron demonstrates thisleads to a RaondashBlackwellization of ordinary cross-validation

The parametric bootstrap proposed here does require know-ing the conditional distribution of an observation given the restwhich in turn requires a knowledge of the unknown parametersProfessor Efron argues that for a near-linear case this poses noproblem A question naturally arises What happens to thosecases where the methods are considerably more complicatedsuch as stepwise methods

Another issue that is not entirely clear is the choice betweenthe conditional and unconditional bootstrap methods The con-ditional bootstrap seems to be better but it can be quite expen-sive computationally Can the unconditional bootstrap be usedas a general method always

It seems important to point out that ordinary cross-validationis not as inadequate as the present paper seems to suggestIf model selection is the goal then estimation of the overallprediction error is what one can concentrate on Even if thecomponentwise errors are not necessarily small ordinary cross-validation may still provide reasonable estimates especially ifthe sample size n is at least moderate and the estimation proce-dure is reasonably smooth

In this connection I would like to point out that methods suchas repeated v-fold (or multifold) cross-validation or repeated (orbootstrapped) learning-testing can improve on ordinary cross-validation because the test set sizes are not necessarily small(see eg the CART book by Breiman Friedman Olshen andStone 1984 Burman 1989 Zhang 1993) In addition suchmethods can reduce computational costs substantially In a re-peated v-fold cross-validation the data are repeatedly randomly

Prabir Burman is Professor Department of Statistics University of Califor-nia Davis CA 95616 (E-mail burmanwalducdavisedu)

split into v groups of roughly equal sizes For each repeat thereare v learning sets and the corresponding test sets Each learn-ing set is of size n(1 minus 1v) approximately and each test set isof size nv In a repeated learningndashtesting method the data arerandomly split into a learning set of size n(1 minus p) and a test setof size np where 0 lt p lt 1 If one takes a small v say v = 3in a v-fold cross-validation or a value of p = 13 in a repeatedlearningndashtesting method then each test set is of size n3 How-ever a correction term is needed in order to account for the factthat each learning set is of a size that is considerably smallerthan n (Burman 1989)

I ran a simulation for the classification case with the modelY is Bernoulli(π(X)) where π(X) = 1 minus sin2(2πX) and X isUniform(01) A majority voting scheme was used among thek nearest neighbor neighbors True misclassification errors (inpercent) and their standard errors (in parentheses) are given inTable 1 along with ordinary cross-validation corrected three-fold cross-validation with 10 repeats and corrected repeated-testing (RLT) methods with p = 33 and 30 repeats The samplesize is n = 100 and the number of replications is 25000 It canbe seen that the corrected v-fold cross-validation or repeatedlearningndashtesting methods can provide some improvement overordinary cross-validation

I would like to end my comments with thanks to Profes-sor Efron for providing significant and valuable insights intothe subject of model selection and for developing new methodsthat are improvements over a popular method such as ordinarycross-validation

Table 1 Classification Error Rates (in percent)

TCH k = 7 k = 9 k = 11

True 2282(414) 2455(487) 2726(543)CV 2295(701) 2482(718) 2765(755)Threefold CV 2274(561) 2456(556) 2704(582)RLT 2270(570) 2455(565) 2711(595)

copy 2004 American Statistical AssociationJournal of the American Statistical Association

September 2004 Vol 99 No 467 Theory and MethodsDOI 101198016214504000000908