model selection: non-nested testing and regression ...doubleh/computational...both the lasso and the...

Introduction Related methods Intuition Details Conclusion

Model Selection: Non-nested Testing and RegressionShrinkage and Selection via the Lasso

Frank A. Wolak

Economics 276Department of Economics

Stanford University

March 9, 2017

Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 1 / 27


Model Selection–Non-Nested Tests

Researchers often want to select between competing econometricmodels

Often these models are non-nested

Cannot write one model as a restricted version of the other model

Classical likelihood-ratio based approach to non-nested hypothesistesting

Cox’s Non-Nested Test assumes researcher knows true model undernull hypothesis

Vuong (1989) non-nested test assumes that each parametric modelis equidistant from true model under null hypothesis

Does not assume researcher knows true model under null hypothesis



Model Selection–Shrinkage and Variable Selection

Researchers often want to include a large number of regressors

Two approaches are typically employed

Estimator that smoothly shrinks all parameter estimates towardszero–Ridge regressionMore recently “least absolute shrinkage and selection operator”(Lasso)–Zeros certain coefficients

Advantage of shrinkage methods is that all variables have non-zerocoefficients in model

But this can also be a disadvantage, because included variables areimprecisely estimated

Lasso attempts to reduce number of variables in model to only the“most important” ones, but this complicates inference process-

No such thing as valid standard error



Cox’s Non-Nested Hypothesis Test–Regression Version I

The researcher would like to test

H: yt = X ′tβ + εt where εt are i.i.d. N(0, σ2) versusK: yt = Z ′tα+ ηt where ηt are i.i.d. N(0, ω2)

where yt is a scalar, Xt is a (K × 1) vector, and Zt is a (L× 1) vector

Compute β, the OLS estimate of β, and α, the OLS estimate of α,which are also ML estimates of these parameters

Compute the OLS estimate of yt = X ′tβ + (Z ′tα)δ + ξt

The two-sided t-test of the null hypothesis H: δ = 0 versus K:δ 6= 0 is equivalent to the non-nested test given above



Cox’s Non-Nested Hypothesis Test–Regression VersionII

Note that testing

H: yt = Z ′tα+ ηt where ηt are i.i.d. N(0, ω2) versusK: yt = X ′tβ + εt where εt are i.i.d. N(0, σ2)

Researcher computes the OLS estimate of yt = Z ′tα+ (X ′tβ)δ + ξt

The two-sided t-test of the null hypothesis H: δ = 0 versus K:δ 6= 0 is equivalent to the non-nested test given above

Note that can reject or fail to reject both models, which leavesresearcher with two “valid” models or no valid model

Vuong’s (1989) non-nested hypothesis avoids this outcome



Vuong’s Non-Nested Hypothesis Test–General Version I

Let Xt = (Y ′t , Z′t)′ and h(y|z) equal the unknown true conditional

density of y given z and m(y, z) the true joint density of (y, z)

Let f(Yt|Zt, θ) equal a parametric conditional density of Yt givenZt and g(Yt|Zt, α) equal an alternative parametric conditionaldensity of Yt given Zt

Let LfT (θ) =∑T

t=1 ln(f(Yt|Zt, θ)) and θT is the quasi-ML estimate

of θ∗, and LgT (α) =∑T

t=1 ln(g(Yt|Zt, α) and αT is the quasi-MLestimate of α∗

Let the Kullback-Liebler Information Criteria (KLIC) distancebetween h(y|z) and f(y|z, θ∗) equal

KLIC(h(y|z), f(y|z, θ∗)) = E[ln[h(Y |Z)/f(Y |Z, θ∗)]] where E[.]denotes the expectation with respect to m(y, z)



Vuong’s Non-Nested Hypothesis Test–General VersionIII

Note that E[ln[f(Yt|Zt, θ∗)/g(Yt|Zt, α∗)]] is unobserved, but canbe consistently estimated

LR(θT , αT ) =∑T

t=1 ln[f(Yt|Zt,θT )g(Yt|Zt,αt)

]= LfT (θT )− LgT (αT )

Note that LR(θT , αT ) is the difference in two log-likelihoodfunction values

Under the null hypothesis 1√T

LR(θT ,αT )ωT

converges in distribution

to a N(0, 1) random variable

ωT = 1T

∑Tt=1[ln

[f(Yt|Zt,θT )g(Yt|Zt,αt)

]]2 − [ 1T

∑Tt=1 ln

[f(Yt|Zt,θT )g(Yt|Zt,αt)

]]2, which

is the sample variance of the observation-by-observation differencein the optimized log-likelihood function values

Define Wt = ln[f(Yt|Zt, θT )]− ln[g(Yt|Zt, αT )]

In this notation the test statistic is:√TW/s where

W = 1T

∑Tt=1Wt and s2 = 1

T

∑Tt=1(Wt − W )2



Vuong’s Non-Nested Hypothesis Test–General VersionIV

Note that rejecting the null hypothesis either implies f(y|z, θ∗) iscloser to h(y|z) than g(y|z, α∗) or vice versa

Researcher cannot be left with outcome that both models arerejected

Three possible outcomes: (1) no difference between models, (2) f issuperior to g, or (3) g superior to f

Testing procedure useful in distinguishing among two economicmodels for same observed outcome, given same observables

Testing between common values and private values in auctionsTesting between Cournot and Stackelberg equilibrium in oligopolies



Variable Selection

OLS estimates with large numbers of regressors

Prediction accuracy : OLS estimates have low bias but largevariance. Shrinking might help improve MSE.Interpretation: When number of predictors is large, interpretationmay be difficult. Would like a smaller subset.

Standard alternative techniques are ridge regression and subsetselection (such as, stepwise regression). But both have drawbacks.

Ridge regression does not set any coefficients to 0.

β = (X ′X + kI)−1X ′y, where k is the ridge factor, a positive scalar

Note that E(β) 6= β, but MSE is less than MSE of OLS estimator

Subset selection is a discrete process and hence its uncertain whichregressors remain, particularly with correlation among regressors



Definition

Tibshirani proposes the estimator lasso (least absolute shrinkageand selection operator).

Standardize:∑N

i=1 xij/N = 0,∑N

i=1 x2ij/N = 1,

∑Ni=1 yi/N = 0.

The lasso estimator β is the solution to

minβ

N∑

i=1

(yi −∑p

j=1 βjxij)2 s.t.

p∑

j=1

|βj | ≤ t,

where t ≥ 0 is a tuning parameter; not binding if t ≥∑pj=1 |βoj |.

Equivalently, we can write

β = argminβ

N∑

i=1

(yi −∑p

j=1 βjxij)2 + λ

p∑

j=1

|βj |

.

No closed-form solution.Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 11 / 27


Motivation

Not satisfied with the OLS estimates.

Prediction accuracy : OLS estimates often have low bias but largevariance. Shrinking might help improve MSE.Interpretation: When number of predictors is large, interpretationmay be difficult. Would like a smaller subset.

Standard alternative techniques are ridge regression and subsetselection. But both have drawbacks.

Subset selection is a discrete process and hence extremely variable.Ridge regression does not set any coefficients to 0.

Lasso attempts to incorporate both.

Laurence Wong (Department of EconomicsStanford University)Regression Shrinkage and Selection via the Lasso April, 2011 2 / 20


Outline

Definition

Related methods

Intuition

Details:

ImplementationTuning parameterAsymptoticsStandard errors

Laurence Wong (Department of EconomicsStanford University)Regression Shrinkage and Selection via the Lasso April, 2011 3 / 20


Ridge regression

The ridge estimator βr is defined as

βr = argminβ

N∑

i=1

(yi −∑p

j=1 βjxij)2 s.t.

p∑

j=1

β2j ≤ t,

or, equivalently,

βr = argminβ

N∑

i=1

(yi −∑p

j=1 βjxij)2 + λ

p∑

j=1

β2j

.

Unlike the lasso estimator, the ridge has an explicit solution:

βr = (X ′X + λI)−1X ′y.



Least squares with constrained Lq-norm

Both the lasso and the ridge estimators belong to a general family ofleast squares estimators, called Bridge estimators (Frank andFriedman, 1993):

βr = argminβ

N∑

i=1

(yi −∑p

j=1 βjxij)2 s.t. ‖β‖q ≤ t,

where 0 < q <∞ and

‖β‖q =

p∑

j=1

|βj |q.

Lasso is q = 1, and ridge is q = 2.



Subset selection

Retain only M of the p OLS estimates.

Many methods for doing this, e.g., best-subset selection,forward-stepwise selection, backward-stepwise selection,forward-stagewise regression, etc.

For illustration, just focus on best-subset selection.

Find the subset of size M that gives smallest residual sum ofsquares.Choice of M can be selected by using cross-validation to estimateprediction error, or using something like the AIC criterion.



Intuition: Orthonormal design

In the orthonormal design case (X ′X = I), subset selection, lasso, andridge all have closed-form solutions. Note βo = X ′y.

Best subset selection of size M reduces to choosing the M largestOLS coefficients in absolute value and setting the rest to 0.

Ridge solutions have the form

βrj =1

1 + λβoj .

Lasso solutions have the form

βj = sign(βoj )(|βoj | − λ)+,

where x+ = max{x, 0}.



Intuition: Orthonormal design

3.4 Shrinkage Methods 71

TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.

Estimator Formula

Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)

Lasso sign(βj)(|βj | − λ)+

(0,0) (0,0) (0,0)

|β(M)|

λ

Best Subset Ridge Lasso

β^ β^2. .β

1

β 2

β1β

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2

1 + β22 ≤ t2, respectively,

while the red ellipses are the contours of the least squares error function.

Figure: The three estimators as functions of βoj .



Intuition: Kinks in constraint set

Back to non-orthonormal case. Why does lasso result in coefficientsthat are exactly 0?

3.4 Shrinkage Methods 71

TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.

Estimator Formula

Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)

Lasso sign(βj)(|βj | − λ)+

(0,0) (0,0) (0,0)

|β(M)|

λ

Best Subset Ridge Lasso

β^ β^2. .β

1

β 2

β1β

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2

1 + β22 ≤ t2, respectively,

while the red ellipses are the contours of the least squares error function.

Figure: Lasso constraint set vs ridge constraint set.



Intuition: Constraint set in three dimensions

Figure: The lasso constraint set defines a rhomboid in higher dimensions.



Lq constraint sets

72 3. Linear Methods for Regression

region for ridge regression is the disk β21 + β2

2 ≤ t, while that for lasso isthe diamond |β1| + |β2| ≤ t. Both methods find the first point where theelliptical contours hit the constraint region. Unlike the disk, the diamondhas corners; if the solution occurs at a corner, then it has one parameterβj equal to zero. When p > 2, the diamond becomes a rhomboid, and hasmany corners, flat edges and faces; there are many more opportunities forthe estimated parameters to be zero.

We can generalize ridge regression and the lasso, and view them as Bayesestimates. Consider the criterion

β = argminβ

{N∑

i=1

(yi − β0 −

p∑

j=1

xijβj

)2+ λ

p∑

j=1

|βj |q}

(3.53)

for q ≥ 0. The contours of constant value of∑

j |βj |q are shown in Fig-ure 3.12, for the case of two inputs.

Thinking of |βj |q as the log-prior density for βj , these are also the equi-contours of the prior distribution of the parameters. The value q = 0 corre-sponds to variable subset selection, as the penalty simply counts the numberof nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridgeregression. Notice that for q ≤ 1, the prior is not uniform in direction, butconcentrates more mass in the coordinate directions. The prior correspond-ing to the q = 1 case is an independent double exponential (or Laplace)distribution for each input, with density (1/2τ) exp(−|β|/τ) and τ = 1/λ.The case q = 1 (lasso) is the smallest q such that the constraint regionis convex; non-convex constraint regions make the optimization problemmore difficult.

In this view, the lasso, ridge regression and best subset selection areBayes estimates with different priors. Note, however, that they are derivedas posterior modes, that is, maximizers of the posterior. It is more commonto use the mean of the posterior as the Bayes estimate. Ridge regression isalso the posterior mean, but the lasso and best subset selection are not.

Looking again at the criterion (3.53), we might try using other valuesof q besides 0, 1, or 2. Although one might consider estimating q fromthe data, our experience is that it is not worth the effort for the extravariance incurred. Values of q ∈ (1, 2) suggest a compromise between thelasso and ridge regression. Although this is the case, with q > 1, |βj |q isdifferentiable at 0, and so does not share the ability of lasso (q = 1) for

q = 4 q = 2 q = 1 q = 0.5 q = 0.1

FIGURE 3.12. Contours of constant value ofP

j |βj |q for given values of q.Figure: Constraint sets for different values of q.



Bayes estimators

Posterior distribution: f(β|y, x) ∝ f(y, x|β)π(β).

Consider the estimator obtained by the posterior mode, i.e.,maximizing the (log)-posterior density with respect to β.

Recall our estimators have the form:

argminβ

N∑

i=1

(yi −∑p

j=1 βjxij)2 + λ

p∑

j=1

|βj |q .

Think of the first term as log f(y, x|β); it’s the usual least squaresterm (e.g. in MLE) under normality.Think of the second term as log π(β).

When q = 2, it’s proportional to the normal log-density.When q = 1, it’s proportional to the double-exponential log-density.When q = 0, it simply counts the number of nonzero parameters.

Thus subset selection, ridge, and lasso estimators can all bethought of as Bayes estimators with different priors.



Implementation: Least Angle Regression

A very fast and efficient algorithm by Efron, Hastie, Johnstone,and Tibshirani (2004) requires the same order of computation asthat of a single least squares fit using p predictors.

Computes the entire path over t ≥ 0. For ease of output theytypically reparametrize: s = t/

∑pj=1 |βoj | ∈ [0, 1].

Available as an R package.



Implementation: Least Angle Regression

* * * * * * * * * * * * *

0.0 0.2 0.4 0.6 0.8 1.0

−50

00

500

|beta|/max|beta|

Sta

ndar

dize

d C

oeffi

cien

ts

* * * * **

*

* * * * * *

**

**

* * * * * * * * *

* * **

** *

* * * * * *

* * * * * * *

*

**

* *

*

* * * * * * * * **

* *

*

* * * *

** * *

* *

* *

*

* * * * * * * *

* ** *

*

* *

**

* * ** * *

* **

* * * * * * ** * * * * *

LASSO

52

17

46

9

0 2 3 4 5 7 8 10 12

Figure: Typical lasso output.



Selection of tuning parameter: Cross-validation

Choose t to minimize (an estimate of) prediction error.

In regression models, we usually mean PE = E(y − y)2.

Estimate PE by K-fold cross-validation.

For instance, K = N gives the usual leave-one-out estimator:

CV(t) =1

N

N∑

i=1

(yi − y−ii )2.

Efron et al. also gives a Cp-type selection criterion.



Asymptotic properties

Knight and Fu (2000) conducts asymptotic analysis under the classicalsetting of fixed p and N →∞. They assume a linear model:

yi = x′iβ + εi,

where ε1, . . . , εN are iid with mean 0 and variance σ2, and regularityconditions

CN =1

N

N∑

i=1

xix′i → C,

where C is a nonnegative definite matrix, and

1

Nmax1≤i≤N

x′ixi → 0.



Asymptotic properties

Objective function associated with Bridge estimator βN is

1

N

N∑

i=1

(yi − x′iβ)2 +λNN

p∑

j=1

|βj |γ .

Limiting behaviour of βN is derived from studying the asymptoticbehaviour of the objective function. They show that

If λN = o(N), then βNp→ β.

For γ ≥ 1, if λN/√N → λ0 ≥ 0, then

√N(βn − β)

d→ argminV (u),

where V (u) = long expression.



Standard errors

Delicate issue!

Tibshirani’s approximate closed form. Osborne, Presnell, andTurlach (2000) claim that this formula does not yield anappropriate estimate. Also, having estimated variance of 0 forpredictors with βj = 0 is inadequate.

Bootstrap. Doesn’t really work well either. Furthermore, atheorem from Kyung, Gill, Ghosh, and Casella (2010): Thebootstrap standard errors of βj are inconsistent if βj = 0.

By studying the dual of the lasso optimization problem, Osborne,Presnell, and Turlach (2000) derived an estimate that yields apositive standard error for all coefficients.

However, since distribution of individual lasso estimates willtypically have a condensation of probability at zero, summarizinguncertainty by standard errors may not be appropriate to beginwith.



Conclusion

Sizable literature about the lasso.

Computation: Efron et al. (2004), Osborne et al. (2000).

Standard errors: Osborne et al. (2000), Fan and Li (2001), Zou(2006).

Model selection: Zhao and Yu (2006), Fan and Li (2001), Huang,Horowitz, and Ma (2008).

Generalizations: fused (Tibshirani et al., 2005), grouped (Yuanand Lin, 2006), elastic net (Zou and Hastie, 2005), adaptive (Zou,2006), Bayesian (Park and Casella, 2008).


model selection: non-nested testing and regression ...doubleh/computational...both the lasso and the...

Documents