model selection: non-nested testing and regression ...doubleh/computational...both the lasso and the...
TRANSCRIPT
Introduction Related methods Intuition Details Conclusion
Model Selection: Non-nested Testing and RegressionShrinkage and Selection via the Lasso
Frank A. Wolak
Economics 276Department of Economics
Stanford University
March 9, 2017
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 1 / 27
Introduction Related methods Intuition Details Conclusion
Model Selection–Non-Nested Tests
Researchers often want to select between competing econometricmodels
Often these models are non-nested
Cannot write one model as a restricted version of the other model
Classical likelihood-ratio based approach to non-nested hypothesistesting
Cox’s Non-Nested Test assumes researcher knows true model undernull hypothesis
Vuong (1989) non-nested test assumes that each parametric modelis equidistant from true model under null hypothesis
Does not assume researcher knows true model under null hypothesis
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 2 / 27
Introduction Related methods Intuition Details Conclusion
Model Selection–Shrinkage and Variable Selection
Researchers often want to include a large number of regressors
Two approaches are typically employed
Estimator that smoothly shrinks all parameter estimates towardszero–Ridge regressionMore recently “least absolute shrinkage and selection operator”(Lasso)–Zeros certain coefficients
Advantage of shrinkage methods is that all variables have non-zerocoefficients in model
But this can also be a disadvantage, because included variables areimprecisely estimated
Lasso attempts to reduce number of variables in model to only the“most important” ones, but this complicates inference process-
No such thing as valid standard error
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 3 / 27
Introduction Related methods Intuition Details Conclusion
Cox’s Non-Nested Hypothesis Test–Regression Version I
The researcher would like to test
H: yt = X ′tβ + εt where εt are i.i.d. N(0, σ2) versusK: yt = Z ′tα+ ηt where ηt are i.i.d. N(0, ω2)
where yt is a scalar, Xt is a (K × 1) vector, and Zt is a (L× 1) vector
Compute β, the OLS estimate of β, and α, the OLS estimate of α,which are also ML estimates of these parameters
Compute the OLS estimate of yt = X ′tβ + (Z ′tα)δ + ξt
The two-sided t-test of the null hypothesis H: δ = 0 versus K:δ 6= 0 is equivalent to the non-nested test given above
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 4 / 27
Introduction Related methods Intuition Details Conclusion
Cox’s Non-Nested Hypothesis Test–Regression VersionII
Note that testing
H: yt = Z ′tα+ ηt where ηt are i.i.d. N(0, ω2) versusK: yt = X ′tβ + εt where εt are i.i.d. N(0, σ2)
Researcher computes the OLS estimate of yt = Z ′tα+ (X ′tβ)δ + ξt
The two-sided t-test of the null hypothesis H: δ = 0 versus K:δ 6= 0 is equivalent to the non-nested test given above
Note that can reject or fail to reject both models, which leavesresearcher with two “valid” models or no valid model
Vuong’s (1989) non-nested hypothesis avoids this outcome
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 5 / 27
Introduction Related methods Intuition Details Conclusion
Vuong’s Non-Nested Hypothesis Test–General Version I
Let Xt = (Y ′t , Z′t)′ and h(y|z) equal the unknown true conditional
density of y given z and m(y, z) the true joint density of (y, z)
Let f(Yt|Zt, θ) equal a parametric conditional density of Yt givenZt and g(Yt|Zt, α) equal an alternative parametric conditionaldensity of Yt given Zt
Let LfT (θ) =∑T
t=1 ln(f(Yt|Zt, θ)) and θT is the quasi-ML estimate
of θ∗, and LgT (α) =∑T
t=1 ln(g(Yt|Zt, α) and αT is the quasi-MLestimate of α∗
Let the Kullback-Liebler Information Criteria (KLIC) distancebetween h(y|z) and f(y|z, θ∗) equal
KLIC(h(y|z), f(y|z, θ∗)) = E[ln[h(Y |Z)/f(Y |Z, θ∗)]] where E[.]denotes the expectation with respect to m(y, z)
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 6 / 27
Introduction Related methods Intuition Details Conclusion
Vuong’s Non-Nested Hypothesis Test–General Version II
Note that KLIC(f, g) ≥ 0 for any two densities f and g, andKLIC(f, f) = 0, so that KLIC(f.g) is a “metric”
Vuong test null hypothesis is H:KLIC(h(y|z), f(y|z, θ∗)) = KLIC(h(y|z), g(y|z, α∗))This is equivalent to H : E[ln[f(Yt|Zt, θ∗)/g(Yt|Zt, α∗)]] = 0
The alternative hypothesis is: K:KLIC(h(y|z), f(y|z, θ∗) > KLIC(h(y|z), g(y|z, α∗), f(y|z, θ∗) iscloser to h(y|z)Or K: KLIC(h(y|z), f(y|z, θ∗) < KLIC(h(y|z), g(y|z, α∗),g(y|z, α∗) is closer to h(y|z)Vuong’s null hypothesis is: E[ln(g(Yt|Zt, θ∗)] = E[ln(h(Yt|Zt, α∗)],the expected log-likelihood function value for the two models areequal, versus the alternative that one model has a larger expectedlog-likelihood function value
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 7 / 27
Introduction Related methods Intuition Details Conclusion
Vuong’s Non-Nested Hypothesis Test–General VersionIII
Note that E[ln[f(Yt|Zt, θ∗)/g(Yt|Zt, α∗)]] is unobserved, but canbe consistently estimated
LR(θT , αT ) =∑T
t=1 ln[f(Yt|Zt,θT )g(Yt|Zt,αt)
]= LfT (θT )− LgT (αT )
Note that LR(θT , αT ) is the difference in two log-likelihoodfunction values
Under the null hypothesis 1√T
LR(θT ,αT )ωT
converges in distribution
to a N(0, 1) random variable
ωT = 1T
∑Tt=1[ln
[f(Yt|Zt,θT )g(Yt|Zt,αt)
]]2 − [ 1T
∑Tt=1 ln
[f(Yt|Zt,θT )g(Yt|Zt,αt)
]]2, which
is the sample variance of the observation-by-observation differencein the optimized log-likelihood function values
Define Wt = ln[f(Yt|Zt, θT )]− ln[g(Yt|Zt, αT )]
In this notation the test statistic is:√TW/s where
W = 1T
∑Tt=1Wt and s2 = 1
T
∑Tt=1(Wt − W )2
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 8 / 27
Introduction Related methods Intuition Details Conclusion
Vuong’s Non-Nested Hypothesis Test–General VersionIV
Note that rejecting the null hypothesis either implies f(y|z, θ∗) iscloser to h(y|z) than g(y|z, α∗) or vice versa
Researcher cannot be left with outcome that both models arerejected
Three possible outcomes: (1) no difference between models, (2) f issuperior to g, or (3) g superior to f
Testing procedure useful in distinguishing among two economicmodels for same observed outcome, given same observables
Testing between common values and private values in auctionsTesting between Cournot and Stackelberg equilibrium in oligopolies
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 9 / 27
Introduction Related methods Intuition Details Conclusion
Variable Selection
OLS estimates with large numbers of regressors
Prediction accuracy : OLS estimates have low bias but largevariance. Shrinking might help improve MSE.Interpretation: When number of predictors is large, interpretationmay be difficult. Would like a smaller subset.
Standard alternative techniques are ridge regression and subsetselection (such as, stepwise regression). But both have drawbacks.
Ridge regression does not set any coefficients to 0.
β = (X ′X + kI)−1X ′y, where k is the ridge factor, a positive scalar
Note that E(β) 6= β, but MSE is less than MSE of OLS estimator
Subset selection is a discrete process and hence its uncertain whichregressors remain, particularly with correlation among regressors
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 10 / 27
Introduction Related methods Intuition Details Conclusion
Definition
Tibshirani proposes the estimator lasso (least absolute shrinkageand selection operator).
Standardize:∑N
i=1 xij/N = 0,∑N
i=1 x2ij/N = 1,
∑Ni=1 yi/N = 0.
The lasso estimator β is the solution to
minβ
N∑
i=1
(yi −∑p
j=1 βjxij)2 s.t.
p∑
j=1
|βj | ≤ t,
where t ≥ 0 is a tuning parameter; not binding if t ≥∑pj=1 |βoj |.
Equivalently, we can write
β = argminβ
N∑
i=1
(yi −∑p
j=1 βjxij)2 + λ
p∑
j=1
|βj |
.
No closed-form solution.Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 11 / 27
Introduction Related methods Intuition Details Conclusion
Motivation
Not satisfied with the OLS estimates.
Prediction accuracy : OLS estimates often have low bias but largevariance. Shrinking might help improve MSE.Interpretation: When number of predictors is large, interpretationmay be difficult. Would like a smaller subset.
Standard alternative techniques are ridge regression and subsetselection. But both have drawbacks.
Subset selection is a discrete process and hence extremely variable.Ridge regression does not set any coefficients to 0.
Lasso attempts to incorporate both.
Laurence Wong (Department of EconomicsStanford University)Regression Shrinkage and Selection via the Lasso April, 2011 2 / 20
Introduction Related methods Intuition Details Conclusion
Outline
Definition
Related methods
Intuition
Details:
ImplementationTuning parameterAsymptoticsStandard errors
Laurence Wong (Department of EconomicsStanford University)Regression Shrinkage and Selection via the Lasso April, 2011 3 / 20
Introduction Related methods Intuition Details Conclusion
Ridge regression
The ridge estimator βr is defined as
βr = argminβ
N∑
i=1
(yi −∑p
j=1 βjxij)2 s.t.
p∑
j=1
β2j ≤ t,
or, equivalently,
βr = argminβ
N∑
i=1
(yi −∑p
j=1 βjxij)2 + λ
p∑
j=1
β2j
.
Unlike the lasso estimator, the ridge has an explicit solution:
βr = (X ′X + λI)−1X ′y.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 12 / 27
Introduction Related methods Intuition Details Conclusion
Least squares with constrained Lq-norm
Both the lasso and the ridge estimators belong to a general family ofleast squares estimators, called Bridge estimators (Frank andFriedman, 1993):
βr = argminβ
N∑
i=1
(yi −∑p
j=1 βjxij)2 s.t. ‖β‖q ≤ t,
where 0 < q <∞ and
‖β‖q =
p∑
j=1
|βj |q.
Lasso is q = 1, and ridge is q = 2.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 13 / 27
Introduction Related methods Intuition Details Conclusion
Subset selection
Retain only M of the p OLS estimates.
Many methods for doing this, e.g., best-subset selection,forward-stepwise selection, backward-stepwise selection,forward-stagewise regression, etc.
For illustration, just focus on best-subset selection.
Find the subset of size M that gives smallest residual sum ofsquares.Choice of M can be selected by using cross-validation to estimateprediction error, or using something like the AIC criterion.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 14 / 27
Introduction Related methods Intuition Details Conclusion
Intuition: Orthonormal design
In the orthonormal design case (X ′X = I), subset selection, lasso, andridge all have closed-form solutions. Note βo = X ′y.
Best subset selection of size M reduces to choosing the M largestOLS coefficients in absolute value and setting the rest to 0.
Ridge solutions have the form
βrj =1
1 + λβoj .
Lasso solutions have the form
βj = sign(βoj )(|βoj | − λ)+,
where x+ = max{x, 0}.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 15 / 27
Introduction Related methods Intuition Details Conclusion
Intuition: Orthonormal design
3.4 Shrinkage Methods 71
TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.
Estimator Formula
Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)
Lasso sign(βj)(|βj | − λ)+
(0,0) (0,0) (0,0)
|β(M)|
λ
Best Subset Ridge Lasso
β^ β^2. .β
1
β 2
β1β
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2
1 + β22 ≤ t2, respectively,
while the red ellipses are the contours of the least squares error function.
Figure: The three estimators as functions of βoj .
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 16 / 27
Introduction Related methods Intuition Details Conclusion
Intuition: Kinks in constraint set
Back to non-orthonormal case. Why does lasso result in coefficientsthat are exactly 0?
3.4 Shrinkage Methods 71
TABLE 3.4. Estimators of βj in the case of orthonormal columns of X. M and λare constants chosen by the corresponding techniques; sign denotes the sign of itsargument (±1), and x+ denotes “positive part” of x. Below the table, estimatorsare shown by broken red lines. The 45◦ line in gray shows the unrestricted estimatefor reference.
Estimator Formula
Best subset (size M) βj · I(|βj | ≥ |β(M)|)Ridge βj/(1 + λ)
Lasso sign(βj)(|βj | − λ)+
(0,0) (0,0) (0,0)
|β(M)|
λ
Best Subset Ridge Lasso
β^ β^2. .β
1
β 2
β1β
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression(right). Shown are contours of the error and constraint functions. The solid blueareas are the constraint regions |β1| + |β2| ≤ t and β2
1 + β22 ≤ t2, respectively,
while the red ellipses are the contours of the least squares error function.
Figure: Lasso constraint set vs ridge constraint set.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 17 / 27
Introduction Related methods Intuition Details Conclusion
Intuition: Constraint set in three dimensions
Figure: The lasso constraint set defines a rhomboid in higher dimensions.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 18 / 27
Introduction Related methods Intuition Details Conclusion
Lq constraint sets
72 3. Linear Methods for Regression
region for ridge regression is the disk β21 + β2
2 ≤ t, while that for lasso isthe diamond |β1| + |β2| ≤ t. Both methods find the first point where theelliptical contours hit the constraint region. Unlike the disk, the diamondhas corners; if the solution occurs at a corner, then it has one parameterβj equal to zero. When p > 2, the diamond becomes a rhomboid, and hasmany corners, flat edges and faces; there are many more opportunities forthe estimated parameters to be zero.
We can generalize ridge regression and the lasso, and view them as Bayesestimates. Consider the criterion
β = argminβ
{N∑
i=1
(yi − β0 −
p∑
j=1
xijβj
)2+ λ
p∑
j=1
|βj |q}
(3.53)
for q ≥ 0. The contours of constant value of∑
j |βj |q are shown in Fig-ure 3.12, for the case of two inputs.
Thinking of |βj |q as the log-prior density for βj , these are also the equi-contours of the prior distribution of the parameters. The value q = 0 corre-sponds to variable subset selection, as the penalty simply counts the numberof nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridgeregression. Notice that for q ≤ 1, the prior is not uniform in direction, butconcentrates more mass in the coordinate directions. The prior correspond-ing to the q = 1 case is an independent double exponential (or Laplace)distribution for each input, with density (1/2τ) exp(−|β|/τ) and τ = 1/λ.The case q = 1 (lasso) is the smallest q such that the constraint regionis convex; non-convex constraint regions make the optimization problemmore difficult.
In this view, the lasso, ridge regression and best subset selection areBayes estimates with different priors. Note, however, that they are derivedas posterior modes, that is, maximizers of the posterior. It is more commonto use the mean of the posterior as the Bayes estimate. Ridge regression isalso the posterior mean, but the lasso and best subset selection are not.
Looking again at the criterion (3.53), we might try using other valuesof q besides 0, 1, or 2. Although one might consider estimating q fromthe data, our experience is that it is not worth the effort for the extravariance incurred. Values of q ∈ (1, 2) suggest a compromise between thelasso and ridge regression. Although this is the case, with q > 1, |βj |q isdifferentiable at 0, and so does not share the ability of lasso (q = 1) for
q = 4 q = 2 q = 1 q = 0.5 q = 0.1
FIGURE 3.12. Contours of constant value ofP
j |βj |q for given values of q.Figure: Constraint sets for different values of q.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 19 / 27
Introduction Related methods Intuition Details Conclusion
Bayes estimators
Posterior distribution: f(β|y, x) ∝ f(y, x|β)π(β).
Consider the estimator obtained by the posterior mode, i.e.,maximizing the (log)-posterior density with respect to β.
Recall our estimators have the form:
argminβ
N∑
i=1
(yi −∑p
j=1 βjxij)2 + λ
p∑
j=1
|βj |q .
Think of the first term as log f(y, x|β); it’s the usual least squaresterm (e.g. in MLE) under normality.Think of the second term as log π(β).
When q = 2, it’s proportional to the normal log-density.When q = 1, it’s proportional to the double-exponential log-density.When q = 0, it simply counts the number of nonzero parameters.
Thus subset selection, ridge, and lasso estimators can all bethought of as Bayes estimators with different priors.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 20 / 27
Introduction Related methods Intuition Details Conclusion
Implementation: Least Angle Regression
A very fast and efficient algorithm by Efron, Hastie, Johnstone,and Tibshirani (2004) requires the same order of computation asthat of a single least squares fit using p predictors.
Computes the entire path over t ≥ 0. For ease of output theytypically reparametrize: s = t/
∑pj=1 |βoj | ∈ [0, 1].
Available as an R package.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 21 / 27
Introduction Related methods Intuition Details Conclusion
Implementation: Least Angle Regression
* * * * * * * * * * * * *
0.0 0.2 0.4 0.6 0.8 1.0
−50
00
500
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts
* * * * **
*
* * * * * *
**
**
* * * * * * * * *
* * **
** *
* * * * * *
* * * * * * *
*
**
* *
*
* * * * * * * * **
* *
*
* * * *
** * *
* *
* *
*
* * * * * * * *
* ** *
*
* *
**
* * ** * *
* **
* * * * * * ** * * * * *
LASSO
52
17
46
9
0 2 3 4 5 7 8 10 12
Figure: Typical lasso output.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 22 / 27
Introduction Related methods Intuition Details Conclusion
Selection of tuning parameter: Cross-validation
Choose t to minimize (an estimate of) prediction error.
In regression models, we usually mean PE = E(y − y)2.
Estimate PE by K-fold cross-validation.
For instance, K = N gives the usual leave-one-out estimator:
CV(t) =1
N
N∑
i=1
(yi − y−ii )2.
Efron et al. also gives a Cp-type selection criterion.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 23 / 27
Introduction Related methods Intuition Details Conclusion
Asymptotic properties
Knight and Fu (2000) conducts asymptotic analysis under the classicalsetting of fixed p and N →∞. They assume a linear model:
yi = x′iβ + εi,
where ε1, . . . , εN are iid with mean 0 and variance σ2, and regularityconditions
CN =1
N
N∑
i=1
xix′i → C,
where C is a nonnegative definite matrix, and
1
Nmax1≤i≤N
x′ixi → 0.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 24 / 27
Introduction Related methods Intuition Details Conclusion
Asymptotic properties
Objective function associated with Bridge estimator βN is
1
N
N∑
i=1
(yi − x′iβ)2 +λNN
p∑
j=1
|βj |γ .
Limiting behaviour of βN is derived from studying the asymptoticbehaviour of the objective function. They show that
If λN = o(N), then βNp→ β.
For γ ≥ 1, if λN/√N → λ0 ≥ 0, then
√N(βn − β)
d→ argminV (u),
where V (u) = long expression.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 25 / 27
Introduction Related methods Intuition Details Conclusion
Standard errors
Delicate issue!
Tibshirani’s approximate closed form. Osborne, Presnell, andTurlach (2000) claim that this formula does not yield anappropriate estimate. Also, having estimated variance of 0 forpredictors with βj = 0 is inadequate.
Bootstrap. Doesn’t really work well either. Furthermore, atheorem from Kyung, Gill, Ghosh, and Casella (2010): Thebootstrap standard errors of βj are inconsistent if βj = 0.
By studying the dual of the lasso optimization problem, Osborne,Presnell, and Turlach (2000) derived an estimate that yields apositive standard error for all coefficients.
However, since distribution of individual lasso estimates willtypically have a condensation of probability at zero, summarizinguncertainty by standard errors may not be appropriate to beginwith.
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 26 / 27
Introduction Related methods Intuition Details Conclusion
Conclusion
Sizable literature about the lasso.
Computation: Efron et al. (2004), Osborne et al. (2000).
Standard errors: Osborne et al. (2000), Fan and Li (2001), Zou(2006).
Model selection: Zhao and Yu (2006), Fan and Li (2001), Huang,Horowitz, and Ma (2008).
Generalizations: fused (Tibshirani et al., 2005), grouped (Yuanand Lin, 2006), elastic net (Zou and Hastie, 2005), adaptive (Zou,2006), Bayesian (Park and Casella, 2008).
Frank A. Wolak (Economics 276 Department of EconomicsStanford University)Model Selection: Non-nested Testing and Regression Shrinkage and Selection via the LassoApril, 2011 27 / 27