model selection and estimation in regression with grouped variables

Model Selection and Estimation Model Selection and Estimation in Regression with Grouped in Regression with Grouped

VariablesVariables

Remember…..Remember…..• Consider fitting this simple model:Consider fitting this simple model:

with arbitrary explanatory variables Xwith arbitrary explanatory variables X11, X, X2,2, X X33 and and continuous Y.continuous Y.

• If we want to determine whether XIf we want to determine whether X11, X, X2,2, X X33 are are predictive of Y, we need to take into account the predictive of Y, we need to take into account the groups of variables derived from Xgroups of variables derived from X11, X, X2,2, X X33..

• 22ndnd Example: ANOVA (dummy variables of a Example: ANOVA (dummy variables of a factor form the groups)factor form the groups)

• Group LARS proceeds in two steps:Group LARS proceeds in two steps:

1)1) A solution path that is indexed by a tuning parameter A solution path that is indexed by a tuning parameter λλ is is built. (Solution path is just a “path” of how the estimated built. (Solution path is just a “path” of how the estimated coefficients move in space as a function of coefficients move in space as a function of λλ))

2) The final model is selected on the solution path by some 2) The final model is selected on the solution path by some “minimal risk” criterion.“minimal risk” criterion.

Remember…..Remember…..

NotationNotation• Model form:Model form:

• Assume we have Assume we have J factors/groupsJ factors/groups of variables of variables• Y is (n x 1)Y is (n x 1)• εε ~MVN(0, ~MVN(0, σσ22))• ppjj is the number of variables in group is the number of variables in group jj • XXjj is (n x p is (n x pjj) design matrix for group ) design matrix for group jj• ββjj is the coefficient vector for group is the coefficient vector for group jj• Each XEach Xjj is centered/ortho-normalized and Y is centered. is centered/ortho-normalized and Y is centered.

Remember…..Remember…..Group LARS Solution Path Algorithm (Refresher):Group LARS Solution Path Algorithm (Refresher):

1.1. Compute the current ‘most correlated set’ (A) by adding in the factor Compute the current ‘most correlated set’ (A) by adding in the factor that maximizes the “correlation” between the current residual and the that maximizes the “correlation” between the current residual and the factor (accounting for factor size).factor (accounting for factor size).

2.2. Move the coefficient vector (Move the coefficient vector (ββ) in the direction of the projection of ) in the direction of the projection of our current residual onto the factors in (A).our current residual onto the factors in (A).

3.3. Continue down this path until a new factor (outside (A)) has the same Continue down this path until a new factor (outside (A)) has the same correlation as factors in (A). Add that new factor into (A).correlation as factors in (A). Add that new factor into (A).

4.4. Repeat steps 2-3 until we have no more factors that can be added to Repeat steps 2-3 until we have no more factors that can be added to (A).(A).

• (Note: solution path is piecewise linear, so computationally (Note: solution path is piecewise linear, so computationally efficient!)efficient!)

Cp Criterion (How to Select a Final Model)Cp Criterion (How to Select a Final Model)

• In gaussian regression problems, an unbiased estimate of In gaussian regression problems, an unbiased estimate of “true risk” is “true risk” is where where

..

• When the full design matrix X is orthonormal, it can be When the full design matrix X is orthonormal, it can be shown that an unbiased estimate for “shown that an unbiased estimate for “df”df” is: is:

• Note the orthonormal Group LARS solution is:Note the orthonormal Group LARS solution is:

Degree-of-Freedom Calculation (Intuition)Degree-of-Freedom Calculation (Intuition)

• When the full design matrix X is orthonormal, it can be When the full design matrix X is orthonormal, it can be shown that an unbiased estimate for “shown that an unbiased estimate for “df”df” is: is:

• Note the orthonormal Group LARS solution is:Note the orthonormal Group LARS solution is:

• The general formula for The general formula for “df”“df” is: is:

Real Dataset ExampleReal Dataset Example• Famous Birthweight dataset from Hosmer/Lemeshow.Famous Birthweight dataset from Hosmer/Lemeshow.

• Y = Baby birthweight, 2 continuous predictors Y = Baby birthweight, 2 continuous predictors (Age/weight of mother), 6 categorical predictors.(Age/weight of mother), 6 categorical predictors.

• For continuous predictors, use 3For continuous predictors, use 3rdrd-order polynomials for -order polynomials for “factors”.“factors”.

• For categorical predictors, use “dummy variables” For categorical predictors, use “dummy variables” excluding the final group.excluding the final group.

• 75%/25% train/test split.75%/25% train/test split.

• Methods Compared: Group LARS, Backward Stepwise Methods Compared: Group LARS, Backward Stepwise (LARS isn’t possible)(LARS isn’t possible)

Real Dataset ExampleReal Dataset Example

Minimal Cp


• Factors Selected:Factors Selected:

Group LARS:Group LARS: All factors except Number of Physician Visits All factors except Number of Physician Visits during the First Trimesterduring the First Trimester

Backward Stepwise:Backward Stepwise: All factors except Number of Physician All factors except Number of Physician Visits during the First Trimester & Mother’s WeightVisits during the First Trimester & Mother’s Weight


Test Set Prediction MSE

Group LARS 463047

Backward Stepwise

506706

Overall Test Set MSE

533035

Simulation Example #1Simulation Example #1• 17 random variables Z17 random variables Z11, Z, Z22,…, Z,…, Z1616, W were , W were

independently drawn from a Normal(0,1).independently drawn from a Normal(0,1).

• XXii = (Z = (Zii + W) / SQRT(2) + W) / SQRT(2)

• Y = XY = X3333 + X + X33

22 + X + X33 + (1/3)*X + (1/3)*X6633 - X - X66

22 + (2/3)*X + (2/3)*X66 + + εε

• εε ~ N(0, 2 ~ N(0, 222))

• Each simulation has 100 observations, 200 simulations.Each simulation has 100 observations, 200 simulations.

• Methods Compared: Group LARS, LARS, Least Methods Compared: Group LARS, LARS, Least Squares, Backward StepwiseSquares, Backward Stepwise

• All 3All 3rdrd-order main effects are considered.-order main effects are considered.

Simulation Example #1Simulation Example #1

Group LARS

LARS OLS Stepwise

Mean Test Set Prediction MSE

5.32 5.73 10.94 7.45

Mean # of Factors Present

7.425 9.435 16 6.565

Simulation Example #2Simulation Example #2• 20 random variables X20 random variables X11, X, X22,…, X,…, X20 20 were generated as in were generated as in

Example #1.Example #1.

• XX1111, X, X1212,…, X,…, X2020 are trichotomized as 0, 1, or 2 if they are are trichotomized as 0, 1, or 2 if they are smaller than the 33smaller than the 33rdrd percentile of a Normal(0,1), larger than percentile of a Normal(0,1), larger than the 66the 66thth percentile, or in between. percentile, or in between.

• Y = XY = X3333 + X + X33

22 + X + X33 + (1/3)*X + (1/3)*X6633 - X - X66

22 + (2/3)*X + (2/3)*X66 + +2 * I(X2 * I(X1111 = 0) + I(X = 0) + I(X1111 = 1) + = 1) + εε

• εε ~ N(0, 2 ~ N(0, 222))

• Each simulation has 100 observations, 200 simulations.Each simulation has 100 observations, 200 simulations.• Methods Compared: Group LARS, LARS, Least Squares, Methods Compared: Group LARS, LARS, Least Squares,

Backward StepwiseBackward Stepwise• All 3All 3rdrd-order main effects/categorical factors are considered.-order main effects/categorical factors are considered.

Simulation Example #2Simulation Example #2

Group LARS

LARS OLS Stepwise

Mean Test Set Prediction MSE

5.43 5.98 9.88 7.55

Mean # of Factors Present

9.61 9.53 20 8.62

ConclusionConclusion• Group LARS provides an improvement over the traditional Group LARS provides an improvement over the traditional

backward stepwise selection + OLS, but still over-selects factors.backward stepwise selection + OLS, but still over-selects factors.

• In the simulations, stepwise selection tends to under-select factors In the simulations, stepwise selection tends to under-select factors relative to Group LARS, and performs more poorly.relative to Group LARS, and performs more poorly.

• Simulation #1 suggests LARS over-selects factors because it Simulation #1 suggests LARS over-selects factors because it enters individual variables into the model (and not the full factor).enters individual variables into the model (and not the full factor).

• Group LARS is also computationally efficient due to its Group LARS is also computationally efficient due to its piecewise linear solution path algorithm.piecewise linear solution path algorithm.

• is the formula for the “correlation” between a factor is the formula for the “correlation” between a factor jj and the current residual and the current residual rr. May select factors if a couple derived . May select factors if a couple derived inputs are predictive and the rest being redundant.inputs are predictive and the rest being redundant.

EL FINEL FIN

model selection and estimation in regression with grouped variables

Documents

final group

group jj

group j xj

group jeach xj

number of variables

new factor

factors present7

groupsgroup lars proceeds