model selection1. 1. regress y on each k potential x variables. 2. determine the best single...

21
Multiple Regression 2 Model Selection 1

Upload: lucas-anderson

Post on 12-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 1

Multiple Regression 2

Page 2: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 2

1. Regress Y on each k potential X variables.2. Determine the best single variable model.3. Regress Y on the best variable and each of the

remaining k-1 variables.4. Determine the best model that includes the

previous best variable and one new best variable.

5. If either the adjusted-R2 declines, the standard error of the regression increases, the t-statistic of the best variable is insignificant, or the coefficients are theoretically inconsistent, STOP, and use the previous best model.

Repeat 2-4 until stopped or an all variable model has been reached.

A Forward Selection Heuristic

Page 3: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 3

The idea behind Forward Selection

-If the adjusted-R2 declines when an additional variable is added, then the added value of the variable does not outweigh its modeling cost.

- If the standard error increases then the additional variable has not improved estimation.

- If the t-statistic of one of the variables is insignificant then there may be too many variables.

- If the coefficients are inconsistent with theory may indicate multicollinearity effects.

Page 4: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 4

1. Regress Y on all k potential X variables2. Use t-tests to determine which X has the

least amount of significance3. If this X does not meet some minimum

level of significance, remove it from the model

4. Regress Y on the set of k-1 X variables

Repeat 2-4 until all remaining Xs meet minimum

The Backward Elimination Heuristic

Page 5: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Multiple Regression 1 5

Use Tests One at a Time

The tests should be used one at a time.

• T1 can tell you to drop X1 and keep X2-X6

• T2 can tell you to drop X2 and keep X1 and X3-X6

• Together, they don’t necessarily tell you to drop both and keep X3-X6

Page 6: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 6

The idea behind Backwards EliminationIf tstat not significant, we can remove an X and simplify the model while still maintaining the model’s high Rsquare.

Typical stopping rule

Continue until all Xs meet some target “significance level to stay” (often .10 or .15 to keep more Xs).

Page 7: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 7

The forward and backward heuristics may or may not result in the same end model. Generally however the resulting models should be quite similar.

The backwards elimination model requires that you start with a model that includes all possible explanatory variables. But, for example, Excel will only conduct regression for up to 16 variables.

Concordance

Page 8: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 8

When using many variables in a regression, it may be the case that some of the explanatory variables are highly correlated with other explanatory variables. In the extreme when two of the variables are linearly related, the multiple regression will fail as unstable.

Simple indicators are a failure of the F-test; an increase in Standard Error; insignificant t-statistic for a previously significant variable; theoretically inconsistent coefficients.

Recall also that when using a categorical variable, one of the categories must be “left out”.

Multi-collinearity

Page 9: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 9

The variance-inflation-factors (VIFs) should be calculated after reaching a supposed stopping point in a multiple regression selection method.

The VIFs are calculated for each independent variable by regressing that INDEPENDENT VARIABLE against the other independent variables = 1 / (1-R2)

A simple rule-of-thumb is that the VIFs should be less than 4.

VIF as a measure of multi-collinearity

Page 10: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 10

The forward and backward heuristic rely on adding or deleting one variable at a time.

It is however possible to evaluate the statistical significance of including a set of variables by constructing the partial F-statistic.

Subsets of Variables

Page 11: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Multiple regression 5 -- The partial F test 11

The “full” and “reduced” models Suppose there are r variables in the group

Define the full model to be the one with all Xs (all k predictors)

Define the reduced model to be the one with the group left out (it has k-r variables).

Page 12: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Multiple regression 5 -- The partial F test 12

Partial F Statistic Look at the increase in the sum of squared

errors SSEReduced – SSEFull to see how much of the explained variation is lost.

Divide this by r, the number of variables in the group.

Put this in ratio to the MSE of the full model.

This is called the partial F statistic.

Page 13: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Multiple regression 5 -- The partial F test 13

Partial F Statistic

This has an F distribution with r numerator and (n-k-1) denominator degrees of freedom

F

FR

MSE

SSE(SSEFPartial

r/)

Page 14: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Multiple regression 5 -- The partial F test 14

Two regression runs

Full

Reduced

Page 15: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Multiple regression 5 -- The partial F test 15

The Partial F for 4 variablesHo: Four variable coefficients are insignificant

H1: at least one variable coefficient in the group is useful

(889.042 – 765.939 )/4 30.776F = -------------------- = ----- = 3.255 9.456 9.456

The correct F dist to test against is 4 numerator and 81 denominator degrees of freedom. The value for a (4,60) distribution is 2.53 at a significance level of .05 and 3.65 at a significance level of .01

Page 16: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Multiple Regression 4: Indicator Variables

16

Extensions

• Two lines, different slopes

• More than two categories

•Multicategory, multislope

Page 17: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Multiple regression 5 -- The partial F test 17

Fit two lines with different slopes Recall that using the Executive variable

alone created a salary model with two lines having different intercepts.

Adding the variable Alpha Experience resulted in a model also having two lines with different intercepts.

But, what if there is an interaction effect between Executive status and Alpha experience.

Page 18: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Multiple regression 5 -- The partial F test 18

Create two new variables. The Executive status variable has two categories: 0

and 1.

Create two variables from Alpha experience so that◦ when Executive =0, Alpha retains its value, otherwise it

equals 0.◦ When Executive = 1, Alpha retains its value, otherwise it

equals 0.

Using now three variables, Executive status and the two alpha variables will result in a model with two lines having different intercepts and different slopes capturing a simple interaction effect among the variables.

Page 19: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 19

Executive Status variable

Page 20: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 20

Executive Status and Alpha Experience

Page 21: Model Selection1. 1. Regress Y on each k potential X variables. 2. Determine the best single variable model. 3. Regress Y on the best variable and each

Model Selection 21

Executive Status and Alpha Experience with Interaction