1 building the regression model –i selection and validation knn ch. 9 (pp. 343-375)

1

Building the Regression Model –I

Selection and Validation

KNN Ch. 9 (pp. 343-375)

2

The Model Building ProcessCollect and prepare data

Reduction of explanatory variables for exploratory/ observational studies

Refine model and select best model

Validate model – if it passes the checks then adopt it

All four of the above have several intermediate steps. These are outlined in Fig. 9.1, page 344 of KNN

3

The Model Building ProcessData collection

Controlled Experiments (levels, treatments)

With supplemental variables (incorporate uncontrollable variables in regression model rather than in the experiment)

Confirmatory Observational Studies (hypothesis testing, primary variables and risk factors)

Exploratory Observational Studies (Measurement errors/problems, duplication of variables, spurious variables, sample size; are but some of the issues here)

4

The Model Building ProcessData Preparation

What are the standard techniques here? Its an easy guess, a rough-cut approach is to look at various plots and identify obvious problems such as outliers, spurious variables etc.

Preliminary Model Investigation

Scatter Plots and Residual Plots (For what?)

Functional forms and transformations (of entire data or some explanatory variables or predicted variable?)

Interactions and …..Intuition

5

The Model Building ProcessReduction of Explanatory Variables

Generally an issue for Controlled Experiments with Supplemental Variables and for Exploratory Observational Studies

It is not difficult to guess that for Exploratory Observational Studies, this is more serious

Identification of good subsets of the explanatory variables and their functional forms and any interactions, is perhaps the most difficult problem in multiple regression analysis

Need to be careful of specification bias and latent explanatory variables.

6

The Model Building ProcessModel Refinement and Selection

Diagnostics for candidate models

Lack-of-fit tests if repeat obs. available

“Best” model’s # of variables should be used as benchmark for investigating other models with similar number of variables

Model Validation

Robustness and Usability of regression coefficients

Usability of regression function. Does it all make sense ?

7

All Possible Regressions: Variable Reduction

Usually many explanatory variables (p-1) present at the outset

Select the best subset of these variables

Best The smallest subset of variables which provides an adequate prediction of Y.

Multicollinearity usually a problem when all variables in the model.

Variable selection may be based on the determination coefficient or on the statistic

(Equivalent Procedures).

2pR pSSE

8

2pRpSSE - and are highest when all the

variables are in the model.

One intends to find the point at which adding more variables causes a very small increase in or a very small decrease in .

Given a value of p, we compute the maximum of Rp

2 (or minimum of SSEp) and then we compare the several maxima (minima).

See the Surgical Unit Example on page 350 of KNN.

2pR

pSSE


9

A Simple ExampleRegression AnalysisThe regression equation isY = 0.236 + 9.09 X1 - 0.330 X2 - 0.203 X3 Predictor Coef StDev T PConstant 0.2361 0.2545 0.93 0.355X1 9.090 1.718 5.29 0.000X2 -0.3303 0.2229 -1.48 0.141X3 -0.20286 0.05894 -3.44 0.001 S = 1.802 R-Sq = 95.7% R-Sq(adj) = 95.6% Regression AnalysisThe regression equation isY = 0.408 + 6.55 X1 - 0.173 X3 Predictor Coef StDev T PConstant 0.4078 0.2276 1.79 0.075X1 6.5506 0.1201 54.54 0.000X3 -0.17253 0.05551 -3.11 0.002 S = 1.810 R-Sq = 95.6% R-Sq(adj) = 95.5% Regression AnalysisThe regression equation isY = 0.014 + 6.50 X1 Predictor Coef StDev T PConstant 0.0144 0.1949 0.07 0.941X1 6.4957 0.1225 53.05 0.000 S = 1.866 R-Sq = 95.3% R-Sq(adj) = 95.3%

10

Rp2 does not take into account the number of

parameters (p) and never decreases as p increases.

This is a mathematical property, but it may not make sense practically.

However, useless explanatory variables can actually worsen the predictive power of the model. How?

The adjusted coefficient of multiple determination will account for the increased p always.

The Ra2 and MSEp criterion are equivalent

When can MSEp actually increase with p?

)1/()/(12

nSSTO

pnSSERa


11

A Simple ExampleRegression AnalysisThe regression equation isY = 21.7 + 12.8 X1 - 0.88 X2 - 5.93 X3 Predictor Coef StDev T PConstant 21.69 14.77 1.47 0.381X1 12.763 9.225 1.38 0.398X2 -0.877 1.099 -0.80 0.571X3 -5.927 2.033 -2.92 0.210 S = 2.878 R-Sq = 99.3% R-Sq(adj) = 97.1% Regression AnalysisThe regression equation isY = 27.8 + 5.45 X1 - 6.37 X3 Predictor Coef StDev T PConstant 27.76 11.45 2.43 0.136X1 5.4534 0.9666 5.64 0.030X3 -6.370 1.769 -3.60 0.069 S = 2.603 R-Sq = 98.8% R-Sq(adj) = 97.7% Regression AnalysisThe regression equation isY = - 10.4 + 8.05 X1 Predictor Coef StDev T PConstant -10.363 9.738 -1.06 0.365X1 8.049 1.439 5.59 0.011 S = 5.816 R-Sq = 91.2% R-Sq(adj) = 88.3%

Interesting

12

The Cp criterion is concerned with the total MSE of the n fitted values.

Total error for any fitted value is a sum of bias and random error components

is the total error, where i is the “true” mean response of Y when X=Xi .

The bias is and the random error is

Then the total mean squared error is shown to be:

When the above is divided by the variance of the actual Y values i.e., by 2, then we get the criterion p

The estimator of p is what we shall use:Cp

iiY ˆ

iiYE }ˆ{ }ˆ{ˆii YEY

}]ˆ{}ˆ{[ 2

1

2

i

n

iii YYE


13

Choose a model with small Cp

Cp should be as close as possible to p. When all variables are included then obviously Cp = p (=P)

If the model has very little bias then in that case

and E(Cp) ≈ p

When we plot a line through the origin at 45o and plot the (p,Cp) points, then for models with little bias, the points will fall almost on the straight line, for models with substantial bias, the points will fall much above the line, and if the points fall below the then such models have no bias but just some random sampling error.

)2(),,( 11

pnXXMSE

SSEC

P

pp

iiYE )ˆ(


14

The PRESSp criterion :

is the predicted value of when the ith observation is not in the dataset.

Choose models with small values of PRESSp .

It may seem that one will have to run “n” separate regressions in order to calculate PRESSp . Not so, as we will see later.

n

iiiip YYPRESS

1

2)( )ˆ(

)(ˆ

iiY iY


15

Best Subsets Algorithm: Best subsets (a limited number) are identified according to pre-specified criteria. Require much less computational effort than when evaluating all possible subsets. Provide “good” subsets along with best, which is quite useful.When pool of X variables is large, then this algorithm can run out of steam. What then? We will see in the ensuing discussion.

Best Subsets

16

Best Subsets Regression (Note: “s” is the square root of MSEp)

Response variable is Y

Adj.

Vars R-Sq R-Sq C-p s X1 X2 X3

1 95.3 95.3 11.9 1.8656 X

1 94.7 94.7 30.8 1.9801 X

2 95.6 95.5 4.2 1.8101 X X

2 95.3 95.2 13.8 1.8718 X X

3 95.7 95.6 4.0 1.8023 X X X

Response variable is Y

Adj.

Vars R-Sq R-Sq C-p s X1 X2 X3 X4

1 95.3 95.3 13.4 1.8656 X

1 94.7 94.7 32.4 1.9801 X

2 95.6 95.5 5.6 1.8101 X X

2 95.5 95.4 9.8 1.8374 X X

3 95.7 95.6 3.9 1.7927 X X X

3 95.7 95.6 5.3 1.8023 X X X

4 95.7 95.6 5.0 1.7936 X X X X

A Simple Example

17

Forward Stepwise Regression An iterative procedure Based on the partial F* or t* statistic one decides whether to add a variable or not. One variable at a time is considered. Before we see the actual algorithm here are some levers:

Minimum acceptable F to enter (FE)

Minimum acceptable F to remove (FR)

Minimum acceptable Tolerance (Tmin)

Maximum number of iterations (N) And here is the general form of the test statistic:

2*

}{)model in thealready XsOther ,(model in thealready XsOther |

k

k

k

kk bs

bXMSEXMSRF

18

Forward Stepwise Regression The procedure:1. Run a simple linear regression of all variables with the Y variable.

2. If none of the individual F values are larger than the cut-off FE value,

then stop. Else, enter the variable with the largest F.3. Now run the regression of remaining variables with Y given that the

variable entered in step 2 is already in the model.4. Repeat step 2. If a candidate is found, then check for tolerance. If tolerance (1-R2

k) is not larger than cut-off tolerance value Tmin , then

choose a different candidate. If none available, then terminate. Else, add the candidate variable.

5. Calculate the partial F for the variable entered in step 2 given that the variable entered in step 4 is already in the model. Check if this F is less

than FR. If so, then remove the variable entered in step 2.

Else keep it. Check if number of iterations is equal to N. If yes, terminate. If not, then proceed to step 6.

6. Check from results of step 1, which is the next candidate variable to enter. If number of iterations exceeded, then terminate

19

Other Stepwise Regression Procedures Backward Stepwise Regression

exact opposite of forward procedure.Sometimes preferred to forward stepwise.Think about how this procedure would work why, or under which conditions you would use it instead of forward stepwise ?

Forward SelectionSimilar to forward stepwise; except that the variable dropping part is not present

Backward EliminationSimilar to backward stepwise; except that the variable adding part is not present

20

An Example

Let us go through the example (Fig. 9.7) on page 366 of KNN.

21

Some other Selection Criteria

Akaike Information Criteria (AIC)– Impose a penalty for adding regressors– AIC = e2p/n SSEp /n , where 2p/n is the penalty factor

– Harsher penalty than Ra2 (How?)

– Model with lowest AIC is preferred– AIC used for in-sample and out-of-sample forecasting

performance measurement– Useful for nested and non-nested mode and for

determining lag-length in autoregressive models (Ch12)

22

Some other Selection Criteria

Schwarz Information Criteria (SIC)– SIC = np/n SSEp /n– Similar to AIC– Imposes stricter penalty than AIC– Has similar advantages as AIC

23

Model Validation Checking the prediction ability of the model.

Methods for the model validation;

1. Collection of new data;

- We select a new sample with the same variables of dimension ;

- Compute the mean squared prediction error:

2. Comparison of results with theoretical expectations;

3. Data splitting in two data sets: model building and validation.

*1

2*

)ˆ(

n

YYMSPR

n

iii

1 building the regression model –i selection and validation knn ch. 9 (pp. 343-375)

Documents