![Page 1: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/1.jpg)
SADC Course in Statistics
Choosing the “best” model
(Session 08)
![Page 2: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/2.jpg)
2To put your footer here go to View > Header and Footer
Learning Objectives
At the end of this session, you will be able to• use a simple descriptive approach to select
of the most appropriate subset of explanatory variables
• apply methods of variable selection (based on statistical tests) in a meaningful way to get the “best” model
• appreciate the effect on t-probabilities when x’s are added or dropped from a model
• understand dangers of using automatic selection procedures
![Page 3: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/3.jpg)
3To put your footer here go to View > Header and Footer
Example of choosing “best” set of x’s
Consider data (fictitious) from a retrospective study of patients surviving less than 4 months after being diagnosed as having acute leukaemia.
Objective: To identify factors affecting survival time.
Variables were:y = survival time (days) after diagnosisx1 = no: of chemotherapy sessionsx2 = total volume of blood transfused
x3 = no: of days of hospital carex4 = age of patient (years).
![Page 4: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/4.jpg)
4To put your footer here go to View > Header and Footer
Start with a matrix plot
![Page 5: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/5.jpg)
5To put your footer here go to View > Header and Footer
Summary statistics for all regressionsHow many possible regression models exist?
Example with x1 and x3 to show summaries:---------+--------------------------------------- Source | SS df MS F Prob>F---------+--------------------------------------- Model | 1488.691 2 744.346 6.07 0.0188Residual | 1227.072 10 122.707 ---------+--------------------------------------- Total | 2715.763 12 226.314 ---------+---------------------------------------
No. of parameters fitted (p) = 3
R2p = 1488.69 / 2715.07 = 0.5482
Adjusted R2p = 1 – 122.71 / 226.31 = 0.4578
![Page 6: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/6.jpg)
6To put your footer here go to View > Header and Footer
Descriptive approach (all regressions)
No. of x’s p = No. of parameters
Terms in model
R2 Adj. R2 Res. M.S.
None None None 0 0 226.3
1 1 x1 0.534 0.492 115.1
1 1 x2 0.666 0.636 82.4
1 1 x3 0.286 0.221 176.3
1 1 x4 0.675 0.645 80.4
2 2 x1, x2 0.979 0.974 5.8
2 2 x1, x3 0.548 0.458 122.7
2 2 x1, x4 0.972 0.967 7.5
2 2 x2, x3 0.847 0.816 41.5
2 2 x2, x4 0.680 0.616 86.9
2 2 x3, x4 0.935 0.922 17.6
3 3 x1, x2, x3 0.982 0.976 5.4
3 3 x1, x2, x4 0.982 0.976 5.3
3 3 x1, x3, x4 0.981 0.975 5.7
3 3 x2, x3, x4 0.973 0.964 8.2
4 4 x1, x2, x3, x4 0.982 0.974 6.0
![Page 7: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/7.jpg)
7To put your footer here go to View > Header and Footer
A descriptive approach… continued
Plot R2 versus no. of parameters (p) in model
Which model would you select on the basis of these results?
![Page 8: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/8.jpg)
8To put your footer here go to View > Header and Footer
A descriptive approach… continued
Which model would you select on the basis of the residual mean square?
Alternatively, plot residual mean square. Small residual mean square is good!
![Page 9: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/9.jpg)
9To put your footer here go to View > Header and Footer
An inferential approach…
Use a sequential procedure to select variables that contribute most, and significantly, to the regression model.
Three popular methods exist:
• Forward selection
• Backward elimination
• Stepwise regression
![Page 10: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/10.jpg)
10To put your footer here go to View > Header and Footer
Forward selection …
Select the “best” single variable - see slide 6
Ask, “Is it contributing significantly?” Answer: Yes (see below)
----------------------------------------- y | Coef. Std. Err. t P>|t|-------+--------------------------------- x4 | -.73816 .1546 -4.77 0.001const. | 117.57 5.2622 22.34 0.000-----------------------------------------
Now consider 2-variable models with x4.
![Page 11: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/11.jpg)
11To put your footer here go to View > Header and Footer
Two-variable models with x4 ----------------------------------------- y | Coef. Std.Err. t P>|t|-------------+--------------------------- x4 | -.61395 .04864 -12.62 0.000 x1 | 1.4400 .13842 10.40 0.000const.| 103.10 2.1240 48.54 0.000----------------------------------------- x4 | -.45694 .69595 -0.66 0.526 x2 | .31090 .74861 0.42 0.687const.| 94.160 56.627 1.66 0.127----------------------------------------- x4 | -.72460 .07233 -10.02 0.000 x3 | -1.1999 .18902 -6.35 0.000const.| 131.28 3.2748 40.09 0.000-----------------------------------------
![Page 12: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/12.jpg)
12To put your footer here go to View > Header and Footer
Three-variable models with x4, x1 ----------------------------------------- y | Coef. Std.Err. t P>|t|-------------+--------------------------- x4 | -.23654 .17329 -1.37 0.205 x1 | 1.4519 .11700 12.41 0.000 x2 | .41611 .18561 2.24 0.052const. | 71.648 14.142 5.07 0.001----------------------------------------- x4 | -.64280 .04454 -14.43 0.000 x1 | 1.0519 .22368 4.70 0.001 x3 | -.41004 .19923 -2.06 0.070const. | 111.68 4.5625 24.48 0.000-----------------------------------------Model with x1, x2 and x4 would be selected!- despite x4 now being non-significant!
![Page 13: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/13.jpg)
13To put your footer here go to View > Header and Footer
Backward elimination gives x1,x2 --------------------------------------- y | Coef. Std.Err. t P>|t|-----+--------------------------------- x1 | 1.5511 .74477 2.08 0.071 x2 | .51017 .7238 0.70 0.501 x3 | .10191 .7547 0.14 0.896 x4 | -.14406 .7091 -0.20 0.844--------------------------------------- x1 | 1.4519 .11700 12.41 0.000 x2 | .41611 .18561 2.24 0.052 x4 | -.23654 .17329 -1.37 0.205--------------------------------------- x1 | 1.4683 .12130 12.10 0.000 x2 | .66225 .04585 14.44 0.000---------------------------------------
![Page 14: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/14.jpg)
14To put your footer here go to View > Header and Footer
Stepwise selection procedure…
This is similar to forward selection, but at each stage of the process, all x’s in the model are re-assessed to check if those that entered the model at an earlier stage still remain “important”.
Note: Software packages allow automatic use of one of these with pre-specified p-values for selection and deletion of variables. Usually available only with quantitative x’s.
![Page 15: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/15.jpg)
15To put your footer here go to View > Header and Footer
Discussion… in small groups • Look back at results. What do you observe
with the forward and backward procedures. Do they give the same results?
• Did the selection using forward seem sensible, given that for x4, the p-value =0.205?
• Can you work out what model would results with a stepwise selection procedures?
• Is it a good idea to use such automatic selection procedures available in software packages? If not, why not?
![Page 16: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/16.jpg)
16To put your footer here go to View > Header and Footer
Discussion continued…
Suppose a medical researcher told you that a model without x2 was not meaningful, how would you proceed with your model selection?
What other latent (lurking) variables, measurable or non-measurable, might affect y?
What further steps would you undertaken before accepting the final model?
![Page 17: SADC Course in Statistics Choosing the best model (Session 08)](https://reader035.vdocument.in/reader035/viewer/2022062619/5515fa9655034694308b48fd/html5/thumbnails/17.jpg)
17To put your footer here go to View > Header and Footer
Practical work follows to ensure learning objectives are
achieved…