dundee epidemiology and biostatistics unit correlation and regression peter t. donnan professor of...
TRANSCRIPT
Dundee Epidemiology and Biostatistics Unit
Correlation and Regression
Peter T. Donnan
Professor of Epidemiology and Biostatistics
Objectives of Workshop Objectives of Workshop
•ScatterplotScatterplot•CorrelationCorrelation•Correlation to regressionCorrelation to regression•Simple linear regressionSimple linear regression•Parameterisation of the Parameterisation of the
modelmodel•AssumptionsAssumptions
•ScatterplotScatterplot•CorrelationCorrelation•Correlation to regressionCorrelation to regression•Simple linear regressionSimple linear regression•Parameterisation of the Parameterisation of the
modelmodel•AssumptionsAssumptions
ScatterplotScatterplot
•The existence of a statistical The existence of a statistical association between two association between two variables is most apparent in variables is most apparent in the appearance of a diagram the appearance of a diagram called a scatterplot.called a scatterplot.
Scatterplot of two Scatterplot of two continuous variablescontinuous variables
DEPRIVATION (JARMAN)
100-10-20
Dis
ch
arg
e r
ate
(a
ge
/se
x a
dj)
1.6
1.4
1.2
1.0
.8
.6
.4
Matrix of Scatterplot of Matrix of Scatterplot of pairs of continuous pairs of continuous
variablesvariablesDISRATE
DISTDGH
DEPREV
LISTPART
CorrelationCorrelation
Correlation is a measure of Correlation is a measure of association between two association between two continuous factorscontinuous factors
Correlations lie between –1 to +1Correlations lie between –1 to +1
A correlation coefficient close to A correlation coefficient close to zero indicates weak or no zero indicates weak or no correlationcorrelation
CorrelationCorrelation
If one factor If one factor increasesincreases with the other with the other – POSITIVE Correlation– POSITIVE Correlation
e.g. cancer risk & tobacco exposuree.g. cancer risk & tobacco exposure
If one factor If one factor decreasesdecreases while the while the other other increasesincreases – NEGATIVE – NEGATIVE correlationcorrelation
e.g. cholesterol and haemorrhagic e.g. cholesterol and haemorrhagic stroke riskstroke risk
Pearson CorrelationPearson Correlation
Correlations
1.000 .462 * -.301 .072
. .015 .127 .721
27 27 27 27
.462 * 1.000 -.470 * .326
.015 . .013 .097
27 27 27 27
-.301 -.470 * 1.000 -.708 **
.127 .013 . .000
27 27 27 27
.072 .326 -.708 ** 1.000
.721 .097 .000 .
27 27 27 27
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
DISRATE
DEPREV
DISTDGH
LISTPART
DISRATE DEPREV DISTDGH LISTPART
Correlation is significant at the 0.05 level (2-tailed).*.
Correlation is significant at the 0.01 level (2-tailed).**.
Warnings about Warnings about correlationcorrelation
Correlation does NOT measure causality Correlation does NOT measure causality only associationonly association
Correlation does NOT measure causality Correlation does NOT measure causality only associationonly association
High correlation of increased High correlation of increased benefit from drug and the star benefit from drug and the star sign Gemini!sign Gemini!
Increased admissions for Increased admissions for pneumonia correlated with high pneumonia correlated with high tides! tides!
Beware non-normally distributed Beware non-normally distributed datadata
Try to normalise data using Try to normalise data using transformationstransformations
Correlation to Regression Correlation to Regression ModelModel
Need to more formally assess Need to more formally assess relationship between pair of relationship between pair of continuous variablescontinuous variables
Fit Regression line - ModelFit Regression line - Model
Usually assume line is linear but Usually assume line is linear but may also fit polynomial curves may also fit polynomial curves
Fitted linear Fitted linear regression line and regression line and
95% CI for mean95% CI for mean
RR22 x 100 is % variation explained by line x 100 is % variation explained by line (21%)(21%)
DEPREV
100-10-20
DIS
RA
TE
1.6
1.4
1.2
1.0
.8
.6
.4 Rsq = 0.2134
How is relationship How is relationship formulated?formulated?
Simplest equation is :Simplest equation is :
iebxay
y is the outcome; a is the y is the outcome; a is the intercept;intercept;
b is the slope related to x the b is the slope related to x the explanatory variable and;explanatory variable and;
e is the error term or random e is the error term or random ‘noise’‘noise’
How is relationship How is relationship formulated?formulated?
iebxay
We have a set of values for y the We have a set of values for y the outcome and a corresponding set of outcome and a corresponding set of values for x the explanatory variable values for x the explanatory variable
Need to estimate values of a the Need to estimate values of a the intercept and b the slope intercept and b the slope
e is the residual and is assumed to e is the residual and is assumed to have a Normal distributionhave a Normal distribution
ExplanatoryExplanatory
IndependenIndependent variablet variable
PredictorPredictor
Alternative TerminologyAlternative Terminology
yy xx
OutcomeOutcome
Dependent Dependent variablevariable
PredictedPredicted
How is the fitted line How is the fitted line obtained?obtained?
Use method of least squares Use method of least squares (LS)(LS)
Seek to minimise squared Seek to minimise squared vertical differences between vertical differences between each point and fitted lineeach point and fitted line
Results in parameter estimates Results in parameter estimates or regression coefficients of or regression coefficients of slope (b) and intercept (a)slope (b) and intercept (a)
Method of least Method of least squaressquares
Consider the regression of age Consider the regression of age on minimum LDL cholesterol on minimum LDL cholesterol
achievedachieved
•Select Regression Select Regression Linear….Linear….
•Dependent (y) – Min LDL achievedDependent (y) – Min LDL achieved•Independent (x) - Age_BaseIndependent (x) - Age_Base
•Select Regression Select Regression Linear….Linear….
•Dependent (y) – Min LDL achievedDependent (y) – Min LDL achieved•Independent (x) - Age_BaseIndependent (x) - Age_Base
N.B. 0.008 may look very small but N.B. 0.008 may look very small but represents: represents:
The DECREASE in LDL achieved for The DECREASE in LDL achieved for each increase in one unit of age i.e. each increase in one unit of age i.e. ONE yearONE year
Output from SPSS linear Output from SPSS linear regressionregression
Coefficientsa
Model Unstandardized Coefficients Standardized CoefficientsB Std. Error Beta t sig
1 (Constant) 2.024 .105 19.340 .000Age at baseline -.008 .002 -.121 -4.546 .000
a. Dependent Variable: Min LDL achieved
HH00 : slope b = 0 : slope b = 0
Test t = slope/se = -0.008/0.002 = 4.546 Test t = slope/se = -0.008/0.002 = 4.546 with p<0.001, so statistically significantwith p<0.001, so statistically significant
Predicted LDL = 2.024 - 0.008xAgePredicted LDL = 2.024 - 0.008xAge
Output from SPSS linear Output from SPSS linear regressionregression
Coefficientsa
Model Unstandardized Coefficients Standardized CoefficientsB Std. Error Beta t sig
1 (Constant) 2.024 .105 19.340 .000Age at baseline -.008 .002 -.121 -4.546 .000
a. Dependent Variable: Min LDL achieved
Predicted LDL achieved = 2.024 - Predicted LDL achieved = 2.024 - 0.008xAge0.008xAge
So for a man aged 65 the predicted LDL So for a man aged 65 the predicted LDL achieved = 2.024 – 0.008x 65 = 1.504achieved = 2.024 – 0.008x 65 = 1.504
Prediction Equation from Prediction Equation from linear regressionlinear regression
Age Predicted Min LDL
45 1.664
55 1.584
65 1.504
75 1.424
Assumptions of Assumptions of RegressionRegression
11. . Relationship is linearRelationship is linear11. . Relationship is linearRelationship is linear
2. Outcome variable and hence 2. Outcome variable and hence residuals or error terms are residuals or error terms are approx. Normally distributed approx. Normally distributed
HH00 : slope b = 0 : slope b = 0
Test statistic, t = slope/se.Test statistic, t = slope/se.
If t > 1.96 then slope is If t > 1.96 then slope is statistically significantstatistically significant
Testing regression Testing regression coefficientscoefficients
Parameterisation of Parameterisation of modelmodel
What does the regression What does the regression coefficient mean?coefficient mean?
Continuous:Continuous:
B represents the change in B represents the change in outcome outcome (y) for each 1 unit (y) for each 1 unit change in the x change in the x variablevariable
e.g. Blood pressure = a + e.g. Blood pressure = a + 2.3*Age2.3*Age≡ Each increase of 1 year ≡ Each increase of 1 year increases increases BP by 2.3 mmHgBP by 2.3 mmHg
Parameterisation of Parameterisation of modelmodel
What does the regression What does the regression coefficient mean?coefficient mean?
Binary:Binary:
B represents the change in B represents the change in outcome outcome (y) for each 1 unit (y) for each 1 unit change in the x change in the x variable (x is variable (x is binary)binary)
e.g. Blood pressure = a + e.g. Blood pressure = a + 5.1*Gender5.1*Gender ≡ Men have higher BP by ≡ Men have higher BP by 5.1 mmHg 5.1 mmHg compared to women compared to women ( M=1; F = 0)( M=1; F = 0)
Parameterisation of Parameterisation of modelmodel
What does the regression What does the regression coefficient mean?coefficient mean?Categories:Categories:
If ordinal, same as continuous, B If ordinal, same as continuous, B represents change in outcome (y) for each represents change in outcome (y) for each 1 unit change in the x variable (x is 1 unit change in the x variable (x is ordinal)ordinal)
e.g. Blood pressure = a + e.g. Blood pressure = a + 1.8*Deprivation1.8*Deprivation
≡ ≡ A 1 category increase in deprivation A 1 category increase in deprivation increases BP by 1.8 mmHg (Carstairs increases BP by 1.8 mmHg (Carstairs
social social deprivation scale 1 to 7)deprivation scale 1 to 7)
Parameterisation of Parameterisation of modelmodel
What does the regression What does the regression coefficient mean?coefficient mean?
Categories:Categories:
If > 2 categories create k -1 If > 2 categories create k -1 dummy dummy variables (k categories)variables (k categories)
e.g. Smoking: e.g. Smoking:
1=Current, 2 = ex, 3=non-1=Current, 2 = ex, 3=non-smokersmoker
requires 2 dummy variables in requires 2 dummy variables in the the modelmodel
Parameterisation of modelParameterisation of modelWhat does the regression What does the regression
coefficient mean?coefficient mean?Categories:Categories:
Smoke1Smoke1 Smoke2Smoke2
CurrentCurrent 11 00
ExEx 00 11
NonNon 00 00
Blood pressure = a + 10*Smoke1 + Blood pressure = a + 10*Smoke1 + 2*smoke22*smoke2
≡ ≡ Current smoking increases BP by 10 Current smoking increases BP by 10 mmHg mmHg and ex-smoking increase BP by 2 and ex-smoking increase BP by 2 mmHg mmHg compared with non-smokerscompared with non-smokers
Entering Multidimensional Space: Multiple
Regression Peter T. Donnan
Professor of Epidemiology and Biostatistics
Statistics for Health ResearchStatistics for Health Research
Objectives of sessionObjectives of session
• Recognise the need for multiple Recognise the need for multiple regressionregression
• Understand methods of selecting Understand methods of selecting variables variables
• Understand strengths and weakness of Understand strengths and weakness of selection methodsselection methods
• Carry out Multiple Carry out Multiple Regression in SPSS Regression in SPSS and interpret the outputand interpret the output
• Recognise the need for multiple Recognise the need for multiple regressionregression
• Understand methods of selecting Understand methods of selecting variables variables
• Understand strengths and weakness of Understand strengths and weakness of selection methodsselection methods
• Carry out Multiple Carry out Multiple Regression in SPSS Regression in SPSS and interpret the outputand interpret the output
Why do we need Why do we need multiple regression?multiple regression?
Research is not as simple as Research is not as simple as effect of one variable on one effect of one variable on one outcome,outcome,
Especially with observational Especially with observational datadata
Need to assess many factors Need to assess many factors simultaneously; more realistic simultaneously; more realistic modelsmodels
Consider Fitted line of Consider Fitted line of y = a + by = a + b11xx11 + b + b22xx22
Explanatory Explanatory (x(x11))
Dep
en
den
t D
ep
en
den
t (y
)(y
)
Explanatory (x
Explanatory (x 22))
3-dimensional scatterplot from 3-dimensional scatterplot from SPSS of Min LDL in relation to SPSS of Min LDL in relation to
baseline LDL and age baseline LDL and age
When to use multiple When to use multiple regression modelling regression modelling
(1)(1)Assess relationship between Assess relationship between two variables while two variables while adjustingadjusting or or allowing forallowing for another another variablevariable
Sometimes the second variable Sometimes the second variable is considered a ‘nuisance’ is considered a ‘nuisance’ factorfactor
Example: Glucose uptake data Example: Glucose uptake data allowing for different cell typesallowing for different cell types
When to use multiple When to use multiple regression modelling (2)regression modelling (2)
In RCT whenever there is In RCT whenever there is imbalance between arms of the imbalance between arms of the trial at baseline in trial at baseline in characteristics of subjectscharacteristics of subjects
e.g. survival in colorectal cancer e.g. survival in colorectal cancer on two different randomised on two different randomised therapies therapies adjustedadjusted for age, for age, gender, co-morbidity gender, co-morbidity
When to use multiple When to use multiple regression modelling (2)regression modelling (2)
A special case of this is when A special case of this is when adjusting for baseline level of adjusting for baseline level of the primary outcome in an RCTthe primary outcome in an RCT
Baseline level added as a factor Baseline level added as a factor in regression model in regression model
This will be covered in Trials This will be covered in Trials part of the course part of the course
When to use multiple When to use multiple regression modelling (3)regression modelling (3)
With observational data in order With observational data in order to produce a to produce a prognostic prognostic equationequation for future prediction of for future prediction of risk of mortalityrisk of mortality
e.g. Predicting future risk of e.g. Predicting future risk of CHD used 10-year data from the CHD used 10-year data from the Framingham cohort Framingham cohort
When to use multiple When to use multiple regression modelling (4)regression modelling (4)
With observational data in order With observational data in order to adjust for possible to adjust for possible confoundersconfounders
e.g. survival in colorectal cancer e.g. survival in colorectal cancer in those with hypertension in those with hypertension adjustedadjusted for age, gender, social for age, gender, social deprivation and co-morbidity deprivation and co-morbidity
Definition of ConfoundingDefinition of Confounding
A confounder is a factor A confounder is a factor which is related to which is related to bothboth the the variable of interest variable of interest (explanatory) and the (explanatory) and the outcome, outcome, butbut is not an is not an intermediary in a causal intermediary in a causal pathwaypathway
Example of ConfoundingExample of Confounding
Deprivation
Lung Cancer
Smoking
But, also worth adjusting for But, also worth adjusting for factors only related to factors only related to
outcomeoutcome
Deprivation
Lung Cancer
Exercise
Not worth adjusting for Not worth adjusting for intermediate factor in a causal intermediate factor in a causal
pathwaypathway
Exercise
Stroke
Blood viscosit
y
In a causal pathway each factor In a causal pathway each factor is merely a marker of the other is merely a marker of the other factors i.e correlated - factors i.e correlated - collinearitycollinearity
SPSS: Add both baseline LDL SPSS: Add both baseline LDL and age in the independent and age in the independent
box in linear regression box in linear regression
Output from SPSS linear Output from SPSS linear regression on Age at regression on Age at
baselinebaseline
Coefficientsa
2.024 .105 19.340 .000 1.819 2.229
-.008 .002 -.121 -4.546 .000 -.011 -.004 1.000 1.000
(Constant)
Age at baseline
Model1
B Std. Error
UnstandardizedCoeff icients
Beta
StandardizedCoeff icients
t Sig. Lower Bound Upper Bound
95% Conf idence Interv al for B
Tolerance VIF
Collinearity Statistics
Dependent Variable: Min LDL achieveda.
Output from SPSS linear Output from SPSS linear regression on Baseline LDLregression on Baseline LDL
Coefficientsa
.668 .066 10.091 .000 .538 .798
.257 .018 .351 13.950 .000 .221 .293
(Constant)
Baseline LDL
Model1
B Std. Error
UnstandardizedCoeff icients
Beta
StandardizedCoeff icients
t Sig. Lower Bound Upper Bound
95% Conf idence Interv al for B
Dependent Variable: Min LDL achieveda.
Model Summary
.360a .130 .129 .6753538Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), Age at baseline, Baseline LDLa.
Coefficientsa
1.003 .124 8.086 .000 .760 1.246
.250 .019 .342 13.516 .000 .214 .286
-.005 .002 -.081 -3.187 .001 -.008 -.002
(Constant)
Baseline LDL
Age at baseline
Model1
B Std. Error
UnstandardizedCoef f icients
Beta
StandardizedCoef f icients
t Sig. Lower Bound Upper Bound
95% Conf idence Interval for B
Dependent Variable: Min LDL achieveda.
Output: Multiple regressionOutput: Multiple regression
RR2 2 now now improveimproved to d to 13%13%
Both variables still significant Both variables still significant INDEPENDENTLY of each otherINDEPENDENTLY of each other
How do you select which How do you select which variables to enter the variables to enter the
model?model?•Usually consider what hypotheses are you Usually consider what hypotheses are you testing?testing?
•If main ‘exposure’ variable, enter first and If main ‘exposure’ variable, enter first and assess confounders one at a timeassess confounders one at a time
•For derivation of CPR you want powerful For derivation of CPR you want powerful predictorspredictors
•Also clinically important factors e.g. cholesterol Also clinically important factors e.g. cholesterol in CHD predictionin CHD prediction
•Significance is important Significance is important butbut
•It is acceptable to have an ‘important’ variable It is acceptable to have an ‘important’ variable withoutwithout statistical significance statistical significance
How do you decide what How do you decide what variables to enter in model?variables to enter in model?
Correlations? With great Correlations? With great difficulty!difficulty!
3-dimensional scatterplot from 3-dimensional scatterplot from SPSS of Time from Surgery in SPSS of Time from Surgery in relation to Duke’s staging and relation to Duke’s staging and
ageage
Approaches to model Approaches to model buildingbuilding
1. Let Scientific or Clinical factors 1. Let Scientific or Clinical factors guide selectionguide selection
1. Let Scientific or Clinical factors 1. Let Scientific or Clinical factors guide selectionguide selection
2. Use automatic selection 2. Use automatic selection algorithmsalgorithms
3. A mixture of above3. A mixture of above
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Baseline LDL cholesterol is an Baseline LDL cholesterol is an important factor determining important factor determining LDL outcome so enter firstLDL outcome so enter first
Next allow for age and genderNext allow for age and gender
Add adherence as important?Add adherence as important?
Add BMI and smoking?Add BMI and smoking?
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Results in model of: Results in model of:
1.1.Baseline LDLBaseline LDL
2.2.age and genderage and gender
3.3.Adherence Adherence
4.4.BMI and smokingBMI and smoking
Is this a ‘good’ model?Is this a ‘good’ model?
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection: factors guide selection:
Final Model Final Model Note three variables entered but not statistically Note three variables entered but not statistically significantsignificant
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Is this the ‘best’ model?Is this the ‘best’ model?Should I leave out the non-significant factors Should I leave out the non-significant factors (Model 2)?(Model 2)?
Model Adj R2 F from ANOVA
No. of Parameters p
1 0.137 37.48 7
2 0.134 72.021 4
Adj RAdj R22 lower, F has increased and number of lower, F has increased and number of parameters is less in 2parameters is less in 2ndnd model. Is this better? model. Is this better?
Kullback-Leibler Kullback-Leibler InformationInformation
Kullback and Leibler Kullback and Leibler (1951) quantified the (1951) quantified the meaning of ‘information’ meaning of ‘information’ – related to Fisher’s – related to Fisher’s ‘sufficient statistics’‘sufficient statistics’Basically we have reality fBasically we have reality fAnd a model g to And a model g to approximate fapproximate fSo K-L information is So K-L information is I(f,g)I(f,g)
ff
gg
Kullback-Leibler Kullback-Leibler InformationInformation
We want to minimise I We want to minimise I (f,g) to obtain the (f,g) to obtain the best best model model over other over other modelsmodels
I (f,g) is the information I (f,g) is the information lost or ‘distance’ lost or ‘distance’ between reality and a between reality and a model so need to model so need to minimise:minimise: dx
xg
xfxfgfI )
)(
)(log()(),(
Akaike’s Information Akaike’s Information CriterionCriterion
It turns out that the It turns out that the function I(f,g) is function I(f,g) is related to a very related to a very simple measure of simple measure of goodness-of-fit:goodness-of-fit:
Akaike’s Information Akaike’s Information Criterion or AICCriterion or AIC
Selection CriteriaSelection Criteria
•With a large number of factors type 1 With a large number of factors type 1 error large, likely to have model with many error large, likely to have model with many variablesvariables
•Two standard criteria:Two standard criteria:
1) Akaike’s Information Criterion (AIC)1) Akaike’s Information Criterion (AIC)
2) Schwartz’s Bayesian Information 2) Schwartz’s Bayesian Information Criterion (BIC)Criterion (BIC)
•Both Both penalisepenalise models with large number models with large number of variables if sample size is large of variables if sample size is large
Akaike’s Information Akaike’s Information CriterionCriterion
•Where p = number of parameters Where p = number of parameters and -2*log likelihood is in the outputand -2*log likelihood is in the output
•Hence AIC penalises models with Hence AIC penalises models with large number of variables large number of variables
•Select model that Select model that minimisesminimises (- (-2LL+2p)2LL+2p)
p*2oodloglikelih*2AIC
Generalized linear Generalized linear modelsmodels
•Unfortunately the standard Unfortunately the standard REGRESSION in SPSS does not give REGRESSION in SPSS does not give these statisticsthese statistics
•Need to use Need to use
AnalyzeAnalyze
Generalized Linear Models…..Generalized Linear Models…..
Generalized linear Generalized linear models. Default is linearmodels. Default is linear
•Add Min LDL Add Min LDL achieved as achieved as dependent as in dependent as in REGRESSION in REGRESSION in SPSS SPSS
•Next go to Next go to predictors…..predictors…..
Generalized linear Generalized linear models: Predictorsmodels: Predictors
•WARNING!WARNING!
•Make sure Make sure you add the you add the predictors in predictors in the correct the correct boxbox
•Categorical Categorical in FACTORS in FACTORS boxbox
•Continuous Continuous in in COVARIATES COVARIATES boxbox
Generalized linear Generalized linear models: Modelmodels: Model
•Add all Add all factors and factors and covariates covariates in the in the model as model as main main effectseffects
Generalized Linear Models Generalized Linear Models Parameter EstimatesParameter Estimates
Note identical to REGRESSION outputNote identical to REGRESSION output
Generalized Linear Models Generalized Linear Models Goodness-of-fitGoodness-of-fit
Note output Note output gives log gives log likelihood and likelihood and AIC = 2835AIC = 2835(AIC = -2x-1409.6 (AIC = -2x-1409.6 +2x7= 2835)+2x7= 2835)
Footnote Footnote explains explains smaller AIC is smaller AIC is ‘better’‘better’
Let Science or Clinical Let Science or Clinical factors guide selection: factors guide selection:
‘Optimal’ model‘Optimal’ model•The log likelihood is a measure of The log likelihood is a measure of GOODNESS-OF-FITGOODNESS-OF-FIT•Seek ‘optimal’ model that Seek ‘optimal’ model that maximisesmaximises the log likelihood or the log likelihood or minimisesminimises the AIC the AIC
Model 2LL p AIC
1 Full Model -1409.6 7 2835.6
2 Non-significant variables removed
-1413.6 4 2837.2
ChangChange is e is 1.61.6
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Key points: Key points:
1.1.Results demonstrate a significant Results demonstrate a significant association with baseline LDL, Age and association with baseline LDL, Age and Adherence Adherence
2.2.Difficult choices with Gender, smoking Difficult choices with Gender, smoking and BMIand BMI
3.3.AIC only changes by 1.6 when removedAIC only changes by 1.6 when removed
4.4.Generally changes of 4 or more in AIC Generally changes of 4 or more in AIC are considered importantare considered important
1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection
Key points: Key points:
1.1.Conclude little to chose between Conclude little to chose between modelsmodels
2.2.AIC actually lower with larger model AIC actually lower with larger model and consider Gender, and BMI important and consider Gender, and BMI important factors so keep larger model but have to factors so keep larger model but have to justifyjustify
3.3.Model building manual, logical, Model building manual, logical, transparent and under your controltransparent and under your control
2) Use automatic 2) Use automatic selection proceduresselection procedures
These are based on automatic These are based on automatic mechanical algorithms usually mechanical algorithms usually related to statistical related to statistical significancesignificance
Common ones are stepwise, Common ones are stepwise, forward or backward forward or backward eliminationelimination
Can be selected in SPSS using Can be selected in SPSS using ‘Method’ in dialogue box‘Method’ in dialogue box
2) Use automatic selection 2) Use automatic selection procedures (e.g Stepwise)procedures (e.g Stepwise)
Select Select Method Method = = StepwisStepwisee
2) Use automatic selection 2) Use automatic selection procedures (e.g Stepwise)procedures (e.g Stepwise)
Final Final ModeModell
11stst step step
2nd 2nd stepstep
2) Change in AIC with 2) Change in AIC with Stepwise selectionStepwise selection
Note: Only available from Generalized Linear Note: Only available from Generalized Linear ModelsModels
Step Model Log Likelihood
AIC Change in AIC
No. of Parameters p
1 Baseline LDL -1423.1 2852.2 - 2
2 +Adherence -1418.0 2844.1 8.1 3
3 +Age -1413.6 2837.2 6.9 4
2) Advantages and 2) Advantages and disadvantages of disadvantages of
stepwisestepwiseAdvantagesAdvantages
Simple to implementSimple to implement
Gives a parsimonious modelGives a parsimonious model
Selection is certainly objectiveSelection is certainly objective
DisadvantagesDisadvantagesNon stable selection – stepwise considers Non stable selection – stepwise considers
many many models that are very similarmodels that are very similar
P-value on entry may be smaller once P-value on entry may be smaller once procedure is procedure is finished so exaggeration finished so exaggeration of p-valueof p-value
Predictions in external dataset usually Predictions in external dataset usually worse for worse for stepwise proceduresstepwise procedures
2) Automatic procedures: 2) Automatic procedures: Backward eliminationBackward elimination
BackwardBackward starts by eliminating the least starts by eliminating the least significant factor form the full model and significant factor form the full model and has a few advantages over forward:has a few advantages over forward:
•Modeller has to consider the ‘full’ model Modeller has to consider the ‘full’ model and sees results for all factors and sees results for all factors simultaneouslysimultaneously
•Correlated factors can remain in the Correlated factors can remain in the model (in forward methods they may not model (in forward methods they may not even enter)even enter)
•Criteria for removal tend to be more lax Criteria for removal tend to be more lax in backward so end up with more in backward so end up with more parametersparameters
2) Use automatic selection 2) Use automatic selection procedures (e.g Backward)procedures (e.g Backward)
Select Select Method Method = = BackwarBackwardd
2) Backward elimination in 2) Backward elimination in SPSSSPSS
Final Final ModeModell
11stst step stepGender Gender removeremovedd
2nd 2nd step step BMI BMI removeremovedd
Summary of automatic Summary of automatic selectionselection
• Automatic selection may not give Automatic selection may not give ‘optimal’ model (may leave out ‘optimal’ model (may leave out important factors)important factors)
• Different methods may give different Different methods may give different results (forward vs. backward results (forward vs. backward elimination)elimination)
• Backward elimination preferred as less Backward elimination preferred as less stringentstringent
• Too easily fitted in SPSS!Too easily fitted in SPSS!
• Model assessment still requires some Model assessment still requires some thoughtthought
3) A mixture of automatic 3) A mixture of automatic procedures and self procedures and self
selectionselection
•Use automatic procedures as Use automatic procedures as a a guide guide
•Think about what factors are Think about what factors are importantimportant
•Add ‘important’ factorsAdd ‘important’ factors•Do not blindly follow Do not blindly follow
statistical significancestatistical significance•Consider AICConsider AIC
•Use automatic procedures as Use automatic procedures as a a guide guide
•Think about what factors are Think about what factors are importantimportant
•Add ‘important’ factorsAdd ‘important’ factors•Do not blindly follow Do not blindly follow
statistical significancestatistical significance•Consider AICConsider AIC
Summary of Model Summary of Model selectionselection
• Selection of factors for Multiple Selection of factors for Multiple Logistic regression models Logistic regression models requires some judgementrequires some judgement
• Automatic procedures are Automatic procedures are available but treat results with available but treat results with cautioncaution
• They are easily fitted in SPSSThey are easily fitted in SPSS
• Check AIC or log likelihood for fitCheck AIC or log likelihood for fit
• Parsimonious models are betterParsimonious models are better
Remember Occam’s Remember Occam’s RazorRazor‘‘Entia non sunt Entia non sunt multiplicanda multiplicanda praeter praeter necessitatem’necessitatem’
‘‘Entities must not be Entities must not be multiplied beyond multiplied beyond necessity’necessity’
William of Ockham 14th century Friar and logician1288-1347
SummarySummary• Multiple regression models are Multiple regression models are
the most used analytical tool in the most used analytical tool in quantitative researchquantitative research
• They are easily fitted in SPSSThey are easily fitted in SPSS
• Model assessment requires Model assessment requires some thoughtsome thought
• Parsimony is better – Occam’s Parsimony is better – Occam’s RazorRazor
SummarySummary
After fitting any model check assumptionsAfter fitting any model check assumptions• Functional form – linearity or not Functional form – linearity or not • Check Residuals for normalityCheck Residuals for normality• Check Residuals for outliers Check Residuals for outliers • All accomplished within SPSSAll accomplished within SPSS• See publications for further infoSee publications for further info
• Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. diabetes: A Go-DARTS study. Pharmacogenetics and GenomicsPharmacogenetics and Genomics, 2008; 18: 279-87. , 2008; 18: 279-87.
After fitting any model check assumptionsAfter fitting any model check assumptions• Functional form – linearity or not Functional form – linearity or not • Check Residuals for normalityCheck Residuals for normality• Check Residuals for outliers Check Residuals for outliers • All accomplished within SPSSAll accomplished within SPSS• See publications for further infoSee publications for further info
• Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. diabetes: A Go-DARTS study. Pharmacogenetics and GenomicsPharmacogenetics and Genomics, 2008; 18: 279-87. , 2008; 18: 279-87.
Practical on Multiple Practical on Multiple RegressionRegression
Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’
1)1)Try fitting multiple regression model on Min LDL Try fitting multiple regression model on Min LDL obtained using forward and backward elimination. obtained using forward and backward elimination. Are the results the same? Add other factors than Are the results the same? Add other factors than those considered in the presentation such as BMI, those considered in the presentation such as BMI, smoking. Remember the goal is to assess the smoking. Remember the goal is to assess the association of APOE with LDL response.association of APOE with LDL response.
2)2)Try fitting multiple regression models for Min Chol Try fitting multiple regression models for Min Chol achieved. Is the model similar to that found for Min achieved. Is the model similar to that found for Min Chol?Chol?
Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’
1)1)Try fitting multiple regression model on Min LDL Try fitting multiple regression model on Min LDL obtained using forward and backward elimination. obtained using forward and backward elimination. Are the results the same? Add other factors than Are the results the same? Add other factors than those considered in the presentation such as BMI, those considered in the presentation such as BMI, smoking. Remember the goal is to assess the smoking. Remember the goal is to assess the association of APOE with LDL response.association of APOE with LDL response.
2)2)Try fitting multiple regression models for Min Chol Try fitting multiple regression models for Min Chol achieved. Is the model similar to that found for Min achieved. Is the model similar to that found for Min Chol?Chol?