dundee epidemiology and biostatistics unit correlation and regression peter t. donnan professor of...

Dundee Epidemiology and Biostatistics Unit

Correlation and Regression

Peter T. Donnan

Professor of Epidemiology and Biostatistics

Objectives of Workshop Objectives of Workshop

•ScatterplotScatterplot•CorrelationCorrelation•Correlation to regressionCorrelation to regression•Simple linear regressionSimple linear regression•Parameterisation of the Parameterisation of the

modelmodel•AssumptionsAssumptions

•ScatterplotScatterplot•CorrelationCorrelation•Correlation to regressionCorrelation to regression•Simple linear regressionSimple linear regression•Parameterisation of the Parameterisation of the

modelmodel•AssumptionsAssumptions

ScatterplotScatterplot

•The existence of a statistical The existence of a statistical association between two association between two variables is most apparent in variables is most apparent in the appearance of a diagram the appearance of a diagram called a scatterplot.called a scatterplot.

Scatterplot of two Scatterplot of two continuous variablescontinuous variables

DEPRIVATION (JARMAN)

100-10-20

Dis

ch

arg

e r

ate

(a

ge

/se

x a

dj)

1.6

1.4

1.2

1.0

.8

.6

.4

Matrix of Scatterplot of Matrix of Scatterplot of pairs of continuous pairs of continuous

variablesvariablesDISRATE

DISTDGH

DEPREV

LISTPART

CorrelationCorrelation

Correlation is a measure of Correlation is a measure of association between two association between two continuous factorscontinuous factors

Correlations lie between –1 to +1Correlations lie between –1 to +1

A correlation coefficient close to A correlation coefficient close to zero indicates weak or no zero indicates weak or no correlationcorrelation

CorrelationCorrelation

If one factor If one factor increasesincreases with the other with the other – POSITIVE Correlation– POSITIVE Correlation

e.g. cancer risk & tobacco exposuree.g. cancer risk & tobacco exposure

If one factor If one factor decreasesdecreases while the while the other other increasesincreases – NEGATIVE – NEGATIVE correlationcorrelation

e.g. cholesterol and haemorrhagic e.g. cholesterol and haemorrhagic stroke riskstroke risk

Pearson CorrelationPearson Correlation

Correlations

1.000 .462 * -.301 .072

. .015 .127 .721

27 27 27 27

.462 * 1.000 -.470 * .326

.015 . .013 .097

27 27 27 27

-.301 -.470 * 1.000 -.708 **

.127 .013 . .000

27 27 27 27

.072 .326 -.708 ** 1.000

.721 .097 .000 .

27 27 27 27

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

DISRATE

DEPREV

DISTDGH

LISTPART

DISRATE DEPREV DISTDGH LISTPART

Correlation is significant at the 0.05 level (2-tailed).*.

Correlation is significant at the 0.01 level (2-tailed).**.

Warnings about Warnings about correlationcorrelation

Correlation does NOT measure causality Correlation does NOT measure causality only associationonly association

Correlation does NOT measure causality Correlation does NOT measure causality only associationonly association

High correlation of increased High correlation of increased benefit from drug and the star benefit from drug and the star sign Gemini!sign Gemini!

Increased admissions for Increased admissions for pneumonia correlated with high pneumonia correlated with high tides! tides!

Beware non-normally distributed Beware non-normally distributed datadata

Try to normalise data using Try to normalise data using transformationstransformations

Correlation to Regression Correlation to Regression ModelModel

Need to more formally assess Need to more formally assess relationship between pair of relationship between pair of continuous variablescontinuous variables

Fit Regression line - ModelFit Regression line - Model

Usually assume line is linear but Usually assume line is linear but may also fit polynomial curves may also fit polynomial curves

Fitted linear Fitted linear regression line and regression line and

95% CI for mean95% CI for mean

RR22 x 100 is % variation explained by line x 100 is % variation explained by line (21%)(21%)

DEPREV

100-10-20

DIS

RA

TE

1.6

1.4

1.2

1.0

.8

.6

.4 Rsq = 0.2134

How is relationship How is relationship formulated?formulated?

Simplest equation is :Simplest equation is :

iebxay

y is the outcome; a is the y is the outcome; a is the intercept;intercept;

b is the slope related to x the b is the slope related to x the explanatory variable and;explanatory variable and;

e is the error term or random e is the error term or random ‘noise’‘noise’

How is relationship How is relationship formulated?formulated?

iebxay

We have a set of values for y the We have a set of values for y the outcome and a corresponding set of outcome and a corresponding set of values for x the explanatory variable values for x the explanatory variable

Need to estimate values of a the Need to estimate values of a the intercept and b the slope intercept and b the slope

e is the residual and is assumed to e is the residual and is assumed to have a Normal distributionhave a Normal distribution

ExplanatoryExplanatory

IndependenIndependent variablet variable

PredictorPredictor

Alternative TerminologyAlternative Terminology

yy xx

OutcomeOutcome

Dependent Dependent variablevariable

PredictedPredicted

How is the fitted line How is the fitted line obtained?obtained?

Use method of least squares Use method of least squares (LS)(LS)

Seek to minimise squared Seek to minimise squared vertical differences between vertical differences between each point and fitted lineeach point and fitted line

Results in parameter estimates Results in parameter estimates or regression coefficients of or regression coefficients of slope (b) and intercept (a)slope (b) and intercept (a)

Method of least Method of least squaressquares

Consider the regression of age Consider the regression of age on minimum LDL cholesterol on minimum LDL cholesterol

achievedachieved

•Select Regression Select Regression Linear….Linear….

•Dependent (y) – Min LDL achievedDependent (y) – Min LDL achieved•Independent (x) - Age_BaseIndependent (x) - Age_Base

•Select Regression Select Regression Linear….Linear….

•Dependent (y) – Min LDL achievedDependent (y) – Min LDL achieved•Independent (x) - Age_BaseIndependent (x) - Age_Base

N.B. 0.008 may look very small but N.B. 0.008 may look very small but represents: represents:

The DECREASE in LDL achieved for The DECREASE in LDL achieved for each increase in one unit of age i.e. each increase in one unit of age i.e. ONE yearONE year

Output from SPSS linear Output from SPSS linear regressionregression

Coefficientsa

Model Unstandardized Coefficients Standardized CoefficientsB Std. Error Beta t sig

1 (Constant) 2.024 .105 19.340 .000Age at baseline -.008 .002 -.121 -4.546 .000

a. Dependent Variable: Min LDL achieved

HH00 : slope b = 0 : slope b = 0

Test t = slope/se = -0.008/0.002 = 4.546 Test t = slope/se = -0.008/0.002 = 4.546 with p<0.001, so statistically significantwith p<0.001, so statistically significant

Predicted LDL = 2.024 - 0.008xAgePredicted LDL = 2.024 - 0.008xAge

Output from SPSS linear Output from SPSS linear regressionregression

Coefficientsa

Model Unstandardized Coefficients Standardized CoefficientsB Std. Error Beta t sig

1 (Constant) 2.024 .105 19.340 .000Age at baseline -.008 .002 -.121 -4.546 .000

a. Dependent Variable: Min LDL achieved

Predicted LDL achieved = 2.024 - Predicted LDL achieved = 2.024 - 0.008xAge0.008xAge

So for a man aged 65 the predicted LDL So for a man aged 65 the predicted LDL achieved = 2.024 – 0.008x 65 = 1.504achieved = 2.024 – 0.008x 65 = 1.504

Prediction Equation from Prediction Equation from linear regressionlinear regression

Age Predicted Min LDL

45 1.664

55 1.584

65 1.504

75 1.424

Assumptions of Assumptions of RegressionRegression

11. . Relationship is linearRelationship is linear11. . Relationship is linearRelationship is linear

2. Outcome variable and hence 2. Outcome variable and hence residuals or error terms are residuals or error terms are approx. Normally distributed approx. Normally distributed

HH00 : slope b = 0 : slope b = 0

Test statistic, t = slope/se.Test statistic, t = slope/se.

If t > 1.96 then slope is If t > 1.96 then slope is statistically significantstatistically significant

Testing regression Testing regression coefficientscoefficients

Parameterisation of Parameterisation of modelmodel

What does the regression What does the regression coefficient mean?coefficient mean?

Continuous:Continuous:

B represents the change in B represents the change in outcome outcome (y) for each 1 unit (y) for each 1 unit change in the x change in the x variablevariable

e.g. Blood pressure = a + e.g. Blood pressure = a + 2.3*Age2.3*Age≡ Each increase of 1 year ≡ Each increase of 1 year increases increases BP by 2.3 mmHgBP by 2.3 mmHg



Binary:Binary:

B represents the change in B represents the change in outcome outcome (y) for each 1 unit (y) for each 1 unit change in the x change in the x variable (x is variable (x is binary)binary)

e.g. Blood pressure = a + e.g. Blood pressure = a + 5.1*Gender5.1*Gender ≡ Men have higher BP by ≡ Men have higher BP by 5.1 mmHg 5.1 mmHg compared to women compared to women ( M=1; F = 0)( M=1; F = 0)


What does the regression What does the regression coefficient mean?coefficient mean?Categories:Categories:

If ordinal, same as continuous, B If ordinal, same as continuous, B represents change in outcome (y) for each represents change in outcome (y) for each 1 unit change in the x variable (x is 1 unit change in the x variable (x is ordinal)ordinal)

e.g. Blood pressure = a + e.g. Blood pressure = a + 1.8*Deprivation1.8*Deprivation

≡ ≡ A 1 category increase in deprivation A 1 category increase in deprivation increases BP by 1.8 mmHg (Carstairs increases BP by 1.8 mmHg (Carstairs

social social deprivation scale 1 to 7)deprivation scale 1 to 7)



Categories:Categories:

If > 2 categories create k -1 If > 2 categories create k -1 dummy dummy variables (k categories)variables (k categories)

e.g. Smoking: e.g. Smoking:

1=Current, 2 = ex, 3=non-1=Current, 2 = ex, 3=non-smokersmoker

requires 2 dummy variables in requires 2 dummy variables in the the modelmodel

Parameterisation of modelParameterisation of modelWhat does the regression What does the regression

coefficient mean?coefficient mean?Categories:Categories:

Smoke1Smoke1 Smoke2Smoke2

CurrentCurrent 11 00

ExEx 00 11

NonNon 00 00

Blood pressure = a + 10*Smoke1 + Blood pressure = a + 10*Smoke1 + 2*smoke22*smoke2

≡ ≡ Current smoking increases BP by 10 Current smoking increases BP by 10 mmHg mmHg and ex-smoking increase BP by 2 and ex-smoking increase BP by 2 mmHg mmHg compared with non-smokerscompared with non-smokers

Entering Multidimensional Space: Multiple

Regression Peter T. Donnan

Professor of Epidemiology and Biostatistics

Statistics for Health ResearchStatistics for Health Research

Objectives of sessionObjectives of session

• Recognise the need for multiple Recognise the need for multiple regressionregression

• Understand methods of selecting Understand methods of selecting variables variables

• Understand strengths and weakness of Understand strengths and weakness of selection methodsselection methods

• Carry out Multiple Carry out Multiple Regression in SPSS Regression in SPSS and interpret the outputand interpret the output

• Recognise the need for multiple Recognise the need for multiple regressionregression

• Understand methods of selecting Understand methods of selecting variables variables

• Understand strengths and weakness of Understand strengths and weakness of selection methodsselection methods

• Carry out Multiple Carry out Multiple Regression in SPSS Regression in SPSS and interpret the outputand interpret the output

Why do we need Why do we need multiple regression?multiple regression?

Research is not as simple as Research is not as simple as effect of one variable on one effect of one variable on one outcome,outcome,

Especially with observational Especially with observational datadata

Need to assess many factors Need to assess many factors simultaneously; more realistic simultaneously; more realistic modelsmodels

Consider Fitted line of Consider Fitted line of y = a + by = a + b11xx11 + b + b22xx22

Explanatory Explanatory (x(x11))

Dep

en

den

t D

ep

en

den

t (y

)(y

)

Explanatory (x

Explanatory (x 22))

3-dimensional scatterplot from 3-dimensional scatterplot from SPSS of Min LDL in relation to SPSS of Min LDL in relation to

baseline LDL and age baseline LDL and age

When to use multiple When to use multiple regression modelling regression modelling

(1)(1)Assess relationship between Assess relationship between two variables while two variables while adjustingadjusting or or allowing forallowing for another another variablevariable

Sometimes the second variable Sometimes the second variable is considered a ‘nuisance’ is considered a ‘nuisance’ factorfactor

Example: Glucose uptake data Example: Glucose uptake data allowing for different cell typesallowing for different cell types

When to use multiple When to use multiple regression modelling (2)regression modelling (2)

In RCT whenever there is In RCT whenever there is imbalance between arms of the imbalance between arms of the trial at baseline in trial at baseline in characteristics of subjectscharacteristics of subjects

e.g. survival in colorectal cancer e.g. survival in colorectal cancer on two different randomised on two different randomised therapies therapies adjustedadjusted for age, for age, gender, co-morbidity gender, co-morbidity


A special case of this is when A special case of this is when adjusting for baseline level of adjusting for baseline level of the primary outcome in an RCTthe primary outcome in an RCT

Baseline level added as a factor Baseline level added as a factor in regression model in regression model

This will be covered in Trials This will be covered in Trials part of the course part of the course


With observational data in order With observational data in order to produce a to produce a prognostic prognostic equationequation for future prediction of for future prediction of risk of mortalityrisk of mortality

e.g. Predicting future risk of e.g. Predicting future risk of CHD used 10-year data from the CHD used 10-year data from the Framingham cohort Framingham cohort


With observational data in order With observational data in order to adjust for possible to adjust for possible confoundersconfounders

e.g. survival in colorectal cancer e.g. survival in colorectal cancer in those with hypertension in those with hypertension adjustedadjusted for age, gender, social for age, gender, social deprivation and co-morbidity deprivation and co-morbidity

Definition of ConfoundingDefinition of Confounding

A confounder is a factor A confounder is a factor which is related to which is related to bothboth the the variable of interest variable of interest (explanatory) and the (explanatory) and the outcome, outcome, butbut is not an is not an intermediary in a causal intermediary in a causal pathwaypathway

Example of ConfoundingExample of Confounding

Deprivation

Lung Cancer

Smoking

But, also worth adjusting for But, also worth adjusting for factors only related to factors only related to

outcomeoutcome

Deprivation

Lung Cancer

Exercise

Not worth adjusting for Not worth adjusting for intermediate factor in a causal intermediate factor in a causal

pathwaypathway

Exercise

Stroke

Blood viscosit

y

In a causal pathway each factor In a causal pathway each factor is merely a marker of the other is merely a marker of the other factors i.e correlated - factors i.e correlated - collinearitycollinearity

SPSS: Add both baseline LDL SPSS: Add both baseline LDL and age in the independent and age in the independent

box in linear regression box in linear regression

Output from SPSS linear Output from SPSS linear regression on Age at regression on Age at

baselinebaseline

Coefficientsa

2.024 .105 19.340 .000 1.819 2.229

-.008 .002 -.121 -4.546 .000 -.011 -.004 1.000 1.000

(Constant)

Age at baseline

Model1

B Std. Error

UnstandardizedCoeff icients

Beta

StandardizedCoeff icients

t Sig. Lower Bound Upper Bound

95% Conf idence Interv al for B

Tolerance VIF

Collinearity Statistics

Dependent Variable: Min LDL achieveda.

Output from SPSS linear Output from SPSS linear regression on Baseline LDLregression on Baseline LDL

Coefficientsa

.668 .066 10.091 .000 .538 .798

.257 .018 .351 13.950 .000 .221 .293

(Constant)

Baseline LDL

Model1

B Std. Error

UnstandardizedCoeff icients

Beta

StandardizedCoeff icients


95% Conf idence Interv al for B


Model Summary

.360a .130 .129 .6753538Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), Age at baseline, Baseline LDLa.

Coefficientsa

1.003 .124 8.086 .000 .760 1.246

.250 .019 .342 13.516 .000 .214 .286

-.005 .002 -.081 -3.187 .001 -.008 -.002

(Constant)

Baseline LDL

Age at baseline

Model1

B Std. Error

UnstandardizedCoef f icients

Beta

StandardizedCoef f icients


95% Conf idence Interval for B


Output: Multiple regressionOutput: Multiple regression

RR2 2 now now improveimproved to d to 13%13%

Both variables still significant Both variables still significant INDEPENDENTLY of each otherINDEPENDENTLY of each other

How do you select which How do you select which variables to enter the variables to enter the

model?model?•Usually consider what hypotheses are you Usually consider what hypotheses are you testing?testing?

•If main ‘exposure’ variable, enter first and If main ‘exposure’ variable, enter first and assess confounders one at a timeassess confounders one at a time

•For derivation of CPR you want powerful For derivation of CPR you want powerful predictorspredictors

•Also clinically important factors e.g. cholesterol Also clinically important factors e.g. cholesterol in CHD predictionin CHD prediction

•Significance is important Significance is important butbut

•It is acceptable to have an ‘important’ variable It is acceptable to have an ‘important’ variable withoutwithout statistical significance statistical significance

How do you decide what How do you decide what variables to enter in model?variables to enter in model?

Correlations? With great Correlations? With great difficulty!difficulty!

3-dimensional scatterplot from 3-dimensional scatterplot from SPSS of Time from Surgery in SPSS of Time from Surgery in relation to Duke’s staging and relation to Duke’s staging and

ageage

Approaches to model Approaches to model buildingbuilding

1. Let Scientific or Clinical factors 1. Let Scientific or Clinical factors guide selectionguide selection

1. Let Scientific or Clinical factors 1. Let Scientific or Clinical factors guide selectionguide selection

2. Use automatic selection 2. Use automatic selection algorithmsalgorithms

3. A mixture of above3. A mixture of above

1) Let Science or Clinical 1) Let Science or Clinical factors guide selection factors guide selection

Baseline LDL cholesterol is an Baseline LDL cholesterol is an important factor determining important factor determining LDL outcome so enter firstLDL outcome so enter first

Next allow for age and genderNext allow for age and gender

Add adherence as important?Add adherence as important?

Add BMI and smoking?Add BMI and smoking?


Results in model of: Results in model of:

1.1.Baseline LDLBaseline LDL

2.2.age and genderage and gender

3.3.Adherence Adherence

4.4.BMI and smokingBMI and smoking

Is this a ‘good’ model?Is this a ‘good’ model?

1) Let Science or Clinical 1) Let Science or Clinical factors guide selection: factors guide selection:

Final Model Final Model Note three variables entered but not statistically Note three variables entered but not statistically significantsignificant


Is this the ‘best’ model?Is this the ‘best’ model?Should I leave out the non-significant factors Should I leave out the non-significant factors (Model 2)?(Model 2)?

Model Adj R2 F from ANOVA

No. of Parameters p

1 0.137 37.48 7

2 0.134 72.021 4

Adj RAdj R22 lower, F has increased and number of lower, F has increased and number of parameters is less in 2parameters is less in 2ndnd model. Is this better? model. Is this better?

Kullback-Leibler Kullback-Leibler InformationInformation

Kullback and Leibler Kullback and Leibler (1951) quantified the (1951) quantified the meaning of ‘information’ meaning of ‘information’ – related to Fisher’s – related to Fisher’s ‘sufficient statistics’‘sufficient statistics’Basically we have reality fBasically we have reality fAnd a model g to And a model g to approximate fapproximate fSo K-L information is So K-L information is I(f,g)I(f,g)

ff

gg

Kullback-Leibler Kullback-Leibler InformationInformation

We want to minimise I We want to minimise I (f,g) to obtain the (f,g) to obtain the best best model model over other over other modelsmodels

I (f,g) is the information I (f,g) is the information lost or ‘distance’ lost or ‘distance’ between reality and a between reality and a model so need to model so need to minimise:minimise: dx

xg

xfxfgfI )

)(

)(log()(),(

Akaike’s Information Akaike’s Information CriterionCriterion

It turns out that the It turns out that the function I(f,g) is function I(f,g) is related to a very related to a very simple measure of simple measure of goodness-of-fit:goodness-of-fit:

Akaike’s Information Akaike’s Information Criterion or AICCriterion or AIC

Selection CriteriaSelection Criteria

•With a large number of factors type 1 With a large number of factors type 1 error large, likely to have model with many error large, likely to have model with many variablesvariables

•Two standard criteria:Two standard criteria:

1) Akaike’s Information Criterion (AIC)1) Akaike’s Information Criterion (AIC)

2) Schwartz’s Bayesian Information 2) Schwartz’s Bayesian Information Criterion (BIC)Criterion (BIC)

•Both Both penalisepenalise models with large number models with large number of variables if sample size is large of variables if sample size is large

Akaike’s Information Akaike’s Information CriterionCriterion

•Where p = number of parameters Where p = number of parameters and -2*log likelihood is in the outputand -2*log likelihood is in the output

•Hence AIC penalises models with Hence AIC penalises models with large number of variables large number of variables

•Select model that Select model that minimisesminimises (- (-2LL+2p)2LL+2p)

p*2oodloglikelih*2AIC

Generalized linear Generalized linear modelsmodels

•Unfortunately the standard Unfortunately the standard REGRESSION in SPSS does not give REGRESSION in SPSS does not give these statisticsthese statistics

•Need to use Need to use

AnalyzeAnalyze

Generalized Linear Models…..Generalized Linear Models…..

Generalized linear Generalized linear models. Default is linearmodels. Default is linear

•Add Min LDL Add Min LDL achieved as achieved as dependent as in dependent as in REGRESSION in REGRESSION in SPSS SPSS

•Next go to Next go to predictors…..predictors…..

Generalized linear Generalized linear models: Predictorsmodels: Predictors

•WARNING!WARNING!

•Make sure Make sure you add the you add the predictors in predictors in the correct the correct boxbox

•Categorical Categorical in FACTORS in FACTORS boxbox

•Continuous Continuous in in COVARIATES COVARIATES boxbox

Generalized linear Generalized linear models: Modelmodels: Model

•Add all Add all factors and factors and covariates covariates in the in the model as model as main main effectseffects

Generalized Linear Models Generalized Linear Models Parameter EstimatesParameter Estimates

Note identical to REGRESSION outputNote identical to REGRESSION output

Generalized Linear Models Generalized Linear Models Goodness-of-fitGoodness-of-fit

Note output Note output gives log gives log likelihood and likelihood and AIC = 2835AIC = 2835(AIC = -2x-1409.6 (AIC = -2x-1409.6 +2x7= 2835)+2x7= 2835)

Footnote Footnote explains explains smaller AIC is smaller AIC is ‘better’‘better’

Let Science or Clinical Let Science or Clinical factors guide selection: factors guide selection:

‘Optimal’ model‘Optimal’ model•The log likelihood is a measure of The log likelihood is a measure of GOODNESS-OF-FITGOODNESS-OF-FIT•Seek ‘optimal’ model that Seek ‘optimal’ model that maximisesmaximises the log likelihood or the log likelihood or minimisesminimises the AIC the AIC

Model 2LL p AIC

1 Full Model -1409.6 7 2835.6

2 Non-significant variables removed

-1413.6 4 2837.2

ChangChange is e is 1.61.6


Key points: Key points:

1.1.Results demonstrate a significant Results demonstrate a significant association with baseline LDL, Age and association with baseline LDL, Age and Adherence Adherence

2.2.Difficult choices with Gender, smoking Difficult choices with Gender, smoking and BMIand BMI

3.3.AIC only changes by 1.6 when removedAIC only changes by 1.6 when removed

4.4.Generally changes of 4 or more in AIC Generally changes of 4 or more in AIC are considered importantare considered important


Key points: Key points:

1.1.Conclude little to chose between Conclude little to chose between modelsmodels

2.2.AIC actually lower with larger model AIC actually lower with larger model and consider Gender, and BMI important and consider Gender, and BMI important factors so keep larger model but have to factors so keep larger model but have to justifyjustify

3.3.Model building manual, logical, Model building manual, logical, transparent and under your controltransparent and under your control

2) Use automatic 2) Use automatic selection proceduresselection procedures

These are based on automatic These are based on automatic mechanical algorithms usually mechanical algorithms usually related to statistical related to statistical significancesignificance

Common ones are stepwise, Common ones are stepwise, forward or backward forward or backward eliminationelimination

Can be selected in SPSS using Can be selected in SPSS using ‘Method’ in dialogue box‘Method’ in dialogue box

2) Use automatic selection 2) Use automatic selection procedures (e.g Stepwise)procedures (e.g Stepwise)

Select Select Method Method = = StepwisStepwisee

2) Use automatic selection 2) Use automatic selection procedures (e.g Stepwise)procedures (e.g Stepwise)

Final Final ModeModell

11stst step step

2nd 2nd stepstep

2) Change in AIC with 2) Change in AIC with Stepwise selectionStepwise selection

Note: Only available from Generalized Linear Note: Only available from Generalized Linear ModelsModels

Step Model Log Likelihood

AIC Change in AIC

No. of Parameters p

1 Baseline LDL -1423.1 2852.2 - 2

2 +Adherence -1418.0 2844.1 8.1 3

3 +Age -1413.6 2837.2 6.9 4

2) Advantages and 2) Advantages and disadvantages of disadvantages of

stepwisestepwiseAdvantagesAdvantages

Simple to implementSimple to implement

Gives a parsimonious modelGives a parsimonious model

Selection is certainly objectiveSelection is certainly objective

DisadvantagesDisadvantagesNon stable selection – stepwise considers Non stable selection – stepwise considers

many many models that are very similarmodels that are very similar

P-value on entry may be smaller once P-value on entry may be smaller once procedure is procedure is finished so exaggeration finished so exaggeration of p-valueof p-value

Predictions in external dataset usually Predictions in external dataset usually worse for worse for stepwise proceduresstepwise procedures

2) Automatic procedures: 2) Automatic procedures: Backward eliminationBackward elimination

BackwardBackward starts by eliminating the least starts by eliminating the least significant factor form the full model and significant factor form the full model and has a few advantages over forward:has a few advantages over forward:

•Modeller has to consider the ‘full’ model Modeller has to consider the ‘full’ model and sees results for all factors and sees results for all factors simultaneouslysimultaneously

•Correlated factors can remain in the Correlated factors can remain in the model (in forward methods they may not model (in forward methods they may not even enter)even enter)

•Criteria for removal tend to be more lax Criteria for removal tend to be more lax in backward so end up with more in backward so end up with more parametersparameters

2) Use automatic selection 2) Use automatic selection procedures (e.g Backward)procedures (e.g Backward)

Select Select Method Method = = BackwarBackwardd

2) Backward elimination in 2) Backward elimination in SPSSSPSS

Final Final ModeModell

11stst step stepGender Gender removeremovedd

2nd 2nd step step BMI BMI removeremovedd

Summary of automatic Summary of automatic selectionselection

• Automatic selection may not give Automatic selection may not give ‘optimal’ model (may leave out ‘optimal’ model (may leave out important factors)important factors)

• Different methods may give different Different methods may give different results (forward vs. backward results (forward vs. backward elimination)elimination)

• Backward elimination preferred as less Backward elimination preferred as less stringentstringent

• Too easily fitted in SPSS!Too easily fitted in SPSS!

• Model assessment still requires some Model assessment still requires some thoughtthought

3) A mixture of automatic 3) A mixture of automatic procedures and self procedures and self

selectionselection

•Use automatic procedures as Use automatic procedures as a a guide guide

•Think about what factors are Think about what factors are importantimportant

•Add ‘important’ factorsAdd ‘important’ factors•Do not blindly follow Do not blindly follow

statistical significancestatistical significance•Consider AICConsider AIC

•Use automatic procedures as Use automatic procedures as a a guide guide

•Think about what factors are Think about what factors are importantimportant

•Add ‘important’ factorsAdd ‘important’ factors•Do not blindly follow Do not blindly follow

statistical significancestatistical significance•Consider AICConsider AIC

Summary of Model Summary of Model selectionselection

• Selection of factors for Multiple Selection of factors for Multiple Logistic regression models Logistic regression models requires some judgementrequires some judgement

• Automatic procedures are Automatic procedures are available but treat results with available but treat results with cautioncaution

• They are easily fitted in SPSSThey are easily fitted in SPSS

• Check AIC or log likelihood for fitCheck AIC or log likelihood for fit

• Parsimonious models are betterParsimonious models are better

Remember Occam’s Remember Occam’s RazorRazor‘‘Entia non sunt Entia non sunt multiplicanda multiplicanda praeter praeter necessitatem’necessitatem’

‘‘Entities must not be Entities must not be multiplied beyond multiplied beyond necessity’necessity’

William of Ockham 14th century Friar and logician1288-1347

http://upload.wikimedia.org/wikipedia/commons/7/70/William_of_Ockham.png

SummarySummary• Multiple regression models are Multiple regression models are

the most used analytical tool in the most used analytical tool in quantitative researchquantitative research

• They are easily fitted in SPSSThey are easily fitted in SPSS

• Model assessment requires Model assessment requires some thoughtsome thought

• Parsimony is better – Occam’s Parsimony is better – Occam’s RazorRazor

SummarySummary

After fitting any model check assumptionsAfter fitting any model check assumptions• Functional form – linearity or not Functional form – linearity or not • Check Residuals for normalityCheck Residuals for normality• Check Residuals for outliers Check Residuals for outliers • All accomplished within SPSSAll accomplished within SPSS• See publications for further infoSee publications for further info

• Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. diabetes: A Go-DARTS study. Pharmacogenetics and GenomicsPharmacogenetics and Genomics, 2008; 18: 279-87. , 2008; 18: 279-87.

After fitting any model check assumptionsAfter fitting any model check assumptions• Functional form – linearity or not Functional form – linearity or not • Check Residuals for normalityCheck Residuals for normality• Check Residuals for outliers Check Residuals for outliers • All accomplished within SPSSAll accomplished within SPSS• See publications for further infoSee publications for further info

• Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A Go-DARTS study. diabetes: A Go-DARTS study. Pharmacogenetics and GenomicsPharmacogenetics and Genomics, 2008; 18: 279-87. , 2008; 18: 279-87.

Practical on Multiple Practical on Multiple RegressionRegression

Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’

1)1)Try fitting multiple regression model on Min LDL Try fitting multiple regression model on Min LDL obtained using forward and backward elimination. obtained using forward and backward elimination. Are the results the same? Add other factors than Are the results the same? Add other factors than those considered in the presentation such as BMI, those considered in the presentation such as BMI, smoking. Remember the goal is to assess the smoking. Remember the goal is to assess the association of APOE with LDL response.association of APOE with LDL response.

2)2)Try fitting multiple regression models for Min Chol Try fitting multiple regression models for Min Chol achieved. Is the model similar to that found for Min achieved. Is the model similar to that found for Min Chol?Chol?

Read in ‘LDL Data.sav’Read in ‘LDL Data.sav’

1)1)Try fitting multiple regression model on Min LDL Try fitting multiple regression model on Min LDL obtained using forward and backward elimination. obtained using forward and backward elimination. Are the results the same? Add other factors than Are the results the same? Add other factors than those considered in the presentation such as BMI, those considered in the presentation such as BMI, smoking. Remember the goal is to assess the smoking. Remember the goal is to assess the association of APOE with LDL response.association of APOE with LDL response.

2)2)Try fitting multiple regression models for Min Chol Try fitting multiple regression models for Min Chol achieved. Is the model similar to that found for Min achieved. Is the model similar to that found for Min Chol?Chol?

dundee epidemiology and biostatistics unit correlation and regression peter t. donnan professor of...

Documents