ols & logistic regression analysis – a recap

50
1 OLS & Logistic Regression Analysis – A Recap Cristina Penaloza & Eoin Maloney Health Economics Unit

Upload: finley

Post on 24-Feb-2016

126 views

Category:

Documents


2 download

DESCRIPTION

OLS & Logistic Regression Analysis – A Recap. Cristina Penaloza & Eoin Maloney Health Economics Unit. Outline. What is regression analysis? Relevance of regression analysis Regression modelling process OLS regression Logistic regression Exercise. What is Regression Analysis?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: OLS & Logistic Regression Analysis – A Recap

1

OLS & Logistic Regression Analysis – A Recap

Cristina Penaloza & Eoin MaloneyHealth Economics Unit

Page 2: OLS & Logistic Regression Analysis – A Recap

2

Outline• What is regression analysis?

• Relevance of regression analysis

• Regression modelling process

– OLS regression

– Logistic regression

• Exercise

Page 3: OLS & Logistic Regression Analysis – A Recap

3

What is Regression Analysis?“Regression analysis is concerned with the study of the dependence of one variable, the dependent variable, on one or more other variables, the explanatory variables, …

with a view to estimating and/or predicting the (population) mean or average value of the dependent variable in terms of known or fixed (in repeated sampling) values of the explanatory variables.”

Gujarati (1995: 16)

Page 4: OLS & Logistic Regression Analysis – A Recap

4

TerminologyDependent variable, explained variable, outcome variable, outcome, response variable, regressand, output variable, predicted value, predictand, endogenous

Explanatory variable, Independent variable, predictor variable, predictor, regressor, stimulus/control variable, exogenous

Disturbance (random error) term, residual, residual error

Page 5: OLS & Logistic Regression Analysis – A Recap

5

Causation / correlation• Regression vs causation

– “A statistical relationship, however strong and however suggestive, can never establish causal connection: our ideas of causation must come from outside statistics”

Gujarati (1995: 20)

• Regression vs correlation– Correlation analysis: seeks to measure the strength of

linear association between two variables– Regression analysis: seeks to estimate or predict the

average value of one variable on the basis of fixed values of other variables

Page 6: OLS & Logistic Regression Analysis – A Recap

6

Why study regression?• Adjusting for baseline characteristics in Economic Evaluation

(Nathwani et al. 2004; Manca et al. 2005; Hoch et al 2002)

• Predicting/mapping utility-based outcome measures for use in Economic Evaluation (Gray et al. 2006; Kaambwa et al.2011; Sengupta et al 2004)

• Predicting costs for use in Economic Evaluation (Smith et al. 2007; Bonizzato et al. 2000; Baumeister et al. 2009)

• Constructing CEACs (Hoch et al. 2006)

• Regression imputation for missing data (Billingham et al. 2002; Engels & Diehr, 2003; Blazer et al. 1995)

• Explaining factors which cause variation in outcome and cost data (Barber &Thomspon, 2004; Kaambwa et al. 2008; Raine et al, 2010)

Page 7: OLS & Logistic Regression Analysis – A Recap

7

The regression modelling process

1. Statement of hypothesis (theory)2. Specification of the model3. Obtaining the data4. Estimation of the regression model5. Diagnostic analysis6. Hypothesis testing7. Prediction/forecasting

Page 8: OLS & Logistic Regression Analysis – A Recap

8

1. Statement of hypothesis

Example: High Blood Pressure and older people

“Amongst those over the age of 65, the incidence of high diastolic blood pressure (dipb) increases with age. Therefore, dipb is, in part, explained by age.”

Page 9: OLS & Logistic Regression Analysis – A Recap

9

In Functional form:

Mean Diastolic High Blood Pressure, DIBP, is some function of age, A:

DIBP = f (A) (1)

2. Specification of the model

Page 10: OLS & Logistic Regression Analysis – A Recap

10

2. Specification of the model (cntd)In Mathematical (linear) form:

Y = 1 + 2X (2)

where Y = Mean DIBP and X = age

1 & 2 = parameters

Page 11: OLS & Logistic Regression Analysis – A Recap

11

E(Y|X)

X

.

.

. ..

.

....

...

..

.

Linear relationship

.

.

.. ...

x1 x6x3 x4

Page 12: OLS & Logistic Regression Analysis – A Recap

12

Econometric (Regression) model

Y = 1 + 2X + u (3)

2. Specification of the model (cntd)

Where Y = Mean DIBP - the dependent variable

X = Age - explanatory variable

u = Disturbance (random error) term

1 & 2 = parameters

Page 13: OLS & Logistic Regression Analysis – A Recap

13

The error term (u)

• Omitted explanatory variables

• Measurement error

• Wrong functional form

• Unavailability of data

• Inherent randomness

etc….

Page 14: OLS & Logistic Regression Analysis – A Recap

14

3 & 4. Data / estimation of parameters

• Obtaining the data– observed values of Y and X

• Estimation of the parameters

– Y and X are the variables (“known”)

– 1, 2 and u are the parameters (“unknown”)

Page 15: OLS & Logistic Regression Analysis – A Recap

15

5. Diagnostic analysis• Is the model correctly specified?

• Have all assumptions been met?

• Are there any unusual observations or outliers that may unduly influence results?

More of this later this morning…

Page 16: OLS & Logistic Regression Analysis – A Recap

16

6. Hypothesis Testing

• Is estimate statistically close to a postulated value? Or are estimates in accord with expectations from theory?

• Only after model has been shown to be adequate

Page 17: OLS & Logistic Regression Analysis – A Recap

17

7. Forecasting or Prediction

• If hypothesis or theory being tested is confirmed, then future values of the dependent variable can be predicted or forecast

• Policy recommendations

Page 18: OLS & Logistic Regression Analysis – A Recap

18

Hypothesis / theory

Model specification

DataEstimation

Specification testing and diagnostic testing

Is the model adequate? NoYes

Hypothesis testing

Policy: prediction and forecasting

The practice of regression modelling

Page 19: OLS & Logistic Regression Analysis – A Recap

19

Sample regression• In practice we will never observe the population

regression line.

• Instead we take a random sample of observations in order to estimate the s.

• We distinguish the sample regression from the population regression as follows:

Page 20: OLS & Logistic Regression Analysis – A Recap

20

Sample regressionMathematical Model Econometric Model

where = estimator of E(Y/Xi)

= estimator of 1

= estimator of 2 = estimate of ui

Y

1

2

iu

ii XY 21ˆˆˆ iii uXY ˆˆˆ

21

Page 21: OLS & Logistic Regression Analysis – A Recap

21

Population regressionMathematical Model Econometric Model

where = E(Y/Xi)

= constant/Y intercept

= coefficient for Xi = error term

Y

1

2

iu

ii XY 21 iii uXY 21

Page 22: OLS & Logistic Regression Analysis – A Recap

22

Y4

X1 X2 X3X4 X

Y

. 2Y1Y

4Y3Y

ii XY 21ˆˆˆ

.

...

.

..Y2Y3

Y1

Page 23: OLS & Logistic Regression Analysis – A Recap

23

Y4

X1 X2 X3X4 X

Y

. 2Y1Y

4Y3Y

iii XY ˆˆˆˆ21 4u

2u .

...

.

..Y2Y3

Y1

3u

1u

Page 24: OLS & Logistic Regression Analysis – A Recap

24

: The Ordinary Least

Squares (OLS) Model

Dependent variable is modelled as a linear function of predictor or independent variables. The dependent variable is continuous e.g. Blood pressure, Cholesterol level or Weight.

Page 25: OLS & Logistic Regression Analysis – A Recap

25

• What factors cause variation in an individual’s Diastolic blood pressure?

• What variables explain movement in Men’s cholesterol level?

• What variables are predictive of high birth weight in a population of mothers from Birmingham?

Dependent variable can take on any numerical value within the limits of the range of that variable.

OLS

Page 26: OLS & Logistic Regression Analysis – A Recap

26

The OLS method seeks to minimise the residual sum of squares:

OLS

n

i

n

ii

n

i

n

ii

YYu

XYu

ii

ii

1

2

1

2

1

2

1

2

)ˆ(ˆ

)ˆˆ(ˆ 21

Page 27: OLS & Logistic Regression Analysis – A Recap

27

}.

.

X1 X2 X3 X4

{

X

Y

.

.

Minimising the residual…

1u

2u3u

4u

Page 28: OLS & Logistic Regression Analysis – A Recap

28

i.e. the proportion of the variation in Yi which is explained by the regression

Coefficient of determination, or R2, is a measure of the ‘goodness of fit’ of a regression

Describing the overall fit of theestimated model

0 < R2 < 1

But focusing solely on maximising R2 is not a good idea! (other measures will be consider this afternoon…)

Page 29: OLS & Logistic Regression Analysis – A Recap

29

Models for Categorical Dependent Variables

For use on dependent variables that are either dichotomous (individual has CVD or not), or polytomous (Low, Medium or High cholesterol level) which are quite common in Health-related datasets

Page 30: OLS & Logistic Regression Analysis – A Recap

30

Models for Categorical Dependent Variables

Focus Binary response variable – independent variables are

used to predict whether or not some event will occur:

Based on certain described characteristics: Will an individual get cancer or not?Will a patient survive or die? will an individual develop CVD or not?

Page 31: OLS & Logistic Regression Analysis – A Recap

31

Coding of outcomes: Usually coded 1 if the attribute of interest is

present and 0 otherwise.

Approach to be used:Logistic regression - best for dichotomous dependent variable, and continuous and categorical independent variables.

Other commonly used approaches: Probit & Nested Logit

Page 32: OLS & Logistic Regression Analysis – A Recap

32

Major difference from Ordinary Linear Regression

• Uses link for relationship between dependent and independent variable

• Substitute maximum likelihood estimation (MLE) of a link function of the dependent variable for regression's use of least squares estimation of the dependent variable itself.

MLE - Method of estimating unknown parameters in such a way that the probability of observing a given dependent variable is as high (or maximum) as possible

Page 33: OLS & Logistic Regression Analysis – A Recap

33

Issues to consider…

• Why are OLS models not suitable for dichotomous data?

• Logit transformation – Link Function

• Marginal & Conditional Odds and Probability

Page 34: OLS & Logistic Regression Analysis – A Recap

34

Suppose we want to model Yi = β0 + β1X1+ ε but

and • β0 is the coefficient on the constant term, • β1 is the coefficient on the independent variable, • X1 is the independent variable – e.g. Age, and • ε is the error term.

1 if the i-th individual has the attribute of interest – e.g. CVD yi = 0, otherwise

Page 35: OLS & Logistic Regression Analysis – A Recap

35

Let Yi = 1 if the ith individual has CVD, and 0 otherwise.

Let also Yi take the values 1 and 0 with probabilities pi and 1-pi, respectively.

i.e. P(Y1=1) = P(CVD =1) = p1

P(Y1=0) = P(CVD =0) = 1- p1

Page 36: OLS & Logistic Regression Analysis – A Recap

36

Why not just use Simple Linear (OLS) regression?

Consider a simple OLS regression model

CVD = β0 + β1Age+ ε ,

Assumptionsa) ε ~N(0, δ2)b) var (ε) is constant i.e. Homoscedasticity

Binary outcome variables violate these assumptions…

Page 37: OLS & Logistic Regression Analysis – A Recap

37

Why not just use Simple Linear (OLS) regression?

• CVD is binary as P takes on only two values. Consequently, ‘ε’ is also binary and therefore ‘normality of residuals’ assumption is violated.

• The error terms are heteroscedastic, so regression

assumption that the variance of the error term is constant is violated.

• The predicted probabilities can be greater than 1 or less than 0 which can be a problem if the predicted values are used in a subsequent analysis!

Page 38: OLS & Logistic Regression Analysis – A Recap

38

Logit transformation

1. Move from probabilities to Odds

2. Take logs of both sides, to get log-odds or Logit

or equivalently,

exist tdoesn' CVDexists CVD

iPiP

Odds

1

iii

ii Ageβ

PP

P logitoddslog

1

log)()(

)exp(1

)iAgeiβexp(exists) (CVD

iAgeiβiP

Page 39: OLS & Logistic Regression Analysis – A Recap

39

The Logit transformation removes the floor restriction

Page 40: OLS & Logistic Regression Analysis – A Recap

40

Logistic Regression Output

Part of this output is in form of Odds, Odds ratios and probability.

An understanding of these concepts (both marginal and conditional) is therefore cardinal to interpreting Logistic Regression output

Key Question to be explored: What factors determine the probability that an

individual will or will not develop CVD?

Page 41: OLS & Logistic Regression Analysis – A Recap

41

Marginal & Conditional odds.

• The odds of having CVD are 115/85 = 1.353. This is the marginal or unconditional odds of having CVD.

· The conditional odds of having CVD, given “smokers” is 75:25, or 3. A smoker is 3.0 times as likely to have CVD than he is not to have it

· The conditional odds of having CVD, given the category “Non-smokers" is 40:60, or 0.67. A non-smoker is 0.67 times as likely to have CVD than he is not to have it

Smokers Non-Smokers Row Total CVD 75 40 115

No CVD 25 60 85 Column Total 100 100 200

Page 42: OLS & Logistic Regression Analysis – A Recap

42

Probability

The probability of having CVD is 115/200 = 0.575

The probability of having CVD given that one is a smoker is 75/100 = 0.75

The probability of having CVD given that one is a non-smoker is 40/100 = 0.40

Page 43: OLS & Logistic Regression Analysis – A Recap

43

Odds Ratio

· The odds ratio of smokers (numerator) to non-smokers (denominator) for CVD, is 3/0.67= 4.478(This means that the odds of smokers having CVD are 4.478

times as high as those of non-smokers having CVD)

· Odds ratio is cross-product ratio i.e.

· When one moves from being a non-smoker to a smoker, the

odds of having CVD increase by 347.8% (i.e. from 0.67 odds for non-smokers to 3 for smokers)

478.4)25*40()75*60(

Page 44: OLS & Logistic Regression Analysis – A Recap

44

Alternative interpretation of Odds Ratio• Smokers are 4.478 times more likely to have CVD as non-

smokers • The risk of having CVD is 4.478 times greater for smokers

than non-smokers

• The odds of CVD for smokers are 347.8%  higher than the odds of CVD for non-smokers (4.478 - 1.00)

• The predicted odds for smokers are 4.478 times the odds for non-smokers.

• A one unit change in the independent variable Smokers (smokers to non-smokers) increases the odds of having CVD by a factor of 4.478.

Page 45: OLS & Logistic Regression Analysis – A Recap

45

References• Altman D.G. 1991. Practical Statistics For Medical Research

(London: Chapman & Hall/CRC)

• Gujarati D.N. 1995. Basic Econometrics (New York: McGraw-Hill, Inc)

• Johnston J. and J. DiNardo. 1997. Econometric Methods (London: The McGraw-Hill Companies, Inc)

• Long J.S. 1997. Regression Models for Categorical and Limited Dependent. A Volume in the Sage Series for Advanced Quantitative Techniques (Thousand Oaks, CA: Sage Publications

• Want, MinQi, James M. Eddy, Eugene C. Fitzhugh. 1995. "Application of Odds Ratio and Logistic Models in Epidemiology and Health Research," Health Values 19 : 59-62.

Page 46: OLS & Logistic Regression Analysis – A Recap

46

References• Nathwani et al. 2004. “An economic evaluation of a European cohort

from a multinational trial of linezolid versus teicoplanin in serious Gram-positive bacterial infections: the importance of treatment setting in evaluating treatment effects” International Journal of Antimicrobial Agents 23: 315–324

• Manca A, Hawkins N, Sculpher M. 2005. “Estimating mean QALYs in trial-based cost-effectiveness analysis: the importance of controlling for baseline utility” Health Economics 14:487-496

• Hoch et al. 2002 “Something old, something new, something blue: a framework for the marriage of health econometrics and cost-effectiveness analysis” Health Econ 11:415–430.

• Gray et al. 2006, "Estimating the association between SF-12 responses and EQ-5D utility values by response mapping", Med Decis Making., vol. 26, no. 1, pp. 18-29.

Page 47: OLS & Logistic Regression Analysis – A Recap

47

References• Kaambwa et al. 2011, “Mapping utility scores from the Barthel

index", Eur. Journal of Health Economics, DOI: 10.1007/s10198-011-0364-5

• Sengupta et al. 2004, "Mapping the SF-12 to the HUI3 and VAS in a managed care population", Med Care.,42,9: 927-937.

• Smith et al. 2007. Predicting Costs Of Care In Chronic Kidney Disease: The Role Of Comorbid Conditions. The Internet Journal of Nephrology 4, 1

• Bonizzato et al. 2000, “Community-based mental health care: to what extent are service costs associated with clinical, social and service history variables? Psychological Medicine, 30: 1205-1215.

• Baumeister et al. 2009, “Predictive modeling of health care costs: do cardiovascular risk markers improve prediction? European Journal of Cardiovascular Prevention & Rehabilitation

Page 48: OLS & Logistic Regression Analysis – A Recap

48

References• Hoch et al. 2006, “Using the net benefit regression framework to

construct cost-effectiveness acceptability curves: an example using data from a trial of external loop recorders versus Holter monitoring for ambulatory monitoring of "community acquired" syncope”, BMC Health Services Research, 6:68

• Billingham LJ et al. 2002. “Patterns, costs and cost-effectiveness of care in a trial of chemotherapy for advanced non-small cell lung cancer: evidence from a randomised trial” Lung Cancer 37:219-225

• Engels, J.M. & Diehr, P. 2003, “Imputation of missing longitudinal data: a comparison of methods”, Journal of Clinical Epidemiology 56: 968–976

• Blazer et al. 1995. “Health Services Access and Use among Older Adults in North Carolina:Urban vs Rural Residents” American Journal of Public Health, 85, 10:1384-1390

Page 49: OLS & Logistic Regression Analysis – A Recap

49

References• Barber, J. & Thomspon, S. 2004, “Multiple regression of cost

data: use of generalised linear models”, J Health Serv Res Policy 9:197-204

• Kaambwa, B., Bryan, S., Barton, P., Parker, H., Martin, G., Hewitt, G., Parker, S., & Wilson, A. 2008, "Costs and health outcomes of intermediate care: results from five UK case study sites", Health Soc. Care Community 16: 573 - 581

• Raine et al. 2010, “Social variations in access to hospital care for patients with colorectal, breast, and lung cancer between 1999 and 2006: retrospective analysis of hospital episode statistics”, BMJ 340:b5479

Page 50: OLS & Logistic Regression Analysis – A Recap

50

Exercises

• OLS regression

• Logistic Regression