binary logistic regression

Post on 18-Jan-2016

66 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Binary Logistic Regression “To be or not to be, that is the question..”(William Shakespeare, “Hamlet”). Binary Logistic Regression. Also known as “logistic” or sometimes “logit” regression Foundation from which more complex models derived - PowerPoint PPT Presentation

TRANSCRIPT

Binary Logistic Regression

“To be or not to be, that is the question..”(William Shakespeare,

“Hamlet”)

Binary Logistic Regression

Also known as “logistic” or sometimes “logit” regression

Foundation from which more complex models derivede.g., multinomial regression and ordinal

logistic regression

Dichotomous Variables

Two categories indicating whether an event has occurred or some characteristic is present

Sometimes called “binary” or “binomial” variables

Dichotomous DVs

Placed in foster care or not Diagnosed with a disease or not Abused or not Pregnant or not Service provided or not

Single (Dichotomous) IV Example DV = continue fostering, 0 = no, 1 = yes

Customary to code category of interest 1 and the other category 0

IV = married, 0 = not married, 1 = married

N = 131 foster families

Are two-parent families more likely to continue fostering than one-parent families?

Crosstabulation

Table 2.1

Relationship between marital status and continuation is statistically significant [2(1, N = 131) = 5.65, p = .017]

A higher percentage of two-parent families (62.20%) than single-parent families (40.82%) planned to continue fostering

Strength & Direction of Relationships

Different ways to quantify the relationship between IV(s) and DVProbabilitiesOddsOdds Ratio (OR)

• Also abbreviated as eB, Exp(B) (on SPSS output), or exp(B)

% change

Roadmap to Computations

Probabilities

Oddsp / 1 - p

Odds RatiosOdds(1) / Odds(0)

% change100(OR - 1)

Probabilities

Percentages in Table 2.1 as probabilities (e.g., 62.20% as .6220)

p• Probability that event will occur (continue)• e.g., probability that one-parent families plan to

continue is .4082

1 – p• Probability that event will not occur (not continue)• e.g., probability that one-parent families do not

plan to continue is .5918 (1 - .4082)

Odds Ratio of probability that event will occur

to probability that it will not

e.g., odds of continuation for one-parent families are .69 (.4082 / .5918)

Can range from 0 to positive infinity

p

podds

1

Probabilities and Odds

Table 2.2 Odds = 1

Both outcomes equally likely Odds > 1

Probability that event will occur greater than probability that it will not

Odds < 1Probability that event will occur less than

probability that it will not

Odds Ratio (OR)

Odds of the event for one value of the IV (two-parent families) divided by the odds for a different value of the IV, usually a value one unit lower (one-parent families)

e.g., odds of continuing for two-parent families more than double the odds for one-parent familiesOR = 1.6455 / .6898 = 2.39

OR (cont’d)

Plays a central role in quantifying the strength and direction of relationships between IVs and DVs in binary, multinomial, and ordinal logistic regression

OR < 1 indicates a negative relationshipOR > 1 indicates a positive relationshipOR = 1 indicates no linear relationship

ORs > 1

e.g., OR of 2.39

A one-unit increase in the independent variable increases the odds of continuing by a factor of 2.39

The odds of continuing are 2.39 times higher for two-parent compared to one-parent families

ORs < 1

e.g., OR = .50

A one-unit increase in the independent variable decreases the odds of continuing by a factor of .50

The odds that two-parent families will continue are .50 (or one-half) of the odds that one-parent families will continue

ORs < 1 (cont’d)

Compute reciprocal (i.e., 1 / .50 = 2.00) Express relationship as opposite event

of interest (e.g., discontinuing)

A one-unit increase in the independent variable increases the odds of discontinuing by a factor of 2.00

The odds that two-parent families will discontinue are 2.00 times (or twice) the odds of one-parent families

OR to Percentage Change

% change = 100(OR – 1) Alternative way to express OR

e.g., A one-unit increase in the independent variable increases the odds of continuing by 139.00%

• 100(2.39 – 1) = 139.00

e.g., A one-unit increase in the independent variable decreases the odds of continuing by 50.00%

• 100(.50 – 1) = -50.00

Comparing OR > 1 and OR OR > 1 and OR < 1< 1 Compute reciprocal of one of the ORs

e.g., OR of 2.00 and an OR of .50

Reciprocal of .50 is 2.00 (1 / .50 = 2.00)ORs are equal in size (but not in direction of

the relationship)

Qualitative Descriptors for OR Table 2.3 Use cautiously with IVs that aren’t

dichotomous

Question & Answer

Are two-parent families more likely to continue fostering than one-parent families?Yes. The odds of continuing are 2.39 times

(139%) higher for two-parent compared to one-parent families. The probability of continuing is .41 for one-parent families and .62 for two-parent families.

Binary Logistic Regression Example DV = continue fostering, 0 = no, 1 = yes

Customary to code category of interest 1 and the other category 0

IV = married, 0 = not married, 1 = married

N = 131 foster families

Are two-parent families more likely to continue fostering than one-parent families?

Statistical Significance

Table 2.4Relationship between marital status and

continuation is statistically significant (Wald 2 = 5.544, p = .019)

Direction of Relationship

B = slopePositive slope, positive relationship

• OR > 1

Negative slope, negative relationship• OR < 1

0 slope, no linear relationship• OR = 1

Direction/Strength of Relationship

Positive relationship between marital status and continuationTwo-parent families more likely to continueB = .869Exp(B) = OR = 2.385

• % change = 100(2.385 - 1) = 139%

The odds of continuing are 2.39 times (139%) higher for two-parent compared to one-parent families

Roadmap to Computations Logits

ln(p / 1 – p) = L short for ln(p / 1 – p)

OddseL

ProbabilitieseL / (1 + eL)

Odds RatiosOdds(1) / Odds(0)

% change100(OR - 1)

Binary Logistic Regression Model

ln(π/ (1 - π)) = α + 1X1 + 1X2 + … kXk, or

ln(π / (1 - π)) =

π is the probability of the event (eta) is the abbreviation for the linear

predictor (right hand side of this equation) k = number of independent variables

Logit Link

ln(π / (1 - π))Log of the odds that the DV equals 1 (event

occurs)Connects (i.e., links) DV to linear

combination of IVs

Estimated Logits (L)

ln(p / 1 - p) = a + B1X1 + B1X2 + … BkXk

ln(p / 1 – p)Log of the odds that the DV equals 1 (event

occurs)Estimated logit, LDoes not have intuitive or substantive

meaning Useful for examining curvilinear

relationships and interaction effectsPrimarily useful for estimating probabilities,

odds, and ORs

Estimated Logits (L)

L(Continue) = a + BMarriedXMarried

L(Continue) = -.372 + (.869)(XMarried)

a = intercept B = slope

Logit to Odds

If L = 0:Odds = eL = e0 = 1.00

If L = .50:Odds = eL = e.50 = 1.65

If L = 1.00:Odds = eL = e1.00 = 2.72

Logits to Odds (cont’d)

Table 2.4One-parent families

• L(Continue) = -.372 = -.372 + (.869)(0)

• Odds of continuing = e-.372 = .69

Two-parent families• L(Continue) = .497 = -.372 + (.869)(1)

• Odds of continuing = e.497 = 1.65

Odds to OR

OR = 1.65 / .69 = 2.39, or

e.869 = 2.39, labeled Exp(B)Table 2.4

OR to Percentage Change

% change = 100(OR – 1)

e.g., A one-unit increase in the independent variable increases the odds of continuing by 139.00%

• 100(2.39 – 1) = 139.00

e.g., A one-unit increase in the independent variable decreases the odds of continuing by 50.00%

• 100(.50 – 1) = -50.00

Logits to Probabilities

One-parent families, L(Continue) = -.372

Two-parent families, L(Continue) = .497

L

L

)Continue( e

ep

..

.

e

ep

.

.

)Continue(

..

.

e

ep

.

.

)Continue(

Question & Answer

Are two-parent families more likely to continue fostering than one-parent families?Yes. The odds of continuing are 2.39 times

(139%) higher for two-parent compared to one-parent families. The probability of continuing is .41 for one-parent families and .62 for two-parent families.

Single (Quantitative) IV Example

DV = continue fostering, 0 = no, 1 = yesCustomary to code category of interest 1

and other category 0 IV = number of resources N = 131 foster families

Are foster families with more resources more likely to continue fostering?

Statistical Significance

Table 2.5Relationship between resources and

continuation is statistically significant (Wald 2 = 4.924, p = .026)

H0: = 0, 0, ≤ 0, same as

H0: OR = 1, OR 1, OR ≤ 1Likelihood ratio 2 better than Wald

Direction/Strength of Relationship

Positive relationship between resources and continuationFamilies with more resources are more

likely to continueB = .212Exp(B) = OR = 1.237

• % change = 100(1.237 – 1) = 24%

The odds of continuing are 1.24 times (24%) higher for each additional resource

Estimated Logits

L(Continue) = -1.227 + (.212)(X)

Figures

Resources.xls

Effect of Resources on Continuation (Logits)

-1.50

-1.00

-0.50

0.00

0.50

1.00

1.50

Resources

Lo

git

s

Logits -1.01 -0.80 -0.59 -0.38 -0.16 0.05 0.26 0.47 0.68 0.90 1.11

1 2 3 4 5 6 7 8 9 10 11

Effect of Resources on Continuation (Odds)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

Resources

Od

ds

Odds 0.36 0.45 0.55 0.69 0.85 1.05 1.30 1.60 1.98 2.45 3.03

1 2 3 4 5 6 7 8 9 10 11

Effect of Resources on Continuation (Probabilities)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

Resources

Pro

ba

bil

itie

s

Probabilities 0.27 0.31 0.36 0.41 0.46 0.51 0.56 0.62 0.66 0.71 0.75

1 2 3 4 5 6 7 8 9 10 11

Question & Answer

Are foster families with more resources more likely to continue fostering?Yes. The odds of continuing are 1.24 times

(24%) higher for each additional resource. The probability of continuing is .31 for families with two resources, .51 for families with 6 resources, and .71 for families with 10 resources.

Relationship of Linear Predictor to Logits, Odds & p Relationship between linear predictor and

logits is linear

Relationship between linear predictor and odds is non-linear

Relationship between linear predictor and p is non-linearChallenge is to summarize changes in odds

and probabilities associated with changes in IVs in the most meaningful and parsimonious way

Logit as Function of Linear Predictor

-3.00

-2.00

-1.00

.00

1.00

2.00

3.00

-3.00 -2.00 -1.00 .00 1.00 2.00 3.00

Linear Predictor

Log

it

Odds as Function of Linear Predictor

.003.006.009.0012.0015.0018.0021.00

-3.00 -2.00 -1.00 .00 1.00 2.00 3.00

Linear Predictor

Od

ds

Probabilities as Function of Linear Predictor

.00

.10

.20

.30

.40

.50

.60

.70

.80

.901.00

-3.00 -2.00 -1.00 .00 1.00 2.00 3.00

Linear Predictor

Pro

bab

ility

IVs to z-scores

z-scores (standard scores)Only the IV (not DV)--semi-standardized slopesOne-unit increase in the IV refers to a one-

standard-deviation increaseOR interpreted as expected change in the odds

associated with a one standard deviation increase in the IV

Conversion to z-scores changes intercept, slope, and OR, but not associated test statistics

Table 2.6 (compare to Table 2.5)

Figures

zResources.xls

Effect of zResources on Continuation (Probabilities)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

Standardized Resources

Pro

ba

bil

itie

s

Probabilities 0.26 0.34 0.44 0.54 0.64 0.73 0.80

-3 -2 -1 0 1 2 3

Question & Answer

Are foster families with more resources more likely to continue fostering?Yes. The odds of continuing are 1.51 times

(51%) higher for each one standard deviation (1.93) increase in resources. The probability of continuing is .34 for families with resources two standard deviations below the mean, .54 for families with the mean number of resources (6.60), and .73 for families with resources two standard deviations above the mean.

IVs Centered

CenteringTypically center on meanUseful when testing interactions, curvilinear

relationships, or when no meaningful 0 point (e.g., no family with 0 resources)

Centering doesn’t change slope, OR, or associated test statistics, but does change the intercept

Table 2.7 (compare to Table 2.5)

Figures

cResources.xls

Effect of cResources on Continuation (Probabilities)

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

Centered Resources

Pro

ba

bil

itie

s

Probabilities 0.29 0.34 0.39 0.44 0.49 0.54 0.60 0.65 0.69 0.74 0.77

-5 -4 -3 -2 -1 0 1 2 3 4 5

Question & Answer

Are foster families with more resources more likely to continue fostering?Yes. The odds of continuing are 1.24 times

(24%) higher for each additional resource. The probability of continuing is .34 for families with 4 resources below the mean, .54 for families with the mean number of resources (6.60), and .74 for families with 4 resources above the mean.

Multiple IV Example

DV = continue fostering, 0 = no, 1 = yesCustomary to code the category of interest as

1 and the other category as 0 IV = married, 0 = not married, 1 =

married IV = number of resources (z-scores) N = 131 foster families

Are foster families with more resources more likely to continue fostering, controlling for marital status?

Statistical Significance

Table 2.12Relationship between set of IVs and

continuation is statistically significant (2 = 6.58, p = .037)

H0: 1 = 2 = k = 0, same as

H0: 1 = 2 = k = 1 (psi) is symbol for population value of OR

Statistical Significance (cont’d) Table 2.13

Relationship between resources and continuation is not statistically significant, controlling for marital status (2 = .92, p = .338)

Relationship between marital status and continuation is not statistically significant, controlling for resources (2 = 1.42, p = .234)

H0: = 0, 0, ≤ 0, same asH0: = 1, 1, ≤ 1

(psi) is symbol for population value of ORLikelihood ratio 2 better than Wald

Statistical Significance (cont’d) Table 2.9

Relationship between resources and continuation is not statistically significant, controlling for marital status (2 = .91, p = .340)

Relationship between marital status and continuation is not statistically significant, controlling for resources (2 = 1.41, p = .235)

H0: = 0, 0, ≤ 0, same asH0: = 1, 1, ≤ 1

(psi) is symbol for population value of OR Wald 2, but likelihood ratio 2 better

Estimated Logits

L(Continue) = -.183 + (.228)(XzResources) + (.570)(XMarried)

ORs & Percentage Change

ORzResources = 1.256 (ns)The odds of continuing are 1.26 times (26%)

higher for each one standard deviation (1.93) increase in resources, controlling for marital status

ORMarried = 1.769 (ns)The odds of continuing are 1.77 times (77%)

higher for two-parent compared to one-parent families, controlling for marital status

Figures

Married & zResources.xls

Effect of Resources and Marital Status on Plans to Continue Fostering (Odds)

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

Standardized Resources

Od

ds

One-Parent 0.42 0.53 0.66 0.83 1.05 1.31 1.65

Two-Parent 0.74 0.93 1.17 1.47 1.85 2.32 2.92

-3 -2 -1 0 1 2 3

Effect of Resources and Marital Status on Plans to Continue Fostering (Probabilities)

0.00

0.100.20

0.300.40

0.50

0.600.70

0.80

Standardized Resources

Pro

ba

bil

itie

s

One-Parent 0.30 0.35 0.40 0.45 0.51 0.57 0.62

Two-Parent 0.43 0.48 0.54 0.60 0.65 0.70 0.74

-3 -2 -1 0 1 2 3

Presenting Odds and Probabilities in Tables

Tables 2.10 and 2.11

Question & Answer

Are foster families with more resources more likely to continue fostering, controlling for marital status?No (ns). The odds of continuing are 1.26

times (26%) higher for each one standard deviation (1.93) increase in resources, controlling for marital status.

Cont’d

Question & Answer (cont’d)

For one-parent families the probability of continuing is .35 for families with resources two standard deviations below the mean, .45 for families with the mean number of resources, and .57 for families with resources two standard deviations above the mean. For two-parent families the probability of continuing is .48 for families with resources two standard deviations below the mean, .60 for families with the mean number of resources, and .70 for families with resources two standard deviations above the mean.

Comparing the Relative Strength of IVs

Size of slope and OR depend on how the IV is measuredWhen IVs measured the same way (e.g., two

dichotomous IVs or two continuous IVs transformed to z-scores) relative strength can be compared

Nothing comparable to standardized slope (Beta)

Nested ModelsNested Models

IV1, IV2, IV3

IV1, IV2 IV2, IV3 IV1, IV3

IV1 IV2 IV1IV2 IV3 IV3

Nested Models (cont’d)Nested Models (cont’d)

One regression model is nested within another if it contains a subset of variables included in the model within which it’s nested, and same cases are analyzed in both models

The more complex model called the “full model” The nested model called the “reduced model.” Comparison of full and reduced models allows

you to examine whether one or more variable(s) in the full model contribute to explanation of the DV

Sequential Entry of IVs

Used to compare full and reduced modelse.g., family resources entered first, and then

marital status

Fchange used in linear regression

Sequential Entry of IVs (cont’d) SPSS GZLM doesn’t allow sequential of

IVsEstimate models separately and compare

omnibus likelihood ratio 2 values

Reduced model 2(1) = 5.168Full model 2(2) = 6.585

2 difference = 6.585 – 5.168 = 1.417df difference = 2 – 1p = .234Chi-square Difference.xls

Assumptions Necessary for Testing Hypotheses No assumptions unique to binary

logistic regression other than ones discussed in GZLM lecture

Model Evaluation

Evaluate your model before you test hypotheses or interpret substantive resultsOutliersAnalogs of R2

Outliers

Atypical cases Can lead to flawed conclusions Can provide theoretical insights Common causes

Data entry errorsModel misspecificationRare events

Outliers (cont’d)

Leverage

ResidualsStandardized or unstandardized deviance

residuals

InfluenceCook’s D

Leverage

Think of a seesaw Leverage value for each case Cases with greater leverage can exert a

disproportionately large influence Leverage value for each case No clear benchmarks

Identify cases with substantially different leverage values than those of other cases

Residuals

Difference between actual and estimated values of the DV for a case

Residual for each case Large residual indicates a case for

which model fits poorly

Residuals (cont’d)

Standardized or unstandardized deviance residualsNot normally distributedValues less than -2 or greater than +2

warrant some concernValues less than -3 or greater than +3 merit

close inspection

Influence

Cases whose deletion result in substantial changes to regression coefficients

Cook’s D for each caseApproximate aggregate change in

regression parameters resulting from deletion of a case

Values of 1.0 or more indicate a problematic degree of influence for an individual case

Index Plot

Scatterplot

Horizontal axis (X)• Case id

Vertical axis (Y)• Leverage values, or• Residuals, or• Cook’s D

Index Plot: Leverage Values

Index Plot: Standardized Deviance Residuals

Index Plot: Cook’s D

Analogs of RAnalogs of R22

None in standard use and each may give different results

Typically much smaller than R2 values in linear regression

Difficult to interpret

Multicollinearity

SPSS GZLM doesn’t compute multicollinearity statistics

Use SPSS linear regression

Problematic levelsTolerance < .10 or VIF > 10

Additional Topics

Polychotomous IVs Curvilinear relationships Interactions

Overview of the Process

Select IVs and decide whether to test curvilinear relationships or interactions

Carefully screen and clean data Transform and code variables as needed Estimate regression model Examine assumptions necessary to

estimate binary regression model, examine model fit, and revise model as needed

Overview of the Process (cont’d)

Test hypotheses about the overall model and specific model parameters, such as ORs

Create tables and graphs to present results in the most meaningful and parsimonious way

Interpret results of the estimated model in terms of logits, probabilities, odds, and odds ratios, as appropriate

Additional Regression Models for Dichotomous DVs Binary probit regression

Substantive results essentially indistinguishable from binary logistic regression

Choice between this and binary logistic regression largely one of convenience and discipline-specific convention

Many researchers prefer binary logistic regression because it provides odds ratios whereas probit regression does not, and binary logistic regression comes with a wider variety of fit statistics

Additional Regression Models for Dichotomous DVs (cont’d) Complementary log-log (clog-log) and

log-log models Probability of the event is very small or

large Loglinear regression

Limited to categorical IVs Discriminant analysis

Limited to continuous IVs

top related