unit 4a: basic logistic (binomial logit) regression analysis © andrew ho, harvard graduate school...

Unit 4a: Basic Logistic (Binomial Logit) Regression Analysis

© Andrew Ho, Harvard Graduate School of Education Unit 4a – Slide 1

http://xkcd.com/74/http://xkcd.com/210/

http://xkcd.com/74/

http://xkcd.com/210/

• Exploratory Data Analysis with dichotomous outcome variables• How our familiar regression model fails our data• An initial look at logistic regression results

© Andrew Ho, Harvard Graduate School of Education Unit 4a– Slide 2

Multiple RegressionAnalysis (MRA)

Multiple RegressionAnalysis (MRA) iiii XXY 22110

Do your residuals meet the required assumptions?

Test for residual

normality

Use influence statistics to

detect atypical datapoints

If your residuals are not independent,

replace OLS by GLS regression analysis

Use Individual

growth modeling

Specify a Multi-level

Model

If time is a predictor, you need discrete-

time survival analysis…

If your outcome is categorical, you need to

use…

Binomial logistic

regression analysis

(dichotomous outcome)

Multinomial logistic

regression analysis

(polytomous outcome)

If you have more predictors than you

can deal with,

Create taxonomies of fitted models and compare

them.

Form composites of the indicators of any common

construct.

Conduct a Principal Components Analysis

Use Cluster Analysis

Use non-linear regression analysis.

Transform the outcome or predictor

If your outcome vs. predictor relationship

is non-linear,

Use Factor Analysis:EFA or CFA?

Course Roadmap: Unit 4a

Today’s Topic Area


Dataset AT_HOME.txt

Overview Sub-sample from the 1976 Canadian National Labor Force Survey (LFS) on the participation of married Canadian women in the labor force, in which the relationship between whether a woman works as a homemaker (vs. taking a job outside the home) is investigated as a function of the husband’s salary and the presence of children in the home.

Source Atkinson, et al., 1977

Sample size

434 married women

Info If you are interested in this topic, you might find the Gender & Work Database informative. This collaborative project, lead by Leah Vosko, Canada Research Chair in Feminist Politic Economy, provides access to a library of theoretical and empirical papers on the relationship of gender and work, summary statistical tables containing descriptive statistics that document women’s positions, compensation, etc., and links to many of the major datasets on women’s labor force participation in Canada, including the LFS itself, and also the Survey of Work Arrangements, the General Social Survey, the Survey of Labour and Income Dynamics, the National Population Health Survey, and many others.

Note: I’ve removed other obvious controls and question predictors to simplify my presentation of the logistic regression approach.Note: I’ve removed other obvious controls and question predictors to simplify my presentation of the logistic regression approach.

Broad Research Question:

In 1976, were married Canadian women who had children at home

and husbands with higher salaries more likely to work at home rather than joining the

labor force (when compared to their married peers with no

children at home and husbands who earn less)?

The Data: A Historical Look at Canadian Gender and Work Patterns

Works at Home

Husband’s Income

Children

Works at Home

Husband’s Income Children

Lurking Confound

http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?catno=71-544-XWE&lang=eng

http://www.statcan.ca/english/Dli/Data/Ftp/lfs.htm

http://www.genderwork.ca/index.html

http://www.statcan.gc.ca/start-debut-eng.html

http://www.statcan.gc.ca/cgi-bin/imdb/p2SV.pl?Function=getSurvey&SDDS=3884&lang=en&db=imdb&adm=8&dis=2

http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=89-647-X

http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=75F0011X



http://www.statcan.gc.ca/bsolc/olc-cel/olc-cel?lang=eng&catno=82-618-M


Structure of Dataset

Col.#

VariableName

Variable Description Variable Metric/Labels

1 HOME Does the married woman work in the home as a homemaker?

Dichotomous outcome variable:0 = no1 = yes

2 HUBSAL The husband’s annual income in 1976 Canadian dollars. $1000’s

3 CHILD Are there children present in the home?

Dichotomous predictor variable:0 = no1 = yes

1 15 11 13 11 45 11 23 11 19 11 7 11 15 10 7 11 15 11 23 11 23 10 13 11 9 1…

We’ve already demonstrated the use of dichotomous predictors. Why not dichotomous outcome variables?

We’ll try it and see. What could possibly go wrong?

HOME is a categorical

(dichotomous) outcome variable

What’s the best way to model the relationship

between a binary outcome and regular

predictors like CHILD and HUBSAL?

Eyes on the Data


*---------------------------------------------------------------------------------* Input the raw dataset, name and label the variables and selected values.*---------------------------------------------------------------------------------* Input the target dataset: infile HOME HUBSAL CHILD ///

using "C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\AT_HOME.txt"

* Label the principal variables: label variable HOME "Is Woman a Homemaker?" label variable HUBSAL "Husband's Annual Salary (in $1,000)" label variable CHILD "Are Children Present in the Home?"

* Label the values of important categorical variables: * Dichomotous outcome HOME: label define homelbl 0 "In Labor Force" 1 "Homemaker" label values HOME homelbl * Dichotomous secondary question predictor CHILD: label define childlbl 0 "No Child" 1 "Children at Home" label values CHILD childlbl

*--------------------------------------------------------------------------------* Obtain descriptive statistics on the sample HOME/HUBSAL relationship.*--------------------------------------------------------------------------------* Examine the sample univariate distribution of HOME: hist HOME, discrete percent ylabel(0(20)100) xlabel(0(1)1) name(Unit4a_g1) summarize HOME

* Inspect the sample bivariate relationship of outcome HOME and predictor HUBSAL:scatter HOME HUBSAL, jitter(7) msize(small) name(Unit4a_g2,replace)graph hbox HUBSAL, over(HOME, descending) name(Unit4a_g3,replace)

*---------------------------------------------------------------------------------* Input the raw dataset, name and label the variables and selected values.*---------------------------------------------------------------------------------* Input the target dataset: infile HOME HUBSAL CHILD ///

using "C:\Users\Andrew Ho\Documents\Dropbox\S-052\Raw Data\AT_HOME.txt"

* Label the principal variables: label variable HOME "Is Woman a Homemaker?" label variable HUBSAL "Husband's Annual Salary (in $1,000)" label variable CHILD "Are Children Present in the Home?"

* Label the values of important categorical variables: * Dichomotous outcome HOME: label define homelbl 0 "In Labor Force" 1 "Homemaker" label values HOME homelbl * Dichotomous secondary question predictor CHILD: label define childlbl 0 "No Child" 1 "Children at Home" label values CHILD childlbl

*--------------------------------------------------------------------------------* Obtain descriptive statistics on the sample HOME/HUBSAL relationship.*--------------------------------------------------------------------------------* Examine the sample univariate distribution of HOME: hist HOME, discrete percent ylabel(0(20)100) xlabel(0(1)1) name(Unit4a_g1) summarize HOME

* Inspect the sample bivariate relationship of outcome HOME and predictor HUBSAL:scatter HOME HUBSAL, jitter(7) msize(small) name(Unit4a_g2,replace)graph hbox HUBSAL, over(HOME, descending) name(Unit4a_g3,replace)

Standard input statements

As I have illustrated in earlier STATA code, where categorical variables are involved, you can define a format label variable (homelbl)to contain the value labels, and then associate the label with the variable of interest, when needed.

Requests standard univariate descriptive plots and statistics on the dichotomous outcome, HOME.

Requests bivariate plot of dichotomous outcome HOME on the continuous predictor HUBSAL

Loading the Data, Visualizing/Summarizing the Outcome Variable


Visualizing/Summarizing the Dichotomous Outcome Variable

HOME 434 .7050691 .456538 0 1 Variable Obs Mean Std. Dev. Min Max

. summarize HOME

For dichotomous 0/1 variables, some notation and properties to remember:

The number of observations, . For dichotomous variables, we can describe

the mean, in this case, , more generally as, , the proportion of 1s. 70.5% of women in the sample are homemakers.

We can define , the proportion of 0s. 29.5% of women in the sample are not homemakers.

The standard deviation is . Maximized when , As proportions become more extreme,

standard deviations drop. For example, when ,

Standard deviation is 0 when or are 1 or 0.

The more extreme the proportion, the less the spread of the distribution. Does this make sense?

020

4060

8010

0P

erce

nt0 1

Is Woman a Homemaker?

𝑝=70.5 %

𝑞=29.5 %

. hist HOME, discrete percent ylabel(0(20)100) xlabel(0(1)1)

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1,000)


The Bivariate Distribution of HOME on HUBSAL

Jittered scatterplot showing the sample relationship between the dichotomous outcome variable

HOME and the continuous predictor, HUBSAL

0 10 20 30 40 50Husband's Annual Salary (in $1,000)

In Labor Force

Homemaker


“In the population, …Assumption How Does Failure of the Assumption Affect OLS Regression Analysis?

Linear Outcome/Predictor

Relationships

… the bivariate relationship between the outcome and each

predictor must be linear.”

If the modeled relationship is not linear, then it will be misrepresented by the linear regression analysis, and the fundamental underpinnings of the entire analysis are at risk:OLS-estimated regression slope will not

represent the population relationship.Assumptions about the population residuals

(sometimes called, simply, “errors”) will be violated.

Estimated residuals will be incorrect.Statistical inference will be incorrect.

High-priority conditions must be met for accurate statistical inference with linear OLS regression. (Most of this falls under the heading of “independent and identically normally distributed errors.”High-priority conditions must be met for accurate statistical inference with linear OLS regression. (Most of this falls under the heading of “independent and identically normally distributed errors.”

Regression as a Model for the Conditional Mean, and

At any slice of at a given , the residuals should have an average of 0 in the population.Variance around this 0 point should be similar across (Homoscedasticity).

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?

0 10 20 30 40 50Husband's Annual Salary (in $1000)


The Linearity Assumption

0.5

1Is

Wo

man

a H

ome

mak

er?


The blue line, the “local polynomial fit,” is trying

to describe the conditional mean of at

any point .

The red line, our familiar linear regression model, is trying to do the

same thing under a linear constraint. What is our predicted value when

0.5

1Is

Wo

man

a H

ome

mak

er?



Fitting a linear model to a dichotomous outcome variable

_cons .5015271 .0493712 10.16 0.000 .4044894 .5985648 HUBSAL .0140262 .0030651 4.58 0.000 .0080019 .0200506 HOME Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 90.2488479 433 .208426901 Root MSE = .44638 Adj R-squared = 0.0440 Residual 86.0763977 432 .199250921 R-squared = 0.0462 Model 4.17245021 1 4.17245021 Prob > F = 0.0000 F( 1, 432) = 20.94 Source SS df MS Number of obs = 434

. regress HOME HUBSAL

We can rewrite our familiar regression model to account for the probabilistic interpretation of dichotomous outcomes:

On average, in the population, two women whose husband’s salary differs by $1000 have a probability of being a homemaker that differs by an estimated 1.4 percentage points.

When , the predicted probability that a woman will be a homemaker is 50.2%.

We can reject the null hypothesis that this difference is 0 in the population .


Residual Diagnostics0

.51

Is W

om

an a

Hom

em

ake

r?


-1-.

50

.5R

esi

dua

ls

.4 .6 .8 1 1.2Fitted values

There is quite a healthy amount of vertical variation in the middle range of fitted values.

We are often fairly forgiving of heteroscedasticity. We might resolve it with “weighted least squares,” if anything.

In the case of a dichotomous outcome variable, the problems (heteroscedasticity, nonlinearity) are so predictable, the implications are so atheoretical (predictions outside 0,1, linear fit to a nonlinear

relationship), and the alternatives are so attractive and straightforward (logistic regression),

that we never fit linear models to dichotomous outcomes.

Very little vertical variation in the extremes.

-4-2

02

4S

tan

dard

ize

d re

sidu

als

-4 -2 0 2 4Inverse Normal

© Andrew Ho, Harvard Graduate School of Education Unit 1b – Slide 12

050

100

150

Fre

que

ncy

-2 -1 0 1Standardized residuals

Residual Normality

Residuals certainly don’t seem normally distributed.

STDRESID 434 0.75581 72.325 10.225 0.00000 Variable Obs W V z Prob>z

Shapiro-Wilk W test for normal data

. swilk STDRESID

Not surprising that we reject the null hypothesis of normally distributed population residuals.

0.5

1Is

Wo

man

a H

ome

mak

er?



Wouldn’t it be nice…

Wouldn’t it be nice if there were some way to fit a… nonlinear… regression model to these data?

One way to think about this nonlinear model might be as a transformed outcome variable that “stretches” extreme proportions, accounting for the smaller variance we know exists in that region…

© Andrew Ho, Harvard Graduate School of Education S052/II.1(a) – Slide 14

HUBSALeHOME

101

11P

Because of the linear model’s flaws, we recommend the non-linear Logistic Function as a credible and interpretable model for representing the population relationship between the underlying probability that a married woman is a homemaker and predictors like the husband’s salary, HUBSAL:

Because of the linear model’s flaws, we recommend the non-linear Logistic Function as a credible and interpretable model for representing the population relationship between the underlying probability that a married woman is a homemaker and predictors like the husband’s salary, HUBSAL:

In a Logistic Regression Model, the outcome is specified in a way that is consistent with our intuition about the analysis of categorical outcomes: We model the underlying

probability that a married woman is a homemaker.

The population Logistic Regression Model has a Non-linear Functional Form so that it can provide the properties that we require for a

hypothesized relationship between a probability and its predictors.

In a Logistic Regression Model, the hypothesized trend line:Cannot drop below zero (the “lower asymptote’),Cannot exceed unity (the “upper asymptote”),Makes a smooth and sensible transition between these asymptotes.

But, what do these parameters represent!!?

The Logistic Regression Model


Here’s some EXCEL plots, to provide intuition into how the Logistic Regression Model works.Here’s some EXCEL plots, to provide intuition into how the Logistic Regression Model works.

HUBSAL P1 P2 P3-30 0.0522 0.1192 0.26894-29 0.0573 0.13011 0.28905-28 0.063 0.14185 0.31003-27 0.0691 0.15447 0.33181-26 0.0759 0.16798 0.35434-25 0.0832 0.18243 0.37754-24 0.0911 0.19782 0.40131-23 0.0998 0.21417 0.42556-22 0.1091 0.23148 0.45017-21 0.1192 0.24974 0.47502-20 0.1301 0.26894 0.5-19 0.1419 0.28905 0.52498-18 0.1545 0.31003 0.54983-17 0.168 0.33181 0.57444-16 0.1824 0.35434 0.59869-15 0.1978 0.37754 0.62246-14 0.2142 0.40131 0.64566-13 0.2315 0.42556 0.66819-12 0.2497 0.45017 0.68997-11 0.2689 0.47502 0.71095-10 0.2891 0.5 0.73106-9 0.31 0.52498 0.75026-8 0.3318 0.54983 0.76852-7 0.3543 0.57444 0.78583-6 0.3775 0.59869 0.80218

0

0.25

0.5

0.75

1

-40 -30 -20 -10 0 10 20 30 40

HUBSAL

p[H

OM

E=

1]

1.0

10

20

When is larger, the logistic curve cuts

through the vertical axis at a higher elevation (i.e., the intercept is

larger).

1.01

All logistic curves approach an

upper asymptote of 1

All logistic curves approach an

upper asymptote of 1

All logistic curves approach an lower

asymptote of 0

All logistic curves approach an lower

asymptote of 0

The “Intercept” Parameter, in a logistic regression model.


HUBSAL P1 P2 P3-30 0.1545 0.05215 0.0163-29 0.1625 0.05732 0.01871-28 0.1708 0.06297 0.02146-27 0.1795 0.06914 0.0246-26 0.1885 0.07586 0.0282-25 0.1978 0.08317 0.0323-24 0.2075 0.09112 0.03697-23 0.2176 0.09975 0.04229-22 0.2279 0.1091 0.04834-21 0.2387 0.1192 0.0552-20 0.2497 0.13011 0.06297-19 0.2611 0.14185 0.07176-18 0.2729 0.15447 0.08166-17 0.285 0.16798 0.09279-16 0.2973 0.18243 0.10527-15 0.31 0.19782 0.1192-14 0.323 0.21417 0.1347-13 0.3363 0.23148 0.15187-12 0.3498 0.24974 0.1708-11 0.3635 0.26894 0.19155-10 0.3775 0.28905 0.21417-9 0.3917 0.31003 0.23867-8 0.4061 0.33181 0.26503-7 0.4207 0.35434 0.29318-6 0.4354 0.37754 0.323

0

0.25

0.5

0.75

1

-40 -30 -20 -10 0 10 20 30 40

p[H

OM

E=

1]

HUBSAL

14.01

10.01

06.01

When is larger, the logistic curve

approaches the upper asymptote more

steeply.

1.00

And a few more ….And a few more ….

The “Slope” Parameter, in a logistic regression model.


HUBSALeHOME

101

11P

This will be our statistical model for relating a categorical outcome to predictors.We will fit it to data using Nonlinear Regression Analysis …

We consider the non-linear Logistic Regression Model for representing the hypothesized population relationship between dichotomous outcome, HOME, and predictors … We consider the non-linear Logistic Regression Model for representing the hypothesized population relationship between dichotomous outcome, HOME, and predictors …

The outcome being modeled is the

underlying probability that the value of outcome HOME equals 1

Parameter 1 determines the slope of

the curve, but is not equal to it (in fact, the

slope is different at every point on the

curve).

Parameter 0 determines the intercept of the curve, but is not

equal to it.

The Logistic Regression Model


Building the Logistic Regression Model: The Unconditional Model

_cons .8715548 .1052638 8.28 0.000 .6652415 1.077868 HOME Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -263.22441 Pseudo R2 = 0.0000 Prob > chi2 = . LR chi2(0) = 0.00Logistic regression Number of obs = 434

Iteration 1: log likelihood = -263.22441 Iteration 0: log likelihood = -263.22441

. logit HOME

To gain our footing, we can fit an unconditional logistic model:

This should look familiar: it is our unconditional percentage of women who are homemakers in our sample:

HOME 434 .7050691 .456538 0 1 Variable Obs Mean Std. Dev. Min Max

. summarize HOME

We recall from multilevel modeling that we wish to maximize our likelihood, “maximum likelihood.”

Because the likelihoods are a product of many, many small probabilities, we maximize the sum of log-likelihoods, an attempt at making a negative number as positive as possible.

Later, we’ll use the difference in -2*logliklihoods (the deviance) in a statistical test to compare models.


Building the Logistic Regression Model

_cons -.2371923 .2626906 -0.90 0.367 -.7520565 .2776718 HUBSAL .0808408 .0184165 4.39 0.000 .0447451 .1169364 HOME Coef. Std. Err. z P>|z| [95% Conf. Interval]

Log likelihood = -252.02479 Pseudo R2 = 0.0425 Prob > chi2 = 0.0000 LR chi2(1) = 22.40Logistic regression Number of obs = 434

Iteration 4: log likelihood = -252.02479 Iteration 3: log likelihood = -252.02479 Iteration 2: log likelihood = -252.02492 Iteration 1: log likelihood = -252.20292 Iteration 0: log likelihood = -263.22441

. logit HOME HUBSAL

Our fitted model

Before we interpret these coefficients directly, it is generally easiest to visualize the fitted model graphically.

We notice that our log likelihood is more positive than before (a better fit, from -263 to -252), but it took a bit longer to converge (increased complexity given the predictor).

We can show that the deviance (loglikelihood) decreases from 526 to 504.

0.5

1Is

Wo

man

a H

ome

mak

er?

0 10 20 30 40 50Husband's Annual Salary (in $1000)© Andrew Ho, Harvard Graduate School of Education Unit 1b – Slide 20

0.2

.4.6

.81

Is W

om

an a

Hom

em

ake

r?


Graphical Interpretation of the Logistic Regression Model

�̂� (𝐻𝑂𝑀𝐸=1 )= 1

1+𝑒− (−.237+.081𝐻𝑈𝐵𝑆𝐴𝐿 )

Comparing local polynomial, linear, and logistic fits to the data.

unit 4a: basic logistic (binomial logit) regression analysis © andrew ho, harvard graduate school...

Documents

nonlinear regression

gls regression analysis

use factor analysis

logistic regression

exploratory data analysis

familiar regression

labor force participation

com210 slide