advanced models and methods in behavioral research
DESCRIPTION
Advanced Models and Methods in Behavioral Research. Chris Snijders [email protected] 3 ects http://www.chrissnijders.com/ammbr (=studyguide) literature: Field book + separate course material laptop exam (+ assignments). ToDo ( if not done yet ): Enroll in 0a611. - PowerPoint PPT PresentationTRANSCRIPT
Advanced Methods and Models in Behavioral Research –
Advanced Models and Methods in Behavioral Research
• Chris Snijders• [email protected]
• 3 ects
• http://www.chrissnijders.com/ammbr (=studyguide)
• literature: Field book + separate course material
• laptop exam (+ assignments)
ToDo (if not done yet):
Enroll in 0a611
Advanced Methods and Models in Behavioral Research –
The methods package
• MMBR (6 ects)– Blumberg: questions, reliability, validity, research design– Field: SPSS: factor analysis, multiple regression, ANcOVA,
sample size etc
• AMMBR (3 ects) - Field (1 chapter): logististic regression - literature through website:
conjoint analysis multi-level regression
Advanced Methods and Models in Behavioral Research –
Models and methods: topics• t-test, Cronbach's alpha, etc• multiple regression, analysis of (co)variance and
factor analysis
• logistic regression• conjoint analysis / repeated measures
– Stata next to SPSS– “Finding new questions”– Some data collection
In the background: “now you should be able to deal with data on your own”
Advanced Methods and Models in Behavioral Research –
Methods in brief (1)
• Logistic regression: target Y, predictors Xi. Y is a binary variable (0/1).
- Why not just multiple regression?- Interpretation is more difficult- goodness of fit is non-standard- ...
(and it is a chapter in Field)
Advanced Methods and Models in Behavioral Research –
Methods in brief (2)
• Conjoint analysis
Underlying assumption: for each user, the "utility" of an offer can be written as
U(x1,x2, ... , xn) = c0 + c1 x1 + ... + cn xn
- 10 Euro p/m- 2 years fixed- free phone- ...
How attractive is thisoffer to you?
Conjoint analysis as an “in between method”
BetweenWhich phone do you like and why?What would your favorite phone be?
And:Let’s keep track of what people buy.
We have:
Advanced Methods and Models in Behavioral Research –
Advanced Methods and Models in Behavioral Research –
Local Master Thesis example:
Fiber to the home
Speed: really fastPrice: sort of highInstallation: free!Your neighbors: are in!
How attractive is this to you?
(Roel Schuring)
Coming up with new ideas (3)
Advanced Methods and Models in Behavioral Research –
“More research is necessary”
But on what?
YOU: come up with sensible new ideas, given previous research
Stata next to SPSS
Advanced Methods and Models in Behavioral Research –
• It’s just better (faster, better written, more possibilities, better programmable …)
• Multi-level regression is much easier than in SPSS
• It’s good to be exposed to more than just a single statistics package (your knowledge should not be based on “where to click” arguments)
• More stable
• BTW Supports OSX as well… (anybody?)
Every advantage has a disadvantage• Output less “polished”
• It takes some extra work to get you started
• The Logistic Regression chapter in the Field book uses SPSS (but still readable for the larger part)
• (and it’s not campus software, but subfaculty software)
• Installation …
Advanced Methods and Models in Behavioral Research –
Advanced Methods and Models in Behavioral Research –
If on Windows, try downloading
• www.chrissnijders.com/ammbr/TUeStata12-zip.exe
Logistic Regression Analysis
Credit where credit is due:slides adapted from Gerrit Rooks
That is: your Y variable is 0/1: Now what?
The main points
1. Why do we have to know and sometimes use logistic regression?
2. What is the underlying model? What is maximum likelihood estimation?
3. Logistics of logistic regression analysis1. Estimate coefficients2. Assess model fit3. Interpret coefficients4. Check residuals
4. An SPSS example
Advanced Methods and Models in Behavioral Research
Suppose we have 100 observations with information about an individuals age and wether or not this indivual had some kind of a heart disease (CHD)
ID age CHD1 20 02 23 03 24 04 25 1…98 64 099 65 1
100 69 1
A graphic representation of the data
CHD
Age
Let’s just try regression analysis
pr(CHD|age) = -.54 +.022*Age
... linear regression is not a suitable model for probabilities
pr(CHD|age) = -.54 +.0218107*Age
In this graph for 8 age groups, I plotted the probability of having a heart disease (proportion)
A nonlinear model is probably better here
Something like this
This is the logistic regression model
)( 111011)|Pr(
XbbeXY
Predicted probabilities are always between 0 and 1
)( 111011)|Pr(
XbbeXY
similar to classic regressionanalysis
Side note: this is similar to MMBR …
Advanced Methods and Models in Behavioral Research –
Suppose Y is a percentage (so between 0 and 1).
Then consider
…which will ensure that the estimated Y will vary between 0 and 1and after some rearranging this is the same as
… (continued)
Advanced Methods and Models in Behavioral Research –
And one “solution” might be:
- Change all Y values that are 0 to 0.001- Change all Y values that are 1 to 0.999
Now run regression on log(Y/(1-Y)) …
… but that really is sort of higgledy-piggledy …
Logistics of logistic regression
1. How do we estimate the coefficients? 2. How do we assess model fit?3. How do we interpret coefficients? 4. How do we check regression assumptions?
Kinds of estimation in regression
• Ordinary Least Squares (we fit a line through a cloud of dots)
• Maximum likelihood (we find the parameters that are the most likely, given our data)
We never bothered to consider maximum likelihood in standard multiple regression, because you can show that they lead to exactly the same estimator (in MR, that is, normally they differ).
Actually, maximum likelihood has superior statistical properties (efficiency, consistency, invariance, …)
Advanced Methods and Models in Behavioral Research –
Maximum likelihood estimation• Method of maximum likelihood yields values
for the unknown parameters that maximize the probability of obtaining the observed set of data
)( 111011)|Pr(
XbbeXY
Unknown parameters
Maximum likelihood estimation• First we have to construct the “likelihood
function” (probability of obtaining the observed set of data).
Likelihood = pr(obs1)*pr(obs2)*pr(obs3)…*pr(obsn)
Assuming that observations are independent
Log-likelihood
• For technical reasons the likelihood is transformed in the log-likelihood (then you just maximize the sum of the logged probabilities)
LL= ln[pr(obs1)]+ln[pr(obs2)]+ln[pr(obs3)]…+ln[pr(obsn)]
Advanced Methods and Models in Behavioral Research –
Some subtleties
• In OLS, we did not need stochastic assumptions to be able to calculate a best-fitting line (only for the estimates of the confidence intervals we need that). With maximum likelihood estimation we need this from the start
(and let us not be bothered at this point by how the confidence intervals are calculated in
maximum likelihood)
Note: optimizing log-likelihoods is difficult• It’s iterative (“searching the landscape”)
it might not converge it might converge to the wrong answer
Advanced Methods and Models in Behavioral Research –
Advanced Methods and Models in Behavioral Research –
Nasty implication: extreme cases should be left out
(some handwaving here)
Advanced Methods and Models in Behavioral Research –
SPSS output
Estimation of coefficients: SPSS Results
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1a age ,111 ,024 21,254 1 ,000 1,117
Constant -5,309 1,134 21,935 1 ,000 ,005
a. Variable(s) entered on step 1: age.
)11.3.5( 111)|Pr( Xe
XY
)11.3.5( 111)|Pr( Xe
XY
)11.3.5( 111)|Pr( Xe
XY
This function fits best: other values of b0 and b1 give worse results (that is, other values have a smaller likelihood value)
Illustration 1: suppose we chose .05X instead of .11X
)05.3.5( 111)|Pr( Xe
XY
)40.3.5( 111)|Pr( Xe
XY
Illustration 2: suppose we chose .40X instead of .11X
Logistics of logistic regression
• Estimate the coefficients (and their conf.int.)• Assess model fit
– Between model comparisons– Pseudo R2 (similar to multiple regression)– Predictive accuracy
• Interpret coefficients • Check regression assumptions
41
Model fit: comparisons between models
)]baseline()New([22 LLLL
The log-likelihood ratio test statistic can be used to test the fit of a model
The test statistic has achi-square distribution
reduced modelfull model
NOTE This is sort of similar to the variance decomposition tables you see in MR!
Advanced Methods and Models in Behavioral Research
Between model comparisons: the likelihood ratio test
)( 11011)(P Xbbe
Y
)]baseline()New([22 LLLL
reduced modelfull model
)( 011)(P be
Y
The model including only an interceptIs often called the empty model. SPSS uses this model as a default.
)]baseline(2)New(22 LLLL
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 29,310 1 ,000
Block 29,310 1 ,000
Model 29,310 1 ,000
Model Summary
Step -2 Log likelihood
Cox & Snell R
Square
Nagelkerke R
Square
1 107,353a ,254 ,341
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than ,001.
This is the test statistic,and it’s associated significance
Between model comparison: SPSS output
45
Overall model fitpseudo R2
Just like in multiple regression, pseudo R2 ranges 0.0 to 1.0
– Cox and Snell• cannot theoretically
reach 1
– Nagelkerke• adjusted so that it
can reach 1
)(2)(2
LOGIT2
EmptyLLModelLLR
log-likelihood of modelbefore any predictors wereentered
log-likelihood of the modelthat you want to test
NOTE: R2 in logistic regression tends to be (even) smaller than in multiple regression
46
Overall model fit: Classification table
We predict 74% correctly
Classification Tablea
Observed
Predicted
chd
0 1
Percentage
Correct
Step 1 chd 0 45 12 78,9
1 14 29 67,4
Overall Percentage 74,0
a. The cut value is ,500
47
Overall model fit: Classification table
14 cases had a CHD while according to our modelthis shouldnt have happened
Classification Tablea
Observed
Predicted
chd
0 1
Percentage
Correct
Step 1 chd 0 45 12 78,9
1 14 29 67,4
Overall Percentage 74,0
a. The cut value is ,500
48
Overall model fit: Classification table
12 cases didn’t have a CHD while according to our modelthis should have happened
Classification Tablea
Observed
Predicted
chd
0 1
Percentage
Correct
Step 1 chd 0 45 12 78,9
1 14 29 67,4
Overall Percentage 74,0
a. The cut value is ,500
Logistics of logistic regression
• Estimate the coefficients • Assess model fit• Interpret coefficients
– Direction– Significance– Magnitude
• Check regression assumptions
50
The Odds Ratio
)...(
)...(
)...( 1110
1110
1110 111)(
nn
nn
nn XbXbb
XbXbb
XbXbb ee
eYp
We had:
And after some rearranging we can get
Magnitude of association: Percentage change in odds
event
event
prob1probOddsi
Probability Odds25% 0.3350% 175% 3
52
Interpreting coefficients: direction
• original b reflects changes in logit: b>0 implies positive relationship
• exponentiated b reflects the “changes in odds”: exp(b) > 1 implies a positive relationship
nnxbxbxbbypyp
...)(1
)(lnlogit 22110
53
3. Interpreting coefficients: magnitude
• The slope coefficient (b) is interpreted as the rate of change in the "log odds" as X changes … not very useful.
• exp(b) is the effect of the independent variable on the odds, more useful for calculating the size of an effect
nnxbxbxbbypyp
...)(1
)(lnlogit 22110
nnxbxbxbb eeeeypyp
...)(1
)(Odds 22110
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1a age ,111 ,024 21,254 1 ,000 1,117
Constant -5,309 1,134 21,935 1 ,000 ,005
a. Variable(s) entered on step 1: age.
• For the age variable:
– Percentage change in odds = (exponentiated coefficient – 1) * 100 = 12%, or “the odds times 1,117”
– A one unit increase in age will result in 12% increase in the odds that the person will have a CHD
– So if a soccer player is one year older, the odds that (s)he will have CHD is 12% higher
Magnitude of association
Ref=0 Ref=1
Another way to get an idea of the size of effects: Calculating predicted probabilities
)11.3.5( 111)|Pr( Xe
XY
For somebody of 20 years old, the predicted probability is .04
For somebody of 70 years old, the predicted probability is .91
But this gets more complicatedwhen you have more than a single X-variable
(see blackboard)
Conclusion: if you consider the effect of a variable on the predicted probability, the size of the effect of X1 depends on the value of X2! (yuck!)
Advanced Methods and Models in Behavioral Research –
Testing significance of coefficients
• In linear regression analysis this statistic is used to test significance
• In logistic regression something similar exists
• however, when b is large, standard error tends to become inflated, hence underestimation (Type II errors are more likely)
b
bSE
Wald
t-distribution standard error of estimate
estimate
Note: This is not the Wald Statistic SPSS presents!!!
Interpreting coefficients: significance
• SPSS presents
• While Andy Field thinks SPSS presents this (at least in the 2nd version of the book):
bSEb
2
2
Wald
b
bSE
Wald
Advanced Methods and Models in Behavioral Research –
Logistics of logistic regression
• Estimate the coefficients • Assess model fit• Interpret coefficients • Check regression assumptions
Checking assumptions
• Influential data points & Residuals– Follow Samanthas tips
• Hosmer & Lemeshow– Divides sample in subgroups– Checks whether there are differences between observed and
predicted between subgroups– Test should not be significant, if so: indication of lack of fit
Hosmer & Lemeshow
Test divides sample in subgroups, checks whether difference between observed and predicted is about equal in these groups
Test should not be significant (indicating no difference)
Examining residuals in logistic regression
1. Isolate points for which the model fits poorly2. Isolate influential data points
Residual statistics: Field’s rules of thumb
Advanced Methods and Models in Behavioral Research –
Time for a summary …
Logistic regression
• Y = 0/1• Multiple regression (or ANcOVA) is not right• You consider either the odds or the log(odds)
• It is estimated through “maximum likelihood”• Interpretation is a bit more complicated than normal• Assumption testing is a bit more concrete than in
multiple regression
Advanced Methods and Models in Behavioral Research –
Advanced Methods and Models in Behavioral Research – 2008/2009 68
Make sure to
• enroll in studyweb (0a611)• Read the Field chapter on logistic
regression• Go through the slides as well• Bring your laptop next time: we’ll go
through a logistic regression in Stata
Advanced Methods and Models in Behavioral Research
Advanced Methods and Models in Behavioral Research –
69
Illustration with SPSS (without the outlier part)
• Penalty kicks data, variables:
– Scored: outcome variable,• 0 = penalty missed, and 1 = penalty scored
– Pswq: degree to which a player worries– Previous: percentage of penalties scored by a particular
player in their career
70
Case Processing Summary
75 100,00 ,0
75 100,00 ,0
75 100,0
Unweighted Casesa
Included in AnalysisMissing CasesTotal
Selected Cases
Unselected CasesTotal
N Percent
If weight is in effect, see classification table for the totalnumber of cases.
a.
Dependent Variable Encoding
01
Original ValueMissed PenaltyScored Penalty
Internal Value
SPSS OUTPUT Logistic Regression
Tells you somethingabout the number of observations and missings
71
Classification Tablea,b
0 35 ,00 40 100,0
53,3
ObservedMissed PenaltyScored Penalty
Result of PenaltyKick
Overall Percentage
Step 0
MissedPenalty
ScoredPenalty
Result of Penalty KickPercentage
Correct
Predicted
Constant is included in the model.a.
The cut value is ,500b.
Variables in the Equation
,134 ,231 ,333 1 ,564 1,143ConstantStep 0B S.E. Wald df Sig. Exp(B)
Variables not in the Equation
34,109 1 ,00034,193 1 ,00041,558 2 ,000
previouspswq
Variables
Overall Statistics
Step0
Score df Sig.
Block 0: Beginning Blockthis table is based on the empty model, i.e. onlythe constant in the model
)( 011)(P be
Y
these variableswill be enteredin the modellater on
72
Block 1: Method = Enter
Omnibus Tests of Model Coefficients
54,977 2 ,00054,977 2 ,00054,977 2 ,000
StepBlockModel
Step 1Chi-square df Sig.
Model Summary
48,662a ,520 ,694Step1
-2 Loglikelihood
Cox & SnellR Square
NagelkerkeR Square
Estimation terminated at iteration number 6 becauseparameter estimates changed by less than ,001.
a.
)]baseline()New([22 LLLL
Block is useful to check significance of individual coefficients, see Field
New model
this is the test statistic
after dividing by -2
Note: Nagelkerkeis larger than Cox
73
Variables in the Equation
,065 ,022 8,609 1 ,003 1,067-,230 ,080 8,309 1 ,004 ,7941,280 1,670 ,588 1 ,443 3,598
previouspswqConstant
Step1
a
B S.E. Wald df Sig. Exp(B)
Variable(s) entered on step 1: previous, pswq.a.
Classification Tablea
30 5 85,77 33 82,5
84,0
ObservedMissed PenaltyScored Penalty
Result of PenaltyKick
Overall Percentage
Step 1
MissedPenalty
ScoredPenalty
Result of Penalty KickPercentage
Correct
Predicted
The cut value is ,500a.
Block 1: Method = Enter (Continued)
Predictive accuracy has improved (was 53%)
estimatesstandard errorestimates
significance based on Wald statistic
change in odds
74
Variables in the Equation
,065 ,022 8,609 1 ,003 1,067-,230 ,080 8,309 1 ,004 ,7941,280 1,670 ,588 1 ,443 3,598
previouspswqConstant
Step1
a
B S.E. Wald df Sig. Exp(B)
Variable(s) entered on step 1: previous, pswq.a.
Classification Tablea
30 5 85,77 33 82,5
84,0
ObservedMissed PenaltyScored Penalty
Result of PenaltyKick
Overall Percentage
Step 1
MissedPenalty
ScoredPenalty
Result of Penalty KickPercentage
Correct
Predicted
The cut value is ,500a.
How is the classification table constructed?
)*230,0*065,028,1(11)(P Pred. pswqpreviouse
Y
# cases not predictedcorrrectly
# cases not predictedcorrrectly
75
How is the classification table constructed?
)*230,0*065,028,1(11)(P Pred. pswqpreviouse
Y
pswq previous scored Predict. prob.
18 56 1 .68
17 35 1 .41
20 45 0 .40
10 42 0 .85
76
How is the classification table constructed?
pswq previous
scored Predict. prob.
predicted
18 56 1 .68 117 35 1 .41 020 45 0 .40 010 42 0 .85 1
Classification Tablea
30 5 85,77 33 82,5
84,0
ObservedMissed PenaltyScored Penalty
Result of PenaltyKick
Overall Percentage
Step 1
MissedPenalty
ScoredPenalty
Result of Penalty KickPercentage
Correct
Predicted
The cut value is ,500a.