bayesian estimation why and how to run your first bayesian model
DESCRIPTION
Bayesian estimation Why and How to Run Your First Bayesian Model. Rens van de Schoot Rensvandeschoot. com. Classical null hypothesis testing. Wainer: "One Cheer for Null-Hypothesis Significance Testing“ (1999; Psych. Meth., 4 , 212-213) … however …. NHT vs. Bayes. - PowerPoint PPT PresentationTRANSCRIPT
Bayesian estimation
Why and How to Run Your First Bayesian Model
Rens van de Schoot Rensvandeschoot. com
Classical null hypothesis testing
Wainer:
"One Cheer for Null-Hypothesis Significance Testing“
(1999; Psych. Meth., 4, 212-213)
… however …
NHT vs. Bayes
Pr (Data | H0)
≠Pr (Hi | Data)
Bayes Theorem
Pr (Hi | Data) =
Posterior ≈ prior * data
Posterior probability is proportional to the product of the prior probability and the likelihood
Bayes theorem: prior, data and posterior
Bayes Theorem:
Bayes Theorem
Pr (Hi| Data) =
Posterior ≈ prior * data
Posterior probability is proportional to the product of the prior probability and the likelihood
Intelligence (IQ)
IQ-∞ ∞
Prior Knowledgde
IQ-∞ ∞
1
9
Intelligence Interval Cognitive Designation
40 - 54 Severely challenged (<1%)
55 - 69 Challenged (2.3% of test takers)
70 - 84 Below average
85 - 114 Average (68% of test takers)
115 - 129 Above average
130 - 144 Gifted (2.3% of test takers)
145 - 159Genius (Less than 1% of test takers)
160 - 175 Extraordinary genius
Prior Knowledgde
IQ40 180
Prior Knowledgde
IQ40 180
2
Prior Knowledgde
IQ40 180
3
100
Prior Knowledgde
IQ40 180
4
100
Prior Knowledgde
IQ40 180
5
100
Prior Knowledgde
-∞ ∞
1
23
45
Prior
IQ-∞ ∞
Prior
Data
IQ-∞ ∞
Prior
Data
Posterior
IQ-∞ ∞
Prior
Posterior
Data
Prior - Data
IQ40 180
100
Prior
Data
Prior - Data
IQ40 180
100
Prior Data
How to obtain posterior?
In complex models, the posterior is often intractable (impossible to compute exactly)
Solution: approximate posterior by simulation– Simulate many draws from posterior
distribution– Compute mode, median, mean, 95% interval
et cetera from the simulated draws21
4 unknown parameters μj (j=1,...,4) and one common but unknown σ2.
Statistical model:
Y = μ1*D1 + μ2*D2 + μ3*D3 + μ4*D4 + E
with E ~ N(0, σ2)
ANOVA example
The Gibbs sampler
Specify prior:Pr(μ1, μ2, μ3, μ4, σ2)
Prior (μj) ~ Nor(μ0, var0)
Prior (μj) ~ Nor(0,10000)
Prior (σ2) ~ IG(0.001, 0.001)
24
Prior is Inverse Gamma (shape), (scale)
The Gibbs sampler
Combine prior with likelihood provides posterior:
Post ( μ1, μ2, μ3, μ4, σ2 | data )
…this is a 5 dimensional distribution…
The Gibbs sampler
Iterative evaluation via conditional distributions:
Post ( μ1 | μ2, μ3, μ4, σ2, data ) ~ Prior (μ1) X Data (μ1)
Post ( μ2 | μ1, μ3, μ4, σ2, data ) ~ Prior (μ2) X Data (μ2)
Post ( μ3 | μ1, μ2, μ4, σ2, data ) ~ Prior (μ3) X Data (μ3)
Post ( μ4 | μ1, μ2, μ3, σ2, data ) ~ Prior (μ4) X Data (μ4)
Post ( σ2 | μ1, μ2, μ3, μ4, data ) ~ Prior (σ2) X Data (σ2)
1.Assign starting values
2.Sample μ1 from conditional distribution
3.Sample μ2 from conditional distribution
4.Sample μ3 from conditional distribution
5.Sample μ4 from conditional distribution
6.Sample σ2 from conditional distribution
7.Go to step 2 until enough iterations
The Gibbs sampler
Iteration
μ1 μ2 μ3 μ4 σ2
1 3.00 5.00 8.00 3.00 10
2 3.75 4.25 7.00 4.30 8
3 3.65 4.11 6.78 5.55 5
. . . . . .
15 4.45 3.19 5.08 6.55 1.1
. . . . . .
. . . . . .
199 4.59 3.75 5.21 6.36 1.2
200 4.36 3.45 4.65 6.99 1.3
The Gibbs sampler
Trace plot
Trace plot: posterior
Posterior Distribution
31
32
Burn In
Gibbs sampler must run t iterations ‘burn in’ before we reach target distribution f(Z)– How many iterations are needed to
converge on the target distribution? Diagnostics
– Examine graph of burn in– Try different starting values– Run several chains in parallel
Convergence
33
Convergence
34
Convergence
35
Convergence
36
Convergence
37
38
Conclusion about convergenge
Burn-in: Mplus deletes first half of chain Run multiple chains (Mplus default 2)
– Decrease Bconvergence: default .05
but better use .01
ALWAYS do a graphical evaluation of each and every parameter
Summing up
39
Probability
Prior
Posterior
Informative prior
Non-informative prior
MCMC methods
Convergence
Degree of belief
What is known before observing the
data
What is known after observing the
Tool to include subjective knowledge
Try to express absence of prior
knowledge
Posterior mainly determined by data
Simulation (sampling) techniques to
obtain the posterior distribution and all
posterior summary measures
Important to check
IQ
N = 20 Data are generated Mean = 102 SD = 15
40
IQ
N = 20 Data are generated Mean = 102 SD = 15
IQ
41
IQ
42
Prior type Prior Variance used Posterior Mean IQ score 95% C.I./C.C.I.
ML 102.00 94.42 – 109.57Prior 1 A 101.99 94.35 – 109.62Prior 2a M or A large variance, SD=100 101.99 94.40 – 109.42Prior2b M or A medium variance, SD=10 101.99 94.89 – 109.07Prior2c M or A small variance, SD=1 102.00 100.12 – 103.87Prior 3A 102.03 94.22 – 109.71Prior 4W medium variance, SD=10 102.00 97.76 – 106.80Prior 5 W small variance, SD=1 102.00 100.20-103.90Prior 6a W Large variance, SD=100 99.37 92.47 – 106.10Prior 6b W medium variance, SD=10 86.56 80.17 – 92.47
4343
Uncertainty in Classical Statistics
Uncertainty = sampling distribution– Estimate population parameter by – Imagine drawing an infinity of samples– Distribution of over samples
Problem is that we have only one sample– Estimate and its sampling distribution– Estimate 95% confidence interval
44
Inference in Classical Statistics
What does 95% confidence interval actually mean?– Over an infinity of samples, 95% of these
contain the true population value – But we have only one sample– We never know if our present estimate and
confidence interval is one of those 95% or not
45
Inference in Classical Statistics
What does 95% confidence interval NOT mean?
We have a 95% probability that the true population value is within the limits of our confidence interval
We only have an aggregate assurance that in the long run 95% of our confidence intervals contain the true population value
46
Uncertainty in Bayesian Statistics
Uncertainty = probability distribution for the population parameter
In classical statistics the population parameter has one single true value
In Bayesian statistics we imagine a distribution of possible values of population parameter
47
Inference in Bayesian Statistics
What does a95% central credibility interval mean?
We have a 95% probability that the population value is within the limits of our confidence interval
48
What have we learned so far?
Results are compromise of prior & data
However: -> non/low-informative priors-> informative priors-> misspecification of the prior-> convergence
Results are easier to communicate (eg CCI compared to confidence interval)
Software
WinBUGS/ OpenBUGS Bayesian inference Using Gibbs Sampling Very general, user must set up model
R packages LearnBayes, R2Winbugs, MCMCpack
MLwiN Special implementation for multilevel regression
AMOS Special implementation for SEM
Mplus Very general (SEM + ML + many other models)
51
MPLUS - ML
DATA: FILE IS data.dat;
VARIABLE: NAMES ARE IQ; ANALYSIS:
ESTIMATOR IS ML;
MODEL: [IQ];
52
MPLUS – BAYES: default settings
DATA: FILE IS data.dat;
VARIABLE: NAMES ARE IQ; ANALYSIS:
ESTIMATOR IS BAYES;
MODEL: [IQ];
53
MPLUS – BAYES: default settings
Prior for IQ:
Prior mean = 0 Prior variance/precision = 1010
IQ
0
54
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE: NAMES ARE IQ; ANALYSIS:
ESTIMATOR IS BAYES;
MODEL: [IQ] (p1);
55
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE: NAMES ARE IQ; ANALYSIS:
ESTIMATOR IS BAYES;
MODEL: [IQ] (p1);
MODEL PRIOR:p1 ~ N(a,b);
a = prior meanb = prior precission
56
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE: NAMES ARE IQ; ANALYSIS:
ESTIMATOR IS BAYES;
MODEL: [IQ] (p1);
MODEL PRIOR:p1 ~ N(100,10);
57
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE: NAMES ARE IQ; ANALYSIS:
ESTIMATOR IS BAYES;
MODEL: [IQ] (p1);
MODEL PRIOR:p1 ~ N(100,10);
PLOT: type is plot2;
58
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;
VARIABLE: NAMES ARE IQ; ANALYSIS:
ESTIMATOR IS BAYES;CHAINS = 4;BITERATIONS = (1000);BCONVERGENCE = .01;
MODEL: [IQ] (p1);MODEL PRIOR:
p1 ~ N(100,10);PLOT: type is plot2;
59
MPLUS – BAYES: change prior
DATA: FILE IS data.dat;VARIABLE: NAMES ARE IQ;ANALYSIS:
ESTIMATOR IS BAYES;CHAINS = 4;BITERATIONS = (1000);BCONVERGENCE = .01;
MODEL: [IQ] (p1);MODEL PRIOR:
p1 ~ N(100,10);PLOT: type is plot2;
OUTPUT: stand sampstat TECH4 TECH8;
60
Bayesian updating
Dynamic interactionism where adolescents are believed to develop through a dynamic and reciprocal transaction between personality and the environment
61
Bayesian updating
Dynamic interactionism where adolescents are believed to develop through a dynamic and reciprocal transaction between personality and the environment
In 1998, Asendorpf and Wilpers stated that "empirical evidence on the relative strength of personality
effects on relationships and vice versa is surprisingly limited"
Back in 1998, there had been very few longitudinal studies about personality development. Personality was not often used as outcome variable because it was seen as stable
These authors investigated for the first time personality and relationships over time in a sample of young students (n = 132) after their transition to university. The main conclusion of their analyses was that personality influenced change in social relationships, but not vice versa.
62
Bayesian updating
In 2001, Neyer and Asendorpf replicated the personality–relationship model, but now using a large representative sample of young adults
Based on the previous results Neyer and Asendorpf “[…] hypothesized that personality effects would
have a clear superiority over relationships effects“
In line with Asendorpf and Wilpers, they concluded that“Path analyses showed that once initial correlations were
controlled, personality traits predicted change in various aspects of social relationships, whereas effects of antecedent relationships on personality were rare and restricted to very specific relationships with one's pre-school children"
63
Bayesian updating
T1Extraversion
Hypothesized to be >0
Hypothesized to be 0
T2Extraversion
T1Friends
T2Friends
e1
e2r1
r2
β1
β2
β3
β4
T1Extraversion
Hypothesized to be >0
Hypothesized to be 0
T2Extraversion
T1Friends
T2Friends
e1
e2r1
r2
β1
β2
β3
β4
64
Bayesian updating
In 2003 Asendorpf and van Aken continued working on studies into personality–relationship transaction The authors stated that
"The aim of the present study was to apply the methodology used by Asendorpf and Wilpers (1998) and Neyer and Asendorpf (2001) to the study of personality–relationship transaction over adolescence, to try to replicate key findings of these earlier studies, particularly the dominance of […] traits over relationship quality“
Asendorpf and van Aken confirmed previous findings:
"The stronger effect was an extraversion effect on perceived support from peers. This result replicates, once more, similar findings in adulthood." (p.653)
65
Bayesian updating
In 2010, Sturaro, Denissen, van Aken, and Asendorpf, once again, investigated the personality–relationship transaction model
Sturaro et al. found some contradictory results compared to the previously described studies
"[The Five-Factor theory] predicts significant paths from personality to change in social relationship quality, whereas it does not predict social relationship quality to have an impact on personality change. Contrary to our expectation, however, personality did not predict changes in relationship quality"
66
Bayesian updating
In conclusion, the four papers described above clearly illustrate how theory building works in daily practice.
Asendorpf and Wilpers (1998) started with testing theoretical ideas on the association between personality and social relationships, tracing back to McCrae and Costa (1996),
and although their results were replicated by Neyer and Asendorpf (2001), and Asendorpf and van Aken (2003),
Sturaro, Denissen, van Aken, and Asendorpf (2010) were not able to do so. This latter finding let to re-formulations of the original theoretical ideas.
67
Bayesian updating
Why not update the results instead of testing the null hypothesis over and over again?
Let’s use Bayesian updating and impost subjective priors
In the first scenario we only focus on those data sets with similar age groups.
Therefore we first re-analyze the data of Neyer and Asendorpf (2001) without using prior knowledge. Thereafter, we re-analyze the data of Sturaro et al. (2010) using prior information based on the data of Neyer and Asendorpf; both data sets contain young adults between 17-30 years of age.
68
Bayesian updating
Why not update the results instead of testing the null hypothesis over and over again?
Let’s us Bayesian updating and impost subjective priors
In the second scenario we assume the relation between personality and social relationships is independent of age and we re-analyze the data of Sturaro et al. using prior information taken from Neyer and Asendorpf and from Asendorpf and van Aken.
In this second scenario we make a strong assumption, namely that the cross lagged effects for young adolescents are equal to the cross lagged effects of young adults.
This assumption implicates similar developmental trajectories across age groups and indicates a full replication study.
69
Bayesian updating
T1Extraversion
Hypothesized to be >0
Hypothesized to be 0
T2Extraversion
T1Friends
T2Friends
e1
e2r1
r2
β1
β2
β3
β4
T1Extraversion
Hypothesized to be >0
Hypothesized to be 0
T2Extraversion
T1Friends
T2Friends
e1
e2r1
r2
β1
β2
β3
β4
70
Scenario 1
Model 1:
Neyer & Asendorpf data
without prior knowledge
Estimate (SD) 95% PPI
β1 0.605 (0.037) 0.532 - 0.676
β2 0.293 (0.047) 0.199 - 0.386
β3 0.131 (0.046) 0.043 -
0.222
β4 -0.026
(0.039)
-0.100 -
0.051
71
Scenario 1
Model 1:
Neyer & Asendorpf data
without prior knowledge
Model 2:
Sturaro et al. data
without prior knowledge
Estimate (SD) 95% PPI Estimate (SD) 95% PPI
β1 0.605 (0.037) 0.532 - 0.676 0.291 (0.063) 0.169 - 0.424
β2 0.293 (0.047) 0.199 - 0.386 0.157 (0.103) -0.042 -
0.364
β3 0.131 (0.046) 0.043 -
0.222
0.029 (0.079) -0.132 -
0.180
β4 -0.026
(0.039)
-0.100 -
0.051
0.303 (0.081) 0.144 -
0.462
72
Scenario 1
Model 1:
Neyer & Asendorpf data
without prior knowledge
Model 2:
Sturaro et al. data
without prior knowledge
Model 3:
Sturaro et al. data
with priors based on Model 1
Estimate (SD) 95% PPI Estimate (SD) 95% PPI Estimate (SD) 95% PPI
β1 0.605 (0.037) 0.532 - 0.676 0.291 (0.063) 0.169 - 0.424 0.337 (0.058) 0.228 - 0.449
β2 0.293 (0.047) 0.199 - 0.386 0.157 (0.103) -0.042 - 0.364 0.287 (0.082) 0.130 - 0.448
β3 0.131 (0.046) 0.043 - 0.222 0.029 (0.079) -0.132 -
0.180
0.106 (0.072) -0.038 - 0.247
β4 -0.026 (0.039) -0.100 -
0.051
0.303 (0.081) 0.144 - 0.462 0.249 (0.067) 0.111 - 0.375
73
Scenario 2
Model 4:
Asendorpf & van Aken data
without prior knowledge
Estimate (SD) 95% PPI
β1 0.512 (0.069) 0.376 - 0.649
β2 0.115 (0.083) -0.049 - 0.277
β3 0.217 (0.106) 0.006 - 0.426
β4 0.072 (0.055) -0.036 - 0.179
74
Scenario 2
Model 4:
Asendorpf & van Aken data
without prior knowledge
Model 5:
Asendorpf & van Aken data
with priors based on Model
1
Estimate (SD) 95% PPI Estimate (SD) 95% PPI
β1 0.512 (0.069) 0.376 - 0.649 0.537 (0.059) 0.424 - 0.654
β2 0.115 (0.083) -0.049 - 0.277 0.140 (0.071) 0.005 - 0 .283
β3 0.217 (0.106) 0.006 - 0.426 0.212 (0.079) 0.057 - 0.361
β4 0.072 (0.055) -0.036 - 0.179 0.073 (0.051) -0.030 - 0.171
75
Scenario 2
Model 4:
Asendorpf & van Aken data
without prior knowledge
Model 5:
Asendorpf & van Aken data
with priors based on Model
1
Model 6:
Sturaro et al. data
with priors based on Model
5
Estimate (SD) 95% PPI Estimate (SD) 95% PPI Estimate (SD) 95% PPI
β1 0.512 (0.069) 0.376 - 0.649 0.537 (0.059) 0.424 - 0.654 0.313 (0.059) 0.199 - 0.427
β2 0.115 (0.083) -0.049 - 0.277 0.140 (0.071) 0.005 - 0 .283 0.246 (0.087) 0.079 - 0.420
β3 0.217 (0.106) 0.006 - 0.426 0.212 (0.079) 0.057 - 0.361 0.100 (0.076) -0.052 - 0.248
β4 0.072 (0.055) -0.036 - 0.179 0.073 (0.051) -0.030 - 0.171 0.259 (0.070) 0.116 - 0.393
Final results Sturaro et al
76
Scenario 1 Scenario 2
Model 3:
Sturaro et al. data
with priors based on Model 1
Model 6:
Sturaro et al. data
with priors based on Model 5
Estimate (SD) 95% PPI Estimate (SD) 95% PPI
β1 0.337 (0.058) 0.228 - 0.449 0.313 (0.059) 0.199 - 0.427
β2 0.287 (0.082) 0.130 - 0.448 0.246 (0.087) 0.079 - 0.420
β3 0.106 (0.072) -0.038 - 0.247 0.100 (0.076) -0.052 - 0.248
β4 0.249 (0.067) 0.111 - 0.375 0.259 (0.070) 0.116 - 0.393
Final results Sturaro et al
77
Scenario 1 Scenario 2
Model 3:
Sturaro et al. data
with priors based on Model 1
Model 6:
Sturaro et al. data
with priors based on Model 5
Estimate (SD) 95% PPI Estimate (SD) 95% PPI
β1 0.337 (0.058) 0.228 - 0.449 0.313 (0.059) 0.199 - 0.427
β2 0.287 (0.082) 0.130 - 0.448 0.246 (0.087) 0.079 - 0.420
β3 0.106 (0.072) -0.038 - 0.247 0.100 (0.076) -0.052 - 0.248
β4 0.249 (0.067) 0.111 - 0.375 0.259 (0.070) 0.116 - 0.393
Model 2:
Sturaro et al. data
without prior knowledge
Estimate (SD) 95% PPI
0.291 (0.063) 0.169 - 0.424
0.157 (0.103) -0.042 - 0.364
0.029 (0.079) -0.132 - 0.180
0.303 (0.081) 0.144 - 0.462
Conclusions
The updating procedure of both scenarios leads us to conclude that the that using subjective priors decrease confidence intervals.
=> More certainty about the relations
However…
78
Conclusions
Using subjective priors never changed the real issuenamely that Sturaro et al found opposite effects to Neyer and Asendorpf.
The results supported the robustness of a conclusion that effects occurring between ages 17 and 23 are different from those occurring between ages 18-30, i.e., the clearly higher age in the Neyer and Asendorpf data.
79
Overall Conclusions
Excellent tool to include prior knowledge if available
Estimates (including intervals) always lie in the sample space if prior is chosen wisely
Results are easier to communicate
Better small-sample performance, large-sample theory not needed
Analyses can be made less computationally demanding
BUT: Bayes doesn’t solve misspecification of the model