bayesian estimation why and how to run your first bayesian model

Bayesian estimation

Why and How to Run Your First Bayesian Model

Rens van de Schoot Rensvandeschoot. com

Classical null hypothesis testing

Wainer:

"One Cheer for Null-Hypothesis Significance Testing“

(1999; Psych. Meth., 4, 212-213)

… however …

NHT vs. Bayes

Pr (Data | H0)

≠Pr (Hi | Data)

Bayes Theorem

Pr (Hi | Data) =

Posterior ≈ prior * data

Posterior probability is proportional to the product of the prior probability and the likelihood

Bayes theorem: prior, data and posterior

Bayes Theorem:

Bayes Theorem

Pr (Hi| Data) =

Posterior ≈ prior * data

Posterior probability is proportional to the product of the prior probability and the likelihood

Intelligence (IQ)

IQ-∞ ∞

Prior Knowledgde

IQ-∞ ∞

1

9

Intelligence Interval Cognitive Designation

40 - 54 Severely challenged (<1%)

55 - 69 Challenged (2.3% of test takers)

70 - 84 Below average

85 - 114 Average (68% of test takers)

115 - 129 Above average

130 - 144 Gifted (2.3% of test takers)

145 - 159Genius (Less than 1% of test takers)

160 - 175 Extraordinary genius

Prior Knowledgde

IQ40 180

Prior Knowledgde

IQ40 180

2

Prior Knowledgde

IQ40 180

3

100

Prior Knowledgde

IQ40 180

4

100

Prior Knowledgde

IQ40 180

5

100

Prior Knowledgde

-∞ ∞

1

23

45

Prior

IQ-∞ ∞

Prior

Data

IQ-∞ ∞

Prior

Data

Posterior

IQ-∞ ∞

Prior

Posterior

Data

Prior - Data

IQ40 180

100

Prior

Data

Prior - Data

IQ40 180

100

Prior Data

How to obtain posterior?

In complex models, the posterior is often intractable (impossible to compute exactly)

Solution: approximate posterior by simulation– Simulate many draws from posterior

distribution– Compute mode, median, mean, 95% interval

et cetera from the simulated draws21

4 unknown parameters μj (j=1,...,4) and one common but unknown σ2.

Statistical model:

Y = μ1*D1 + μ2*D2 + μ3*D3 + μ4*D4 + E

with E ~ N(0, σ2)

ANOVA example

The Gibbs sampler

Specify prior:Pr(μ1, μ2, μ3, μ4, σ2)

Prior (μj) ~ Nor(μ0, var0)

Prior (μj) ~ Nor(0,10000)

Prior (σ2) ~ IG(0.001, 0.001)

24

Prior is Inverse Gamma (shape), (scale)

The Gibbs sampler

Combine prior with likelihood provides posterior:

Post ( μ1, μ2, μ3, μ4, σ2 | data )

…this is a 5 dimensional distribution…

The Gibbs sampler

Iterative evaluation via conditional distributions:

Post ( μ1 | μ2, μ3, μ4, σ2, data ) ~ Prior (μ1) X Data (μ1)




Post ( σ2 | μ1, μ2, μ3, μ4, data ) ~ Prior (σ2) X Data (σ2)

1.Assign starting values

2.Sample μ1 from conditional distribution




6.Sample σ2 from conditional distribution

7.Go to step 2 until enough iterations

The Gibbs sampler

Iteration

μ1 μ2 μ3 μ4 σ2

1 3.00 5.00 8.00 3.00 10

2 3.75 4.25 7.00 4.30 8

3 3.65 4.11 6.78 5.55 5

. . . . . .

15 4.45 3.19 5.08 6.55 1.1

. . . . . .

. . . . . .

199 4.59 3.75 5.21 6.36 1.2

200 4.36 3.45 4.65 6.99 1.3

The Gibbs sampler

Trace plot

Trace plot: posterior

Posterior Distribution

31

32

Burn In

Gibbs sampler must run t iterations ‘burn in’ before we reach target distribution f(Z)– How many iterations are needed to

converge on the target distribution? Diagnostics

– Examine graph of burn in– Try different starting values– Run several chains in parallel

Convergence

33

Convergence

34

Convergence

35

Convergence

36

Convergence

37

38

Conclusion about convergenge

Burn-in: Mplus deletes first half of chain Run multiple chains (Mplus default 2)

– Decrease Bconvergence: default .05

but better use .01

ALWAYS do a graphical evaluation of each and every parameter

Summing up

39

Probability

Prior

Posterior

Informative prior

Non-informative prior

MCMC methods

Convergence

Degree of belief

What is known before observing the

data

What is known after observing the

Tool to include subjective knowledge

Try to express absence of prior

knowledge

Posterior mainly determined by data

Simulation (sampling) techniques to

obtain the posterior distribution and all

posterior summary measures

Important to check

IQ

N = 20 Data are generated Mean = 102 SD = 15

40

IQ

N = 20 Data are generated Mean = 102 SD = 15

IQ

41

IQ

42

Prior type Prior Variance used Posterior Mean IQ score 95% C.I./C.C.I.

ML 102.00 94.42 – 109.57Prior 1 A 101.99 94.35 – 109.62Prior 2a M or A large variance, SD=100 101.99 94.40 – 109.42Prior2b M or A medium variance, SD=10 101.99 94.89 – 109.07Prior2c M or A small variance, SD=1 102.00 100.12 – 103.87Prior 3A 102.03 94.22 – 109.71Prior 4W medium variance, SD=10 102.00 97.76 – 106.80Prior 5 W small variance, SD=1 102.00 100.20-103.90Prior 6a W Large variance, SD=100 99.37 92.47 – 106.10Prior 6b W medium variance, SD=10 86.56 80.17 – 92.47

Uncertainty in Classical Statistics

Uncertainty = sampling distribution– Estimate population parameter by – Imagine drawing an infinity of samples– Distribution of over samples

Problem is that we have only one sample– Estimate and its sampling distribution– Estimate 95% confidence interval

44

Inference in Classical Statistics

What does 95% confidence interval actually mean?– Over an infinity of samples, 95% of these

contain the true population value – But we have only one sample– We never know if our present estimate and

confidence interval is one of those 95% or not

45

Inference in Classical Statistics

What does 95% confidence interval NOT mean?

We have a 95% probability that the true population value is within the limits of our confidence interval

We only have an aggregate assurance that in the long run 95% of our confidence intervals contain the true population value

46

Uncertainty in Bayesian Statistics

Uncertainty = probability distribution for the population parameter

In classical statistics the population parameter has one single true value

In Bayesian statistics we imagine a distribution of possible values of population parameter

47

Inference in Bayesian Statistics

What does a95% central credibility interval mean?

We have a 95% probability that the population value is within the limits of our confidence interval

48

What have we learned so far?

Results are compromise of prior & data

However: -> non/low-informative priors-> informative priors-> misspecification of the prior-> convergence

Results are easier to communicate (eg CCI compared to confidence interval)

Software

WinBUGS/ OpenBUGS Bayesian inference Using Gibbs Sampling Very general, user must set up model

R packages LearnBayes, R2Winbugs, MCMCpack

MLwiN Special implementation for multilevel regression

AMOS Special implementation for SEM

Mplus Very general (SEM + ML + many other models)

51

MPLUS - ML

DATA: FILE IS data.dat;

VARIABLE: NAMES ARE IQ; ANALYSIS:

ESTIMATOR IS ML;

MODEL: [IQ];

52

MPLUS – BAYES: default settings



ESTIMATOR IS BAYES;

MODEL: [IQ];

53

MPLUS – BAYES: default settings

Prior for IQ:

Prior mean = 0 Prior variance/precision = 1010

IQ

0

54

MPLUS – BAYES: change prior



ESTIMATOR IS BAYES;

MODEL: [IQ] (p1);

55




ESTIMATOR IS BAYES;

MODEL: [IQ] (p1);

MODEL PRIOR:p1 ~ N(a,b);

a = prior meanb = prior precission

56




ESTIMATOR IS BAYES;

MODEL: [IQ] (p1);

MODEL PRIOR:p1 ~ N(100,10);

57




ESTIMATOR IS BAYES;

MODEL: [IQ] (p1);

MODEL PRIOR:p1 ~ N(100,10);

PLOT: type is plot2;

58




ESTIMATOR IS BAYES;CHAINS = 4;BITERATIONS = (1000);BCONVERGENCE = .01;

MODEL: [IQ] (p1);MODEL PRIOR:

p1 ~ N(100,10);PLOT: type is plot2;

59


DATA: FILE IS data.dat;VARIABLE: NAMES ARE IQ;ANALYSIS:

ESTIMATOR IS BAYES;CHAINS = 4;BITERATIONS = (1000);BCONVERGENCE = .01;

MODEL: [IQ] (p1);MODEL PRIOR:

p1 ~ N(100,10);PLOT: type is plot2;

OUTPUT: stand sampstat TECH4 TECH8;

60

Bayesian updating

Dynamic interactionism where adolescents are believed to develop through a dynamic and reciprocal transaction between personality and the environment

61

Bayesian updating

Dynamic interactionism where adolescents are believed to develop through a dynamic and reciprocal transaction between personality and the environment

In 1998, Asendorpf and Wilpers stated that "empirical evidence on the relative strength of personality

effects on relationships and vice versa is surprisingly limited"

Back in 1998, there had been very few longitudinal studies about personality development. Personality was not often used as outcome variable because it was seen as stable

These authors investigated for the first time personality and relationships over time in a sample of young students (n = 132) after their transition to university. The main conclusion of their analyses was that personality influenced change in social relationships, but not vice versa.

62

Bayesian updating

In 2001, Neyer and Asendorpf replicated the personality–relationship model, but now using a large representative sample of young adults

Based on the previous results Neyer and Asendorpf “[…] hypothesized that personality effects would

have a clear superiority over relationships effects“

In line with Asendorpf and Wilpers, they concluded that“Path analyses showed that once initial correlations were

controlled, personality traits predicted change in various aspects of social relationships, whereas effects of antecedent relationships on personality were rare and restricted to very specific relationships with one's pre-school children"

63

Bayesian updating

T1Extraversion

Hypothesized to be >0

Hypothesized to be 0

T2Extraversion

T1Friends

T2Friends

e1

e2r1

r2

β1

β2

β3

β4

T1Extraversion



T2Extraversion

T1Friends

T2Friends

e1

e2r1

r2

β1

β2

β3

β4

64

Bayesian updating

In 2003 Asendorpf and van Aken continued working on studies into personality–relationship transaction The authors stated that

"The aim of the present study was to apply the methodology used by Asendorpf and Wilpers (1998) and Neyer and Asendorpf (2001) to the study of personality–relationship transaction over adolescence, to try to replicate key findings of these earlier studies, particularly the dominance of […] traits over relationship quality“

Asendorpf and van Aken confirmed previous findings:

"The stronger effect was an extraversion effect on perceived support from peers. This result replicates, once more, similar findings in adulthood." (p.653)

65

Bayesian updating

In 2010, Sturaro, Denissen, van Aken, and Asendorpf, once again, investigated the personality–relationship transaction model

Sturaro et al. found some contradictory results compared to the previously described studies

"[The Five-Factor theory] predicts significant paths from personality to change in social relationship quality, whereas it does not predict social relationship quality to have an impact on personality change. Contrary to our expectation, however, personality did not predict changes in relationship quality"

66

Bayesian updating

In conclusion, the four papers described above clearly illustrate how theory building works in daily practice.

Asendorpf and Wilpers (1998) started with testing theoretical ideas on the association between personality and social relationships, tracing back to McCrae and Costa (1996),

and although their results were replicated by Neyer and Asendorpf (2001), and Asendorpf and van Aken (2003),

Sturaro, Denissen, van Aken, and Asendorpf (2010) were not able to do so. This latter finding let to re-formulations of the original theoretical ideas.

67

Bayesian updating

Why not update the results instead of testing the null hypothesis over and over again?

Let’s use Bayesian updating and impost subjective priors

In the first scenario we only focus on those data sets with similar age groups.

Therefore we first re-analyze the data of Neyer and Asendorpf (2001) without using prior knowledge. Thereafter, we re-analyze the data of Sturaro et al. (2010) using prior information based on the data of Neyer and Asendorpf; both data sets contain young adults between 17-30 years of age.

68

Bayesian updating

Why not update the results instead of testing the null hypothesis over and over again?

Let’s us Bayesian updating and impost subjective priors

In the second scenario we assume the relation between personality and social relationships is independent of age and we re-analyze the data of Sturaro et al. using prior information taken from Neyer and Asendorpf and from Asendorpf and van Aken.

In this second scenario we make a strong assumption, namely that the cross lagged effects for young adolescents are equal to the cross lagged effects of young adults.

This assumption implicates similar developmental trajectories across age groups and indicates a full replication study.

69

Bayesian updating

T1Extraversion



T2Extraversion

T1Friends

T2Friends

e1

e2r1

r2

β1

β2

β3

β4

T1Extraversion



T2Extraversion

T1Friends

T2Friends

e1

e2r1

r2

β1

β2

β3

β4

70

Scenario 1

Model 1:

Neyer & Asendorpf data

without prior knowledge

Estimate (SD) 95% PPI

β1 0.605 (0.037) 0.532 - 0.676

β2 0.293 (0.047) 0.199 - 0.386

β3 0.131 (0.046) 0.043 -

0.222

β4 -0.026

(0.039)

-0.100 -

0.051

71

Scenario 1

Model 1:



Model 2:

Sturaro et al. data


Estimate (SD) 95% PPI Estimate (SD) 95% PPI

β1 0.605 (0.037) 0.532 - 0.676 0.291 (0.063) 0.169 - 0.424

β2 0.293 (0.047) 0.199 - 0.386 0.157 (0.103) -0.042 -

0.364

β3 0.131 (0.046) 0.043 -

0.222

0.029 (0.079) -0.132 -

0.180

β4 -0.026

(0.039)

-0.100 -

0.051

0.303 (0.081) 0.144 -

0.462

72

Scenario 1

Model 1:



Model 2:

Sturaro et al. data


Model 3:

Sturaro et al. data

with priors based on Model 1

Estimate (SD) 95% PPI Estimate (SD) 95% PPI Estimate (SD) 95% PPI

β1 0.605 (0.037) 0.532 - 0.676 0.291 (0.063) 0.169 - 0.424 0.337 (0.058) 0.228 - 0.449

β2 0.293 (0.047) 0.199 - 0.386 0.157 (0.103) -0.042 - 0.364 0.287 (0.082) 0.130 - 0.448

β3 0.131 (0.046) 0.043 - 0.222 0.029 (0.079) -0.132 -

0.180

0.106 (0.072) -0.038 - 0.247

β4 -0.026 (0.039) -0.100 -

0.051

0.303 (0.081) 0.144 - 0.462 0.249 (0.067) 0.111 - 0.375

73

Scenario 2

Model 4:

Asendorpf & van Aken data



β1 0.512 (0.069) 0.376 - 0.649

β2 0.115 (0.083) -0.049 - 0.277

β3 0.217 (0.106) 0.006 - 0.426

β4 0.072 (0.055) -0.036 - 0.179

74

Scenario 2

Model 4:



Model 5:


with priors based on Model

1


β1 0.512 (0.069) 0.376 - 0.649 0.537 (0.059) 0.424 - 0.654

β2 0.115 (0.083) -0.049 - 0.277 0.140 (0.071) 0.005 - 0 .283

β3 0.217 (0.106) 0.006 - 0.426 0.212 (0.079) 0.057 - 0.361

β4 0.072 (0.055) -0.036 - 0.179 0.073 (0.051) -0.030 - 0.171

75

Scenario 2

Model 4:



Model 5:



1

Model 6:

Sturaro et al. data


5

Estimate (SD) 95% PPI Estimate (SD) 95% PPI Estimate (SD) 95% PPI

β1 0.512 (0.069) 0.376 - 0.649 0.537 (0.059) 0.424 - 0.654 0.313 (0.059) 0.199 - 0.427

β2 0.115 (0.083) -0.049 - 0.277 0.140 (0.071) 0.005 - 0 .283 0.246 (0.087) 0.079 - 0.420

β3 0.217 (0.106) 0.006 - 0.426 0.212 (0.079) 0.057 - 0.361 0.100 (0.076) -0.052 - 0.248

β4 0.072 (0.055) -0.036 - 0.179 0.073 (0.051) -0.030 - 0.171 0.259 (0.070) 0.116 - 0.393

Final results Sturaro et al

76

Scenario 1 Scenario 2

Model 3:

Sturaro et al. data


Model 6:

Sturaro et al. data



β1 0.337 (0.058) 0.228 - 0.449 0.313 (0.059) 0.199 - 0.427

β2 0.287 (0.082) 0.130 - 0.448 0.246 (0.087) 0.079 - 0.420

β3 0.106 (0.072) -0.038 - 0.247 0.100 (0.076) -0.052 - 0.248

β4 0.249 (0.067) 0.111 - 0.375 0.259 (0.070) 0.116 - 0.393

Final results Sturaro et al

77

Scenario 1 Scenario 2

Model 3:

Sturaro et al. data


Model 6:

Sturaro et al. data



β1 0.337 (0.058) 0.228 - 0.449 0.313 (0.059) 0.199 - 0.427

β2 0.287 (0.082) 0.130 - 0.448 0.246 (0.087) 0.079 - 0.420

β3 0.106 (0.072) -0.038 - 0.247 0.100 (0.076) -0.052 - 0.248

β4 0.249 (0.067) 0.111 - 0.375 0.259 (0.070) 0.116 - 0.393

Model 2:

Sturaro et al. data



0.291 (0.063) 0.169 - 0.424

0.157 (0.103) -0.042 - 0.364

0.029 (0.079) -0.132 - 0.180

0.303 (0.081) 0.144 - 0.462

Conclusions

The updating procedure of both scenarios leads us to conclude that the that using subjective priors decrease confidence intervals.

=> More certainty about the relations

However…

78

Conclusions

Using subjective priors never changed the real issuenamely that Sturaro et al found opposite effects to Neyer and Asendorpf.

The results supported the robustness of a conclusion that effects occurring between ages 17 and 23 are different from those occurring between ages 18-30, i.e., the clearly higher age in the Neyer and Asendorpf data.

79

Overall Conclusions

Excellent tool to include prior knowledge if available

Estimates (including intervals) always lie in the sample space if prior is chosen wisely

Results are easier to communicate

Better small-sample performance, large-sample theory not needed

Analyses can be made less computationally demanding

BUT: Bayes doesn’t solve misspecification of the model

bayesian estimation why and how to run your first bayesian model

Documents

prior probability

x data

conditional distributionsample

bayespr data h0 pr

approximate posterior

conditional distributiongo

dataposterior probability

unknown parameters j