andrew gelman department of statistics and department of...

Weakly informative priors

Andrew GelmanDepartment of Statistics and Department of Political Science

Columbia University

21 Oct 2011

Collaborators (in order of appearance):Gary King, Frederic Bois, Aleks Jakulin, Vince Dorie, Sophia

Rabe-Hesketh, Jingchen Liu, Yeojin Chung, Matt Schofield . . .

Andrew Gelman Weakly informative priors

Weakly informative priors for . . .

I Two applied examples

1. Identifying a three-component mixture (1990)2. Population variation in toxicology (1996)

I Some default priors:

3. Logistic regression (2008)4. Hierarchical models (2006, 2011)5. Covariance matrices (2011)6. Mixture models (2011)


1. Identifying a three-component mixture

I Maximum likelihood estimate blows up

I Bayes posterior with flat prior blows up too!


Priors!

I Mixture component 1: mean has N(−0.4, 0.42) prior,standard deviation has inverse-χ2(4, 0.42) prior

I Mixture component 2: mean has N(+0.4, 0.42) prior,standard deviation has inverse-χ2(4, 0.42) prior

I Mixture component 3: mean has N(0, 32) prior, standarddeviation has inverse-χ2(4, 0.82) prior

I Three mixture parameters have a Dirichlet (19, 19, 4) prior


Separating into Republicans, Democrats, and open seats

I Beyond “weakly informative” by using incumbency information


2. Weakly informative priors forpopulation variation in toxicology

I Pharamcokinetic parameters such as the “Michaelis-Mentencoefficient”

I Wide uncertainty: prior guess for θ is 15 with a factor of 100of uncertainty, log θ ∼ N(log(15), log(10)2)

I Population model: data on several people j ,log θj ∼ N(log(15), log(10)2) ????

I Hierarchical prior distribution:I log θj ∼ N(µ, σ2), σ ≈ log(2)I µ ∼ N(log(15), log(10)2)

I Weakly informative


Concepts

I Wip instead of noninformative prior or informative prior

I You’re using wips already!

I Prior as a placeholder

I Model as a placeholder

I “Life is what happens to you while you’re busy making otherplans”

I Full Bayes vs. Bayesian point estimates

I HIerarchical modeling as a unifying framework


3. Logistic regression

−6 −4 −2 0 2 4 60.0

0.2

0.4

0.6

0.8

1.0 y = logit−1(x)

x

logi

t−1(x

)

slope = 1/4


A clean example

−10 0 10 20

0.0

0.2

0.4

0.6

0.8

1.0

estimated Pr(y=1) = logit−1(−1.40 + 0.33 x)

x

y slope = 0.33/4


The problem of separation

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

slope = infinity?

x

y


Separation is no joke!

glm (vote ~ female + black + income, family=binomial(link="logit"))

1960 1968

coef.est coef.se coef.est coef.se

(Intercept) -0.14 0.23 (Intercept) 0.47 0.24

female 0.24 0.14 female -0.01 0.15

black -1.03 0.36 black -3.64 0.59

income 0.03 0.06 income -0.03 0.07

1964 1972

coef.est coef.se coef.est coef.se

(Intercept) -1.15 0.22 (Intercept) 0.67 0.18

female -0.09 0.14 female -0.25 0.12

black -16.83 420.40 black -2.63 0.27

income 0.19 0.06 income 0.09 0.05


Regularization in action!


Information in prior distributions

I Informative prior distI A full generative model for the data

I Noninformative prior distI Let the data speakI Goal: valid inference for any θ

I Weakly informative prior distI Purposely include less information than we actually have


Weakly informative priors for logistic regression

I Separation in logistic regressionI Some prior info: logistic regression coefs are almost always

between −5 and 5:I 5 on the logit scale takes you from 0.01 to 0.50

or from 0.50 to 0.99I Smoking and lung cancer

I Independent Cauchy prior dists with center 0 and scale 2.5

I Rescale each predictor to have mean 0 and sd 12

I Fast implementation using EM; easy adaptation of glm


Prior distributions

−10 −5 0 5 10

θ


Another example

Dose #deaths/#animals

−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5

I Slope of a logistic regression of Pr(death) on dose:I Maximum likelihood est is 7.8± 4.9I With weakly-informative prior: Bayes est is 4.4± 1.9


Maximum likelihood and Bayesian estimates

Dose

Pro

babi

lity

of d

eath

0 10 20

0.0

0.5

1.0

glmbayesglm

I Which is truly conservative?

I The sociology of shrinkage


Evaluation using a corpus of datasets

I Compare classical glm to Bayesian estimates using variousprior distributions

I Evaluate using 5-fold cross-validation and average predictiveerror

I The optimal prior distribution for β’s is (approx) Cauchy (0, 1)

I Our Cauchy (0, 2.5) prior distribution is weakly informative!


Expected predictive loss, avg over a corpus of datasets

0 1 2 3 4 5

0.29

0.30

0.31

0.32

0.33

scale of prior

−lo

g te

st li

kelih

ood

(1.79)GLM

BBR(l)

df=2.0

df=4.0

df=8.0BBR(g)

df=1.0

df=0.5


4. Inference for hierarchical variance parameters

0 5 10 15 20 25 30

Marginal likelihood for σα

σα


Hierarchical variance parameters: 1. Full Bayes

I What is a good “weakly informative prior”?I log σα ∼ Uniform(−∞,∞)I σα ∼ Uniform(0,∞)I σα ∼ Inverse-gamma(0.001, 0.001)I σα ∼ Cauchy+(0,A)

I Polson and Scott (2011):“The half-Cauchy occupies a sensible ‘middle ground’ . . . itperforms very well near the origin, but does not lead to drasticcompromises in other parts of the parameter space.”


Problems with inverse-gamma prior

I Inv-gamma prior cuts off at 0


Problems with uniform prior

I Uniform prior doesn’t cut off the long tail


Hierarchical variance parameters: 2. Point estimation

I [Estimate ± standard error] as approximation to full Bayes

I Point estimation as goal in itself

I Problems with boundary estimate, σ̂α = 0


The problem of boundary estimates: 8-schools example

0 5 10 15 20 25 30

Marginal likelihood for σα

σα


The problem of boundary estimates: simulation

Maximum marginal likelihood estimate

Freq

uenc

y (o

ut o

f 100

0 si

mul

atio

ns)

0 1 2

020

040

0 Sampling distribution of ^ (true value is = 0.5)

Hierarchical scale parameter, 0 2 4

Mar

gina

l like

lihoo

d, p

(y|

)

100 draws of the marginal likelihood p(y| ), eachcorresponding to a different random draw of y


The problem of boundary estimates: simulation

I Box and Tiao (1973):

I All variance parameters want to become lost in the noise

I When does it hurt to estimate σ̂α = 0?


Point estimate of a hierachical variance parameter

I Desirable properties:I Point estimate should never be 0I But . . . no nonzero lower boundI Estimate should respect the likelihoodI Bias and variance should be as good as mleI Should be easy to computeI Should be Bayesian


Boundary-avoiding point estimate!

0 5 10 15 20 25 30

Posterior for σα with Gamma(0,∞) prior

σα

likelihood

prior

posterior


Boundary-avoiding weakly-informative point estimate

0 5 10 15 20 25 30

Posterior for σα with Gamma(0, 1/25) prior

σα

likelihood

prior

posterior


Gamma (not inverse-gamma) prior on σα

0 5 10 15 20 25 30

Posterior for σα under uniform prior

σα

likelihood

prior

0 5 10 15 20 25 30

Posterior for σα with Gamma(0,∞) prior

σα

likelihood

prior

posterior

0 5 10 15 20 25 30

Posterior for σα with Gamma(0, 1/25) prior

σα

likelihood

prior

posterior

Posterior mode is shifted at most one standard error from theboundary


5. Boundary estimate of group-level correlation

Point estimates of correlation ρ̂ from a hierarchicalvarying-intercept, varying-slope model:

Maximum marginal likelihood estimate

Freq

uenc

y (o

ut o

f 100

0 si

mul

atio

ns)

1 0 1

050

100 Sampling distribution of ^ (true value is = 0)

Hierarchical correlation parameter, 1 0 1

Prof

ile m

argi

nal l

ikelih

ood

for

100 draws of the marginal likelihood p(y| ), eachcorresponding to a different random draw of y

I Boundary-avoiding prior: ρ ∼ Beta(2, 2)

I Better statistical properties than mle


Weakly informative priors for covariance matrix

I Boundary-avoiding priorI Wishart (not inverse-Wishart) priorI Generalization of gamma

I Full Bayes:I Scaled inverse-Wshart prior, Diag*Q*DiagI Generalization of half-t for σα and Beta(2,2) for ρ


6. Weakly informative priors formixture models

I Well-known problem of fitting the mixture model likelihood

I The maximum likelihood fits are weird, with a single pointtaking half the mixture

I Bayes with flat prior is just as bad

I These solutions don’t “look” like mixtures

I There must be additional prior information—or, to put itanother way, regularization

I Simple constraints, for example, a prior dist on the varianceratio

I Weakly informative prior: use a hierarchical model for thescale parameters


General theory for wips

I (Unknown) true prior, ptrue(θ) = N(θ|µ0, σ20)

I Your subjective prior, psubj = N(θ|µ1, σ21)

I Weakly-informative prior, pwip = N(θ|µ1, (κσ1)2), with κ > 1

I Tradeoffs if κ is too low or too high


What have we learned?

I Models need structure but not too much structure

I Conservatism in statistics

I Priors for full Bayes vs. priors for point estimation

I Formalize the losses in supplying too much or too little priorinfo


andrew gelman department of statistics and department of...

Documents