andrew gelman department of statistics and department of...

36
Weakly informative priors Andrew Gelman Department of Statistics and Department of Political Science Columbia University 21 Oct 2011 Collaborators (in order of appearance): Gary King, Frederic Bois, Aleks Jakulin, Vince Dorie, Sophia Rabe-Hesketh, Jingchen Liu, Yeojin Chung, Matt Schofield . . . Andrew Gelman Weakly informative priors

Upload: others

Post on 18-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Weakly informative priors

Andrew GelmanDepartment of Statistics and Department of Political Science

Columbia University

21 Oct 2011

Collaborators (in order of appearance):Gary King, Frederic Bois, Aleks Jakulin, Vince Dorie, Sophia

Rabe-Hesketh, Jingchen Liu, Yeojin Chung, Matt Schofield . . .

Andrew Gelman Weakly informative priors

Page 2: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Weakly informative priors for . . .

I Two applied examples

1. Identifying a three-component mixture (1990)2. Population variation in toxicology (1996)

I Some default priors:

3. Logistic regression (2008)4. Hierarchical models (2006, 2011)5. Covariance matrices (2011)6. Mixture models (2011)

Andrew Gelman Weakly informative priors

Page 3: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

1. Identifying a three-component mixture

I Maximum likelihood estimate blows up

I Bayes posterior with flat prior blows up too!

Andrew Gelman Weakly informative priors

Page 4: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Priors!

I Mixture component 1: mean has N(−0.4, 0.42) prior,standard deviation has inverse-χ2(4, 0.42) prior

I Mixture component 2: mean has N(+0.4, 0.42) prior,standard deviation has inverse-χ2(4, 0.42) prior

I Mixture component 3: mean has N(0, 32) prior, standarddeviation has inverse-χ2(4, 0.82) prior

I Three mixture parameters have a Dirichlet (19, 19, 4) prior

Andrew Gelman Weakly informative priors

Page 5: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Separating into Republicans, Democrats, and open seats

I Beyond “weakly informative” by using incumbency information

Andrew Gelman Weakly informative priors

Page 6: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

2. Weakly informative priors forpopulation variation in toxicology

I Pharamcokinetic parameters such as the “Michaelis-Mentencoefficient”

I Wide uncertainty: prior guess for θ is 15 with a factor of 100of uncertainty, log θ ∼ N(log(15), log(10)2)

I Population model: data on several people j ,log θj ∼ N(log(15), log(10)2) ????

I Hierarchical prior distribution:I log θj ∼ N(µ, σ2), σ ≈ log(2)I µ ∼ N(log(15), log(10)2)

I Weakly informative

Andrew Gelman Weakly informative priors

Page 7: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Concepts

I Wip instead of noninformative prior or informative prior

I You’re using wips already!

I Prior as a placeholder

I Model as a placeholder

I “Life is what happens to you while you’re busy making otherplans”

I Full Bayes vs. Bayesian point estimates

I HIerarchical modeling as a unifying framework

Andrew Gelman Weakly informative priors

Page 8: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

3. Logistic regression

−6 −4 −2 0 2 4 60.0

0.2

0.4

0.6

0.8

1.0 y = logit−1(x)

x

logi

t−1(x

)

slope = 1/4

Andrew Gelman Weakly informative priors

Page 9: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

A clean example

−10 0 10 20

0.0

0.2

0.4

0.6

0.8

1.0

estimated Pr(y=1) = logit−1(−1.40 + 0.33 x)

x

y slope = 0.33/4

Andrew Gelman Weakly informative priors

Page 10: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

The problem of separation

−6 −4 −2 0 2 4 6

0.0

0.2

0.4

0.6

0.8

1.0

slope = infinity?

x

y

Andrew Gelman Weakly informative priors

Page 11: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Separation is no joke!

glm (vote ~ female + black + income, family=binomial(link="logit"))

1960 1968

coef.est coef.se coef.est coef.se

(Intercept) -0.14 0.23 (Intercept) 0.47 0.24

female 0.24 0.14 female -0.01 0.15

black -1.03 0.36 black -3.64 0.59

income 0.03 0.06 income -0.03 0.07

1964 1972

coef.est coef.se coef.est coef.se

(Intercept) -1.15 0.22 (Intercept) 0.67 0.18

female -0.09 0.14 female -0.25 0.12

black -16.83 420.40 black -2.63 0.27

income 0.19 0.06 income 0.09 0.05

Andrew Gelman Weakly informative priors

Page 12: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Regularization in action!

Andrew Gelman Weakly informative priors

Page 13: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Information in prior distributions

I Informative prior distI A full generative model for the data

I Noninformative prior distI Let the data speakI Goal: valid inference for any θ

I Weakly informative prior distI Purposely include less information than we actually have

Andrew Gelman Weakly informative priors

Page 14: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Weakly informative priors for logistic regression

I Separation in logistic regressionI Some prior info: logistic regression coefs are almost always

between −5 and 5:I 5 on the logit scale takes you from 0.01 to 0.50

or from 0.50 to 0.99I Smoking and lung cancer

I Independent Cauchy prior dists with center 0 and scale 2.5

I Rescale each predictor to have mean 0 and sd 12

I Fast implementation using EM; easy adaptation of glm

Andrew Gelman Weakly informative priors

Page 15: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Prior distributions

−10 −5 0 5 10

θ

Andrew Gelman Weakly informative priors

Page 16: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Another example

Dose #deaths/#animals

−0.86 0/5−0.30 1/5−0.05 3/5

0.73 5/5

I Slope of a logistic regression of Pr(death) on dose:I Maximum likelihood est is 7.8± 4.9I With weakly-informative prior: Bayes est is 4.4± 1.9

Andrew Gelman Weakly informative priors

Page 17: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Maximum likelihood and Bayesian estimates

Dose

Pro

babi

lity

of d

eath

0 10 20

0.0

0.5

1.0

glmbayesglm

I Which is truly conservative?

I The sociology of shrinkage

Andrew Gelman Weakly informative priors

Page 18: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Evaluation using a corpus of datasets

I Compare classical glm to Bayesian estimates using variousprior distributions

I Evaluate using 5-fold cross-validation and average predictiveerror

I The optimal prior distribution for β’s is (approx) Cauchy (0, 1)

I Our Cauchy (0, 2.5) prior distribution is weakly informative!

Andrew Gelman Weakly informative priors

Page 19: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Expected predictive loss, avg over a corpus of datasets

0 1 2 3 4 5

0.29

0.30

0.31

0.32

0.33

scale of prior

−lo

g te

st li

kelih

ood

(1.79)GLM

BBR(l)

df=2.0

df=4.0

df=8.0BBR(g)

df=1.0

df=0.5

Andrew Gelman Weakly informative priors

Page 20: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

4. Inference for hierarchical variance parameters

0 5 10 15 20 25 30

Marginal likelihood for σα

σα

Andrew Gelman Weakly informative priors

Page 21: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Hierarchical variance parameters: 1. Full Bayes

I What is a good “weakly informative prior”?I log σα ∼ Uniform(−∞,∞)I σα ∼ Uniform(0,∞)I σα ∼ Inverse-gamma(0.001, 0.001)I σα ∼ Cauchy+(0,A)

I Polson and Scott (2011):“The half-Cauchy occupies a sensible ‘middle ground’ . . . itperforms very well near the origin, but does not lead to drasticcompromises in other parts of the parameter space.”

Andrew Gelman Weakly informative priors

Page 22: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Problems with inverse-gamma prior

I Inv-gamma prior cuts off at 0

Andrew Gelman Weakly informative priors

Page 23: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Problems with uniform prior

I Uniform prior doesn’t cut off the long tail

Andrew Gelman Weakly informative priors

Page 24: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Hierarchical variance parameters: 2. Point estimation

I [Estimate ± standard error] as approximation to full Bayes

I Point estimation as goal in itself

I Problems with boundary estimate, σ̂α = 0

Andrew Gelman Weakly informative priors

Page 25: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

The problem of boundary estimates: 8-schools example

0 5 10 15 20 25 30

Marginal likelihood for σα

σα

Andrew Gelman Weakly informative priors

Page 26: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

The problem of boundary estimates: simulation

Maximum marginal likelihood estimate

Freq

uenc

y (o

ut o

f 100

0 si

mul

atio

ns)

0 1 2

020

040

0 Sampling distribution of ^ (true value is = 0.5)

Hierarchical scale parameter, 0 2 4

Mar

gina

l like

lihoo

d, p

(y|

)

100 draws of the marginal likelihood p(y| ), eachcorresponding to a different random draw of y

Andrew Gelman Weakly informative priors

Page 27: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

The problem of boundary estimates: simulation

I Box and Tiao (1973):

I All variance parameters want to become lost in the noise

I When does it hurt to estimate σ̂α = 0?

Andrew Gelman Weakly informative priors

Page 28: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Point estimate of a hierachical variance parameter

I Desirable properties:I Point estimate should never be 0I But . . . no nonzero lower boundI Estimate should respect the likelihoodI Bias and variance should be as good as mleI Should be easy to computeI Should be Bayesian

Andrew Gelman Weakly informative priors

Page 29: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Boundary-avoiding point estimate!

0 5 10 15 20 25 30

Posterior for σα with Gamma(0,∞) prior

σα

likelihood

prior

posterior

Andrew Gelman Weakly informative priors

Page 30: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Boundary-avoiding weakly-informative point estimate

0 5 10 15 20 25 30

Posterior for σα with Gamma(0, 1/25) prior

σα

likelihood

prior

posterior

Andrew Gelman Weakly informative priors

Page 31: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Gamma (not inverse-gamma) prior on σα

0 5 10 15 20 25 30

Posterior for σα under uniform prior

σα

likelihood

prior

0 5 10 15 20 25 30

Posterior for σα with Gamma(0,∞) prior

σα

likelihood

prior

posterior

0 5 10 15 20 25 30

Posterior for σα with Gamma(0, 1/25) prior

σα

likelihood

prior

posterior

Posterior mode is shifted at most one standard error from theboundary

Andrew Gelman Weakly informative priors

Page 32: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

5. Boundary estimate of group-level correlation

Point estimates of correlation ρ̂ from a hierarchicalvarying-intercept, varying-slope model:

Maximum marginal likelihood estimate

Freq

uenc

y (o

ut o

f 100

0 si

mul

atio

ns)

1 0 1

050

100 Sampling distribution of ^ (true value is = 0)

Hierarchical correlation parameter, 1 0 1

Prof

ile m

argi

nal l

ikelih

ood

for

100 draws of the marginal likelihood p(y| ), eachcorresponding to a different random draw of y

I Boundary-avoiding prior: ρ ∼ Beta(2, 2)

I Better statistical properties than mle

Andrew Gelman Weakly informative priors

Page 33: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

Weakly informative priors for covariance matrix

I Boundary-avoiding priorI Wishart (not inverse-Wishart) priorI Generalization of gamma

I Full Bayes:I Scaled inverse-Wshart prior, Diag*Q*DiagI Generalization of half-t for σα and Beta(2,2) for ρ

Andrew Gelman Weakly informative priors

Page 34: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

6. Weakly informative priors formixture models

I Well-known problem of fitting the mixture model likelihood

I The maximum likelihood fits are weird, with a single pointtaking half the mixture

I Bayes with flat prior is just as bad

I These solutions don’t “look” like mixtures

I There must be additional prior information—or, to put itanother way, regularization

I Simple constraints, for example, a prior dist on the varianceratio

I Weakly informative prior: use a hierarchical model for thescale parameters

Andrew Gelman Weakly informative priors

Page 35: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

General theory for wips

I (Unknown) true prior, ptrue(θ) = N(θ|µ0, σ20)

I Your subjective prior, psubj = N(θ|µ1, σ21)

I Weakly-informative prior, pwip = N(θ|µ1, (κσ1)2), with κ > 1

I Tradeoffs if κ is too low or too high

Andrew Gelman Weakly informative priors

Page 36: Andrew Gelman Department of Statistics and Department of ...gelman/presentations/wipnewhandout.pdfWeakly informative priors for logistic regression I Separation in logistic regression

What have we learned?

I Models need structure but not too much structure

I Conservatism in statistics

I Priors for full Bayes vs. priors for point estimation

I Formalize the losses in supplying too much or too little priorinfo

Andrew Gelman Weakly informative priors