an extended bic for model selection

JSM: July-August 2007 - slide #1

An Extended BIC for Model Selectionat

the JSM meeting 2007 - Salt Lake City

Surajit RayBoston University (Dept of Mathematics and Statistics)

Joint work withJames Berger, Duke University; Susie Bayarri, University of Valencia;

Woncheol Jang, University of Georgia; Luis R. Pericchi, University of Puerto Rico

Introduction and Motivation

➲ Motivation and Outline

➲ Bayesian Information Criterion:

BIC

➲ First Ambiguity: What is n?

➲ Second Ambiguity: Can n be

different for differnt parameters?

➲ Third Ambiguity: What is the

number of parameters

➲ Example: Random effects group

means

➲ Example: Common mean,

differing variances:

➲ Problem with n and p

EBIC

The prior

EBIC*

Effective sample size

Simulation


Motivation and Outline

Initiated from a SAMSI Social Sciences working group. “getting the model right”in structural equation modeling.

Introduced in the Wald Lecture by Jim Berger.Specific characteristics

They have ‘independent cases’ (individuals).Often ‘Multilevel’ or ‘random effects’ models (p grows as n grows).Regularity Conditions may not be satisfied.

Problems with BIC.

Can a generalization of BIC work?Goal: Use only Standard software output and close-form expressions.

Present a generalization of BIC (‘extended BIC’ or EBIC) and some examples inlinear regression scenario.

Work in progress (comments are welcome!)




BIC







means




EBIC

The prior

EBIC*


Simulation


Bayesian Information Criterion: BIC

BIC = 2l(θ|x)−p log(n),

l(θ|x) the log-likelihood of θ given x

p the number of estimated parameters.

n the cardinality of x.

Swartz’s result: As n → ∞ (with p fixed) this is an approximation (up to aconstant) to twice the Bayesian log likelihood for the model,m(x) =

Rf(x | θ)π(θ)dθ, so that

m(x) = cπeBIC/2(1 + o(1)) .

BIC approaches to BF’s further ignore the cπ ’s check this




BIC







means




EBIC

The prior

EBIC*


Simulation


First Ambiguity: What is n?

Consider time series yit,

yit = βyi t−1 + εit, i = 1, . . . , n∗, t = 1, . . . T,

εit ∼ N(0, σ2).

When the goal is to estimate the mean level y, consider two extreme cases:

1. β = 0; then the yit are i.i.d., and the estimated parameter y has sample size

n = n∗ × T

2. β = 1; then clearly the effective sample size for y equals n = n∗

See Thiebaux & Zwiers (1984) for discussion of effective sample size in time series.




BIC







means




EBIC

The prior

EBIC*


Simulation


Second Ambiguity: Can n be different for differntparameters?

Consider data from p groups

yip = µp + εip , i = 1 . . . n∗, p = 1 . . . r

εip ∼ N(0, σ2)

µp and σ2 both unknown.

Group 1 y11 y21 . . . yn∗1

Group 2 y12 y22 . . . yn∗2

......

. . ....

Group p y1p y2p . . . yn∗p

......

. . ....

Group r y1r y2r . . . yn∗r




BIC







means




EBIC

The prior

EBIC*


Simulation




yip = µp + εip , i = 1 . . . n∗, p = 1 . . . r

εip ∼ N(0, σ2)


Different parameters can have different sample sizes:

the sample size for estimating each µp is n∗

the sample size for estimating σ2 is n∗ × r




BIC







means




EBIC

The prior

EBIC*


Simulation




yip = µp + εip , i = 1 . . . n∗, p = 1 . . . r

εip ∼ N(0, σ2)


Computation for this example yields

I =

0@

n∗

σ2 Ir×r 0

0 n2σ4

1A ,

where σ2 = 1n

Ppi=1

Prl=1(Xil − Xi)

2.




BIC







means




EBIC

The prior

EBIC*


Simulation


Third Ambiguity: What is the number of parameters

Example: Random effects group meansIn the previous Group Means example, with p groups and r observations Xil per

group, with Xil ∼ N(µi, σ2) for l = 1, . . . , r, consider the multilevel version in

which

µi ∼ N(ξ, τ2),

with ξ and τ2 being unknown.

What is the number of parameters? (see also Pauler (1998))

(1) If τ2 = 0, there is only two parameters: ξ and σ2.

(2) If τ2 is huge, is the number of parameters p + 3 ?

the means along with ξ, τ 2 and σ2

Continued ...




BIC







means




EBIC

The prior

EBIC*


Simulation


Example: Random effects group means

(3) But, if one integrates out µ = (µ1, . . . , µp), then

f(x | σ2, ξ, τ2) =

Zf(x | µ, ξ, σ2)π(µ | ξ, τ2)dµ

∝ 1

σ−p(r−1)exp

σ2

2σ2

!pY

i=1

exp

− (xi − ξ)2

2(σ2

r+ τ2)

!,

so p = 3 if one can work directly with f(x | σ2, ξ, τ2).

Note: This seems to be common practice in Social Sciences: latent variables areintegrated out, so remaining parameters are the only unknowns when applyingBIC

Note: In this example the effective sample sizes should be ≈ p r for σ2, ≈ p for ξand τ2, and ≈ r for the µi ’s.




BIC







means




EBIC

The prior

EBIC*


Simulation


Example: Common mean, differing variances:

(Cox mixtures)Suppose each independent observation Xi, i = 1, . . . , n, has probability 1/2 ofarising from the N(θ, 1), and probability 1/2 of arising from the N(θ, 1000).

Clearly half of the observations are worthless, so the ‘effective sample size’ is roughlyn/2.




BIC







means




EBIC

The prior

EBIC*


Simulation


Problem with n and p

Problems with n.Is n the number of vector observations or the number of real observations?Different θi can have different effective sample sizes.Some observations can me more informative than others

as in mixture modelsmodels with mixed, continuous and discrete observations

Problems with p.What is p with random effects, latent variables, mixture model... etc. ?Often p grows with n


EBIC

➲ EBIC: a proposed solution

➲ Laplace approximation

The prior

EBIC*


Simulation


EBIC: a proposed solution

Based on a modified Laplace approximation to m(x) for good priors:The good priors can deal with growing pThe Laplace approximation

retains the constant cπ in the expansionoften good even for moderate n

Implicitly uses ‘effective sample size’ (discussion later)

Computable using standard software using mle’s and observed informationmatrices.


EBIC

➲ EBIC: a proposed solution

➲ Laplace approximation

The prior

EBIC*


Simulation


Laplace approximation

1. Preliminary: ‘nice’ reparameterization.

2. Taylor expansion

By a Taylor’s series expansion of el(θ) about the mle bθ,

m(x) =

Z

f(x | θ)π(θ)dθ =

Z

el(θ)π(θ)dθ

≈

Z

exp

»

l(θ) + (θ − θ)t∇l(θ) −1

2

“

θ − θ

”tI

“

θ − θ)”

–

π(θ)dθ

where ∇ denotes the gradient and bI = (Ijk) is the observed informationmatrix, with (j, k) entry

3. Approximation to marginal m(x)

If bθ occurs on the interior of the parameter space, so ∇l(bθ) = 0 (if not true, theanalysis must proceed as in Haughton 1991,1993), mild conditions yield

m(x) = el(bθ)

Ze−

12(θ−

bθ)t bI(θ−cθ)π(θ)dθ(1 + on(1)) .


EBIC

The prior

➲ Choosing a good prior

➲ Univariate priors

➲ Berger’s robust priors

➲ Proposal for EBIC

➲ Comparison of EBIC and BIC

EBIC*


Simulation


Choosing a good prior

Integrate out common parameters

Orthogonalize the parameters.

Let O be orthogonal and D = diag(di), i = 1, . . . p1 such that

Σ = OtDO

Make the change of variables ξ = Oθ(1), and let bξ = Obθ(1)

If there is no θ(2) to be integrated out, Σ = I−1

= OtDO

We now use a prior that is independent in the ξi i.e, π(ξ) =Qp1

i=1 πi(ξi)The approximation to m(x) is then

m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2

"p1Y

i=1

Z1√2πdi

e−

(ξi−ξi)2

2di πi(ξi)dξi

#.

where we recall that di are the diagonal elements of D from Σ = OtDO,that is, Var(ξi) . They will appear often in what follows.


EBIC

The prior






EBIC*


Simulation




Orthogonalize the parameters.Let O be orthogonal and D = diag(di), i = 1, . . . p1 such that

Σ = OtDO



= OtDO



m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2

"p1Y

i=1

Z1√2πdi

e−

(ξi−ξi)2

2di πi(ξi)dξi

#.



EBIC

The prior






EBIC*


Simulation





Σ = OtDO



= OtDO


i=1 πi(ξi)

The approximation to m(x) is then

m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2

"p1Y

i=1

Z1√2πdi

e−

(ξi−ξi)2

2di πi(ξi)dξi

#.



EBIC

The prior






EBIC*


Simulation





Σ = OtDO



= OtDO



m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2

"p1Y

i=1

Z1√2πdi

e−

(ξi−ξi)2

2di πi(ξi)dξi

#.



EBIC

The prior






EBIC*


Simulation


Univariate priors

For the priors πi(ξi), there are several possibilities:

1. Jeffreys recommended the Cauchy(0, bi) density

πCi (ξi) =

Z ∞

0

N

„ξi

˛˛ 0,

1

λibi

«Ga

„λi |

1

2,1

2

«dλi

where (bi)−1 =

(di)−1

niis the unit information for ξi with ni being the

effecteive sample size for ξi (Kass & Wasserman, 1995).

Note: For i.i.d. scenarios, di is like σ2i /ni so bi is like the

variance of one observation.

2. Use other sensible (objective) testing priors, like the intrinsic prior.

3. Goal is to get closed form expression: We use priors proposed in Berger(1985) which is very close to both the Cauchy and the intrinsic priors andproduces close form expressions for m(x) for normal likelihoods.


EBIC

The prior






EBIC*


Simulation


Berger’s robust priors

For if bi ≥ di, the prior is given by

πRi (ξi) =

Z 1

0

N

„ξi

˛˛ 0,

1

2λi(di + bi) − di

«1

2√

λi

dλi ,

Introduced by Berger in the context of Robust Analyses

No closed form expression.

Not widely used in model selection

Behaves like Cauchy priors with parameters bi.

Gives closed form expression for approximation to m(x)

m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2

24

p1Y

i=1

1p2π(di + bi)

“1 − e−ξ2

i /(di+bi)”

√2 ξ2

i /(di + bi)

35

where (di)−1 is global information about ξi, and (bi)

−1 is unit information.


EBIC

The prior






EBIC*


Simulation


Proposal for EBIC

Finally, we have, as the approximation to 2 log m(x),

EBIC ≡ 2l(θ) + (p − p1) log(2π) − log |I22| −Pp1

i=1 log(1 + ni)

+2Pp1

i=1 log(1−e−vi)

√2 vi

, where vi =ξ2

i

bi+di.

(error as approximation to 2 log m(x) is on(1); exact for Normals)

If no parameters are integrated out (so that p1 = p), then

EBIC = 2l(θ) −pX

i=1

log(1 + ni) + 2

pX

i=1

log

`1 − e−vi

´√

2 vi

.

If all ni = n, the dominant terms in the expression (as n → ∞) are

2l(θ) − p log n. The third term (the ‘constant’ ignored by Swartz) is negative.


EBIC

The prior






EBIC*


Simulation


Comparison of EBIC and BIC

Compare this EBIC and BIC:

EBIC = 2l(θ) −pX

i=1

log(1 + ni) − kπ

BIC = 2l(θ) − p log n

where kπ > 0 is the ‘ignored’ constant from π

It is easy to intuitively notice that ‘BIC makes two mistakes’:Penalizes too much with p log n. Note that ni ≤ nPenalizes too little with the prior (the third term, a new penalization, is absent)

The second ‘mistake’ actually helps the first one (in this sense, BIC is ‘lucky’),but often the ‘first mistake’ completely overwhelms this little help.


EBIC

The prior

EBIC*

➲ EBIC*: Modification More

Favorable to Complex Models


Simulation


EBIC*: Modification More Favorable to ComplexModels

Priors centered at zero penalize complex models too much?

Priors centered around MLE ( Raftery, 1996) favor complex models too much?

An attractive compromise: Cauchy-type priors centered at zero, but with thescales, bi, chosen so as to maximize the marginal likelihood of the model.

Empirical Bayes alternative, popularized in the robust Bayesian literature(Berger, 1994)


EBIC

The prior

EBIC*

➲ EBIC*: Modification More

Favorable to Complex Models


Simulation


EBIC*: Modification More Favorable to ComplexModels

The bi that maximizes m(x) can easily be seen to be

bi = max{di,ξ2

i

w− di}, with w s.t. ew = 1 + 2w, or w ≈ 1.3 .

Problem: when ξi = 0, this empirical Bayes choice can result in inconsistencyas ni → ∞.

Solution: prevent bi from being less than nidi. This results, after an accurateapproximation (for fixed p) in

EBIC∗ ≈ 2l(θ) −

pX

i=1

log(1 + ni) −pX

i=1

log (3vi + 2max{vi, 1}) .

again a very simple expresion


EBIC

The prior

EBIC*


➲ Effective Sample size

➲ Effective Sample size: Continued

Simulation


Effective Sample size

Calculate I = Ijk , the expected information matrix, at the MLE,

Ijk = −EX|θ ∂2

∂θj∂θklog f(x | θ)

˛˛θ=

bθ.

For each case xi, compute Ii = (Ii,jk), having (j, k) entry

Ii,jk = −EXi|θ ∂2

∂θj∂θklog fi(xi | θ)

˛˛θ=

bθ.

Define Ijj =Pn

i=1 Ii,jj Define nj as follows:

Define information weights wij = Ii,jj/Pn

k=1 Ik,jj .Define the effective sample size for θj as

nj =IjjPn

i=1 wijIi,jj=

(Ijj)2

Pni=1 (Ii,jj)

2 .


EBIC

The prior

EBIC*


➲ Effective Sample size

➲ Effective Sample size: Continued

Simulation


Effective Sample size: Continued

Intuitively,P

wijIi,jj is a weighted measure of the information ‘perobservation’, and dividing the total information about θj by this information percase seems plausible as an effective sample size.

This is just a tentative proposal

Works for most of the examples we discussed

Research in progress


EBIC

The prior

EBIC*


Simulation

➲ A small linear regression

simulation:

➲ Linear regression Example



➲ Conclusions and future work

➲ References


A small linear regression simulation:

Yn×1 = Xn×8β8×1 + εn×1 ε ∼ N(0, σ2),

β = (3,1.5, 0, 0,2, 0, 0, 0),

and design Matrix X ∼ N(0, Σx), Σx =

0BBBB@

1 ρ · · · ρ

ρ 1 · · · ρ...

.... . .

...

ρ ρ · · · 1

1CCCCA


EBIC

The prior

EBIC*


Simulation


simulation:





➲ References


Linear regression Example

exac

t

20

40

60

80

10060 200 500 1000

60 200 500

20

40

60

80

1001000

20

40

60

80

10060 200 500 1000

Methodabf2aicbicGbicGbic*hbf

PS

fragreplacem

ents

ρ=0ρ=0ρ=0ρ=0ρ=0 ρ=.2ρ=.2ρ=.2ρ=.2ρ=.2 ρ=.5ρ=.5ρ=.5ρ=.5ρ=.5

σ=1σ=1σ=1σ=1σ=1

σ=2σ=2σ=2σ=2σ=2

σ=3σ=3σ=3σ=3σ=3

n=30

n=30

n=30

Percentage of true models selected under different situations


EBIC

The prior

EBIC*


Simulation


simulation:





➲ References



corr

ect

3.5

4.0

4.5

5.060 200 500 1000

60 200 500

3.5

4.0

4.5

5.01000

3.5

4.0

4.5

5.060 200 500 1000


PS

fragreplacem

ents


σ=1σ=1σ=1σ=1σ=1

σ=2σ=2σ=2σ=2σ=2

σ=3σ=3σ=3σ=3σ=3

n=30

n=30

n=30

Average number of 0’s that were determined to be 0. (True Model)


EBIC

The prior

EBIC*


Simulation


simulation:





➲ References



inco

rrec

t

0.0

0.2

0.4

0.6

60 200 500 1000

60 200 500

0.0

0.2

0.4

0.6

1000

0.0

0.2

0.4

0.6

60 200 500 1000


PS

fragreplacem

ents


σ=1σ=1σ=1σ=1σ=1

σ=2σ=2σ=2σ=2σ=2

σ=3σ=3σ=3σ=3σ=3

n=30

n=30

n=30

Average number of non-zeroes that were determined to be zero (small is good).


EBIC

The prior

EBIC*


Simulation


simulation:





➲ References


Conclusions and future work

BIC can be improved by– keeping rather than ignoring the prior

– using a sensible, objective prior producing close-form expressions

– carefully acknowledging “effective sample size” for each parameter to guidechoice of the scale of the priors

Work in progress– A satisfactory definition of ‘effective sample size’ (if possible). Current proposal

works in most cases.

– Extension to the non-independent case


EBIC

The prior

EBIC*


Simulation


simulation:





➲ References


References

1. Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. 2ndedition. Springer-Verlag.

2. Berger, J.O., Ghosh, J.K., and Mukhopadhyay (2003). Approximations andConsistency of Bayes Factors as Model Dimension Grows. Journal of StatisticalPlanning and Inference 112, 241–258.

3. Berger, J., Pericchi, L, and Varshavsky, J. (1998). Bayes factors and marginaldistributions in invariant situations. Sankhya A 60, 307–321.

4. Bollen, K., Ray S., Zavisca J.,(2007) A scaled unit information prior approximationto the Bayes Factor. under prepration

an extended bic for model selection

Documents