an extended bic for model selection
TRANSCRIPT
JSM: July-August 2007 - slide #1
An Extended BIC for Model Selectionat
the JSM meeting 2007 - Salt Lake City
Surajit RayBoston University (Dept of Mathematics and Statistics)
Joint work withJames Berger, Duke University; Susie Bayarri, University of Valencia;
Woncheol Jang, University of Georgia; Luis R. Pericchi, University of Puerto Rico
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #2
Motivation and Outline
Initiated from a SAMSI Social Sciences working group. “getting the model right”in structural equation modeling.
Introduced in the Wald Lecture by Jim Berger.Specific characteristics
They have ‘independent cases’ (individuals).Often ‘Multilevel’ or ‘random effects’ models (p grows as n grows).Regularity Conditions may not be satisfied.
Problems with BIC.
Can a generalization of BIC work?Goal: Use only Standard software output and close-form expressions.
Present a generalization of BIC (‘extended BIC’ or EBIC) and some examples inlinear regression scenario.
Work in progress (comments are welcome!)
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #3
Bayesian Information Criterion: BIC
BIC = 2l(θ|x)−p log(n),
l(θ|x) the log-likelihood of θ given x
p the number of estimated parameters.
n the cardinality of x.
Swartz’s result: As n → ∞ (with p fixed) this is an approximation (up to aconstant) to twice the Bayesian log likelihood for the model,m(x) =
Rf(x | θ)π(θ)dθ, so that
m(x) = cπeBIC/2(1 + o(1)) .
BIC approaches to BF’s further ignore the cπ ’s check this
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #4
First Ambiguity: What is n?
Consider time series yit,
yit = βyi t−1 + εit, i = 1, . . . , n∗, t = 1, . . . T,
εit ∼ N(0, σ2).
When the goal is to estimate the mean level y, consider two extreme cases:
1. β = 0; then the yit are i.i.d., and the estimated parameter y has sample size
n = n∗ × T
2. β = 1; then clearly the effective sample size for y equals n = n∗
See Thiebaux & Zwiers (1984) for discussion of effective sample size in time series.
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #5
Second Ambiguity: Can n be different for differntparameters?
Consider data from p groups
yip = µp + εip , i = 1 . . . n∗, p = 1 . . . r
εip ∼ N(0, σ2)
µp and σ2 both unknown.
Group 1 y11 y21 . . . yn∗1
Group 2 y12 y22 . . . yn∗2
......
. . ....
Group p y1p y2p . . . yn∗p
......
. . ....
Group r y1r y2r . . . yn∗r
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #5
Second Ambiguity: Can n be different for differntparameters?
Consider data from p groups
yip = µp + εip , i = 1 . . . n∗, p = 1 . . . r
εip ∼ N(0, σ2)
µp and σ2 both unknown.
Different parameters can have different sample sizes:
the sample size for estimating each µp is n∗
the sample size for estimating σ2 is n∗ × r
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #5
Second Ambiguity: Can n be different for differntparameters?
Consider data from p groups
yip = µp + εip , i = 1 . . . n∗, p = 1 . . . r
εip ∼ N(0, σ2)
µp and σ2 both unknown.
Computation for this example yields
I =
0@
n∗
σ2 Ir×r 0
0 n2σ4
1A ,
where σ2 = 1n
Ppi=1
Prl=1(Xil − Xi)
2.
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #6
Third Ambiguity: What is the number of parameters
Example: Random effects group meansIn the previous Group Means example, with p groups and r observations Xil per
group, with Xil ∼ N(µi, σ2) for l = 1, . . . , r, consider the multilevel version in
which
µi ∼ N(ξ, τ2),
with ξ and τ2 being unknown.
What is the number of parameters? (see also Pauler (1998))
(1) If τ2 = 0, there is only two parameters: ξ and σ2.
(2) If τ2 is huge, is the number of parameters p + 3 ?
the means along with ξ, τ 2 and σ2
Continued ...
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #7
Example: Random effects group means
(3) But, if one integrates out µ = (µ1, . . . , µp), then
f(x | σ2, ξ, τ2) =
Zf(x | µ, ξ, σ2)π(µ | ξ, τ2)dµ
∝ 1
σ−p(r−1)exp
σ2
2σ2
!pY
i=1
exp
− (xi − ξ)2
2(σ2
r+ τ2)
!,
so p = 3 if one can work directly with f(x | σ2, ξ, τ2).
Note: This seems to be common practice in Social Sciences: latent variables areintegrated out, so remaining parameters are the only unknowns when applyingBIC
Note: In this example the effective sample sizes should be ≈ p r for σ2, ≈ p for ξand τ2, and ≈ r for the µi ’s.
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #8
Example: Common mean, differing variances:
(Cox mixtures)Suppose each independent observation Xi, i = 1, . . . , n, has probability 1/2 ofarising from the N(θ, 1), and probability 1/2 of arising from the N(θ, 1000).
Clearly half of the observations are worthless, so the ‘effective sample size’ is roughlyn/2.
Introduction and Motivation
➲ Motivation and Outline
➲ Bayesian Information Criterion:
BIC
➲ First Ambiguity: What is n?
➲ Second Ambiguity: Can n be
different for differnt parameters?
➲ Third Ambiguity: What is the
number of parameters
➲ Example: Random effects group
means
➲ Example: Common mean,
differing variances:
➲ Problem with n and p
EBIC
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #9
Problem with n and p
Problems with n.Is n the number of vector observations or the number of real observations?Different θi can have different effective sample sizes.Some observations can me more informative than others
as in mixture modelsmodels with mixed, continuous and discrete observations
Problems with p.What is p with random effects, latent variables, mixture model... etc. ?Often p grows with n
Introduction and Motivation
EBIC
➲ EBIC: a proposed solution
➲ Laplace approximation
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #10
EBIC: a proposed solution
Based on a modified Laplace approximation to m(x) for good priors:The good priors can deal with growing pThe Laplace approximation
retains the constant cπ in the expansionoften good even for moderate n
Implicitly uses ‘effective sample size’ (discussion later)
Computable using standard software using mle’s and observed informationmatrices.
Introduction and Motivation
EBIC
➲ EBIC: a proposed solution
➲ Laplace approximation
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #10
EBIC: a proposed solution
Based on a modified Laplace approximation to m(x) for good priors:The good priors can deal with growing pThe Laplace approximation
retains the constant cπ in the expansionoften good even for moderate n
Implicitly uses ‘effective sample size’ (discussion later)
Computable using standard software using mle’s and observed informationmatrices.
Introduction and Motivation
EBIC
➲ EBIC: a proposed solution
➲ Laplace approximation
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #10
EBIC: a proposed solution
Based on a modified Laplace approximation to m(x) for good priors:The good priors can deal with growing pThe Laplace approximation
retains the constant cπ in the expansionoften good even for moderate n
Implicitly uses ‘effective sample size’ (discussion later)
Computable using standard software using mle’s and observed informationmatrices.
Introduction and Motivation
EBIC
➲ EBIC: a proposed solution
➲ Laplace approximation
The prior
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #11
Laplace approximation
1. Preliminary: ‘nice’ reparameterization.
2. Taylor expansion
By a Taylor’s series expansion of el(θ) about the mle bθ,
m(x) =
Z
f(x | θ)π(θ)dθ =
Z
el(θ)π(θ)dθ
≈
Z
exp
»
l(θ) + (θ − θ)t∇l(θ) −1
2
“
θ − θ
”tI
“
θ − θ)”
–
π(θ)dθ
where ∇ denotes the gradient and bI = (Ijk) is the observed informationmatrix, with (j, k) entry
3. Approximation to marginal m(x)
If bθ occurs on the interior of the parameter space, so ∇l(bθ) = 0 (if not true, theanalysis must proceed as in Haughton 1991,1993), mild conditions yield
m(x) = el(bθ)
Ze−
12(θ−
bθ)t bI(θ−cθ)π(θ)dθ(1 + on(1)) .
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #12
Choosing a good prior
Integrate out common parameters
Orthogonalize the parameters.
Let O be orthogonal and D = diag(di), i = 1, . . . p1 such that
Σ = OtDO
Make the change of variables ξ = Oθ(1), and let bξ = Obθ(1)
If there is no θ(2) to be integrated out, Σ = I−1
= OtDO
We now use a prior that is independent in the ξi i.e, π(ξ) =Qp1
i=1 πi(ξi)The approximation to m(x) is then
m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2
"p1Y
i=1
Z1√2πdi
e−
(ξi−ξi)2
2di πi(ξi)dξi
#.
where we recall that di are the diagonal elements of D from Σ = OtDO,that is, Var(ξi) . They will appear often in what follows.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #12
Choosing a good prior
Integrate out common parameters
Orthogonalize the parameters.Let O be orthogonal and D = diag(di), i = 1, . . . p1 such that
Σ = OtDO
Make the change of variables ξ = Oθ(1), and let bξ = Obθ(1)
If there is no θ(2) to be integrated out, Σ = I−1
= OtDO
We now use a prior that is independent in the ξi i.e, π(ξ) =Qp1
i=1 πi(ξi)The approximation to m(x) is then
m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2
"p1Y
i=1
Z1√2πdi
e−
(ξi−ξi)2
2di πi(ξi)dξi
#.
where we recall that di are the diagonal elements of D from Σ = OtDO,that is, Var(ξi) . They will appear often in what follows.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #12
Choosing a good prior
Integrate out common parameters
Orthogonalize the parameters.Let O be orthogonal and D = diag(di), i = 1, . . . p1 such that
Σ = OtDO
Make the change of variables ξ = Oθ(1), and let bξ = Obθ(1)
If there is no θ(2) to be integrated out, Σ = I−1
= OtDO
We now use a prior that is independent in the ξi i.e, π(ξ) =Qp1
i=1 πi(ξi)The approximation to m(x) is then
m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2
"p1Y
i=1
Z1√2πdi
e−
(ξi−ξi)2
2di πi(ξi)dξi
#.
where we recall that di are the diagonal elements of D from Σ = OtDO,that is, Var(ξi) . They will appear often in what follows.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #12
Choosing a good prior
Integrate out common parameters
Orthogonalize the parameters.Let O be orthogonal and D = diag(di), i = 1, . . . p1 such that
Σ = OtDO
Make the change of variables ξ = Oθ(1), and let bξ = Obθ(1)
If there is no θ(2) to be integrated out, Σ = I−1
= OtDO
We now use a prior that is independent in the ξi i.e, π(ξ) =Qp1
i=1 πi(ξi)The approximation to m(x) is then
m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2
"p1Y
i=1
Z1√2πdi
e−
(ξi−ξi)2
2di πi(ξi)dξi
#.
where we recall that di are the diagonal elements of D from Σ = OtDO,that is, Var(ξi) . They will appear often in what follows.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #12
Choosing a good prior
Integrate out common parameters
Orthogonalize the parameters.Let O be orthogonal and D = diag(di), i = 1, . . . p1 such that
Σ = OtDO
Make the change of variables ξ = Oθ(1), and let bξ = Obθ(1)
If there is no θ(2) to be integrated out, Σ = I−1
= OtDO
We now use a prior that is independent in the ξi i.e, π(ξ) =Qp1
i=1 πi(ξi)
The approximation to m(x) is then
m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2
"p1Y
i=1
Z1√2πdi
e−
(ξi−ξi)2
2di πi(ξi)dξi
#.
where we recall that di are the diagonal elements of D from Σ = OtDO,that is, Var(ξi) . They will appear often in what follows.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #12
Choosing a good prior
Integrate out common parameters
Orthogonalize the parameters.Let O be orthogonal and D = diag(di), i = 1, . . . p1 such that
Σ = OtDO
Make the change of variables ξ = Oθ(1), and let bξ = Obθ(1)
If there is no θ(2) to be integrated out, Σ = I−1
= OtDO
We now use a prior that is independent in the ξi i.e, π(ξ) =Qp1
i=1 πi(ξi)The approximation to m(x) is then
m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2
"p1Y
i=1
Z1√2πdi
e−
(ξi−ξi)2
2di πi(ξi)dξi
#.
where we recall that di are the diagonal elements of D from Σ = OtDO,that is, Var(ξi) . They will appear often in what follows.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #13
Univariate priors
For the priors πi(ξi), there are several possibilities:
1. Jeffreys recommended the Cauchy(0, bi) density
πCi (ξi) =
Z ∞
0
N
„ξi
˛˛ 0,
1
λibi
«Ga
„λi |
1
2,1
2
«dλi
where (bi)−1 =
(di)−1
niis the unit information for ξi with ni being the
effecteive sample size for ξi (Kass & Wasserman, 1995).
Note: For i.i.d. scenarios, di is like σ2i /ni so bi is like the
variance of one observation.
2. Use other sensible (objective) testing priors, like the intrinsic prior.
3. Goal is to get closed form expression: We use priors proposed in Berger(1985) which is very close to both the Cauchy and the intrinsic priors andproduces close form expressions for m(x) for normal likelihoods.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #14
Berger’s robust priors
For if bi ≥ di, the prior is given by
πRi (ξi) =
Z 1
0
N
„ξi
˛˛ 0,
1
2λi(di + bi) − di
«1
2√
λi
dλi ,
Introduced by Berger in the context of Robust Analyses
No closed form expression.
Not widely used in model selection
Behaves like Cauchy priors with parameters bi.
Gives closed form expression for approximation to m(x)
m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2
24
p1Y
i=1
1p2π(di + bi)
“1 − e−ξ2
i /(di+bi)”
√2 ξ2
i /(di + bi)
35
where (di)−1 is global information about ξi, and (bi)
−1 is unit information.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #14
Berger’s robust priors
For if bi ≥ di, the prior is given by
πRi (ξi) =
Z 1
0
N
„ξi
˛˛ 0,
1
2λi(di + bi) − di
«1
2√
λi
dλi ,
Introduced by Berger in the context of Robust Analyses
No closed form expression.
Not widely used in model selection
Behaves like Cauchy priors with parameters bi.
Gives closed form expression for approximation to m(x)
m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2
24
p1Y
i=1
1p2π(di + bi)
“1 − e−ξ2
i /(di+bi)”
√2 ξ2
i /(di + bi)
35
where (di)−1 is global information about ξi, and (bi)
−1 is unit information.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #14
Berger’s robust priors
For if bi ≥ di, the prior is given by
πRi (ξi) =
Z 1
0
N
„ξi
˛˛ 0,
1
2λi(di + bi) − di
«1
2√
λi
dλi ,
Introduced by Berger in the context of Robust Analyses
No closed form expression.
Not widely used in model selection
Behaves like Cauchy priors with parameters bi.
Gives closed form expression for approximation to m(x)
m(x) ≈ el(ˆθ)(2π)p/2|I|−1/2
24
p1Y
i=1
1p2π(di + bi)
“1 − e−ξ2
i /(di+bi)”
√2 ξ2
i /(di + bi)
35
where (di)−1 is global information about ξi, and (bi)
−1 is unit information.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #15
Proposal for EBIC
Finally, we have, as the approximation to 2 log m(x),
EBIC ≡ 2l(θ) + (p − p1) log(2π) − log |I22| −Pp1
i=1 log(1 + ni)
+2Pp1
i=1 log(1−e−vi)
√2 vi
, where vi =ξ2
i
bi+di.
(error as approximation to 2 log m(x) is on(1); exact for Normals)
If no parameters are integrated out (so that p1 = p), then
EBIC = 2l(θ) −pX
i=1
log(1 + ni) + 2
pX
i=1
log
`1 − e−vi
´√
2 vi
.
If all ni = n, the dominant terms in the expression (as n → ∞) are
2l(θ) − p log n. The third term (the ‘constant’ ignored by Swartz) is negative.
Introduction and Motivation
EBIC
The prior
➲ Choosing a good prior
➲ Univariate priors
➲ Berger’s robust priors
➲ Proposal for EBIC
➲ Comparison of EBIC and BIC
EBIC*
Effective sample size
Simulation
JSM: July-August 2007 - slide #16
Comparison of EBIC and BIC
Compare this EBIC and BIC:
EBIC = 2l(θ) −pX
i=1
log(1 + ni) − kπ
BIC = 2l(θ) − p log n
where kπ > 0 is the ‘ignored’ constant from π
It is easy to intuitively notice that ‘BIC makes two mistakes’:Penalizes too much with p log n. Note that ni ≤ nPenalizes too little with the prior (the third term, a new penalization, is absent)
The second ‘mistake’ actually helps the first one (in this sense, BIC is ‘lucky’),but often the ‘first mistake’ completely overwhelms this little help.
Introduction and Motivation
EBIC
The prior
EBIC*
➲ EBIC*: Modification More
Favorable to Complex Models
Effective sample size
Simulation
JSM: July-August 2007 - slide #17
EBIC*: Modification More Favorable to ComplexModels
Priors centered at zero penalize complex models too much?
Priors centered around MLE ( Raftery, 1996) favor complex models too much?
An attractive compromise: Cauchy-type priors centered at zero, but with thescales, bi, chosen so as to maximize the marginal likelihood of the model.
Empirical Bayes alternative, popularized in the robust Bayesian literature(Berger, 1994)
Introduction and Motivation
EBIC
The prior
EBIC*
➲ EBIC*: Modification More
Favorable to Complex Models
Effective sample size
Simulation
JSM: July-August 2007 - slide #17
EBIC*: Modification More Favorable to ComplexModels
The bi that maximizes m(x) can easily be seen to be
bi = max{di,ξ2
i
w− di}, with w s.t. ew = 1 + 2w, or w ≈ 1.3 .
Problem: when ξi = 0, this empirical Bayes choice can result in inconsistencyas ni → ∞.
Solution: prevent bi from being less than nidi. This results, after an accurateapproximation (for fixed p) in
EBIC∗ ≈ 2l(θ) −
pX
i=1
log(1 + ni) −pX
i=1
log (3vi + 2max{vi, 1}) .
again a very simple expresion
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
➲ Effective Sample size
➲ Effective Sample size: Continued
Simulation
JSM: July-August 2007 - slide #18
Effective Sample size
Calculate I = Ijk , the expected information matrix, at the MLE,
Ijk = −EX|θ ∂2
∂θj∂θklog f(x | θ)
˛˛θ=
bθ.
For each case xi, compute Ii = (Ii,jk), having (j, k) entry
Ii,jk = −EXi|θ ∂2
∂θj∂θklog fi(xi | θ)
˛˛θ=
bθ.
Define Ijj =Pn
i=1 Ii,jj Define nj as follows:
Define information weights wij = Ii,jj/Pn
k=1 Ik,jj .Define the effective sample size for θj as
nj =IjjPn
i=1 wijIi,jj=
(Ijj)2
Pni=1 (Ii,jj)
2 .
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
➲ Effective Sample size
➲ Effective Sample size: Continued
Simulation
JSM: July-August 2007 - slide #18
Effective Sample size
Calculate I = Ijk , the expected information matrix, at the MLE,
Ijk = −EX|θ ∂2
∂θj∂θklog f(x | θ)
˛˛θ=
bθ.
For each case xi, compute Ii = (Ii,jk), having (j, k) entry
Ii,jk = −EXi|θ ∂2
∂θj∂θklog fi(xi | θ)
˛˛θ=
bθ.
Define Ijj =Pn
i=1 Ii,jj Define nj as follows:
Define information weights wij = Ii,jj/Pn
k=1 Ik,jj .Define the effective sample size for θj as
nj =IjjPn
i=1 wijIi,jj=
(Ijj)2
Pni=1 (Ii,jj)
2 .
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
➲ Effective Sample size
➲ Effective Sample size: Continued
Simulation
JSM: July-August 2007 - slide #18
Effective Sample size
Calculate I = Ijk , the expected information matrix, at the MLE,
Ijk = −EX|θ ∂2
∂θj∂θklog f(x | θ)
˛˛θ=
bθ.
For each case xi, compute Ii = (Ii,jk), having (j, k) entry
Ii,jk = −EXi|θ ∂2
∂θj∂θklog fi(xi | θ)
˛˛θ=
bθ.
Define Ijj =Pn
i=1 Ii,jj Define nj as follows:
Define information weights wij = Ii,jj/Pn
k=1 Ik,jj .Define the effective sample size for θj as
nj =IjjPn
i=1 wijIi,jj=
(Ijj)2
Pni=1 (Ii,jj)
2 .
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
➲ Effective Sample size
➲ Effective Sample size: Continued
Simulation
JSM: July-August 2007 - slide #19
Effective Sample size: Continued
Intuitively,P
wijIi,jj is a weighted measure of the information ‘perobservation’, and dividing the total information about θj by this information percase seems plausible as an effective sample size.
This is just a tentative proposal
Works for most of the examples we discussed
Research in progress
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
Simulation
➲ A small linear regression
simulation:
➲ Linear regression Example
➲ Linear regression Example
➲ Linear regression Example
➲ Conclusions and future work
➲ References
JSM: July-August 2007 - slide #20
A small linear regression simulation:
Yn×1 = Xn×8β8×1 + εn×1 ε ∼ N(0, σ2),
β = (3,1.5, 0, 0,2, 0, 0, 0),
and design Matrix X ∼ N(0, Σx), Σx =
0BBBB@
1 ρ · · · ρ
ρ 1 · · · ρ...
.... . .
...
ρ ρ · · · 1
1CCCCA
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
Simulation
➲ A small linear regression
simulation:
➲ Linear regression Example
➲ Linear regression Example
➲ Linear regression Example
➲ Conclusions and future work
➲ References
JSM: July-August 2007 - slide #21
Linear regression Example
exac
t
20
40
60
80
10060 200 500 1000
60 200 500
20
40
60
80
1001000
20
40
60
80
10060 200 500 1000
Methodabf2aicbicGbicGbic*hbf
PS
fragreplacem
ents
ρ=0ρ=0ρ=0ρ=0ρ=0 ρ=.2ρ=.2ρ=.2ρ=.2ρ=.2 ρ=.5ρ=.5ρ=.5ρ=.5ρ=.5
σ=1σ=1σ=1σ=1σ=1
σ=2σ=2σ=2σ=2σ=2
σ=3σ=3σ=3σ=3σ=3
n=30
n=30
n=30
Percentage of true models selected under different situations
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
Simulation
➲ A small linear regression
simulation:
➲ Linear regression Example
➲ Linear regression Example
➲ Linear regression Example
➲ Conclusions and future work
➲ References
JSM: July-August 2007 - slide #22
Linear regression Example
corr
ect
3.5
4.0
4.5
5.060 200 500 1000
60 200 500
3.5
4.0
4.5
5.01000
3.5
4.0
4.5
5.060 200 500 1000
Methodabf2aicbicGbicGbic*hbf
PS
fragreplacem
ents
ρ=0ρ=0ρ=0ρ=0ρ=0 ρ=.2ρ=.2ρ=.2ρ=.2ρ=.2 ρ=.5ρ=.5ρ=.5ρ=.5ρ=.5
σ=1σ=1σ=1σ=1σ=1
σ=2σ=2σ=2σ=2σ=2
σ=3σ=3σ=3σ=3σ=3
n=30
n=30
n=30
Average number of 0’s that were determined to be 0. (True Model)
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
Simulation
➲ A small linear regression
simulation:
➲ Linear regression Example
➲ Linear regression Example
➲ Linear regression Example
➲ Conclusions and future work
➲ References
JSM: July-August 2007 - slide #23
Linear regression Example
inco
rrec
t
0.0
0.2
0.4
0.6
60 200 500 1000
60 200 500
0.0
0.2
0.4
0.6
1000
0.0
0.2
0.4
0.6
60 200 500 1000
Methodabf2aicbicGbicGbic*hbf
PS
fragreplacem
ents
ρ=0ρ=0ρ=0ρ=0ρ=0 ρ=.2ρ=.2ρ=.2ρ=.2ρ=.2 ρ=.5ρ=.5ρ=.5ρ=.5ρ=.5
σ=1σ=1σ=1σ=1σ=1
σ=2σ=2σ=2σ=2σ=2
σ=3σ=3σ=3σ=3σ=3
n=30
n=30
n=30
Average number of non-zeroes that were determined to be zero (small is good).
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
Simulation
➲ A small linear regression
simulation:
➲ Linear regression Example
➲ Linear regression Example
➲ Linear regression Example
➲ Conclusions and future work
➲ References
JSM: July-August 2007 - slide #24
Conclusions and future work
BIC can be improved by– keeping rather than ignoring the prior
– using a sensible, objective prior producing close-form expressions
– carefully acknowledging “effective sample size” for each parameter to guidechoice of the scale of the priors
Work in progress– A satisfactory definition of ‘effective sample size’ (if possible). Current proposal
works in most cases.
– Extension to the non-independent case
Introduction and Motivation
EBIC
The prior
EBIC*
Effective sample size
Simulation
➲ A small linear regression
simulation:
➲ Linear regression Example
➲ Linear regression Example
➲ Linear regression Example
➲ Conclusions and future work
➲ References
JSM: July-August 2007 - slide #25
References
1. Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. 2ndedition. Springer-Verlag.
2. Berger, J.O., Ghosh, J.K., and Mukhopadhyay (2003). Approximations andConsistency of Bayes Factors as Model Dimension Grows. Journal of StatisticalPlanning and Inference 112, 241–258.
3. Berger, J., Pericchi, L, and Varshavsky, J. (1998). Bayes factors and marginaldistributions in invariant situations. Sankhya A 60, 307–321.
4. Bollen, K., Ray S., Zavisca J.,(2007) A scaled unit information prior approximationto the Bayes Factor. under prepration