Deviance Information Criterion for ModelSelection: Justification and Variation∗
Yong LiRenmin University of China
Jun YuSingapore Management University
Tao ZengZhejiang University
November 23, 2017
Abstract
Deviance information criterion (DIC) has been extensively used for making model se-lection based on the MCMC output. Although it is understood as a Bayesian versionof AIC, a rigorous justification has not been provided in the literature. In this paper,we show that when the plug-in predictive distribution is used, DIC can have a rigorousdecision-theoretic justification in a frequentist setup. Under a set of regularity conditions,we show that DIC chooses a model that gives the smallest expected Kullback-Leiblerdivergence between the data generating process (DGP) and the plug-in predictive distrib-ution asymptotically. An alternative expression for DIC, based on the Bayesian predictivedistribution, is proposed. The new DIC has a smaller penalty term than the original DICand is very easy to compute from the MCMC output. It is invariant to reparameterizationand yields a smaller expected loss than the original DIC asymptotically.
JEL classification: C11, C12, G12Keywords: AIC; DIC; Bayesian Predictive Distribution; Plug-in Predictive Distribution;Loss Function; Model Comparison; Expected loss
1 Introduction
A highly important statistical inference often faced by model builders and empirical re-
searchers is model selection (Phillips, 1995, 1996). Many penalty-based information criteria
have been proposed to select from candidate models. In the frequentist statistical framework,∗We wish to thank Eric Renault, Peter Phillips and David Spiegelhalter for their helpful comments.
Yong Li, Hanqing Advanced Institute of Economics and Finance, Renmin University of China, Beijing,China 100872. Jun Yu, School of Economics and Lee Kong Chian School of Business, Singapore Man-agement University, 90 Stamford Rd, Singapore 178903. Email for Jun Yu: [email protected]. URL:http://www.mysmu.edu/faculty/yujun/. Tao Zeng, School of Economics and Academy of Financial Research,Zhejiang University, Zhejiang, China 310027. Li gratefully acknowledges the financial support of the ChineseNatural Science Fund (No. 71271221), Program for New Century Excellent Talents in University.
1
the most popular information criterion is AIC. Arguably one of the most important develop-
ments in the Bayesian literature in recent years is the deviance information criterion (DIC)
of Spiegelhalter, et al (2002) for model selection.1 DIC is understood as a Bayesian version
of AIC. Like AIC, it trades off a measure of model adequacy against a measure of complexity
and is concerned with how replicate data predict the observed data. Unlike AIC, DIC takes
prior information into account.
DIC is constructed based on the posterior distribution of the log-likelihood or the deviance,
and has several desirable features. Firstly, DIC is easy to calculate when the likelihood
function is available in closed-form and the posterior distributions of models are obtained by
Markov chain Monte Carlo (MCMC) simulation. Secondly, it is applicable to a wide range of
statistical models. Thirdly, unlike the Bayes factors (BF), it is not subject to Jeffreys-Lindley’s
paradox and can be calculated when non-informative or improper priors are used.
However, as acknowledged in Spiegelhalter, et al (2002, 2014), the decision-theoretic jus-
tification of DIC is not rigorous in the literature. In fact, in the heuristic justification given
by Spiegelhalter, et al (2002), the frequentist framework and the Bayesian framework were
mixed together. The first contribution of the present paper is to provide a rigorous decision-
theoretic justification to DIC purely in a frequentist setup. It can be shown that DIC is an
asymptotically unbiased estimator of the expected Kullback-Leibler (KL) divergence between
the data generating process (DGP) and the plug-in predictive distribution, when the posterior
mean is used. This justification is similar to how AIC has been justified.
Moreover, DIC is not invariant to reparameterization due to the use of the plug-in pre-
dictive distribution in defining the loss function. In the Bayesian framework, an alterative
predictive distribution is the Bayesian predictive distribution. Naturally, the KL divergence
between the DGP and the Bayesian predictive distribution can be used as the loss function
which can in turn be used to derive a new information criterion for model comparison. Unlike
the plug-in predictive distribution, the Bayesian predictive distribution is invariant to repara-
meterization. The second contribution of the present paper is to develop a new information
criterion that is an asymptotically unbiased estimator of the new expected KL divergence
under a general framework. Compared with the information criterion developed in Ando and
Tsay (2010) in the independent and identically distributed (iid) environment, our information
criterion does not need the iid assumption and has a simpler expression. Moreover, it is easier
to compare our information criterion with other information criteria.
Our theoretical results show that asymptotically the expected loss implied by the Bayesian
predictive distribution is smaller than that implied by the plug-in predictive distribution.
Hence, from the predictive viewpoint, the Bayesian predictive distribution is a better predic-
1According to, Spiegelhalter et al. (2014), Spiegelhalter et al. (2002) was the third most cited paper ininternational mathematical sciences between 1998 and 2008. Up to October 2017, it has received 4898 citationson the Web of Knowledge and over 8623 on Google Scholar.
2
tive distribution. This represents another important advantage of using the Bayesian predic-
tive distribution and hence our new information criterion.
The paper is organized as follows. Section 2 explains how to treat the model selection
as a decision problem and gives a simple review about the decision-theoretic justification of
AIC which helps explain our theory for DIC. Section 3 provides a rigorous decision-theoretic
justification to DIC of Spiegelhalter, et al (2002) under a set of regularity conditions, and shows
why DIC can be explained as the Bayesian version of AIC. In Section 4, based on the Bayesian
predictive distribution, a new information criterion is proposed. Its theoretical properties are
established and comparisons with other information criteria are also made in this section.
Section 5 concludes the paper. The Appendix collects the proof of the theoretical results in
the paper. To save space, the proof of Lemma 3.2 is provided in a technical supplement to
the present article (Li et al., 2017).
2 Decision-theoretic Justification of AIC
There are essentially two strands of literature on model selection.2 The first strand aims
to answer the following question —which model best explains the observed data? The BF
(Kass and Raftery, 1995) and its variations belong to this strand. They compare models by
examining “posterior probabilities”given the observed data and search for the “true”model.
BIC is a large sample approximation to BF although it is based on the maximum likelihood
estimator. The second strand aims to answer the following question —which model give the
best predictions of future observations generated by the same mechanism that gives rise to
the observed data? Clearly this is a utility-based approach where the utility is set to be the
prediction. Ideally, we would like to choose the model that gives the best overall predictions
of future values. Some cross validation-based criteria have been developed where the original
sample is split into a training set and a validation set (Vehtari and Lampinen, 2002; Zhang and
Yang, 2015). Unfortunately, different ways of sample splitting often lead to different outcomes.
Alternatively, based on hypothetically replicate data generated by the same mechanism that
gives rise to the observed data, some predictive information criteria have been proposed for
model selection. They minimize a loss function associated with the predictive decisions. AIC
and DIC are two well-known criteria in this framework. After the decision is made about
which model should be used for prediction, a unique prediction action for future observations
can be obtained to fulfill the original goal. The latter approach is what we follow in the present
paper. Given the relevance of prediction in economics, nor surprisingly, such an approach to
model selection has been widely used in applications.
2For more information about the literature, see Vehtari and Ojanen (2012) and Burnham and Anderson(2002).
3
2.1 Predictive model selection as a decision problem
Assuming that the probabilistic behavior of observed data, y = (y1, y2, · · · , yn)′ ∈ Y, is
described by a set of probabilistic models such as MkKk=1 := p (y|θk,Mk)Kk=1 where para-meter θk is the set of parameters in candidate modelMk and p(·) means a probability densityfunction (pdf). Formally, the model selection problem can be taken as a decision problem
to select a model among MkKk=1 where the action space has K elements, namely, dkKk=1,where dk means Mk is selected.
For the decision problem, a loss function, `(y, dk), which measures the loss of decision dkas a function of y, must be specified. Given the loss function, the expected loss (or risk) can
be defined as (Berger, 1985)
Risk(dk) = Ey [`(y, dk)] =
∫`(y, dk)g(y)dy,
where g(y) is the pdf of the DGP of y. Hence, the model selection problem is equivalent to
optimizing the statistical decision,
k∗ = arg minkRisk(dk).
Based on the set of candidate models MkKk=1, the model Mk∗ with the decision dk∗ is
selected.
Let yrep be the replicate data independently generated by the same mechanism that gives
rise to the observed data y. Assume the sample size in yrep is the same as that in y. Consider
the predictive density of this replicate experiment for a candidate model Mk. The plug-in
predictive density can be expressed as p(yrep|θn(y),Mk
)for Mk where θn(y) is the quasi
maximum likelihood (QML) estimate of θk based on y (when there is no confusion we simply
write θn(y) as θn or even θ) and defined by
θn(y) = arg maxθk∈Θ
ln p (y|θk,Mk) ,
which is the global maximum interior to Θ.
The quantity that has been used to measure the quality of the candidate model in terms of
its ability to make predictions is the KL divergence between g (yrep) and p(yrep|θn(y),Mk
)multiplied by 2,
2×KL[g (yrep) , p
(yrep|θn(y),Mk
)]= 2Eyrep
lng (yrep)
p(yrep|θn(y),Mk
)
= 2
∫ lng (yrep)
p(yrep|θn(y),Mk
) g (yrep)dyrep.
4
Naturally the loss function associated with decision dk is
`(y, dk) = 2×KL[g (yrep) , p
(yrep|θn(y),Mk
)].
As a result, the model selection problem is,
k∗ = arg minkRisk(dk) = arg min
kEy [`(y, dk)]
= arg mink
2× EyEyrep
lng (yrep)
p(yrep|θn(y),Mk
)
= arg mink
EyEyrep [2 ln g (yrep)] + EyEyrep
[−2 ln p
(yrep|θn(y),Mk
)].
Since g (yrep) is the DGP, Eyrep [2 ln g (yrep)] is the same across all candidate models, and
hence, is dropped from the above equation. Consequently,
k∗ = arg minkRisk(dk) = arg min
kEyEyrep
[−2 ln p
(yrep|θn(y),Mk
)].
The smaller theRisk(dk), the better the candidate model performs when using p(yrep|θn(y),Mk)
to predict g (yrep). The optimal decision makes it necessary to evaluate the risk. AIC is an
asymptotically unbiased estimator of EyEyrep
[−2 ln p
(yrep|θn(y),Mk
)].
2.2 AIC for predictive model selection
To show the asymptotic unbiasedness of AIC, let us first fix some notations. When there is no
confusion, we simply write candidate model p (y|θk,Mk) as p(y|θ) where θ = (θ1, . . . , θP )′ ∈Θ ⊆ RP . Under the iid assumption, let y = (y1, y2, · · · , yn)′ denote the observed data,
yrep = (y1,rep, · · · , yn,rep)′ denote the replicate data, and n be the sample size in both setsof data. Although −2 ln p
(y|θn(y)
)is a natural estimate of Eyrep
(−2 ln p(yrep|θn(y)
), it is
asymptotically biased. Let
c(y) = Eyrep
(−2 ln p
(yrep|θn(y)
))−(−2 ln p
(y|θn(y)
))(1)
be the “optimism”. Under a set of regularity conditions, one can show that Ey (c(y)) → 2P
as n→∞. Hence, if we let AIC = −2 ln p(y|θn(y)
)+ 2P , then, as n→∞,
Ey(AIC)− EyEyrep
(−2 ln p
(yrep|θn(y)
))→ 0.
To see why a penalty term, 2P , is needed in AIC, let
θpn := arg minθ∈Θ
1
nKL[g(y), p(y|θ)] (2)
be the pseudo-true parameter value; θn (yrep) be the QML estimate of θ obtained from yrep;
p(y|θ) be a “good approximation”to the DGP, a concept will be defined formally in the next
section. Note that
EyEyrep
(−2 ln p
(yrep|θn(y)
))5
=[EyEyrep
(−2 ln p
(yrep|θn(yrep)
))](T1)
+[EyEyrep (−2 ln p (yrep|θpn))− EyEyrep
(−2 ln p
(yrep|θn (yrep)
))](T2)
+[EyEyrep
(−2 ln p
(yrep|θn(y)
))− EyEyrep (−2 ln p (yrep|θpn))
](T3)
.
Clearly, the term in T1 is the same as Ey
(−2 ln p
(y|θn(y)
)). The term in T2 is the
expectation of the likelihood ratio statistic based on the replicate data. Under a set of regu-
larity conditions that ensure√n-consistency and asymptotic normality of the QML estimate,
we have T2 = T3 + o(1). To approximate the term in T3, if θn(y) is a consistent estimate of
θpn, we have
T3 = Ey
−2Eyrep
[∂ ln p (yrep|θpn)
∂θ′
(θn(y)− θpn
)]+Ey
Eyrep
[−(θn(y)− θpn
)′ ∂2 ln p (yrep|θpn)
∂θ∂θ′
(θn(y)− θpn
)]+ o(1).
By the definition of θpn, we have Eyrep [∂ ln p (yrep|θpn) /∂θ] = 0, implying that
Ey
−2Eyrep
[∂ ln p (yrep|θpn)
∂θ′
(θn(y)− θpn
)]= −2Eyrep
(∂ ln p (yrep|θpn)
∂θ′
)Ey
(θn(y)− θpn
)= 0.
Consequently, under the same regularity conditions for approximating T2, we have
T3 = tr
Eyrep
[∂2 ln p (yrep|θpn)
∂θ∂θ′
]Ey
[−(θn(y)− θpn
)(θn(y)− θpn
)′]= P + o(1),
where tr denotes the trace of a matrix. Following Burnham and Anderson (2002), we have
EyEyrep
(−2 ln p
(yrep|θn(y)
))= Ey
(−2 ln p
(y|θn(y)
)+ 2P
)+ o(1) = Ey (AIC) + o(1),
that is, AIC is an unbiased estimator of EyEyrep
(−2 ln p
(yrep|θn(y)
))asymptotically. From
the decision viewpoint, among candidate models, AIC selects a model which minimizes the
expected loss asymptotically when the plug-in predictive distribution is used for making pre-
dictions.
It is clear that the decision-theoretic justification of AIC rests on a frequentist framework.
Specifically, it requires a careful choice of the KL divergence function, the use of QML esti-
mation, and a set of regularity conditions that ensure the√n-consistency and the asymptotic
normality of the QML estimates. The penalty term in AIC arises from two sources. First, the
pseudo-true value has to be estimated. Second, the estimate obtained from the observed data
is not the same as that from the replicate data. Moreover, as pointed out in Burnham and
Anderson (2002), the justification of AIC requires the candidate model be a “good approxi-
mation”to the DGP for the trace to be P asymptotically in T3 although a formal definition
of “good approximation”was not provided.
6
3 Decision-theoretic Justification of DIC
3.1 DIC
Spiegelhalter, et al (2002) proposed DIC for Bayesian model selection. The criterion is based
on the deviance
D(θ) = −2 ln p(y|θ),
and takes the form of
DIC = D(θ) + PD. (3)
The first term, interpreted as a Bayesian measure of model fit, is defined as the posterior
expectation of the deviance, that is,
D(θ) = Eθ|y[D(θ)] = Eθ|y[−2 ln p(y|θ)].
The better the model fits the data, the larger the log-likelihood value and hence the smaller
the value for D(θ). The second term, used to measure the model complexity and also known
as “effective number of parameters”, is defined as the difference between the posterior mean
of the deviance and the deviance evaluated at the posterior mean of the parameters:
PD = D(θ)−D(θn(y)
)= −2
∫ [ln p(y|θ)− ln p
(y|θn(y)
)]p(θ|y)dθ, (4)
where θn(y) is the Bayesian estimator based on y, and more precisely the posterior mean of
θ,∫θp(θ|y)dθ. When there is no confusion, we simple write θn(y) as θn or even θ.
DIC can be rewritten in two equivalent forms:
DIC = D(θn)
+ 2PD, (5)
and
DIC = 2D(θ)−D(θn)
= −4Eθ|y[ln p(y|θ)] + 2 ln p(y|θn
). (6)
DIC defined in Equation (5) bears similarity to AIC of Akaike (1973) and can be inter-
preted as a classical “plug-in”measure of fit plus a measure of complexity (i.e., 2PD, also
known as the penalty term). In Equation (3) the Bayesian measure, D(θ), is the same as
D(θn)
+PD which already includes a penalty term for model complexity and, thus, could be
better thought of as a measure of model adequacy rather than pure goodness of fit.
However, as stated explicitly in Spiegelhalter et al. (2002) (Section 7.3 on Page 603 and
the first paragraph on Page 605), the justification of DIC is informal and heuristic. In fact, it
mixes a frequentist setup and a Bayesian setup. In this section, we provide a rigorous decision-
theoretic justification of DIC purely in a frequentist setup. Specifically, we show that, when a
proper loss function is selected, DIC is an asymptotically unbiased estimator of the expected
loss.
7
3.2 Decision-theoretic justification of DIC
When developing DIC, Spiegelhalter, et al (2002) assumes that there is a true distribution for
y in Section 2.2, a pseudo-true parameter value θpn for a candidate model also in Section 2.2,
an independent replicate data set yrep in Section 7.1. All these assumptions are identical what
was done when justifying AIC. Furthermore, as explained in Section 7.1 of Spiegelhalter, et
al (2002), the goal for model selection is to estimate the expected loss where the expectation
is taken with respect to yrep|θpn. The assumptions and the goal indicate that a frequentistframework was considered. On the other hand, since the “optimism” associated with the
natural estimator depends on a pseudo-true parameter value θpn, instead of replacing it with
a frequentist estimator and then finding the asymptotic property of the “optimism”, Spiegel-
halter, et al (2002) replaces θpn with a random quantity θ and then calculates the posterior
expectation of the “optimism”; see Sections 7.1 and 7.3. As a result, a Bayesian framework
is adopted when studying the behavior of the “optimism”.
Spiegelhalter, et al (2002) has not explicitly specified the KL divergence function. However,
from Equation (33) on Page 602, the loss function defined in the first paragraph on Page
603, and Equation (40) on Page 603 in their paper, one may deduce that the following KL
divergence
KL[p (yrep|θ) , p
(yrep|θn(y)
)]= Eyrep|θ
[ln
p (yrep|θ)
p(yrep|θn(y)
)] (7)
was used.3 Hence,
2×KL[p (yrep|θ) , p
(yrep|θn(y)
)]= 2Eyrep|θ (ln p (yrep|θ)) +Eyrep|θ
(−2 ln p
(yrep|θn(y)
)).
(8)
With this KL function, unfortunately, the first term in the right hand side of Equation (8) is
no longer a constant across candidate models. This is because, when the pseudo-true value
is replaced by a random quantity θ, the first term in the right hand side of Equation (8) is
model dependent. Clearly, another KL divergence function is needed.
As in AIC, we first consider the plug-in predictive distribution p(yrep|θ(y)
)in the follow-
ing KL divergence
KL[g (yrep) , p
(yrep|θn(y)
)]= Eyrep
[ln
g (yrep)
p(yrep|θn(y)
)] .The corresponding expected loss function of a statistical decision dk for model selection is
Risk(dk) = Ey
Eyrep
[2 ln
g (yrep)
p(yrep|θn(y),Mk
)]3 In Equation (33) of Spiegelhalter, et al (2002), the expectation is taken with respect to yrep|θt which
corresponds to the candidate model. In AIC, the expectation is taken with respect to yrep which correspondsto the DGP.
8
= EyEyrep [2 ln g (yrep)] + EyEyrep
[−2 ln p
(yrep|θn(y),Mk
)].
Since EyEyrep [2 ln g (yrep)] is the same across candidate models, minimizing the expected loss
function Risk(dk) is equivalent to minimizing
EyEyrep
[−2 ln p
(yrep|θn(y),Mk
)].
Denote the selected model by Mk∗ . Then p(yrep|θn(y),Mk∗
)is used to generate future
observations where θn(y) is the posterior mean of θ in Mk∗ .
We are now in the position to provide a rigorous decision-theoretic justification to DIC in
a frequentist framework based on a set of regularity conditions. Let yt := (y0, y1, . . . , yt) for
any 0 ≤ t ≤ n and lt(yt,θ
):= ln p(yt|θ) − ln p(yt−1|θ) be the conditional log-likelihood for
the tth observation for any 1 ≤ t ≤ n. When there is no confusion, we suppress lt(yt,θ
)as
lt (θ) so that the log-likelihood function ln p(y|θ) is∑n
t=1 lt (θ).4 Let ∇jlt (θ) denote the jth
derivative of lt (θ) and ∇jlt (θ) := lt (θ) when j = 0. Furthermore, define
s(yt,θ) :=∂ ln p(yt|θ)
∂θ=
t∑i=1
∇li (θ) , h(yt,θ) :=∂2 ln p(yt|θ)
∂θ∂θ′=
t∑i=1
∇2li (θ) ,
st(θ) := ∇lt (θ) = s(yt,θ)− s(yt−1,θ), ht(θ) := ∇2lt (θ) = h(yt,θ)− h(yt−1,θ),
Bn (θ) := V ar
[1√n
n∑t=1
5lt (θ)
], Hn(θ) :=
1
n
n∑t=1
ht(θ),
Jn(θ) :=1
n
n∑t=1
[st(θ)− s(θ)] [st(θ)− s(θ)]′ , s(θ) :=1
n
n∑t=1
st(θ),
Ln(θ) := ln p(θ|y), L(j)n (θ) := ∂j ln p(θ|y)/∂θj ,
Hn(θ) :=
∫Hn(θ)g (y) dy, Jn(θ) :=
∫Jn(θ)g (y) dy.
In this paper, we impose the following regularity conditions.
Assumption 1: Θ ⊂ RP is compact.Assumption 2: yt∞t=1 satisfies the strong mixing condition with the mixing coeffi cient
α (m) = O(m−2rr−2−ε
)for some ε > 0 and r > 2.
Assumption 3: For all t, lt (θ) satisfies the standard measurability and continuity condi-
tion, and the eight-times differentiability condition on F t−∞×Θ where F t−∞ = σ (yt, yt−1, · · · ).Assumption 4: For j = 0, 1, 2, for any θ,θ′ ∈ Θ,
∥∥5jlt (θ)−5jlt(θ′)∥∥ ≤ cjt (yt) ∥∥θ − θ′∥∥,
where cjt(yt)is a positive random variable with the following two properties: suptE
∥∥∥cjt (yt)∥∥∥ <∞ and 1
n
∑nt=1
(cjt(yt)− E
(cjt(yt))) p→ 0.
Assumption 5: For j = 0, 1, . . . , 8, there exist Mt(yt) and M < ∞ such that for all
θ ∈ Θ, 5jlt (θ) exists, supθ∈Θ∥∥5jlt (θ)
∥∥ 6 Mt(yt), suptE
∥∥Mt(yt)∥∥r+δ ≤ M for some
δ > 0, where r is the same as that in Assumption 2.4 In the definition of log-likelihood, we ignore the initial condition ln p(y0). For weakly dependent data, the
impact of ignoring the initial condition is aymptotically negligible.
9
Assumption 6:5jlt (θ)
is L2-near epoch dependent with respect to yt of size −1
for j = 0, 1 and −12 for j = 2 uniformly on Θ.
Assumption 7: Let θpn be the pseudo-true value that minimizes the KL loss between the
DGP and the candidate model
θpn = arg minθ∈Θ
1
n
∫ln
g(y)
p(y|θ)g(y)dy,
where θpn is the sequence of minimizers that are interior to Θ uniformly in n. For all ε > 0,
limn→∞
sup supΘ\N(θ
pn,ε)
1
n
n∑t=1
E [lt (θ)]− E [lt (θpn)] < 0, (9)
where N (θpn, ε) is the open ball of radius ε around θpn.
Assumption 8: The sequence Hn (θpn) is negative definite and Bn (θpn) is positivedefinite, both uniformly in n.
Assumption 9: Hn (θpn) + Bn (θpn) = o (1).
Assumption 10: The prior density p(θ) is eight-times continuously differentiable, p(θpn) >
0 and∫‖θ‖2 p(θ)dθ <∞.
Remark 3.1 Assumption 1 is the compactness condition. Assumption 2 and Assumption
6 imply weak dependence in yt and lt. The first part of Assumption 3 is the continuity
condition. Assumption 4 is the Lipschitz condition for lt first introduced in Andrews (1987) to
develop the uniform law of large numbers for dependent and heterogeneous stochastic processes.
Assumption 5 contains the domination condition for lt. Assumption 7 is the identification
condition. These assumptions are well-known primitive conditions for developing the QML
theory, namely consistency and asymptotic normality, for dependent and heterogeneous data;
see, for example, Gallant and White (1988) and Wooldridge (1994).
Remark 3.2 The eight-times differentiability condition in Assumption 3 and the domination
condition for up to the eighth derivative of lt are important to develop a high order stochastic
Laplace expansion. In particular, as shown in Kass et al (1990), these two conditions, together
with the well-known consistency condition for QML given by (10) below, are suffi cient for
developing the Laplace expansion. This consistency condition requires that, for any ε > 0,
there exists K1 (ε) > 0 such that
limn→∞
P
supΘ\N(θ
pn,ε)
1
n
n∑t=1
[lt (θ)− lt (θpn)] < −K1 (ε)
= 1. (10)
Our Assumption 7 is clearly more primitive than the consistency condition (10). In the fol-
lowing lemma, we show that Assumptions 1-7, including the identification condition (9), are
suffi cient to ensure (10) as well as the concentration condition around the posterior mode
10
given by Chen (1985) and the concentration condition around the QML estimator given by
Kim (1994, 1998). Together with Assumption 10, the concentration condition suggests that
the stochastic Laplace expansion can be applied to the posterior distribution and the asymp-
totic normality of posterior distribution can be established. To the best of our knowledge, this
is the first time in the literature that primitive conditions have been proposed for the stochas-
tic Laplace expansion. Assumption 10 ensures the second moment of the prior is bounded.
As argued in Geweke (2001), such a condition typically leads to a finite second moment of
posterior. Moreover, it implies that the prior is negligible asymptotically.
Lemma 3.1 If Assumptions 1-7 hold true, then Equation (10) holds. Furthermore, if As-
sumptions 1-7 hold true, for any ε > 0, there exists K2 (ε) > 0 such that
limn→∞
P
sup
Θ\N(θn,ε
) 1
n
[n∑t=1
lt (θ)−n∑t=1
lt (θpn)
]< −K2 (ε)
= 1. (11)
If, in addition, Assumption 10 holds true, for any ε > 0, there exists K3 (ε) > 0 such that
limn→∞
P
sup
Θ\N(←→θ n,ε
) 1
n
(n∑t=1
[lt (θ)− lt (θpn)] + ln p (θ)− ln p (θpn)
)< −K3 (ε)
= 1, (12)
where←→θ n := arg maxθ∈Θ
∑nt=1 lt (θ) + ln p (θ) which is the posterior mode.
Remark 3.3 Assumption 9 gives the exact requirement for “good approximation”. This gen-
eralizes the definition of “information matrix equality”; see White (1996). We now specify a
few cases where Assumption 9 is satisfied. The detail of these cases can be found in White
(1996) and Wooldridge (1994).
1. According to White (1996, p55), a model is specification correct in its entirety if there
exists θ0 ∈ Θ such that
p(y|θ0) = g (y) . (13)
In this case, θpn = θ0 for all n, Hn (θ0) + Bn (θ0) = 0, implying Assumption 9 (White,
1996, p93).
2. For any t, if Cov [st(θpn), st+j(θ
pn)] = 0 for all j > 1, then Jn (θpn) = Bn (θpn). For any
t, if V ar [st(θpn)] = −E [ht(θ
pn)], then Jn (θpn) + Hn (θpn) = 0. If both conditions are
satisfied, we have Hn (θpn) + Bn (θpn) = 0, again implying Assumption 9.
3. Suppose yt is a ky × 1 vector which can be partitioned into a kw × 1 vector of endoge-
nous variables wt and a kz × 1 vector of exogenous variables zt. Let xt be a subset
11
of (zt, wt−1, zt−1, . . . , w1, z1). According to White (1996, p55), a model is specification
correct for dependent variable wt, if there exists θ0 ∈ Θ, such that
p (wt|xt,θ0) = g (wt|xt) , all xt, t = 1, 2, . . . . (14)
If lt (θ) = ln p (wt|xt,θ), then E [st(θ0)|xt] = 0. Under the condition (14), θpn = θ0 for
all n, E [st(θ0)st(θ0)′] = −E [ht(θ0)], and Jn (θ0) = −Hn (θ0). If the model is further
assumed to be dynamically complete in distribution in the sense that
g (wt|xt, wt−1, zt−1, . . . , w1, z1) = g (wt|xt) ,
then E [st(θ0)st+j(θ0)′] = 0 and Jn (θ0) = Bn (θ0). Hence, Hn (θ0) + Bn (θ0) = 0,
implying Assumption 9; see White (1996, p95-97).
4. A model is specification correct for the conditional mean and the conditional variance,
if there exists θ0 ∈ Θ, such that
E (wt|xt) = mt (xt,θ0) , V ar (wt|xt) = Ωt (xt,θ0) , for any t,
where mt (xt,θ) and Ωt (xt,θ) are the conditional mean and the conditional variance
of the model. In this case, the QML method that maximizes the Gaussian-quasi-log-
likelihood function can be used. Let
lt (θ) := −1
2ln |Ωt (xt,θ)| − 1
2utΩt (xt,θ)−1 ut,
where ut = ut (θ) = wt − mt (xt,θ). It is easy to show that E [st(θpn)|xt] = 0. If the
model is further assumed to be dynamically complete in mean and variance in the sense
that
E (wt|xt, wt−1, zt−1, . . . , w1, z1) = E (wt|xt) , V ar (wt|xt, wt−1, zt−1, . . . , w1, z1) = V ar (wt|xt) ,(15)
then θpn = θ0, st(θ0) is a martingale difference sequence, and Jn (θ0) = Bn (θ0). If
E[vec(utu
′t)u′t|xt]
= 0,
E[vec(utu
′t)− Ωt (xt,θ0)
vec(utu
′t)− Ωt (xt,θ0)
′ |xt] = 2Nkw [Ωt (xt,θ0)⊗ Ωt (xt,θ0)] ,
where vec is the column-wise vectorization, Nkw = Dkw
(D′kwDkw
)−1D′kw , and Dkw is
the k2w × kw(kw + 1)/2 duplication matrix, then Jn (θ0) = −Hn (θ0). Hence, Hn (θ0) +
Bn (θ0) = 0, implying Assumption 9; see Wooldridge (1994, p2690).
The above four cases all imply thatHn (θpn)+Bn (θpn) = 0, stronger than what Assumption
9 requires. We now give an example where Hn (θpn) + Bn (θpn) = o (1).
12
Example 3.1 Let the DGP be
yt = x1tβ0 + x2tγ0 + εt, εtiid∼ N(0, σ20),
where (x1t, x2t) is iid over t and independent of εt. Assume that γ0 = δ0/n1/2, where δ0 is an
unknown constant. Let the candidate model be
yt = x1tβ + vt, vtiid∼ N(0, σ2).
In this case
lt (θ) = −1
2ln 2π − 1
2lnσ2 − (yt − x1tβ)2
2σ2,
where θ =(β, σ2
)′. In this case, the pseudo true value is θpn =(βpn, σ
2pn
)′, which maximizes
E [lt (θ)], and can be expressed as
βpn = β0 + bγ0, σ2pn = σ20 + cγ20,
where b =[E(x21t)]−1
E (x1tx2t) and c = E(x22t)− [E (x1tx2t)]
2 [E (x21t)]−1. Hence,−E [ht (θpn)] =
E(x21t)σ2pn
0
0 − 1
2(σ2pn )2 − σ20+cγ
20
(σ2pn )3
,−Hn (θpn) = − 1
n
n∑t=1
E [ht (θpn)] = −E [ht (θpn)] .
From the iid assumption, we have
V ar (st (θpn)) = E(st (θpn) st (θpn)′
)=
σ20E(x1tx′1t)σ2pn
+d1γ20
(σ2pn )2
d2γ30
2(σ2pn )2
d2γ30
2(σ2pn )2 − 1
4(σ2pn )2 +
3σ20+6cσ20γ
20+d3γ
40
4(σ2pn )4
.where dj = E
[x4−j−11t (x2t − x1tb)j+1
]for j = 1, 2, 3 and
Bn (θpn) = V ar
(1√n
n∑t=1
st (θpn)
)=
1
n
n∑t=1
V ar (st (θpn)) = Jn (θpn)
=
σ20E(x21t)
σ2pn+
d1γ20
(σ2pn )2
d2γ30
2(σ2pn )2
d2γ30
2(σ2pn )2 − 1
4(σ2pn )2 +
3(σ20)2+6cσ20γ
20+d3γ
40
4(σ2pn )4
.Hence,
limn→∞
Bn (θpn) = limn→∞
Jn (θpn) = limn→∞
−Hn (θpn) =
E(x21t)σ20
0
0 1
2(σ20)2
since γ0 = δ0/n
1/2. Thus, Hn (θpn) + Bn (θpn) = o (1). However, Hn (θpn) + Bn (θpn) 6= 0 for
any finite n.
13
To develop the Laplace expansion, we need to fix more notations. For convenience of
exposition, we let H(j)n (θ) = 1
n
∑nt=1∇jlt (θ) for j = 3, 4, 5. Let π (θ) = ln p (θ), p, π, ∇j p,
and ∇j π be the values of functions, p (θ), π (θ), ∇jp (θ) and ∇jπ (θ) evaluated at θn. The
next lemma extends the result in Kass et al (1990) to a higher order in matrix form.
Lemma 3.2 Under Assumptions 1-10, we have∫lt (θ) p (θ) p (y|θ) dθ∫p (θ) p (y|θ) dθ
= lt
(θn
)+
1
nBt,1 +
1
n2(Bt,2 −Bt,3) +Op
(n−3
), (16)
where
Bt,1 = −1
2tr
[Hn
(θn
)−152 lt
(θn
)]−5lt
(θn
)′Hn
(θn
)−1 ∇pp
−1
2vec
(Hn
(θn
)−1)H(3)n
(θn
)Hn
(θn
)−15 lt
(θn
), (17)
Bt,2 = −1
85 lt
(θn
)′Hn
(θn
)−1H(5)n
(θn
)′vec
[Hn
(θn
)−1⊗ vec
(Hn
(θn
)−1)](18)
+35
485 lt
(θn
)′A2Hn
(θn
)−1H(3)n
(θn
)′vec
(Hn
(θn
)−1)−35
485 lt
(θn
)′Hn
(θn
)−1H(3)n
(θn
)′vec
(Hn
(θn
)−1)A1
−5
85 lt
(θn
)′Hn
(θn
)−1 ∇pp
tr [A2] +35
245 lt
(θn
)′Hn
(θn
)−1 ∇ppA1
−5
45 lt
(θn
)′Hn
(θn
)−1H(3)n
(θn
)′vec
(Hn
(θn
)−1)tr
[Hn
(θn
)−1 ∇2pp
]+
1
25 lt
(θn
)′Hn
(θn
)′ ∇3pp
[vec
(H(2)n
(θn
)−1)]− 5
16tr [A2] tr
[Hn
(θn
)−152 lt
(θn
)]+
35
48A1tr
[Hn
(θn
)−152 lt
(θn
)]− 5
12vec
(Hn
(θn
)−1)′H(3)n
(θn
)Hn
(θn
)−153 lt
(θn
)′vec
(Hn
(θn
)−1)+
1
8tr
([Hn
(θn
)−1⊗ vec
(Hn
(θn
)−1)]′54 lt
(θn
))−5
4vec
(Hn
(θn
)−1)′H(3)n
(θn
)Hn
(θn
)−152 lt
(θn
)vec
(Hn
(θn
)−1) ∇pp
+1
2vec
(Hn
(θn
)−1)′53 lt
(θn
)Hn
(θn
)−1 ∇pp
+3
4tr
[52lt
(θn
)Hn
(θn
)−1]tr
[Hn
(θn
)−1 ∇2pp
],
Bt,3 = B4 ×Bt,1 with (19)
14
B4 = −1
2tr
[Hn
(θn
)−1 ∇2pp
]+
1
2vec
(Hn
(θn
)−1)′H(3)n
(θn
)Hn
(θn
)−1 ∇pp
− 5
24A1 +
1
8tr [A2] , (20)
A1 = vec
(Hn
(θn
)−1)′H(3)n
(θn
)Hn
(θn
)−1H(3)n
(θn
)′vec
(Hn
(θn
)−1),
A2 =
[Hn
(θn
)−1⊗ vec
(Hn
(θn
)−1)]′H(4)n
(θn
).
We are now in the position to develop a high order expansion to PD and DIC.
Lemma 3.3 Under Assumptions 1-10, we have
PD = P +1
nC1 +
1
nC2 +Op
(n−2
),
DIC = AIC+1
nD1 +
1
nD2 +Op
(n−2
),
where
C1 = −−13 + 15P
12A1 −
1− 2P
4tr [A2] , C2 =
3− P2
C21 − PC22 + (1− P )C23,
D1 = −1− 2P
2tr [A2]−
−23 + 25P
12A1, D2 = (2− P )C21 − 2PC22 + (1− 2P )C23,
C21 = Oπ′Hn
(θn
)−1H(3)n
(θn
)′vec
(Hn
(θn
)−1),
C22 = tr
[Hn
(θn
)−1O2π
], C23 = Oπ′Hn
(θn
)−1Oπ,
where A1 and A2 are defined as in Lemma 3.2.
Theorem 3.1 Under Assumptions 1-10, we have,
EyEyrep
[−2 ln p
(yrep|θn(y)
)]= Ey [DIC+ op(1)] = Ey [DIC] + o(1).
Remark 3.4 Like AIC, DIC is an unbiased estimator of EyEyrep
[−2 ln p
(yrep|θ(y)
)]as-
ymptotically, according to Theorem 3.1. Hence, the decision-theoretic justification to DIC is
that DIC selects a model that asymptotically minimizes the expected loss, which is the expected
KL divergence between the DGP and the plug-in predictive distribution p(yrep|θn(y)
)where
the expectation is taken with respect to the DGP. A key difference between AIC and DIC is
that the plug-in predictive distribution is based on different estimators. In AIC the QML esti-
mate, θn(y), is used while in DIC the Bayesian posterior mean, θn(y), is used. That is why
DIC is explained as the Bayesian version of AIC.
15
Remark 3.5 The justification of DIC remains valid if the posterior mean is replaced with
the posterior mode or with the QML estimate and PD is replaced with P . This is because the
justification of DIC requires that the information matrix identity holds true asymptotically,
and that the posterior distribution converges to a Normal distribution (the posterior mean
converges to the posterior mode and the posterior variance converges to zero).
Remark 3.6 In AIC, the number of degrees of freedom, P , is used to measure the model
complexity. When the prior is informative, it imposes additional restrictions on the parameter
space and, hence, PD may not be close to P for a finite n. A useful contribution of DIC is to
provide a way to measure the model complexity when the prior information is incorporated;
see Brooks (2002). From Lemma 3.3, the effect of prior information on PD is reflected by C2and is of order Op
(n−1
). Similarly, the effect of prior information on DIC is reflected by D2
and is also of order Op(n−1
).
Remark 3.7 As pointed out in Spiegelhalter, et al (2014), the consistency of BF requires
that there is a true model and the true model is among the candidate models. However, AIC
and DIC are prediction-based criteria which are designed to find the best model for making
predictions among candidate models. Neither AIC nor DIC makes attempt to find the true
model.
Remark 3.8 If p(y|θ) has a closed-form expression, DIC is trivially computable from the
MCMC output. The computational tractability, together with the versatility of MCMC and
the fact that DIC is incorporated into a Bayesian software, WinBUGS, are among the reasons
why DIC has enjoyed a very wide range of applications.
4 DIC Based on Bayesian Predictive Distribution
The above decision-theoretic justification of DIC is based on the loss function constructed
from the plug-in predictive distribution. Unfortunately, the plug-in predictive distribution
is not invariant to parameterization and, hence, the corresponding DIC can be sensitive to
parameterization. From the pure Bayesian viewpoint, only the Bayesian predictive distribu-
tion, but not the plug-in predictive distribution, is a full proper predictive distribution. The
Bayesian predictive distribution is invariant to reparameterization. Hence, the loss function
and the corresponding information criterion will be invariant to reparameterization; see Ando
and Tsay (2010), Spiegelhalter, et al (2014).
Let p(yrep|y,Mk) be the Bayesian predictive distribution, that is,
p(yrep|y,Mk) =
∫p(yrep|θ,Mk)p(θ|y,Mk)dθ.
The KL divergence based on the Bayesian predictive distribution is
KL [g (yrep) , p (yrep|y,Mk)] = Eyrep (ln g (yrep))− Eyrep (ln p (yrep|y,Mk)) . (21)
16
The expected loss for a statistical decision dk that selects Model Mk is
Risk(dk) = Ey 2×KL [g (yrep) , p (yrep|y,Mk)]
= Ey
Eyrep (2 ln g (yrep))
+ Ey
Eyrep (−2 ln p (yrep|y,Mk))
.
A better model is expected to yield a smaller value for Risk(dk). Since Eyrep (2 ln g (yrep))
is the same across all candidate models, it is dropped from (21) when comparing models. As
a result, we propose to choose a model that gives the smallest value of (again we suppress Mk
for notational simplicity)
EyEyrep (−2 ln p (yrep|y)) =
∫ ∫−2 ln p (yrep|y) g (yrep) g(y)dyrepdy.
Let the selected model beMk∗ (i.e. the optimal decision is dk∗). Then p(yrep|y,Mk∗), which is
the closest to g (yrep) in terms of the expected KL divergence, is used to generate predictions
of future observations.
4.1 ICAT
Under the iid assumption, Ando and Tsay (2010) showed that
EyEyrep [−2 ln p (yrep|y)] = Ey
−2 ln p(y|y) + tr
[J−1AT
(θAT
)IAT
(θAT
)]+ o(1),
where
θAT = arg maxθ∈Θ
2 ln p(y|θ) + ln p(θ) ,
JAT (θ) = − 1
n
n∑t=1
∂2 ln ξ(yt|θ)
∂θ∂θ′
, IAT (θ) =
1
n
n∑t=1
∂ ln ξ(yt|θ)
∂θ
∂ ln ξ(yt|θ)
∂θ′
,
with ln ξ(yt|θ) = ln p(yt|θ) + ln p(θ)/(2n) for t = 1, · · · , n.
Remark 4.1 Ando and Tsay (2010) defined the following information criterion
ICAT = −2 ln p(y|y) + tr[I−1AT
(θAT
)JAT
(θAT
)]. (22)
Since ICAT is constructed based on the Bayesian predictive distribution, it is invariant to
reparameterization. Interestingly, the first term in ICAT is different from that in AIC or
DIC. To compute the first term in ICAT from the MCMC output, note that
−2 ln p(y|y) = −2 ln
(∫p(y|θ)p(θ|y)dθ
)≈ −2 ln
1
J
J∑j=1
p(y|θ(j)
) ,
whereθ(j)
Jj=1
are J effective random samples drawn from the posterior distribution p(θ|y).
To compute the penalty term tr[I−1AT
(θAT
)JAT
(θAT
)], one first needs to obtain θAT . In
general, θAT is not available analytically and hence, one has to use a numerical method to
find θAT . Then one needs to calculate IAT (θ) and JAT (θ) and to invert IAT (θ).
17
Remark 4.2 Under the assumptions that the prior is Op(1) and that the candidate model
encompasses the DGP, Ando and Tsay simplified the information criterion as
ICAT = −2 ln p(y|y) + P. (23)
In this case, the second term has a very simple expression as it is the same as the number of
degrees of freedom, which no longer depends on the prior information.
4.2 DICBP
Using −2 ln p(y|y) as the first term makes it diffi cult to compare with DIC, AIC or BIC. In
this paper, we propose a Bayesian predictive distribution-based information criterion whose
first term is the same as DIC, i.e., D(θ)but without resorting the iid assumption. It turns
out such a choice not only facilitates the comparison with DIC, AIC or BIC, but also leads
to a much simpler expression for the penalty term. In the following theorem we propose a
new information criterion based on the KL divergence between the DGP and the Bayesian
predictive distribution and show the asymptotic unbiasedness.
Theorem 4.1 Define the information criterion as
DICBP = D(θ)
+ (1 + ln 2)PD, (24)
where PD is defined in (4). Under Assumptions 1-10, we have
EyEyrep [−2 ln p (yrep|y)] = Ey
[DICBP
]+ o(1).
Remark 4.3 The justification of DICBP remains valid if the posterior mean is replaced with
the posterior mode or with the QML estimate and if PD is replaced with P . Clearly, the penalty
term in DICBP is smaller than that in DIC, AIC, and BIC (i.e., (1 + ln 2)PD ≈ (1 + ln 2)P
as opposed to 2PD in DIC, 2P in AIC, and P lnn in BIC).
Remark 4.4 It can be shown that DICBP =ICAT+op(1) and Ey
[DICBP
]= Ey(ICAT )+o(1).
Clearly (1 + ln 2)PD ≈ (1 + ln 2)P > P , implying −2 ln p(y|θ)< −2 ln p(y|y). To see why
this is the case, note that p(y|θ)
= p(y|(∫θp(θ|y)dθ
)), and p(y|y) =
∫p(y|θ)p(θ|y)dθ.
Under Assumptions 1-10, p(θ|y) is approximately Gaussian and concave. The inequality
−2 ln p(y|θ)< −2 ln p(y|y) follows from Jensen’s inequality.
Remark 4.5 As ICAT , DICBP is based on the Bayesian predictive distribution and hence, is
invariant to reparameterization. There are several good properties for DICBP . First, DICBP
is developed without resorting the iid assumption. Second, DICBP is easier to compute. When
the MCMC output is available, ICAT needs to evaluate
ln p(y|y) = ln
(∫p(y|θ)p(θ|y)dθ
)≈ ln
1
J
J∑j=1
p(y|θ(j)
) ,
18
while DICBP needs to evaluate∫ln p(y|θ)p(θ|y)dθ ≈ 1
J
J∑j=1
ln p(y|θ(j)
).
Numerically, the log-likelihood function is usually much more stable than the likelihood func-
tion. Thus, for the same value of J , the accuracy in 1J
∑Jj=1 ln p
(y|θ(j)
)is often much higher
than that in ln(1J
∑Jj=1 p
(y|θ(j)
)). Moreover, DICBP is as easy to compute as DIC. Since
DIC is monitored in WinBUGS, no additional effort is needed for calculating DICBP . Third,
like DIC, the penalty term depends on the prior information. Hence, the prior information is
incorporated in DICBP .
Remark 4.6 A recent literature suggests the minimization of the posterior mean of the KL
divergence between g (yrep) and p (yrep|θ), i.e.,∫ [∫ln
g (yrep)
p (yrep|θ)g(yrep)dyrep
]p(θ|y)dθ
=
∫ln g (yrep) g (yrep) dyrep −
∫ ∫[ln p (yrep|θ) g (yrep) dyrep] p (θ|y) dθ.
Hence, the corresponding expected loss for a statistical decision dk is
Risk(dk) = Ey
∫ [∫ln
g (yrep)
p (yrep|θ,Mk)g(yrep)dyrep
]p (θ|y,Mk) dθ
=
∫ln g (yrep) g (yrep) dyrep − Ey
∫ ∫ln p (yrep|θ,Mk) g (yrep) dyrepp (θ|y,Mk) dθ
.
Since the first term is constant across different models, van der Linde (2005, 2012), Plum-
mer (2008), Ando (2007) and Ando (2012) proposed to choose a model to minimize
Ey
∫ ∫[−2 ln p (yrep|θ) g (yrep) dyrep] p (θ|y) dθ
.
Under the iid assumption, it was shown that
Ey
[∫ ∫[−2 ln p (yrep|θ) g (yrep) dyrep] p(θ|y)dθ
]≈ Ey
[D(θ)
+ 3PD],
leading to an information criterion called BPIC by Ando (2007). Clearly, the target here is
different from that under DIC and also from that under DICBP . According to Spiegelhalter,
et al (2014) and Vehtari and Ojanen (2012), BPIC chooses an “average target” rather than
a “representative target”. Although BPIC can select a model, it cannot tell the user how
to actually predict the future observations because there is no utility corresponding to this
criterion.
19
Remark 4.7 To understand why the penalty term in BPIC is larger than that in DICBP ,
note that the loss employed by BPIC is∫ ∫−2 ln (p (yrep|θ)) g (yrep) p(θ|y)dθdyrep
=
∫ ∫−2 ln (p (yrep|θ)) p(θ|y)dθg (yrep) dyrep,
while the loss by DICBP is∫−2 ln
(∫p (yrep|θ) p(θ|y)dθ
)g (yrep) dyrep.
By Jensen’s inequality,
−2 ln
(∫p (yrep|θ) p(θ|y)dθ
)<
∫−2 ln (p (yrep|θ)) p(θ|y)dθ.
Hence, the expected loss in DICBP is smaller than that in BPIC for the same candidate model.
4.3 Expected loss of DIC and DICBP
From the decision viewpoint, DIC and DICBP lead to different statistical decisions. If the
optimal model selected by DIC is different from that selected by DICBP , the two statistical
decisions are obviously different. Even when DIC and DICBP select the same model, the
two statistical decisions are still different because the predictions come from two different
distributions, namely p(yrep|θ(y),Mk
)versus p (yrep|y,Mk). In both cases, therefore, the
expected loss implied by DIC and DICBP is different. Although it is known that the Bayesian
predictive distribution is a full predictive distribution and invariant to parameterization, the
questions such as “which predictive distribution should be used for making predictions?”and
“is there any difference in using the two predictive distributions for making predictions?”
remain unanswered in the literature. The theoretical development of DICBP allows us to
answer these two important questions.
With the two information criteria, the action space is larger than before. Denote the action
space by dk0 , dk1Kk=1 where dka (a ∈ (0, 1)) means Mk is selected, and the predictions come
from p(yrep|θk(y),Mk
)if a = 0 (i.e., DIC is the corresponding information criterion) but the
predictions come from p (yrep|y,Mk) if a = 1 (i.e., DICBP is the corresponding information
criterion). Let the two KL divergence functions be represented uniformly as
`(y, dka) = 2×KL [g (yrep) , p (yrep|y, dka))] ,
where p (yrep|y, dk0) := p(yrep|θk(y),Mk
), and p (yrep|y, dk1) := p (yrep|y,Mk). The ex-
pected loss associated with dka is
Risk(dka) = Ey (`(y, dka)) =
∫`(y, dka)g(y)dy.
20
Hence, the model selection problem is equivalent to the following statistical decision,
mina∈0,1
mink∈1,··· ,K
Risk (dka) . (25)
According to Lemma 3.3, PD = P+op(1). Since P > 0 and ln 2+1 < 2, for any modelMk,
we have DIC > DICBP with probability approaching one (w.p.a.1). As a result, Ey (DIC) >
Ey
(DICBP
)w.p.a.1. Following Theorem 3.1 and Theorem 4.1, w.p.a.1, we have
EyEyrep
[−2 ln p
(yrep|θk(y),Mk
)]> EyEyrep [−2 ln p (yrep|y,Mk)] ,
2EyEyrepg(yrep)−EyEyrep
[2 ln p
(yrep|θk(y),Mk
)]> 2EyEyrepg(yrep)−EyEyrep [2 ln p (yrep|y,Mk)] ,
and
Risk (dk0) = Ey (` (y, dk0)) > Ey (` (y, dk1)) = Risk (dk1) . (26)
Hence, w.p.a.1,
mink∈1,··· ,K
Risk(dk0) > mink∈1,··· ,K
Risk (dk1) . (27)
and
arg mina∈0,1
[min
k∈1,··· ,KRisk(dka)
]= 1.
This means that the optimal solution to the statistical decision problem given in Section 2.1
is obtained by DICBP w.p.a.1. This is true even in case where arg mink∈1,··· ,KRisk(dk0) =
arg mink∈1,··· ,KRisk(dk1). Therefore, as far as the expected loss is concerned, the Bayesian
predictive distribution but not the plug-in predictive distribution should be used for making
predictions.
5 Conclusion
This paper provides a rigorous decision-theoretic justification of DIC based on a set of regular-
ity conditions but without requiring the iid assumption. The candidate model is not required
to encompassed the DGP. It is shown that DIC is an asymptotically unbiased estimator of
the expected KL divergence between the DGP and the plug-in predictive distribution.
Based on the Bayesian predictive distribution, a new information criterion (DICBP ) is
constructed for model selection. The first term has the same expression as that in DIC, but the
penalty term is smaller than that in DIC. The asymptotic justification of DICBP is provided,
in the same way as how DIC has been justified. The expected loss of DICBP is compared
with that of DIC and BPIC. It is shown that as n→∞, DICBP leads to the smaller expectedloss than DIC and BPIC. From the decision viewpoint, the Bayesian predictive distribution
and DICBP but not the plug-in predictive distribution or DIC leads to the optimal decision
action.
21
Although the theoretic framework under which we justify DIC and DICBP are general, it
requires the consistency of the posterior mean, the asymptotic normal approximation to the
posterior distribution, and the asymptotic normality to the QML estimator. When there are
latent variables in the candidate model under which the number of latent variables grows as
n grows and when the parameter space is enlarged to include latent variables, the consistency
and the asymptotic normality may not hold true. As a result, DIC and DICBP are not
justified. Moreover, when the data are nonstationary, the asymptotic normality may not hold
true. In this case, it remains unknown whether or not DIC and DICBP are still justified.
Appendix
Notations
:= definitional equality←→θ n posterior mode
o(1) tend to zero θn QML estimateop(1) tend to zero in probability θpn pseudo true parameterp→ converge in probability θAT arg max of 2 ln p(y|θ) + ln p(θ)
θn posterior mean θn arg max of ln p (yrep|θ) + ln p (y|θ) + ln p (θ)
Proof of Lemma 3.1
1
n
n∑t=1
[lt (θ)− lt (θpn)]
=1
n
n∑t=1
(lt (θ)− E [lt (θ)]) +1
n
n∑t=1
(E [lt (θ)]− E [lt (θpn)]) +1
n
n∑t=1
(E [lt (θpn)]− lt (θpn)) .
From (9), we know that for any ε > 0, there exists δ1(ε) > 0 and N(ε) > 0, for all n > N (ε),
1
n
n∑t=1
E [lt (θ)]− E [lt (θpn)] < −δ1(ε),
if θ ∈ Θ\N (θpn, ε). Thus, for any ε > 0, if θ ∈ Θ\N (θpn, ε), for all n > N (ε),
1
n
n∑t=1
[lt (θ)− lt (θpn)] <1
n
n∑t=1
(lt (θ)− E [lt (θ)])− δ1 (ε) +1
n
n∑t=1
(E [lt (θpn)]− lt (θpn)) ,
and
supΘ\N(θ
pn,ε)
1
n
n∑t=1
[lt (θ)− lt (θpn)]
≤ supΘ\N(θ
pn,ε)
1
n
n∑t=1
(lt (θ)− E [lt (θ)])− δ1 (ε) +1
n
n∑t=1
(E [lt (θpn)]− lt (θpn))
≤ supΘ\N(θ
pn,ε)
∣∣∣∣∣ 1nn∑t=1
(lt (θ)− E [lt (θ)])
∣∣∣∣∣− δ1 (ε) +
∣∣∣∣∣ 1nn∑t=1
(E [lt (θpn)]− lt (θpn))
∣∣∣∣∣22
≤ 2 supθ∈Θ
∣∣∣∣∣ 1nn∑t=1
(lt (θ)− E [lt (θ)])
∣∣∣∣∣− δ1 (ε) . (28)
Under Assumptions 1-6, the uniform convergence condition is satisfied, i.e.,
P
(supθ∈Θ
∣∣∣∣∣ 1nn∑t=1
(lt (θ)− E [lt (θ)])
∣∣∣∣∣ < ε
)→ 1, (29)
From the uniform convergence, if we choose δ2 such that 0 < δ2 < δ1 (ε) /2, we have
P
[supθ∈Θ
∣∣∣∣∣ 1nn∑t=1
(lt (θ)− E [lt (θ)])
∣∣∣∣∣ < δ2
]→ 1.
Hence,
P
[2 supθ∈Θ
∣∣∣∣∣ 1nn∑t=1
(lt (θ)− E [lt (θ)])
∣∣∣∣∣− δ1 (ε) < 2δ2 − δ1 (ε)
]→ 1.
From (28), we have
P
[2 supθ∈Θ
∣∣∣∣∣ 1nn∑t=1
(lt (θ)− E [lt (θ)])
∣∣∣∣∣− δ1 (ε) < 2δ2 − δ1 (ε)
]
6 P
supΘ\N(θ
pn,ε)
1
n
[n∑t=1
lt (θ)−n∑t=1
lt (θpn)
]< 2δ2 − δ1 (ε)
.Letting K1 (ε) = − (2δ2 − δ1 (ε)) > 0, we have, for any ε,
limn→∞
P
supΘ\N(θ
pn,ε)
1
n
[n∑t=1
lt (θ)−n∑t=1
lt (θpn)
]< −K1 (ε)
= 1,
which proves the consistency condition given by (10). The proof of the other two concentration
conditions (11) and (12) can be done similarly and hence omitted.
Proof of Lemma 3.2: See the technical supplement (Li, et al, 2017).
Proof of Lemma 3.3
From the definition
PD =
∫−2[ln p (y|θ)− ln p
(y|θ)]p (θ|y)dθ
= −2
∫ln p (y|θ) p (θ|y)dθ + 2 ln p
(y|θ)
= −2
n∑t=1
∫lt (θ) p (θ|y) dθ + 2 ln p
(y|θ).
23
By Lemma 3.2 we have∫lt (θ) p(θ|y)dθ =lt
(θn
)+
1
nBt,1 +
1
n2(Bt,2 −Bt,3) +Op
(1
n3
),
where Bt,i is defined in Lemma 3.2 for i = 1, 2, 3. Note that
1
n
n∑t=1
Bt,1 = −1
2tr
[Hn
(θn
)−1 1
n
n∑t=1
O2lt(θn
)]− 1
n
n∑t=1
Olt(θn
)′Hn
(θn
)−1 5pp
−1
2vec
(Hn
(θn
)−1)H(3)n
(θn
)−1Hn
(θn
)−1 1
n
n∑t=1
Olt(θn
)= −1
2tr
[Hn
(θn
)−1Hn
(θn
)]= −1
2P,
since∑n
t=1Olt(θn
)= 0. For the same reason
1
n
n∑t=1
Bt,2
= − 5
16tr [A2] tr
[Hn
(θn
)−1 1
n
n∑t=1
O2lt(θn
)]+
35
48A1tr
[Hn
(θn
)−1 1
n
n∑t=1
O2lt(θn
)]
− 5
12vec
(Hn
(θn
)−1)′H(3)n
(θn
)Hn
(θn
)−1 1
n
n∑t=1
O3lt(θn
)′vec
(Hn
(θn
)−1)
+1
8tr
[[Hn
(θn
)−1⊗ vec
(Hn
(θn
)−1)] 1
n
n∑t=1
O4lt(θn
)′]
−5
4vec
(Hn
(θn
)−1)′H(3)n
(θn
)Hn
(θn
)−1 1
n
n∑t=1
O2lt(θn
)Hn
(θn
)−1 5pp
+1
2vec
(Hn
(θn
)−1)′ 1
n
n∑t=1
O3lt(θn
)Hn
(θn
)−1 5pp
+3
4tr
[Hn
(θn
)−1 1
n
n∑t=1
O2lt(θn
)]tr
[Hn
(θn
)−1 52pp
]= − 5
16tr [A2]P +
35
48A1P −
5
12A1 +
1
8tr [A2]
−3
4vec
(Hn
(θn
)−1)′H(3)n
(θn
)Hn
(θn
)−1 5pp
+3
4P tr
[Hn
(θn
)−1 52pp
]=
2− 5P
16tr [A2] +
−20 + 35P
48A1 −
3
4C21 +
3P
4C22 +
3P
4C23,
where C21, C22 and C23 are defined in Lemma 3.3. Clearly the first two terms are not related
to the prior while the last three terms are related to the prior. We can also show that
1
n
n∑t=1
Bt,3
24
= B41
n
n∑t=1
Bt,1
= −P2
(−1
2tr
[Hn
(θn
)−1 ∇2pp
]+
1
2vec
(Hn
(θn
)−1)′H(3)n
(θn
)Hn
(θn
)−1 ∇pp
)−P
2
(− 5
24A1 +
1
8tr [A2]
)=
P
4C22 +
P
4C23 −
P
4C21 +
P
2
(5
24A1 −
1
8tr [A2]
).
Hence,
n∑t=1
∫lt (θ) p (θ|y) dθ
=
n∑t=1
lt
(θn
)− P
2
+1
n
(2− 5P
16tr [A2] +
−20 + 35P
48A1 −
3
4C21 +
3P
4C22 +
3P
4C23
)+
1
n
[−P
4C22 −
P
4C23 +
P
4C21 −
P
2
(5
24A1 −
1
8tr [A2]
)]+Op
(n−2
)=
n∑t=1
lt
(θn
)− P
2
+1
n
(1− 2P
8tr [A2] +
−10 + 15P
24A1 −
3− P4
C21 +P
2C22 +
P
2C23
)+Op
(n−2
).
From the stochastic expansion, we have
θn = θn −1
nHn
(θn
)−1Oπ
+1
2nHn
(θn
)−1H(3)n
(θn
)′vec
(Hn
(θn
)−1)+Op
(1
n2
),
and
θn − θn =1
nHn
(θn
)−1Oπ
− 1
2nHn
(θn
)−1H(3)n
(θn
)′vec
(Hn
(θn
)−1)+Op
(1
n2
). (30)
By the Taylor expansion, we get
ln p(y|θn
)= ln p
(y|θn
)+∂ ln p
(y|θn
)∂θ′
(θn − θn
)+
1
2
(θn − θn
)′ ∂2 ln p(y|θn
)∂θ∂θ′
(θn − θn
)+Op
(1
n2
)25
= ln p(y|θn
)+
1
2nOπ′Hn
(θn
)−1Hn
(θn
)Hn
(θn
)−1Oπ
+1
8nvec
(Hn
(θn
)−1)′ 1
n
n∑t=1
O3lt(θn
)Hn
(θn
)−1Hn
(θn
)Hn
(θn
)−1× 1
n
n∑t=1
O3lt(θn
)′vec
(Hn
(θn
)−1)− 1
2nOπ′Hn
(θn
)−1Hn
(θn
)Hn
(θn
)−1× 1
n
n∑t=1
O3lt(θn
)′vec
(Hn
(θn
)−1)+Op(n
−2)
= ln p(y|θn
)− 1
2nC21 +
1
2nC23 +
1
8nA1 +Op(n
−2). (31)
Hence, from (30) and (31) we have,
PD = −2
n∑t=1
∫lt (θ) p (θ|y)dθ + 2 ln p
(y|θn
)= −2 ln p
(y|θn
)+ P
+1
n
(−1− 2P
4tr [A2]−
−10 + 15P
12A1 +
3− P2
C21 − PC22 − PC23)
+2 ln p(y|θ)
+Op(n−2
)= P +
1
n
(−C21 + C23 +
1
4A1
)+
1
n
(−1− 2P
4tr [A2]−
−10 + 15P
12A1 +
3− P2
C21 − PC22 − PC23)
+Op(n−2
)= P +
1
n
(−1− 2P
4tr [A2]−
−13 + 15P
12A1
)+
1
n
(1− P
2C21 − PC22 + (1− P )C23
)+Op
(n−2
)= P +
1
nC1 +
1
nC2 +Op
(n−2
), (32)
where C1 and C2 are defined in Lemma 3.3. From the definition of DIC, (31) and (32), we
have
DIC = −4
n∑t=1
∫lt (θ) p (θ|y) dθ + 2 ln p
(y|θ)
= −2
n∑t=1
∫lt (θ) p (θ|y) dθ+PD
= −2 ln p(y|θn
)+ P +
1
n
(−1− 2P
4tr [A2]−
−10 + 15P
12A1 +
3− P2
C21 − PC22 − PC23)
+P +1
n
(−1− 2P
4tr [A2]−
−13 + 10P
12A1 +
1− P2
C21 − PC22 + (1− P )C23
)+Op
(n−2
)26
= −2 ln p(y|θn
)+ 2P
+1
n
[−1− 2P
2tr [A2]−
−23 + 25P
12A1 + (2− P )C21 − 2PC22 + (1− 2P )C23
]+Op
(n−2
)= AIC+
1
nD1 +
1
nD2 +Op
(n−2
),
where D1 and D2 are defined in Lemma 3.3.
Proof of Theorem 3.1
We write Hn (θpn) as Hn, Bn (θpn) as Bn, and let Cn = H−1n BnH−1n . Under Assumptions
1-10, we can show that
θn(y) = θn(y) +Op(n−1
)(33)
by (30). Then we have
θn(y) = θpn +Op
(n−1/2
),
1√n
B−1/2n
∂ ln p(yrep|θpn)
∂θ
d→ N (0, IP ) , (34)
and
C−1/2n
√n(θn(y)− θpn
) d−→ N (0, IP ) . (35)
Note that
EyEyrep
(−2 ln p
(yrep|θn(y)
))=
[EyEyrep
(−2 ln p
(yrep|θn (yrep)
))](T1)
+[EyEyrep (−2 ln p (yrep|θpn))− EyEyrep
(−2 ln p
(yrep|θn (yrep)
))](T2)
+[EyEyrep
(−2 ln p
(yrep|θn(y)
))− EyEyrep (−2 ln p (yrep|θpn))
](T3)
.
Now let us analyze T2 and T3. First expand ln p (yrep|θpn) at θn (yrep),
ln p (yrep|θpn)
= ln p(yrep|θn (yrep)
)+∂ ln p
(yrep|θn (yrep)
)∂θ′
(θpn − θn (yrep)
)+
1
2
(θpn − θn (yrep)
)′ ∂ ln p(yrep|θn (yrep)
)∂θ∂θ′
(θpn − θn (yrep)
)+ op (1)
= ln p(yrep|θn (yrep)
)+∂ ln p
(yrep|θn (yrep)
)∂θ′
(θpn − θn (yrep)
)+(θn (yrep)− θn (yrep)
)′ ∂ ln p(yrep|θn (yrep)
)∂θ∂θ′
(θpn − θn (yrep)
)+
1
2
(θpn − θn (yrep)
)′ ∂ ln p(yrep|θn (yrep)
)∂θ∂θ′
(θpn − θn (yrep)
)+ op (1)
27
= ln p(yrep|θn (yrep)
)+
1
2
(θpn − θn (yrep)
)′ ∂ ln p(yrep|θn (yrep)
)∂θ∂θ′
(θpn − θn (yrep)
)+ op (1) .
from (33). Then we have
T2 = EyEyrep
[−2 ln p (yrep|θpn) + 2 ln p
(yrep|θn (yrep)
)]= EyEyrep
[−(θn (yrep)− θpn
)′ ∂ ln p(yrep|θn (yrep)
)∂θ∂θ′
(θn (yrep)− θpn
)+ op (1)
]
= Eyrep
[−(θn (yrep)− θpn
)′ ∂ ln p(yrep|θn (yrep)
)∂θ∂θ′
(θn (yrep)− θpn
)]+ o (1)
= Ey
[−(θn(y)− θpn
)′ ∂ ln p(y|θn(y))
∂θ∂θ′(θn(y)− θpn
)]+ o (1) ,
by Assumption 5-6 and the dominated convergence theorem (Chung, 2001 and DasGupta,
2008). Next, we expand ln p(yrep|θn(y)
)at θpn:
ln p(yrep|θn(y)
)= ln p (yrep|θpn) +
∂ ln p (yrep|θpn)
∂θ′(θn(y)− θpn
)+
1
2
(θn(y)− θpn
)′ ∂2 ln p (yrep|θpn)
∂θ∂θ′(θn(y)− θpn
)+ op (1) .
Substituting the above expansion into T3, we have
T3 = EyEyrep(−2 ln p(yrep|θn(y)
))− EyEyrep(−2 ln p (yrep|θpn))
= EyEyrep
−2∂ ln p(yrep|θ
pn)
∂θ′(θn(y)− θpn
)−(
θn(y)− θpn)′ ∂2 ln p(yrep|θ
pn)
∂θ∂θ′(θn(y)− θpn
)+ op (1)
= EyEyrep
[−2
∂ ln p (yrep|θpn)
∂θ′(θn(y)− θpn
)]+EyEyrep
[−(θn(y)− θpn
)′ ∂2 ln p (yrep|θpn)
∂θ∂θ′(θn(y)− θpn
)]+ o (1)
= −2Eyrep
(∂ ln p (yrep|θpn)
∂θ′
)Ey
[(θn(y)− θpn
)]+Ey
[−(θn(y)− θpn
)′Eyrep
(∂2 ln p (yrep|θpn)
∂θ∂θ′
)(θn(y)− θpn
)]+ o (1)
= Ey
[−√n(θn(y)− θpn
)′Ey
(1
n
∂2 ln p(y|θpn)
∂θ∂θ′
)√n(θn(y)− θpn
)]+ o(1),
since
EyEyrep
[−2
∂ ln p (yrep|θpn)
∂θ′(θn(y)− θpn
)]= Eyrep
[−2
∂ ln p (yrep|θpn)
∂θ′
]Ey
[(θn(y)− θpn
)]= 0
by (34), (35), and the dominated convergence theorem.
Note that1
n
∂2 ln p(y|θn(y)
)∂θ∂θ′
= Ey
(1
n
∂2 ln p (y|θpn)
∂θ∂θ′
)+ op (1) ,
28
under Assumption 1-10 and by the uniform law of large numbers. Hence, we get
T2 = Ey
[−(θn(y)− θpn
)′ ∂ ln p(y|θn(y)
)∂θ∂θ′
(θn(y)− θpn
)]+ o (1)
= Ey
[−√n(θn(y)− θpn
)′ 1
nEy
(∂2 ln p (y|θpn)
∂θ∂θ′
)√n(θn(y)− θpn
)+ op (1)
]+ o (1)
= T3 + o(1).
So we only need to analyze T3. Note that
T3 = Ey
[−√n(θn(y)− θpn
)′Ey
(− 1
n
∂2 ln p(y|θpn)
∂θ∂θ′
)√n(θn(y)− θpn
)]+ o (1)
= Ey
[√n(θn(y)− θpn
)′(−Hn)
√n(θn(y)− θpn
)]+ o (1)
= Ey
[(C−1/2n
√n(θn(y)− θpn
))′C1/2n (−Hn) C1/2
n C−1/2n
√n(θn(y)− θpn
)]+ o (1)
= Ey
tr[HnC
1/2n C−1/2n
√n(θn(y)− θpn
)√n(θn(y)− θpn
)′C−1/2n C1/2
n
]+ o (1)
= tr
(−Hn) C1/2n Ey
[C−1/2n
√n(θn(y)− θpn
)√n(θn(y)− θpn
)′C−1/2n
]C1/2n
+ o (1)
= tr
(−Hn) C1/2n Ey
[C−1/2n
√n(θn(y)− θpn
)√n(θn(y)− θpn
)′C−1/2n
]C1/2n
+ o (1) ,
and
Ey
[C−1/2n
√n(θn(y)− θpn
)√n(θn(y)− θpn
)′C−1/2n
]= IP + o (1) .
Hence,
T3 = tr(
(−Hn) C1/2n C1/2
n
)+ o (1) = tr ((−Hn) Cn) + o (1)
= tr(
(−Hn) (−Hn)−1Bn (−Hn)−1)
+ o (1)
= tr(
(−Hn) (−Hn)−1Bn (−Hn)−1)
+ o (1)
= tr(Bn (−Hn)−1
)+ o (1) ,
and
Ey
[Eyrep(−2 ln p
(yrep|θn(y)
))]
= Ey
[Eyrep(−2 ln p
(yrep|θn (yrep)
))]
+ 2tr(Bn (−Hn)−1
)+ o (1)
= Ey
[Ey(−2 ln p(y|θn(y)))
]+ 2tr
(Bn (−Hn)−1
)+ o (1)
= Ey(−2 ln p(y|θn(y))) + 2tr(Bn (−Hn)−1
)+ o (1)
= Ey(−2 ln p(y|θn(y)) + 2P ) + o (1) ,
under the condition that Bn + Hn = o(1).
Following Lemma 3.3, we get PD = P + op(1). Finally, we have
EyEyrep
[−2 ln p
(yrep|θn(y)
)]= Ey
[−2 ln p(y|θn) + 2P + op(1)
]29
= Ey
[D(θn)
+ 2PD + op(1)]
= Ey [DIC+ op(1)] = Ey [DIC] + o(1),
by Assumption 10 and the dominated convergence theorem.
Proof of Theorem 4.1
Denote
θn := arg maxθ
ln p (yrep|θ) + ln p (y|θ) + ln p (θ) .
Then we have the following three lemmas under the condition that y and yrep are independent.
These three lemma are useful to prove Theorem 4.1.
Lemma 5.1 Under Assumptions 1-7 and 10, θnp→ θpn.
Proof. The proof follows the argument in Theorem 4.2 in Wooldridge (1994) and Bester
and Hansen (2006). Let Qn (θ) = n−1∑n
t=1 lt(yt,θ
)+n−1
∑nt=1 lt
(ytrep,θ
)+n−1 ln p (θ) and
Qn (θ) = n−1E [ln p (yrep|θ) + ln p (y|θ)]. Then we need to show that, for each ε > 0,
P
[supθ∈Θ
∣∣Qn (θ)− Qn (θ)∣∣ > ε
]→ 0.
Let δ > 0 be a number to be set later. Because Θ is compact, there exists a finite number
of spheres of radius δ about θj , say ζδ (θj) = θ∈ Θ : ‖θ−θj‖ ≤ δ, j = 1, . . . ,K (δ), which
covers Θ (Gallant and White, 1988). Set ζj = ζδ (θj), K = K (δ). It follows that
P
[supθ∈Θ
∣∣Qn (θ)− Qn (θ)∣∣ > ε
]≤ P
max1≤j≤K
supθ∈ζj
∣∣Qn (θ)− Qn (θ)∣∣ > ε
≤
K∑j=1
P
supθ∈ζj
∣∣Qn (θ)− Qn (θ)∣∣ > ε
.For all θ ∈ ζj ,∣∣Qn (θ)− Qn (θ)
∣∣≤ |Qn (θ)−Qn (θj)|+
∣∣Qn (θj)− Qn (θj)∣∣+∣∣Qn (θj)− Qn (θ)
∣∣≤ 1
n
n∑t=1
∣∣lt (yt,θ)− lt (yt,θj)∣∣+1
n
n∑t=1
∣∣lt (ytrep,θ)− lt (ytrep,θj)∣∣+
∣∣∣∣∣ 1nn∑t=1
(lt(yt,θj
)− E [lt (θj)]
)∣∣∣∣∣+
∣∣∣∣∣ 1nn∑t=1
(lt(ytrep,θj
)− E [lt (θj)]
)∣∣∣∣∣+
1
n
n∑t=1
|E [lt (θ)]− E [lt (θj)]|+1
n
n∑t=1
|E [lt (θ)]− E [lt (θj)]|+∣∣∣∣ ln p (θ)
n
∣∣∣∣ ,where E [lt (θ)] := E
[lt(yt,θ
)]= E
[lt(ytrep,θ
)]. By Assumption 4, for all θ ∈ ζj ,∣∣lt (yt,θ)− lt (yt,θj)∣∣ ≤ ct (yt) ‖θ−θj‖ ≤ δct (yt) .
30
and
|E [lt (θ)]− E [lt (θj)]| ≤ δct,
where ct = E[ct(yt)]
= E[ct(ytrep
)]. Similarly, we have∣∣lt (ytrep,θ)− lt (ytrep,θj)∣∣ ≤ δct (ytrep) .
Thus, we have
supθ∈ζj
∣∣Qn (θ)− Qn (θ)∣∣
≤ δ
n
n∑t=1
ct(ytt)
+
∣∣∣∣∣ 1nn∑t=1
(lt(yt,θ
)− E [lt (θj)]
)∣∣∣∣∣+δ
n
n∑t=1
ct +δ
n
n∑t=1
ct(ytrep
)+
∣∣∣∣∣ 1nn∑t=1
(lt(ytrep,θ
)− E [lt (θj)]
)∣∣∣∣∣+δ
n
n∑t=1
ct +
∣∣∣∣ ln p (θ)
n
∣∣∣∣≤ 2
δ
n
n∑t=1
ct + δ
∣∣∣∣∣ 1nn∑t=1
(ct(yt)− ct
)∣∣∣∣∣+
∣∣∣∣∣ 1nn∑t=1
(lt(yt,θ
)− E [lt (θj)]
)∣∣∣∣∣+ 2δ
n
n∑t=1
ct
+δ
∣∣∣∣∣ 1nn∑t=1
(ct(ytrep
)− ct
)∣∣∣∣∣+
∣∣∣∣∣ 1nn∑t=1
(lt(ytrep,θ
)− E [lt (θj)]
)∣∣∣∣∣+
∣∣∣∣ ln p (θ)
n
∣∣∣∣≤ 4δC + δ
∣∣∣∣∣ 1nn∑t=1
(ct(yt)− ct
)∣∣∣∣∣+
∣∣∣∣∣ 1nn∑t=1
(lt(yt,θ
)− E [lt (θj)]
)∣∣∣∣∣+δ
∣∣∣∣∣ 1nn∑t=1
(ct(ytrep
)− ct
)∣∣∣∣∣+
∣∣∣∣∣ 1nn∑t=1
(lt(ytrep,θ
)− E [lt (θj)]
)∣∣∣∣∣+
∣∣∣∣ ln p (θ)
n
∣∣∣∣ ,where n−1
∑nt=1 ct ≤ C <∞ by Assumption 4. It follows that
P
[maxθ∈ζj
∣∣Qn (θ)− Qn (θ)∣∣ > ε
]
≤ P
δ∣∣ 1n
∑nt=1
(ct(yt)− ct
)∣∣+∣∣ 1n
∑nt=1
(lt(yt,θ
)− E [lt (θj)]
)∣∣+δ∣∣ 1n
∑nt=1
(ct(ytrep
)− ct
)∣∣+∣∣ 1n
∑nt=1
(lt(ytrep,θ
)− E [lt (θj)]
)∣∣+
∣∣∣∣ ln p(θ)n
∣∣∣∣ > ε− 4δC
.Now choose δ such that 0 < δ < ε/(4C) (i.e., ε− 2δC > ε/2). Then
P
supθ∈ζj
∣∣Qn (θ)− Qn (θ)∣∣ > ε
≤ P
∣∣ 1n
∑nt=1
(ct(yt)− ct
)∣∣+∣∣ 1n
∑nt=1
(lt(yt,θ
)− E [lt (θj)]
)∣∣+∣∣ 1n
∑nt=1
(ct(ytrep
)− ct
)∣∣+∣∣ 1n
∑nt=1
(lt(ytrep,θ
)− E [lt (θj)]
)∣∣+
∣∣∣∣ ln p(θ)n
∣∣∣∣ > ε/2
.Next, choose n0 so that
P
∣∣ 1n
∑nt=1
(ct(yt)− ct
)∣∣+∣∣ 1n
∑nt=1
(lt(yt,θ
)− E [lt (θj)]
)∣∣+∣∣ 1n
∑nt=1
(ct(ytrep
)− ct
)∣∣+∣∣ 1n
∑nt=1
(lt(ytrep,θ
)− E [lt (θj)]
)∣∣+
∣∣∣∣ ln p(θ)n
∣∣∣∣ > ε/2
≤ ε
K
31
for all n ≥ n0 and all j = 1, . . . ,K by Assumptions 1-10 since K is finite. Hence,
P
[supθ∈Θ
∣∣Qn (θ)− Qn (θ)∣∣ > ε
]→ 0.
It then follows that Qn (θ) satisfies a uniform law of large numbers and the consistency of θnfollowed by the usual argument.
Lemma 5.2 Under Assumptions 1-10, − (2Hn)−1/2√n(θn − θpn
)d→ N (0, IP ).
Proof. The proof follows from Bester and Hansen (2006). By Lemma 5.1, we have,
0 =1
n
n∑t=1
[5lt
(yt, θn
)+5lt
(ytrep, θn
)]+
1
nln p
(θn
)=
1
n
n∑t=1
[5lt
(yt,θpn
)+5lt
(ytrep,θ
pn
)]+
1
n
n∑t=1
[52lt
(yt, θn1
)+52lt
(ytrep, θn1
)](θn − θpn
)
+ln p
(θn
)n
where θn1 is an intermediate value between θn and θpn. It follows that
√n(θn − θpn
)=
(−n−1
n∑t=1
[52lt
(yt, θn1
)+52lt
(ytrep, θn1
)])−1×(
n−1/2n∑t=1
[5lt
(yt,θpn
)+5lt
(ytrep,θ
pn
)]+ n−1/2 ln p
(θn
)).
Under the assumptions, we have
n−1/2 ln p(θn
)= op (1) , − n−1
n∑t=1
52lt(yt, θn1
)p→ −Hn,
−n−1n∑t=1
52lt(yt, θn1
)p→ −Hn,−n−1
n∑t=1
52lt(ytrep, θn1
)p→ −Hn,
[−Hn]1/2 n−1/2n∑t=1
5lt(yt,θpn
) d→ N (0, IP ) , [−Hn]1/2 n−1/2n∑t=1
5lt(ytrep,θ
pn
) d→ N (0, IP ) .
Note that V ar(n−1/2
∑nt=15lt
(yt,θpn
))→ −Hn as n → ∞. By the central limit theorem
and the Cramer-Wold device, we get
(−2Hn)−1/2√n(θn − θpn
)d→ N (0, IP ) or
√2n (−Hn)−1/2
(θn − θpn
)d→ N (0, IP ) .
32
Lemma 5.3 Under Assumption 1-10, the asymptotic joint distribution of√n(θn − θpn
)and
√n(←→θ n − θpn
)is
[(−2Hn)−1 (−2Hn)−1
(−2Hn)−1 (−Hn)−1
]1/2 √n(θn − θpn
)√n(←→θ n − θpn
) d→ N (0, I2P ) .
Proof. By Lemma 5.2, we have
√n(θn − θpn
)=
(−n−1
n∑t=1
[52lt
(yt, θn1
)+52lt
(ytrep, θn1
)])−1
×(n−1/2
n∑t=1
[5lt
(yt,θpn
)+5lt
(ytrep,θ
pn
)]+ n−1/2 ln p
(θn
)),
and
√n(←→θ n − θpn
)=
(−n−1
n∑t=1
52lt(yt, θn2
))−1(n−1/2
n∑t=1
5lt(yt,θpn
)+ n−1/2 ln p
(θn
)),
where θn2 is an intermediate value between←→θ n and θpn. Hence, we have
Cov(√
n(θn − θpn
),√n(←→θ n − θpn
))= E
(√n(θn − θpn
)√n(←→θ n − θpn
)′)= E
[ −n−1
∑nt=1
[52lt
(yt,θpn
)+52lt
(ytrep,θ
pn
)]−1n−1/2
∑nt=15lt
(yt,θpn
)×[n−1/2
∑nt=15lt
(yt,θpn
)]′ (−n−1∑nt=152lt
(yt,θpn
))−1]
+ op (1)
= (−2Hn)−1 (−Hn) (−Hn)−1 + op (1)
= (−2Hn)−1 + op (1)
Then we have[(−2Hn)−1 (−2Hn)−1
(−2Hn)−1 (−Hn)−1
]1/2 √n(θn − θpn
)√n(←→θ n − θpn
) d→ N (0, I2P ) .
We are now in the position to prove Theorem 4.1. Under Assumptions 1-10, it can be
shown that,
EyEyrep (−2 ln p (yrep|y)) = Ey
[−2 ln p
(y|←→θ n
)+ (1 + ln 2)P
]+ o (1) .
By the Laplace approximation (Tierney et al., 1989 and Kass et al., 1990) and Lemma
5.2, we have
p (yrep|y) =
∫p (yrep|θ) p (θ|y) dθ=
∫p (yrep|θ) p (y|θ) p (θ) dθ∫
p (y|θ) p (θ) dθ
33
=
∫exp (−nhN (θ)) dθ∫exp (−nhD (θ)) dθ
=
∣∣∣∇2hN (θn)∣∣∣−1/2 exp(−nhN
(θn
))∣∣∣∇2hD (←→θ n
)∣∣∣−1/2 exp(−nhD
(←→θ n
)) (1 +Op
(1
n
))+Op
(1
n2
),
where
hN (θ) = − 1
n(ln p (yrep|θ) + ln p (y|θ) + ln p (θ)) ,
hD (θ) = − 1
n(ln p (y|θ) + ln p (θ)) .
Note that
ln
∣∣∣∇2hN (θn)∣∣∣−1/2 exp
(−nhN
(θn
))∣∣∣∇2hD (←→θ n
)∣∣∣−1/2 exp(−nhD
(←→θ n
))
= −1
2
(ln∣∣∣∇2hN (θn)∣∣∣− ln
∣∣∣∇2hD (←→θ n
)∣∣∣)+[−nhN
(θn
)+ nhD
(←→θ n
)].
The first term is
−1
2
(ln∣∣∣∇2hN (θn)∣∣∣− ln
∣∣∣∇2hD (←→θ n
)∣∣∣)= −1
2ln
∣∣∣∣∣∣− 1
n
∂ ln p(yrep|θn
)∂θ∂θ′
− 1
n
∂ ln p(y|θn
)∂θ∂θ′
− 1
n
∂ ln p(θn
)∂θ∂θ′
∣∣∣∣∣∣+
1
2ln
∣∣∣∣∣∣− 1
n
∂ ln p(y|←→θ n
)∂θ∂θ′
− 1
n
∂ ln p(←→θ n
)∂θ∂θ′
∣∣∣∣∣∣= −1
2ln |−Hn −Hn|+
1
2ln |−Hn|+ op (1)
= −1
2ln(2P |−Hn|
)+
1
2ln |−Hn|+ op (1) = −1
2P ln 2 + op (1) . (36)
Here we can see how ln 2 shows up in the penalty term.
The second term is
−nhN(θn
)+ nhD
(←→θ n
)= ln p
(yrep|θn
)+ ln p
(y|θn
)+ ln p
(θn
)− ln p
(y|←→θ n
)− ln p
(←→θ n
)= ln p
(yrep|θn
)+ ln p
(y|θn
)− ln p
(y|←→θ n
)+ op (1)
= ln p(yrep|
←→θ n
)+ E1 + E2 + op (1) , (37)
where
E1 = ln p(yrep|θn
)− ln p
(yrep|
←→θ n
), E2 = ln p
(y|θn
)− ln p
(y|←→θ n
).
34
We can further decompose E1 as
E1 = E11 + E12,
where
E11 = ln p(yrep|θn
)− ln p (yrep|θpn) , E12 = ln p (yrep|θpn)− ln p
(yrep|
←→θ n
).
For E11, we have
E11 = ln p(yrep|θn
)− ln p (yrep|θpn)
=1√n
∂ ln p (yrep|θpn)
∂θ′√n(θn − θpn
)+
1
2
√n(θn − θpn
)′ 1
n
∂2 ln p (yrep|θpn)
∂θ∂θ′√n(θn − θpn
)+ op (1) .
Following Assumption 1-10 and Lemma 5.3, we can similarly prove that
1√n
∂ ln p (yrep|θpn)
∂θ′√n(θn − θpn
)=√n(←→θ (yrep)− θpn
)′(−n−1
n∑t=1
52lt(ytrep,θ
pn
))√n(θn − θpn
)+ op (1)
=√n(←→θ (yrep)− θpn
)′(−Hn)
√n(θn − θpn
)+ op (1)
= tr
[(−Hn)
√n(θn − θpn
)√n(←→θ (yrep)− θpn
)′]+ op (1)
=√n(←→θ (yrep)− θpn
)′(−Hn)1/2 (−Hn)−1/2 (−Hn) (−2Hn)−1/2
× (−2Hn)1/2√n(θn − θpn
)+ op (1)
= 2−1/2[(−Hn)1/2
√n(←→θ (yrep)− θpn
)]′(−2Hn)1/2
√n(θn − θpn
)+ op (1) .
where [(−Hn)1/2
√n(←→θ n (yrep)− θpn
)]′(−2Hn)1/2
√n(θn − θpn
)=
(−Hn)1/2(−n−1
n∑t=1
52lt(ytrep, θn3
))−1n−1/2
n∑t=1
5lt(ytrep,θ
pn
)× (−2Hn)1/2
(−n−1
n∑t=1
[52lt
(yt, θn1
)+52lt
(ytrep, θn1
)])−1
×[n−1/2
n∑t=1
5lt(yt,θpn
)+ n−1/2
n∑t=1
5lt(ytrep,θ
pn
)]
=
[(−Hn)−1/2 n−1/2
n∑t=1
5lt(ytrep,θ
pn
)]′(−2Hn)−1/2
×[n−1/2
n∑t=1
5lt(yt,θpn
)+ n−1/2
n∑t=1
5lt(ytrep,θ
pn
)]+ op (1)
35
=
[(−Hn)−1/2 n−1/2
n∑t=1
5lt(ytrep,θ
pn
)]′(−2Hn)−1/2 n−1/2
n∑t=1
5lt(yt,θpn
)+2−1/2
[(−Hn)−1/2 n−1/2
n∑t=1
5lt(ytrep,θ
pn
)]′(−Hn)−1/2
×n−1/2n∑t=1
5lt(ytrep,θ
pn
)+ op (1) ,
with θn3 lying between←→θ n (yrep) and θpn. Hence,[
(−Hn)−1/2 n−1/2n∑t=1
5lt(ytrep,θ
pn
)]′(−Hn)−1/2 n−1/2
n∑t=1
5lt(ytrep,θ
pn
) d→ χ2 (P ) . (38)
Thus, we have
EyEyrep
([(−Hn)−1/2 n−1/2
n∑t=1
5lt(ytrep,θ
pn
)]′(−2Hn)−1/2 n−1/2
n∑t=1
5lt(yt,θpn
))= 0,
(39)
since y and yrep are independent. Hence, we have
EyEyrep
[1√n
∂ ln p (yrep|θpn)
∂θ′√n(θn − θpn
)]= EyEyrep
[2−1/2
[(−Hn)1/2
√n(←→θ n (yrep)− θpn
)]′(−2Hn)1/2
√n(θn − θpn
)+ op (1)
]= EyEyrep
(2−1/2
[(−Hn)−1/2 n−1/2
n∑t=1
5lt(ytrep,θ
pn
)]′(−2Hn)−1/2 n−1/2
n∑t=1
5lt(yt,θpn
))+
EyEyrep
(2−1
[(−Hn)−1/2 n−1/2
n∑t=1
5lt(ytrep,θ
pn
)]′(−Hn)−1/2 n−1/2
n∑t=1
5lt(ytrep,θ
pn
))+ o (1)
=1
2P + o (1) . (40)
Moreover,
1
2
√n(θn − θpn
)′ 1
n
∂2 ln p (yrep|θpn)
∂θ∂θ′√n(θn − θpn
)= −1
2
√n(θn − θpn
)′(−Hn)
√n(θn − θpn
)+ op (1)
= −1
2
√n(θn − θpn
)′(−2Hn)1/2 (−2Hn)−1/2 (−Hn) (−2Hn)−1/2 (−2Hn)1/2
×√n(θn − θpn
)+ op (1)
= −1
4
[(−2Hn)1/2
√n(θn − θpn
)]′(−2Hn)1/2
√n(θn − θpn
)+ op (1)
where [(−2Hn)1/2
√n(θn − θpn
)]′(−2Hn)1/2
√n(θn − θpn
)d→ χ2 (P ) . (41)
36
From (40) and (41)
EyEyrep (E11) =
(1
2− 1
4
)P =
1
4P + o (1) .
For E12, we have
E12 = ln p (yrep|θpn)− ln p (yrep|θpn)− 1√n
∂ ln p (yrep|θpn)
∂θ′√n(←→θ n − θpn
)−1
2
(←→θ n − θpn
)′ ∂2 ln p (yrep|θpn)
∂θ∂θ′
(←→θ n − θpn
)+ op (1) .
Since
−(←→θ n − θpn
)′ ∂2 ln p (yrep|θpn)
∂θ∂θ′
(←→θ n − θpn
)d→ χ2 (P ) , (42)
Eyrep
(1√n
∂ ln p (yrep|θpn)
∂θ
)= 0, Eyrep
(√n(←→θ n − θpn
))= o (1) (43)
from (42), and (43), we have
EyEyrep (E12) =1
2P + o (1) .
Then
EyEyrep (E1) = EyEyrep
(ln p
(yrep|θn
)− ln p
(yrep|
←→θ n
))(44)
= EyEyrep (E11 + E12) =3
4P + o (1) .
Similarly, we can decompose E2 = − ln p(y|←→θ n
)+ ln p
(y|θn
)as
E2 = E21 + E22,
where
E21 = ln p(y|θn
)− ln p (y|θpn) , E22 = ln p (y|θpn)− ln p
(y|←→θ n
).
From the discussion above, ln p(y|θn
)− ln p (y|θpn) has the same asymptotic property as
ln p(yrep|θn
)− ln p (yrep|θpn), then we have
EyEyrep (E21) = EyEyrep (E11) =1
4P + o (1) .
For E22, we have
ln p (y|θpn)− ln p(y|←→θ n
)(45)
= ln p(y|←→θ n
)+
1√n
∂ ln p(y|←→θ n
)∂θ′
√n(θpn −
←→θ n
)+
1
2
√n(←→θ n − θpn
)′ 1
n
∂2 ln p(y|←→θ n
)∂θ∂θ′
√n(←→θ n − θpn
)− ln p
(y|←→θ n
)+ op (1) ,
37
where
1√n
∂ ln p(y|←→θ n
)∂θ′
= op (1) ,
−√n(←→θ n − θpn
)′ 1
n
∂2 ln p(y|←→θ n
)∂θ∂θ′
√n(←→θ n − θpn
)d→ χ2 (P ) .
Then we can get
EyEyrep (E22) = −1
2P + o (1) ,
and
EyEyrep (E2) = EyEyrep
(ln p
(y|θn
)− ln p
(y|←→θ n
))(46)
= EyEyrep (E21 + E22) = −1
4P + o (1) .
Note that
θn =←→θ n + op(n
−1/2),
by Lemma 3.3. Mimicking the proof of Theorem 3.1, we get
EyEyrep ln p(yrep|
←→θ n
)= Ey
[ln p
(y|←→θ n
)]− P. (47)
With (36), (44), (46) and (47), we have
Ey
[Eyrep ln p (yrep|y)
]= EyEyrep ln
∣∣∣∇2hN (θn)∣∣∣−1/2 exp
(−nhN
(θn
))∣∣∣∇2hD (←→θ n
)∣∣∣−1/2 exp(−nhD
(←→θ n
)) (1 +Op
(1
n
))+Op
(1
n2
)= −1
2
(ln∣∣∣∇2hN (θn)∣∣∣− ln
∣∣∣∇2hD (←→θ n
)∣∣∣)+EyEyrep
[ln p
(yrep|
←→θ n
)+ ln p
(yrep|θn
)− ln p
(yrep|
←→θ n
)+ ln p
(y|θn
)− ln p
(y|←→θ n
)]+o (1)
= −P2
ln 2 + EyEyrep
[ln p
(yrep|
←→θ n
)]+ EyEyrep [E1 + E2] + o (1)
= −P2
ln 2 + EyEyrep
[ln p
(yrep|
←→θ n
)]+
3
4P − 1
4P + o (1)
= EyEyrep
(ln p
(yrep|
←→θ n
)+
(1
2− ln 2
2
)P
)+ o (1)
= Ey ln p(y|←→θ n
)− P +
(1
2− ln 2
2
)P + o (1)
= Ey
[ln p
(y|←→θ n
)− 1 + ln 2
2P
]+ o (1) , (48)
= Ey
[ln p
(y|θn
)− 1 + ln 2
2P
]+ o (1) .
Therefore, −2 ln p(y|θn
)+ (1 + ln 2)P is an unbiased estimator of Eyrep (−2 ln p (yrep|y))
asymptotically.
38
References1 Akaike, H. (1973). Information theory and an extension of the maximum likelihood prin-
ciple. Second International Symposium on Information Theory, Springer Verlag, 1,267-281.
2 Andrews, D. W. K. (1987). Consistency in nonlinear econometric models: A generic uniformlaw of large numbers. Econometrica, 55(6), 1465-1471.
3 Andrews, D. W. K. (1988). Laws of large numbers for dependent non-identically distributedrandom variables. Econometric theory, 4(03), 458-467.
4 Ando, T. (2007). Bayesian predictive information criterion for the evaluation of hierarchicalBayesian and empirical Bayes models. Biometrika, 443-458.
5 Ando, T. (2012). Predictive Bayesian model selection. American Journal of Mathematicaland Management Sciences, 31(1-2), 13-38.
6 Ando, T. and Tsay, R. (2010). Predictive likelihood for Bayesian model selection andaveraging. International Journal of Forecasting, 26, 744—763.
7 Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. 2nd edition.Springer-Verlag.
8 Bester, C.A. and Hansen, C. (2006). Bias reduction for Bayesian and frequentist estimators.SSRN Working Paper Series
9 Brooks, S. (2002). Discussion on the paper by Spiegelhalter, Best, Carlin, and van de Linde(2002). Journal of the Royal Statistical Society Series B, 64, 616—618.
10 Burnham, K. and Anderson, D. (2002). Model Selection and Multi-model Inference: APractical Information-theoretic Approach, Springer.
11 Chen, C. (1985). On asymptotic normality of limiting density functions with Bayesianimplications. Journal of the Royal Statistical Society Series B, 47, 540—546.
12 Chung, K.L. (2001) A course in probability theory. Academic press.
13 DasGupta, A. (2008). Asymptotic theory of statistics and probability. Springer Science& Business Media.
14 Gallant, A. R. and White, H. (1988). A unified theory of estimation and inference fornonlinear dynamic models. Blackwell.
15 Geweke, J. and Keane, M. (2001). Computationally intensive methods for integration ineconometrics. Handbook of econometrics, 5, 3463-3568.
16 Geweke, J. (2005). Contemporary Bayesian Econometrics and Statistics. Wiley-Interscience.
17 Ghosh, J. and Ramamoorthi, R. (2003). Bayesian Nonparametrics, Springer Verlag.
39
18 Kass, R., Tierney, L. and Kadane, J. (1990) The validity of posterior expansions based onLaplace’s Method. in Bayesian and Likelihood Methods in Statistics and Econometrics,ed. by S. Geisser, J.S. Hodges, S.J. Press and A. Zellner. Elsevier Science PublishersB.V.: North-Holland.
19 Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators. TheAnnals of Mathematical Statistics, 40(2), 633-643.
20 Kim, J. (1994). Bayesian asymptotic theory in a time series model with a possible non-stationary process. Econometric Theory, 10, 764-773.
21 Kim, J. (1998). Large sample properties of posterior densities, Bayesian informationcriterion and the likelihood principle in nonstationary time series models. Econometrica,66, 359-380.
22 Li, Y., Yu, J. and T. Zeng (2017) Online Supplement to ‘Deviance Information Criterionfor Model Selection: Justification and Variation’, Singapore Management University.
23 Magnus J R, Neudecker H. (1986). Symmetry, 0-1 matrices and Jacobians: A review .Econometric Theory, 2(02), 157-190.
24 Phillips, P. C. B. (1995). Bayesian model selection and prediction with empirical applica-tion (with discussions). Journal of Econometrics, 69(1), 289-331.
25 Phillips, P. C. B. (1996). Econometric model determination. Econometrica, 64, 763-812.
26 Plummer, M. (2008). Penalized loss functions for Bayesian model comparison. Biostatis-tics, 9(3), 523-539.
27 Spiegelhalter, D., Best, N., Carlin, B. and van der Linde, A. (2002). Bayesian measures ofmodel complexity and fit. Journal of the Royal Statistical Society Series B, 64, 583-639.
28 Spiegelhalter, D., Best, N., Carlin, B. and van der Linde, A. (2014). The deviance in-formation criterion: 12 years on. Journal of the Royal Statistical Society Series B, 76,485-493.
29 Schervish, M. J. (2012). Theory of statistics. Springer Science & Business Media.
30 Tierney, L., Kass, R. E., and Kadane, J. B. (1989). Fully exponential Laplace approxima-tions to expectations and variances of nonpositive functions. Journal of the AmericanStatistical Association, 84(407), 710-716.
31 van der Linde, A. (2005). DIC in variable selection. Statistica Neerlandica, 59(1), 45-56.
32 van der Linde, A. (2012). A Bayesian view of model complexity. Statistica Neerlandica,66, 253—271.
33 Vehtari, A., and Ojanen, J. (2012). A survey of Bayesian predictive methods for modelassessment, selection and comparison. Statistics Surveys, 6, 142-228.
40
34 Vehtari, A., and Lampinen, J. (2002). Bayesian model assessment and comparison usingcross-validation predictive densities. Neural Computation, 14, 2439-2468.
35 White, H. (1996). Estimation, inference and specification analysis. Cambridge UniversityPress. Cambridge, UK.
36 Wooldridge, J. M. (1994). Estimation and inference for dependent processes. Handbookof Econometrics, 4, 2639-2738.
37 Zhang, Y., and Yang, Y. (2015). Cross-validation for selecting a model selection procedure.Journal of Econometrics, 187(1), 95-112.
41