nonparametric empirical bayes estimation on...
TRANSCRIPT
Biometrika (2021), xx, x, pp. 1–10
Printed in Great Britain
Nonparametric Empirical Bayes Estimation On
Heterogeneous Data
BY TRAMBAK BANERJEE
Analytics, Information and Operations Management, University of Kansas,
Lawrence, Kansas 66045, U.S.A. 5
LUELLA J. FU
Department of Mathematics, San Francisco State University,
San Francisco, California 94132, U.S.A.
GARETH M. JAMES
Department of Data Sciences and Operations, University of Southern California,
Los Angeles, California 90089, U.S.A.
WENGUANG SUN 15
Department of Data Sciences and Operations, University of Southern California,
Los Angeles, California 90089, U.S.A.
SUMMARY
The simultaneous estimation of many parameters based on data collected from corresponding 20
studies is a key research problem that has received renewed attention in the high-dimensional
setting. Many practical situations involve heterogeneous data where heterogeneity is captured by
a nuisance parameter. Effectively pooling information across samples while correctly accounting
for heterogeneity presents a significant challenge in large-scale estimation problems. We address
this issue by introducing the “Nonparametric Empirical–Bayes Structural Tweedie (NEST) esti- 25
mator, which efficiently estimates the unknown effect sizes and properly adjusts for heterogeneity
via a generalized version of Tweedie’s formula. For the normal means problem, NEST simulta-
neously handles the two main selection biases introduced by heterogeneity: one, the selection
bias in the mean, which cannot be effectively corrected without also correcting for, two, selec-
tion bias in the variance. Our theoretical results show that NEST has strong asymptotic properties 30
without requiring explicit assumptions about the prior. Extensions to other two-parameter mem-
bers of the exponential family are discussed. Simulation studies show that NEST outperforms
competing methods, with substantial efficiency gains in many settings. The proposed method is
demonstrated on estimating the batting averages of baseball players and Sharpe ratios of mutual
fund returns. 35
Some key words: Compound decision; doluble shrinkage estimation; non-parametric empirical Bayes; kernelizedStein’s discrepancy; Tweedie’s formula.
C© 2018 Biometrika Trust
2 BANERJEE, FU, JAMES AND SUN
1. INTRODUCTION
1.1. Large-scale estimation and Empirical Bayes
Suppose that we are interested in estimating a vector of parameters η = (η1, . . . , ηn) based40
on the summary statistics Y1, . . . , Yn from n study units. The setting where Yi | µi ∼ N(µi, σ2)
is the most well–known example, but the broader scope includes the compound estimation of
Poisson parameters λi, Binomial parameters pi, and other members of the exponential family.
High-dimensional settings have required researchers to solve this problem in new ways and for
new purposes. In modern large-scale applications it is often of interest to perform simultaneous45
and selective inference (Benjamini & Yekutieli, 2011; Berk et al., 2013; Weinstein et al., 2013).
For example, there has been recent work on how to construct valid simultaneous confidence
intervals after a selection procedure is applied (Lee et al., 2016). In multiple testing, as well
as related ranking and selection problems, it is often desirable to incorporate estimates of the
effect sizes ηi in the decision process to prioritize the selection of more scientifically meaningful50
hypotheses (Benjamini & Hochberg, 1997; Sun & McLain, 2012; He et al., 2015; Henderson &
Newton, 2016; Basu et al., 2017).
However, the simultaneous inference of thousands of means, or other parameters, is challeng-
ing because, as described in Efron (2011), the large scale of the problem introduces selection
bias, wherein some data points are large merely by chance, causing traditional estimators to55
overestimate the corresponding means. Shrinkage estimation, exemplified by the seminal work
of James & Stein (1961), has been widely used in simultaneous inference. There are several
popular classes of methods, including linear shrinkage estimators (James & Stein, 1961; Efron
& Morris, 1975; Berger, 1976), non–linear thresholding–based estimators motivated by sparse
priors (Donoho & Jonhstone, 1994; Johnstone & Silverman, 2004; Abramovich et al., 2006),60
and both Bayes or empirical Bayes estimators with unspecified priors (Brown & Greenshtein,
2009; Jiang & Zhang, 2009; Castillo & van der Vaart, 2012). This article focuses on a class
of estimators based on Tweedie’s formula (Robbins, 1956)1. The formula is an elegant shrink-
age estimator, for distributions from the exponential family, that has recently received renewed
interest (Brown & Greenshtein, 2009; Efron, 2011; Koenker & Mizera, 2014). Tweedie’s for-65
mula is simple and intuitive, and its implementation under the f -modeling strategy for empirical
Bayes estimation only requires the estimation of the marginal distribution of Yi. This property
is particularly appealing for large–scale estimation problems where such estimates can be easily
constructed from the observed data. The resultant empirical Bayes estimator enjoys optimality
properties (Brown & Greenshtein, 2009) and delivers superior numerical performance. The work70
of Efron (2011) further convincingly demonstrates that Tweedie’s formula provides an effective
bias correction tool when estimating thousands of parameters simultaneously.
1.2. Issues with heterogeneous data
Most of the research in this area has been restricted to models of the form f(yi | ηi) where
the distribution of Yi is solely a function of ηi. In situations involving a nuisance parameter θ it75
is generally assumed to be known and identical for all Yi. For example, homoscedastic Gaussian
models of the form Yi | µi, σind∼ N(µi, σ
2) involve a common nuisance parameter θ = σ2 for
all i. However, in large-scale studies when the data are collected from heterogeneous sources,
the nuisance parameters may vary over the n study units. Perhaps the most common example,
and the setting we concentrate most on, involves heteroscedastic errors, where σ2 varies over80
Yi. Microarray data (Erickson & Sabatti, 2005; Chiaretti et al., 2004), returns on mutual funds
1 A reviewer pointed out that the formula appears even earlier in the astronomy literature (Dyson, 1926; Eddington, 1940), wherein
Frank Dyson credits the formula to Sir Arthur Eddington.
NEST Estimation On Heterogeneous Data 3
(Brown et al., 1992), and the state-wide school performance gaps (Sun & McLain, 2012) are
all instances of large-scale data where genes, funds, or schools have heterogeneous variances.
Heteroscedastic errors also arise in analysis of variance and linear regression settings (Weinstein
et al., 2018). Moreover, in compound binomial problems, heterogeneity arises through unequal 85
sample sizes across different study units. Unfortunately, the conventional Tweedie’s formula
assumes identical nuisance parameters across study units so cannot eliminate selection bias for
heterogeneous data. Moreover, various works show that failing to account for heterogeneity leads
to inefficient shrinkage estimators (Weinstein et al., 2018), methods with invalid false discovery
rates (Efron, 2008; Cai & Sun, 2009), unstable multiple testing procedures (Tusher et al., 2001) 90
and suboptimal ranking and selection algorithms (Henderson & Newton, 2016), exacerbating the
replicability crisis in large-scale studies. Few methodologies are available to address this issue.
For Gaussian data, a common goal is to find the estimator, or make the decision, that min-
imizes the expected squared error loss. A plausible–seeming solution might be to scale each
Yi by its estimated standard deviation Si so that a homoscedastic method could be applied to 95
Xi = Yi/Si, before undoing the scaling on the final estimate of µi. Indeed, this is essentially the
approach taken whenever we compute standardized test statistics, such as t–values and z–values.
A similar standardization is performed in the Binomial setting when we compute pi = Xi/mi,
where Xi is the number of successes and mi the number of trials. However, this approach, which
disregards important structural information, can be highly inefficient. More advanced methods 100
have been developed, but all suffer from various limitations. For instance, the methods proposed
by Xie et al. (2012), Tan (2015), Jing et al. (2016), Kou & Yang (2017), and Zhang & Bhat-
tacharya (2017) are designed for heteroscedastic data but assume a parametric Gaussian prior
or semi-parametric Gaussian mixture prior, which leads to loss of efficiency when the prior is
misspecified. Moreover, existing methods, such as Xie et al. (2012) and Weinstein et al. (2018), 105
often assume that the nuisance parameters are known and use a consistent estimator for imple-
mentation. However, when a large number of units are investigated simultaneously, traditional
sample variance estimators may similarly suffer from selection bias, which often leads to severe
deterioration in the MSE for estimating the means.
1.3. The proposed approach and main contributions 110
In the homogeneous setting, Tweedie’s formula estimates ηi using the score function of Yi, but
that approach does not immediately extend to heterogeneous data. Instead, this article proposes
a two–step approach, “Nonparametric Empirical Bayes Structural Tweedie” (NEST), which first
estimates the bivariate score function for data from a two–parameter exponential family, and
second predicts ηi using a generalized version of Tweedie’s formula that effectively incorporates 115
the structural information encoded in the variances. A significant challenge in the heterogeneous
setting is how to pool information from different study units effectively while accounting for
the heterogeneity captured by the possibly unknown nuisance parameters. NEST addresses the
issue by proposing a double shrinkage method that simultaneously incorporates the structural
information encoded in both the primary (e.g. Yi) and auxiliary (e.g. Si) data. 120
NEST has several clear advantages. First, it simultaneously handles the two main selection
biases introduced by heterogeneity: one, selection bias in the primary parameter (mean), which
cannot be effectively corrected without also correcting for, two, selection bias in the nuisance
parameter (variance). By producing more accurate estimates for the nuisance parameters, NEST
in general renders improved shrinkage factors for estimating the primary parameters. Second, 125
NEST makes no explicit assumptions about the prior since it uses a nonparametric method to
directly estimate the optimal shrinkage directions. Third, NEST exploits the structure of the
entire sample and avoids the information loss that occurs in the discretization step of the group-
4 BANERJEE, FU, JAMES AND SUN
ing methods (Weinstein et al., 2018). Fourth, NEST only requires a few simple assumptions to
achieve strong asymptotic properties. Finally, NEST provides a general estimation framework130
for members of the two-parameter exponential family. It is fast to implement, produces stable es-
timates, and is robust against model mis-specification. We demonstrate numerically that NEST
can provide high levels of estimation accuracy relative to a host of benchmark methods.
1.4. Organization
The rest of the paper is structured as follows. Section 2 first derives a generalized Tweedie’s135
formula for a hierarchical Gaussian model where both the mean and variance parameters are un-
known. We then develop a convex optimization approach to estimating the unknown shrinkage
factors in the formula. In Section 3, we first describe the theoretical setup, then justify the op-
timization criterion, and finally establish asymptotic theories for the proposed NEST estimator.
Extensions to other members of the two-parameter exponential family are discussed in Section140
4. Simulation studies are carried out in Section 5 to compare NEST to competing methods. Two
data applications are presented in Section 6. The proofs, as well as additional theoretical and
numerical results, are provided in the Supplementary Material.
2. DOUBLE SHRINKAGE ESTIMATION ON HETEROSCEDASTIC NORMAL DATA
2.1. Generalized Tweedie’s formula for a hierarchical Normal model145
Suppose we collect mi observations for the ith study unit, i = 1, . . . , n. The data are normally
distributed obeying the following hierarchical model:
Yij | µi, τii.i.d∼ N(µi, 1/τi), j = 1, . . . ,mi, (1)
µi |τi ind∼ Gµ(·|τi), τii.i.d∼ Hτ (·), i = 1, . . . , n.
We view µi and τi, both of which are unknown, as the primary and nuisance parameters, respec-
tively. The prior distributions Gµ(·|τi) and Hτ (·) are unspecified. We develop a nonparametric
empirical Bayes method for jointly estimating the mean and variance under the squared error150
loss. The general setting of two-parameter exponential families is considered in Section 4.
Let Yi = m−1i
∑mi
j=1 Yij and S2i = (mi − 1)−1
∑mi
j=1(Yij − Yi)2 respectively denote the sam-
ple mean and sample variance. Further, let Y = (Y1, . . . , Yn) and S = (S21 , . . . , S
2n) be the
vectors of summary statistics, and y = (y1, . . . , yn) and s = (s21, . . . , s2n) the observed values.
Denote δ = (δ1, . . . , δn)T an estimator for µ = (µ1, . . . , µn)
T based on (Y ,S). Consider the155
squared error loss ln(µ, δ) = n−1∑n
i=1 ℓ(µi, δi), where ℓ(µi, δi) = (δi − µi)2. The compound
Bayes risk is
r(δ,G) = 1
n
n∑
i=1
∫ ∫ ∫
ℓ(µi, δi)fmi(y, s2 | ψi)dyds
2dG(ψi), (2)
where ψi = (µi, τi), G(ψi) = Gµ(µi|τi)Hτ (τi), and fmi(y, s2 | ψi) is the likelihood function
of (Yi, S2i ). The Bayes estimator that minimizes (2) is given by δδδπ = (δπ1 , . . . , δ
πn), where
δπi := δπ(yi, s2i ,mi) = E(µi | Yi = yi, S
2i = s2i ,mi).
Unfortunately, Model (1) in its full generality does not allow a closed form expression for δπ. The
next proposition exploits the properties of the exponential family to form a new representation of
the posterior distribution of the canonical parameters. The useful representation is subsequently160
employed in Corollary 1 to construct our proposed estimator.
NEST Estimation On Heterogeneous Data 5
PROPOSITION 1. Consider hierarchical Model (1) with mi > 3. Let fm(y, s2) =∫
fm(y, s2 |ψ)dG(ψ) denote the joint density of (Y, S2). The posterior distribution of (µi, τi) belongs to a
two-parameter exponential family with density
fmi(µi, τi|yi, s2i ) ∝ exp
ηTi T (µi, τi)−A(ηi)
gµ(µi|τi)hτ (τi),
where T (µi, τi) = (τiµi, τi/2),ηi =
miyi,−miy2i − (mi − 1)s2i
:= (η1i, η2i), γ(η1i, η2i) =−η2i−m−1
iη21i
mi−1 and A(ηi) = −12(mi − 3) log γ(η1i, η2i) + log fmi
m−1i η1i, γ(η1i, η2i)
.
The two-parameter exponential family representation of the joint posterior distribution of
(µi, τi) in Proposition 1 is particularly useful because, as shown in Corollary 1, it allows one 165
to write down the Bayes estimators of ζi := τiµi and τi immediately.
COROLLARY 1. Under the hierarchical Model (1) with mi > 3, the Bayes estimators of
(ζi, τi) under squared error loss are, respectively,
ζπi := ζπ(yi, s2i ,mi) = E(ζi|yi, s2i ,mi) = yiE(τi|yi, s2i ,mi) +m−1
i w1(yi, s2i ;mi), (3)
τπi := τπ(yi, s2i ,mi) = E(τi|yi, s2i ,mi) =
mi − 3
(mi − 1)s2i− 2
mi − 1w2(yi, s
2i ;mi), (4)
where w1(y, s2;m) :=
∂
∂ylog fm(y, s2), w2(y, s
2;m) :=∂
∂s2log fm(y, s2).
Corollary 1 is significant for several reasons. First, it generalizes Tweedie’s formula, which can 170
be recovered from Equation (3) by treating τi as a known constant and dividing both sides by τi.Second, the score functions w1i := w1(yi, s
2i ;mi) and w2i := w2(yi, s
2i ;mi) correspond to opti-
mal shrinkage factors that should be applied to (yi, s2i ) when both µi and τi are unknown. Hence
the optimal estimation for µi requires a “double shrinkage” procedure that operates on both ζand τ dimensions. Third, Corollary 1 can be employed to construct estimators for functions of 175
(µi, τi) such as the Sharpe ratio in finance applications (Section 6.2).
2.2. Estimation of shrinkage factors via convex optimization
We begin by introducing some notation. Let x = (y, s2) be a generic pair of observations from
the distribution with marginal density fm(x), which we assume is continuously differentiable
with support on X ⊆ R× R+. Denote the score function 180
w(x;m) = ∇x log fm(x) := w1(x;m), w2(x;m) . (5)
Next, for i = 1, . . . , n, let zi = (xi,mi) where xi is an observation from a distribution with
density fmi(x).
Let K(zi, zj) = exp−(zi − zj)TΩ(zi − zj)/2 be the Radial Basis Function (RBF) kernel
with Ω3×3 being the inverse of the sample covariance matrix of (z1, . . . , zn). Denote
Wn×20 =
w(x1;m1), . . . ,w(xn;mn)T
, where (6)
w(xi;mi) = ∇x log fmi(x)∣
∣
∣
x=xi
:=
w1(xi;mi), w2(x
i;mi)
.
We denote wk(xi;mi) by wki, k = 1, 2, for the remainder of this article. 185
Let ∇zjk
K(zi, zj) be the partial derivative of K(zi, zj) with respect to the kth component of
zj . The following matrices are needed in our proposed estimator:
Kn×n = [Kij ]1≤i≤n,1≤j≤n, ∇Kn×2 = [∇Kik]1≤i≤n,1≤k≤2 ,
6 BANERJEE, FU, JAMES AND SUN
where Kij = K(zi, zj) and ∇Kik =∑n
j=1∇zjk
K(zi, zj). Next we formally define our pro-
posed Nonparametric Empirical-Bayes Structural Tweedie (NEST) estimator.
DEFINITION 1. Consider hierarchical Model (1) with mi > 3. For a fixed regularization pa-
rameter λ > 0, let Wn(λ) = (w1λ, . . . , w
nλ)
T , where wiλ = (wi
1,λ, wi2,λ), be the solution to the
following quadratic optimization problem:190
minW∈Rn×2
1
n2trace
(
WTKW + 2WT∇K
)
+ ρn(W;λ), (7)
where ρn(W;λ) is a penalty on the elements of W . Then the NEST estimator for µi is
δdsi (λ) = yi +1
miτdsi (λ)wi1,λ, (8)
where τdsi (λ) =mi − 3
(mi − 1)s2i− 2
mi − 1wi2,λ, (9)
with the superscript ds denoting “double shrinkage”.
Although not immediately obvious, we show in Section 3 that, under the compound estima-
tion setting, minimizing the first term in the objective function (7) is asymptotically equivalent
to minimizing the kernelized Stein’s discrepancy (KSD; Liu et al., 2016; Chwialkowski et al.,195
2016). Roughly speaking, the KSD measures how far a given n× 2 matrix W is from the true
score matrix W0. A key property of the KSD is that it is always non-negative and is equal to 0
if and only if W and W0 are equal. Hence, solving the convex program (7) is equivalent to find-
ing a W that is as close as possible to W0. Since the oracle rule that minimizes the Bayes risk
(2) is constructed based on W0, we can expect that the NEST estimator based on W would be200
asymptotically optimal. Theory underpinning this intuition, as well as the asymptotic optimality
of δdsi (λ), are established in Section 3.
The second term in Equation (7), ρn(W;λ), is a penalty function that increases as the elements
of W move further away from 0. In this article, we take ρn(W;λ) = (λ/n2)∑n
i=1
∑2k=1W2
ik =(λ/n2)‖W‖2F , where Wik are the elements of matrix W and ‖W‖F is the Frobenius norm of W .205
A large λ forces the estimated shrinkage factors towards 0, and in the limit the NEST estimate is
simply the unbiased estimate yi. An alternative approach, as pursued in James et al. (2020), is to
penalize the lack of smoothness in w as a function of x.
A key characteristic of δdsi (λ) in Equation (8) is that it exploits the joint structural information
available in both Yi and S2i through Wn(λ). Although the loss function only involves the means,210
we perform shrinkage on both the mean and variance dimensions. Inspecting Equations (8) and
(9), we expect that the improved accuracy achieved by τdsi (λ) leads to better shrinkage factors
for δdsi (λ) and hence additional reduction in the estimation risk. In Section 4 we adopt a similar
strategy to extend the estimation framework presented in Definition 1 to other distributions in
the two-parameter exponential family.215
We end this section with a discussion of the case of equal sample sizes, i.e. mi = m for all
i. For instance, the leukemia dataset analyzed in Jing et al. (2016) consists of the expression
levels of n = 5,000 genes for m = 27 acute lymphoblastic leukemia patients. The heterogene-
ity across the n units is due to the intrinsic variability instead of the varied number of repli-
cates. When the mi’s are equal, the RBF kernel K(·, ·) needs to be modified to avoid a singu-220
lar sample covariance matrix. Denote K(xi,xj) = exp−0.5(xi − xj)TΩ(xi − xj) the mod-
ified RBF kernel with Ω2×2 being the inverse of the sample covariance matrix of (x1, . . . ,xn).
NEST Estimation On Heterogeneous Data 7
Correspondingly in Definition 1, Wn(λ) are the estimates of the shrinkage factors Wn×20 =
w(x1;m), . . . ,w(xn;m)T
, where w(xi;m) =
w1(xi;m), w2(x
i;m)
:= (w1i, w2i).
2.3. Details around implementation 225
In this section we discuss details around the implementation of NEST. First note that Equa-
tion (7) can be solved separately for the two columns of W , which respectively yield the es-
timates for w1i and w2i. Next, the solution to Equation (7) is available in the closed form of
Wn(λ) = −B(λ)∇K, where B(λ) = (K + λI)−1. However, in our implementation the closed
form solution is replaced by a convex program that directly solves (7) with constraint Wa b, 230
where a = (0, 1)T and b = (b1, . . . , bn) with bi =12(mi − 3)/s2i − κ for some κ > 0. Inspect-
ing Equation (9) shows that adding this constraint guarantees that τdsi (λ) > 0. This is desirable
in both numerical and theoretical analyses.
The practical implementation requires a data-driven scheme for choosing λ. Consider the
modified cross validation scheme of Brown et al. (2013), which involves splitting Yij into two 235
parts: Uij = Yij + αǫij and Vij = Yij − (1/α)ǫij , where ǫij ∼ N(0, τ−1i ), and Uij and Vij are
used to construct the estimator and to choose the tuning parameter, respectively. However in
our setup τi is unknown; hence we sample ǫij’s from N(0, S2i ). Let Vi = m−1
i
∑mi
j=1 Vij , Ui =
m−1i
∑mi
j=1 Uij , U = Uij : 1 ≤ i ≤ n, 1 ≤ j ≤ mi and V = Vij : 1 ≤ i ≤ n, 1 ≤ j ≤ mi.
Then conditional on (µi, τi), Ui and Vi are approximately independent with mean E(Yi) when 240
m is large. Define ϑn(λ;U ,V) =1
n
∑ni=1
Vi − δdsi (Ui;U , λ)2
. The tuning parameter will be
chosen as λ := argminλ∈Λϑn(λ). In our numerical studies of Sections 5 and 6, we set α = 1/2,
n−2Λ = [1, 10]. The tuning parameter λ is obtained from this scheme and then used to estimate
µi via equation (8).
Remark 1. ϑn(λ;U ,V) provides a practical criterion for choosing λ and works well empiri- 245
cally in our numerical studies. However, ϑn is not an unbiased estimate of the true risk. We note
in Section 5 that the relative risk of the data-driven NEST is slightly off from 1 when m = 10(e.g. the left panels in Figs 1 & 2). As m increases the covariance between Uij and Vij converges
to 0 and the risk of our data-driven estimator is almost identical to that of an oracle (which selects
the optimal λ based on the true risk). The development of a SURE criterion is a challenging topic 250
requiring further research. See also Ignatiadis & Wager (2019) for related discussions.
We are developing an R package, nest, to implement the NEST estimator in Definition 1.
The R code that reproduces the numerical results in Sections 5 and 6 can be downloaded from
the link: github.com/trambakbanerjee/DS paper.
3. THEORY 255
3.1. Kernelized Stein’s Discrepancy
In this section we introduce the Kernelized Stein’s Discrepancy (KSD) measure (Liu et al.,
2016; Chwialkowski et al., 2016) and discuss its connection to the quadratic program (7). While
the KSD has been used in various contexts including goodness of fit tests (Liu et al., 2016;
Yang et al., 2018), variational inference (Liu & Wang, 2016) and Monte Carlo integration (Oates 260
et al., 2017), its connections to compound estimation and empirical Bayes methodology was
established only recently (Banerjee et al., 2020). The analysis in this and following sections is
geared towards the case mi = m for i = 1, . . . , n. Under this setting, (x1, . . . ,xn) constitute an
i.i.d random sample from fm(x). The case of unequal mi’s can be analyzed in a similar fashion
8 BANERJEE, FU, JAMES AND SUN
by first assuming that mi’s are a random sample from a distribution with mass function q(·) and265
are independent of (µi, τi). Then z = (x,m) has distribution with density p(z) := q(m)fm(x)and, with zi = (xi,mi), (z
1, . . . , zn) being realizations of an i.i.d random sample from p(z).Suppose X and X ′ are i.i.d. copies from the marginal distribution of (Y, S2) that has density
f wherein the dependence on m is implicit. Denote w(X) and w(X ′), defined in Equation (5),
to be the score functions atX andX ′ respectively. Suppose f is an arbitrary density function on270
the support of (Y, S2), for which we similarly define w(X). The KSD, formally defined as
S(f, f) = EX,X′∼f
[
w(X)−w(X)T
K(X,X ′)
w(X ′)−w(X ′)]
, (10)
provides a discrepancy measure between f and f in the sense that S(f, f) tends to increase when
there is a bigger disparity between w and w (or equivalently, between f and f ), and
S(f, f) ≥ 0 and S(f, f) = 0 if and only if f = f .
The direct evaluation of S(f, f) is difficult because w is unknown. Liu et al. (2016) introduced
an alternative representation of the KSD that does not directly involve w:
S(f, f) = Efκ[w(X), w(X ′)](X,X ′)
= Ef
1
n(n− 1)
n∑
i=1
n∑
j=1
κ[w(Xi), w(Xj)](Xi,Xj)I(i 6= j)
= Ef
[
Mn(W)]
:= M(W), (11)
where X1, . . . ,Xn is a random sample from f , Ef denotes expectation under f and
κ[w(x), w(x′)](x,x′) = w(x)T w(x′)K(x,x′) + w(x)T∇x′K(x,x′) +∇xK(x,x′)T w(x)
+ trace(∇x,x′K(x,x′)) (12)
is a smooth and symmetric positive definite kernel function associated with the U-statistic275
Mn(W). The implementation of the NEST estimator in Definition 1 boils down to the estimation
of W0 via the convex program (7), which corresponds to minimizing
Mλ(n),n(W) =1
n2
n∑
i=1
n∑
j=1
κ[w(Xi), w(Xj)](Xi,Xj) +λ(n)
n2‖W‖2F (13)
w.r.t. W . A key observation is that if the empirical criterion Mλ(n),n(W) is asymptotically equal
to the population KSD criterion M(W), then minimizing Mλ(n),n(W) with respect to W =
w(x1), . . . , w(xn)
is effectively the process of finding an W that is as close as possible to280
W0 in Equation (6). This intuitively justifies the NEST estimator in Definition 1. In what follows,
we denote λ(n) by λ and keep its dependence on n implicit. Next we show that NEST provides
an asymptotically optimal solution to the compound estimation problem.
3.2. Asymptotic Properties of NEST
This section studies the asymptotic properties of the NEST estimator. We begin by defining an285
oracle estimator δ∗ = (δ∗1 , . . . , δ∗n) for µ, where
δ∗i := δ∗(yi, s2i ,m) =
ζπ(yi, s2i ,m)
τπ(yi, s2i ,m)= yi +
1
mτπ(yi, s2i ,m)w1(yi, s
2i ,m), (14)
NEST Estimation On Heterogeneous Data 9
and ζπ and τπ are respectively the Bayes estimators of τµ and τ as defined in Equation (3).
Viewing the proposed NEST estimator δdsi (λ) as a data-driven approximation to δ∗i , we study the
quality of this approximation for large n and fixed m.
Remark 2. The oracle estimator δ∗, which serves as the benchmark in our theoretical analysis, 290
is different from the Bayes estimator δπ defined in Section 2.1: δπ has full knowledge of the prior
distributions Gµ(·|τ) and Hτ (·), whereas δδδ∗ only has the information on the true score functions
(w1i, w2i). Under the special setting of conjugate priors, δδδ∗ and δπ coincide (Corollary 2).
We impose the following regularity conditions where f denotes the density function of the
joint marginal distribution of (Y, S2) and Ef denotes expectation under f . 295
Assumption 1. Ef |κ[w(Xi), w(Xj)](Xi,Xj)|2 < ∞ for any (i, j) ∈ 1, . . . , n.
Assumption 2.∫
Xg(x)TK(x,x′)g(x′)dxdx′ > 0 for any g : X → R
2 s.t. 0 < ‖g‖22 < ∞.
Assumption 3. For some ǫi ∈ (0, 1), i = 1, 2, 3, EGexp(ǫ1|µ|) < ∞, EHexp(ǫ2/τ) <∞ and EHexp(ǫ3τ) < ∞.
Assumption 1 is a standard moment condition on κ[w(Xi), w(Xj)](Xi,Xj) [see, for example, 300
Section 5.5 in Serfling (2009)] , which is needed for establishing the Central Limit Theorem for
the U-statistic Mn(W) in Equation (11). Assumption 2 is a condition from Liu et al. (2016);
Chwialkowski et al. (2016) for ensuring that K(x,x′) is integrally strictly positive definite. This
guarantees that the KSD S(f, f) is a valid discrepancy measure in the sense that S(f, f) ≥ 0and S(f, f) = 0 if and only if f = f . Assumption 3 represents moment conditions on the prior 305
distributions. Together, they ensure that with high probability |µ| ≤ log n and 1/ log n ≤ τ ≤logn as n → ∞. This is formalized in Lemma 1 in Appendix A.4. These conditions allow a
cleaner statement of our theoretical results. It is likely that Assumption 3 can be further relaxed
but we do not seek the full generality here.
Theorem 1 below establishes the asymptotic consistency of the sample criterion Mλ,n(W) 310
around the population criterion M(W).
THEOREM 1. If λ(n)/√n → 0 as n → ∞ then, under Assumption 1, we have
∣
∣Mλ,n(W)−M(W)∣
∣ = Op
(
n−1/2)
.
Moreover, along with the fact that M(W0) = 0, Theorem 1 establishes that Mλ,n(W) is an ap-
propriate optimization criterion.
Theorem 2 establishes the consistency of the estimated score functions.
THEOREM 2. If limn→∞ cnn−1/2 = 0 then under the conditions of Theorem 1 and Assump-
tion 2, we have
limn→∞
P
1
n
∥
∥Wn(λ)−W0
∥
∥
2
F≥ c−1
n ǫ
= 0, for any ǫ > 0.
Theorem 3 establishes the optimality theory of δds by showing that (a) the average squared 315
error between δds(λ) and δ∗ is asymptotically small, and (b) the estimation loss of NEST con-
verges in probability to that of its oracle counterpart as n → ∞.
THEOREM 3. Suppose λ(n)/√n → 0 as n → ∞ and Assumptions 1 – 3 hold. If
limn→∞ cnn−1/2 log6 n = 0, then
cnn
∥
∥δds(λ)− δ∗∥
∥
2
2= op(1). Furthermore, if additionally
limn→∞ cnn−1/4 log6 n = 0, then cn
∣
∣ln(δds(λ),µ)− ln(δ
∗,µ)∣
∣ = op(1). 320
10 BANERJEE, FU, JAMES AND SUN
As we mentioned earlier, δ∗i in Equation (14) is not, in general, the Bayes estimator δπi of µi.
This follows since, dropping subscript i,
δ∗(y, s2) =E(τµ|y, s2)E(τ |y, s2) =
E[τE(µ|y, s2, τ)|y, s2]E(τ |y, s2) 6= E(µ|y, s2) = δπ(y, s2)
unless µ and τ are conditionally independent given y and s2, in which case δ∗(y, s2) = δπ(y, s2).A natural setting where E(µ|y, s2, τ) is indeed independent of τ is the popular scenario of con-
jugate priors under which the posterior expectation of µ is a linear combination of the prior
expectation of µ and the maximum likelihood estimate y (Diaconis & Ylvisaker, 1979).
COROLLARY 2. Consider hierarchical Model (1) where Gµ(·|τ) and Hτ (·) are, respectively,
conjugate prior distributions of µ|τ and τ . With m > 3, let δπ denote the Bayes estimator of µ.
Under the conditions of Theorem 3, if limn→∞ cnn−1/2 log6 n = 0, then
cnn
∥
∥δds(λ)− δπ∥
∥
2
2= op(1).
Furthermore, under the same conditions, if limn→∞ cnn−1/4 log6 n = 0, then
cn∣
∣ln(δds(λ),µ)− ln(δ
π,µ)∣
∣ = op(1).
4. EXTENSIONS325
This section considers the extension of our methodology to several well known members in
the two-parameter exponential family. We will focus on several examples where the nuisance
parameter is known. Our proposed estimation framework is motivated by the double shrinkage
idea, but the approach nonetheless handles the case with known nuisance parameters. We discuss
four examples, in each of which we derive the Bayes estimator of the natural parameter similar to330
that in Corollary 1. The Bayes estimator in these examples relies on the unknown score function
(of the marginal density of the sufficient statistic), which can be efficiently estimated using the
ideas in Section 2.2. The numerical performance of the new estimators considered in this section
is investigated in Appendix B.3.
Example 1 (Location mixture of Gaussians). Consider the following hierarchical model335
Yi | µi, τiind.∼ N(µi, 1/τi), µi
i.i.d∼ Gµ(·), for i = 1, . . . , n, (15)
where τi are known and Gµ(·) is an unspecified prior. Equation (15) represents the heteroscedas-
tic normal means problem with known variances 1/τi [see for example Weinstein et al. (2018)].
In this setting, the sufficient statistic for µi is Yi and the Bayes estimator of µi is given by
µπi := E(µi|yi, τi) = yi +
1
τi
∂
∂yilog f(yi|τi),
where f(·|τi) is the pdf of the distribution of Yi given τi marginalizing out µi. From Section 2.2
and with mi = 1, xi = (yi, τi), the NEST estimate of µi is given by δnesti (λ) = yi +1
τiwi1,λ.
Example 2 (Scale mixture of Gamma distributions). Consider the following model
Yij | αi, 1/βii.i.d∼ Γ(αi, 1/βi), 1/βi
i.i.d∼ G(·), (16)
where the shape parameters αi are known and G(·) is an unspecified prior distribution on scale
parameters 1/βi. Here Ti =∑m
j=1 Yij is a sufficient statistic and Ti|αi, βiind.∼ Γ(mαi, 1/βi).340
NEST Estimation On Heterogeneous Data 11
The posterior distribution of 1/βi belongs to a one-parameter exponential family with density
f(1/βi|Ti, αi) ∝ exp
−Tiβi + (mαi − 1) log Ti − log f(Ti|αi)
, (17)
where f(·|αi) is the pdf of the distribution of Ti given αi (marginalizing out 1/βi). From Equa-
tion (17), the Bayes estimator of βi is given by
βπi := E(βi|Ti, αi) =
mαi − 1
Ti− ∂
∂Tilog f(Ti|αi).
With xi = (Ti, αi), the NEST estimate of βi is given by δnesti (λ) =mαi − 1
Ti− wi
1,λ.
Example 3 (Shape mixture of Gamma distributions). We consider the following model: 345
Yij | αi, 1/βii.i.d∼ Γ(αi, 1/βi), αi
i.i.d∼ G(·), (18)
where the scale parameters 1/βi are known and G(·) is an unspecified prior distribution on
the shape parameters αi. Let Yi =∑m
j=1 Yij . Then Yi|αi, βiind.∼ Γ(mαi, 1/βi) and Ti = log Yi
is a sufficient statistic. Moreover, the posterior distribution of αi belongs to a one-parameter
exponential family with density
f(αi|Ti, 1/βi) ∝ exp
(mαi)Ti − βiexp (Ti)− log f(Ti|1/βi)
, (19)
where f(·|1/βi) is the density of the distribution of Ti given 1/βi marginalizing out αi. From 350
Equation (19), the Bayes estimator of αi is given by
απi := E(αi|Ti, 1/βi) =
βi exp (Ti)
m+
1
m
∂
∂Tilog f(Ti|1/βi).
With xi = (Ti, 1/βi), the NEST estimate of αi is δnesti (λ) =βi exp (Ti)
m+
1
mwi1,λ.
Example 4 (Scale mixture of Weibulls). We consider the following model:
Yij | ki, βi i.i.d∼ Weibull(ki, βi), βii.i.d∼ G(·). (20)
We have f(y| k, β) = βkyk−1 exp(−βyk). In Equation (20) the shape parameters ki are known,
G(·) is an unspecified prior distribution on the scale parameters βi, Ti =∑m
j=1Yijki is a suf- 355
ficient statistic, and Ti|ki, 1/βi ind.∼ Γ(m, 1/βi). From Example 2, the Bayes estimator of βi is
βπi := E(βi|Ti, ki) =
m− 1
Ti+
∂
∂Tilog f(Ti|ki).
With xi = (Ti, ki), the NEST estimate of βi is given by δnesti (λ) =m− 1
Ti− wi
1,λ.
The preceding examples present a setting with known nuisance parameter. When both pa-
rameters are unknown, extensions of our estimation framework to an arbitrary member of the
two-parameter exponential family is difficult. The main reason is that in the Gaussian case the 360
sufficient statistics are independent and their marginal distributions are known. However, for
other distributions such as the Gamma and Beta, the joint distribution of the two sufficient statis-
tics is generally unknown. This impedes a full generalization of our approach. We anticipate that
an iterative scheme that conducts shrinkage estimation on the primary and nuisance coordinates
12 BANERJEE, FU, JAMES AND SUN
in turn may be developed by combining the ideas in Examples 2 and 3 above. We do not pursue365
those extensions in this article.
5. NUMERICAL EXPERIMENTS
5.1. Compound Estimation of Normal Means
We focus on the hierarchical Model (1) and compare six approaches for estimating µ. These
approaches can be categorized into three types: the first consists of the NEST oracle method,370
which conducts double shrinkage (DS) with λ chosen to minimize the true loss (DS orc), and the
proposed NEST method, where λ is chosen using modified cross-validation (DS). The second
are linear shrinkage methods: the group linear estimator (Grp linear) of Weinstein et al. (2018);
the semi-parametric monotonically constrained SURE estimator that shrinks towards the grand
mean (XKB.SG) from Xie et al. (2012); and from the same paper, the parametric SURE esti-375
mator that shrinks towards a general data driven location (XKB.M). These methods assume that
the true variances are known, and, for implementation, we employ sample variances. Finally, the
third type is the g-modelling approach of Gu & Koenker (2017a,b). This method is the nearest
competitor to NEST as it estimates the joint prior distribution of the mean and variance us-
ing nonparametric maximum likelihood estimation (NPMLE) techniques (Kiefer & Wolfowitz,380
1956; Laird, 1978). For the linear shrinkage methods, we use code provided by Weinstein et al.
(2018) while for NPMLE we rely on the R package REBayes (Koenker & Gu, 2017).
The aforementioned six approaches are evaluated on five different simulation settings, with
the goal of assessing the relative performance of the competing estimators as the heterogeneity
in the variances σ2i is varied while keeping the sample sizes mi fixed at m. The five simulation385
settings can be categorized into four types: a setting where mean and variances are independent;
a setting where mean and variance are correlated; a sparse setting; and two settings that repre-
sent departures from the Normal data-generating model. For each setting we set n = 1,000 and
compute the average squared error risk for each competing estimator of µ across 50 Monte Carlo
repetitions. Figures 1 to 5 plot the relative risk which is the ratio of the average squared error risk390
for any competing estimator to that of DS orc so that a ratio bigger than 1 represents a poorer
risk performance of the competing estimator relative to the NEST oracle method.
The first setting, Figure 1, corresponds to the independent case. Here, for each i = 1, . . . , n,
µii.i.d∼ 0.7 N(0, .1) + 0.3 N(±1, 3) and σ2
ii.i.d∼ U(0.5, u) where we let u vary across four lev-
els, 0.5, 1, 2, 3. The three plots in Figure 1 show the relative risks as u varies for m = 10, 30395
and 50 (left to right). We see that for m = 10, the competing methods split into two levels of
performance. The group with the lowest relative risks consists of NPMLE and DS while the
three linear shrinkage methods exhibit substantially higher relative risks. Moreover, we also see
that as heterogeneity increases with increasing u, the gap between the two groups’ relative risks
increases, indicating that both NPMLE and the proposed NEST method are particularly useful400
for compound estimation of normal means when the variances are unknown and heterogeneous,
and the sample size for estimating those variances are themselves small. As m increases, the
performance of the three linear shrinkage methods improve which is expected as there are now
more replicates per unit of study to construct a relatively reliable estimate of the unknown vari-
ances. However, the performance of DS improves too and particularly at m = 50 (Figure 1 right),405
NPMLE exhibits a slightly higher relative risk than DS.
The second setting, Figure 2, corresponds to the correlated case. The precisions τi = 1/σ2i are
generated independently from a gamma mixture, with an even chance of drawing Γ(20, rate =20) or Γ(20, rate = u) and given τi, the means µi are independently N(±0.5/τi, 0.5
2). In this
NEST Estimation On Heterogeneous Data 13
DS Grp Linear NPMLE XKB.M XKB.SG
1.0
1.1
1.2
1.3
1 2 3u
Rel
ativ
e R
isk
1.0
1.1
1.2
1.3
1 2 3u
Rel
ativ
e R
isk
1.0
1.1
1.2
1.3
1 2 3u
Rel
ativ
e R
isk
Fig. 1. Comparison of relative risks when (µi, σ2i ) are in-
dependent. Here µii.i.d∼ 0.7N(0, .1) + 0.3N(±1, 3) and
σ2i
i.i.d∼ U(0.5, u). Plots show m = 10, 30, 50 left to right.
setting, the magnitude of the variances increase with u and the means grow with the variances. 410
Again, for the left plot m = 10, the same groups as in Figure 1 perform well although Grp Linear
has a lower relative risk in comparison to the other two linear shrinkage methods considered here.
However, the pattern as m grows is more pronounced than before, wherein the NEST method
maintains the lowest risk across the three plots.
DS Grp Linear NPMLE XKB.M XKB.SG
1.00
1.05
1.10
1.15
25 30 35 40 45 50u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
25 30 35 40 45 50u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
25 30 35 40 45 50u
Rel
ativ
e R
isk
Fig. 2. Comparison of relative risks for correlated
(µi, σ2i ). Here τi = 1/σ2
i
i.i.d∼ 0.5 Γ(20,rate = 20) +
0.5 Γ(20,rate = u) and µi|τi ind.∼ N(±0.5/τi, 0.52).
Plots show m = 10, 30, 50 left to right.
The third setting, Figure 3, corresponds to the sparse case. The precisions τi are drawn from 415
the same gamma mixture as in Figure 2, but µi are only 30% likely to come from N(±0.5/τi, 1)and 70% likely from a point mass at 0. We see similar patterns to those in Figures 1 and 2 at
m = 10. For m = 30 and 50, we notice that the relative risks of the linear shrinkage methods are
now higher than their levels at m = 10. This is not surprising for in this setting, while the risk
performance of all methods have improved with larger sample sizes, NPMLE and DS exhibit a 420
bigger improvement in risk than those of Grp Linear, XKB.M and XKB.SG.
The fourth and fifth settings, Figures 4 and 5, correspond to the setting where the data
Yij |(µi, σ2i ) are not normally distributed. In Figure 4, Yij |(µi, σ
2i ) are generated independently
14 BANERJEE, FU, JAMES AND SUN
DS Grp Linear NPMLE XKB.M XKB.SG
1.0
1.2
1.4
1.6
1.8
25 30 35 40 45 50u
Rel
ativ
e R
isk
1.0
1.2
1.4
1.6
1.8
25 30 35 40 45 50u
Rel
ativ
e R
isk
1.0
1.2
1.4
1.6
1.8
25 30 35 40 45 50u
Rel
ativ
e R
isk
Fig. 3. Comparison of relative risks when µ is sparse.
Here τii.i.d∼ 0.5 Γ(20,rate = 20) + 0.5 Γ(20,rate =
u) and µi|τi ind.∼ 0.7δ(0) + 0.3 N(±0.5/τi, 0.52). Plots
show m = 10, 30, 50 left to right.
from a uniform distribution between µi ±√3σi, σ
2i are sampled independently from N(u, 1)
which is truncated below at 0.01, and µi|σi ind.∼ 0.8N(σ2i /4, 0.25) + 0.2N(σ2
i , 1). For Fig-
DS Grp Linear NPMLE XKB.M XKB.SG
1.0
1.1
1.2
1.3
1 2 3u
Rel
ativ
e R
isk
1.0
1.1
1.2
1.3
1 2 3u
Rel
ativ
e R
isk
1.0
1.1
1.2
1.3
1 2 3u
Rel
ativ
e R
isk
Fig. 4. Comparison of relative risk for non-normal data.
Here Yij |µi, σii.i.d∼ U(µi −
√3σi, µi +
√3σi), σ2
i aresampled independently from N(u, 1) truncated below at
0.01 and µi|σ2i
ind.∼ 0.8N(0.25σ2i , 0.25) + 0.2N(σ2
i , 1).Plots show m = 10, 30, 50 left to right.
425
ure 5, Yij |(µi, σ2i ) are generated with additional non-normal noise N(µi, σ
2i ) + Lap(0, 1) and
σ2i
i.i.d∼ U(0.1, u) with µi|σi ind.∼ 0.8N(σ2i /2, 0.5) + 0.2N(2σ2
i , 2). Across both these settings
and for Setting 5 (Figure 5) in particular, the proposed NEST method demonstrates robustness
to departures from the Normal model. Proposition 7 in Barp et al. (2019) guarantees that, in gen-
eral, the influence function of minimum KSD estimators, such as the NEST estimator, is bounded430
under data corruption and the behavior of the NEST estimator in Settings 4 and 5 is potentially
an example of such a robustness property.
In Appendix B we consider additional simulation experiments wherein we evaluate the risk
performance of these competing estimators with n fixed at 100 (Appendix B.1) and unequal
sample sizes mi (Appendix B.2).435
NEST Estimation On Heterogeneous Data 15
DS Grp Linear NPMLE XKB.M XKB.SG
1.00
1.05
1.10
1.15
1 2 3u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
1 2 3u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
1 2 3u
Rel
ativ
e R
isk
Fig. 5. Comparison of relative risk for non-normal
data. Here Yij |µi, σii.i.d∼ N(µi, σ
2i ) + Lap(0, 1),
σ2i
i.i.d∼ U(0.1, u) and µi|σ2i
ind.∼ 0.8 N(0.5σ2i , 0.5) +
0.2 N(2σ2i , 2). Plots show m = 10, 30, 50 left to right.
5.2. Compound Estimation of Ratios
In this section we demonstrate the use of the NEST estimation framework for compound es-
timation of the n ratios θi =√miµi/σi which represent a popular financial metric for assessing
mutual fund performance (see Section 6.2 for a related real data application involving com-
pound estimation of mutual fund Sharpe ratios.). We evaluate the performance of the same six 440
methods with n fixed at 1,000 and mi = m for i = 1, . . . , n. The data Yij are generated inde-
pendently from N(µi, σ2i ) and the variances σ2
i are simulated as in Figure 5 while the means
are independently drawn from a mixture model with half chance N(−σ2i /2, 0.5
2) and the other
half N(σ2i /2, 0.5
2). Figure 6 shows the relative risk performance of the competing estimators of
θ = (θ1, . . . , θn). We continue to see that DS is in the group of better performers and especially 445
for m > 10, it dominates other methods. As u increases the heterogeneity in the data grows and
we see that the relative risks of the linear shrinkage methods across all m first decrease and then
increase. The shift in the behavior of these estimators is related to the observation that as u in-
creases, the centers of the mixture model that generates µi, are on average, further away from
one another. This makes estimating the numerator of the ratio easier for all methods up until a 450
point. As heterogeneity increases further, the risks of these methods that use the sample standard
deviation in the denominator of θi are relatively worse than the risk of DS. Moreover, we also
notice that NPMLE exhibits a substantially higher relative risk than DS when m > 10 and at
m = 50 the linear shrinkage methods are at par with NPMLE.
6. REAL DATA ANALYSES 455
6.1. Baseball Data
We analyze the monthly data on the number of “at bats” and “hits” for all U.S Major League
baseball players from regular seasons of 2002-2011. In this analysis we focus on both pitchers
and non-pitchers using an approach similar to that of Gu & Koenker (2017a). The data are avail-
able from the R package REBayes and have been aggregated into half seasons to produce an 460
unbalanced panel. It includes observations on 932 players who have at least ten at bats in any half
season and appear in no fewer than five half-seasons (note that there are a total of 20 half-seasons
that a player can appear in).
16 BANERJEE, FU, JAMES AND SUN
DS Grp Linear NPMLE XKB.M XKB.SG
1.0
1.2
1.4
1.6
1 2 3u
Rel
ativ
e R
isk
1.0
1.2
1.4
1.6
1 2 3u
Rel
ativ
e R
isk
1.0
1.2
1.4
1.6
1 2 3u
Rel
ativ
e R
isk
Fig. 6. Comparison of relative risk for estimating θ. Here
σ2i
i.i.d∼ U(0.1, u) and µi|σiind.∼ N(±0.5σ2
i , 0.52). Plots
show m = 10, 30, 50 left to right.
Following Brown (2008), let the transformed batting average Yij for player i(= 1, . . . , n)
at time j(= 1, . . . ,mi) be denoted by Yij = arcsin
(√
Hij + 0.25
Nij + 0.5
)
where Hij denotes the465
number of “hits” and Nij denotes the number of “at bats” at time j for player i. We assume that
Yij ∼ N(µi, v2ij/τi) where µi = arcsin(
√pi), pi being player i’s batting success probability, and
v2ij = 1/(4Nij). Here 1/τi are player specific scale parameters as described in Gu & Koenker
(2017a). Under this setup, the sufficient statistics are
µi =
mi∑
j=1
1/v2ij
−1mi∑
j=1
Yij/v2ij ∼ N(µi, v
2i /τi) with v2i =
4
mi∑
j=1
Nij
−1
,
S2i =
1
mi − 1
mi∑
j=1
(Yij − µi)2/v2ij with (mi − 1)S2
i τi ∼ X 2mi−1.
In this analysis, the goal is to use the 2002-2011 data to predict the batting averages of470
the players in 2012. Players are divided into three categories: all, non-pitchers, and pitchers.
We consider the following seven estimators of µi: two non-parametric maximum likelihood
based estimators, denoted NPMLE-Indep and NPMLE-Dep which assume, respectively, in-
dependent and dependent priors on (µi, 1/τi), the sufficient statistics µ = (µ1, . . . , µn) of µ,
the grand mean across all players in the 2011 season Y 2011 = n−1∑n
i=1 Yi,2011, the proposed475
NEST estimator (DS) and its oracle counterpart (DS orc), and the naive estimator that uses
2011 batting averages Y2011 = (Y1,2011, . . . , Yn,2011). To assess how well these methods pre-
dict 2012 batting averages Y2012 = (Y1,2012, . . . , Yn,2012), we consider three criteria for evalu-
ating any estimate δi of µi: total squared error from Brown (2008) and defined as TSE(δ) =∑n
i=1
(Yi,2012 − δi)2 − (4Ni,2012)
−1
, normalized squared error from Gu & Koenker (2017a)480
and defined as NSE(δ) =∑n
i=1
4Ni,2012(Yi,2012 − δi)2
, and total squared error on a prob-
ability scale from Jiang et al. (2010) which is defined as TSEp(p) =∑n
i=1
(pi,2012 − pi)2 −
pi,2012(1− pi,2012)(4Ni,2012)−1
. Here pi = sin2(δi) and pi,2012 = sin2(Yi,2012).
NEST Estimation On Heterogeneous Data 17
Table 1. Performance of the competing estimators relative to the performance of the naive esti-
mator Y2011. Here R-TSE(δ) = TSE(δ)/TSE(Y2011) with similar definitions for R-NSE and
R-TSEp. The smallest two relative errors are bolded
NPMLE-Indep NPMLE-Dep µ Y 2011 DS DS orc
All
n for estimation: 932 R-TSE 0.463 0.469 0.352 1.958 0.351 0.348
n for prediction: 370 R-NSE 0.668 0.674 0.676 1.495 0.672 0.671
R-TSEp 0.582 0.591 0.501 1.783 0.499 0.497
Nonpitchers
n for estimation: 792 R-TSE 0.535 0.546 0.503 0.551 0.478 0.477
n for prediction: 325 R-NSE 0.656 0.661 0.679 0.973 0.663 0.663
R-TSEp 0.651 0.663 0.619 0.682 0.593 0.593
Pitchers
n for estimation: 140 R-TSE 0.659 0.655 0.629 0.804 0.602 0.601
n for prediction: 45 R-NSE 0.659 0.659 0.649 0.769 0.598 0.596
R-TSEp 0.124 0.117 0.133 0.334 0.076 0.074
In Table 1, we report the performance of the competing estimators relative to the performance
of the naive estimator Y2011 wherein R-TSE(δ) = TSE(δ)/TSE(Y2011) with similar defini- 485
tions for R-NSE and R-TSEp. Thus, a smaller value of R-TSE, R-NSE or R-TSEp indicates
a relatively better prediction error. For at least two of the three performance metrics, DS exhibits
the best relative risk across every player category. It is interesting to note that the sufficient statis-
tics µ are quite competitive in this example while NPMLE with independent priors dominate the
one with dependent priors across “All” and “Nonpitchers”. The compound estimation problem 490
for “Pitchers” is an example of a setting where n is relatively small and DS demonstrates a better
risk performance than NPMLE across all three metrics. We report a similar performance of DS
for such small n settings in the numerical experiments of Appendix B.1.
6.2. Mutual Fund Sharpe Ratios
In this section we analyze a dataset on n1 = 5,000 monthly mutual fund returns spanning 495
12 months from January 2014 to December 2014. This data is sourced from the Wharton re-
search data services (Wharton School, 1993). The goal in this analysis is to use Sharpe ratios
constructed using the data on the first 6 months, January 2014 - June 2014, to predict the corre-
sponding Sharpe ratios for the next 6 months. Formally, let Yij denote the excess return of fund
i(= 1, . . . , n1) in month j(= 1, . . . ,mi,1) over the return on the 3 month treasury yield. Denote 500
Yi = m−1i,1
∑mi,1
j=1 Yij , S2i = (mi,1 − 1)−1
∑mi,1
j=1 (Yij − Yi)2 and δnaivei = Yi/
√
S2i to be, respec-
tively, the sample mean, the sample variance and the observed Sharpe ratio of the monthly excess
returns. In our analysis we include funds that have at least 4 months worth of returns available
during January 2014 - June 2014 and so mi,1 ∈ [4, 6]. Of the 5,000 funds available during these
first 6 months, there are n2 = 4,927 funds that appear in the next 6 months, July 2014 - December 505
2014, and have at least 3 months of returns available during this period. For our prediction, we
consider these n2 funds to assess the performance of various estimators for predicting θi = µi/σiwhere µi and σi are the sample mean and sample standard deviation of the excess returns of the
n2 funds during the next 6 months.
We consider the following estimators of θ = (θ1, . . . , θn): DS, DS orc, Grp Linear and 510
XKB.SG from Section 5.1, as well as NPMLE-Indep, NPMLE-Dep from Section 6.1. Addi-
tionally, we consider the SURE estimator XKB.G from Xie et al. (2012) that, unlike XKB.SG,
imposes a Normal prior on µ and shrinks towards the grand mean. Note that for predicting
θi, Grp Linear, XKB.SG and XKB.G rely on the sample variances S2i . To evaluate the per-
18 BANERJEE, FU, JAMES AND SUN
Table 2. Performance of the competing estimators relative to the performance of the naive esti-
mator δnaive. Here R-TSE(δ) = TSE(δ)/TSE(δnaive) with similar definitions for R-WSE and
R-WAE. The smallest two relative errors are bolded
n1 n2 Group Linear XKB.G XKB.SG DS DS orc
R - TSE 5000 4927 0.900 0.928 0.902 0.677 0.677
R - WSE 5000 4927 0.899 0.928 0.902 0.676 0.676
R - WAE 5000 4927 1.028 0.972 0.966 0.779 0.778
formance of these estimators for predicting θ, we consider three criteria: Total Squared Er-515
ror : TSE(δ) =∑n2
i=1(θi − δi)2; weighted Squared Error : WSE(δ) =
∑n2i=1mi,2(θi − δi)
2;
and weighted Absolute Error : WAE(δ) =∑n2
i=1mi,2|1− δi/θi|. In Table 2, we present the
performance of the competing estimators relative to the performance of the naive estimator
δnaive = (δnaivei : 1 ≤ i ≤ n) so that a smaller value of R-TSE, R-WSE or R-WAE indicates a
relatively better prediction error. NPMLE-Indep and NPMLE-Dep revealed convergence issues520
on this data and so we do not include them in table 2. Along all performance measures, DS
has the smallest relative risk among all competing estimators considered in this example. With
respect to the weighted Absolute Error, Grp Linear appears to be doing relatively worse in pre-
dicting the small Sharpe ratios and exhibits an R-WAE bigger than 1. When compared against
the three linear shrinkage methods considered here, DS demonstrates an overall value in joint525
shrinkage estimation of the means µi and the variances σ2i for predicting θ.
ACKNOWLEDGEMENTS
We thank the editor and three anonymous referees for taking the time to review this pa-
per. Their suggestions have led to significant improvements in the quality and content of this
manuscript.530
SUPPLEMENTARY MATERIAL
Supplementary material available at Biometrika online includes proofs of main theorems and
additional numerical experiments.
REFERENCES
ABRAMOVICH, F., BENJAMINI, Y., DONOHO, D. L. & JOHNSTONE, I. M. (2006). Adapting to unknown sparsity535
by controlling the false discovery rate. Ann. Statist. 34, 584–653.BANERJEE, T., LIU, Q., MUKHERJEE, G. & SUN, W. (2020). A general framework for empirical bayes estimation
in discrete linear exponential family. arXiv preprint arXiv:1910.08997 (accepted in Journal of Machine LearningResearch) .
BARP, A., BRIOL, F.-X., DUNCAN, A., GIROLAMI, M. & MACKEY, L. (2019). Minimum stein discrepancy esti-540
mators. In Advances in Neural Information Processing Systems.BASU, P., CAI, T. T., DAS, K. & SUN, W. (2017). Weighted false discovery rate control in large-scale multiple
testing. Journal of the American Statistical Association 0, 0–0.BENJAMINI, Y. & HOCHBERG, Y. (1997). Multiple hypotheses testing with weights. Scandinavian Journal of
Statistics 24, 407–418.545
BENJAMINI, Y. & YEKUTIELI, D. (2011). False discovery rate–adjusted multiple confidence intervals for selectedparameters. Journal of the American Statistical Association 100, 71–81.
BERGER, J. O. (1976). Admissible minimax estimation of a multivariate normal mean with arbitrary quadratic loss.Ann. Statist. 4, 223–226.
BERK, R., BROWN, L., BUJA, A., ZHANG, K. & ZHAO, L. (2013). Valid post-selection inference. Ann. Statist. 41,550
802–837.
NEST Estimation On Heterogeneous Data 19
BROWN, L. D. (2008). In-season prediction of batting averages: A field test of empirical bayes and bayes method-ologies. The Annals of Applied Statistics , 113–152.
BROWN, L. D. & GREENSHTEIN, E. (2009). Nonparametric empirical Bayes and compound decision approaches toestimation of a high-dimensional vector of normal means. The Annals of Statistics 37, 1685–1704. 555
BROWN, L. D., GREENSHTEIN, E. & RITOV, Y. (2013). The poisson compound decision problem revisited. Journalof the American Statistical Association 108, 741–749.
BROWN, S. J., GOETZMANN, W., IBBOTSON, R. G. & ROSS, S. A. (1992). Survivorship bias in performancestudies. The Review of Financial Studies 5, 553–580.
CAI, T. T. & SUN, W. (2009). Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. 560
J. Amer. Statist. Assoc. 104, 1467–1481.CASTILLO, I. & VAN DER VAART, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly
sparse sequences. Ann. Statist. 40, 2069–2101.CHIARETTI, S., LI, X., GENTLEMAN, R., VITALE, A., VIGNETTI, M., MANDELLI, F., RITZ, J. & FOA, R. (2004).
Gene expression profile of adult t-cell acute lymphocytic leukemia identifies distinct subsets of patients with 565
different response to therapy and survival. Blood 103, 2771–2778.CHWIALKOWSKI, K., STRATHMANN, H. & GRETTON, A. (2016). A kernel test of goodness of fit. JMLR: Workshop
and Conference Proceedings.DIACONIS, P. & YLVISAKER, D. (1979). Conjugate priors for exponential families. The Annals of statistics ,
269–281. 570
DONOHO, D. L. & JONHSTONE, J. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425.DYSON, F. (1926). A Method for Correcting Series of Parallax Observations. Monthly Notices of the Royal Astro-
nomical Society 86, 686–706.EDDINGTON, A. S. (1940). The Correction of Statistics for Accidental Error. Monthly Notices of the Royal Astro-
nomical Society 100, 354–361. 575
EFRON, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 23, 1–22.EFRON, B. (2011). Tweedie’s formula and selection bias. Journal of the American Statistical Association 106,
1602–1614.EFRON, B. & MORRIS, C. N. (1975). Data analysis using stein’s estimator and its generalizations. Journal of the
American Statistical Association 70, 311–319. 580
ERICKSON, S. & SABATTI, C. (2005). Empirical Bayes estimation of a sparse vector of gene expression change.Statistical applications in genetics and molecular biology 4, 1132.
GU, J. & KOENKER, R. (2017a). Empirical bayesball remixed: Empirical bayes methods for longitudinal data.Journal of Applied Econometrics 32, 575–599.
GU, J. & KOENKER, R. (2017b). Unobserved heterogeneity in income dynamics: An empirical bayes perspective. 585
Journal of Business & Economic Statistics 35, 1–16.HE, L., SARKAR, S. K. & ZHAO, Z. (2015). Capturing the severity of type ii errors in high-dimensional multiple
testing. Journal of Multivariate Analysis 142, 106 – 116.HENDERSON, N. C. & NEWTON, M. A. (2016). Making the cut: improved ranking and selection for large-scale
inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78, 1467–9868. 590
IGNATIADIS, N. & WAGER, S. (2019). Covariate-powered empirical bayes estimation. In Advances in NeuralInformation Processing Systems.
JAMES, G. M., RADCHENKO, P. & RAVA, B. (2020). Irrational exuberance: Correcting bias in probability estimates.Journal of the American Statistical Association , 1–14.
JAMES, W. & STEIN, C. (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium 595
on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. Berkeley, Calif.:University of California Press.
JIANG, W. & ZHANG, C.-H. (2009). General maximum likelihood empirical bayes estimation of normal means.Ann. Statist. 37, 1647–1684.
JIANG, W., ZHANG, C.-H. et al. (2010). Empirical bayes in-season prediction of baseball batting averages. In Bor- 600
rowing Strength: Theory Powering Applications–A Festschrift for Lawrence D. Brown. Institute of MathematicalStatistics, pp. 263–273.
JING, B.-Y., LI, Z., PAN, G. & ZHOU, W. (2016). On sure-type double shrinkage estimation. Journal of theAmerican Statistical Association 111, 1696–1704.
JOHNSTONE, I. M. & SILVERMAN, B. W. (2004). Needles and straw in haystacks: empirical Bayes estimates to 605
possibly sparse sequences. Annals of Statistics 32, 1594–1649.KIEFER, J. & WOLFOWITZ, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely
many incidental parameters. The Annals of Mathematical Statistics , 887–906.KOENKER, R. & GU, J. (2017). Rebayes: Empirical bayes mixture methods in r. Journal of Statistical Software 82,
1–26. 610
KOENKER, R. & MIZERA, I. (2014). Convex optimization, shape constraints, compound decisions, and empiricalBayes rules. Journal of the American Statistical Association 109, 674–685.
20 BANERJEE, FU, JAMES AND SUN
KOU, S. C. & YANG, J. J. (2017). Optimal Shrinkage Estimation in Heteroscedastic Hierarchical Linear Models,chap. 25. Cham: Springer International Publishing, pp. 249–284.
LAIRD, N. (1978). Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American615
Statistical Association 73, 805–811.LAURENT, B. & MASSART, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of
Statistics , 1302–1338.LEE, J. D., SUN, D. L., SUN, Y. & TAYLOR, J. E. (2016). Exact post-selection inference, with application to the
lasso. Ann. Statist. 44, 907–927.620
LIU, Q., LEE, J. D. & JORDAN, M. I. (2016). A kernelized stein discrepancy for goodness-of-fit tests. In Proceedingsof the International Conference on Machine Learning (ICML).
LIU, Q. & WANG, D. (2016). Stein variational gradient descent: A general purpose bayesian inference algorithm. InAdvances In Neural Information Processing Systems.
OATES, C. J., GIROLAMI, M. & CHOPIN, N. (2017). Control functionals for monte carlo integration. Journal of the625
Royal Statistical Society: Series B (Statistical Methodology) 79, 695–718.ROBBINS, H. (1956). An empirical Bayes approach to statistics. Proc. Third Berkeley Symp. on Math. Statistic. and
Prob. 1, 157–163.SERFLING, R. J. (2009). Approximation theorems of mathematical statistics, vol. 162. John Wiley & Sons.SUN, W. & MCLAIN, A. C. (2012). Multiple testing of composite null hypotheses in heteroscedastic models. Journal630
of the American Statistical Association 107, 673–687.TAN, Z. (2015). Improved minimax estimation of a multivariate normal mean under heteroscedasticity. Bernoulli 21,
574–603.TUSHER, V. G., TIBSHIRANI, R. & CHU, G. (2001). Significance analysis of microarrays applied to the ionizing
radiation response. Proceedings of the National Academy of Sciences 98, 5116–5121.635
WEINSTEIN, A., FITHIAN, W. & BENJAMINI, Y. (2013). Selection adjusted confidence intervals with more powerto determine the sign. Journal of the American Statistical Association 108, 165–176.
WEINSTEIN, A., MA, Z., BROWN, L. D. & ZHANG, C.-H. (2018). Group-linear empirical bayes estimates for aheteroscedastic normal mean. Journal of the American Statistical Association 0, 1–13.
WHARTON SCHOOL (1993). Wharton research data services. https://wrds-web.wharton.upenn.edu/wrds/ .640
XIE, X., KOU, S. & BROWN, L. D. (2012). Sure estimates for a heteroscedastic hierarchical model. Journal of theAmerican Statistical Association 107, 1465–1479.
YANG, J., LIU, Q., RAO, V. & NEVILLE, J. (2018). Goodness-of-fit testing for discrete distributions via steindiscrepancy. In International Conference on Machine Learning.
ZHANG, X. & BHATTACHARYA, A. (2017). Empirical Bayes, sure, and sparse normal mean models. Preprint.645
NEST Estimation On Heterogeneous Data 1
Supplementary Material for “Nonparametric Empirical
Bayes Estimation On Heterogeneous Data”
A. PROOFS
A.1. Proof of Proposition 1
Proposition 1 follows by recalling that under the hierarchical model of equation (1),
fmi(yi, s
2i |µi, τi) ∝ exp
−τi2
[
miy2i + (mi − 1)s2i
]
+miτiµiyi −mi
2τiµ
2i +
mi − 3
2log s2i
.
Therefore from Bayes theorem,
fmi(µi, τi|yi, s2i ) =
fmi(yi, s
2i |µi, τi)
fmi(yi, s2i )
g(µi|τi)h(τi) ∝ exp
ηTi T (µi, τi)−A(ηi)
g(µi|τi)h(τi).
Here ηi = (miyi,−miy2i − (mi − 1)s2i ) := (η1i, η2i), T (µi, τi) = (τiµi, τi/2) and 650
A(ηi) = −0.5(mi − 3) log γ(η1i, η2i) + log fmi
m−1i η1i, γ(η1i, η2i)
,
γ(η1i, η2i) =−η2i −m−1
i η21imi − 1
,
with fm(y, s2) =∫ ∫
fm(y, s2|µ, τ)gµ(µ|τ)hτ (τ)dµdτ being the marginal density function of (Y, S2).
Corollary 1 is a consequence of the properties of exponential family of distributions and follows from
proposition 1 under the squared error loss. From proposition 1 and dropping subscript i, we have,
τπ := τπ(y, s2,m) = E(τ |y, s2,m) = 2∂A(η)
∂η2=
m− 3
(m− 1)s2− 2
m− 1w2(y, s
2;m).
Furthermore, with ζ = τµ,
ζπ := ζπ(y, s2,m) = E(ζ|y, s2,m) =∂A(η)
∂η1=
(m− 3)y
(m− 1)s2+m−1w1(y, s
2;m)− 2yw2(y, s2;m)
= yE(τ |y, s2,m) +m−1w1(y, s2;m).
A.2. Proof of Theorem 1 655
Consider the sample criterion Mλ,n(W) and population criterion M(W). We have
∣
∣
∣Mλ,n(W)−M(W)
∣
∣
∣≤∣
∣
∣
1
n2
n∑
i=1
n∑
j=1
κ[w(Xi), w(Xj)](Xi,Xj)I(i 6= j)−M(W)∣
∣
∣+
∣
∣
∣
1
n2
n∑
i=1
κ[w(Xi), w(Xi)](Xi,Xi)∣
∣
∣+λ(n)
n2‖W‖2F
:= I1 + I2 + I3.
(1)
Define Mn(W) = [n(n− 1)]−1∑n
i=1
∑nj=1 κ[w(Xi), w(Xj)](Xi,Xj)I(i 6= j). Then
I1 ≤ |Mn(W)−M(W)|+ n−1|Mn(W)|.
By assumption 1, n−1|Mn(W)| is Op(n−1). Now, note that Mn(W) is an unbiased estimator of M(W)
and is a U-statistic with a symmetric kernel function κ[w(Xi), w(Xj)](Xi,Xj). From assumption 1,
κ[w(Xi), w(Xj)](Xi,Xj) has finite second moments. Moreover, from theorem 4.1 of Liu et al. (2016),
2 BANERJEE, FU, JAMES AND SUN
Mn(W) is a non-degenerate U-statistic whenever f 6= f . Thus, from the CLT for U-statistics (Serfling660
(2009) section 5.5), |Mn(W)−M(W)| is Op(n−1/2).
For the terms I2 and I3 in equation (1), we have
I2 + I3 = 1 + λ(n)∣
∣
∣
1
n2
n∑
i=1
κ[w(Xi), w(Xi)](Xi,Xi)∣
∣
∣.
From Assumption 1, I2 + I3 is Op(1 + λ(n)/n). Theorem 1 is proved by combining these results and
using the condition λ(n)n−1/2 → 0.
A.3. Proof of Theorem 2
Denote λ(n) by λ and keep its dependence on n implicit for notational ease. According to Proposition
3.3 of Liu et al. (2016), Assumption 2 implies that M(W) is a valid discrepancy measure in the sense that
M(W) > 0 if and only if f 6= f . It follows that given a fixed ǫ > 0, there exists a δ > 0 such that
P
cnn
∥
∥Wn(λ)−W0
∥
∥
2
F≥ ǫ0
≤ P[
cn
M(Wn(λ))−M(W0)
≥ δ]
.
But the right hand side in the display above is upper bounded by the sum of three665
terms: P[cnM(Wn(λ))− Mλ,n(Wn(λ)) ≥ δ/3], P[cnMλ,n(Wn(λ))− Mλ,n(W0) ≥ δ/3] and
P[cnMλ,n(W0)−M(W0) ≥ δ/3]. From Theorem 1 the first term goes to zero as n → ∞ while the
second term is zero since Mλ,n(Wn(λ)) ≤ Mλ,n(W0).
Consider the third term P[cnMλ,n(W0)−M(W0) ≥ δ/3]. Here M(W0) = 0 and we can write
∣
∣Mλ,n(W0)∣
∣ ≤∣
∣Mλ,n(W0)− Mλ,n(W0)∣
∣+∣
∣Mλ,n(W0)∣
∣
:= T1 + T2,
where Mλ,n(W0) = Mn(W0) + λn−2‖W0‖2F and
Mn(W0) = [n(n− 1)]−1n∑
i=1
n∑
j=1
κ[w(Xi),w(Xj)](Xi,Xj)I(i 6= j).
So, we have T1 ≤ 1
n
∣
∣Mn(W0)∣
∣+1
n2
∣
∣
∣
∑ni=1 κ[w(Xi),w(Xi)](Xi,Xi)
∣
∣
∣, and T2 ≤
∣
∣Mn(W0)∣
∣+670
λn−2‖W0‖2F . Therefore,
∣
∣Mλ,n(W0)∣
∣ ≤ T1 + T2 ≤ (1 + n−1)∣
∣Mn(W0)∣
∣+1 + λ
n2
∣
∣
∣
n∑
i=1
κ[w(Xi),w(Xi)](Xi,Xi)∣
∣
∣. (2)
From Theorem 4.1 of Liu et al. (2016), Mn(W) is a degenerate U-statistic when W = W0. From As-
sumption 1 and the CLT for U-statistics [cf. Section 5.5.2 of Serfling (2009)], Mn(W0) is Op(n−1). Also,
from Assumption 1, the second term in equation (2) above is Op((1 + λ)/n). Finally, using these results
in equation (2) along with the condition that λn−1/2 → 0 as n → ∞, we conclude that the third term
P[cnMλ,n(W0)−M(W0) ≥ δ/3] → 0
as n → ∞. The desired result thus follows.
A.4. Proof of Theorem 3 and Corollary 2
We first state two lemmata that are needed for proving Theorem 3. Denote c0, c1, . . . some generic
positive constants which may vary in different statements.675
LEMMA 1. If Assumption 3 holds, then with probability tending to 1, |µ| ≤ C1 log n and C2/ log n ≤τ ≤ C3 log n for some positive constants C1, C2 and C3.
NEST Estimation On Heterogeneous Data 3
LEMMA 2. Consider Model (1). Suppose Assumption 3 holds. Then with probability tending to 1,
|w1,i| ≤ c0(log n)2 and E(τi|yi, s2i ) ≥
c1log n
.
Lemmata 1 and 2 are proved in Appendices A.5 and A.6, respectively. We now prove Theorem 3.
To establish the first part of Theorem 3, note that
1
n‖δds(λ)− δ∗‖22 =
1
nm2
n∑
i=1
∣
∣
∣
w1,i
τπi−
wi1,λ
τdsi (λ)
∣
∣
∣
2
≤ 2
nm2
n∑
i=1
1
[τdsi (λ)]2
∣
∣
∣w1,i − wi
1,λ
∣
∣
∣
2
+2
nm2
n∑
i=1
∣
∣
∣w1,i
∣
∣
∣
2 ∣∣
∣
1
τdsi (λ)− 1
τπi
∣
∣
∣
2
:= T1 + T2.
Consider the first term T1. From the discussion in Section 2.3, there is a positive constant c0 such that 680
τdsi (λ) > c0 > 0 for all i = 1, . . . , n. It follows that for some constant c1 > 0 depending on the fixed m,
T1 ≤ c1n
∥
∥
∥w1 − w1,λ
∥
∥
∥
2
2≤ c1
n
∥
∥
∥W0 − Wn(λ)∥
∥
∥
2
F, (3)
where w1,λ = (w11,λ, . . . , w
n1,λ) and w1 = (w1,1, . . . , w1,n). From Theorem 2 the last term on the right
hand side of the inequality in equation (3) is Op(n−1/2).
Next consider the second term T2. We have
T2 ≤ c2n
n∑
i=1
∣
∣
∣
w1,i
τπi
∣
∣
∣
2
|w2,i − wi2,λ|2 (4)
We will use Lemma 2 to bound the terms |w1,i/τπi | in equation (4). First note that Model (1), Assumption
3 and Lemma 1 imply that with probability tending to 1, |Yi| ≤ c0 log n and
(m− 1)n−1 ≤ (m− 1)S2i τi ≤ (m− 1) + 2
√
(m− 1) log n+ 2 log n
[cf. Lemma 1 of Laurent & Massart (2000)]. So conditional on these events, we have, from Lemma 2, 685
|w1,i/τπi | ≤ c3 log
3 n. Thus,
T2 ≤ c4 log6 n
n
n∑
i=1
∣
∣
∣w2,i − wi
2,λ
∣
∣
∣
2
,
which is Op(log6 n/
√n) from Theorem 2. Thus n−1‖δds(λ)− δ∗‖22 is Op(log
6 n/√n).
Now we will prove the second part of Theorem 3. Observe that |ln(µ, δ∗)− ln(µ, δds(λ))| equals
1
n
∣
∣
∣
∥
∥µ− δ∗∥
∥
2−∥
∥µ− δds(λ)∥
∥
2
∣
∣
∣
∣
∣
∣
∥
∥µ− δ∗∥
∥
2+∥
∥µ− δds(λ)∥
∥
2
∣
∣
∣
and Triangle inequality implies
1√n
∣
∣
∣
∥
∥µ− δ∗∥
∥
2−∥
∥µ− δds(λ)∥
∥
2
∣
∣
∣≤ 1√
n
∥
∥δds(λ)− δ∗∥
∥
2. (5)
The quantity on the right hand side of the inequality in equation (5) is Op(log3 n/n1/4) from the first part
of theorem 3. Thus, it follows from equation (5) that√
ln(µ, δds(λ)) ≤√
ln(µ, δ∗) +Op(log3 n/n1/4), and
∣
∣ln(µ, δ∗)− ln(µ, δ
ds(λ))∣
∣ ≤ 4√
ln(µ, δ∗)∣
∣
√
ln(µ, δ∗)−√
ln(µ, δds(λ))∣
∣
1 + op(1)
. (6)
4 BANERJEE, FU, JAMES AND SUN
Now Assumption 3, together with Lemmata 1 and 2 imply ln(µ, δ∗) is Op(log
6 n). Thus, from equation690
(5) and the first part of Theorem 3, we have the desired result.
Corollary 2 is a consequence of Theorem 2 of Diaconis & Ylvisaker (1979). When applied to the
hierarchical model of equation (1) with conjugate priors it establishes that the posterior expectation of µi
is a linear combination of the prior mean and yi. The weights in this linear combination are proportional
to m and the prior sample size m0. For instance, if we consider the following normal conjugate model,695
Yij | µi, τii.i.d∼ N(µi, 1/τi), µi |τi ind∼ N(µ0, 1/τi), τi
i.i.d∼ Γ(α, β), (7)
then under model (7) standard calculations give δπi = (µ0 +myi)/(m+ 1) where µ0 is the prior mean
and m0 = 1 is the prior sample size. Moreover, under model (7) we also have
log f(yi, s2i ) = c0 +
m− 3
2log s2i − (α+m/2) log
β−1 + 0.5(m− 1)s2i + 0.5m
m+ 1(yi − µ0)
2
,
where c0 is a constant independent of (yi, s2i ). From the above display,
w1(yi, s2i ) = − α+m/2
β−1 + 0.5(m− 1)s2i + 0.5m/(m+ 1)(yi − µ0)2
w2(yi, s2i ) =
(m− 3)
2s2i− 0.5(α+m/2)(m− 1)
β−1 + 0.5(m− 1)s2i + 0.5m/(m+ 1)(yi − µ0)2.
Substituting these expressions for w1(yi, s2i ), w2(yi, s
2i ) in equation (14) give δ∗i = δπi . This suffices to
prove the statement of Corollary 2.
A.5. Proof of Lemma 1
The proof of Lemma 1 follows directly from Assumption 3 and Markov’s inequality. For example, fix
a ν > 0 and note that, for r = ǫ−ν2 > 1,
P
(
τ ≤ ǫ1+ν2
log n
)
≤EH
exp(ǫ2/τ)
nr.
A.6. Proof of Lemma 2700
Recall that f(yi, s2i ) =
∫
R+
∫
Rf1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ, where g(·|τ) and h(·) are, respec-
tively, the density functions associated with the distribution functions Gµ(·|τ) and Hτ (·) in equation (1),
f1 is the density of a Gaussian random variable with mean µ and variance 1/(mτ) and f2 is the density
of S2 where (m− 1)S2τ ∼ X 2m−1. We will denote the partial derivative of f(y, s2) with respect to y by
f ′(1)(y, s
2). From Model (1), |f ′(1)(yi, s
2i )| ≤ T1 + T2, where705
T1 = m|Yi|∫
R+
∫
R
τf1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ (8)
T2 = m
∫
R+
∫
R
|µ|τf1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ.
Consider the term T1 in equation (8) above. Fix ν3 > 0 such that ǫ−ν3
3 > 4m in Lemma 1. With C3 =ǫ−1−ν3
3 we have,
T1 ≤ m|Yi|f(yi, s2i )C3 log n+m|Yi|∫
R+
∫
R
τI(τ ≥ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ
≤ m|Yi|f(yi, s2i )C3 log n+mc0|Yi|EH
τ5/2I(τ ≥ C3 log n)
(9)
≤ m|Yi|f(yi, s2i )C3 log n+mc1|Yi|
P
(
τ ≥ C3 log n)1/2
(10)
≤ m|Yi|f(yi, s2i )C3 log n+mc2|Yi|n−2m (11)
NEST Estimation On Heterogeneous Data 5
In equation (9) we have used the fact that f1(yi|µ, τ) ≤√
τ/(2π) and f2(s2i |τ) ≤ c2τ while for equations
(10), (11) we use the Cauchy Schwartz inequality, Assumption 1 and Lemma 1.
Now, consider the term T2 in equation (8). With C3 = ǫ−1−ν3
3 and C1 = ǫ−1−ν1
1 , where ν1 > 0 is such 710
that ǫ−ν1
1 > 4m in Lemma 1, T2 term can be expressed as the sum of the following four integrals:
I1 = m
∫
R+
∫
R
|µ|I(|µ| ≤ C1 log n)τI(τ ≤ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ (12)
I2 = m
∫
R+
∫
R
|µ|I(|µ| ≥ C1 log n)τI(τ ≤ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ (13)
I3 = m
∫
R+
∫
R
|µ|I(|µ| ≤ C1 log n)τI(τ ≥ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ
I4 = m
∫
R+
∫
R
|µ|I(|µ| ≥ C1 log n)τI(τ ≥ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ
We will bound I1 and I2 and the other two integrals can be bounded using similar arguments. For I1 in
equation (12), note that I1 ≤ c0 log2 nf(yi, s
2i ) and for I2 in equation (13),
I2 ≤ c1(log5/2 n)
P
(
|µ| ≥ C1 log n)1/2
≤ c2(log5/2 n)n−2m (14)
In equation (14) we have used f1(yi|µ, τ) ≤√
τ/(2π), f2(s2i |τ) ≤ c2τ , the Cauchy Schwartz inequality,
Assumption 1 and Lemma 1. Similarly, we can show that I3 ≤ c0 log n/n2m and I4 ≤ c1/n
4m. Finally, 715
putting the upper bounds for Ii, i = 1, · · · , 4, together, we get,
T2 ≤ c0 log2 nf(yi, s
2i ) + c1
log5/2 n
n2m(15)
From equations (11) and (15) we have
|w1(Yi, S2i )| ≤ c0|Yi| log n+ c1 log
2 n+ c2|Yi|+ log5/2 n
f(yi, s2i )n2m
.
Moreover, noting that Model (1) and Lemma 1 imply that |Yi| ≤ c3 log n with high probability, we have,
conditional on this event,
|w1(Yi, S2i )| ≤ c4 log
2 n+ c5log n+ log5/2 n
f(yi, s2i )n2m
. (16)
We will now analyze the behavior of f(yi, s2i ) that appears in the display above. Gaussian concentration 720
implies that with high probability (Yi − µi)2mτi ≤ 2 log n and so, conditional on this event,
f1(yi|µ, τ) ≥ c0
√τ
n. (17)
Moreover, using the Chi-square concentration in Lemma 1 of Laurent & Massart (2000), (m− 1)S2i τi ≤
(m− 1) + 2√
(m− 1) log n+ 2 log n and S2i τi ≥ n−1 with high probability. It follows that
f2(s2i |τ) ≥ c1τ
ann(m+1)/2 log n
, (18)
conditional on this event, where an = exp−√
(m− 1) log n. Using equations (17) and (18), we have
f(yi, s2i ) ≥ c2
ann(m+3)/2 log n
∫
R+
τ3/2h(τ)dτ. (19)
6 BANERJEE, FU, JAMES AND SUN
Now, use Assumption 3 and Lemma 1 on the quantity∫
R+ τ3/2h(τ)dτ in equation (19) to conclude that725
∫
R+
τ3/2h(τ)dτ ≥ c3
log3/2 nP
(
τ ≥ C2/ log n)
. (20)
So, with equations (20), (19) and Lemma 1, we have with high probability,
f(yi, s2i ) ≥ c4
an
n(m+3)/2 log5/2 n. (21)
The first statement of Lemma 2 thus follows from equations (21) and (16).
We will now prove the second statement of Lemma 2. Fix ν2 > 0 such that ǫ−ν2
2 > 2m in Lemma
1 and let C2 = ǫ1+ν2
2 . First note that from Assumption 3, P(τ ≤ C2/ log n) ≤ c0/n2m. Now, Markov’s
inequality implies,
E(τ |yi, s2i ) ≥C2
log n
1− P(τ ≤ C2/ log n|yi, s2i )
.
Moreover,
P(τ ≤ C2/ log n|yi, s2i ) = f(yi, s2i )−1
∫ C2/ logn
0
h(τ)
∫
R
g(µ|τ)f1(yi|µ, τ)f2(s2i |τ)dµ
dτ.
Now f1(yi|µ, τ) ≤√
τ/(2π) and f2(s2i |τ) ≤ c1τ where c1 > 0 is a constant. So for some positive con-
stant c2,
P(τ ≤ C2/ log n|yi, s2i ) ≤c2
f(yi, s2i )
∫ C2/ logn
0
τ3/2h(τ)
∫
R
g(µ|τ)dµ
dτ
≤ c3
f(yi, s2i ) log3/2 n
∫ C2/ logn
0
h(τ)dτ =c3
f(yi, s2i ) log3/2 n
P(τ ≤ C2/ log n).
Thus, from the above display, Assumption 3 and Lemma 1,730
E(τ |yi, s2i ) ≥C2
log n
1− c3n−2m
f(yi, s2i ) log3/2 n
.
Finally, equation (21) and the above display prove the second statement of Lemma 2.
B. ADDITIONAL NUMERICAL EXPERIMENTS
B.1. Numerical Experiments with small n
In this section, we evaluate the risk performance of the six competing approaches of sections 5.1 and
5.2 when n = 100. We consider the simulation settings 1, 2 and 4 from section 5.1, and the setting of the735
ratio estimation example from section 5.2. Figures 7 to 10 present the relative risk ratios of the competing
estimators for the four simulation settings. We continue to observe that as heterogeneity increases in
the data, DS dominates the three linear shrinkage estimators considered here. Moreover, DS dominates
NPMLE across all settings with the exception of setting 1 at m = 10 (figure 7 left) where it is almost as
good as NPMLE with rising heterogeneity.740
B.2. Numerical Experiments with unequal sample sizes mi
In this section, we present the risk performance of the six competing approaches of sections 5.1 and
5.2 when the sample sizes mi are different across the n = 1000 units of study. We continue to use the
simulation settings of sections 5.1 and 5.2, and only change how m = (m1, . . . ,mn) are generated in
these settings. Figures 11 and 12 present the relative risks of the competing estimators across the six745
simulations settings of sections 5.1 and 5.2. With the exception of setting 3 in section 5.1,m is generated
NEST Estimation On Heterogeneous Data 7
DS Grp Linear NPMLE XKB.M XKB.SG
1.00
1.05
1.10
1.15
1.20
1.25
1 2 3u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
1.20
1.25
1 2 3u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
1.20
1.25
1 2 3u
Rel
ativ
e R
isk
Fig. 7. Comparison of relative risks when (µi, σ2i ) are in-
dependent. Here µii.i.d∼ 0.7N(0, .1) + 0.3N(±1, 3) and
σ2i
i.i.d∼ U(0.5, u). Plots show m = 10, 30, 50 left to right.
DS Grp Linear NPMLE XKB.M XKB.SG
1.00
1.05
1.10
1.15
1.20
25 30 35 40 45 50u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
1.20
25 30 35 40 45 50u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
1.20
25 30 35 40 45 50u
Rel
ativ
e R
isk
Fig. 8. Comparison of relative risks for correlated
(µi, σ2i ). Here τi = 1/σ2
i
i.i.d∼ 0.5 Γ(20,rate = 20) +
0.5 Γ(20,rate = u) and µi|τi ind.∼ N(±0.5/τi, 0.52).
Plots show m = 10, 30, 50 left to right.
as a fixed vector of size n with elements sampled randomly from (10, 15, 20, . . . , 50) with replacement
while for setting 3 we generate m by sampling randomly from (30, 35, . . . , 100) with replacement. For
settings 1, 2 and 3 (figure 11), we note that both NPMLE and DS dominate the three linear shrinkage
estimators with DS being as good as NPMLE in these three settings. Settings 4 and 5 from section 5.1 750
represent scenarios wherein the data Yij |(µi, σ2i ) are not normally distributed. Under both these settings,
DS provides an overall better risk performance than the competing estimators as the heterogeneity in the
data increases with u. Setting 6 (figure 12 right) is the ratio estimation scenario considered in section 5.2
and we see that DS dominates the competing estimators of θi =√miµi/σi considered here.
B.3. Gamma and Weibull mixtures 755
In this section we provide numerical evidence that demonstrate the performance of the NEST estimator
for compound estimation of the scale parameter of Gamma and Weibull mixtures in examples 2 and 4
of section 4. To choose λ in these settings, we use two fold cross validation by randomly splitting the
observed data Yi = (Yij : 1 ≤ j ≤ m) for unit i into two approximately equal parts (Ui,Vi) such that
Yi = (Ui,Vi). 760
8 BANERJEE, FU, JAMES AND SUN
DS Grp Linear NPMLE XKB.M XKB.SG
1.00
1.05
1.10
1.15
1 2 3u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
1 2 3u
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
1 2 3u
Rel
ativ
e R
isk
Fig. 9. Comparison of relative risk for non-normal data.
Here Yij |µi, σii.i.d∼ U(µi −
√3σi, µi +
√3σi), σ2
i aresampled independently from N(u, 1) truncated below at
0.01 and µi|σ2i
ind.∼ 0.8N(0.25σ2i , 0.25) + 0.2N(σ2
i , 1).Plots show m = 10, 30, 50 left to right.
DS Grp Linear NPMLE XKB.M XKB.SG
1.0
1.1
1.2
1.3
1.4
1.5
1 2 3u
Rel
ativ
e R
isk
1.0
1.1
1.2
1.3
1.4
1.5
1 2 3u
Rel
ativ
e R
isk
1.0
1.1
1.2
1.3
1.4
1.5
1 2 3u
Rel
ativ
e R
isk
Fig. 10. Comparison of relative risk for estimating θ. Here
σ2i
i.i.d∼ U(0.1, u) and µi|σiind.∼ N(±0.5σ2
i , 0.52). Plots
show m = 10, 30, 50 left to right.
Scale mixture of Gamma distributions - Consider the setting of example 2 in section 4 where for j =1, . . . ,m and i = 1, . . . , n,
Yij | αi, 1/βii.i.d∼ Γ(αi, 1/βi), 1/βi
i.i.d∼ G(·).
Here the shape parameters α = (α1, . . . , αn) are known and G(·) is an unspecified prior distribution on
the scale parameters 1/βi. We fix m = 30 and consider two scenarios wherein, for scenario 1, we let
βii.i.d∼ Γ(7, 1) and α to be a fixed vector of size n with elements sampled randomly from (1, 3, 5) with
replacement. For scenario 2, βii.i.d∼ X 2
5 and α is a fixed vector of size n with elements sampled randomly
from (1, 2, 3) with replacement. We consider four competing estimators of β across the two scenarios as
n varies. The competing estimators include the NEST estimator from example 2 with λ obtained from
two fold cross validation, denoted DS, the NPMLE based estimator of β from Koenker & Gu (2017),
the maximum likelihood estimator (MLE) of βi which is mαi/∑m
j=1 Yij and DS orc which represents
the NEST estimator of example 2 with oracle λ. The average squared error risk of these estimators is
calculated over 100 Monte-Carlo repetitions and figure 13 presents the ratio of the average risks wherein
the denominator is the average risk of DS orc. For scenario 1 (figure 13 left), the NEST estimator is
NEST Estimation On Heterogeneous Data 9
DS Grp Linear NPMLE XKB.M XKB.SG
1.00
1.05
1.10
1.15
1.20
1 2 3u
Rela
tive R
isk
1.000
1.025
1.050
1.075
1.100
25 30 35 40 45 50u
Rela
tive R
isk
1.0
1.2
1.4
1.6
25 30 35 40 45 50u
Rela
tive R
isk
Fig. 11. Left to Right: Simulation settings 1, 2 and 3. Herem is a fixed vector of size n with elements sampled ran-domly from (10, 15, 20, . . . , 50) with replacement for set-tings 1 and 2 and from (30, 35, . . . , 100) with replacement
for setting 3.
DS Grp Linear NPMLE XKB.M XKB.SG
1.00
1.05
1.10
1.15
1 2 3u
Re
lative
Ris
k
1.025
1.050
1.075
1 2 3u
Re
lative
Ris
k
1.0
1.1
1.2
1.3
1 2 3u
Re
lative
Ris
k
Fig. 12. Left to Right: Simulation settings 4, 5 and 6(from section 5.2). In each setting m is a fixed vec-tor of size n with elements sampled randomly from
(10, 15, 20, . . . , 50) with replacement.
almost as good as the NPMLE estimator of β for large n. Scenario 2, on the other hand, represents a
challenging setting where βi are relatively smaller in comparison to those in scenario 1. In this setting,
the NEST estimator has a lower relative risk than the NPMLE across all n.
Scale mixture of Weibull distributions - Consider the setting of example 4 in section 4 where
for j = 1, . . . ,m and i = 1, . . . , n,
Yij | ki, βii.i.d∼ Weibull(ki, βi), βi
i.i.d∼ G(·).
Here the shape parameters k = (k1, . . . , kn) are known and G(·) is an unspecified prior distribution on the
scale parameters βi. We fix m = 30 and consider two scenarios wherein, for scenario 1, we let βi = |Zi|where Zi
i.i.d∼ N(1, 0.12) and for scenario 2, βii.i.d∼ (1/3)δ(0.75) + (1/3)δ(1) + (1/3)δ(1.25). In each case,
α is a fixed vector of size n with elements sampled randomly from (1, 2, 3) with replacement. Analogous
to figure 13, figure 14 presents the relative risks of the competing estimators of β across the two scenarios 765
as n varies. The competing estimators include the NEST estimator from example 4 with λ obtained from
two fold cross validation, denoted DS, the NPMLE based estimator ofβ, the maximum likelihood estimator
10 BANERJEE, FU, JAMES AND SUN
DS MLE NPMLE
1.0
1.1
1.2
1.3
100 200 300 400 500 600 700 800 900 1000n
Rel
ativ
e R
isk
1.00
1.05
1.10
1.15
100 200 300 400 500 600 700 800 900 1000n
Rel
ativ
e R
isk
Fig. 13. Comparison of relative risk for estimating β undera scale mixture of Gamma distributions. Left: Scenario 1 -
βii.i.d∼ Γ(7, 1) and α to be a fixed vector of size n with el-
ements sampled randomly from (1, 3, 5) with replacement.
Right: Scenario 2 -βii.i.d∼ X 2
5 and α is a fixed vector ofsize n with elements sampled randomly from (1, 2, 3) with
replacement. Here m is fixed at 30.
DS MLE NPMLE
1.0
1.2
1.4
1.6
1.8
2.0
100 200 300 400 500 600 700 800 900 1000n
Rel
ativ
e R
isk
1.0
1.5
2.0
2.5
100 200 300 400 500 600 700 800 900 1000n
Rel
ativ
e R
isk
Fig. 14. Comparison of relative risk for estimating β undera scale mixture of Weibull distributions. Left: Scenario 1
- βi = |Zi| where Zii.i.d∼ N(1, 0.12). Right: Scenario 2
- βii.i.d∼ (1/3)δ(0.75) + (1/3)δ(1) + (1/3)δ(1.25). Here α
is a fixed vector of size n with elements sampled randomlyfrom (1, 2, 3) with replacement and m is fixed at 30.
(MLE) of βi which is m/∑m
j=1 Yki
ij and DS orc which represents the NEST estimator of example 4 with
oracle λ. We note that for both the scenarios and particularly for scenario 2, the NEST estimator has a
lower relative risk than the NPMLE for large n.770