nonparametric empirical bayes estimation on...

Biometrika (2021), xx, x, pp. 1–10

Printed in Great Britain

Nonparametric Empirical Bayes Estimation On

Heterogeneous Data

BY TRAMBAK BANERJEE

Analytics, Information and Operations Management, University of Kansas,

Lawrence, Kansas 66045, U.S.A. 5

[email protected]

LUELLA J. FU

Department of Mathematics, San Francisco State University,

San Francisco, California 94132, U.S.A.

[email protected] 10

GARETH M. JAMES

Department of Data Sciences and Operations, University of Southern California,

Los Angeles, California 90089, U.S.A.

[email protected]

WENGUANG SUN 15

Department of Data Sciences and Operations, University of Southern California,

Los Angeles, California 90089, U.S.A.

[email protected]

SUMMARY

The simultaneous estimation of many parameters based on data collected from corresponding 20

studies is a key research problem that has received renewed attention in the high-dimensional

setting. Many practical situations involve heterogeneous data where heterogeneity is captured by

a nuisance parameter. Effectively pooling information across samples while correctly accounting

for heterogeneity presents a significant challenge in large-scale estimation problems. We address

this issue by introducing the “Nonparametric Empirical–Bayes Structural Tweedie (NEST) esti- 25

mator, which efficiently estimates the unknown effect sizes and properly adjusts for heterogeneity

via a generalized version of Tweedie’s formula. For the normal means problem, NEST simulta-

neously handles the two main selection biases introduced by heterogeneity: one, the selection

bias in the mean, which cannot be effectively corrected without also correcting for, two, selec-

tion bias in the variance. Our theoretical results show that NEST has strong asymptotic properties 30

without requiring explicit assumptions about the prior. Extensions to other two-parameter mem-

bers of the exponential family are discussed. Simulation studies show that NEST outperforms

competing methods, with substantial efficiency gains in many settings. The proposed method is

demonstrated on estimating the batting averages of baseball players and Sharpe ratios of mutual

fund returns. 35

Some key words: Compound decision; doluble shrinkage estimation; non-parametric empirical Bayes; kernelizedStein’s discrepancy; Tweedie’s formula.

C© 2018 Biometrika Trust

2 BANERJEE, FU, JAMES AND SUN

1. INTRODUCTION

1.1. Large-scale estimation and Empirical Bayes

Suppose that we are interested in estimating a vector of parameters η = (η1, . . . , ηn) based40

on the summary statistics Y1, . . . , Yn from n study units. The setting where Yi | µi ∼ N(µi, σ2)

is the most well–known example, but the broader scope includes the compound estimation of

Poisson parameters λi, Binomial parameters pi, and other members of the exponential family.

High-dimensional settings have required researchers to solve this problem in new ways and for

new purposes. In modern large-scale applications it is often of interest to perform simultaneous45

and selective inference (Benjamini & Yekutieli, 2011; Berk et al., 2013; Weinstein et al., 2013).

For example, there has been recent work on how to construct valid simultaneous confidence

intervals after a selection procedure is applied (Lee et al., 2016). In multiple testing, as well

as related ranking and selection problems, it is often desirable to incorporate estimates of the

effect sizes ηi in the decision process to prioritize the selection of more scientifically meaningful50

hypotheses (Benjamini & Hochberg, 1997; Sun & McLain, 2012; He et al., 2015; Henderson &

Newton, 2016; Basu et al., 2017).

However, the simultaneous inference of thousands of means, or other parameters, is challeng-

ing because, as described in Efron (2011), the large scale of the problem introduces selection

bias, wherein some data points are large merely by chance, causing traditional estimators to55

overestimate the corresponding means. Shrinkage estimation, exemplified by the seminal work

of James & Stein (1961), has been widely used in simultaneous inference. There are several

popular classes of methods, including linear shrinkage estimators (James & Stein, 1961; Efron

& Morris, 1975; Berger, 1976), non–linear thresholding–based estimators motivated by sparse

priors (Donoho & Jonhstone, 1994; Johnstone & Silverman, 2004; Abramovich et al., 2006),60

and both Bayes or empirical Bayes estimators with unspecified priors (Brown & Greenshtein,

2009; Jiang & Zhang, 2009; Castillo & van der Vaart, 2012). This article focuses on a class

of estimators based on Tweedie’s formula (Robbins, 1956)1. The formula is an elegant shrink-

age estimator, for distributions from the exponential family, that has recently received renewed

interest (Brown & Greenshtein, 2009; Efron, 2011; Koenker & Mizera, 2014). Tweedie’s for-65

mula is simple and intuitive, and its implementation under the f -modeling strategy for empirical

Bayes estimation only requires the estimation of the marginal distribution of Yi. This property

is particularly appealing for large–scale estimation problems where such estimates can be easily

constructed from the observed data. The resultant empirical Bayes estimator enjoys optimality

properties (Brown & Greenshtein, 2009) and delivers superior numerical performance. The work70

of Efron (2011) further convincingly demonstrates that Tweedie’s formula provides an effective

bias correction tool when estimating thousands of parameters simultaneously.

1.2. Issues with heterogeneous data

Most of the research in this area has been restricted to models of the form f(yi | ηi) where

the distribution of Yi is solely a function of ηi. In situations involving a nuisance parameter θ it75

is generally assumed to be known and identical for all Yi. For example, homoscedastic Gaussian

models of the form Yi | µi, σind∼ N(µi, σ

2) involve a common nuisance parameter θ = σ2 for

all i. However, in large-scale studies when the data are collected from heterogeneous sources,

the nuisance parameters may vary over the n study units. Perhaps the most common example,

and the setting we concentrate most on, involves heteroscedastic errors, where σ2 varies over80

Yi. Microarray data (Erickson & Sabatti, 2005; Chiaretti et al., 2004), returns on mutual funds

1 A reviewer pointed out that the formula appears even earlier in the astronomy literature (Dyson, 1926; Eddington, 1940), wherein

Frank Dyson credits the formula to Sir Arthur Eddington.

NEST Estimation On Heterogeneous Data 3

(Brown et al., 1992), and the state-wide school performance gaps (Sun & McLain, 2012) are

all instances of large-scale data where genes, funds, or schools have heterogeneous variances.

Heteroscedastic errors also arise in analysis of variance and linear regression settings (Weinstein

et al., 2018). Moreover, in compound binomial problems, heterogeneity arises through unequal 85

sample sizes across different study units. Unfortunately, the conventional Tweedie’s formula

assumes identical nuisance parameters across study units so cannot eliminate selection bias for

heterogeneous data. Moreover, various works show that failing to account for heterogeneity leads

to inefficient shrinkage estimators (Weinstein et al., 2018), methods with invalid false discovery

rates (Efron, 2008; Cai & Sun, 2009), unstable multiple testing procedures (Tusher et al., 2001) 90

and suboptimal ranking and selection algorithms (Henderson & Newton, 2016), exacerbating the

replicability crisis in large-scale studies. Few methodologies are available to address this issue.

For Gaussian data, a common goal is to find the estimator, or make the decision, that min-

imizes the expected squared error loss. A plausible–seeming solution might be to scale each

Yi by its estimated standard deviation Si so that a homoscedastic method could be applied to 95

Xi = Yi/Si, before undoing the scaling on the final estimate of µi. Indeed, this is essentially the

approach taken whenever we compute standardized test statistics, such as t–values and z–values.

A similar standardization is performed in the Binomial setting when we compute pi = Xi/mi,

where Xi is the number of successes and mi the number of trials. However, this approach, which

disregards important structural information, can be highly inefficient. More advanced methods 100

have been developed, but all suffer from various limitations. For instance, the methods proposed

by Xie et al. (2012), Tan (2015), Jing et al. (2016), Kou & Yang (2017), and Zhang & Bhat-

tacharya (2017) are designed for heteroscedastic data but assume a parametric Gaussian prior

or semi-parametric Gaussian mixture prior, which leads to loss of efficiency when the prior is

misspecified. Moreover, existing methods, such as Xie et al. (2012) and Weinstein et al. (2018), 105

often assume that the nuisance parameters are known and use a consistent estimator for imple-

mentation. However, when a large number of units are investigated simultaneously, traditional

sample variance estimators may similarly suffer from selection bias, which often leads to severe

deterioration in the MSE for estimating the means.

1.3. The proposed approach and main contributions 110

In the homogeneous setting, Tweedie’s formula estimates ηi using the score function of Yi, but

that approach does not immediately extend to heterogeneous data. Instead, this article proposes

a two–step approach, “Nonparametric Empirical Bayes Structural Tweedie” (NEST), which first

estimates the bivariate score function for data from a two–parameter exponential family, and

second predicts ηi using a generalized version of Tweedie’s formula that effectively incorporates 115

the structural information encoded in the variances. A significant challenge in the heterogeneous

setting is how to pool information from different study units effectively while accounting for

the heterogeneity captured by the possibly unknown nuisance parameters. NEST addresses the

issue by proposing a double shrinkage method that simultaneously incorporates the structural

information encoded in both the primary (e.g. Yi) and auxiliary (e.g. Si) data. 120

NEST has several clear advantages. First, it simultaneously handles the two main selection

biases introduced by heterogeneity: one, selection bias in the primary parameter (mean), which

cannot be effectively corrected without also correcting for, two, selection bias in the nuisance

parameter (variance). By producing more accurate estimates for the nuisance parameters, NEST

in general renders improved shrinkage factors for estimating the primary parameters. Second, 125

NEST makes no explicit assumptions about the prior since it uses a nonparametric method to

directly estimate the optimal shrinkage directions. Third, NEST exploits the structure of the

entire sample and avoids the information loss that occurs in the discretization step of the group-


ing methods (Weinstein et al., 2018). Fourth, NEST only requires a few simple assumptions to

achieve strong asymptotic properties. Finally, NEST provides a general estimation framework130

for members of the two-parameter exponential family. It is fast to implement, produces stable es-

timates, and is robust against model mis-specification. We demonstrate numerically that NEST

can provide high levels of estimation accuracy relative to a host of benchmark methods.

1.4. Organization

The rest of the paper is structured as follows. Section 2 first derives a generalized Tweedie’s135

formula for a hierarchical Gaussian model where both the mean and variance parameters are un-

known. We then develop a convex optimization approach to estimating the unknown shrinkage

factors in the formula. In Section 3, we first describe the theoretical setup, then justify the op-

timization criterion, and finally establish asymptotic theories for the proposed NEST estimator.

Extensions to other members of the two-parameter exponential family are discussed in Section140

4. Simulation studies are carried out in Section 5 to compare NEST to competing methods. Two

data applications are presented in Section 6. The proofs, as well as additional theoretical and

numerical results, are provided in the Supplementary Material.

2. DOUBLE SHRINKAGE ESTIMATION ON HETEROSCEDASTIC NORMAL DATA

2.1. Generalized Tweedie’s formula for a hierarchical Normal model145

Suppose we collect mi observations for the ith study unit, i = 1, . . . , n. The data are normally

distributed obeying the following hierarchical model:

Yij | µi, τii.i.d∼ N(µi, 1/τi), j = 1, . . . ,mi, (1)

µi |τi ind∼ Gµ(·|τi), τii.i.d∼ Hτ (·), i = 1, . . . , n.

We view µi and τi, both of which are unknown, as the primary and nuisance parameters, respec-

tively. The prior distributions Gµ(·|τi) and Hτ (·) are unspecified. We develop a nonparametric

empirical Bayes method for jointly estimating the mean and variance under the squared error150

loss. The general setting of two-parameter exponential families is considered in Section 4.

Let Yi = m−1i

∑mi

j=1 Yij and S2i = (mi − 1)−1

∑mi

j=1(Yij − Yi)2 respectively denote the sam-

ple mean and sample variance. Further, let Y = (Y1, . . . , Yn) and S = (S21 , . . . , S

2n) be the

vectors of summary statistics, and y = (y1, . . . , yn) and s = (s21, . . . , s2n) the observed values.

Denote δ = (δ1, . . . , δn)T an estimator for µ = (µ1, . . . , µn)

T based on (Y ,S). Consider the155

squared error loss ln(µ, δ) = n−1∑n

i=1 ℓ(µi, δi), where ℓ(µi, δi) = (δi − µi)2. The compound

Bayes risk is

r(δ,G) = 1

n

n∑

i=1

∫ ∫ ∫

ℓ(µi, δi)fmi(y, s2 | ψi)dyds

2dG(ψi), (2)

where ψi = (µi, τi), G(ψi) = Gµ(µi|τi)Hτ (τi), and fmi(y, s2 | ψi) is the likelihood function

of (Yi, S2i ). The Bayes estimator that minimizes (2) is given by δδδπ = (δπ1 , . . . , δ

πn), where

δπi := δπ(yi, s2i ,mi) = E(µi | Yi = yi, S

2i = s2i ,mi).

Unfortunately, Model (1) in its full generality does not allow a closed form expression for δπ. The

next proposition exploits the properties of the exponential family to form a new representation of

the posterior distribution of the canonical parameters. The useful representation is subsequently160

employed in Corollary 1 to construct our proposed estimator.


PROPOSITION 1. Consider hierarchical Model (1) with mi > 3. Let fm(y, s2) =∫

fm(y, s2 |ψ)dG(ψ) denote the joint density of (Y, S2). The posterior distribution of (µi, τi) belongs to a

two-parameter exponential family with density

fmi(µi, τi|yi, s2i ) ∝ exp

ηTi T (µi, τi)−A(ηi)

gµ(µi|τi)hτ (τi),

where T (µi, τi) = (τiµi, τi/2),ηi =

miyi,−miy2i − (mi − 1)s2i

:= (η1i, η2i), γ(η1i, η2i) =−η2i−m−1

iη21i

mi−1 and A(ηi) = −12(mi − 3) log γ(η1i, η2i) + log fmi

m−1i η1i, γ(η1i, η2i)

.

The two-parameter exponential family representation of the joint posterior distribution of

(µi, τi) in Proposition 1 is particularly useful because, as shown in Corollary 1, it allows one 165

to write down the Bayes estimators of ζi := τiµi and τi immediately.

COROLLARY 1. Under the hierarchical Model (1) with mi > 3, the Bayes estimators of

(ζi, τi) under squared error loss are, respectively,

ζπi := ζπ(yi, s2i ,mi) = E(ζi|yi, s2i ,mi) = yiE(τi|yi, s2i ,mi) +m−1

i w1(yi, s2i ;mi), (3)

τπi := τπ(yi, s2i ,mi) = E(τi|yi, s2i ,mi) =

mi − 3

(mi − 1)s2i− 2

mi − 1w2(yi, s

2i ;mi), (4)

where w1(y, s2;m) :=

∂

∂ylog fm(y, s2), w2(y, s

2;m) :=∂

∂s2log fm(y, s2).

Corollary 1 is significant for several reasons. First, it generalizes Tweedie’s formula, which can 170

be recovered from Equation (3) by treating τi as a known constant and dividing both sides by τi.Second, the score functions w1i := w1(yi, s

2i ;mi) and w2i := w2(yi, s

2i ;mi) correspond to opti-

mal shrinkage factors that should be applied to (yi, s2i ) when both µi and τi are unknown. Hence

the optimal estimation for µi requires a “double shrinkage” procedure that operates on both ζand τ dimensions. Third, Corollary 1 can be employed to construct estimators for functions of 175

(µi, τi) such as the Sharpe ratio in finance applications (Section 6.2).

2.2. Estimation of shrinkage factors via convex optimization

We begin by introducing some notation. Let x = (y, s2) be a generic pair of observations from

the distribution with marginal density fm(x), which we assume is continuously differentiable

with support on X ⊆ R× R+. Denote the score function 180

w(x;m) = ∇x log fm(x) := w1(x;m), w2(x;m) . (5)

Next, for i = 1, . . . , n, let zi = (xi,mi) where xi is an observation from a distribution with

density fmi(x).

Let K(zi, zj) = exp−(zi − zj)TΩ(zi − zj)/2 be the Radial Basis Function (RBF) kernel

with Ω3×3 being the inverse of the sample covariance matrix of (z1, . . . , zn). Denote

Wn×20 =

w(x1;m1), . . . ,w(xn;mn)T

, where (6)

w(xi;mi) = ∇x log fmi(x)∣

∣

∣

x=xi

:=

w1(xi;mi), w2(x

i;mi)

.

We denote wk(xi;mi) by wki, k = 1, 2, for the remainder of this article. 185

Let ∇zjk

K(zi, zj) be the partial derivative of K(zi, zj) with respect to the kth component of

zj . The following matrices are needed in our proposed estimator:

Kn×n = [Kij ]1≤i≤n,1≤j≤n, ∇Kn×2 = [∇Kik]1≤i≤n,1≤k≤2 ,


where Kij = K(zi, zj) and ∇Kik =∑n

j=1∇zjk

K(zi, zj). Next we formally define our pro-

posed Nonparametric Empirical-Bayes Structural Tweedie (NEST) estimator.

DEFINITION 1. Consider hierarchical Model (1) with mi > 3. For a fixed regularization pa-

rameter λ > 0, let Wn(λ) = (w1λ, . . . , w

nλ)

T , where wiλ = (wi

1,λ, wi2,λ), be the solution to the

following quadratic optimization problem:190

minW∈Rn×2

1

n2trace

(

WTKW + 2WT∇K

)

+ ρn(W;λ), (7)

where ρn(W;λ) is a penalty on the elements of W . Then the NEST estimator for µi is

δdsi (λ) = yi +1

miτdsi (λ)wi1,λ, (8)

where τdsi (λ) =mi − 3

(mi − 1)s2i− 2

mi − 1wi2,λ, (9)

with the superscript ds denoting “double shrinkage”.

Although not immediately obvious, we show in Section 3 that, under the compound estima-

tion setting, minimizing the first term in the objective function (7) is asymptotically equivalent

to minimizing the kernelized Stein’s discrepancy (KSD; Liu et al., 2016; Chwialkowski et al.,195

2016). Roughly speaking, the KSD measures how far a given n× 2 matrix W is from the true

score matrix W0. A key property of the KSD is that it is always non-negative and is equal to 0

if and only if W and W0 are equal. Hence, solving the convex program (7) is equivalent to find-

ing a W that is as close as possible to W0. Since the oracle rule that minimizes the Bayes risk

(2) is constructed based on W0, we can expect that the NEST estimator based on W would be200

asymptotically optimal. Theory underpinning this intuition, as well as the asymptotic optimality

of δdsi (λ), are established in Section 3.

The second term in Equation (7), ρn(W;λ), is a penalty function that increases as the elements

of W move further away from 0. In this article, we take ρn(W;λ) = (λ/n2)∑n

i=1

∑2k=1W2

ik =(λ/n2)‖W‖2F , where Wik are the elements of matrix W and ‖W‖F is the Frobenius norm of W .205

A large λ forces the estimated shrinkage factors towards 0, and in the limit the NEST estimate is

simply the unbiased estimate yi. An alternative approach, as pursued in James et al. (2020), is to

penalize the lack of smoothness in w as a function of x.

A key characteristic of δdsi (λ) in Equation (8) is that it exploits the joint structural information

available in both Yi and S2i through Wn(λ). Although the loss function only involves the means,210

we perform shrinkage on both the mean and variance dimensions. Inspecting Equations (8) and

(9), we expect that the improved accuracy achieved by τdsi (λ) leads to better shrinkage factors

for δdsi (λ) and hence additional reduction in the estimation risk. In Section 4 we adopt a similar

strategy to extend the estimation framework presented in Definition 1 to other distributions in

the two-parameter exponential family.215

We end this section with a discussion of the case of equal sample sizes, i.e. mi = m for all

i. For instance, the leukemia dataset analyzed in Jing et al. (2016) consists of the expression

levels of n = 5,000 genes for m = 27 acute lymphoblastic leukemia patients. The heterogene-

ity across the n units is due to the intrinsic variability instead of the varied number of repli-

cates. When the mi’s are equal, the RBF kernel K(·, ·) needs to be modified to avoid a singu-220

lar sample covariance matrix. Denote K(xi,xj) = exp−0.5(xi − xj)TΩ(xi − xj) the mod-

ified RBF kernel with Ω2×2 being the inverse of the sample covariance matrix of (x1, . . . ,xn).


Correspondingly in Definition 1, Wn(λ) are the estimates of the shrinkage factors Wn×20 =

w(x1;m), . . . ,w(xn;m)T

, where w(xi;m) =

w1(xi;m), w2(x

i;m)

:= (w1i, w2i).

2.3. Details around implementation 225

In this section we discuss details around the implementation of NEST. First note that Equa-

tion (7) can be solved separately for the two columns of W , which respectively yield the es-

timates for w1i and w2i. Next, the solution to Equation (7) is available in the closed form of

Wn(λ) = −B(λ)∇K, where B(λ) = (K + λI)−1. However, in our implementation the closed

form solution is replaced by a convex program that directly solves (7) with constraint Wa b, 230

where a = (0, 1)T and b = (b1, . . . , bn) with bi =12(mi − 3)/s2i − κ for some κ > 0. Inspect-

ing Equation (9) shows that adding this constraint guarantees that τdsi (λ) > 0. This is desirable

in both numerical and theoretical analyses.

The practical implementation requires a data-driven scheme for choosing λ. Consider the

modified cross validation scheme of Brown et al. (2013), which involves splitting Yij into two 235

parts: Uij = Yij + αǫij and Vij = Yij − (1/α)ǫij , where ǫij ∼ N(0, τ−1i ), and Uij and Vij are

used to construct the estimator and to choose the tuning parameter, respectively. However in

our setup τi is unknown; hence we sample ǫij’s from N(0, S2i ). Let Vi = m−1

i

∑mi

j=1 Vij , Ui =

m−1i

∑mi

j=1 Uij , U = Uij : 1 ≤ i ≤ n, 1 ≤ j ≤ mi and V = Vij : 1 ≤ i ≤ n, 1 ≤ j ≤ mi.

Then conditional on (µi, τi), Ui and Vi are approximately independent with mean E(Yi) when 240

m is large. Define ϑn(λ;U ,V) =1

n

∑ni=1

Vi − δdsi (Ui;U , λ)2

. The tuning parameter will be

chosen as λ := argminλ∈Λϑn(λ). In our numerical studies of Sections 5 and 6, we set α = 1/2,

n−2Λ = [1, 10]. The tuning parameter λ is obtained from this scheme and then used to estimate

µi via equation (8).

Remark 1. ϑn(λ;U ,V) provides a practical criterion for choosing λ and works well empiri- 245

cally in our numerical studies. However, ϑn is not an unbiased estimate of the true risk. We note

in Section 5 that the relative risk of the data-driven NEST is slightly off from 1 when m = 10(e.g. the left panels in Figs 1 & 2). As m increases the covariance between Uij and Vij converges

to 0 and the risk of our data-driven estimator is almost identical to that of an oracle (which selects

the optimal λ based on the true risk). The development of a SURE criterion is a challenging topic 250

requiring further research. See also Ignatiadis & Wager (2019) for related discussions.

We are developing an R package, nest, to implement the NEST estimator in Definition 1.

The R code that reproduces the numerical results in Sections 5 and 6 can be downloaded from

the link: github.com/trambakbanerjee/DS paper.

3. THEORY 255

3.1. Kernelized Stein’s Discrepancy

In this section we introduce the Kernelized Stein’s Discrepancy (KSD) measure (Liu et al.,

2016; Chwialkowski et al., 2016) and discuss its connection to the quadratic program (7). While

the KSD has been used in various contexts including goodness of fit tests (Liu et al., 2016;

Yang et al., 2018), variational inference (Liu & Wang, 2016) and Monte Carlo integration (Oates 260

et al., 2017), its connections to compound estimation and empirical Bayes methodology was

established only recently (Banerjee et al., 2020). The analysis in this and following sections is

geared towards the case mi = m for i = 1, . . . , n. Under this setting, (x1, . . . ,xn) constitute an

i.i.d random sample from fm(x). The case of unequal mi’s can be analyzed in a similar fashion


by first assuming that mi’s are a random sample from a distribution with mass function q(·) and265

are independent of (µi, τi). Then z = (x,m) has distribution with density p(z) := q(m)fm(x)and, with zi = (xi,mi), (z

1, . . . , zn) being realizations of an i.i.d random sample from p(z).Suppose X and X ′ are i.i.d. copies from the marginal distribution of (Y, S2) that has density

f wherein the dependence on m is implicit. Denote w(X) and w(X ′), defined in Equation (5),

to be the score functions atX andX ′ respectively. Suppose f is an arbitrary density function on270

the support of (Y, S2), for which we similarly define w(X). The KSD, formally defined as

S(f, f) = EX,X′∼f

[

w(X)−w(X)T

K(X,X ′)

w(X ′)−w(X ′)]

, (10)

provides a discrepancy measure between f and f in the sense that S(f, f) tends to increase when

there is a bigger disparity between w and w (or equivalently, between f and f ), and

S(f, f) ≥ 0 and S(f, f) = 0 if and only if f = f .

The direct evaluation of S(f, f) is difficult because w is unknown. Liu et al. (2016) introduced

an alternative representation of the KSD that does not directly involve w:

S(f, f) = Efκ[w(X), w(X ′)](X,X ′)

= Ef

1

n(n− 1)

n∑

i=1

n∑

j=1

κ[w(Xi), w(Xj)](Xi,Xj)I(i 6= j)

= Ef

[

Mn(W)]

:= M(W), (11)

where X1, . . . ,Xn is a random sample from f , Ef denotes expectation under f and

κ[w(x), w(x′)](x,x′) = w(x)T w(x′)K(x,x′) + w(x)T∇x′K(x,x′) +∇xK(x,x′)T w(x)

+ trace(∇x,x′K(x,x′)) (12)

is a smooth and symmetric positive definite kernel function associated with the U-statistic275

Mn(W). The implementation of the NEST estimator in Definition 1 boils down to the estimation

of W0 via the convex program (7), which corresponds to minimizing

Mλ(n),n(W) =1

n2

n∑

i=1

n∑

j=1

κ[w(Xi), w(Xj)](Xi,Xj) +λ(n)

n2‖W‖2F (13)

w.r.t. W . A key observation is that if the empirical criterion Mλ(n),n(W) is asymptotically equal

to the population KSD criterion M(W), then minimizing Mλ(n),n(W) with respect to W =

w(x1), . . . , w(xn)

is effectively the process of finding an W that is as close as possible to280

W0 in Equation (6). This intuitively justifies the NEST estimator in Definition 1. In what follows,

we denote λ(n) by λ and keep its dependence on n implicit. Next we show that NEST provides

an asymptotically optimal solution to the compound estimation problem.

3.2. Asymptotic Properties of NEST

This section studies the asymptotic properties of the NEST estimator. We begin by defining an285

oracle estimator δ∗ = (δ∗1 , . . . , δ∗n) for µ, where

δ∗i := δ∗(yi, s2i ,m) =

ζπ(yi, s2i ,m)

τπ(yi, s2i ,m)= yi +

1

mτπ(yi, s2i ,m)w1(yi, s

2i ,m), (14)


and ζπ and τπ are respectively the Bayes estimators of τµ and τ as defined in Equation (3).

Viewing the proposed NEST estimator δdsi (λ) as a data-driven approximation to δ∗i , we study the

quality of this approximation for large n and fixed m.

Remark 2. The oracle estimator δ∗, which serves as the benchmark in our theoretical analysis, 290

is different from the Bayes estimator δπ defined in Section 2.1: δπ has full knowledge of the prior

distributions Gµ(·|τ) and Hτ (·), whereas δδδ∗ only has the information on the true score functions

(w1i, w2i). Under the special setting of conjugate priors, δδδ∗ and δπ coincide (Corollary 2).

We impose the following regularity conditions where f denotes the density function of the

joint marginal distribution of (Y, S2) and Ef denotes expectation under f . 295

Assumption 1. Ef |κ[w(Xi), w(Xj)](Xi,Xj)|2 < ∞ for any (i, j) ∈ 1, . . . , n.

Assumption 2.∫

Xg(x)TK(x,x′)g(x′)dxdx′ > 0 for any g : X → R

2 s.t. 0 < ‖g‖22 < ∞.

Assumption 3. For some ǫi ∈ (0, 1), i = 1, 2, 3, EGexp(ǫ1|µ|) < ∞, EHexp(ǫ2/τ) <∞ and EHexp(ǫ3τ) < ∞.

Assumption 1 is a standard moment condition on κ[w(Xi), w(Xj)](Xi,Xj) [see, for example, 300

Section 5.5 in Serfling (2009)] , which is needed for establishing the Central Limit Theorem for

the U-statistic Mn(W) in Equation (11). Assumption 2 is a condition from Liu et al. (2016);

Chwialkowski et al. (2016) for ensuring that K(x,x′) is integrally strictly positive definite. This

guarantees that the KSD S(f, f) is a valid discrepancy measure in the sense that S(f, f) ≥ 0and S(f, f) = 0 if and only if f = f . Assumption 3 represents moment conditions on the prior 305

distributions. Together, they ensure that with high probability |µ| ≤ log n and 1/ log n ≤ τ ≤logn as n → ∞. This is formalized in Lemma 1 in Appendix A.4. These conditions allow a

cleaner statement of our theoretical results. It is likely that Assumption 3 can be further relaxed

but we do not seek the full generality here.

Theorem 1 below establishes the asymptotic consistency of the sample criterion Mλ,n(W) 310

around the population criterion M(W).

THEOREM 1. If λ(n)/√n → 0 as n → ∞ then, under Assumption 1, we have

∣

∣Mλ,n(W)−M(W)∣

∣ = Op

(

n−1/2)

.

Moreover, along with the fact that M(W0) = 0, Theorem 1 establishes that Mλ,n(W) is an ap-

propriate optimization criterion.

Theorem 2 establishes the consistency of the estimated score functions.

THEOREM 2. If limn→∞ cnn−1/2 = 0 then under the conditions of Theorem 1 and Assump-

tion 2, we have

limn→∞

P

1

n

∥

∥Wn(λ)−W0

∥

∥

2

F≥ c−1

n ǫ

= 0, for any ǫ > 0.

Theorem 3 establishes the optimality theory of δds by showing that (a) the average squared 315

error between δds(λ) and δ∗ is asymptotically small, and (b) the estimation loss of NEST con-

verges in probability to that of its oracle counterpart as n → ∞.

THEOREM 3. Suppose λ(n)/√n → 0 as n → ∞ and Assumptions 1 – 3 hold. If

limn→∞ cnn−1/2 log6 n = 0, then

cnn

∥

∥δds(λ)− δ∗∥

∥

2

2= op(1). Furthermore, if additionally

limn→∞ cnn−1/4 log6 n = 0, then cn

∣

∣ln(δds(λ),µ)− ln(δ

∗,µ)∣

∣ = op(1). 320


As we mentioned earlier, δ∗i in Equation (14) is not, in general, the Bayes estimator δπi of µi.

This follows since, dropping subscript i,

δ∗(y, s2) =E(τµ|y, s2)E(τ |y, s2) =

E[τE(µ|y, s2, τ)|y, s2]E(τ |y, s2) 6= E(µ|y, s2) = δπ(y, s2)

unless µ and τ are conditionally independent given y and s2, in which case δ∗(y, s2) = δπ(y, s2).A natural setting where E(µ|y, s2, τ) is indeed independent of τ is the popular scenario of con-

jugate priors under which the posterior expectation of µ is a linear combination of the prior

expectation of µ and the maximum likelihood estimate y (Diaconis & Ylvisaker, 1979).

COROLLARY 2. Consider hierarchical Model (1) where Gµ(·|τ) and Hτ (·) are, respectively,

conjugate prior distributions of µ|τ and τ . With m > 3, let δπ denote the Bayes estimator of µ.

Under the conditions of Theorem 3, if limn→∞ cnn−1/2 log6 n = 0, then

cnn

∥

∥δds(λ)− δπ∥

∥

2

2= op(1).

Furthermore, under the same conditions, if limn→∞ cnn−1/4 log6 n = 0, then

cn∣

∣ln(δds(λ),µ)− ln(δ

π,µ)∣

∣ = op(1).

4. EXTENSIONS325

This section considers the extension of our methodology to several well known members in

the two-parameter exponential family. We will focus on several examples where the nuisance

parameter is known. Our proposed estimation framework is motivated by the double shrinkage

idea, but the approach nonetheless handles the case with known nuisance parameters. We discuss

four examples, in each of which we derive the Bayes estimator of the natural parameter similar to330

that in Corollary 1. The Bayes estimator in these examples relies on the unknown score function

(of the marginal density of the sufficient statistic), which can be efficiently estimated using the

ideas in Section 2.2. The numerical performance of the new estimators considered in this section

is investigated in Appendix B.3.

Example 1 (Location mixture of Gaussians). Consider the following hierarchical model335

Yi | µi, τiind.∼ N(µi, 1/τi), µi

i.i.d∼ Gµ(·), for i = 1, . . . , n, (15)

where τi are known and Gµ(·) is an unspecified prior. Equation (15) represents the heteroscedas-

tic normal means problem with known variances 1/τi [see for example Weinstein et al. (2018)].

In this setting, the sufficient statistic for µi is Yi and the Bayes estimator of µi is given by

µπi := E(µi|yi, τi) = yi +

1

τi

∂

∂yilog f(yi|τi),

where f(·|τi) is the pdf of the distribution of Yi given τi marginalizing out µi. From Section 2.2

and with mi = 1, xi = (yi, τi), the NEST estimate of µi is given by δnesti (λ) = yi +1

τiwi1,λ.

Example 2 (Scale mixture of Gamma distributions). Consider the following model

Yij | αi, 1/βii.i.d∼ Γ(αi, 1/βi), 1/βi

i.i.d∼ G(·), (16)

where the shape parameters αi are known and G(·) is an unspecified prior distribution on scale

parameters 1/βi. Here Ti =∑m

j=1 Yij is a sufficient statistic and Ti|αi, βiind.∼ Γ(mαi, 1/βi).340


The posterior distribution of 1/βi belongs to a one-parameter exponential family with density

f(1/βi|Ti, αi) ∝ exp

−Tiβi + (mαi − 1) log Ti − log f(Ti|αi)

, (17)

where f(·|αi) is the pdf of the distribution of Ti given αi (marginalizing out 1/βi). From Equa-

tion (17), the Bayes estimator of βi is given by

βπi := E(βi|Ti, αi) =

mαi − 1

Ti− ∂

∂Tilog f(Ti|αi).

With xi = (Ti, αi), the NEST estimate of βi is given by δnesti (λ) =mαi − 1

Ti− wi

1,λ.

Example 3 (Shape mixture of Gamma distributions). We consider the following model: 345

Yij | αi, 1/βii.i.d∼ Γ(αi, 1/βi), αi

i.i.d∼ G(·), (18)

where the scale parameters 1/βi are known and G(·) is an unspecified prior distribution on

the shape parameters αi. Let Yi =∑m

j=1 Yij . Then Yi|αi, βiind.∼ Γ(mαi, 1/βi) and Ti = log Yi

is a sufficient statistic. Moreover, the posterior distribution of αi belongs to a one-parameter

exponential family with density

f(αi|Ti, 1/βi) ∝ exp

(mαi)Ti − βiexp (Ti)− log f(Ti|1/βi)

, (19)

where f(·|1/βi) is the density of the distribution of Ti given 1/βi marginalizing out αi. From 350

Equation (19), the Bayes estimator of αi is given by

απi := E(αi|Ti, 1/βi) =

βi exp (Ti)

m+

1

m

∂

∂Tilog f(Ti|1/βi).

With xi = (Ti, 1/βi), the NEST estimate of αi is δnesti (λ) =βi exp (Ti)

m+

1

mwi1,λ.

Example 4 (Scale mixture of Weibulls). We consider the following model:

Yij | ki, βi i.i.d∼ Weibull(ki, βi), βii.i.d∼ G(·). (20)

We have f(y| k, β) = βkyk−1 exp(−βyk). In Equation (20) the shape parameters ki are known,

G(·) is an unspecified prior distribution on the scale parameters βi, Ti =∑m

j=1Yijki is a suf- 355

ficient statistic, and Ti|ki, 1/βi ind.∼ Γ(m, 1/βi). From Example 2, the Bayes estimator of βi is

βπi := E(βi|Ti, ki) =

m− 1

Ti+

∂

∂Tilog f(Ti|ki).

With xi = (Ti, ki), the NEST estimate of βi is given by δnesti (λ) =m− 1

Ti− wi

1,λ.

The preceding examples present a setting with known nuisance parameter. When both pa-

rameters are unknown, extensions of our estimation framework to an arbitrary member of the

two-parameter exponential family is difficult. The main reason is that in the Gaussian case the 360

sufficient statistics are independent and their marginal distributions are known. However, for

other distributions such as the Gamma and Beta, the joint distribution of the two sufficient statis-

tics is generally unknown. This impedes a full generalization of our approach. We anticipate that

an iterative scheme that conducts shrinkage estimation on the primary and nuisance coordinates


in turn may be developed by combining the ideas in Examples 2 and 3 above. We do not pursue365

those extensions in this article.

5. NUMERICAL EXPERIMENTS

5.1. Compound Estimation of Normal Means

We focus on the hierarchical Model (1) and compare six approaches for estimating µ. These

approaches can be categorized into three types: the first consists of the NEST oracle method,370

which conducts double shrinkage (DS) with λ chosen to minimize the true loss (DS orc), and the

proposed NEST method, where λ is chosen using modified cross-validation (DS). The second

are linear shrinkage methods: the group linear estimator (Grp linear) of Weinstein et al. (2018);

the semi-parametric monotonically constrained SURE estimator that shrinks towards the grand

mean (XKB.SG) from Xie et al. (2012); and from the same paper, the parametric SURE esti-375

mator that shrinks towards a general data driven location (XKB.M). These methods assume that

the true variances are known, and, for implementation, we employ sample variances. Finally, the

third type is the g-modelling approach of Gu & Koenker (2017a,b). This method is the nearest

competitor to NEST as it estimates the joint prior distribution of the mean and variance us-

ing nonparametric maximum likelihood estimation (NPMLE) techniques (Kiefer & Wolfowitz,380

1956; Laird, 1978). For the linear shrinkage methods, we use code provided by Weinstein et al.

(2018) while for NPMLE we rely on the R package REBayes (Koenker & Gu, 2017).

The aforementioned six approaches are evaluated on five different simulation settings, with

the goal of assessing the relative performance of the competing estimators as the heterogeneity

in the variances σ2i is varied while keeping the sample sizes mi fixed at m. The five simulation385

settings can be categorized into four types: a setting where mean and variances are independent;

a setting where mean and variance are correlated; a sparse setting; and two settings that repre-

sent departures from the Normal data-generating model. For each setting we set n = 1,000 and

compute the average squared error risk for each competing estimator of µ across 50 Monte Carlo

repetitions. Figures 1 to 5 plot the relative risk which is the ratio of the average squared error risk390

for any competing estimator to that of DS orc so that a ratio bigger than 1 represents a poorer

risk performance of the competing estimator relative to the NEST oracle method.

The first setting, Figure 1, corresponds to the independent case. Here, for each i = 1, . . . , n,

µii.i.d∼ 0.7 N(0, .1) + 0.3 N(±1, 3) and σ2

ii.i.d∼ U(0.5, u) where we let u vary across four lev-

els, 0.5, 1, 2, 3. The three plots in Figure 1 show the relative risks as u varies for m = 10, 30395

and 50 (left to right). We see that for m = 10, the competing methods split into two levels of

performance. The group with the lowest relative risks consists of NPMLE and DS while the

three linear shrinkage methods exhibit substantially higher relative risks. Moreover, we also see

that as heterogeneity increases with increasing u, the gap between the two groups’ relative risks

increases, indicating that both NPMLE and the proposed NEST method are particularly useful400

for compound estimation of normal means when the variances are unknown and heterogeneous,

and the sample size for estimating those variances are themselves small. As m increases, the

performance of the three linear shrinkage methods improve which is expected as there are now

more replicates per unit of study to construct a relatively reliable estimate of the unknown vari-

ances. However, the performance of DS improves too and particularly at m = 50 (Figure 1 right),405

NPMLE exhibits a slightly higher relative risk than DS.

The second setting, Figure 2, corresponds to the correlated case. The precisions τi = 1/σ2i are

generated independently from a gamma mixture, with an even chance of drawing Γ(20, rate =20) or Γ(20, rate = u) and given τi, the means µi are independently N(±0.5/τi, 0.5

2). In this


DS Grp Linear NPMLE XKB.M XKB.SG

1.0

1.1

1.2

1.3

1 2 3u

Rel

ativ

e R

isk

1.0

1.1

1.2

1.3

1 2 3u

Rel

ativ

e R

isk

1.0

1.1

1.2

1.3

1 2 3u

Rel

ativ

e R

isk

Fig. 1. Comparison of relative risks when (µi, σ2i ) are in-

dependent. Here µii.i.d∼ 0.7N(0, .1) + 0.3N(±1, 3) and

σ2i

i.i.d∼ U(0.5, u). Plots show m = 10, 30, 50 left to right.

setting, the magnitude of the variances increase with u and the means grow with the variances. 410

Again, for the left plot m = 10, the same groups as in Figure 1 perform well although Grp Linear

has a lower relative risk in comparison to the other two linear shrinkage methods considered here.

However, the pattern as m grows is more pronounced than before, wherein the NEST method

maintains the lowest risk across the three plots.


1.00

1.05

1.10

1.15

25 30 35 40 45 50u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

25 30 35 40 45 50u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

25 30 35 40 45 50u

Rel

ativ

e R

isk

Fig. 2. Comparison of relative risks for correlated

(µi, σ2i ). Here τi = 1/σ2

i

i.i.d∼ 0.5 Γ(20,rate = 20) +

0.5 Γ(20,rate = u) and µi|τi ind.∼ N(±0.5/τi, 0.52).

Plots show m = 10, 30, 50 left to right.

The third setting, Figure 3, corresponds to the sparse case. The precisions τi are drawn from 415

the same gamma mixture as in Figure 2, but µi are only 30% likely to come from N(±0.5/τi, 1)and 70% likely from a point mass at 0. We see similar patterns to those in Figures 1 and 2 at

m = 10. For m = 30 and 50, we notice that the relative risks of the linear shrinkage methods are

now higher than their levels at m = 10. This is not surprising for in this setting, while the risk

performance of all methods have improved with larger sample sizes, NPMLE and DS exhibit a 420

bigger improvement in risk than those of Grp Linear, XKB.M and XKB.SG.

The fourth and fifth settings, Figures 4 and 5, correspond to the setting where the data

Yij |(µi, σ2i ) are not normally distributed. In Figure 4, Yij |(µi, σ

2i ) are generated independently



1.0

1.2

1.4

1.6

1.8

25 30 35 40 45 50u

Rel

ativ

e R

isk

1.0

1.2

1.4

1.6

1.8

25 30 35 40 45 50u

Rel

ativ

e R

isk

1.0

1.2

1.4

1.6

1.8

25 30 35 40 45 50u

Rel

ativ

e R

isk

Fig. 3. Comparison of relative risks when µ is sparse.

Here τii.i.d∼ 0.5 Γ(20,rate = 20) + 0.5 Γ(20,rate =

u) and µi|τi ind.∼ 0.7δ(0) + 0.3 N(±0.5/τi, 0.52). Plots

show m = 10, 30, 50 left to right.

from a uniform distribution between µi ±√3σi, σ

2i are sampled independently from N(u, 1)

which is truncated below at 0.01, and µi|σi ind.∼ 0.8N(σ2i /4, 0.25) + 0.2N(σ2

i , 1). For Fig-


1.0

1.1

1.2

1.3

1 2 3u

Rel

ativ

e R

isk

1.0

1.1

1.2

1.3

1 2 3u

Rel

ativ

e R

isk

1.0

1.1

1.2

1.3

1 2 3u

Rel

ativ

e R

isk

Fig. 4. Comparison of relative risk for non-normal data.

Here Yij |µi, σii.i.d∼ U(µi −

√3σi, µi +

√3σi), σ2

i aresampled independently from N(u, 1) truncated below at

0.01 and µi|σ2i

ind.∼ 0.8N(0.25σ2i , 0.25) + 0.2N(σ2

i , 1).Plots show m = 10, 30, 50 left to right.

425

ure 5, Yij |(µi, σ2i ) are generated with additional non-normal noise N(µi, σ

2i ) + Lap(0, 1) and

σ2i

i.i.d∼ U(0.1, u) with µi|σi ind.∼ 0.8N(σ2i /2, 0.5) + 0.2N(2σ2

i , 2). Across both these settings

and for Setting 5 (Figure 5) in particular, the proposed NEST method demonstrates robustness

to departures from the Normal model. Proposition 7 in Barp et al. (2019) guarantees that, in gen-

eral, the influence function of minimum KSD estimators, such as the NEST estimator, is bounded430

under data corruption and the behavior of the NEST estimator in Settings 4 and 5 is potentially

an example of such a robustness property.

In Appendix B we consider additional simulation experiments wherein we evaluate the risk

performance of these competing estimators with n fixed at 100 (Appendix B.1) and unequal

sample sizes mi (Appendix B.2).435



1.00

1.05

1.10

1.15

1 2 3u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

1 2 3u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

1 2 3u

Rel

ativ

e R

isk

Fig. 5. Comparison of relative risk for non-normal

data. Here Yij |µi, σii.i.d∼ N(µi, σ

2i ) + Lap(0, 1),

σ2i

i.i.d∼ U(0.1, u) and µi|σ2i

ind.∼ 0.8 N(0.5σ2i , 0.5) +

0.2 N(2σ2i , 2). Plots show m = 10, 30, 50 left to right.

5.2. Compound Estimation of Ratios

In this section we demonstrate the use of the NEST estimation framework for compound es-

timation of the n ratios θi =√miµi/σi which represent a popular financial metric for assessing

mutual fund performance (see Section 6.2 for a related real data application involving com-

pound estimation of mutual fund Sharpe ratios.). We evaluate the performance of the same six 440

methods with n fixed at 1,000 and mi = m for i = 1, . . . , n. The data Yij are generated inde-

pendently from N(µi, σ2i ) and the variances σ2

i are simulated as in Figure 5 while the means

are independently drawn from a mixture model with half chance N(−σ2i /2, 0.5

2) and the other

half N(σ2i /2, 0.5

2). Figure 6 shows the relative risk performance of the competing estimators of

θ = (θ1, . . . , θn). We continue to see that DS is in the group of better performers and especially 445

for m > 10, it dominates other methods. As u increases the heterogeneity in the data grows and

we see that the relative risks of the linear shrinkage methods across all m first decrease and then

increase. The shift in the behavior of these estimators is related to the observation that as u in-

creases, the centers of the mixture model that generates µi, are on average, further away from

one another. This makes estimating the numerator of the ratio easier for all methods up until a 450

point. As heterogeneity increases further, the risks of these methods that use the sample standard

deviation in the denominator of θi are relatively worse than the risk of DS. Moreover, we also

notice that NPMLE exhibits a substantially higher relative risk than DS when m > 10 and at

m = 50 the linear shrinkage methods are at par with NPMLE.

6. REAL DATA ANALYSES 455

6.1. Baseball Data

We analyze the monthly data on the number of “at bats” and “hits” for all U.S Major League

baseball players from regular seasons of 2002-2011. In this analysis we focus on both pitchers

and non-pitchers using an approach similar to that of Gu & Koenker (2017a). The data are avail-

able from the R package REBayes and have been aggregated into half seasons to produce an 460

unbalanced panel. It includes observations on 932 players who have at least ten at bats in any half

season and appear in no fewer than five half-seasons (note that there are a total of 20 half-seasons

that a player can appear in).



1.0

1.2

1.4

1.6

1 2 3u

Rel

ativ

e R

isk

1.0

1.2

1.4

1.6

1 2 3u

Rel

ativ

e R

isk

1.0

1.2

1.4

1.6

1 2 3u

Rel

ativ

e R

isk

Fig. 6. Comparison of relative risk for estimating θ. Here

σ2i

i.i.d∼ U(0.1, u) and µi|σiind.∼ N(±0.5σ2

i , 0.52). Plots


Following Brown (2008), let the transformed batting average Yij for player i(= 1, . . . , n)

at time j(= 1, . . . ,mi) be denoted by Yij = arcsin

(√

Hij + 0.25

Nij + 0.5

)

where Hij denotes the465

number of “hits” and Nij denotes the number of “at bats” at time j for player i. We assume that

Yij ∼ N(µi, v2ij/τi) where µi = arcsin(

√pi), pi being player i’s batting success probability, and

v2ij = 1/(4Nij). Here 1/τi are player specific scale parameters as described in Gu & Koenker

(2017a). Under this setup, the sufficient statistics are

µi =

mi∑

j=1

1/v2ij

−1mi∑

j=1

Yij/v2ij ∼ N(µi, v

2i /τi) with v2i =

4

mi∑

j=1

Nij

−1

,

S2i =

1

mi − 1

mi∑

j=1

(Yij − µi)2/v2ij with (mi − 1)S2

i τi ∼ X 2mi−1.

In this analysis, the goal is to use the 2002-2011 data to predict the batting averages of470

the players in 2012. Players are divided into three categories: all, non-pitchers, and pitchers.

We consider the following seven estimators of µi: two non-parametric maximum likelihood

based estimators, denoted NPMLE-Indep and NPMLE-Dep which assume, respectively, in-

dependent and dependent priors on (µi, 1/τi), the sufficient statistics µ = (µ1, . . . , µn) of µ,

the grand mean across all players in the 2011 season Y 2011 = n−1∑n

i=1 Yi,2011, the proposed475

NEST estimator (DS) and its oracle counterpart (DS orc), and the naive estimator that uses

2011 batting averages Y2011 = (Y1,2011, . . . , Yn,2011). To assess how well these methods pre-

dict 2012 batting averages Y2012 = (Y1,2012, . . . , Yn,2012), we consider three criteria for evalu-

ating any estimate δi of µi: total squared error from Brown (2008) and defined as TSE(δ) =∑n

i=1

(Yi,2012 − δi)2 − (4Ni,2012)

−1

, normalized squared error from Gu & Koenker (2017a)480

and defined as NSE(δ) =∑n

i=1

4Ni,2012(Yi,2012 − δi)2

, and total squared error on a prob-

ability scale from Jiang et al. (2010) which is defined as TSEp(p) =∑n

i=1

(pi,2012 − pi)2 −

pi,2012(1− pi,2012)(4Ni,2012)−1

. Here pi = sin2(δi) and pi,2012 = sin2(Yi,2012).


Table 1. Performance of the competing estimators relative to the performance of the naive esti-

mator Y2011. Here R-TSE(δ) = TSE(δ)/TSE(Y2011) with similar definitions for R-NSE and

R-TSEp. The smallest two relative errors are bolded

NPMLE-Indep NPMLE-Dep µ Y 2011 DS DS orc

All

n for estimation: 932 R-TSE 0.463 0.469 0.352 1.958 0.351 0.348

n for prediction: 370 R-NSE 0.668 0.674 0.676 1.495 0.672 0.671

R-TSEp 0.582 0.591 0.501 1.783 0.499 0.497

Nonpitchers



R-TSEp 0.651 0.663 0.619 0.682 0.593 0.593

Pitchers



R-TSEp 0.124 0.117 0.133 0.334 0.076 0.074

In Table 1, we report the performance of the competing estimators relative to the performance

of the naive estimator Y2011 wherein R-TSE(δ) = TSE(δ)/TSE(Y2011) with similar defini- 485

tions for R-NSE and R-TSEp. Thus, a smaller value of R-TSE, R-NSE or R-TSEp indicates

a relatively better prediction error. For at least two of the three performance metrics, DS exhibits

the best relative risk across every player category. It is interesting to note that the sufficient statis-

tics µ are quite competitive in this example while NPMLE with independent priors dominate the

one with dependent priors across “All” and “Nonpitchers”. The compound estimation problem 490

for “Pitchers” is an example of a setting where n is relatively small and DS demonstrates a better

risk performance than NPMLE across all three metrics. We report a similar performance of DS

for such small n settings in the numerical experiments of Appendix B.1.

6.2. Mutual Fund Sharpe Ratios

In this section we analyze a dataset on n1 = 5,000 monthly mutual fund returns spanning 495

12 months from January 2014 to December 2014. This data is sourced from the Wharton re-

search data services (Wharton School, 1993). The goal in this analysis is to use Sharpe ratios

constructed using the data on the first 6 months, January 2014 - June 2014, to predict the corre-

sponding Sharpe ratios for the next 6 months. Formally, let Yij denote the excess return of fund

i(= 1, . . . , n1) in month j(= 1, . . . ,mi,1) over the return on the 3 month treasury yield. Denote 500

Yi = m−1i,1

∑mi,1

j=1 Yij , S2i = (mi,1 − 1)−1

∑mi,1

j=1 (Yij − Yi)2 and δnaivei = Yi/

√

S2i to be, respec-

tively, the sample mean, the sample variance and the observed Sharpe ratio of the monthly excess

returns. In our analysis we include funds that have at least 4 months worth of returns available

during January 2014 - June 2014 and so mi,1 ∈ [4, 6]. Of the 5,000 funds available during these

first 6 months, there are n2 = 4,927 funds that appear in the next 6 months, July 2014 - December 505

2014, and have at least 3 months of returns available during this period. For our prediction, we

consider these n2 funds to assess the performance of various estimators for predicting θi = µi/σiwhere µi and σi are the sample mean and sample standard deviation of the excess returns of the

n2 funds during the next 6 months.

We consider the following estimators of θ = (θ1, . . . , θn): DS, DS orc, Grp Linear and 510

XKB.SG from Section 5.1, as well as NPMLE-Indep, NPMLE-Dep from Section 6.1. Addi-

tionally, we consider the SURE estimator XKB.G from Xie et al. (2012) that, unlike XKB.SG,

imposes a Normal prior on µ and shrinks towards the grand mean. Note that for predicting

θi, Grp Linear, XKB.SG and XKB.G rely on the sample variances S2i . To evaluate the per-


Table 2. Performance of the competing estimators relative to the performance of the naive esti-

mator δnaive. Here R-TSE(δ) = TSE(δ)/TSE(δnaive) with similar definitions for R-WSE and

R-WAE. The smallest two relative errors are bolded

n1 n2 Group Linear XKB.G XKB.SG DS DS orc

R - TSE 5000 4927 0.900 0.928 0.902 0.677 0.677

R - WSE 5000 4927 0.899 0.928 0.902 0.676 0.676

R - WAE 5000 4927 1.028 0.972 0.966 0.779 0.778

formance of these estimators for predicting θ, we consider three criteria: Total Squared Er-515

ror : TSE(δ) =∑n2

i=1(θi − δi)2; weighted Squared Error : WSE(δ) =

∑n2i=1mi,2(θi − δi)

2;

and weighted Absolute Error : WAE(δ) =∑n2

i=1mi,2|1− δi/θi|. In Table 2, we present the

performance of the competing estimators relative to the performance of the naive estimator

δnaive = (δnaivei : 1 ≤ i ≤ n) so that a smaller value of R-TSE, R-WSE or R-WAE indicates a

relatively better prediction error. NPMLE-Indep and NPMLE-Dep revealed convergence issues520

on this data and so we do not include them in table 2. Along all performance measures, DS

has the smallest relative risk among all competing estimators considered in this example. With

respect to the weighted Absolute Error, Grp Linear appears to be doing relatively worse in pre-

dicting the small Sharpe ratios and exhibits an R-WAE bigger than 1. When compared against

the three linear shrinkage methods considered here, DS demonstrates an overall value in joint525

shrinkage estimation of the means µi and the variances σ2i for predicting θ.

ACKNOWLEDGEMENTS

We thank the editor and three anonymous referees for taking the time to review this pa-

per. Their suggestions have led to significant improvements in the quality and content of this

manuscript.530

SUPPLEMENTARY MATERIAL

Supplementary material available at Biometrika online includes proofs of main theorems and

additional numerical experiments.

REFERENCES

ABRAMOVICH, F., BENJAMINI, Y., DONOHO, D. L. & JOHNSTONE, I. M. (2006). Adapting to unknown sparsity535

by controlling the false discovery rate. Ann. Statist. 34, 584–653.BANERJEE, T., LIU, Q., MUKHERJEE, G. & SUN, W. (2020). A general framework for empirical bayes estimation

in discrete linear exponential family. arXiv preprint arXiv:1910.08997 (accepted in Journal of Machine LearningResearch) .

BARP, A., BRIOL, F.-X., DUNCAN, A., GIROLAMI, M. & MACKEY, L. (2019). Minimum stein discrepancy esti-540

mators. In Advances in Neural Information Processing Systems.BASU, P., CAI, T. T., DAS, K. & SUN, W. (2017). Weighted false discovery rate control in large-scale multiple

testing. Journal of the American Statistical Association 0, 0–0.BENJAMINI, Y. & HOCHBERG, Y. (1997). Multiple hypotheses testing with weights. Scandinavian Journal of

Statistics 24, 407–418.545

BENJAMINI, Y. & YEKUTIELI, D. (2011). False discovery rate–adjusted multiple confidence intervals for selectedparameters. Journal of the American Statistical Association 100, 71–81.

BERGER, J. O. (1976). Admissible minimax estimation of a multivariate normal mean with arbitrary quadratic loss.Ann. Statist. 4, 223–226.

BERK, R., BROWN, L., BUJA, A., ZHANG, K. & ZHAO, L. (2013). Valid post-selection inference. Ann. Statist. 41,550

802–837.


BROWN, L. D. (2008). In-season prediction of batting averages: A field test of empirical bayes and bayes method-ologies. The Annals of Applied Statistics , 113–152.

BROWN, L. D. & GREENSHTEIN, E. (2009). Nonparametric empirical Bayes and compound decision approaches toestimation of a high-dimensional vector of normal means. The Annals of Statistics 37, 1685–1704. 555

BROWN, L. D., GREENSHTEIN, E. & RITOV, Y. (2013). The poisson compound decision problem revisited. Journalof the American Statistical Association 108, 741–749.

BROWN, S. J., GOETZMANN, W., IBBOTSON, R. G. & ROSS, S. A. (1992). Survivorship bias in performancestudies. The Review of Financial Studies 5, 553–580.

CAI, T. T. & SUN, W. (2009). Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. 560

J. Amer. Statist. Assoc. 104, 1467–1481.CASTILLO, I. & VAN DER VAART, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly

sparse sequences. Ann. Statist. 40, 2069–2101.CHIARETTI, S., LI, X., GENTLEMAN, R., VITALE, A., VIGNETTI, M., MANDELLI, F., RITZ, J. & FOA, R. (2004).

Gene expression profile of adult t-cell acute lymphocytic leukemia identifies distinct subsets of patients with 565

different response to therapy and survival. Blood 103, 2771–2778.CHWIALKOWSKI, K., STRATHMANN, H. & GRETTON, A. (2016). A kernel test of goodness of fit. JMLR: Workshop

and Conference Proceedings.DIACONIS, P. & YLVISAKER, D. (1979). Conjugate priors for exponential families. The Annals of statistics ,

269–281. 570

DONOHO, D. L. & JONHSTONE, J. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425.DYSON, F. (1926). A Method for Correcting Series of Parallax Observations. Monthly Notices of the Royal Astro-

nomical Society 86, 686–706.EDDINGTON, A. S. (1940). The Correction of Statistics for Accidental Error. Monthly Notices of the Royal Astro-

nomical Society 100, 354–361. 575

EFRON, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 23, 1–22.EFRON, B. (2011). Tweedie’s formula and selection bias. Journal of the American Statistical Association 106,

1602–1614.EFRON, B. & MORRIS, C. N. (1975). Data analysis using stein’s estimator and its generalizations. Journal of the

American Statistical Association 70, 311–319. 580

ERICKSON, S. & SABATTI, C. (2005). Empirical Bayes estimation of a sparse vector of gene expression change.Statistical applications in genetics and molecular biology 4, 1132.

GU, J. & KOENKER, R. (2017a). Empirical bayesball remixed: Empirical bayes methods for longitudinal data.Journal of Applied Econometrics 32, 575–599.

GU, J. & KOENKER, R. (2017b). Unobserved heterogeneity in income dynamics: An empirical bayes perspective. 585

Journal of Business & Economic Statistics 35, 1–16.HE, L., SARKAR, S. K. & ZHAO, Z. (2015). Capturing the severity of type ii errors in high-dimensional multiple

testing. Journal of Multivariate Analysis 142, 106 – 116.HENDERSON, N. C. & NEWTON, M. A. (2016). Making the cut: improved ranking and selection for large-scale

inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78, 1467–9868. 590

IGNATIADIS, N. & WAGER, S. (2019). Covariate-powered empirical bayes estimation. In Advances in NeuralInformation Processing Systems.

JAMES, G. M., RADCHENKO, P. & RAVA, B. (2020). Irrational exuberance: Correcting bias in probability estimates.Journal of the American Statistical Association , 1–14.

JAMES, W. & STEIN, C. (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium 595

on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. Berkeley, Calif.:University of California Press.

JIANG, W. & ZHANG, C.-H. (2009). General maximum likelihood empirical bayes estimation of normal means.Ann. Statist. 37, 1647–1684.

JIANG, W., ZHANG, C.-H. et al. (2010). Empirical bayes in-season prediction of baseball batting averages. In Bor- 600

rowing Strength: Theory Powering Applications–A Festschrift for Lawrence D. Brown. Institute of MathematicalStatistics, pp. 263–273.

JING, B.-Y., LI, Z., PAN, G. & ZHOU, W. (2016). On sure-type double shrinkage estimation. Journal of theAmerican Statistical Association 111, 1696–1704.

JOHNSTONE, I. M. & SILVERMAN, B. W. (2004). Needles and straw in haystacks: empirical Bayes estimates to 605

possibly sparse sequences. Annals of Statistics 32, 1594–1649.KIEFER, J. & WOLFOWITZ, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely

many incidental parameters. The Annals of Mathematical Statistics , 887–906.KOENKER, R. & GU, J. (2017). Rebayes: Empirical bayes mixture methods in r. Journal of Statistical Software 82,

1–26. 610

KOENKER, R. & MIZERA, I. (2014). Convex optimization, shape constraints, compound decisions, and empiricalBayes rules. Journal of the American Statistical Association 109, 674–685.


KOU, S. C. & YANG, J. J. (2017). Optimal Shrinkage Estimation in Heteroscedastic Hierarchical Linear Models,chap. 25. Cham: Springer International Publishing, pp. 249–284.

LAIRD, N. (1978). Nonparametric maximum likelihood estimation of a mixing distribution. Journal of the American615

Statistical Association 73, 805–811.LAURENT, B. & MASSART, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of

Statistics , 1302–1338.LEE, J. D., SUN, D. L., SUN, Y. & TAYLOR, J. E. (2016). Exact post-selection inference, with application to the

lasso. Ann. Statist. 44, 907–927.620

LIU, Q., LEE, J. D. & JORDAN, M. I. (2016). A kernelized stein discrepancy for goodness-of-fit tests. In Proceedingsof the International Conference on Machine Learning (ICML).

LIU, Q. & WANG, D. (2016). Stein variational gradient descent: A general purpose bayesian inference algorithm. InAdvances In Neural Information Processing Systems.

OATES, C. J., GIROLAMI, M. & CHOPIN, N. (2017). Control functionals for monte carlo integration. Journal of the625

Royal Statistical Society: Series B (Statistical Methodology) 79, 695–718.ROBBINS, H. (1956). An empirical Bayes approach to statistics. Proc. Third Berkeley Symp. on Math. Statistic. and

Prob. 1, 157–163.SERFLING, R. J. (2009). Approximation theorems of mathematical statistics, vol. 162. John Wiley & Sons.SUN, W. & MCLAIN, A. C. (2012). Multiple testing of composite null hypotheses in heteroscedastic models. Journal630

of the American Statistical Association 107, 673–687.TAN, Z. (2015). Improved minimax estimation of a multivariate normal mean under heteroscedasticity. Bernoulli 21,

574–603.TUSHER, V. G., TIBSHIRANI, R. & CHU, G. (2001). Significance analysis of microarrays applied to the ionizing

radiation response. Proceedings of the National Academy of Sciences 98, 5116–5121.635

WEINSTEIN, A., FITHIAN, W. & BENJAMINI, Y. (2013). Selection adjusted confidence intervals with more powerto determine the sign. Journal of the American Statistical Association 108, 165–176.

WEINSTEIN, A., MA, Z., BROWN, L. D. & ZHANG, C.-H. (2018). Group-linear empirical bayes estimates for aheteroscedastic normal mean. Journal of the American Statistical Association 0, 1–13.

WHARTON SCHOOL (1993). Wharton research data services. https://wrds-web.wharton.upenn.edu/wrds/ .640

XIE, X., KOU, S. & BROWN, L. D. (2012). Sure estimates for a heteroscedastic hierarchical model. Journal of theAmerican Statistical Association 107, 1465–1479.

YANG, J., LIU, Q., RAO, V. & NEVILLE, J. (2018). Goodness-of-fit testing for discrete distributions via steindiscrepancy. In International Conference on Machine Learning.

ZHANG, X. & BHATTACHARYA, A. (2017). Empirical Bayes, sure, and sparse normal mean models. Preprint.645


Supplementary Material for “Nonparametric Empirical

Bayes Estimation On Heterogeneous Data”

A. PROOFS

A.1. Proof of Proposition 1

Proposition 1 follows by recalling that under the hierarchical model of equation (1),

fmi(yi, s

2i |µi, τi) ∝ exp

−τi2

[

miy2i + (mi − 1)s2i

]

+miτiµiyi −mi

2τiµ

2i +

mi − 3

2log s2i

.

Therefore from Bayes theorem,

fmi(µi, τi|yi, s2i ) =

fmi(yi, s

2i |µi, τi)

fmi(yi, s2i )

g(µi|τi)h(τi) ∝ exp

ηTi T (µi, τi)−A(ηi)

g(µi|τi)h(τi).

Here ηi = (miyi,−miy2i − (mi − 1)s2i ) := (η1i, η2i), T (µi, τi) = (τiµi, τi/2) and 650

A(ηi) = −0.5(mi − 3) log γ(η1i, η2i) + log fmi

m−1i η1i, γ(η1i, η2i)

,

γ(η1i, η2i) =−η2i −m−1

i η21imi − 1

,

with fm(y, s2) =∫ ∫

fm(y, s2|µ, τ)gµ(µ|τ)hτ (τ)dµdτ being the marginal density function of (Y, S2).

Corollary 1 is a consequence of the properties of exponential family of distributions and follows from

proposition 1 under the squared error loss. From proposition 1 and dropping subscript i, we have,

τπ := τπ(y, s2,m) = E(τ |y, s2,m) = 2∂A(η)

∂η2=

m− 3

(m− 1)s2− 2

m− 1w2(y, s

2;m).

Furthermore, with ζ = τµ,

ζπ := ζπ(y, s2,m) = E(ζ|y, s2,m) =∂A(η)

∂η1=

(m− 3)y

(m− 1)s2+m−1w1(y, s

2;m)− 2yw2(y, s2;m)

= yE(τ |y, s2,m) +m−1w1(y, s2;m).

A.2. Proof of Theorem 1 655

Consider the sample criterion Mλ,n(W) and population criterion M(W). We have

∣

∣

∣Mλ,n(W)−M(W)

∣

∣

∣≤∣

∣

∣

1

n2

n∑

i=1

n∑

j=1

κ[w(Xi), w(Xj)](Xi,Xj)I(i 6= j)−M(W)∣

∣

∣+

∣

∣

∣

1

n2

n∑

i=1

κ[w(Xi), w(Xi)](Xi,Xi)∣

∣

∣+λ(n)

n2‖W‖2F

:= I1 + I2 + I3.

(1)

Define Mn(W) = [n(n− 1)]−1∑n

i=1

∑nj=1 κ[w(Xi), w(Xj)](Xi,Xj)I(i 6= j). Then

I1 ≤ |Mn(W)−M(W)|+ n−1|Mn(W)|.

By assumption 1, n−1|Mn(W)| is Op(n−1). Now, note that Mn(W) is an unbiased estimator of M(W)

and is a U-statistic with a symmetric kernel function κ[w(Xi), w(Xj)](Xi,Xj). From assumption 1,

κ[w(Xi), w(Xj)](Xi,Xj) has finite second moments. Moreover, from theorem 4.1 of Liu et al. (2016),


Mn(W) is a non-degenerate U-statistic whenever f 6= f . Thus, from the CLT for U-statistics (Serfling660

(2009) section 5.5), |Mn(W)−M(W)| is Op(n−1/2).

For the terms I2 and I3 in equation (1), we have

I2 + I3 = 1 + λ(n)∣

∣

∣

1

n2

n∑

i=1

κ[w(Xi), w(Xi)](Xi,Xi)∣

∣

∣.

From Assumption 1, I2 + I3 is Op(1 + λ(n)/n). Theorem 1 is proved by combining these results and

using the condition λ(n)n−1/2 → 0.

A.3. Proof of Theorem 2

Denote λ(n) by λ and keep its dependence on n implicit for notational ease. According to Proposition

3.3 of Liu et al. (2016), Assumption 2 implies that M(W) is a valid discrepancy measure in the sense that

M(W) > 0 if and only if f 6= f . It follows that given a fixed ǫ > 0, there exists a δ > 0 such that

P

cnn

∥

∥Wn(λ)−W0

∥

∥

2

F≥ ǫ0

≤ P[

cn

M(Wn(λ))−M(W0)

≥ δ]

.

But the right hand side in the display above is upper bounded by the sum of three665

terms: P[cnM(Wn(λ))− Mλ,n(Wn(λ)) ≥ δ/3], P[cnMλ,n(Wn(λ))− Mλ,n(W0) ≥ δ/3] and

P[cnMλ,n(W0)−M(W0) ≥ δ/3]. From Theorem 1 the first term goes to zero as n → ∞ while the

second term is zero since Mλ,n(Wn(λ)) ≤ Mλ,n(W0).

Consider the third term P[cnMλ,n(W0)−M(W0) ≥ δ/3]. Here M(W0) = 0 and we can write

∣

∣Mλ,n(W0)∣

∣ ≤∣

∣Mλ,n(W0)− Mλ,n(W0)∣

∣+∣

∣Mλ,n(W0)∣

∣

:= T1 + T2,

where Mλ,n(W0) = Mn(W0) + λn−2‖W0‖2F and

Mn(W0) = [n(n− 1)]−1n∑

i=1

n∑

j=1

κ[w(Xi),w(Xj)](Xi,Xj)I(i 6= j).

So, we have T1 ≤ 1

n

∣

∣Mn(W0)∣

∣+1

n2

∣

∣

∣

∑ni=1 κ[w(Xi),w(Xi)](Xi,Xi)

∣

∣

∣, and T2 ≤

∣

∣Mn(W0)∣

∣+670

λn−2‖W0‖2F . Therefore,

∣

∣Mλ,n(W0)∣

∣ ≤ T1 + T2 ≤ (1 + n−1)∣

∣Mn(W0)∣

∣+1 + λ

n2

∣

∣

∣

n∑

i=1

κ[w(Xi),w(Xi)](Xi,Xi)∣

∣

∣. (2)

From Theorem 4.1 of Liu et al. (2016), Mn(W) is a degenerate U-statistic when W = W0. From As-

sumption 1 and the CLT for U-statistics [cf. Section 5.5.2 of Serfling (2009)], Mn(W0) is Op(n−1). Also,

from Assumption 1, the second term in equation (2) above is Op((1 + λ)/n). Finally, using these results

in equation (2) along with the condition that λn−1/2 → 0 as n → ∞, we conclude that the third term

P[cnMλ,n(W0)−M(W0) ≥ δ/3] → 0

as n → ∞. The desired result thus follows.

A.4. Proof of Theorem 3 and Corollary 2

We first state two lemmata that are needed for proving Theorem 3. Denote c0, c1, . . . some generic

positive constants which may vary in different statements.675

LEMMA 1. If Assumption 3 holds, then with probability tending to 1, |µ| ≤ C1 log n and C2/ log n ≤τ ≤ C3 log n for some positive constants C1, C2 and C3.


LEMMA 2. Consider Model (1). Suppose Assumption 3 holds. Then with probability tending to 1,

|w1,i| ≤ c0(log n)2 and E(τi|yi, s2i ) ≥

c1log n

.

Lemmata 1 and 2 are proved in Appendices A.5 and A.6, respectively. We now prove Theorem 3.

To establish the first part of Theorem 3, note that

1

n‖δds(λ)− δ∗‖22 =

1

nm2

n∑

i=1

∣

∣

∣

w1,i

τπi−

wi1,λ

τdsi (λ)

∣

∣

∣

2

≤ 2

nm2

n∑

i=1

1

[τdsi (λ)]2

∣

∣

∣w1,i − wi

1,λ

∣

∣

∣

2

+2

nm2

n∑

i=1

∣

∣

∣w1,i

∣

∣

∣

2 ∣∣

∣

1

τdsi (λ)− 1

τπi

∣

∣

∣

2

:= T1 + T2.

Consider the first term T1. From the discussion in Section 2.3, there is a positive constant c0 such that 680

τdsi (λ) > c0 > 0 for all i = 1, . . . , n. It follows that for some constant c1 > 0 depending on the fixed m,

T1 ≤ c1n

∥

∥

∥w1 − w1,λ

∥

∥

∥

2

2≤ c1

n

∥

∥

∥W0 − Wn(λ)∥

∥

∥

2

F, (3)

where w1,λ = (w11,λ, . . . , w

n1,λ) and w1 = (w1,1, . . . , w1,n). From Theorem 2 the last term on the right

hand side of the inequality in equation (3) is Op(n−1/2).

Next consider the second term T2. We have

T2 ≤ c2n

n∑

i=1

∣

∣

∣

w1,i

τπi

∣

∣

∣

2

|w2,i − wi2,λ|2 (4)

We will use Lemma 2 to bound the terms |w1,i/τπi | in equation (4). First note that Model (1), Assumption

3 and Lemma 1 imply that with probability tending to 1, |Yi| ≤ c0 log n and

(m− 1)n−1 ≤ (m− 1)S2i τi ≤ (m− 1) + 2

√

(m− 1) log n+ 2 log n

[cf. Lemma 1 of Laurent & Massart (2000)]. So conditional on these events, we have, from Lemma 2, 685

|w1,i/τπi | ≤ c3 log

3 n. Thus,

T2 ≤ c4 log6 n

n

n∑

i=1

∣

∣

∣w2,i − wi

2,λ

∣

∣

∣

2

,

which is Op(log6 n/

√n) from Theorem 2. Thus n−1‖δds(λ)− δ∗‖22 is Op(log

6 n/√n).

Now we will prove the second part of Theorem 3. Observe that |ln(µ, δ∗)− ln(µ, δds(λ))| equals

1

n

∣

∣

∣

∥

∥µ− δ∗∥

∥

2−∥

∥µ− δds(λ)∥

∥

2

∣

∣

∣

∣

∣

∣

∥

∥µ− δ∗∥

∥

2+∥


∥

2

∣

∣

∣

and Triangle inequality implies

1√n

∣

∣

∣

∥

∥µ− δ∗∥

∥

2−∥


∥

2

∣

∣

∣≤ 1√

n

∥

∥δds(λ)− δ∗∥

∥

2. (5)

The quantity on the right hand side of the inequality in equation (5) is Op(log3 n/n1/4) from the first part

of theorem 3. Thus, it follows from equation (5) that√

ln(µ, δds(λ)) ≤√

ln(µ, δ∗) +Op(log3 n/n1/4), and

∣

∣ln(µ, δ∗)− ln(µ, δ

ds(λ))∣

∣ ≤ 4√

ln(µ, δ∗)∣

∣

√

ln(µ, δ∗)−√

ln(µ, δds(λ))∣

∣

1 + op(1)

. (6)


Now Assumption 3, together with Lemmata 1 and 2 imply ln(µ, δ∗) is Op(log

6 n). Thus, from equation690

(5) and the first part of Theorem 3, we have the desired result.

Corollary 2 is a consequence of Theorem 2 of Diaconis & Ylvisaker (1979). When applied to the

hierarchical model of equation (1) with conjugate priors it establishes that the posterior expectation of µi

is a linear combination of the prior mean and yi. The weights in this linear combination are proportional

to m and the prior sample size m0. For instance, if we consider the following normal conjugate model,695

Yij | µi, τii.i.d∼ N(µi, 1/τi), µi |τi ind∼ N(µ0, 1/τi), τi

i.i.d∼ Γ(α, β), (7)

then under model (7) standard calculations give δπi = (µ0 +myi)/(m+ 1) where µ0 is the prior mean

and m0 = 1 is the prior sample size. Moreover, under model (7) we also have

log f(yi, s2i ) = c0 +

m− 3

2log s2i − (α+m/2) log

β−1 + 0.5(m− 1)s2i + 0.5m

m+ 1(yi − µ0)

2

,

where c0 is a constant independent of (yi, s2i ). From the above display,

w1(yi, s2i ) = − α+m/2

β−1 + 0.5(m− 1)s2i + 0.5m/(m+ 1)(yi − µ0)2

w2(yi, s2i ) =

(m− 3)

2s2i− 0.5(α+m/2)(m− 1)

β−1 + 0.5(m− 1)s2i + 0.5m/(m+ 1)(yi − µ0)2.

Substituting these expressions for w1(yi, s2i ), w2(yi, s

2i ) in equation (14) give δ∗i = δπi . This suffices to

prove the statement of Corollary 2.

A.5. Proof of Lemma 1

The proof of Lemma 1 follows directly from Assumption 3 and Markov’s inequality. For example, fix

a ν > 0 and note that, for r = ǫ−ν2 > 1,

P

(

τ ≤ ǫ1+ν2

log n

)

≤EH

exp(ǫ2/τ)

nr.

A.6. Proof of Lemma 2700

Recall that f(yi, s2i ) =

∫

R+

∫

Rf1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ, where g(·|τ) and h(·) are, respec-

tively, the density functions associated with the distribution functions Gµ(·|τ) and Hτ (·) in equation (1),

f1 is the density of a Gaussian random variable with mean µ and variance 1/(mτ) and f2 is the density

of S2 where (m− 1)S2τ ∼ X 2m−1. We will denote the partial derivative of f(y, s2) with respect to y by

f ′(1)(y, s

2). From Model (1), |f ′(1)(yi, s

2i )| ≤ T1 + T2, where705

T1 = m|Yi|∫

R+

∫

R

τf1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ (8)

T2 = m

∫

R+

∫

R

|µ|τf1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ.

Consider the term T1 in equation (8) above. Fix ν3 > 0 such that ǫ−ν3

3 > 4m in Lemma 1. With C3 =ǫ−1−ν3

3 we have,

T1 ≤ m|Yi|f(yi, s2i )C3 log n+m|Yi|∫

R+

∫

R

τI(τ ≥ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ

≤ m|Yi|f(yi, s2i )C3 log n+mc0|Yi|EH

τ5/2I(τ ≥ C3 log n)

(9)

≤ m|Yi|f(yi, s2i )C3 log n+mc1|Yi|

P

(

τ ≥ C3 log n)1/2

(10)

≤ m|Yi|f(yi, s2i )C3 log n+mc2|Yi|n−2m (11)


In equation (9) we have used the fact that f1(yi|µ, τ) ≤√

τ/(2π) and f2(s2i |τ) ≤ c2τ while for equations

(10), (11) we use the Cauchy Schwartz inequality, Assumption 1 and Lemma 1.

Now, consider the term T2 in equation (8). With C3 = ǫ−1−ν3

3 and C1 = ǫ−1−ν1

1 , where ν1 > 0 is such 710

that ǫ−ν1

1 > 4m in Lemma 1, T2 term can be expressed as the sum of the following four integrals:

I1 = m

∫

R+

∫

R

|µ|I(|µ| ≤ C1 log n)τI(τ ≤ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ (12)

I2 = m

∫

R+

∫

R

|µ|I(|µ| ≥ C1 log n)τI(τ ≤ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ (13)

I3 = m

∫

R+

∫

R

|µ|I(|µ| ≤ C1 log n)τI(τ ≥ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ

I4 = m

∫

R+

∫

R

|µ|I(|µ| ≥ C1 log n)τI(τ ≥ C3 log n)f1(yi|µ, τ)f2(s2i |τ)g(µ|τ)h(τ)dµdτ

We will bound I1 and I2 and the other two integrals can be bounded using similar arguments. For I1 in

equation (12), note that I1 ≤ c0 log2 nf(yi, s

2i ) and for I2 in equation (13),

I2 ≤ c1(log5/2 n)

P

(

|µ| ≥ C1 log n)1/2

≤ c2(log5/2 n)n−2m (14)

In equation (14) we have used f1(yi|µ, τ) ≤√

τ/(2π), f2(s2i |τ) ≤ c2τ , the Cauchy Schwartz inequality,

Assumption 1 and Lemma 1. Similarly, we can show that I3 ≤ c0 log n/n2m and I4 ≤ c1/n

4m. Finally, 715

putting the upper bounds for Ii, i = 1, · · · , 4, together, we get,

T2 ≤ c0 log2 nf(yi, s

2i ) + c1

log5/2 n

n2m(15)

From equations (11) and (15) we have

|w1(Yi, S2i )| ≤ c0|Yi| log n+ c1 log

2 n+ c2|Yi|+ log5/2 n

f(yi, s2i )n2m

.

Moreover, noting that Model (1) and Lemma 1 imply that |Yi| ≤ c3 log n with high probability, we have,

conditional on this event,

|w1(Yi, S2i )| ≤ c4 log

2 n+ c5log n+ log5/2 n

f(yi, s2i )n2m

. (16)

We will now analyze the behavior of f(yi, s2i ) that appears in the display above. Gaussian concentration 720

implies that with high probability (Yi − µi)2mτi ≤ 2 log n and so, conditional on this event,

f1(yi|µ, τ) ≥ c0

√τ

n. (17)

Moreover, using the Chi-square concentration in Lemma 1 of Laurent & Massart (2000), (m− 1)S2i τi ≤

(m− 1) + 2√

(m− 1) log n+ 2 log n and S2i τi ≥ n−1 with high probability. It follows that

f2(s2i |τ) ≥ c1τ

ann(m+1)/2 log n

, (18)

conditional on this event, where an = exp−√

(m− 1) log n. Using equations (17) and (18), we have

f(yi, s2i ) ≥ c2

ann(m+3)/2 log n

∫

R+

τ3/2h(τ)dτ. (19)


Now, use Assumption 3 and Lemma 1 on the quantity∫

R+ τ3/2h(τ)dτ in equation (19) to conclude that725

∫

R+

τ3/2h(τ)dτ ≥ c3

log3/2 nP

(

τ ≥ C2/ log n)

. (20)

So, with equations (20), (19) and Lemma 1, we have with high probability,

f(yi, s2i ) ≥ c4

an

n(m+3)/2 log5/2 n. (21)

The first statement of Lemma 2 thus follows from equations (21) and (16).

We will now prove the second statement of Lemma 2. Fix ν2 > 0 such that ǫ−ν2

2 > 2m in Lemma

1 and let C2 = ǫ1+ν2

2 . First note that from Assumption 3, P(τ ≤ C2/ log n) ≤ c0/n2m. Now, Markov’s

inequality implies,

E(τ |yi, s2i ) ≥C2

log n

1− P(τ ≤ C2/ log n|yi, s2i )

.

Moreover,

P(τ ≤ C2/ log n|yi, s2i ) = f(yi, s2i )−1

∫ C2/ logn

0

h(τ)

∫

R

g(µ|τ)f1(yi|µ, τ)f2(s2i |τ)dµ

dτ.

Now f1(yi|µ, τ) ≤√

τ/(2π) and f2(s2i |τ) ≤ c1τ where c1 > 0 is a constant. So for some positive con-

stant c2,

P(τ ≤ C2/ log n|yi, s2i ) ≤c2

f(yi, s2i )

∫ C2/ logn

0

τ3/2h(τ)

∫

R

g(µ|τ)dµ

dτ

≤ c3

f(yi, s2i ) log3/2 n

∫ C2/ logn

0

h(τ)dτ =c3


P(τ ≤ C2/ log n).

Thus, from the above display, Assumption 3 and Lemma 1,730

E(τ |yi, s2i ) ≥C2

log n

1− c3n−2m


.

Finally, equation (21) and the above display prove the second statement of Lemma 2.

B. ADDITIONAL NUMERICAL EXPERIMENTS

B.1. Numerical Experiments with small n

In this section, we evaluate the risk performance of the six competing approaches of sections 5.1 and

5.2 when n = 100. We consider the simulation settings 1, 2 and 4 from section 5.1, and the setting of the735

ratio estimation example from section 5.2. Figures 7 to 10 present the relative risk ratios of the competing

estimators for the four simulation settings. We continue to observe that as heterogeneity increases in

the data, DS dominates the three linear shrinkage estimators considered here. Moreover, DS dominates

NPMLE across all settings with the exception of setting 1 at m = 10 (figure 7 left) where it is almost as

good as NPMLE with rising heterogeneity.740

B.2. Numerical Experiments with unequal sample sizes mi

In this section, we present the risk performance of the six competing approaches of sections 5.1 and

5.2 when the sample sizes mi are different across the n = 1000 units of study. We continue to use the

simulation settings of sections 5.1 and 5.2, and only change how m = (m1, . . . ,mn) are generated in

these settings. Figures 11 and 12 present the relative risks of the competing estimators across the six745

simulations settings of sections 5.1 and 5.2. With the exception of setting 3 in section 5.1,m is generated



1.00

1.05

1.10

1.15

1.20

1.25

1 2 3u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

1.20

1.25

1 2 3u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

1.20

1.25

1 2 3u

Rel

ativ

e R

isk

Fig. 7. Comparison of relative risks when (µi, σ2i ) are in-

dependent. Here µii.i.d∼ 0.7N(0, .1) + 0.3N(±1, 3) and

σ2i

i.i.d∼ U(0.5, u). Plots show m = 10, 30, 50 left to right.


1.00

1.05

1.10

1.15

1.20

25 30 35 40 45 50u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

1.20

25 30 35 40 45 50u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

1.20

25 30 35 40 45 50u

Rel

ativ

e R

isk

Fig. 8. Comparison of relative risks for correlated

(µi, σ2i ). Here τi = 1/σ2

i

i.i.d∼ 0.5 Γ(20,rate = 20) +

0.5 Γ(20,rate = u) and µi|τi ind.∼ N(±0.5/τi, 0.52).

Plots show m = 10, 30, 50 left to right.

as a fixed vector of size n with elements sampled randomly from (10, 15, 20, . . . , 50) with replacement

while for setting 3 we generate m by sampling randomly from (30, 35, . . . , 100) with replacement. For

settings 1, 2 and 3 (figure 11), we note that both NPMLE and DS dominate the three linear shrinkage

estimators with DS being as good as NPMLE in these three settings. Settings 4 and 5 from section 5.1 750

represent scenarios wherein the data Yij |(µi, σ2i ) are not normally distributed. Under both these settings,

DS provides an overall better risk performance than the competing estimators as the heterogeneity in the

data increases with u. Setting 6 (figure 12 right) is the ratio estimation scenario considered in section 5.2

and we see that DS dominates the competing estimators of θi =√miµi/σi considered here.

B.3. Gamma and Weibull mixtures 755

In this section we provide numerical evidence that demonstrate the performance of the NEST estimator

for compound estimation of the scale parameter of Gamma and Weibull mixtures in examples 2 and 4

of section 4. To choose λ in these settings, we use two fold cross validation by randomly splitting the

observed data Yi = (Yij : 1 ≤ j ≤ m) for unit i into two approximately equal parts (Ui,Vi) such that

Yi = (Ui,Vi). 760



1.00

1.05

1.10

1.15

1 2 3u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

1 2 3u

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

1 2 3u

Rel

ativ

e R

isk

Fig. 9. Comparison of relative risk for non-normal data.

Here Yij |µi, σii.i.d∼ U(µi −

√3σi, µi +

√3σi), σ2

i aresampled independently from N(u, 1) truncated below at

0.01 and µi|σ2i

ind.∼ 0.8N(0.25σ2i , 0.25) + 0.2N(σ2

i , 1).Plots show m = 10, 30, 50 left to right.


1.0

1.1

1.2

1.3

1.4

1.5

1 2 3u

Rel

ativ

e R

isk

1.0

1.1

1.2

1.3

1.4

1.5

1 2 3u

Rel

ativ

e R

isk

1.0

1.1

1.2

1.3

1.4

1.5

1 2 3u

Rel

ativ

e R

isk

Fig. 10. Comparison of relative risk for estimating θ. Here

σ2i

i.i.d∼ U(0.1, u) and µi|σiind.∼ N(±0.5σ2

i , 0.52). Plots


Scale mixture of Gamma distributions - Consider the setting of example 2 in section 4 where for j =1, . . . ,m and i = 1, . . . , n,

Yij | αi, 1/βii.i.d∼ Γ(αi, 1/βi), 1/βi

i.i.d∼ G(·).

Here the shape parameters α = (α1, . . . , αn) are known and G(·) is an unspecified prior distribution on

the scale parameters 1/βi. We fix m = 30 and consider two scenarios wherein, for scenario 1, we let

βii.i.d∼ Γ(7, 1) and α to be a fixed vector of size n with elements sampled randomly from (1, 3, 5) with

replacement. For scenario 2, βii.i.d∼ X 2

5 and α is a fixed vector of size n with elements sampled randomly

from (1, 2, 3) with replacement. We consider four competing estimators of β across the two scenarios as

n varies. The competing estimators include the NEST estimator from example 2 with λ obtained from

two fold cross validation, denoted DS, the NPMLE based estimator of β from Koenker & Gu (2017),

the maximum likelihood estimator (MLE) of βi which is mαi/∑m

j=1 Yij and DS orc which represents

the NEST estimator of example 2 with oracle λ. The average squared error risk of these estimators is

calculated over 100 Monte-Carlo repetitions and figure 13 presents the ratio of the average risks wherein

the denominator is the average risk of DS orc. For scenario 1 (figure 13 left), the NEST estimator is



1.00

1.05

1.10

1.15

1.20

1 2 3u

Rela

tive R

isk

1.000

1.025

1.050

1.075

1.100

25 30 35 40 45 50u

Rela

tive R

isk

1.0

1.2

1.4

1.6

25 30 35 40 45 50u

Rela

tive R

isk

Fig. 11. Left to Right: Simulation settings 1, 2 and 3. Herem is a fixed vector of size n with elements sampled ran-domly from (10, 15, 20, . . . , 50) with replacement for set-tings 1 and 2 and from (30, 35, . . . , 100) with replacement

for setting 3.


1.00

1.05

1.10

1.15

1 2 3u

Re

lative

Ris

k

1.025

1.050

1.075

1 2 3u

Re

lative

Ris

k

1.0

1.1

1.2

1.3

1 2 3u

Re

lative

Ris

k

Fig. 12. Left to Right: Simulation settings 4, 5 and 6(from section 5.2). In each setting m is a fixed vec-tor of size n with elements sampled randomly from

(10, 15, 20, . . . , 50) with replacement.

almost as good as the NPMLE estimator of β for large n. Scenario 2, on the other hand, represents a

challenging setting where βi are relatively smaller in comparison to those in scenario 1. In this setting,

the NEST estimator has a lower relative risk than the NPMLE across all n.

Scale mixture of Weibull distributions - Consider the setting of example 4 in section 4 where

for j = 1, . . . ,m and i = 1, . . . , n,

Yij | ki, βii.i.d∼ Weibull(ki, βi), βi

i.i.d∼ G(·).

Here the shape parameters k = (k1, . . . , kn) are known and G(·) is an unspecified prior distribution on the

scale parameters βi. We fix m = 30 and consider two scenarios wherein, for scenario 1, we let βi = |Zi|where Zi

i.i.d∼ N(1, 0.12) and for scenario 2, βii.i.d∼ (1/3)δ(0.75) + (1/3)δ(1) + (1/3)δ(1.25). In each case,

α is a fixed vector of size n with elements sampled randomly from (1, 2, 3) with replacement. Analogous

to figure 13, figure 14 presents the relative risks of the competing estimators of β across the two scenarios 765

as n varies. The competing estimators include the NEST estimator from example 4 with λ obtained from

two fold cross validation, denoted DS, the NPMLE based estimator ofβ, the maximum likelihood estimator


DS MLE NPMLE

1.0

1.1

1.2

1.3

100 200 300 400 500 600 700 800 900 1000n

Rel

ativ

e R

isk

1.00

1.05

1.10

1.15

100 200 300 400 500 600 700 800 900 1000n

Rel

ativ

e R

isk

Fig. 13. Comparison of relative risk for estimating β undera scale mixture of Gamma distributions. Left: Scenario 1 -

βii.i.d∼ Γ(7, 1) and α to be a fixed vector of size n with el-

ements sampled randomly from (1, 3, 5) with replacement.

Right: Scenario 2 -βii.i.d∼ X 2

5 and α is a fixed vector ofsize n with elements sampled randomly from (1, 2, 3) with

replacement. Here m is fixed at 30.

DS MLE NPMLE

1.0

1.2

1.4

1.6

1.8

2.0

100 200 300 400 500 600 700 800 900 1000n

Rel

ativ

e R

isk

1.0

1.5

2.0

2.5

100 200 300 400 500 600 700 800 900 1000n

Rel

ativ

e R

isk

Fig. 14. Comparison of relative risk for estimating β undera scale mixture of Weibull distributions. Left: Scenario 1

- βi = |Zi| where Zii.i.d∼ N(1, 0.12). Right: Scenario 2

- βii.i.d∼ (1/3)δ(0.75) + (1/3)δ(1) + (1/3)δ(1.25). Here α

is a fixed vector of size n with elements sampled randomlyfrom (1, 2, 3) with replacement and m is fixed at 30.

(MLE) of βi which is m/∑m

j=1 Yki

ij and DS orc which represents the NEST estimator of example 4 with

oracle λ. We note that for both the scenarios and particularly for scenario 2, the NEST estimator has a

lower relative risk than the NPMLE for large n.770

nonparametric empirical bayes estimation on...

Documents