empirical bayes in the unbalanced case 1 introduction

59
Empirical Bayes in the unbalanced case by Ragnar Norberg Abstract The present paper treats empirical Bayes decision situations where the observational design varies among the decision units — the unbal- anced case. By placing a distribution on the designs, access is given to standard asymptotic theory. Two main avenues are described. One is to restrict the model by the technique of conjugate priors, which is adapted here to the unbalanced case using a modified definition of sufficiency. The other is to restrict the method, which seems to be the only passable way in situations with nonparametric distributions and unbalanced designs. Special attention is given to linear estimators. Empirical Bayes decisions are constructed by estimating the parame- ters involved in the (possibly restricted) Bayes decision. Asymptotic properties of parameter estimators are discussed. Some new notions of asymptotic optimality of empirical Bayes rules are introduced, and weak sufficient conditions for asymptotic optimality are given. Keywords: Empirical Bayes, unbalanced design, conjugate priors, suf- ficiency, parametric and nonparametric models, unrestricted and re- stricted decision rules, linear estimators, optimal parameter estima- tion, asymptotic optimality. 1 Introduction 1A. Background, issues, and results. The purpose of the present paper is to present a version of the empirical Bayes theory that permits a system- atic treatment of situations where the observational design varies among the decision units, henceforth referred to as the unbalanced case. The empirical Bayes theory was launched by Robbins (1955, 1964) as an approach that is applicable when the same decision problem presents itself repeatedly and independently with a fixed but unknown a priori distribu- tion of the parameter” (Robbins, 1964). This formulation implied that the observational design is the same for all the decision problems. Later con- tributions, notably by Swamy (1971), Wind (1973), and Bunke & Gladitz (1974), dealt with sequences of regression experiments with different regres- sor matrices. Asymptotic optimality in the sense of convergence of the risk 1

Upload: nguyenhanh

Post on 04-Jan-2017

221 views

Category:

Documents


0 download

TRANSCRIPT

Empirical Bayes in the unbalanced case

by Ragnar Norberg

Abstract

The present paper treats empirical Bayes decision situations wherethe observational design varies among the decision units — the unbal-anced case. By placing a distribution on the designs, access is givento standard asymptotic theory. Two main avenues are described. Oneis to restrict the model by the technique of conjugate priors, whichis adapted here to the unbalanced case using a modified definition ofsufficiency. The other is to restrict the method, which seems to be theonly passable way in situations with nonparametric distributions andunbalanced designs. Special attention is given to linear estimators.Empirical Bayes decisions are constructed by estimating the parame-ters involved in the (possibly restricted) Bayes decision. Asymptoticproperties of parameter estimators are discussed. Some new notionsof asymptotic optimality of empirical Bayes rules are introduced, andweak sufficient conditions for asymptotic optimality are given.Keywords: Empirical Bayes, unbalanced design, conjugate priors, suf-ficiency, parametric and nonparametric models, unrestricted and re-stricted decision rules, linear estimators, optimal parameter estima-tion, asymptotic optimality.

1 Introduction

1A. Background, issues, and results. The purpose of the present paperis to present a version of the empirical Bayes theory that permits a system-atic treatment of situations where the observational design varies among thedecision units, henceforth referred to as the unbalanced case.

The empirical Bayes theory was launched by Robbins (1955, 1964) as anapproach that is applicable when the same decision problem presents itselfrepeatedly and independently with a fixed but unknown a priori distribu-tion of the parameter” (Robbins, 1964). This formulation implied that theobservational design is the same for all the decision problems. Later con-tributions, notably by Swamy (1971), Wind (1973), and Bunke & Gladitz(1974), dealt with sequences of regression experiments with different regres-sor matrices. Asymptotic optimality in the sense of convergence of the risk

1

to the Bayes risk was established by placing regularity conditions on the se-quence of regressor matrices. The present author (Norberg, 1980) pursuedthese problems in Hachemeister’s (1975) context of empirical linear Bayesestimation and proved asymptotic optimality under the condition that thecoefficients in the linear Bayes estimator are bounded functions of the pa-rameters. The condition was verified for some standard models, includingthe traditional regression model with known correlation structure. A laterwork (Norberg, 1982) centered on the problem of estimating the first andsecond order moments occurring in the linear Bayes estimators. A classof weighted least squares estimators was studied, and the properties of theEGLS (estimated generalized least squares) estimator obtained by insertingestimated optimal weights, was studied by simulations. In a follow-up of thatpaper Neuhaus (1984) assumed the designs to be iid (independent and iden-tically distributed) random elements and proved results on consistency andasymptotic normality of the weighted least squares estimators with weightsdepending on the design only. The iid-device of Neuhaus ensures that theobservational units become stochastic replicates, which lend themselves toa rich asymptotic theory.

The present paper undertakes to merge well established empirical Bayestheory with the results described above, and complete the theory at certainpoints to obtain a unified methodology for a variety of situations coveringparametric, semiparametric and nonparametric families of distributions, andunbalanced designs.

Section 2 outlines the general framework model and motivates it by threeexamples. Section 3 formulates the empirical Bayes decision problem. Thefirst step in the construction of its solution is to find the Bayes decision byknown distributions. This is the issue of Section 4.

In Section 5 the potentials — and limitations — of the technique ofconjugate priors are investigated. The outline of De Groot (1970) is followedclosely and adapted to the unbalanced case. It proves serviceable to modifythe traditional definition of sufficiency to account for the import of thedesign.

The conjugate priors approach amounts to restricting the family of dis-tributions, and it will often result in a model that is unable to reflect thereality. In Section 6 it is argued that, in quest for simple decision rules, it ismore appropriate to restrict the class of admitted rules and judge their prop-erties within a realistic model. As an important example of this approach,the linear Bayes estimation procedure is given special attention. A resulton restricted minimaxity of linear Bayes estimators is stated and proved.

2

Section 7 treats the second step in the construction of the empiricalBayes rule, which is to estimate the parameters occurring in the (possiblyrestricted) Bayes rules obtained in Sections 4 and 6. In the unbalanced casethe problem of estimating the exact Bayes decision may be difficult even insimple parametric models, and it is usually insurmountable in semi- or non-parametric models. This circumstance adds to the motivation of restrictedBayes rules. Section 8 is devoted to estimation of the first and second or-der moments occurring in linear Bayes estimators. The EGLS-estimator isrevisited. It can be shown to have the same asymptotic variance as the GLS-estimator. Hesselager (1988a) conjectured and proved that Neuhaus’ (1984)result on asymptotic normality of the weighted least squares estimators isvalid also conditionally, for almost all given sequences of designs. His proofrests on a lemma given here, which is used to establish an analogous resultfor the maximum likelihood estimator.

Section 9 addresses the question of asymptotic optimality of empiricalBayes decisions. The notion of asymptotic optimality is defined in a waythat works in situations where asymptotic optimality in the traditional sensehas not been proved.

Throughout, the discussion of restricted Bayes and empirical Bayes deci-sions centers on linear estimation by quadratic loss. Linear estimation servesas an archetype of methods that are parametrized in order to compensate thecomplexity of nonparametric models with unbalanced designs. Quadraticloss is unbounded and exposes itself fully to the problem of bounding therisk of an empirical procedure.

The presentation is partly expository, aiming not only at telling what isdone, but also why.

1B. Notational conventions and mathematical prerequisites. Scalarsare written in ordinary italics, and so are quantities whose dimensions arenot specified. Matrices and vectors are written in boldface upper and lowercase letters, respectively. Occasionally the dimensions of matrices are in-dicated by topscripts. Thus, Ar×s is a matrix with r rows and s columns.The transpose of A is denoted by A′. Vectors without a transposition markare of column type. The trace of a square matrix As×s is denoted by trA.The trace operator is invariant under (permitted) cyclical permutation ofthe factors in a matrix product;

tr(AB) = tr(BA). (1.1)

3

The norm |A| of a matrix A = (aij) is defined by

|A|2 = tr(A′A) =∑

i

∑j

a2ij , (1.2)

the straightforward generalization of the usual Euclidean vector-norm. Theinequality

|AB| ≤ |A||B| (1.3)

is easily verified (Norberg, 1980). The space of r× s matrices is denoted byRr×s. The set of pds (positive definite symmetric) s× s matrices is denotedby Rs×s

> .Greek letters are reserved for unobservable quantities, which may be ei-

ther non-random parameters or latent random variables. In contexts whereit is necessary to distinguish random quantities from points in their rangespaces, the former are written in boldface, e.g. x. The expectation, covari-ance, and variance operators are denoted by E, Cov, and V ar, respectively.Thus, for any two random vectors x and y, not necessarily of the same di-mension, write Cov(x, y′) = E(xy′)−ExEy′ and V arx = Cov(x, x′). Recallthe general rules Ex = EE(x|θ) and

Cov(x, y′) = CovE(x|θ) , E(y′|θ)+ ECov(x, y′|θ) . (1.4)

LetM be a linear space with an inner product < , >. The correspondingnorm | | is defined by |m|2 =< m,m > , m ∈M. The distance between twopoints m and m in M is

|m− m| , (1.5)

which defines a metric. Assume that M is closed in this metric and that Mis a closed linear subspace of M. Keep m ∈M fixed, and let m vary in M.The distance in (1.5) is minimized at a unique point m, which is called theprojection of m onto M and is uniquely determined by the normal equations

< m− m , m >= 0 , ∀m ∈ M . (1.6)

4

Two special inner product spaces are needed in the following. Firstly,the space Rs with the inner product < , >A and corresponding norm | |A,defined by

< m, m >A = m′Am ,

= tr(Amm′) .

where As×s is pds. Secondly, the space of all squared integrable randoms-vectors with inner product , A and corresponding norm ‖ ‖A definedby

m, mA= E < m, m >A . (1.7)

2 The model – general set-up and some examples

2A. The framework model. Consider a sequence of observational unitslabeled by i = 1, . . . , I, . . .. The I first units are observable at the time ofconsideration and constitute the current sample, whereas the remainder ofthe sequence represents future observations.

The model framework will be built in steps, starting from a parsimoniousfixed-effects model specified as follows. With each unit i there is associatedan observable output quantity xi, an unknown parameter θi representingsome hidden characteristics of the unit, and an observable input or designquantity ci comprising certain explanatory covariates and the observationaldesign. The range of xi depends on the design and is denoted by Xci . Typi-cally the observation is vector-valued so that Xci ⊂ Rni , the Euclidean spaceof dimension ni = n(ci). Denote the range of the θi by T and that of the ciby C. The basic assumption is the following.

(i) The observational units are stochastically independent, and each outputxi has a pdf of the form f(·|θi, ci) (depending on i only through θi and ci)with respect to a basic measure µci . Put

F = f(·|θ, c) ; θ ∈ T , c ∈ C . (2.1)

Apart from the common shape of the pdf’s, the assumption (i) estab-lishes no relationship between the observational units: the outputs are

5

stochastically independent, and the θi assume their values in T indepen-dently. A much stronger model is obtained by assuming that all θi areequal to θ, say, so that x1, . . . ,xI can be pooled into one sample with jointpdf

∏Ii=1 f(xi|θ, ci). Robbins’ (1955, 1964) so-called empirical Bayes set-up

places itself between these two extreme solutions by envisaging the param-eters as outcomes of independent selections from a common distribution.Thereby a certain similarity is introduced between them, and at the sametime they are allowed to vary among the units. More specifically, the exten-sion to an empirical Bayes model goes as follows.

(ii) Assumption (i) is the conditional model for the xi, given θi = θi , i =1, 2, . . .. The θi are iid according to a pdf g with respect to a measure ν,

g ∈ G . (2.2)

The bulk of early empirical Bayes theory rests on the assumption thatall ci are equal to c, say, which will be referred to as the case with balanceddesign. The key point in the balanced design assumption is that the pairs(xi, θi) , i = 1, 2, . . ., become iid, so that standard large sample theory basedon the laws of large numbers and the central limit theorem can be invoked.In the unbalanced case with varying designs it is necessary to impose reg-ularity conditions on the ci to ensure the desired asymptotic properties ofstatistical procedures. A very simple way of getting round the problem isthe following assumption.

(iii) Assumptions (i) and (ii) constitute the conditional model for the (xi, θi),given ci = ci, i = 1, 2, . . .. The ci are iid according to a pdf h with respectto a measure η,

h ∈ H . (2.3)

Under the assumptions (i)–(iii) the triplets (xi, θi, ci) , i = 1, 2, . . ., becomeiid, and so the basis for application of standard large sample theory is re-stored. Assumption (iii) seems particularly apt in the non-experimentalcontext, where the designs are often produced by mechanisms of a more orless random nature. It may also be appropriate in the experimental contextsince the designation of the experiments is then controlled by an experi-menter and can be arranged so as to comply with the iid assumption.

6

Formally, the designs can be taken as random also in the submodel madeup of (i)–(ii): just let each ci be distributed according to a hi placing mass 1at ci. All measures introduced here are defined on some suitable sigmaalge-bra, which is not made visible. At the base of the model there is a probabilityspace (Ω,B, P ), which will also play a reticent role in the following. Theoperators E, V ar, etc. could appropriately be equipped with a subscript toindicate their dependence on the probability measure, but this is generallynot done.

When only one unit is under consideration, it is not necessary to dragalong with the subscript i. Therefore, to facilitate the presentation, thesequence (xi, θi, ci), i = 1, . . . , I, is extended with a current unit, (x, θ, c),which will frequently be focused on in what follows.

In the full model (i)–(iii) the joint pdf of (x, θ, c) is

f(x|θ, c)g(θ)h(c) . (2.4)

The joint pdf of (x, c) is obtained by integrating over θ in (2.4), and is

fg(x|c)h(c) , (2.5)

where

fg(x|c) =∫Tf(x|θ, c)g(θ)dν(θ) (2.6)

is the conditional pdf of x for given c = c. The conditional pdf of θ for given(x, c) = (x, c) is the ratio of the expressions in (2.4) and (2.5),

g(θ|x, c) =f(x|θ, c)g(θ)fg(x|c)

. (2.7)

The distribution of c is immaterial because of the assumed independence ofθ and c.

The function f(·|θ, c) will be referred to as the kernel density in thefollowing. It is often called the likelihood (function), but this term is unfor-tunate in the present context where θ is the outcome of a random variablestemming from a distribution, which has a frequency interpretation and can(in principle) be estimated. Then the marginal density of the observables

7

(x, c), given by (2.5), can appropriately be termed the likelihood since itforms the basis of statistical inferences. Sometimes it is convenient to speakof the kernel density as the conditional likelihood, that is, the likelihoodfunction in the conditional stage (i) model for given (θ, c) = (θ, c).

The marginal pdf g will be called the prior density in spite of its fre-quency interpretation. This convention is in accordance with tradition. Pri-ors representing subjective beliefs can suitably be referred to as subjectivepriors. The term prior indicates that g represents the knowledge about θbefore observations are made. Accordingly, the conditional pdf in (2.7) iscalled the posterior density since it represents the knowledge after observa-tion of (x, c) = (x, c).

2B. Examples. To fix ideas and motivate the general model, consider thefollowing examples, which will be referred to throughout.

Example 1 (Poisson kernels). In some lines of insurance it is customary tolet the premium of each individual risk depend on its current claims record.One example of such individual experience rating, familiar to most people,is merit rating in automobile insurance.

Most of the merit rating schemes used in practice take only on the num-bers of claims into consideration. Consider a class comprising I independentrisks. Let xij , j = 1, . . . , ni, be the numbers of claims reported in ni yearsby risk No. i. They are assumed to be independent, with each xij dis-tributed according to the Poisson law with parameter θipij . Here pij is anobservable measure of risk exposure in year j, e.g. the time exposed to riskas insured or the mileage, and θi is the individual claim intensity per unitrisk exposure. In this case ci = (ni, pi), with pi = (pij , . . . , pini)

′. The pdfof xi = (xi1,...,xini)

′ is

f(xi|θi, ni, pi) = (ni∏

j=1

pxij

ij

xij !)θ

∑j

xij

i e−θi

∑j

pij , (2.8)

xi ∈ Xci = N ni , θi ∈ T = (0,∞), the basic measure µci being the countingmeasure on Xci . This completes the specification of the model at stage (i)of the general outline.

Consider the problem of assessing θi for each individual risk i = 1, . . . , I.So far the model establishes no relationship between the risks. The θi arecompletely disconnected — any specific value of the sequence (θ1, θ2, . . .) is

8

just as likely at the outset as any other, and the samples xi are stochasticallyindependent. Thus, each θi has to be estimated from the experience xi ofrisk i alone. The traditional solution would be to estimate θi by the ML(maximum likelihood) estimator,

θi =∑j

xij/∑j

pij , (2.9)

which is also UMVU (uniformly minimum variance unbiased).The estimator θi is poor if

∑j pij is small — its variance is θi/

∑j pij .

The problem becomes acute when the insurer is faced with a new risk withno experience of its own (

∑j pij = 0). The fixed effects model at stage (i)

renders no possibility of assessing θi in a rational manner in the absence ofobservations. However, an insurer who wants to remain in business, has tofix a premium somehow. The common practice is to use, as an initial esti-mate,

∑i

∑j xij/

∑i

∑j pij , the observed mean risk premium of the present

portfolio. This solution is approved to by practical insurance people andcustomers, but has no support whatever in the stage (i) model. Obviously,the model fails to reflect the essential circumstance that automobile riskshave something in common, which justifies the use of portfoliowide statisticsin an assessment of each single risk. The mathematical way of accounting forthis notion of similarity is assumption (ii) in the general outline, whereby therisks are viewed as random selections from a population of risks—different,but of a similar origin. 2

Example 2 (Binomial kernels). A certain type of items can be attributedeither of two quality characteristics, defect or intact. A factory delivers itemsin large batches. Denote by θi the proportion of intact items in batch No.i: it is unknown and represents the quality of the batch. To prevent poorbatches from being supplied to the trade, a quality control is arranged. Fromeach batch i a random sample of ni items is controlled, and the outcome isrecorded as xi = (xi1, . . . ,xini)

′, where xij is 0 or 1 according as the j-thitem in the sample is defect or intact. The design is simply the sample size,ci = ni, Xni = 0, 1ni , and θi ∈ (0, 1) = T . Suppose the ni are smallcompared to the size of the batches. Then each xi can reasonably be viewedas a sequence of Bernoulli trials with pdf

f(xi|θi, ni) = θ

∑j

xij

i (1− θi)ni−

∑j

xij , (2.10)

9

the basic measure µni being the counting measure on Xni .This completes the specification of item (i) in the general outline above.

The model establishes no relationship between the batches and the sam-ples drawn from them. For each batch i the quality θi has to be assessedfrom xi alone. The ML and UMVU estimator is

θi =ni∑

j=1

xij/ni . (2.11)

Suppose I batches have already been subjected to control and that, onthe whole, the values of θ1, . . . , θI are close to 1, indicating that the batchesare of high quality. It may be felt that this piece of information ought tobe taken into account in the assessment of future batches. The batchesstem from the same manufacturing process, and the quality of the processitself must be decisive of the qualities of the single batches. This motivatesassumption (ii) in the general outline. The pdf g represents the quality ofthe manufacturing process.

Finally, the model can be extended by a stage (iii) assumption, placingdistributions on the sample sizes ni.2

Example 3 (linear regression). Assume that the observational units form asequence of related regression problems. For the time being, concentrate onthe current unit. At stage (i) of the general model outline it is assumedthat the vector of outputs xn×1 is distributed according to Nn(Y b, vP−1),where Y n×q and Pn×n are known and bq×1 and v1×1 are unobservable. Thematrix P is taken to be pds, implying that v > 0. In this set-up the designis c = (n, Y, P ) and the latent quantity is θ = (b, v). The kernel density is

f(x|b, v, n, Y, P ) = (2π)−n/2v−n/2|P |12 exp(− 1

2v|x− Y b|2P ) . (2.12)

Consider the problem of estimating (b, v) by the ML method. Replace xby x in (2.12). For each fixed v > 0 the expression on the right is maximizedat any b solving

Y ′P (x− Y b) = 0q×1 , (2.13)

10

the normal equations determining the < , >P -projection of x onto thecolumn space of Y . Then, insert b = b into (2.12) and maximize withrespect to v. It is easily shown that maximum is attained at n−r

n v, where

v =1

n− r|x− Y b|2P . (2.14)

The nature of the solution depends on r = rankY (≤ min(q, n)). If r = q,then (2.13) possesses a unique solution and delivers the ML estimator

b = (Y ′PY )−1Y ′Px . (2.15)

If r < q, there is a (q − r)-dimensional space of solutions, and the MLestimation of b is indeterminate. If r < n, then v > 0 a.s. If r = n, thenv = 0 a.s., reflecting the fact that the observations contain no informationabout the erratic terms. (In the special case r = q = n (2.14) reduces tob = Y −1x — perfect fit of the empirical regression to the observations.) Tosum up, the ML principle delivers a meaningful solution only if r = q <n. In this case the ML estimator is (b, n−r

n v) given by (2.14) and (2.15).The estimator (b, v) is unbiased and, being based on a minimal sufficientstatistic, it is even UMVU. Note that v is well defined whenever r < n.

Like in the previous examples, it can be concluded that the fixed effectmodel considered here renders no possibility of assessing the parameterswhen the data are scanty. One is then urged to consider the possibility ofutilizing knowledge about other similar observations by adding a type (ii)assumption from the general set-up.2

3 Formulation of the Decision problem

3A. The empirical Bayes decision problem. For the current unit adecision is to be selected in a space M of possible decisions. A loss functionl : M × T → R specifies that the loss incurred by making the decision mwhen θ = θ is l(m, θ). Only trivial modifications are required in the followingif l depends also on the outcome c of the design.

The observations constitute the available information. The primary in-formation is the observation (x, c) from the current unit, and the secondaryinformation is the observations

11

(x, c)I = (xi, ci) ; i = 1, . . . , I

from the collateral units i = 1, . . . , I. The decision problem consists indesigning a decision function,

mI(x, c) ; (x, c)I , (3.1)

with values in M such that, on the whole, the loss l(mI , θ) becomes smallin a suitable sense.

A measure of the overall performance of the decision function mI is themean loss,

ρg,h(mI) = El(mI , θ) , (3.2)

which is called the overall risk of mI . This is a highly convenient criterionsince it is scalarvalued and, therefore, orders the set of possible decisionfunctions and gives precise content to the notion of optimality. By the ruleof iterated expectations, (3.2) may be written as

ρg,h(mI) = Eρg(mI |θ, c) , (3.3)

where

ρg(mI |θ, c) = El(mI , θ)|θ, c . (3.4)

Considered as a function of given values (θ, c) of (θ, c), ρg(mI |θ, c) is theso-called risk function commonly used as a criterion in the context of fixedeffects models at stage (i) of the general outline. In non-trivial cases it isnot possible to find a decision function which minimizes the risk functionuniformly for all θ ∈ T . Therefore, results on optimality can only be ob-tained within constrained classes of decision rules (e.g. unbiased or invariantprocedures) or by evaluating the risk function by some functional like thesupremum or the weighted average in (3.2).

By the risk criterion the problem of selecting an optimal decision func-tion for the i-th unit amounts to minimizing the expected value in (3.2) with

12

respect to the function mI . One cannot hope to find a decision function thatminimizes the risk uniformly for all g ∈ G. The difficulty is the same as forthe risk function, and again a possible escape would be to form a weightedaverage, this time over G, to arrive at a scalarvalued performance criterion.The weights attached to the subsets of G would represent an assessmenton the part of the statistician of their relative importance and, essentially,constitute a subjective prior in the sense of Bayesian statistics.

3B. Sketch of a two-stage procedure. Subjective priors are not allowedfor in an empirical Bayes approach, and the problem of finding a decisionrule that performs well for all g must be attacked differently.

Matters would be greatly simplified if g were known, since then thedecision function could be permitted to depend on g. Thus, as a first step,look for an optimal choice in the extended class of decision functions thatare allowed to depend both on the observations and the prior. Write (3.2)as

ρg,h(mI) = EEl(mI , θ)|(x, c) , (x, c)I . (3.5)

Minimum is obtained by minimizing the inmost integral for fixed valuesof the observations. Due to the mutual independence of the observationalunits and the special form of the integrand, the problem reduces to mini-mizing

El(m, θ)|x, c (3.6)

with respect to m varying in the class of decision functions depending onlyon (x, c) and g. The solution, when it exists, is called the Bayes decisionfunction or Bayes rule against g. Let it be denoted by

mg(x, c) . (3.7)

Since (2.7) is independent of h, this is also the case with (3.6) and (3.7).The following section is devoted to the construction of Bayes solutions.

The collateral data (x, c)I , which in Section 2 were argued to be ofrelevance, have dropped out of the analysis and do not appear in the solution(3.7). This is so because g was held fixed (“assumed known”). In the full

13

model it is not, however, and this is where the collateral data come into play.The second step in the two-stage procedure consists in estimating the Bayesdecision in (3.7) from the observations (x, c)I to obtain a genuine decisionfunction depending only on the available data. This is the issue of Sections7 and 8.

Finally, the resulting decision rule has to be assessed by the perfor-mance criterion (3.2) to ascertain that the two-stage procedure serves theproclaimed purpose. This problem is treated in Section 9.

4 Bayes decisions by known distributions

4A. Constructing the Bayes solution. Consider now the auxiliary prob-lem of constructing the Bayes decision function in (3.7) when g is known.Then the current unit can be treated isolated from collateral data.

The problem is to minimize the integral in (3.6). It is termed the poste-rior risk and may be written as

ρg(m|x, c) =∫Tl(m, θ)g(θ|x, c)dν(θ) , (4.1)

where g(θ|x, c) is the posterior density in (2.7). From (2.7) and (4.1) it isseen that, to obtain the Bayes solution, it suffices to minimize

∫Tl(m, θ)f(x|θ, c)g(θ)dν(θ) .

The density in (2.6), which appears in the denominator of (2.7), is notneeded for that purpose.

Clearly, minimizing the posterior risk in (4.1) for each given (x, c) isequivalent to minimizing the risk (at c),

ρg(m|c) = El(m, θ)|c= Eρg(m|x, c)|c (4.2)

or, as already pointed out in Paragraph 3B, the overall risk,

ρg,h(m) = El(m, θ)= Eρg(m|c)= Eρg(m|x, c) . (4.3)

14

It is servicable to give special attention to the case where no observationsare available. Then the set of possible decision functions is just the decisionspace M, and the three concepts of risk in (4.1)–(4.3) coincide with theprior risk,

ρg(m) =∫Tl(m, θ)g(θ)dν(θ) . (4.4)

Assume that a prior Bayes decision in M exists, and denote it by mg.

The corresponding prior Bayes risk is ρg = ρg(m).Now, return to the general case and assume that a Bayes decision func-

tion mg exists. By comparison of (4.1) and (4.4), it follows that

mg(x, c) = mg(·|x,c) , (4.5)

and the minima of (4.2) and (4.3) are, respectively, the Bayes risk,

ρg(c) = Eρg(·|x,c)|c , (4.6)

and the overall Bayes risk,

ρg,h = Eρg(c) = Eρg(·|x,c) . (4.7)

Thus, formally the situation is the same after (x, c) has been observed asbefore, the only difference being that the prior distribution has been replacedby the posterior distribution in regard of the additional information that hasbeen gained.

It turned out that, due to the independence of θ and c, the essentialpart of the analysis could be carried out for fixed values of the designs. Thedistribution h comes into play only by formation of the overall risk as aweighted average of the risks at different design points. Therefore, the anal-ysis will henceforth be performed in the conditional model (i)–(ii) as long asonly one unit is under consideration. This applies to Paragraphs 4B through6C.

4B. Estimation by squared loss. Suppose the purpose of the decision isto estimate a vector-valued function ms×1(θ) of the latent variable. Then

15

the natural decision space is the range M = m(T ), which will be taken tobe all of Rs. Let the loss incurred by an estimator m(x) be the A-weightedsquared estimation error,

l(m, m) = |m− m|2A= (m− m)′A(m− m)= trA(m− m)(m− m)′ , (4.8)

see (1.7). The risk is

ρ(m|c) = ‖m− m‖2A= trA∆(m|c), (4.9)

where ‖ ‖A is the norm defined by (1.7) with E(·|c) in the place of E, and

∆(m|c) = E(m− m)(m− m)′|c (4.10)

is the risk matrix of m. Clearly, m is better than a given competitor ˇm if∆( ˇm|c) ≥ ∆(m|c), where the inequality signifies that the difference of thematrices is psds (positive semidefinite symmetric).

Let M be the class of all squared integrable estimators. It is a closedlinear subspace in the space of all random s-vectors. Consider the problemof finding an optimal estimator in a subclass M ⊂ M. If M is a closedlinear subspace, then the optimal estimator m exists and is the projectionof m onto M. It is uniquely determined by the normal equations (1.6),which now become

tr[AE(m− m)m′|c] = 0 , ∀m ∈ M . (4.11)

A sufficient condition for (4.11) to hold true is that

E(m− m)m′|c = 0s×s , ∀m ∈ M . (4.12)

Consider first the trivial space M = Rs of constant estimators. Theoptimal estimate m is determined by (4.12), which now specializes to

16

(Em−m)m′ = 0 , ∀m ∈ Rs . (4.13)

Thus,

m = Em , (4.14)

and the the risk matrix ∆ = ∆(m) is

∆ = V arm . (4.15)

The space M coincides with M in the case where no observations areat hand. Thus, (4.14) and (4.15) represent the prior Bayes solution.

In the presence of some observations x the (unrestricted) Bayes solutionin the class M of squared integrable estimators is readily obtained from(4.5), (4.6), (4.14) and (4.15). The Bayes estimator is

m = E(m|x, c) , (4.16)

and the Bayes risk matrix is

∆(c) = EV ar(m|x, c)|c . (4.17)

By use of (1.4), ∆(c) can be cast as

∆(c) = ∆ − V ar(m|c) , (4.18)

which shows how the estimation is improved when prior knowledge is sup-plemented by information from observations.

5 Specification of the model

5A. General considerations. Section 4 was dealing with general densities.In any particular case the families F and G have to be specified to form asuitable statistical model for the problem under consideration. This is theissue of the present chapter.

17

A mathematical model is not an attempt to describe all features of aphenomenon in their right proportions. Modeling necessarily means magni-fying some features and leaving others out, and a good model is one thatfocuses the essentials and neglects the less important details. Thus, thereare two objectives of modeling, which in a sense are conflicting. On theone hand, the model ought to be realistic in the sense of being capable ofproviding a close approximation to any possible candidate of the underlyingdistribution. This pulls in the direction of specifying wide families of distri-butions. On the other hand, the model should be mathematically simple andwell structured, which pulls in the direction of specifying narrow families ofdistributions in which posterior calculations and parameter estimation arefeasible.

Building a two-stage model of the kind studied here, usually starts withthe specification of the kernel family F . In many situations the choice ofkernels is motivated by some kind of physical reasoning. This was the casein each of the introductory Examples 1 and 2.

There is much more latitude in the choice of the family G of prior densi-ties. For example, there is a multitude of possible candidates for a distribu-tion of the individual claim intensities θi in Example 2. If one is convincedthat a unimodal distribution is adequate, then some simple parametric classof densities can serve as G. But if one suspects that the prior density g maybe bimodal (it might have a peak in the upper part of the interval (0,1) dueto clumsies and road-hogs) or even more complex, then the nonparametricfamily is more appropriate. In any case, one can say that the deliberationsunderlying the choice of G will typically be directed at fitting rather thanexplaining.

The balance of the present section follows closely the exposition of DeGroot (1970), but extends it to the unbalanced case and adds some accu-racy. In particular, the definition of sufficiency is adapted to the unbalancedcase. The regression model is examined in some detail. The examples to bestudied are standard, and the technicalities are mostly well known. Theyare reexamined here to determine the potentials — and limitations — ofunrestricted Bayes rules in (restricted) parametric models when confrontedwith real life problems.

Consider now the situation where F is given and it remains to specifyG with a view to the requirements of realism and simplicity. A constructiveapproach is to start by looking for a family of priors for which posteriorcalculations are easy to perform, and then—if such a family is found—judgewhether it is rich enough to be realistic. For each c ∈ C , x ∈ Xc, and

18

prior density g, the posterior density g(·|x, c) is given by (2.7). In search ofpriors g that produce simple expressions in θ on the right of (4.1), only thenumerator in (2.7) is of interest since the denominator is independent of θ.Thus, write

g(θ|x, c) ∝ f(x|θ, c)g(θ) , (5.1)

where the symbol ∝ signifies that the two expressions are proportional,considered as functions of θ. If the right hand side expression in (5.1) isrecognized as the essential part of a well known density, then the normingconstant fg(x|c) in (2.7) can be picked from the formula of this density, andit is not necessary to calculate the integral in (2.6).

The following two examples give guidance to a general procedure for se-lecting G.

Example 1 (continued). From (2.8) it is seen that

f(x | θ, n, p) ∝ θ∑

xje−θ∑

pj . (5.2)

As prior take the gamma distribution Ga(γ, δ) with density

gγ,δ(θ) =δγ

Γ(γ)θγ−1e−θδ

∝ θγ−1e−θδ , (5.3)

θ ∈ (0,∞) , γ, δ > 0. Here Γ is the gamma function.The posterior density is easily determined by the device described in

connection with (5.1). One finds

gγ,δ(θ | x, n, p) = g∑xj+γ,∑

pj+δ(θ) . (5.4)

The result (5.4) is very convenient, since it implies that all posteriorcalculations can be based on formulas for the gamma distributions. Byvirtue of (4.5) and (4.6), it suffices to derive the Bayes solution against theprior Ga(γ, δ). Thereafter, as observations are made, the decision can becurrently updated simply by inserting (

∑xj + γ ,

∑pj + δ) in the place of

(γ, δ) in the general formula. It is easily shown that

19

Eθ = γ/δ (5.5)

and

V arθ = γ/δ2 . (5.6)

As shown in Paragraph 4B, (5.5) and (5.6) are the Bayes solution to theproblem of estimating θ by quadratic loss. Now suppose the observationsx described above are available. By (4.16), (5.4), and (5.5), the Bayesestimator is the posterior mean,

θ = (∑

xj + γ)/(∑

pj + δ) . (5.7)

By (4.17), (5.4), and (5.6), the Bayes risk is easily found to be

ρ(c) = γ/δ(∑

pj + δ) . (5.8)

It can safely be stated that the family of gamma distributions meets thesimplicity requirement formulated in the previous paragraph. Whether italso meets the requirement of realism, depends on the situation at hand.

The following concluding remarks anticipate some later events. Intro-ducing

ζ =∑

pj/(∑

pj + δ) , (5.9)

(5.7) can be written as

θ = ζθ + (1− ζ)Eθ . (5.10)

where θ is the ML estimator given by (2.9). Formula (5.10) expresses theBayes estimator as a weighted average of the optimal estimator based en-tirely on the information contained in the sample and the prior estimatebased entirely on the information contained in the prior distribution. Ageneral version of the formula is discussed in Paragraph 5C.

20

Example 2 (continued). Take as prior the beta distribution Be(γ, δ) withdensity

gγ,δ(θ) =1

B(γ, δ)θγ−1(1− θ)δ−1 , (5.11)

θ ∈ (0, 1) (and zero elsewhere), γ, δ > 0. Here B is the beta function.The posterior distribution is found by the same technique as in the pre-

vious example. Inspection of the product of the expressions in (2.10) and(5.11) yields

gγ,δ(θ|x) = g∑xj+γ,n−∑

xj+δ(θ) . (5.12)

Again a pure prior analysis is sufficient to work out the formulas of Bayessolutions. One finds

Eθ =γ

γ + δ(5.13)

and

V arθ =γδ

(1 + γ + δ)(γ + δ)2. (5.14)

From (5.12) it follows that the Bayes estimator based on the observationsx is

θ =∑

xj + δ

n+ γ + δ. (5.15)

The corresponding Bayes risk is

ρ =γδ

(n+ γ + δ)(1 + γ + δ)(γ + δ). (5.16)

Also in this example the proposed class of priors is mathematically con-venient. As regards the question of realism, the beta family is fairly wide,but fails to include for instance bimodal distributions.

21

Finally, observe that also in the present case the Bayes estimator can bewritten in the form (5.10), with θ given by (2.11) and

ζ =n

n+ γ + δ. (5.17)

5B. Conjugate families of priors. In the two examples studied above,the search for tractable priors was successful. In either case there existed asimple family G of prior densities with the very convenient property that theposterior densities themselves belong to G. More precisely, and in generalterms, it turned out that

g(·|x, c) ∈ G∀c ∈ C , x ∈ Xc , g ∈ G , (5.18)

or, by virtue of (5.1), for each c ∈ C , x ∈ Xc , g ∈ G, there exists a g′′ ∈ Gsuch that

f(x|θ, c)g(θ) ∝ g′′(θ) . (5.19)

A family G with this property is said to be closed under sampling from thefamily F .

An examination of the examples in Paragraph 5A reveals that they havethe following features in common. Firstly, G essentially contains the densi-ties in F in the sense that

for each c ∈ C and x ∈ Xc there exists a g′ ∈ Gsuch that

f(x|θ, c) ∝ g′(θ) .() (5.20)

Secondly, G is essentially closed under the formation of products in the sensethatfor each pair g′, g ∈ G there exists a g′′ ∈ Gsuch that g′(θ)g(θ) ∝ g′′(θ) . ()It is immediately realized that (5.20) and (5) imply closedness of G undersampling from F as defined in (5.19).

Alone and by itself, closedness under sampling is not a distinctive featuresince it can always be attained by choosing G sufficiently wide, e.g. the trivial

22

family of all densities. It becomes significant only in conjunction with thesimplicity requirement, as was demonstrated in Paragraph 5A. If G is closedunder sampling and at the same time possesses nice, closed formulas forcalculation of prior Bayes solutions and other functions of interest, thenposterior calculations can be made by use of the same formulas. One shouldtherefore look for a narrow family G satisfying (5.20) and (5). Property(5.20) gives the key to a general prescription. Obviously, the narrowestfamily of priors satistying (5.20) is

G′ = g′ : g′(θ) ∝ f(x|θ, c) forsome c ∈ C and x ∈ Xc , (5.21)

which will be called the family of priors induced by F . Clearly, G’ also sat-isfies (5) and is indeed closed under sampling if

for each c′, c ∈ C, x′ ∈ Xc′ , and c ∈ Xc, there exist a c′′ ∈ C and an x′′ ∈ Xc′′

such that

f(x′|θ, c′)f(x|θ, c) ∝ f(x′′|θ, c′′) . (5.22)

In most situations condition (5.22) is trivially fulfilled if C is taken to in-clude all possible designs: the product on the left of (5.22) is simply theconditional density of the combined sample x′′ = (x′,x), where x′ and x areconditionally independent observations corresponding to the designs c′ andc, respectively.

In the presence of (5.22) the induced family G’ in (5.21) is the narrowestfamily of priors satisfying (5.20) and (5). Sometimes G’ may be judgedas too narrow so that it is desirable to extend it. For instance, in thePoisson/gamma case studied in Example 1, the family of priors induced bythe kernel densities (2.8) as

∑xj ranges in 0, 1, . . . , and

∑pj ranges in

(0,∞) is the family of gamma densities (5.3) with γ a positive integer andδ > 0. It is convenient to allow γ and δ to be any values in the naturalparameter space of the gamma distribution and let the full gamma familyof densities serve as the family of priors. This increases the flexibility of themodel, and nothing is lost in terms of mathematical simplicity. The familyG finally obtained, possibly after extending to the natural parameter space,will be called the natural conjugate family to F provided that the property(5) is preserved.

23

Having constructed a conjugate family G, it finally remains to judgewhether it is flexible enough to serve as a realistic model. This judgementhas to be based on deliberations of the kind mentioned in Paragraph 5A.If the last point turns out negative, then one will usually have to resortto methods that utilize only certain aspects (e.g. certain moments) of theunconditional distributions. Such methods are presented in Section 6.

Return to Example 1 where the scheme applied to the Poisson kernelsleads to a simple, parametric conjugate family of priors. This is so becausethe essential part of the kernel densities, from which the conjugate priorsare induced by the construction (5.21), depend on the design n, p and theoutput x only through the functions

∑pj and

∑xj . Similar circumstances

are seen to underlie the results in Example 2, where the binomial kernelsalso turned out to possess a parametric conjugate family.

In terms of the general set-up, the crucial point in the two examples isthat there exists a function sk×1(x, c) of fixed finite dimension and a functionl such that

f(x|θ, c) ∝ lθ, s(x, c). (5.23)

Under the condition (5.23) the induced family of priors in (5.21) is nec-essarily parametric, that is, its elements are indexed by a parameter of finitedimension k. The following theorem explicates the result.

Theorem 5.1 If the density f(x|θ, c) is of the form (5.23) with s of fixedfinite dimension, then the induced family of priors is parametric.

When (5.23) is fulfilled, s is said to be sufficient for the family F . Thereason for this phraseology is that, for any prior g, the posterior density ofθ depends on the observations only through s(x, c), as is seen by replacingf(x|θ, c) in (2.6) and (2.7) by the expression on the right of (5.23). Thus,to make statements about θ, it suffices to know the value of s. The con-verse is also true. If the posterior density depends on (x, c) only throughs(x, c) for each prior g, then s is sufficient. To see this, pick a prior den-sity g such that g(θ) > 0 for each θ ∈ T . Then it follows from (2.7) thatf(x|θ, c) ∝ g(θ|x, c)/g(θ), and since g(θ|x, c) depends on (x, c) only throughs(x, c), (5.23) is fulfilled.

5C. Examples. Example 1 (continued). For the Poisson distributions,(∑pj ,

∑xj) is sufficient. Thus, the total exposure and the total number of

24

Poisson events summarize all information contained in the data concerningthe value of the frequency parameter θ.2

Example 2 (continued). For the binomial distributions, (n,∑

xj) is suffi-cient. Once the sample size and the number of observed intacts are known,it is of no use to know which of the articles were intact when the purpose isto assess the proportion of defective items in the batch. 2

Example 3 (continued). Now, turn to stage (ii) of the outline of the generalmodel. In quest for a convenient family G of priors, write out the essentialpart of the expression on the right of (2.12) to obtain

f(x|b, v, n, Y, P ) ∝ v−n/2 exp− 12v

(b′Y ′PY b− 2b′Y ′Px+ x′Px). (5.24)

It is seen that a sufficient statistic for F is

s = (n, Y ′PY , Y ′Px , x′Px). (5.25)

If r = q < n, then (5.25) can be replaced by

s = (n, Y ′PY, b, v) , (5.26)

since the functions on the right of (5.25) and (5.26) correspond one-to-one.

Case 1: Normal priors of b by fixed v = ϕ. The construction of aconjugate family of priors is simple if restriction is made to distributionswhich place all mass at the hyperplane v = ϕ. Then (5.24) reduces to

f(x|b, n, Y, P ) ∝ exp−12(b′Y ′Φ−1Y b− 2b′Y ′Φ−1x) , (5.27)

where

Φ = ϕP−1 (5.28)

has been introduced for convenience. A sufficient statistic for F is

s = (n, Y ′PY , Y ′Px) . (5.29)

25

Considered as a function of b, the kernel density in (5.27) is shaped asa normal density, and so the conjugate family to F is the family of normaldensities

G = gβ,Λ ; β ∈ Rq , Λ ∈ Rq×q> (5.30)

given by

gβ,Λ(b) = (2π)−q/2|Λ|−1/2 exp−12(b− β)′Λ−1(b− β)

∝ exp−12(b′Λ−1b− 2b′Λ−1β) . (5.31)

Assume that b is distributed according to (5.31). The first two moments ofb are known to be

Eb = β , (5.32)

V arb = Λ . (5.33)

The essential part of the posterior density of b is the product of theexpressions in (5.27) and (5.31), which is

gβ,Λ(b|x, c) ∝ exp[−12b′(Y ′Φ−1Y + Λ−1)b− 2b′(Y ′Φ−1x+ Λ−1β)].(5.34)

The expression on the right of (5.34) is the essential part of a normaldensity, whose parameters are determined by comparison with the generalexpression in (5.31). It follows that

gβ,Λ(b|x, c) = gβ,Λ(b) , (5.35)

where

β = Λ(Y ′Φ−1x + Λ−1β , )Λ = (Y ′Φ−1Y + Λ−1)−1 . (5.36)

26

It can be concluded from (5.36) and (5.36) that the Bayes estimator ofb with respect to squared loss is

b = (Y ′Φ−1Y + Λ−1)−1(Y ′Φ−1x + Λ−1β) . (5.37)

If Y is of full rank, then (5.37) can be cast as

b = (Y ′Φ−1Y + Λ−1)−1(Y ′Φ−1Y b+ Λ−1β)= Zb+ (I − Z)β , (5.38)

where b is defined by (2.15) and

Zq×q = (ΛY ′Φ−1Y + I)−1ΛY ′Φ−1Y

= ΛY ′Φ−1Y (ΛY ′Φ−1Y + I)−1 .

= I − (ΛY ′Φ−1Y + I)−1 , (5.39)

which is well defined regardless of the rank of Y . In particular, in thecase of no observations (n = 0) Z can be defined as 0q×q, which makes(5.37) include the prior estimate (5.32) as a special case. Formula (5.38)expresses the Bayes estimator as a matrix weighted mean of the sampleestimator b and the prior estimate β. The weight matrix Z is called thecredibility matrix since it represents the relative amount of credence attachedto the observations. It increases with increasing precision Y ′Φ−1Y of b anddecreases with increasing precision Λ−1 of b or precision of the prior estimateβ, roughly speaking. An important consequence of extending the model byplacing a distribution on b, is that b can be estimated meaningfully by (5.37)also when Y is not of full rank, even in the case of no observations.

The Bayes risk matrix ∆(c) is simply Λ in (5.36) or, by (5.39),

∆(c) = (Y ′Φ−1Y + Λ)−1

= (I − Z)Λ . (5.40)

It is, of course, smaller than the prior Bayes risk matrix Λ, and it tends to0q×q if the precision Y ′Φ−1Y of the sample estimator increases (its weight Ztends to I) or if the precision Λ−1 of the prior estimate increases (its weightI − Z tends to I).

27

Case 2: Normal-gamma priors for (b,v). Now return to the originalfamily F of kernels given by (5.24), without the restriction v = ϕ.Inspection of (5.24), taken as a function of (b,v), reveals that it is shapedas a member of the family of densities

G = gβ,Λ,γ,δ ; β ∈ Rq , Λ ∈ Rq×q , γ > 0, δ > 0 (5.41)

given by

gβ,Λ,γ,δ(b, v) ∝ v−q/2 exp− 12v

(b− β)′Λ−1(b− β)v−(γ−1) exp(−δ/v)

= v−(γ+q/2−1) exp− 12v

(b′Λ−1b− 2b′Λ−1β + β′Λ−1β + 2δ).(5.42)

When (b,v) has the density in (5.42), then the conditional distribution of bfor fixed v = v is N (β, vΛ), and the marginal distribution of v−1 is Ga(γ, δ).

The family G given by (5.41) and (5.42) is induced by the kernels in(5.24), saturates the natural parameter space, and is closed under the for-mation of products in the sense of (5), and hence is the family of naturalconjugates to F .

The posterior density of (b,v) is obtained as the product of the functionsin (5.24) and (5.42). Inspection of the two expressions yields

gβ,Λ,γ,δ(b, v|x, c) = gβ,Λ,γ,δ(b, v) (5.43)

with

β = (Y ′PY + Λ−1)−1(Y ′Px + Λ−1β) ,Λ = (Y ′PY + Λ−1)−1 ,

γ = n/2 + γ ,

δ =12(β′Λ−1β + x′Px− β′Λ−1β) + δ . (5.44)

The Bayes estimator of b by squared loss is b = β. The posterior analysiscan be compiled from De Groot’s treatment of the iid case and shall not bedisplayed here.2

28

6 Restricted Bayes solutions – Linear Bayes esti-mation

6A. The rationale of restricted decisions. The crux of conjugate anal-ysis is to restrict the families F and G to obtain easy posterior calculationsand, in particular, a simple construction of the Bayes solution by (4.5) and(4.6). Often, however, the feasible restrictions turn out to be too severeso that the resulting model disregards or twists important features of thereality it is supposed to describe. For instance, the gamma family arrivedat in Example 1 is unsuitable if the family of prior densities is required toallow for a possible peak at high values of θ. The point is carried also inExample 3. The first family of conjugate priors obtained in Paragraph 5Crequired the variance component v to be constant, not depending on b. Thesecond compels the variance matrices of b and the erratic terms to dependon a common variance component — a rather artificial restriction. Even thenormality assumption may be unjustifiable in some situations.

Oversimplifying the reality is not a passable way to get round a realproblem, but there is another avenue. If the motive of restricting the modelessentially is to attain tractable decisions, it would be more expedient andlogical to place restrictions directly on the class of admitted decisions andseek the Bayes solution in the restricted class.

6B. Linear estimators. The convenience of linear methods leads to es-timators that are non-homogeneous linear functions of some vectorvaluedstatistic tn×1(x). To save notation, let t itself play the role of the observa-tions. Thus, assume that the observations are a squared integrable vectorxn×1, and consider the class M of linear estimators of the form

m = g(c) +G(c)x , (6.1)

where g(c)s×1 and G(c)s×n are functions of the design c. The class M is aclosed linear space, and so the optimal solution, to be called the linear Bayesor LB solution, is obtained by projecting m onto M. Denote the projectionby

m = γ(c) + Γ(c)x . (6.2)

It is determined by the sufficient normal equations (4.12), which now become

29

E(m− γ(c)− Γ(c)x)(g′(c) + x′G′(c))|c = 0s×n, ∀g(c) ∈ Rs, G(c) ∈ Rs×n,(6.3)

or equivalently,

Em− γ(c)− Γ(c)E(x|c) = 0s×1 , (6.4)

E(mx′|c)− γ(c)E(x′|c)− Γ(c)E(xx′|c) = 0s×n . (6.5)

Postmultiplying in (6.4) by E(x′|c) and subtracting the result from (6.5),one obtains the equivalent set of equations,

Cov(m,x′|c) = Γ(c)V ar(x|c) , (6.6)

γ(c) = Em− Γ(c)E(x|c) . (6.7)

Inserting the solution

Γ(c) = Cov(m,x′|c)(V ar(x|c))−1 (6.8)

and γ(c) from (6.7) into (6.2), one arrives at the LB-estimator

m = Em+ Cov(m,x′|c)(V ar(x|c))−1(x− E(x|c)) . (6.9)

The LB risk matrix, ∆(c) = ∆(m|c), is defined by (4.10) with m = m.An easy calculation yields

∆(c) = V arm− Cov(m,x′|c)(V ar(x|c))−1Cov(x,m′|c) . (6.10)

The LB solution depends on the joint distribution of m and x onlythrough their first and second order unconditional moments. The LB es-timator is of the form m = m + Γ(c)(x − E(x|c)), the sum of the priorBayes estimate (4.14) based on no observations and an adjustment term

30

that depends on the deviation of the observations from their mean. Themagnitude of the adjustment depends on the variation and covariation ofthe estimand and the observations: the stronger the covariation, the greaterthe adjustment.

The risk matrix in (6.10) is of the form

∆(c) = ∆ − Cov(m,x′|c)(V ar(x|c))−1Cov(x,m′|c) , (6.11)

which should be compared with (4.18). In the special case when both mand x are scalars, (6.10) reduces to V arm1 − Cor2(m,x|c), where Cordenotes the coefficient of correlation.

It is noteworthy that the weight matrix A does not appear in the Bayesestimators and risk matrices in (4.16), (4.17) and (6.9), (6.10). The weight-ing affects only the risk given by (4.9). Obviously, the estimator and therisk matrix are the basic entities of any Bayes solution obtained by solvingthe sufficient normal equations (4.12). In particular, it follows that for themere purpose of constructing m or m, one could as well put A = I, by whichthe risk in (4.9) reduces to

∑sk=1E(mk−mk)2|c. Consequently, the Bayes

estimator can be constructed componentwise, using ordinary quadratic lossfor the scalarvalued estimands.

Example 3 (continued). Now drop all assumptions about the shapes of thedistributions and assume only that xn×1 is square integrable with conditionalfirst and second order moments of the form

E(x|θ, c) = Y b , (6.12)

V ar(x|θ, c) = vP−1 , (6.13)

with n, Y, P known functions of the design c, and b and v some functions ofthe unobservable θ. (Then F is the non-parametric family of all distribu-tions with the first two moments given by (6.12) and (6.13), θ is of infinitedimension, and G is the class of all priors that make x square integrable.)

Introduce

β = Eb , Λ = V arb , ϕ = Ev (6.14)

31

and Φ as in (5.28). The moments required in the LB estimator of b are β,Λ, Cov(b, x′|c) = ΛY ′, E(x|c) = Y β, and V ar(x|c) = Y ΛY ′ + Φ. Insertingthese elements into (6.9) and (6.10), one obtains the LB solution

b = ΛY ′(Y ΛY ′ + Φ)−1x+ (I − ΛY ′(Y ΛY ′ + Φ)−1Y )β , (6.15)∆(c) = Λ− ΛY ′(Y ΛY ′ + Φ)−1Y Λ . (6.16)

The LB solution depends on the distributions only through the momentsβ,Λ,Φ and must be the same for all points in the model space with thesame values of these moments. In particular, (6.15) and (6.16) are validfor the normal distributions. On the other hand, as the unrestricted Bayessolution in (5.37) and (5.40) obtained under normal assumptions is linear,it must also be LB. And since the family of normal distributions generatesall possible values of the moments β,Λ,Φ, the solution (6.15)–(6.16) must,in fact, be identical to the one in (5.37) and (5.40). This example points toa general technique of constructing LB solutions.

Actually, the present derivation of the LB solution could have beendispensed with since (5.37) and (5.40) are already there. The direct LBconstruction is, nevertheless, of value since it delivers alternative formulas.Which formulas to use, depends on n and q. When n < q, (6.15) and (6.16)are the more convenient since they involve matrix inversions of order n×n.When n > q, it is the other way around since (5.37) and (5.40) involvematrix inversions of order q × q (assuming that Φ is easily inverted, e.g.diagonal). As a bonus, the equivalence of (5.40) and (6.16) delivers, withoutvisible algebraic manipulations, a well known identity much used in multi-variate analysis. Note, finally, that the list (5.39) of equivalent expressionsfor Z can be extended by

Z = ΛY ′(Y ΛY ′ + Φ)−1Y . (6.17)

6C. A minimax property of the linear Bayes estimator. A natu-ral question is, how much poorer is the LB estimator than the unrestrictedBayes estimator? Since m may be regarded as a linear approximation tom, one should expect them to be close. In a number of important modelsm is linear so that m = m. This was the case in the Examples 1–3 whenthe natural conjugate family was taken as the family of priors. The booksof Raiffa & Schlaifer (1961), De Groot (1970), and Box & Tiao (1973) of-fer many other examples of Bayes estimators that are linear. Jewell (1974

32

a,b) found conjugate priors for the multiparameter exponential family ofdistributions and established that the Bayes estimator of the exponentialparameter with respect to any such conjugate prior is a linear function ofthe canonical sufficient statistic. This discovery he published as a result onexact credibility. Diaconis & Ylvisaker (1979) gave a rigorous proof of Jew-ell’s result and proved a partial converse, which says that the Jewell-priorsare characterized by the linear form of the Bayes estimator. Also Ferguson’s(1972) famous Dirichlet priors for the non-parametric kernels lead to linearBayes estimators.

Except for notable papers by Hartigan (1969) and Goldstein (1975a,b),early results on LB estimation in the statistics literature were reached onlyincidentally as byproducts in the manufacturing of models that yield simpleBayes procedures. A number of examples can be found by ascertainingwhether the L can be placed before the B. Much earlier, actuaries cameto the LB-approach by another route. Urged by the necessity of producingsimple rating formulas that could be operated on a large scale, they startedwith the L (Whitney, 1918, Bailey, 1945, 1950) and only at a much laterinstance settled the question of optimality by adding the B (Buhlmann,1967, 1969).

Consider now a fixed family F of kernel densities and let G be some familyof priors, not necessarily conjugate to F . To visualize the dependence on theprior, equip Bayes estimators and Bayes risks by a subscript g. Introduce theclass of Bayes estimators MG = mg; g ∈ G and the class of LB estimatorsMG = mg; g ∈ G.

Exact credibility deals with the problem of determining a G with theproperty that mg = mg for all g ∈ G or, equivalently,

MG = MG . (6.18)

This problem has been solved for certain standard kernel families F withnice properties, e.g. the parametric families in Examples 1–3 or the non-parametric family for samples of iid observations (Ferguson, 1972). Whendealing with more complex kernel families, e.g. nonparametric families forobservations with varying design, one must be content to require less fromthe LB solutions and replace (6.18) by the weaker

MG ⊂ MG , (6.19)

33

which says that every LB-estimator is exact Bayes for some g ∈ G.Let α be the mapping that takes g ∈ G to the (vector of) moments

occurring in the LB estimator in (6.9). Partition G into the contours

Ga = g ; g ∈ G , α(g) = a , (6.20)

a ∈ α(G). Since mg and ρg depend only on the value of α(g), write simplyma and ρa (allowing a slight abuse of notation and deleting the argument c).

Lemma 6.1 Under the condition (6.19) each ma ∈ MG is minimax withrespect to the restricted family Ga, that is,

supg∈Ga

ρg(ma) = infm∈M

supg∈Ga

ρg(m) . (6.21)

Proof: Fix a ∈ a(G). By assumption (6.19) there exists a g′ ∈ G such that

ma = mg′ . (6.22)

The following relations are valid for each m ∈ M :

supg∈Ga

ρg(ma) = ρa = ρg′ ≤ ρg′(m) ≤ supg∈Ga

ρg(m) . (6.23)

The first equality in (6.23) follows since ρg(ma) is constant and equal to ρa

on Ga—ma is an “equalizer rule” (Berger, 1985). The second equality is dueto (6.22). The succeeding inequalities are trivial. Comparison of the firstand final expression in (6.23) yields (6.21). 2

The lemma clarifies the structure of the argument given by Bunke &Gladitz (1974) for the restricted minimax property of the LB estimator inthe regression case. They used the fact that b in (6.15) is Bayes in thenormal case.

The restricted minimax property within Ga might seem to be of limitedvalue since g is unknown and α(g) may be any point in α(G). However, aswill be substantiated in Section 9, it becomes significant when α(g) can beconsistently estimated from data.

34

7 Empirical Bayes decision functions

7A. Constructing the decision functions. The Bayes decision function(3.7) against a specific g cannot be expected to perform well for all priorsin the entire set G of possibilities. It is merely a first auxiliary step in thetwo-stage procedure described in Paragraph 3B.

The second step consists in estimating the Bayes decision function fromthe available data (x, c)I to arrive at a function m∗

I , say, of the form (3.1).This function is typically constructed by insertion of an estimator g∗ ofg in (3.7) or, if a restricted Bayes solution like the LB estimator in (6.2)is arranged at the first stage, by inserting estimators of those parametersthat are involved in the restricted Bayes decision function. Hopefully, theresulting decision function perform almost as well as the Bayes rule at eachpoint g ∈ G. That question can only be settled by an eventual evaluation ofm∗

I by its risk.As a provisional minimum requirement, m∗

I should converge to the Bayesdecision function in a suitable sense. Following Robbins (1955), the decisionrule is said to be empirical Bayes (against G), abbreviated EB, if

m∗I

p→ mg , ∀g ∈ G . (7.1)

Here and elsewhere → signifies convergence as I → ∞, andp→ denotes

convergence in probability (with respect to some appropriate metric inM).Other modes of convergence needed in the following are d→, convergence indistribution,

wp1→ , convergence almost surely or with probability 1, and 2nd→ ,convergence in mean square (i.e. with respect to the norm ‖ ‖ in the spaceof squared integrable random vectors, see (1.7)).

The feasibility of the EB or any restricted empirical Bayes proceduredepends on the specification of the basic model entities F and G. The fol-lowing Paragraphs 7B–C treat briefly the major cases, ordered by decreasingspecificity of the families of distributions.

7B. The parametric case. Assume that both F and G are parametricfamilies, that is, T is a finite-dimensional Euclidean set, and

G = gα ; α ∈ A (7.2)

with A ⊂ Rk for some k. Then the Bayes solution in (3.7) is parametrizedby α ∈ A and may be denoted by

35

mα . (7.3)

In this case a natural approach is to estimate α by some standard parametricmethod and replace α in (7.3) by the resulting estimator α∗ to obtain theempirical decision rule

m∗I = mα∗ . (7.4)

Clearly, if α∗p→ α for all α ∈ A and the Bayes rule in (7.3) is a continuous

function of α, then (7.4) is EB.In many situations the ML construction applies. The unconditional like-

lihood function for the observation (x, c) is given by (2.5) with g = gα. Forsimplicity, replace the index gα by α. The ML estimator α∗ is obtained bymaximizing the likelihood of the entire sample,

L =I∏

i=1

fα(xi|ci)hi(ci) , (7.5)

or its logarithm

logL =I∑

i=1

log fα(xi|ci) +I∑

i=1

log hi(ci) . (7.6)

In most situations the ML estimator is obtained as the unique solution α∗

of the necessary condition of an extremum,

∂alogL|a=α∗ =

I∑i=1

∂alog fa(xi|ci)|a=α∗ = 0k×1 . (7.7)

It is seen that the maximization is performed by maximizing the first termon the right of (7.6), which is the log likelihood of the xi in the stage (ii) con-ditional model for fixed designs. Thus, the appearance of the ML estimatordepends only on the realized values of the designs, not on the mechanismsthat have produced them. One should not be mislead by this fact to believethat the stage (iii) model is redundant and that one could as well work in the

36

fixed-designs model at stage (ii). Reasoning ad hoc in the frames of stage(ii), it is true under mild regularity conditions on the sequence (c1, c2, . . .)that α∗ for large I is approximately normally distributed with mean α andvariance matrix

−E

I∑i=1

∂2

∂a∂a′log fa(xi|ci) |a=α

−1

, (7.8)

the expectation being with respect to the xi for given designs ci. The vari-ance matrix in (7.8) depends on the realized sequence (c1, c2, . . .). In theextended model with random designs ci, their densities hi govern the out-come of the matrix in (7.8) and are thus decisive of the informativeness ofthe sample.

Now, adopt again assumption (iii) of the general model, by which all hi

are equal to a common density h. Then the (xi, ci) become iid and, understandard regularity conditions (see e.g. Serfling, 1980),

α∗wp1→ α , (7.9)

and

√I(α∗ − α) d→ N(0,Ξα) , (7.10)

with

Ξα = −E

∂2

∂a∂a′log fa(xi|ci) |a=α

−1

, (7.11)

the expectation now being with respect to the (xi, ci). In particular, (7.9)implies that (7.4) is an EB procedure.

Example 1 (continued). In the Poisson/gamma model given by (2.8) and(5.3), the function fα(xi|ci) is the negative binomial pdf,

fγ,δ(xi|pi) =Γ(xi + γ)xi!Γ(γ)

pi + δ

)γ (pi

pi + δ

)xi

, (7.12)

37

where pi =∑

j pij . The entries in the derivative ∂∂α logL are

∂γlogL =

I∑i=1

ψ(xi + γ)− Iψ(γ) +I∑

i=1

log(

δ

pi + δ

),

∂δlogL = I

γ

δ−

I∑i=1

xi + γ

pi + δ, (7.13)

where ψ is the derivative of log Γ. The ML equation (7.7) is messy, particu-larly in the unbalanced case, and can only be solved by numerical methods.

A simpler approach is to arrange an MM (moment method) based on

E(x(k)i |pi) = pk

i νk , (7.14)

where

νk = Eθki , (7.15)

k = 1, 2, . . . In the present two-parameter model one needs

ν1 = γ/δ ,

ν2 = γ/δ2 + γ2/δ2 . (7.16)

An MM estimator is obtained by identifying weighted sample moments withtheir expected values, and solving with respect to the parameters. Theestimators obtained in this manner are of the form

ν∗1 =I∑

i=1

w1(pi)xi

I∑i=1

w1(pi)pi ,

ν∗2 =I∑

i=1

w2(pi)x(2)i

I∑i=1

w2(pi)p2i . (7.17)

The parametrization by (ν1, ν2) is convenient since, by (7.14), the expectedvalues of the factorial moments are linear functions of the νk so that linearmethods can be employed. This device, which is applicable in a number of

38

situations, was put forth by Norberg (1982), and the present example wasexamined thoroughly there.

To each choice of weights w1 and w2 there corresponds an MM estimator.How to choose the weights is discussed in Section 8. 2

The maximum likelihood equations (7.7) can in principle be solved, butin the unbalanced case they are often unwieldy. Already in this very sim-ple Poisson/gamma model one must resort to numerical methods. In morecomplex models, like the regression models encountered in Example 3, thenumerical work is rather requiring and may be judged as excessive in viewof the fact that parametric models often are chosen — at the sacrifice ofrealism — to gain simplicity.

7C. The semi– and nonparametric cases. Assume that F is paramet-ric and G is nonparametric. An example is more clarifying than lengthydiscussions.

Example 1 (continued). The Poisson kernels are well reasoned and are main-tained. As for the priors, more flexibility is now allowed for, and the non-parametric family is given the role of G.

It is easily verified (Robbins, 1955) that the posterior mean of θi is ofthe form

θg,i =xi + 1pi

fg(xi + 1|pi)fg(xi|pi)

. (7.18)

In the balanced case with p1 = p2 = . . . = p, fg(x|p) is consistentlyestimated by its empirical counterpart, hence

θ∗i =xi + 1p

#i′;xi′ = xi + 1#i′;xi′ = xi

, (7.19)

is an EB estimator against all priors g.In the unbalanced case this construction does not work, at least not if

only the sufficient statistics (xi, pi) are available. At any rate it cannot beapplied to the unit with the largest exposure since no replicates are observed.

The estimator in (7.19) has also other shortcomings. It has a largevariance, and will typically have a ragged appearance. In particular, it isunacceptable as a formula for experience rating in insurance since it is not

39

an increasing function of xi (the insured might gain a premium deductibleby notifying another claim). It is easy to check that the Bayes estimator in(7.18) is an increasing function of xi.

Martz & Krutchkoff (1969) arranged an EB estimator of the regressioncoefficient vector in the semiparametric variation of the normal regressionmodel in Example 3, with v = ϕ fixed. The construction resembles the onein (7.19) as the marginal density of the xi is replaced by a density estimator.Again the technique requires balanced design, and the resulting estimatoris highly unstable.

Johns (1957) constructed EB estimators in the case where both F and Gare nonparametric. Balanced design is, of course, an essential assumption,and the resulting estimators share the unpleasant properties of the estimatorin (7.19).

7D. Restricted empirical Bayes procedures. By way of conclusion,the unrestricted Bayes solution seems not to provide a basis for constructionof useful empirical procedures in complex situations with nonparametric Gand unbalanced design. This circumstance is, perhaps, the most importantreason for studying restricted Bayes procedures. They typically depend onthe underlying distributions only through a finite set of parameter functions,e.g. certain first and second order moments, which can be estimated even ifthe model is nonparametric and the design is unbalanced. In this respect theLB approach is representative of a more general methodology that consistsin restricting the space of decision functions sufficiently to obtain restrictedBayes solutions which can be reliably estimated even if the model itself is ofhigh complexity.

To be specific, let α be a vector of basic parameters determining thefirst and second order moments appearing in the LB estimator in (6.2),e.g. the parameters in (6.14) for the linear regression model. Visualize thedependence of the coefficients in (6.7) and (6.8) on α by writing

γα(c) ,Γα(c) . (7.20)

Suppose α∗ is an estimator of α. An empirical linear estimator,

m∗I = γ∗(c) + Γ∗(c)x , (7.21)

is obtained upon replacing the coefficients in (6.2) by

40

γ∗(c) = γα∗(c) ,Γ∗(c) = Γα∗(c) . (7.22)

Definitions analogous to (7.1) are made for estimators of restricted Bayessolutions. For instance, m∗

I in (7.21) is ELB (empirical linear Bayes) if itestimates the LB estimator mg consistently for all g in G.

Before turning to a detailed description of the ELB procedure, it may beremarked that efforts have been made to smooth nonparametric EB estima-tors of the type decribed in the previous paragraph, see e.g. Maritz (1970).One might say that the aim of the empirical variation of the LB approachis the same, only that the smoothing is performed already at the outset bythe restriction to linear estimators.

8 Parameter estimation in Empirical linear Bayesproblems

8A. Description of a general procedure and a review of some pre-vious results. For the sake of concreteness, consider the nonparametricregression model treated in Example 3, Paragraph 6B. An ELB estimatorof b is obtained from (6.15) upon replacing the parameters in (6.14) byestimators based on I observed replicates of the situation.

Introduce

Ψ = Λ + ββ′ , (8.1)

and, instead of estimating the parameters in (6.14), consider the equivalentproblem of estimating

(β,Ψ, ϕ) . (8.2)

The point is that the empirical first and second order empirical moments ofthe observations, which form the natural basis for estimation of the first andsecond order moments, have expectations that are linear in the componentsof (8.1). More specifically for each observational unit i = 1, . . . , I

E(xi|ci) = Yiβ , (8.3)

41

E(xix′i|ci) = YiΨY ′

i + ϕP−1i , (8.4)

or, in the full rank case,

E(bi|ci) = β , (8.5)

E(bib′i|ci) = Ψ + ϕ(Y ′i PiYi)−1 , (8.6)

E(vi|ci) = ϕ . (8.7)

This way the situation is made accessible to linear estimation methods. Therelations (8.3)–(8.4) or (8.5)–(8.7), whichever are chosen, can be written incompact form as

E(si|ci) = Aiα , (8.8)

where Ai is a matrix of coefficients which can be compiled from the ex-pressions on the right of (8.3)–(8.4) or (8.5)–(8.7), and α is a vectorvaluedparameterfunction made up of the different entries in β,Ψ and ϕ. Put

Σi = V ar(si|ci) . (8.9)

The coefficient matrix Ai is a function of the design only;

Ai = A(ci) . (8.10)

The variance Σi is a function of the design and certain parameters τ occur-ring in moments up to order four;

Σi = Σ(ci, τ) . (8.11)

42

Concatenate the statistics from the I units and introduce

s =

s1...sI

, A =

A1...AI

,Σ = diag(Σ1, . . . ,ΣI) , c =

c1...cI

.

From (8.8) and (8.9) and the independence of the units one gathers

E(s|c) = Aα , (8.12)

V ar(s|c) = Σ . (8.13)

If the variances Σi were known, then the best unbiased linear (in s) esti-mator of α would be the so-called GLS (generalized least squares) estimator,

α∗Σ−1 = (A′Σ−1A)−1A′Σ−1s

= (∑

i

A′iΣ

−1i Ai)−1

∑i

A′iΣ

−1i si . (8.14)

Since the Σi are unknown, the GLS estimator is only an auxiliary construc-tion. It motivates a class of weighted least squares estimators of the form

α∗W = (∑

i

A′iWiAi)−1

∑i

A′iWisi , (8.15)

with Wi = W (ci) some pds matrices.The procedure described above was adopted in the LB context by Nor-

berg (1982). In a follow-up of that paper Neuhaus (1984) introduced thecrucial assumption (iii) in the general model and could prove a number ofresults on the sampling properties of α∗W . Rewrite (8.15) as

α∗W = α+ (1I

∑i

Vi)−1 1I

∑i

zi , (8.16)

where

43

Vi = A′iWiAi (8.17)

and

zi = A′iWi(si −Aiα) . (8.18)

By (8.8) and (8.9),

E(zi|ci) = 0 , (8.19)

and

V ar(zi|ci) = A′WiΣiWiAi . (8.20)

It follows that

Ezi = EE(zi|ci) = 0 (8.21)

and

V arzi = V arE(zi|ci) + EV ar(zi|ci)= E(A′WiΣiWiAi) . (8.22)

From (8.16) and (8.19)–(8.22) one obtains

Eα∗ = E(α∗W |c) = α , (8.23)

V ar(α∗W |c) = (∑

i

Vi)−1∑

i

A′iWiΣiWiAi(

∑i

Vi)−1 . (8.24)

and

44

V arα∗W = E(∑

i

Vi)−1∑

i

A′iWiΣiWiAi(

∑Vi)−1 . (8.25)

Since the units are assumed to be iid, standard asymptotic theory can beinvoked. Apply the strong law of large numbers together with (8.21) to thesecond term on the right of (8.16) to obtain

α∗Wwp1→ α . (8.26)

The asymptotic distribution of α∗W is obtained by casting (8.16) as

√I(α∗W − α) = (

1I

∑i

Vi)−1 1√I

∑i

zi . (8.27)

By the strong law of large numbers,

1I

∑i

Viwp1→ EVi , (8.28)

and by the central limit theorem and (8.21) and (8.22),

1√I

∑i

zid→ N(0, E(A′WiΣiWiAi)) . (8.29)

From (8.27)–(8.29) it follows that

√I(α∗W − α) d→ N(0,ΞW ) , (8.30)

with

ΞW = (EVi)−1E(A′iWiΣiWiAi)(EVi)−1. (8.31)

Neuhaus proved (8.23)–(8.25) and (8.30). He proved (8.26) withp→ in-

stead ofwp1→ , and Hesselager (1988a) pointed out the strong convergence.

All results cited here are valid provided that the displayed moments exist,

45

which is tacitly assumed.

8B. The EGLS procedure. The mentioned optimum property of the GLSpseudo-estimator in (8.14) means precisely that

V ar(α∗Σ−1 |c) ≤ V ar(α∗W |c) (8.32)

or

(∑

i

AiΣ−1i Ai)−1 ≤ (

∑i

Vi)−1∑

i

AiWiΣiWiAi(∑

i

Vi)−1 (8.33)

for all weights W . This property carries over to the asymptotic variances.

Lemma 8.1 The GLS pseudo-estimator has a smaller covariance matrixthan any estimator of the form (8.15) and the same is true for the asymptoticvariances in (8.31), that is,

ΞΣ−1 ≤ ΞW . (8.34)

Proof: The inequality (8.33) is equivalent to

(1I

∑i

A′iΣ

−1i Ai)−1 ≤ (

1I

∑i

Vi)−1 1I

∑i

A′iWiΣiWiAi(

1I

∑i

Vi)−1 . (8.35)

Now, let I → ∞ and use the strong law of large numbers to obtain (8.34).2

The merits of the GLS pseudo-estimator motivates the EGLS (estimatedgeneralized least squares) estimator,

α∗ = α∗Σ∗−1 , (8.36)

obtained by replacing the parameter τ in (8.11) by an estimator τ∗ andusing the inverses of the estimated variances

Σ∗i = Σ(ci, τ∗) (8.37)

46

as weights.In the fully nonparametric case τ is of infinite dimension, and so is also

the set of parameters involved in the sequence (8.11) of conditional variancematrices, unless the designs are very regular, roughly speaking. In particu-lar, if the design is balanced, then all Σi are equal, and a consistent estimatorof their common value can in general be found. Note that EΣ(ci, τ) can beestimated in the iid situation at stage (iii) of the general model, but thatdoes not solve the problem of finding the optimal weights for given designs.(In Paragraph 8C it will be shown, however, that in large samples only theunconditional moments count, in a sense.)

It can be concluded that in the unbalanced case a parametric structurehas to be imposed on the covariance function Σ(c, τ) somehow. This couldbe done directly, or by restricting F to some parametric family of densities.Norberg (1982) examines the parametrization by a finitedimensional τ forF Poisson, generalized Poisson, binomial, and multinormal, and reports onsimulation experiments that shed light on the small sample properties of theEGLS estimator. Hesselager (1988b) has supplied further evidence.

The asymptotic properties of the EGLS estimator is, as one could hope,the same as that of the GLS estimator under suitable conditions on theestimator τ∗ and the matrices A(ci) and Σ(ci, τ). A number of results onconsistency, convergence in distribution and also in mean square are sum-marized in HUMAK (1984).

8C. Almost sure convergence of conditional distributions. Assump-tion (iii) in the general model ensured that asymptotic distributions couldbe found also in the unbalanced case. It turns out that these asymptoticproperties are valid also conditionally, given the designs, with probabilityone. The following lemma is the key to results of this kind.

Lemma 8.2 Let (zi, ci), i = 1, 2, . . . , be a sequence of iid random pairs, withzi ∈ Rk. If

V arzi = Σ (8.38)

is finite and pds, and

E(zi|ci) = 0 , (8.39)

47

then

1√I

I∑i=1

zid|c1,c2,...−→ N(0, σ)wp1 . (8.40)

Proof: The technique of the proof consists in verifying that the Lindeberg-Feller conditions (see e.g. Serfling, 1980) are satisfied for almost all sequencesc1, c2, . . ..

Assume first that the zi are real-valued, and put V arzi = σ2. Introduce

σ2i = V ar(zi|ci) (8.41)

and

b2I = σIi=1σ

2i . (8.42)

Since the σ2i are iid with finite mean

E2∑i

= V arzi − V arE(zi|ci) = σ2 − 0 = σ2 , (8.43)

the strong law of large numbers yields

1Ib2I

wp1→ σ2 . (8.44)

As an immediate consequence,

2∑I

/b2I

wp1→ 0 and b2Iwp1→ +∞ . (8.45)

The first part of (8.45) follows by writing

2∑i

/b2I = (b2I − b2I−1)/b2I (8.46)

= 1− I − 1I

(1

I − 1b2I−1/

1Ib2I

)(8.47)

48

and using (8.44). The contents of (8.45) are precisely the first Lindeberg-Feller condition.

The second Lindeberg-Feller condition requires that

I∑i=1

E(1z2i >εb2I

z2i | c1, c2, . . .)/b2I

wp1→ 0 , ∀ε > 0 . (8.48)

(Here 1A denotes the indicator function of the event A.) Fix a number q > 0.By the second part of (8.45) and the strong law of large numbers,

lim supI∑

i=1

E(1z2i >εb2I

z2i | c1, c2, . . .)

/b2I

wp1→ ≤ lim1I

I∑i=1

E(1z2i >qz

2i | ci)

/1Ib2I

wp1= EE(1z2

1>qz21 | c1)/

2∑= E(1z2

1>qz21)/

2∑. (8.49)

By choice of q, the expression in (8.49) can be made arbitrarily small and(8.48) follows, so (8.40) is proved for k = 1.

The result in k dimensions stated in the theorem follows by the Cramer–Wold device (see e.g. Billingsley, 1968) upon application of the result fork = 1 to all linear functions a′zi, a ∈ Rk. The problem with an uncountablenumber of exceptional null sets is tackled by passage to the limit throughcountable a and arguments in the limiting distribution function, just like inthe construction of conditional distributions.2

The results (8.26) and (8.30) concern the asymptotic properties of themarginal distribution of the estimator α∗W . Now, since the designs ci areobservable, it might seem more appropriate to judge α∗W by the propertiesof its conditional distribution for given designs, also asymptotically as Iincreases. However, (8.23) implies

V arα∗W = EV ar(α∗W |c) , (8.50)

so that, on the average, nothing can be gained in terms of estimation errorby taking the values of the designs into account. Moreover, multiplying byI in (8.24) and using the strong law of large numbers, yields

49

IV ar(α∗W |c)wp1→ ΞW , (8.51)

which says, roughly speaking, that in large samples the marginal distribu-tion contains just as much information as the conditional distribution, giventhe designs. This is really not surprising: being symmetric in the pairs(xi, ci), i = 1, . . . I, the estimator α∗W depends on the ci only through theirempirical distribution, which converges almost surely to the marginal distri-bution with pdf h. These informal considerations were stated precisely byHesselager (1988a) as follows. The result (8.30) is valid with probability 1also conditionally, given the designs, that is,

√I(α∗W − α)

d|c1,c2,...−→ N(0,ΞW )wp1 . (8.52)

In the proof Hesselager applied Lemma 8.2 to the second factor on the rightof (8.16). Condition (8.39) is the same as (8.19), and Σ in (8.38) is given by(8.22).

The basic lemma yields an analogous result for the maximum likelihoodestimator in the parametric case.

Theorem 8.4 The result (7.10) is valid with probability 1 also conditionally,given the designs, that is,

√I(α∗ − α)

d|c1,c2,...−→ N(0,Ξα)wp1 . (8.53)

Proof (sketched): The result is easily obtained by examining the proof of(7.10) in the iid case, see Serfling (1980). It is shown that

√I(α∗ − α)−Dn

√Ian

wp1→ 0 , (8.54)

where Dn and an satisfy

Dnwp1→ Ξα (8.55)

and

50

√Ian =

1√I

∂ logLa

∂a|a=α

=1√I

I∑i=1

∂alog fa(xi|ci)|a=α

d|c1,c2,...−→ N(0,Ξ−1α ) . (8.56)

Lemma 8.2 applies to zi = ∂∂a log fa(xi|ci)|a=α, and so the convergence in

(8.56) holds also conditionally, given c1, c2, . . . , almost surely. This, togetherwith (8.54) and (8.55), proves (8.53). 2.

9 Asymptotic optimality

9A. The classical notion of asymptotic optimality. Having chosen therisk as the basic criterion for assessing the performance of decision functions,it remains to measure the quality of empirical Bayes rules by this yard-stick.The risk of a decision rule mI of the form (3.1) for the current unit withdesign c is

ρg(mI |c) = El(mI |c) . (9.1)

It depends in general on both g and h, contrary to the risk in (4.2) of decisionfunctions based solely on the current observation. The dependence on h isnot made visible in the notation.

Computation of the risk in (9.1) presents difficulties already in simplemodels, and the problem of finding an optimal mI is usually insurmountable.Small sample properties of empirical Bayes rules may be investigated byMonte-Carlo simulations, see e.g. Maritz (1970). A promising approach isto use the bootstrap, which has been applied to empirical Bayes problemsby Laird & Louis (1987) and in a recent work of Hesselager (1988c).

The asymptotic properties of empirical decision rules are usually withinthe reach of theoretical study, and allow for discussion of the question ofoptimality. Robbins’ (1964) classical definition of asymptotic optimalitycan be adapted to the unbalanced case in a straightforward manner. AnEB decision rule m∗

I (or rather the sequence m∗II=1,2,...) is said to be

asymptotically optimal (against G) at c, abbreviated a.o.|c, if

51

ρg(m∗I |c)

p→ ρg(c) , ∀g ∈ G . (9.2)

Restricted asymptotic optimality is defined in the obvious analogous wayfor empirical approximations to restricted Bayes rules. For instance, m∗

I in(7.21) is an a.o.|c ELB rule if its risk converges to the LB risk ρg(c).

Clearly, the a.o.|c property implies the EB property as defined in (7.1).Ascertaining a.o.|c amounts to proving convergence of the integral in (9.1)under the weak assumption (7.1) of convergence in measure of the functionm∗

I . The following version of the dominated convergence theorem, herephrased in probabilistic terms, suggests itself as the obvious tool. Let xnand yn be sequences of random variables satisfying xn

p→ x, ynp→ y,

0 ≤ xn ≤ yn for all n, and Eyn → Ey <∞. Then Ex <∞ and Exn → Ex.In the present context l(mI , θ) and l(mg , θ) play the role of xn and x,

respectively. The problem is to find a function of the observables that couldplay the role of the dominating yn. If the loss function is bounded, thereis of course no problem. But already the standard problem of estimationby squared loss, presents great difficulties. Deeley & Zimmer (1976) provedasymptotic optimality for an EB estimator of a location parameter under thestrict assumption that the kernel and priors are unimodal and symmetric.

In an attempt to circumvent the problem Rutherford & Krutchkoff (1969)invented the concept of “ε asymptotic optimality”. The idea is to truncatethe empirical decision rule to secure integrability and convergence of the risk.A problem is that the truncation rule required to ensure the ε-approximationto the Bayes risk, depends on g.

Norberg (1980) proved asymptotic optimality of the ELB estimator in(7.21) under conditions which in the unbalanced case are to read as

γ∗(c) 2nd−→ γ(c) , (9.3)

and

|Γ(c)| < k(c) with Ek(c) <∞ , (9.4)

and verified these conditions for some standard situations, including theregression model in Paragraph 6B. The proof is an immediate consequenceof

52

‖ m− m∗I ‖A ≤ ‖ m− mg ‖A + ‖ mg − m∗

I ‖A

= ρg(c)+ ‖ γ(c)− γ∗(c) ‖A + ‖ (Γ(c)− Γ∗(c))x ‖A .(9.5)

The second term in (9.5) converges to 0 by (9.3), and so does the third termby

|(Γ(c)− Γ∗(c))x| ≤ |Γ(c)− Γ∗(c)||x| (9.6)

(recall (1.3), (1.7), (1.7)).The condition (9.3) is usually easy to verify: typically γ is estimated by

an MM estimator which converges in mean square to γ. The condition (9.4)takes care of the other part of the linear formula and ensures convergenceof the risk under the weak requirement that Γ∗ be consistent.

The conditions (9.3) and (9.4) are different in nature and serve to tackletwo different components of the problem. Now, condition (9.3) is the un-traditional one, and it invites a reconsideration of the whole problem ofasymptotic optimality. Clearly, the difficulties are rooted in the fact thatthe very definition of an EB rule (possibly restricted) requires only conver-gence in probability to the Bayes rule, and the definition of a.o.|c requiresconvergence of an integral. Taken alone, convergence in probability impliesnothing as to convergence of integrals. Therefore, it is worthwile looking foralternative definitions that match better.

9B. An alternative definition of asymptotic optimality. Norberg(1980) questions the relevance of the traditional concept of a.o. defined by(9.2). The risk in (9.1) is an average over all possible outcomes of thesequence (x1, c1) , (x2, c2), . . . of collateral data, both the one that actuallyoccurs and all other hypothetical possibilities. Now, taking the point of viewthat the actual development of the history is all that matters, look at theconditional risk

ρg(m∗I |c, (x, c)I) = El(m∗

I , θ)|c, (x, c)I , (9.7)

where (x, c)I = (xi, ci)i=1,...I . Define asymptotic optimality in the sensea.o.|c, I by

53

ρg(m∗I |c, (x, c)I)

p→ ρg(c) , ∀g ∈ G . (9.8)

Clearly, a.o.|c, I is a weaker requirement that a.o.|c (only relaxations ofa.o.|c are of interest since a.o.|c presents so much mathematical difficulties).In fact,

ρg(m∗I |c) = Eρg(m∗

I |c , (x, c)I)|c , (9.9)

and since ρg(m∗I |c, (x, c)I) ≥ ρg(c), it follows that (9.2) implies (9.8).

Consider a class M of decision functions possessing an M-Bayes ruleagainst each g ∈ G. Assume that the class MG of M-Bayes rules is para-metric, that is,

MG = mα(g) ; g ∈ G (9.10)

for some parameter function α : G → A, an open finitedimensional Euclideanset. Denote the Bayes risk against g accordingly by ρα(g)(c). Let α∗ be aconsistent estimator of α(g),

α∗p→ α(g) , ∀g ∈ G , (9.11)

and construct the empirical decision function

m∗I = mα∗ . (9.12)

If mα is continuous in α, then m∗I is an empirical M-Bayes rule:

m∗I

p→ mα(g) , ∀g ∈ G . (9.13)

Theorem 9.1 Consider the situation given by (9.10)—(9.13). If ρg(ma|c)is a continuous function of a ∈ A, then the rule m∗

I in (9.12) is a.o.|c, I inM.Proof: By the independence of (x, c) and (x, c)I and the assumed continu-ity, ρg(mα∗ |c, (x, c)I) = ρg(mα∗ |c)

p→ ρg(mα(g)|c) = ρα(g)(c), which proveseverything. 2

54

Typical examples are M = M, the class of all decision functions in theparametric case, and M = M, the class of linear estimators. The conditionsin the theorem are trivially fulfilled in virtually all interesting specializationsof these two major cases.

From a mathematical point of view the theorem is trivial. Its import ison the interpretational and practical side. Those who find the c, I variationof a.o. adequate, are relieved from the burdensome task of bounding the lossfunction in most situations.

9C. Additional remarks. Lemma 6.1 becomes significant in conjunctionwith asymptotic optimality. Roughly speaking, the set G of possible priorsis eventually narrowed to Gα(g) by the consistency of α∗. Thus, in the limitit is the properties on the contours Ga that count in comparison of empiricaldecision rules. Lemma 6.1 implies that an a.o. (in the one or the other sense)ELB estimator is asymptotically minimax.

Finally, pursuing the remark at the end of Paragraph 9A, here are sometentative thoughts on the problem of proving a.o.|c. The recipe in Paragraph9B was to relax the concept of a.o. so as to comply with the

p→ involvedin the definition of EB. Alternatively, but in the same vein, one could tryto fortify the

p→ requirement in the definition of EB in a way that complieswith the concept a.o.|c. The natural starting point is to look at the metric,if any, defined in the space of decision functions by the risk. For instance,estimation by squared loss leads to the metric of the norm ‖ ‖A, and themode of convergence that suggests itself, is 2nd→ . If the M-Bayes estimatormα (let it be realvalued) possesses continuous first order partial derivatives,then

mα∗ − mα = 5mα∗∗(α∗ − α) , (9.14)

where 5mα = ∂∂ama|a=α and α∗∗ is a point on the line segment joining α

and α∗. By the triangle inequality and (9.14),

‖ m− mα∗ ‖ ≤ ‖ m− mα ‖ + ‖ mα − mα∗ ‖= ρα(c)+ ‖ 5′mα∗∗(α∗ − α) ‖ . (9.15)

It follows that mα∗ is a.o. in M if the second term on the right of (9.15) con-verges to 0. Unfortunately, the problem of bounding the integrand persists

55

to exist. It would have been resolved if Young’s form of Taylor’s theorem(Serfling, 1980. p.45) could be extended to the multivariate case. Then(essentially) α∗∗ could have been replaced by α in the latter term in (9.15),and α∗ 2nd→ α would imply a.o.|c.

The difficulties met with in this section on asymptotic optimality addsanother aspect to the deliberations on restricting the model versus restrictingthe method. Parametric models often have infinite support. It may be aparadox that restricting the model entails an unrestricted extension of theworld and, in particular, difficulties with the very natural notion of a.o.

References

Bauer, H. (1978). Wahrscheinlichkeitstheorie und Grundzuge der Masstheo-rie. De Gruyter, Berlin.

Buhlmann, H. & Straub, E. (1970). Glaubwurdigkeit fur Schadensatze.Mitteil. Schweiz. Verein. Vers.math. 70, 111–133.

De Vylder, F. (1976). Geometrical credibility. Scand. Actuarial J. 1976,121–149.

Hachemeister, C. (1975). Credibility for regression models with applicationto trend. In Credibility: Theory and Applications (ed. P.M. Kahn),129–163. Academic Press, New York.

Hiss, K. (1991). Lineare Filtration und Kredibilitatstheorie. Mitteil.Schweiz. Verein. Vers.math. 1991, 85–103.

Kallianpur, G. & Karindakar, R.L. (1988). White Noise Theory of Predic-tion, Filtering and Smoothing. Gordon and Breach Science Publishers,New York.

Nather, W. (1984). Bayes estimation of the trend parameter in randomfields. Statistics 15, 553–558.

Norberg, R. (1986). Hierarchical credibility: analysis of a random effectlinear model with nested classification. Scand. Actuarial J. 1986,204–222.

Ruymgaart, R.A. & Soong, T.T. (1988) The Mathematics of Kalman-BucyFiltering. Springer-Verlag, Berlin, Heidelberg, New York.

Sundt, B. (1981). Recursive credibility estimation. Scand. Actuarial J.1981, 3–21.

Bailey, A.L. (1945). A generalized theory of credibility. Proc. Cas. Ac-tuarial Soc. 32, 13–20.

Bailey, A.L. (1950). Credibility procedures, La Place’s generalization of

56

Bayes’ rule, and the combination of collateral knowledge with obser-ved data. Proc. Cas. Actuarial Soc. 37, 7–23. Discussion: Proc. Cas.Actuarial Soc. 37, 94–115.

Berger, J.O. (1985), Statistical Decision Theory and Bayesian Analysis.Springer, New York.

Billingsley, P. (1968). Convergence of Probability Measures. Wiley, NewYork.

Box, G.E. and Tiao, G.C. (1973). Bayesian inference in statistical anal-ysis. Addison-Wesley, New York.

Bunke, H. and Gladitz, J. (1974). Empirical linear Bayes decision rulesfor a sequence of linear models with different regressor matrices. Math-ematische Operationsforschung und Statistik 5, 235–244.

Buhlmann, H. (1967). Experience rating and credibility. ASTIN Bull.4, 199–207.

Buhlmann, H. (1969). Experience rating and credibility. ASTIN Bull.5, 157–165.

Deeley, J.J. & Zimmer, W.J. (1976). Asymptotic optimality of the em-pirical Bayes Procedure. Ann. Statist. 4, 576–580.

De Groot, M. (1970). Optimal statistical decisions, McGraw-Hill, NewYork.

Diaconis, P. and Ylvisaker, D. (1979). Conjugate priors for exponentialfamilies. Ann. Statist. 7, 269–281.

Ferguson, T.S. (1972). A Bayesian analysis of some nonparametric prob-lems. Ann. Statist. 1, 209–230.

Goldstein, M. (1975a). Approximate Bayes solution to some nonpara-metric problems. Ann. Statist. 3, 512–517.

Goldstein, M. (1975b). Bayesian nonparametric estimates. Ann. Sta-tist. 3, 736–740.

Hachemeister, C. (1975). Credibility for regression models with applica-tion to trend. In Credibility: Theory and applications (ed. P.M.Kahn), 129–163. Academic Press, New York.

Hartigan, J.A. (1969). Linear Bayesian methods J. Royal Statist. Soc.,B.31, 446–454.

Hesselager, O. (1988a). On the asymptotic distribution of weighted leastsquares estimators. Scand.Actuarial J. 1988, 69–76.

Hesselager, O. (1988b). Estimation of variance components in hierarchi-cal regression models with nested classification. Working paper No.69, Laboratory of Actuarial Mathematics, University of Copenhagen.

Hesselager, O. (1988c). On the application of bootstrap in some empirical

57

linear Bayes estimation problems. Working paper No 76, Laboratoryof Actuarial Mathematics, University of Copenhagen.

HUMAK, K.M.S. (1984). Statistische Methoden der Modellbildung III.Akademie–Verlag, Berlin.

Jewell, W.S. (1974a). Exact multidimensional credibility. Mitt.Verein.Schweiz. Vers.Math. 74, 193-214.

Jewell, W.S. (1974b). Regularity conditions for exact credibility. ASTINBull. 8, 336-341.

Johns Jr., M.V. (1957). Nonparametric empirical Bayes procedures.Ann. Math. Statist. 28, 649–669.

Laird, N.M. & Louis, T.A. (1987). Empirical Bayes confidence intervalsbased on bootstrap samples. J. Am. Statist. Assoc. 82, 739–750.

Maritz, J.S. (1970). Empirical Bayes Methods. Methuen, London.Martz, H.F., and Krutchkoff, R.G. (1969). Empirical Bayes estimators

in a multiple linear regression model. Biometrika 56, 367–374.Neuhaus, W. (1984). Inference about parameters in empirical linear

Bayes estimation problems. Scand. Actuarial J. 1984, 131–142.1979, 181-221.

Norberg, R. (1980). Empirical Bayes credibility. Scand. Actuarial J.1980, 177–194.

Norberg, R. (1982). On optimal parameter estimation in credibility.Insurance: Math. Econ. 1, 73–89.

Norberg, R. (1986). Hierarchical credibility: analysis of a random effectlinear model with nested classification. Scand. Actuarial J. 1986,204–222.

Raiffa, H.A. and Schlaifer, R. (1961). Applied statistical decision theory.Boston: Graduate School of Business Administration, Harward Uni-versity.

Robbins, H. (1955). An empirical Bayes approach to statistics. Proc.Third Berkeley Symposium on Mathematical Statistics and Prob-ability, Vol. 1, 157–163. University of California Press.

Robbins, H. (1964). The empirical Bayes approach to statistical prob-lems. Ann. Math. Statist. 35, 1–20.

Rutherford, J.R. & Krutchkoff, R.G. (1969). ε asymptotic optimalityof empirical Bayes estimators. Biometrica 56, 220–223.

Serfling, R. J. (1980). Approximation Theorems of Mathematical Statis-tics. Wiley, New York.

Swamy, P.A.V.B. (1971). Statistical inference in random coefficient re-gression models. Lecture Notes in Operations Research and Mathe-

58

matical Systems 55. Springer-Verlag, Berlin.Whitney, A.W. (1918). The theory of experience rating. Proc. Cas.

Actuarial Soc. 4, 274–292.Wind, S. (1973). An empirical Bayes approach to multiple linear regres-

sion. Ann. Statist. 1, 93–103.

59