asymptotic theory of statistical estimation 1jiantao/ibragimov.pdfasymptotic theory of statistical...

55
Asymptotic Theory of Statistical Estimation 1 Jiantao Jiao Department of Electrical Engineering and Computer Sciences University of California, Berkeley Email: [email protected] September 11, 2019 1 Summary of Chapters in [1]

Upload: others

Post on 03-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Asymptotic Theory of Statistical Estimation 1

Jiantao JiaoDepartment of Electrical Engineering and Computer Sciences

University of California, BerkeleyEmail: [email protected]

September 11, 2019

1Summary of Chapters in [1]

Contents

1 The Problem of Statistical Estimation 31.1 Formulation of the Problem of Statistical Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Some Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Hodges’ and Lehmann’s Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Estimation of the Mean of a Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Consistency. Methods for Constructing Consistent Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.1 An Existence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.2 Method of moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.3 Method of Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.4 Bayesian Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Inequalities for Probabilities of Large Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.1 Convergence of θε to θ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.2 Some Basic Theorems and Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5 Lower Bounds on the Risk Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6 Regular Statistical Experiments. The Cramer-Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.6.1 Regular Statistical Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.6.2 The Cramer-Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.6.3 Bounds on the Hellinger distance r2

2(θ; θ′) in Regular Experiments . . . . . . . . . . . . . . . . . . . . 191.7 Approximating Estimators by Means of Sums of Independent Random Variables . . . . . . . . . . . . . . . . 201.8 Asymptotic Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.8.1 Basic Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.8.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.8.3 Bahadur’s Asymptotic Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.8.4 Efficiency in C. R. Rao’s Sense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.9 Two Theorems on the Asymptotic Behavior of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261.9.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2 Local Asymptotic Normality of Families of Distributions 292.1 Independent Identically Distributed Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2 Local Asymptotic Normality (LAN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3 Independent Nonhomogeneous Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.4 Multidimensional Parameter Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5 Characterizations of Limiting Distributions of Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.1 Estimators of an Unknown Parameter when the LAN Condition is Fulfilled . . . . . . . . . . . . . . . 342.5.2 Regular Parameter Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6 Asymptotic Efficiency under LAN Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.7 Asymptotically Minimax Risk Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.8 Some Corollaries. Superefficient Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Some Applications to Nonparametric Estimation 393.1 A Minimax Bound on Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2 Bounds on Risks for Some Smooth Functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3 Examples of Asymptotically Efficient Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4 Estimation of Unknown Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.5 Minimax Bounds on Estimators for Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1

Acknowledgment

Thank Russian mathematicians for providing excellent research monographs.

2

Chapter 1

The Problem of Statistical Estimation

1.1 Formulation of the Problem of Statistical Estimation

Let (X ,X , Pθ, θ ∈ Θ) be a statistical experiment generated by the observation X. Let ϕ be a measurable function from (Θ,B)into (Y ,Y). Consider the problem of estimating the value of ϕ(θ) at point θ based on observation X, whose distribution isPθ. Our only information about θ is that θ ∈ Θ. As an estimator for ϕ(θ) one can choose any function of observations T (X)with values in (Y ,Y). Therefore the following problem arises naturally: how to choose statistic T which would estimate ϕ(θ)in the best possible manner. However, what is the meaning of the expression ”in the best (possible) manner”?

Assume there is on the set Y × Y a real-valued nonnegative function W (y1, y2) which we call loss function and whichhas the following meaning: if observation X is distributed according to the distribution Pθ, then utilizing statistic T (X) toestimate ϕ(θ) yields a loss which is equal to W (T (X), ϕ(θ)). Averaging over all possible values of X, we arrive at the riskfunction:

RW (T ; θ) = EθW (T (X), ϕ(θ)), (1.1)

which will be chosen as the measure of the quality of statistic T as an estimator of ϕ(θ) for a given loss function W .Thus a partial ordering is introduced on the space of all estimators of ϕ(θ): the estimator T1 is preferable to T2 if for all

θ ∈ Θ, RW (T1; θ) ≤ RW (T2; θ).In view of the last definition, estimator T of the function ϕ(θ) is called inadmissible with respect to loss function W if

there exists an estimator T ∗ such that

RW (T ∗; θ) ≤ RW (T ; θ), ∀θ ∈ Θ, RW (T ∗; θ) < RW (T ; θ) for some θ ∈ Θ. (1.2)

An estimator which is not inadmissible is called admissible.Although the approach described above is commonly used, it is not free of certain shortcomings. First, many estimators

turn out to be uncomparable, and second the choice of loss functions is arbitrary to a substantial degree.Sometimes it is possible to find estimators which are optimal within a certain class which is smaller than the class of all

estimators. One such class is the class of unbiased estimators: an estimator T is called an unbiased estimator of functionϕ(θ) if EθT = ϕ(θ) for all θ ∈ Θ.

Furthermore, if the initial experiment is invariant with respect to a group of transformations it is natural to confineourselves to a class of estimators which do not violate the symmetry of the problem. This is called invariance principle.

Comparing estimators T1, T2 according to their behavior at the ”least favorable” points, we arrive at the notion of aminimax estimator. An estimator T0 is called the minimax estimator of ϕ(θ) in Θ1 ⊂ Θ relative to loss function W if

supθ∈Θ1

RW (T0; θ) = infT

supθ∈Θ1

RW (T ; θ), (1.3)

where the inf is taken over all estimators T of ϕ(θ).If Θ is a subset of a finite-dimensional Euclidean space, then statistical estimation problems based this experiment is

called parametric estimation problems.Below we shall mainly deal with parametric problems. Moreover, we shall always assume that Θ is an open subset of a

finite-dimensional Euclidean space Rk, and that the family of distributions Pθ and the densities p(x; θ) = dPθdµ are defined on

the closure Θ of the set Θ. By B we shall denote the σ-algebra of Borel subsets of Θ.In the parametric case it is usually the parameter itself that is estimated (i.e. ϕ(θ) = θ). In this case the loss function W

is defined on the set Θ×Θ and as a rule we shall consider loss functions which possess the following properties:

1. W (u, v) = w(u− v).

3

2. The function w(u) is defined and is nonnegative on Rk; moreover, w(0) = 0 and w(u) is continuous at u = 0 but is notidentically zero.

3. Function w is symmetric, i.e. w(u) = w(−u).

4. The sets {u : w(u) < c} are convex sets for all c > 0.

5. The sets {u : w(u) < c} are convex sets for all c > 0 and are bounded for all c > 0 sufficiently small.

The function w will also be called loss functions.The first three properties are quite natural and do not require additional comments. Property 4 in the case of a one-

dimensional parameter set means that function w(u) is non-decreasing on [0,∞). Denote by W the class of loss functionssatisfying 1-4; the same notation will also be used for the corresponding set of functions w. Denote by W′ the class offunctions satisfying 1-5. The notation W(W′) will be used for the set of functions w which posses a polynomial majorant.Denote by We,α(W′

e,α) the set of functions w belonging to W(W′) whose growth as |u| → ∞ is slower than any one of the

functions eε|u|α

, ε > 0.Clearly, all loss functions of the form |u − v|r, r > 0, and the indicator loss function WA(u, v) = I(u − v /∈ A) belong to

the class W′p.

Theorem 1.1.1 (Blackwell’s theorem). Let the family {Pθ, θ ∈ Θ} possess a sufficient statistic T . Let the loss function beof the form w(u − v), where w(u) is a convex function in Rk. Let θ∗ be an arbitrary estimator for θ. Then there exists anestimator T ∗ representable in the form g(T ) and such that for all θ ∈ Θ,

Eθw(T ∗ − θ) ≤ Eθw(θ∗ − θ). (1.4)

If θ∗ is an unbiased estimator for θ, T ∗ can also be chosen to be unbiased.

Consider again a parametric statistical experiment. Now we assume that θ is a random variable with a known distributionQ on Θ. In such a situation the estimation problem is called the estimation problem in the Bayesian formulation. Assume,for simplicity, that Q possesses density q with respect to the Lebesgue measure. If, as before, the losses are measured bymeans of function w, then the mean loss obtained using estimator T (the so-called Bayesian risk of estimator T ) is equal to

rw(T ) = EQw(T − θ) =

∫Θ

q(θ)dθ

∫X

w(T (x)− θ)Pθ(dx) =

∫Θ

Rw(T ; θ)q(θ)dθ. (1.5)

In the Bayesian setup the risk of estimator R is a positive number and one can talk about the best estimator T minimizingrisk rw:

rw(T ) = minTrw(T ). (1.6)

Here we assume the minimum is achievable. The estimator T is called the Bayesian estimator with respect to loss functionw and prior distribution Q.

Evidently the form of the optimal estimator T depends on the prior density q. On the other hand, one may assume thatas the sample size increases to infinite the Bayesian estimator T ceases to depend on the initial distribution Q within a wideclass of these distributions (e.g. those Q for which q > 0 on Θ). Therefore, for an asymptotic treatment of Bayesian problemsthe exact knowledge of q is not so obligatory anymore.

1.2 Some Examples

1.2.1 Hodges’ and Lehmann’s Result

We shall first formulate a simple general criterion due to Lehmann which is useful for proving the minimax property of certainestimators.

Theorem 1.2.1. Let Tk be a Bayesian estimator with respect to the distribution λk on Θ and the loss function W , k = 1, 2, . . ..If the estimator T is such that for all θ ∈ Θ,

EθW (θ, T ) ≤ lim supk

∫Θ

EθW (θ, Tk)dλk(θ), (1.7)

it is minimax.

As a corollary to this theorem we obtain the following result of Hodges and Lehmann.

4

Theorem 1.2.2. Let T be an estimator which is Bayesian with respect to W and probability measure λ on Θ. Denote byΘ0 ⊂ Θ the support of λ. If

1. EθW (θ, T ) ≡ c for all θ ∈ Θ0,

2. EθW (θ, T ) ≤ c for all θ ∈ Θ,

then T is minimax estimator.

1.2.2 Estimation of the Mean of a Normal Distribution

Let Xj possess normal distribution N (θ, 1) on the real line. Denote by Tk an estimator which is Bayesian with respect tothe normal distribution λk with mean zero and variance σ2

k = k. Since the loss function is quadratic, then

Tk =nk

nk + 1X. (1.8)

Therefore ∫ ∞−∞

Eu(Tk − u)2dλk =k

nk + 1. (1.9)

For all θ ∈ Θ,

Eθ(X − θ)2 = n−1 = limk

∫ ∞−∞

Eu(Tk − u)2dλk (1.10)

and it follows from Theorem 1.2.1 that X is a minimax estimator.Consequently, also the equivariant estimator X is optimal in the class of equivariant estimators. We note immediately

that X is admissible as well (we can easily show that using results based on Fisher information later). Hence in the problemunder consideration, the estimator X of parameter θ has all the possible virtues: it is unbiased, admissible, minimax, andequivariant. These properties of X are retained also in the case when Xj are normally distributed in R2.

If, however, Xj are normally distributed in Rk, k ≥ 3, with the density N (θ, J), then the statistic X relative to the lossfunction W (θ, t) = |θ − t|2 loses all of its remarkable properties. It is even inadmissible in this case.

We now present briefly the method of Stein of constructing estimators which are better than X. The following simplerelation is the basis for Stein’s construction. Let ξ be a random variable with mean a and variance σ2. Furthermore, let ϕ(x)be absolutely continuous on R1 and E|ϕ′(ξ)| <∞. Then

σ2Eϕ′(ξ) = E(ξ − a)ϕ(ξ). (1.11)

Indeed, integrating by parts we obtain

Eϕ′(ξ) =1

σ√

∫ ∞−∞

ϕ′(x)e−(x−a)2/2σ2

dx = − 1

σ√

∫ ∞−∞

ϕ(x)d

(e−

(x−a)2

2σ2

)= σ−2Eϕ(ξ)(ξ − a). (1.12)

Now let ξ = (ξ1, . . . , ξk) be a normal random vector in Rk with mean a and correlation matrix σ2J , where J is the

identity matrix. Furthermore, let the function ϕ = (ϕ1, . . . , ϕk) : Rk → Rk be differentiable and E|∂ϕi(ξ)∂ξi| <∞. Under these

assumptions the following identity is obtained from Stein’s identity:

E[σ2 ∂ϕi∂ξi

(ξ)− (ξi − ai)ϕi(ξ)]

= 0. (1.13)

Return now to the sequence of iid observations X1, . . . , Xn, where Xj possesses a normal distribution in Rk with thedensity N (θ, J). An estimator of θ will be sought among the estimators of the form

θn = X + n−1g(X), (1.14)

where the function g(x) = (g1, . . . , gk) : Rk → Rk. In view of (1.13),

Eθ|X − θ|2 − Eθ|X + n−1g(X)− θ|2 = −2n−1Eθ〈X − θ, g(X)〉 − n−2Eθ|g(X)|2 (1.15)

= −2n−2Eθ

(k∑i=1

∂gi∂xi

(X)

)− n−2Eθ|g(X)|2. (1.16)

5

Assume now that the function g can be represented in the form g(x) = ∇ lnϕ(x), where ϕ(x) is a twice differentiablefunction from Rk to R1. Then

k∑i=1

∂gi∂xi

(x) =

k∑i=1

∂xi

(1

ϕ(x)

∂xiϕ(x)

)= −|g|2 +

1

ϕ∆ϕ, (1.17)

where ∆ =∑ki=1

∂2

∂x2i

is the Laplace operator. Consequently, for the above choice of g,

Eθ|X − θ|2 − Eθ|X + n−1g(X)− θ|2 = n−2Eθ|g|2 − n−2Eθ[

1

ϕ(X)∆ϕ(X)

]. (1.18)

The right hand side of the last equality is obviously positive provided ϕ(x) is a positive nonconstant superharmonicfunction–this means that ∆ϕ ≤ 0. Since there are not superharmonic functions bounded from below on the real line of on aplane which are not constant, the proposed approach does not improve on the estimator X in these two cases. However, inspaces of dimension k ≥ 3, there exist a substantial number of nonnegative superharmonic functions. Consider, for example,the function

ϕk(x) =

{|x|−(k−2) |x| ≥

√k − 2

(k − 2)−(k−2)/2e12 ((k−2)−|x|2) |x| ≤

√k − 2

(1.19)

This function is positive and superharmonic in Rk, k ≥ 3,

∇ lnϕk =

{−k−2|x|2 x |x| ≥

√k − 2

−x |x| ≤√k − 2

(1.20)

Thus the Stein–James estimator

X +1

n∇ lnϕk(X) =

{(1− k−2

n|X|2

)X |x| ≥

√k − 2(

1− 1n

)X |x| ≤

√k − 2

(1.21)

is uniformly better than X. It is worth mentioning, however, that as n → ∞ the improvement is of order n−2 whileEθ|X − θ|2 = k/n is of order n−1.

Another improvement on estimator X may be obtained by setting ϕ(x) = |x|k−2. Here ϕ(x) is a harmonic function andthe corresponding estimator is of the form:

θn =

(1− k − 2

n|X|2

)X (1.22)

This estimator was suggested by Stein. For Stein’s estimator

Eθ|X − θ|2 − Eθ|θn − θ|2 =

(k − 2

n

)2

Eθ|X|−2 =

(k − 2

2

)21

(2π)k/2

∫Rk

∣∣∣∣θ +x√n

∣∣∣∣−2

e−|x|2/2dx. (1.23)

As before, as n→∞,Eθ|X − θ|2 − Eθ|θn − θ|2 = O(n−2) (1.24)

provided θ = 0. However, for θ = 0, the different equals to (k−2)/n. Thus for θ = 0, Stein’s estimator is substantially betterthan X:

Eθ|X − θ|2

Eθ|θn − θ|2=k

2> 1. (1.25)

1.3 Consistency. Methods for Constructing Consistent Estimators

1.3.1 An Existence Theorem

Consider a sequence of statistical experiments (X n,Xn, Pnθ , θ ∈ Θ) generated by observations Xn = (X1, X2, . . . , Xn), whereX1, X2, . . . is a sequence of independent observations with values in (X ,X ). Would it not be possible to achieve arbitraryprecision by increasing indefinitely the number of observations n?

A sequence of statistics {Tn(X1, X2, . . . , Xn)} is called a consistent sequence of estimators for the value ϕ(θ) of thefunction ϕ : Θ→ Rl if

TnPnθ→ ϕ(θ). (1.26)

6

Below we shall consider more general families of experiments than the sequences of repeated samples and we shall nowextend the definition of consistent estimators to this formally more general case.

Consider the family of experiments (X (ε),X (ε), P(ε)θ , θ ∈ Θ) generated by observations Xε, where ε is a real parameter.

For our purposes, it is sufficient to deal with the case when ε ∈ (0, 1) and asymptotic estimation problems are studied forε → 0. Observe that the case of a sequence of experiments (X n,Xn, Pnθ , θ ∈ Θ) is a particular case of this scheme: chooseε = n−1, n = 1, 2, . . ..

The family of statistics Tε = Tε(Xε) is called a consistent family of estimators for the value ϕ(θ) of the function ϕ : Θ→ Rl

if Tε → ϕ(θ) in P(ε)θ probability as ε→ 0 for all θ ∈ Θ.

A family of statistics {Tε} is called a uniformly consistent family of estimators for ϕ(θ) on the set K ⊂ Θ if Tn → ϕ(θ)

in P(ε)θ -probability as ε→∞ uniformly in θ ∈ K. The latter means that for any δ > 0,

supθ∈K

P(ε)θ (|Tε(Xε)− ϕ(θ)| > δ)→ 0. (1.27)

Observe that since the only information about parameter θ is that it belongs to the set Θ, uniformly consistent estimatorsin Θ or in any compact set K ⊂ Θ are useful.

In this section we shall consider the most commonly employed methods for constructing consistent estimators. We firstverify that in the case of repeated sampling, consistent estimators exist under very general assumptions. For repeatedexperiments, we consider on Θ the Hellinger distances between the measures

∫Af(x; θ)dν and

∫Af(x; θ′)dν:

rp(θ; θ′) =

(∫X

|f1/p(x; θ)− f1/p(x; θ′)|pν(dx)

)1/p

, p ≥ 1. (1.28)

Clearly, 0 ≤ rp(θ; θ′) ≤ 1, and in view of the condition Pθ 6= Pθ′ , θ 6= θ′, rp(θ; θ′) = 0 if and only if θ = θ′. Below we shall

often use the distances r1 and r2.

Theorem 1.3.1. Let the conditions

1. infθ′:|θ−θ′|>δ r1(θ; θ′) > 0, if δ > 0, θ ∈ Θ

2. limθ′→θ r1(θ; θ′) = 0

be satisfied. Then there exists a sequence {Tn} of estimators consistent for θ. If the conditions above are satisfied uniformlyin θ belong to the compact set K ⊂ Θ, then there exists a sequence of uniformly consistent estimators in K.

Theorem 1.3.1 is an existence theorem which cannot be used for actual determination of consistent estimators. We nowturn to a consideration of practically usable methods.

1.3.2 Method of moments

This method was suggested by K. Pearson and is historically the first general method for construction of estimators. Ingeneral terms it can be described as follows:

Let (X (ε),X (ε), P(ε)θ , θ ∈ Θ), Θ ⊂ Rk be a family of statistical experiments and let g1(θ), . . . , gk(θ) be k real-valued

functions on Θ. Assume that there exist consistent estimators gε1, . . . , gεk for g1(θ), . . . , gk(θ). The method of moments

recommends that we choose as an estimator for θ the solution of the system of equations

gi(θ) = gεi , i = 1, 2, . . . , k. (1.29)

To specify this system, consider once more statistical experiment generated by a sequence of iid random variablesX1, X2, . . .. Assume that Xi are real valued random variables with a finite k-th moment and let αν(θ) = EθXν

1 , ν ≤ k.Denote by aν the ν-th sample moment, i.e. aν = (

∑ni=1X

νi )/n. It is known that aν is a consistent estimator for αν . Thus,

in view of that stated above, one can choose as an estimator for θ the solution of the system of equations

αν(θ) = aν , ν = 1, 2, . . . , k. (1.30)

Theorem 1.3.2. Let functions αν(θ) possess continuous partial derivatives on Θ and let the Jacobian det |∂αν/∂θi|, θ =(θ1, . . . , θk) be different from zero everywhere on Θ. Let the method of moments equation possess a unique solution Tn withprobability approaching 1 as n→∞. Then this solution is a consistent estimator for θ.

7

1.3.3 Method of Maximum Likelihood

This method, suggested by R. A. Fisher, is one of the commonly used general methods for determination of consistentestimators.

Let dPθ/dµ = p(x; θ). Let X be the observation. The function p(X; θ) is called the likelihood function corresponding to

the experiment; thus p(X; θ) is a random function of θ defined on Θ ⊂ Rk. The statistic θ defined by the relation

p(X; θ) = supθ∈Θ

p(X; θ) (1.31)

is called the maximum likelihood estimator for the parameter θ based on the observation X. Obviously, it may turn outthat the maximization has no solution. However, below we shall consider the case when the solution does exist. If there areseveral maximizers we shall assume, unless otherwise specified, that any one of then is a maximum likelihood estimator.

If p(X; θ) is a smooth function of θ, and θ ∈ Θ then θ is necessarily also a solution for θ of the likelihood equation

∂θiln p(X; θ) = 0, i = 1, 2, . . . , k, θ = (θ1, . . . , θk). (1.32)

To prove the consistency of the maximum likelihood estimators it is convenient to utilize the following simple generalresult.

Lemma 1.3.1. Let (X (ε),X (ε), P(ε)θ , θ ∈ Θ) be a family of experiments and let the likelihood functions pε(X

ε; θ) correspondto these experiments. Set

Zε,θ(u) = Zε(u) =pε(X

ε; θ + u)

pε(Xε; θ), u ∈ U = Θ− θ. (1.33)

Then in order to have the maximum likelihood estimator θε be consistent it is sufficient that for all θ ∈ Θ and γ > 0,

P(ε)θ ( sup|u|>γ

Zε,θ(u) > 1)→ 0 (1.34)

as ε→ 0. If the last relation is uniform in θ ∈ K ⊂ Θ, then the estimator θε is uniformly consistent in K.

We now turn to the case of independent identically distributed observations X1, X2, . . . , Xn, where Xi possesses thedensity f(x; θ) with respect to measure ν. The maximum likelihood estimator θn is the solution of the equation

n∏i=1

f(Xi; θn) = supθ

n∏i=1

f(Xi; θ). (1.35)

We show that under quite general assumptions θn is a consistent estimator.

Theorem 1.3.3. Let Θ be a bounded open set in Rk, f(x; θ) be a continuous function of θ on Θ for almost all x ∈ X andlet the following conditions be fulfilled:

1. For all θ ∈ Θ and all γ > 0,inf

|θ′−θ|>γr22(θ; θ′) = kθ(γ) > 0 (1.36)

2. For all θ ∈ Θ (∫X

sup|h|≤δ

(f1/2(x; θ)− f1/2(x; θ + h))2dν

)1/2

= ωθ(δ)→ 0 (1.37)

as δ → 0.

Then for all θ ∈ Θ the estimator θn → θ as n→∞ with probability one.

In order to prove the consistency of maximum likelihood estimators in the case of an unbounded parameter set, additionalconditions dealing with the interrelation between f(x; θ) and f(x; θ′) when |θ−θ′| → ∞ will be needed. The simplest variantof such a condition is the following.

Theorem 1.3.4. Let Θ be an open set in Rk, f(x; θ) be a continuous function of θ for ν-almost all x and let conditions inTheorem 1.3.3 as well as the condition: for all θ ∈ Θ,

limc→∞

∫X

sup|u|≥c

(f1/2(x; θ)f1/2(x; θ + u))dν < 1 (1.38)

be fulfilled. Then θn is a consistent estimator for θ.

8

1.3.4 Bayesian Estimates

Theorem 1.3.5. Let Θ be an open bounded set in Rk and the density f(x; θ) satisfy the following conditions:

1. inf |θ−θ′|>γ∫

X (f1/2(x; θ)− f1/2(x; θ′))2dν = kθ(γ) > 0 for all θ ∈ Θ, γ > 0.

2.∫

X (f1/2(x; θ + h)− f1/2(x; θ))2dν = O(

1(lnh)2

), h→ 0, for all θ ∈ Θ.

Then the estimator tn, which is Bayesian relative to the loss function W (u; θ) = |u− θ|α, α ≥ 1 and the prior density q(θ)where q(θ) is continuous and positive on Θ, is a consistent estimator of the parameter θ.

1.4 Inequalities for Probabilities of Large Deviations

1.4.1 Convergence of θε to θ

Let a family of statistical experiments (X (ε),X (ε), P(ε)θ , θ ∈ Θ) generated by observations Xε be given. Here Θ is an open

subset of Rk. Let pε(Xε; θ) =

(dP

(ε)θ

dνε

)(Xε) where νε is a measure on X (ε). Consider the random function

ζε(u) =pε(X

ε; θ + u)

pε(Xε; θ)(1.39)

where θ is the “true” value of the parameter. Above we have seen that (see Lemma 1.3.1) that as long as function ζε(u) is

sufficiently small for large u, the maximum likelihood estimators θε constructed from observations Xε are consistent on Θ.One can expect that by classifying in some manner the decrease of ζε(u) as u→∞ and ε→ 0 we could pinpoint more closely

the nature of convergence of θε to θ. Let for example, for any γ > 0,

P(ε)θ

(sup|u|>γ

pε(Xε; θ + uε)

pε(Xε; θ)≥ 1

)→∞ (1.40)

as ε→ 0, then evidently also

P(ε)θ (ε−1|θε − θ| > γ)→ 0 (1.41)

as ε→ 0.This sample device is essentially the basis for proofs of all the theorems stated below.Set

Zε,θ(u) = Zε(u) =pε(X

ε; θ + ϕ(ε)u)

pε(Xε; θ)(1.42)

where ϕ(ε) denotes some matrix nondegenerated normalizing factor; it is also assumed that |ϕ(ε)| → 0 as ε → 0. Thus thefunction Zε,θ is defined on the set Uε = (ϕ(ε))−1(Θ− θ).

Below we shall denote by G the set of families of functions {gε(y)} possessing the following properties:

1. For a fixed ε, gε(y) is monotonically increasing to ∞ as a function of y defined on [0,∞)

2. For any N > 0,lim

y→∞,ε→0yNe−gε(y) = 0. (1.43)

For example, if gε(y) = yα, α > 0, then gε ∈ G.We shall agree throughout this section to denote nonnegative constants by the letter B, which the letter b will be reserved

for positive constants. When we wish to emphasize the dependence of these constants on certain parameters a1, a2, . . ., weshall sometimes write B(a1, a2, . . .).

Theorem 1.4.1. Let functions Zε,θ(u) be continuous with probability 1 and possess the following properties: given anarbitrary compact set K ⊂ Θ there correspond to it nonnegative numbers M1 and m1 (depending on K) and functionsgKε (y) = gε(y), {gε} ∈ G such that

1. There exist numbers α > k and m ≥ α such that for all θ ∈ K,

sup|u1|≤R,|u2|≤R

|u2 − u1|−αE(ε)θ |Z

1/mε,θ (u2)− Z1/m

ε,θ (u1)|m ≤M1(1 +Rm1). (1.44)

9

2. For all u ∈ Uε, θ ∈ K,

E(ε)θ Z

1/2ε,θ (u) ≤ e−gε(|u|). (1.45)

Then the maximum likelihood estimator θε is consistent and for all ε sufficiently small, 0 < ε < ε0,

supθ∈K

P(ε)θ (|(ϕ(ε))−1(θε − θ)| > H) ≤ B0e

−b0gε(H), (1.46)

where B0, b0 > 0 are constants.

This theorem, as well as Theorem 1.4.2 below, may appear to be exceedingly cumbersome, involving conditions which aredifficult to verify. In the next subsection we shall, however, illustrate the usefulness of these theorems by applying them tosequences of independent homogeneous observations. Both Theorem 1.4.1 and 1.4.2 play a significant role in the subsequentchapters as well.

Corollary 1.4.1. Under the conditions of Theorem 1.4.1 for any N > 0 we have

limH→∞,ε→0

HN supθ∈K

P(ε)θ (|(ϕ(ε))−1(θε − θ)| > H) = 0. (1.47)

Corollary 1.4.2. Under the conditions of Theorem 1.4.1 for any function w ∈Wp,

lim supε→0

E(ε)θ w((ϕ(ε))−1(θε − θ)) <∞. (1.48)

If we replace the maximum likelihood estimator by a Bayesian one, condition (1) of Theorem 1.4.1 may be substantiallyweakened, the only requirement being that α is positive.

Theorem 1.4.2. Let function Zε,θ(u) possess the following properties: given a compact set K ⊂ Θ to which numbersM1 > 0,m1 ≥ 0 and functions gKε (y) ∈ G correspond such that

1. For some α > 0 and all θ ∈ K,

sup|u1|≤R,|u2|≤R

|u2 − u1|−αE(ε)θ |Z

1/2ε,θ (u2)− Z1/2

ε,θ (u1)|2 ≤M1(1 +Rm1). (1.49)

2. For all u ∈ Uε, θ ∈ K,

E(ε)θ Z

1/2ε,θ (u) ≤ e−gε(|u|). (1.50)

Let {tε} be a family of estimators. Bayesian with respect to a prior density q—continuous and positive on K and possessingin Θ a polynomial majorant—and a loss function Wε(u, v) = l((ϕ(ε))−1(u− v)) where

1. l ∈W′p

2. there exist numbers γ > 0, H0 ≥ 0 such that for H ≥ H0,

sup{l(u) : |u| ≤ Hγ} − inf{l(u) : |u| ≥ H} ≤ 0. (1.51)

Then for any N ,

limH→∞,ε→0

HN supθ∈K

P(ε)θ (|(ϕ(ε))−1(tε − θ)| > H) = 0. (1.52)

If in addition l(u) = τ(|u|) then for all ε sufficiently small, 0 < ε < ε0,

supθ∈K

P(ε)θ (|(ϕ(ε))−1(tε − θ)| > H) ≤ B0e

−b0gε(H). (1.53)

Corollary 1.4.3. Under the conditions of Theorem 1.4.2 for any function w ∈Wp,

lim supε→0

E(ε)θ w((ϕ(ε))−1(tε − θ)) <∞. (1.54)

10

1.4.2 Some Basic Theorems and Lemmas

In this subsection we shall consider a sequence of statistical experiments (X n,Xn, Pnθ , θ ∈ Θ) where Θ is an open subsetin Rk generated by a sequence of homogeneous independent observations X1, X2, . . . , Xn with common density f(x; θ) withrespect to measure ν. Based on the results of the preceding section we shall continue the study of consistent estimators {Tn}for θ with the aim of determining the rate of convergence of {Tn} to θ for certain classes of estimators {Tn}. It will be shownthat the rate of convergence depends on the asymptotic behavior of the Hellinger distance

r2(θ; θ + h) =

(∫X

|f1/2(x, θ + h)− f1/2(x, θ)|2dν)1/2

, (1.55)

as h→ 0.

Theorem 1.4.3. Let Θ be a bounded interval in R1, f(x; θ) be a continuous function of θ on Θ for ν-almost all x and letthe following conditions be satisfied:

1. There exists a number α > 1 such that

supθ∈Θ

suph|h|−αr2

2(θ; θ + h) = A <∞ (1.56)

2. For any compact set K there corresponds a positive number a(K) = a > 0 such that

r22(θ; θ + h) ≥ a|h|α

1 + |h|α, θ ∈ K. (1.57)

Then the maximum likelihood estimator θn is defined, is consistent, and

supθ∈K

Pθ(n1/α|θn − θ| > H) ≤ B0e

−b0aHα , (1.58)

where the positive constants B0, b0 do not depend on K.

A version of Theorem 1.4.3 for an unbounded interval can be stated as follows.

Theorem 1.4.4. Let Θ be an interval in R1 (not necessarily bounded), f(x; θ) be a continuous function of θ for ν-almost allx and let the following conditions be satisfied: there exist numbers α > 1 and γ > 0 such that

1. sup|θ|<R suph |h|−αr22(θ; θ + h) ≤M(1 +Rm), where M,m are constants.

2. For any compact set K ⊂ Θ there corresponds a number a(K) = a such that

r22(θ; θ + h) ≥ a|h|α

1 + |h|α, θ ∈ K. (1.59)

3. For any compact set K ⊂ Θ there corresponds a number c = c(K) such that

sup|h|>R

∫X

f1/2(x, θ)f1/2(x, θ + h)dν ≤ cR−γ , θ ∈ K. (1.60)

Then a maximum likelihood estimator is defined, is consistent, and for all n ≥ n0(γ) the following inequalities are satisfied:for any number Λ > 0 there correspond positive constants Bi, bi (dependent also on α,M,m, a, γ, c) such that

supθ∈Θ

Pθ(n1/α|θn − θ| > H) ≤

{B1e

−b1aHα H < Λn1/α

B2e−b2cn ln H

n1/α H ≥ Λn1/α(1.61)

Remark 1.4.1. The last inequality in Theorem 1.4.4 is equivalent to the following:

supθ∈Θ

Pθ(|θn − θ| > δ) ≤

{B1e

−b1anδα δ < Λ

B2e−b2cn ln δ δ ≥ Λ

(1.62)

11

We now present two theorems on Bayesian estimators. In these theorems Θ is an open subset of Rk. Bayesian estimatorsare constructed with respect to a positive continuous prior density q on Θ possessing a polynomial majorant on Θ. Thesetheorems are analogous to Theorems 1.4.3 and 1.4.4. The first deals with case of a bounded set Θ and the second with anunbounded one. However, transition to Bayesian estimators allows us to substantially weaken the restrictions on f wheneverthe dimension k of the parameter set is greater than 1.

Theorem 1.4.5. Let Θ be an open bounded set in Rk. Let the following conditions be satisfied: there exists a number α > 0such that

1. supθ∈Θ |h|−αr22(θ; θ + h) = A <∞

2. For any compact set K there corresponds a positive number a(K) = a > 0 such that

r22(θ; θ + h) ≥ a|h|α

1 + |h|α, θ ∈ K. (1.63)

Finally let {tn} be a sequence of estimators which are Bayesian with respect to prior density q and the loss function Wn(u, v) =l(n1/α|u− v|), where l ∈W′

p. Then

supθ∈K

Pθ(n1/α|tn − θ| > H) ≤ Be−baH

α

. (1.64)

Here the constants B, b are positive and depend only on A,α and the diameter of Θ.

Theorem 1.4.6. Let Θ be an open set in Rk. Let the following conditions be satisfied: there exist numbers α > 0, γ > 0such that

1.sup|θ|≤R

|h|−αr22(θ; θ + h) ≤M1(1 +Rm1), (1.65)

where M1,m1 are positive numbers.

2. For any compact set K ⊂ Θ there corresponds a positive number a(K) = a > 0 such that

r22(θ; θ + h) ≥ a|h|α

1 + |h|α, θ ∈ K. (1.66)

3. For any compact set K ⊂ Θ there corresponds a positive number c(K) = c > 0 such that

sup|h|>R

∫X

f1/2(x; θ)f1/2(f ; θ + h)dν ≤ cR−γ , θ ∈ K. (1.67)

Finally, let {tn} be a sequence of estimators which are Bayesian with respect to prior density q and the loss function Wn(u, v) =l(n1/α|u− v|), where l ∈Wp. Then, for any number Λ > 0 there correspond positive constants B1, B2, b1, b2 such that for alln > n0(m1, γ),

supθ∈K

Pθ(n1/α|tn − θ| > H) ≤

{B1e

−b1aHα H ≤ Λn1/α

B2e−b2cn ln H

n1/α H > Λn1/α(1.68)

Theorem 1.4.7. Let conditions 1-3 of Theorem 1.4.6 be satisfied. Furthermore, let {tn} be a sequence of estimators whichare Bayesian with respect to prior density q and the loss function Wn(u, v) = l(n1/α|u− v|), where l ∈W′

p, and, moreover,for some γ0 > 0, H0 > 0, H > H0 the inequality

sup{l(u) : |u| ≤ Hγ0} − inf{l(u) : |u| > H} ≤ 0 (1.69)

be fulfilled. Then for all n ≥ n0,supθ∈K

Pθ(n1/α|tn − θ| > H) ≤ BNH−N , (1.70)

whatever the number N > 0 is.

12

One can formulate for maximum likelihood estimators theorems which are similar to Theorems 1.4.3 and 1.4.4 also in thecase when Θ ⊂ Rk, k > 1; in these cases in place of the Hellinger distance r2 one should take the distance

rm(θ; θ + h) =

(∫X

|f1/2(x, θ + h)− f1/2(x, θ)|mdν)1/m

, (1.71)

m > k and require that

rmm(θ; θ + h) ≥ a|h|α

1 + |h|α, α > k. (1.72)

We shall not dwell on this point in detail, but present the following result which is somewhat different from the precedingtheorems.

Theorem 1.4.8. Let Θ be a bounded open convex set in Rk, f(x; θ) be a continuous function of θ on Θ for ν-almost all xand let the following conditions be satisfied:

1. There exists a number α > 0 such that∫X

supθ1,θ2∈Θ

(f1/2(x; θ2)− f1/2(x; θ1))2

|θ1 − θ2|αdν = A <∞. (1.73)

2. There exists a number β > 0 such that for all θ ∈ Θ,

r22(θ; θ + h) ≥ a(θ)|h|β

1 + |h|β, (1.74)

where a(θ) > 0.

Then the maximum likelihood estimator θn is defined, it is consistent and for any λ ≤ 1/β,

Pθ(nλ|θn − θ| > H) ≤ Bnβ−1−(2α)−1

e−n1−λβba(θ)Hβ . (1.75)

Here the positive constants B, b depend only on A,α, β and the diameter of the set Θ.

1.4.3 Examples

Example 1.4.1. Let (X1, X2, . . . , Xn) be a sample from the normal distribution N (a, σ2), where a,−∞ < a < ∞ is anunknown parameter. The conditions of Theorem 1.4.4 are fulfilled; here α = 2. The maximum likelihood estimator for a isX and for this estimator the inequality of Theorem 1.4.4 is satisfied with α = 2. Since L(X) = N (a, σ2/n), this inequalitycan be substantially improved.

Example 1.4.2. Now let the variables Xi ∼ Bern(p), 0 < p < 1. The conditions of Theorem 1.4.3 are satisfied if 0 < p0 ≤p ≤ p1 < 1; here α = 2. The maximum likelihood estimator for p once again is X. Since the set of values the parameter pcan take is compact, we have

Pp(√n|X − p| > H) ≤ B1e

−b1H2

. (1.76)

This bound can be substantially refined using Hoeffding’s inequality. This inequality applied in our case asserts that

Pp(√n|X − p| > H) ≤ 2e−2H2

, 0 < p < 1. (1.77)

Example 1.4.3. Let θ be a location parameter. This means that X = Θ = Rk and the distribution Pθ in Rk possesses adensity of the form f(x− θ) with respect to the Lebesgue measure λ. In this case we always have∫

Rk|f1/2(x+ h)− f1/2(x)|2dx = r(h) ≥ a|h|2

1 + |h|2, a > 0. (1.78)

Indeed, denote by ϕ(t) the Fourier transform of the function f1/2(x). It follows from the Parseval equality that

lim inf|h|→0

r(h)

|h|2= lim inf|h|→0

4

|h|2

∫sin2(1/2〈t, h〉)|ϕ(t)|2dt = inf

|h|=1

∫〈t, h〉2|ϕ(t)|2dt = µ ≥ 0. (1.79)

However, λ{t : |ϕ(t)| > 0} > 0 so that the quadratic form on the right hand side is positive definite and µ > 0 (the caseµ =∞ is not excluded. ) Moreover, it is clear that r(h)→ 2 as |h| → ∞.

13

Example 1.4.4. Let (X1, X2, . . . , Xn) be a sample from the uniform distribution on the interval [θ − 1/2, θ + 1/2], where θis the parameter to be estimated. It is a particular case of the location model with f(x) = 1 for |x| < 1/2 and f(x) = 0 for|x| > 1/2. It is easy to see that ∫

|f1/2(x+ h)− f1/2(x)|2dx = 2h, (1.80)

so that α = 1. The statistic tn = (Xmax + Xmin)/2 is a Bayesian estimator with respect to the quadratic loss function andthe uniform prior distribution (Pitman’s estimator) and in view of Theorem 1.4.5,

Pθ(n|tn − θ| > H) ≤ B1e−b1H , H < n/2. (1.81)

Taking into account the exact form of the distribution of statistic tn, one can write for n ≥ 2,

Pθ(n|tn − θ| > H) = 2

∫ n/2

H

(1− 2u

n

)n−1

du ≤ 2

∫ ∞H

e−udu = 2e−H . (1.82)

Example 1.4.5. Let (X1, X2, . . . , Xn) be a sample from a population with a Gamma distribution with unknown locationparameter θ, i.e., let Xj possess on the real line the probability density equal to

(x− θ)p−1eθ−x

Γ(p), θ ≤ x <∞, (1.83)

where p > 0 is known. Once again the conditions of Example 1.4.3 are satisfied and

r(h) ∼

{h2 p > 2

hp p < 2(1.84)

Example 1.4.6. Experiments with a finite Fisher information amount serve as a source for a large number of examples.We will discuss this case in more details later.

1.5 Lower Bounds on the Risk Function

Let a repeated sample of size n, X1, X2, . . . , Xn with density function f(x; θ) with respect to some measure ν be given. ASusual, let θ ∈ Θ ⊂ Rk. If Tn = Tn(X1, X2, . . . , Xn) be an estimator for θ, we set

S(m)(Tn; θ) = Eθ|Tn − θ|m. (1.85)

Theorem 1.5.1. Let for all θ ∈ Θ,

r22(θ; θ′) = r2

2 =

∫X

(√f(x; θ)−

√f(x; θ′))2ν(dx) ≤ K1(θ)|θ − θ′|α,K1 > 0, α > 0 (1.86)

as long as |θ − θ′| ≤ h1(θ), h1 > 0. Denote by j a vector of unit length in Rk. Then for any sequence of estimators {Tn},

lim infn→∞

nm/α(S(m)(Tn; θ) + S(m)(Tn; θ + (2nK1(θ))−1/αj)

)> 0 (1.87)

for all θ ∈ Θ and m ≥ 1.

Theorem 1.5.1 establishes the asymptotic lower bound on the risk function for arbitrary estimators; a comparison of thisresult with Theorems 1.4.3-1.4.7 shows that Bayesian estimators and maximum likelihood estimators are optimal as far asthe order of decrease of the risk is concerned under power (polynomial) loss functions.

If we denoteS(m)(T ; θ) = inf

|j|=1S(m)(T ; θ + (2nK1(θ))−1/αj), (1.88)

and if n > (2K1hα1 )−1, then we could show

infTn

(S(1)(Tn; θ) + S(1)(Tn; θ)

)≥ 2−5

(2K1n)1/α. (1.89)

It implies that

infTn

(S(m)(Tn; θ) + S(m)(Tn; θ)

)≥ 2−m+1 2−5m

(2K1n)m/α. (1.90)

In the last step we have used the elementary inequality

am + bm ≥ 21−m(a+ b)m, a > 0, b > 0,m ≥ 1. (1.91)

14

1.6 Regular Statistical Experiments. The Cramer-Rao Inequality

1.6.1 Regular Statistical Experiments

Consider the statistical experiment (X ,X , Pθ, θ ∈ Θ), where Θ is an open subset of Rk and all the measures Pθ are absolutelycontinuous with respect to measure ν on X ; moreover dPθ

dν = p(x; θ).Let p(x; θ) be a continuous function of θ on Θ for ν-almost all x. Assume that for ν-almost all x the density p(x;u) is

differentiable at the point u = θ and all the integrals

Ijj(θ) =

∫X

∣∣∣∣∣∂p(x; θ)

∂θj

∣∣∣∣∣2ν(dx)

p(x; θ)<∞. (1.92)

Here the integration is carried out over those x for which p(x; θ) 6= 0, so that

Ijj(θ) = Eθ[∂p(X; θ)

∂θj

1

p(X; θ)

]2

. (1.93)

Let us agree to interpret all the integrals of the form∫X

a(x; θ)

p(x; θ)ν(dx) (1.94)

as the integrals over the set {x : p(x; θ) 6= 0}, i.e. as

Eθ[a(X; θ)

p2(X; θ)

]. (1.95)

The Cauchy-Schwarz inequality implies that together with Ijj(θ) all the following integrals

Iij(θ) =

∫X

∂p(x; θ)

∂θi

∂p(x; θ)

∂θj

ν(dx)

p(x; θ), i, j = 1, 2, . . . , k (1.96)

are convergent. The matrix I(θ) whose ij-th entry is Iij(θ) is called Fisher’s information matrix.Denote by L2(ν) the Hilbert space of functions, square integrable with respect to measure ν on X with the scalar product

(ϕ,ψ)ν =∫

X ϕ(x)ψ(x)ν(dx) and the norm ‖ϕ‖ν . Note that Fisher’s information amount can be written as

I(θ) = 4

∫X

(∂

∂θp1/2(x; θ)

)2

dν = 4

∥∥∥∥∥ ∂∂θp1/2

∥∥∥∥∥2

ν

, (1.97)

and the information matrix as

I(θ) = 4

∫X

∂θp1/2(x; θ)

(∂

∂θp1/2(x; θ)

)Tdν. (1.98)

We say that experiment possesses at point θ ∈ Θ Fisher’s finite information if the function p1/2(·;u) is differentiable atpoint u = θ in L2(ν).

Set, for brevity, p1/2(x; θ) = g(x; θ). The differentiability of function g in L2(ν) means the following. There exists afunction ψ : X ×Θ→ Rk such that ∫

X

|ψ(x;u)|2dν = ‖ψ(·;u)‖2ν <∞, (1.99)∫X

|g(x; θ + h)− g(x; θ)− 〈ψ(x; θ), h〉|2dν = o(|h|2), h→ 0. (1.100)

The matrix

I(θ) = 4

∫X

ψ(x; θ)(ψ(x; θ))T dν (1.101)

will be called as before Fisher information matrix, and the integral

I(θ) = 4

∫X

|ψ(x; θ)|2dν,Θ ⊂ R1, (1.102)

will be called Fisher’s information amount (measure).

15

If in addition the density p(x;u) is differentiable at point θ, then of course

ψ(x; θ) =∂

∂θp1/2(x; θ) (1.103)

and, for example,

I(θ) = 4

∫X

|ψ(x; θ)|2dν =

∫X

∣∣∣ ∂∂θp(x; θ)∣∣∣2

p(x; θ)dν. (1.104)

Since our intention is to retain the classical expression also in the case when then density p is not differentiable in theusual sense we shall adhere to the following convention. Let c(s) be a differentiable function from R1 into R1. Formaldifferentiation yields

∂θc(p(x; θ)) = c′(p(x; θ))

∂θp(x; θ) (1.105)

= 2c′(p(x; θ))p1/2(x; θ)∂

∂θp1/2(x; θ) (1.106)

= 2c′(p(x; θ))p1/2(x; θ)ψ(x; θ). (1.107)

The expression on the right hand side is defined provided the function g(x; θ) = p1/2(x; θ) is differentiable in the meansquare and in this case we shall set by definition

∂θc(p) = 2c′(p)p1/2ψ. (1.108)

For example, we have

∂θp(x; θ) = 2p1/2(x; θ)ψ(x; θ) (1.109)

∂θln p(x; θ) = 2p−1/2(x; θ)ψ(x; θ) (1.110)

and so on. Utilizing this convention in the case of an experiment with finite Fisher information at point θ we can then usefor Fisher information matrix the notation

I(θ) =

∫X

∂p

∂θ

(∂p

∂θ

)Tp−1(x; θ)dν. (1.111)

We shall adhere to this notation below.Moreover, below we shall always use the very same notation ∂

∂θ for the ordinary derivatives as well as for the derivativesin the mean square. In general in order to construct a well developed theory of experiments wit finite Fisher information itis necessary to impose certain smoothness conditions on the family {p(x; θ)}. We shall utilize the following definition:

Definition 1.6.1 (Regular Statistical Experiment). A statistical experiment E is called regular in Θ if

1. p(x; θ) is a continuous function on Θ for ν-almost all x

2. E possess finite Fisher information at each point θ ∈ Θ

3. the function ψ(·; θ) is continuous in the space L2(ν).

Note that conditions 1-3 are not totally independent, for example, if Θ ⊂ R1 then conditions 2-3 imply condition 1.Namely if the density p(x; θ) satisfies conditions 2-3 it can be modified on sets of ν-measure zero (these sets may depend onθ) in such a manner that it becomes a continuous function of θ. Indeed, the measure ν may be considered to be a probabilitymeasure so that p1/2(x; θ) is a random function of θ satisfying the condition

E(p1/2(·; θ + h)− p1/2(·; θ)

)2

≤ Bh2, (1.112)

where B is a constant. In view of Kolmogorov’s continuity criterion, there exists a modification of p1/2(·; θ) which is continuouswith probability one.

16

It should be verified that the “regularity” property of the experiment and its information matrix do not depend onthe choice of the measure ν dominating the family {Pθ}. Let µ be another such measure and let q(x; θ) = dPθ

dµ . Then

q(x; θ) = p(x; θ)γ(x) where γ does not depend on θ. Indeed, we have

dPθd(µ+ ν)

(x) = p(x; θ)dν

d(µ+ ν)(x) = q(x; θ)

d(µ+ ν)(x). (1.113)

Consequently, functions p, q either both satisfy conditions in Definition 1.6.1 or both don’t satisfy these conditions.Moreover,

[∂

∂θ

√p

(∂

∂θ

√p

)T]= Eθ

[∂

∂θ

√q

(∂

∂θ

√q

)T](1.114)

The following lemma is a simple corollary of Definition 1.6.1.

Lemma 1.6.1. Let E = {X ,X , Pθ, θ ∈ Θ} be a regular experiment. Then

1. The matrix I(θ) is continuous on Θ (i.e. all of the functions Iij(θ) are continuous on Θ)

2. The integrals Iij(θ) converge uniformly on an arbitrary compact set K ⊂ Θ, i.e.

limA→∞

supθ∈K

∫X

∣∣∣ ∂p∂θi ∂p∂θj ∣∣∣p(x; θ)

I

x :

∣∣∣ ∂p∂θi ∂p∂θj ∣∣∣p(x; θ)

> A

dν = 0. (1.115)

3. If the interval {t+ su|0 ≤ s ≤ 1} belongs to Θ then

g(x; t+ u)− g(x; t) =

∫ 1

0

⟨∂g

∂t(x; t+ su), u

⟩ds (1.116)

in L2(ν) (i.e. the integral on the right hand side is the limit in L2(ν) of the Riemann sums).

Lemma 1.6.2. Let (X ,X , Pθ, θ ∈ Θ) be a regular experiment and the statistic T : X → R1 be such that the function EuT 2

be bounded in the neighbourhood of a point θ ∈ Θ. Then the function EuT is continuously differentiable in a neighborhood ofθ and

∂uEuT =

∂u

∫X

T (x)p(x;u)ν(dx) =

∫X

T (x)∂

∂up(x;u)ν(dx). (1.117)

Remark 1.6.1. It follows from (1.117) and the Fubini theorem that for all bounded T we have∫X

T (x)[p(x; θ + u)− p(x; θ)]dν =

∫X

T (x)

∫ 1

0

∂θp(x; θ + su)ds, (1.118)

provided the interval {θ + su : 0 ≤ s ≤ 1} ⊂ Θ. Consequently for ν-almost all x,

p(x; θ + u)− p(x; θ) =

∫ 1

0

⟨∂

∂θp(x; θ + su), u

⟩ds. (1.119)

Setting T (x) ≡ 1, we obtain ∫X

∂θp(x; θ)dν =

∂θ

∫X

p(x; θ)dν =∂

∂θ1 = 0. (1.120)

Theorem 1.6.1. Let E1 = {X1,X1, Pθ1 , θ ∈ Θ} and E2 = {X2,X2, Pθ2 , θ ∈ Θ} be regular statistical experiments with Fisherinformations I1(θ), I2(θ) respectively. Then the experiment E = E1 × E2 is also regular and moreover

I(θ) = I1(θ) + I2(θ), (1.121)

here I(θ) is the Fisher information of experiment E .

Theorem 1.6.2. Let E = {X ,X , Pθ, θ ∈ Θ} be a regular experiment. Let experiment {Y ,Y, Qθ, θ ∈ Θ} generated by thestatistic T : X → Y be a part of the experiment E . Then E possess finite Fisher information at all points θ ∈ Θ and,moreover,

I(θ) ≤ I(θ), (1.122)

where I(θ), I(θ) are Fisher information matrices for experiments {Y ,Y, Qθ, θ ∈ Θ}, {X ,X , Pθ, θ ∈ Θ}, respectively. If thedensity p(x; θ) is a continuously differentiable function of θ for ν-almost all x, then the inequality (1.122) is valid for allθ ∈ Θ if and only if T is a sufficient statistic for {Pθ}.

17

1.6.2 The Cramer-Rao Inequality

Theorem 1.6.3 (The Cramer-Rao Inequality). Let (X ,X , Pθ, θ ∈ Θ) be a regular experiment with an information matrixI(θ) > 0,Θ ⊂ Rk. If the statistic T = (T1, T2, . . . , Tk) : X → Rk is such that the risk Eu|T − u|2 is bounded in theneighborhood of the point θ ∈ Θ, then the bias

d(u) = EuT − u (1.123)

is continuously differentiable in a neighborhood of θ and the following matrix inequality is satisfied:

Eθ(T − θ)(T − θ)T ≥(J +

∂d(θ)

∂θ

)I−1(θ)

(J +

∂d(θ)

∂θ

)T+ d(θ)d(θ)T , (1.124)

where J is the identity matrix.In particular, if Θ ⊂ R1 and I(θ) > 0, then

Eθ|T − θ|2 ≥(1 + d′(θ))2

I(θ)+ d2(θ). (1.125)

The Cramer-Rao inequality allows us to show that the existence of consistent estimators is connected, in a certain sense,with unlimited inflow of Fisher information. Consider, for example, a sequence of regular statistical experiments E (n) wherethe parameter set Θ is a bounded interval (a, b) on the real line. A sequence of consistent estimators {Tn} for θ ∈ Θ mayexist only if for an arbitrary [α, β] ⊂ (a, b),

inf[α,β]

I(θ; E (n))→∞ (1.126)

as n→∞.Indeed, let the sequence Tn → θ in probability for all θ ∈ Θ. Since the set Θ is bounded one can assume that |Tn| ≤

max(|a|+1, |b|+1). Therefore Eθ|Tn−θ|2 → 0 for all θ. Assume that for some interval [α, β], I(θ; E (n)) < M for all θ ∈ [α, β].In view of the Cramer-Rao inequality,

(1 + d′n(θ))2

M+ d2

n(θ)→ 0, θ ∈ [α, β]. (1.127)

Here dn(θ) is the bias of the estimator Tn. In this case dn(θ)→ 0, θ ∈ Θ, d′n(θ)→ −1 for all θ ∈ [α, β]. However, in view ofLemma 1.6.2,

|d′n(θ)| =

∣∣∣∣∣∫T∂

∂θ

dP(n)θ

dνdν − 1

∣∣∣∣∣ ≤ (EθT 2)1/2

I1/2(θ; E (n)) + 1 ≤√M(|a|+ |b|+ 2)1/2 + 1. (1.128)

Lebesgue’s theorem yields

limn

∫ τ

α

d′n(u)du =

∫ τ

α

limnd′n(u)du = −(τ − α), α < τ ≤ β. (1.129)

However, in this case

dn(τ) =

∫ τ

α

d′n(u)du+ dn(α)→ −(τ − α) 6= 0. (1.130)

The contradiction obtained proves our assertion.The Cramer-Rao inequality is exact in the sense that there are cases when equality is attained. However, equality in the

Cramer-Rao inequality is possible only for some special families of distributions. We shall exemplify this using a family witha one-dimensional parameter. In view of Lemma 1.6.2, for the estimator T ,

d′(θ) + 1 =d

dθEθT =

∫X

(T − EθT )∂p(x; θ)

∂θν(dx) (1.131)

and from the Cauchy-Schwarz inequality we have

Eθ(T − EθT )2 ≥ (1 + d′(θ))2

I(θ). (1.132)

Moreover, Eθ(T − EθT )2 = Eθ(T − θ)2 − d2(θ). Thus equality in the Cramer-Rao inequality is attainable only if

Eθ(T − EθT )2 =(1 + d′(θ))2

I(θ)=

(∂EθT/∂θ)2

I(θ). (1.133)

18

As it is known, the equality in the Cauchy-Schwarz inequality is satisfied only if

T − EθT = k(θ)∂

∂θln p(x; θ). (1.134)

Analogous results are also valid for a multi-dimensional parameter.Estimators for which the equality sign holds in the Cramer-Rao inequality are called efficient. They possess a number of

nice properties. First, it follows from Theorem 1.6.2 it is a sufficient estimator for θ. Second, under certain conditions anefficient estimator is admissible with respect to a quadratic loss function. Namely, the following result is valid.

Theorem 1.6.4. Let E = (X ,X , Pθ) be a regular experiment with one dimensional parameter set Θ = (a, b),−∞ ≤ a < b ≤∞. If the integrals ∫

a

I(u)du,

∫ b

I(u)du, I(u) = I(u; E ), (1.135)

are both divergent, then the efficient estimator T0 for parameter θ is admissible with respect to a quadratic loss function.

Example 1.6.1. In the one dimensional Gaussian mean estimation problem N (θ, σ2), we have the Fisher information

I(θ) = nσ−2, (1.136)

where σ2 is the variance of variable Xi, and in view of Theorem 1.6.4 the efficient estimator X is also admissible. Note thatif the parameter set Θ is not the whole real line, but, for example, the ray (0,∞), then the conditions of Theorem 1.6.4 are nolonger satisfied (the integral

∫0n/σ−2dθ is convergent) and as before the efficient estimator X is inadmissible (and is worse

than the estimator max(0, X).)

Example 1.6.2. Let us consider the Bern(θ) example. The sample average X is an efficient estimator, and I(θ) = 1θ(1−θ) .

Since ∫0

I(u)du =

∫0

1

u(1− u)du =∞,

∫ 1

I(u)du =

∫ 1 1

u(1− u)du =∞, (1.137)

we know X is an admissible estimator provided Θ = (0, 1).

1.6.3 Bounds on the Hellinger distance r22(θ; θ′) in Regular Experiments

Consider a regular experiment and assume that Θ is a convex subset of Rk.

Theorem 1.6.5. The following inequality is valid:

r22(θ; θ + h) =

∫X

|p1/2(x; θ + h)− p1/2(x; θ)|2dν ≤ |h|2

4

∫ 1

0

trI(θ + sh)ds. (1.138)

Moreover, if K is a compact subset of Θ then we have uniformly in K

lim infh→0

|h|−2

∫X

|p1/2(x; θ + h)− p1/2(x; θ)|2dν ≥ 1

4inf|u|=1〈I(θ)u, u〉. (1.139)

Let a sequence of iid observations X1, X2, . . . , Xn generate regular experiments with probability density f(x; θ), θ ∈ Θ ⊂Rk and information matrix I(θ). Let Θ be a convex bounded set and the matrix I(θ) be strictly positive at each point θ.Assume also that for any compact K ⊂ Θ and all δ > 0

infθ∈K

inf|h|≥δ

∫X

|f1/2(x; θ + h)− f1/2(x; θ)|2dν > 0. (1.140)

It follows from Theorem 1.6.5 and the last inequality that to any compact set K ⊂ Θ there correspond two constantsa,A > 0 such that for all θ ∈ K,

a|h|2

1 + |h|2≤∫

X

|f1/2(x; θ + h)− f1/2(x; θ)|2dν ≤ A|h|2. (1.141)

This, together with the theorems on large deviations of maximum likelihood estimators and Bayesian estimators as wellas lower bounds on the risk function, implies that the rate of convergence of the best estimators to the value θ is of ordern−1/2.

For example, if Θ ⊂ R1 and θn is a maximum likelihood estimator then it follows from Theorem 1.4.2 that for all p > 0,

lim supn→∞

supθ∈K

Eθ|√n(θn − θ)|p <∞. (1.142)

19

1.7 Approximating Estimators by Means of Sums of Independent RandomVariables

Let X1, X2, . . . be iid observations with a density f(x; θ) with respect to the measure ν,Θ ⊂ Rk. If for h→ 0,∫X

|f1/2(x; θ + h)− f1/2(x; θ)|2dν ∼ |h|α, (1.143)

then results obtained in the large deviations of estimators allow us to hope that under some additional restrictions thedistribution of random variables

n1/α(θn − θ), n1/α(tn − θ) (1.144)

converges to a proper limit distribution as n→∞. Similar theorems will indeed be proved in the subsequent chapters. Hereas an example we shall present some results along these lines for regular experiments.

Let experiments En generated by X1, X2, . . . , Xn be regular. Let I(θ) denote the Fisher information matrix, set l(x; θ) =ln f(x; θ).

Below we shall assume that Θ is a convex bounded open subset of Rk and that uniformly with respect to θ ∈ K ⊂ Θ,

inf|h|≥δ

∫X

|f1/2(x; θ + h)− f1/2(x; θ)|2dν > 0. (1.145)

Theorem 1.7.1. Let the function l(x; θ) for all x be twice differentiable in θ. Assume, furthermore, that there exist a numberδ, 0 < δ ≤ 1, such that for any compact set K ⊂ Θ,

supθ∈K

Eθ∣∣∣∣ ∂∂θ l(X1; θ)

∣∣∣∣2+δ

<∞ (1.146)

supθ∈K

Eθ∣∣∣∣ ∂2

∂θi∂θjl(X1; θ)

∣∣∣∣1+δ

<∞ (1.147)

(sup

θ,θ′∈K

∣∣∣∣ ∂2

∂θi∂θjl(X1; θ)− ∂2

∂θi∂θjl(X1; θ′)

∣∣∣∣ |θ − θ′|−δ)<∞. (1.148)

Then there exists a number ε > 0 such that with Pθ probability one uniformly in θ ∈ K as n→∞, we have

√nI(θ)(θn − θ) =

1√n

n∑j=1

∂θl(Xj ; θ) +O(n−ε). (1.149)

Theorem 1.7.2. Let the conditions of Theorem 1.7.1 be satisfied. If tn is a Bayesian estimator with respect to the priordensity q(θ) > 0 continuous on Θ and a convex loss function w(u), u ∈ Wp, then there exists ε > 0 such that with Pθ-probability one as n→∞ we have

√nI(θ)(tn − θ) =

1√n

n∑j=1

∂θl(Xj ; θ) +O(n−ε). (1.150)

Theorems 1.7.1 and 1.7.2 imply that in the regular case the estimators θn, tn are well approximated, in terms of sums

of independent random variables∂l(Xj ;θ)

∂θ and, in particular, an appeal to limit theorems to sums of independent random

variables allows us to obtain analogous limit theorems for the estimators θn, tn.For example, it follows directly from the central limit theorem that under certain conditions the differences

√n(θn −

θ),√n(tn − θ) are asymptotically normal with mean zero and correlation matrix I−1(θ).

If θ is a one-dimensional parameter, the law of the iterated logarithm for θn, tn follows from the law of the iteratedlogarithm for sums of independent random variables. We have

(lim sup

n(θn − θ)

√nI(θ)

2 ln lnn= − lim inf

n(θn − θ)

√nI(θ)

2 ln lnn= 1

)= 1 (1.151)

(lim sup

n(tn − θ)

√nI(θ)

2 ln lnn= − lim inf

n(tn − θ)

√nI(θ)

2 ln lnn= 1

)= 1 (1.152)

20

1.8 Asymptotic Efficiency

1.8.1 Basic Definition

The term asymptotic efficient estimator was introduced by R. Fisher to designate consistent asymptotically normal estimatorswith asymptotically minimal variance. The motivation was that estimators of this kind should be preferable from theasymptotic point of view. The program outlined by R. Fisher consisted in showing that

1. if θn is a maximum likelihood estimator then under some natural regularity conditions the different√n(θn − θ) is

asymptotically normal with parameter (0, I−1(θ));

2. if Tn is an asymptotically normal sequence of estimators then

lim infn→∞

Eθ(√n(Tn − θ)

)2 ≥ I−1(θ), θ ∈ R1. (1.153)

Should these conjectures be verified to be true, one could then indeed consider maximum likelihood estimators to beasymptotically best (in the class of asymptotically normal estimators). However, the program as stated above cannot berealized for the simple reason that estimators with minimal variance do not exist.

Indeed, let Tn be an arbitrary sequence of estimators and let the difference√n(Tn − θ) be asymptotically normal with

parameters (0, σ2(θ)). If σ2(θ0) 6= 0, define the estimator

Tn =

{Tn |Tn − θ0| > n−1/4

θ0 |Tn − θ0| ≤ n−1/4(1.154)

It is easy to verify that the sequence√n(Tn−θ) is asymptotically normal with parameters (0, σ2(θ)), where σ2(θ) = σ2(θ)

if θ 6= θ0 and σ2(θ0) = 0 < σ2(θ0).In particular, applying this method of improving estimators to maximum likelihood estimators one can construct (under

regularity conditions) a sequence of asymptotically normal estimators {Tn} such that

limn→∞

Eθ(√n(Tn − θ)

)2 ≤ I−1(θ), (1.155)

and at certain points

limn→∞

Eθ(√n(Tn − θ)

)2< I−1(θ). (1.156)

Estimators {Tn} such that for θ ∈ Θ,

limn→∞

Eθ(√n(Tn − θ)

)2 ≤ I−1(θ), (1.157)

and for at least one point θ the strict inequality is valid are called superefficient with respect to a quadratic loss function, andpoints θ at which the strict inequality holds are called points of superefficiency. Thus if X1, X2, . . . are iid N (θ, 1) randomvariables, then X is the maximum likelihood estimator for θ and the estimator

Tn =

{X |X| > n−1/4

αX |X| ≤ n−1/4, (1.158)

|α| < 1 is a superefficient estimator with superefficiency point θ = 0. Chronologically, it is the first example of a superefficientestimator and is due to Hodges.

Various authors proposed several definitions of the notion of an asymptotically efficient estimator which retain the asymp-totic efficiency of maximum likelihood estimators. We shall now present some of the definitions.

Returning to the N (θ, 1) model, we note that for all θ, Eθ(√n(X − θ)

)2 ≤ 1, while one can find a sequence of values ofparameter θn → 0 such that

lim infn→∞

Eθn(√n(Tn − θn)

)2> 1. (1.159)

(for example one can choose θn = c/√n, c 6= 0.) Thus in the vicinity of the point of superefficiency there are located points θn

where the estimator Tn behaves worse than X. The first definition of asymptotic efficiency is based precisely on the minimaxproperty which X (but not Tn) possesses.

The merit of this definition is that it is in complete correspondence with the principle of comparing estimators based ontheir risk functions. Unlike Fisher’s definition, it does not limit the class of competing estimators to asymptotically normalestimators and therefore makes sense also for nonregular statistical problems.

21

Definition 1.8.1. Let (X (ε),X (ε), P(ε)θ , θ ∈ Θ) be a family of statistical experiments. A family of estimators θε is called

wε-asymptotically efficient in K ⊂ Θ (asymptotically efficient with respect to the family of loss functions wε) if for anynonempty open set U ⊂ K the relation

lim infε→0

[infTε

supu∈U

E(ε)u wε(Tε − u)− sup

u∈UE(ε)u wε(θε − u)

]= 0 (1.160)

is satisfied.

As this definition plays a basic role in our further exposition we shall therefore present a few comments.

1. Should the square bracket in the left hand side of (1.160) vanish, it would mean that θε is a minimax estimator inU for the loss function wε. Therefore one can assert that an estimator is called asymptotically efficient in Θ if it isasymptotically minimax in any nonempty and (independent of ε) open set U ⊂ Θ.

2. Let wε(x) = w(x) ∈W and |w(x)| < c so that the loss function does not depend on ε and is bounded. Then clearly forany uniformly consistent in K estimator θε,

limε→0

supu∈K

E(ε)u wε(θε − u) = 0. (1.161)

This implies (1.160) and thus any uniformly consistent estimator will be an asymptotically efficient estimator for thesew. Obviously such loss functions are not of great interest. In order to take into account more subtle differences betweenthe estimators it is necessary that the loss function itself depend on ε. For example, for regular experiment generatedby iid observations, it is natural to set wn(x) = w(

√nx), w ∈W.

3. Definition 1.8.1 can be localized in the following manner. The estimation θε is called wε-asymptotically efficient atpoint θ ∈ Θ provided

limδ→0

lim infε→0

[infTε

sup|u−θ|<δ

E(ε)u wε(Tε − u)− sup

|u−θ|<δE(ε)u wε(θε − u)

]= 0 (1.162)

Obviously, asymptotic efficiency in K implies asymptotic efficiency at any interior point θ ∈ K.

In relation to Definition 1.8.1, Fisher’s program is now realized: in a wide class of problems asymptotically efficientestimators exist, and under regularity conditions such estimators are maximum likelihood estimators.

Observe that the left hand side of (1.160) is bounded from above by 0 for any estimator θε. Therefore in order to proveasymptotic efficiency of an estimator θ∗ε it is sufficient first to verify that uniformly in U ⊂ K the limit

limε→0

E(ε)u wε(θ

∗ε − u) = L(u) (1.163)

exist and next to prove that for any estimator Tε and any nonempty open set U ⊂ K the inequality

lim infε→0

supu∈U

E(ε)u wε(Tε − u) ≥ sup

u∈UL(u) (1.164)

is valid.A general method for obtaining lower bounds of the type (1.164) is given by

Theorem 1.8.1. Let (X (ε),X (ε), P(ε)θ , θ ∈ Θ), Θ ⊂ Rk be a family of statistical experiments and let tε,q be a Bayesian

estimator of parameter θ with respect to loss function wε and the prior density q. Assume that for any continuous priordensity which is positive at point u ∈ K the relation

limε→0

E(ε)u wε(tε,q − u) = L(u) (1.165)

is satisfied and the function L(u) is continuous and bounded in K.Then for any estimator Tε and any nonempty open set U ⊂ K the relation (1.164) is fulfilled.

Remark 1.8.1. It is sufficient to check that (1.165) is fulfilled not for all prior density q, but only for a certain subclass Qprovide that for any U ⊂ K,

supUL(u) = sup

q∈Q

∫U

L(u)q(u)du. (1.166)

For example, one may choose Q = {qδ(u; t)}, where

qδ(u; t) =

{1

λ{u:|u−t|≤δ} |u− t| ≤ δ

0 |u− t| > δ(1.167)

22

Corollary 1.8.1. If the conditions of Theorem 1.8.1 are fulfilled and for some estimator θ∗ε uniformly in u ∈ K equality(1.163) is satisfied, then the estimator θ∗ε is wε asymptotically efficient in K.

Corollary 1.8.2. If the conditions of Theorem 1.8.1 are satisfied for wε(x) = (ϕ(ε)x)2, Θ ⊂ R1, then for a wε-asymptoticallyefficient estimator θ∗ε in K it is necessary that for any density q(u), u ∈ K the equality

limε→0

∫K

E(ε)u

(ϕ(ε)(θ∗ε − tε,q)

)2q(u)du = 0 (1.168)

will be fulfilled.

1.8.2 Examples

Example 1.8.1. Consider a sequence of iid observations X1, X2, . . . , Xn with density f(x; θ) satisfying the conditions ofTheorems 1.7.1 and 1.7.2. In view of Theorem 1.7.2 there exists the limit

limn→∞

Euw(√n(tn,qδ − u)) =

√det I(u)

(2π)k

∫Rkw(y)e−

12 〈I(u)y,y〉dy = L(u) (1.169)

and hence for any sequence of estimators {Tn} the inequality

lim infn→∞

supu∈U

Euw(√n(Tn − u)) ≥ sup

u∈UL(u) (1.170)

is valid for all U ⊂ Θ. In view of Theorems 1.7.1 and 1.7.2 for Bayesian and maximum likelihood estimators, the equality isattained in the last inequality and these estimators become asymptotically efficient.

Example 1.8.2. Let X1, X2, . . . , Xn be iid random variables with uniform distribution on [θ − 1/2, θ + 1/2]. If one choosesqδ for the prior distribution, the posterior distribution after n observations will be uniform on the interval

[αn, βn] = [θ − δ, θ + δ] ∩ [maxXi − 1/2,minXi + 1/2]. (1.171)

Therefore, it w is an even, increasing function on [0,∞) the Bayes estimator tn,qδ = 12 (αn + βn). Since

limn→∞

Pu(n(1/2 + u−maxXi) > s1, n(minXi + 1/2− u) > s2) = e−s1−s2 , (1.172)

we have for any δ,

limn

Euw(n(tn,qδ − u)) = Ew(τ1 + τ2

2), (1.173)

where τ1,−τ2 are iid random Exp(1) random variables.Thus in this example the classical estimator for θ, which is tn = 1

2 (maxXi + minXi) is again asymptotically efficient.However, in this case Fisher’s information is not defined and the asymptotic distribution of n(tn − θ) is not normal.

1.8.3 Bahadur’s Asymptotic Efficiency

Consider once more the family (X (ε),X (ε), P(ε)θ , θ ∈ Θ). Let Tε be an estimator for θ constructed from this experiment.

R. Bahadur suggested measuring the asymptotic efficiency of the estimators {Tε} by the magnitude of concentration ofthe estimator in the interval of a fixed (independent of ε) length with center at θ, i.e., by the magnitude of probability

P(ε)θ (|Tε − θ| < γ) (1.174)

As it follows from the result of inequalities for probabilities of large deviations, under quite general assumptions theprobabilities

P(ε)θ (|Tε − θ| ≥ γ) (1.175)

for a fixed γ in the case of “nice” estimators decrease exponentially in ε−1. If ε−1/α is the “correct” normalizing factor thenit is natural to select as the measure of asymptotic efficiency the upper bound over all Tε of the expressions

lim infγ→0

lim infε→0

ε

γαlnP

(ε)θ (|Tε − θ| ≥ γ). (1.176)

The following asymptotic inequality in the iid observation model serves as the basis for calculations connected withefficient estimators in the Bahadur sense.

23

Theorem 1.8.2. Let {Tn} be a consistent estimator for θ. Then for all γ′ > γ,

lim infn→∞

1

nlnPθ(|Tn − θ| > γ) ≥ −

∫X

f(x, θ + γ′) lnf(x, θ + γ′)

f(x; θ)ν(dx). (1.177)

Theorem 1.8.2 actually is a particular case of Stein’s lemma in hypothesis testing. It provides the bound on the probabilityof the first kind for testing hypothesis θ against the alternative θ + γ′ by means of a test constructed from Tn.

Inequality (1.177) results in the following bound on asymptotic efficiency in the Bahadur sense: for any estimator Tn,

lim infγ→0

limn→∞

1

γαnlnPθ(|Tn − θ| > γ) ≥ lim

γ→0

1

γα

∫X

f(x, θ + γ) lnf(x, θ)

f(x, θ + γ)ν(dx). (1.178)

Assume now f(x; θ) possess finite Fisher information I(θ). Then α = 2 should be taken in (1.178); moreover under theassumption the normal distribution ought to play a prominent role since the estimators which are commonly used possessasymptotically a normal distribution.

Following Bahadur we shall call the quantity τn = τn(θ, γ, Tn) defined by the equality

Pθ(|Tn − θ| > γ) =

√2

π

∫ ∞γ/τn

e−u2/2du = P (|ξ| ≥ γ

τn), (1.179)

where L(ξ) = N (0, 1), the efficient standard deviation of the estimator Tn. Clearly the sequence {Tn} is consistent if andonly if τn → 0 as n→∞ for any γ > 0 and θ ∈ Theta. If the variable Tn is normal with mean θ and τ2

n is the variance of Tn.The well-known inequality

z

1 + z2e−z

2/2 <

∫ ∞z

e−u2/2 <

1

ze−z

2/2 (1.180)

allows us to assert that

lim infγ→0

lim infn→∞

1

nγ2lnPθ(|Tn − θ| > γ) = −1

2lim infγ→0

limn→∞

1

nτ2n

. (1.181)

Theorem 1.8.3. Let the parameter set Θ = (a, b) ⊂ R1 and the density f(x; θ) satisfy all the conditions of Theorem 1.7.1.Then for any estimator Tn,

lim infγ→0

lim infn→∞

1

nγ2lnPθ(|Tn − θ| > γ) = −1

2lim infγ→0

limn→∞

1

nτ2n

≥ −1

2I(θ), (1.182)

where I(θ) is the Fisher information. If, in addition, −∞ < a < b <∞, the function ∂2

∂θ2 ln f(x; θ) < 0 for all θ ∈ Θ and for

almost all x ∈X , and θn is a maximum likelihood estimator, then

− 1

2limγ→0

limn→∞

n−1τ−2n (θ; γ, θn) = −1

2I(θ). (1.183)

Thus under specific assumptions, for example, under the conditions of Theorem 1.8.3, maximum likelihood estimatorsare asymptotically efficient and the phenomenon of superefficiency is not present in relation to Bahadur’s definition. It issomewhat unfortunate, however, that now the class of efficient estimators contains also certain undesirable estimators suchat Hodges’ estimator.

Bahadur’s definition does not completely eliminate the evil of superefficiency. This phenomenon many occur, for example,due to the fact that the hypotheses θ = t and θ = t+ γ are very well distinguishable. Consider the following example.

Example 1.8.3. Assume that X1, X2, . . . , Xn are independent observations on a line with a uniform distribution on theinterval |x− θ| ≤ 1/2, where parameter θ ∈ R1. The accepted best estimator for θ is

θn =1

2(maxXj + minXj) (1.184)

and its efficiency is

limγ→0

limn→∞

1

nγlnPθ(|θn − θ| > γ) = −2. (1.185)

In order to prove the last relation it is sufficient to verify that θn − θ possesses a density equal to n(1 − 2|α|)n−1 for|α| ≤ 1/2 and vanishes outside this interval.

Now set

θn =

{θn if at least one observation Xj /∈ [−1/2, 1/2]

0 if all Xj ∈ [−1/2, 1/2](1.186)

24

Then

Pθ(|θn| > γ) = 0 θ = 0 (1.187)

Pθ(|θn − θ| > γ) ≤ Pθ(|θn − θ| > γ) + (1− |θ|)n |θ| < 1, θ 6= 0 (1.188)

Pθ(|θn − θ| > γ) = Pθ(|θn − θ| > γ) |θ| ≥ 1. (1.189)

Consequently, for 0 < γ < 1/2,

lnPθ(|θn − θ| > γ) ≤

{−∞ θ = 0

e−2nγ + e−nθ θ 6= 0(1.190)

So that Bahadur’s efficiency is

limγ→0

limn→∞

1

nγlnPθ(|θn − θ| > γ) ≤

{−∞ θ = 0

−2 θ 6= 0(1.191)

and the point θ = 0 is a point of superefficiency.

Example 1.8.4. An analogous example may be constructed also for observations with a finite Fisher information. As above,let X1, X2, . . . , Xn be independent observations on the real line with the density f(x − θ) where f(x) = 0 for |x| ≥ 1 andf(x) > 0 for |x| ≤ 1. The function f(x) is assumed to be sufficiently smooth so that

I =

∫ 1

−1

|f ′(x)|2

f(x)dx <∞. (1.192)

If moreover (ln f(x))′′ ≤ 0 on [−1, 1] then, using an argument similar to the one utilized in the proof of Theorem 1.8.3,

one can show that for a maximum likelihood estimator θn,

limγ→0

limn→∞

1

nγ2lnPθ(|θn − θ| > γ) = −1

2I. (1.193)

Set

θn =

{θn if at least one observation Xj /∈ [−1, 1]

0 if all Xj ∈ [−1, 1](1.194)

It turns out that

limγ→0

limn→∞

1

nγ2lnPθ(|θn − θ| > γ) ≤

{−∞ θ = 0

− 12I θ 6= 0

(1.195)

and the estimator θn is superefficient in the Bahadur sense at the point θ = 0.

1.8.4 Efficiency in C. R. Rao’s Sense

Behind the definition suggested by C. R. Rao are quite different considerations: the efficiency of the estimators is determinedhere by some other properties rather than by their closeness to the estimated parameter. We shall limit ourselves to the caseof iid observations X1, X2, . . . , Xn with the common density f(x; θ), θ ∈ R1, for which there exists finite Fisher informationI(θ). C. R. Rao admits for comparison only those estimators Tn = Tn(X1, X2, . . . , Xn) whose distributions possess densitiesϕn(xn; θ) with respect to measure νn, where ϕn(xn; θ) is an absolutely continuous function of θ. Rao’s motivation is asfollows: since the likelihood ratio pn(Xn; θ)/pn(Xn; θ′) plays a basic role in statistical problems, for a “nice” estimator Tnthe ratio ϕn(Xn; θ)/ϕn(Xn; θ′) should therefore be close to the likelihood ratio pn(Xn; θ)/pn(Xn; θ′). This idea is formalizedin the following definition: A sequence of estimators {Tn} is asymptotically efficient if

limn→∞

1√n

∣∣∣∣∣ ddθpn(Xn; θ)− d

dθϕn(Xn; θ)

∣∣∣∣∣ = 0 (1.196)

in probability for all θ ∈ Θ.Since it is difficult to verify (1.196) directly, C. R. Rao proposes as a basic definition the following

25

Definition 1.8.2. Statistic Tn is called asymptotically efficient in Rao’s sense if there exist two nonrandom functionsα(θ), β(θ) such that

1√n

d

dθpn(Xn; θ)− α(θ)− β(θ)

√n(Tn − θ)→ 0 (1.197)

in probability as n→∞.

It follows directly from Theorems 1.7.1 and 1.7.2 that if the conditions of these theorems are satisfied the maximumlikelihood estimators as well as Bayesian estimators are asymptotically efficient in C. R. Rao’s sense, provided we chooseα(θ) = 0, β(θ) = I(θ)−1.

It can be shown that under the conditions of Theorem 1.7.1 and some additional restrictions on the distribution of Tn,Fisher information contained in Tn coincides asymptotically with Fisher information contained in the whole sample:

limn→∞

1

nI(θ)Eθ

∣∣∣∣∣ ddθ lnϕn(Xn; θ)

∣∣∣∣∣2

= 1. (1.198)

It is pleasing to realize that although there is no complete unanimity among statisticians as to the notion of efficiency, theclasses of estimators which different groups of statisticians consider to be efficient become identical provided some reasonableconditions are satisfied (such as the conditions of Theorem 1.7.1).

1.9 Two Theorems on the Asymptotic Behavior of Estimators

Consider a family of statistical experiments (X (ε),X (ε), P(ε)θ , θ ∈ Θ) generated by observations Xε be given. Here Θ is an

open subset of Rk. Let pε(Xε; θ) =

(dP

(ε)θ

dνε

)(Xε) where νε is a measure on X (ε).

Set

Zε,θ(u) = Zε(u) =pε(X

ε; θ + ϕ(ε)u)

pε(Xε; θ)(1.199)

where ϕ(ε) denotes some matrix nondegenerated normalizing factor; it is also assumed that |ϕ(ε)| → 0 as ε → 0. Thus thefunction Zε,θ is defined on the set Uε = (ϕ(ε))−1(Θ− θ).

Denote by C0(Rk) the normalized space of functions continuous in Rk which vanish at infinity, with the norm |ψ| =supy |ψ(y)|.

Below we shall denote by G the set of families of functions {gε(y)} possessing the following properties:

1. For a fixed ε, gε(y) is monotonically increasing to ∞ as a function of y defined on [0,∞)

2. For any N > 0,lim

y→∞,ε→0yNe−gε(y) = 0. (1.200)

For example, if gε(y) = yα, α > 0, then gε ∈ G.

Denote by θε the maximum likelihood estimator for θ.

Theorem 1.9.1. Let the parameter set Θ be an open subset of Rk, functions Zε,θ(u) be continuous with probability onepossessing the following properties:

1. For any compact K ⊂ Θ, there correspond numbers a(K) = a and B(K) = B and functions gKε (y) = gε(y), {gε} ∈ Gsuch that

(a) there exist numbers α > k,m ≥ α such that for θ ∈ K,

sup|u1|≤R,|u2|≤R,u1,u2∈Uε

|u2 − u1|−αE(ε)θ |Z

1/mε,θ (u2)− Z1/m

ε,θ (u1)|m ≤ B(1 +Ra) (1.201)

(b) For all u ∈ Uε, θ ∈ K,

E(ε)θ Z

1/2ε,θ (u) ≤ e−gε(|u|). (1.202)

2. Uniformly in θ ∈ K for ε→ 0 the marginal (finite dimensional) distributions of the random functions Zε,θ(u) convergeto marginal distributions of random functions Zθ(u) = Z(u) where Z ∈ C0(Rk).

3. The limit functions Zθ(u) with probability one attain the maximum at the unique point u(θ) = u.

26

Then uniformly in θ ∈ K the distribution of the random variables ϕ−1(ε)(θε − θ) converge to the distribution of u(θ) andfor any continuous loss function w ∈Wp we have uniformly in θ ∈ K,

limε→0

E(ε)θ w(ϕ−1(ε)(θε − θ)) = Ew(u(θ)). (1.203)

A similar theorem can be proved for Bayesian estimators as well. Consider the family {tε} of Bayesian estimators withrespect to the loss function Wε(u, v) = l(ϕ−1(ε)|u−v|) and prior density q. We shall assume that l ∈Wp and that, moreover,for all H sufficiently large and γ sufficiently small

inf|u|>H

l(u)− sup|u|≤Hγ

l(u) ≥ 0. (1.204)

Denote by Q the set of continuous positive functions q : Rk → R1 possessing a polynomial majorant.

Theorem 1.9.2. Let {tε} be a family of Bayesian estimators with respect to the loss function l(ϕ−1(ε)|u− v|), where l ∈Wp

and satisfy (1.204), and prior density q ∈ Q. Assume that the normalized likelihood ratio Zε,θ(u) possesses the followingproperties:

1. For any compact K ⊂ Θ there correspond numbers a(K) = a,B(K) = B and functions gKε (y) = gε(y) ∈ G such that

(a) For some α > 0 and all θ ∈ K

sup|u1|≤R,|u2|≤R

|u2 − u1|−αE(ε)θ |Z

1/2ε,θ (u2)− Z1/2

ε,θ (u1)|2 ≤ B(1 +Ra) (1.205)

(b) For all u ∈ Uε, θ ∈ K,

E(ε)θ Z

1/2ε,θ (u) ≤ e−gε(|u|). (1.206)

2. The marginal distributions of the random functions Zε,θ(u) converge uniformly in θ ∈ K as ε → 0 to the marginaldistributions of random functions Zθ(u) = Z(u).

3. The random function

ψ(s) =

∫Rkl(s− u)

Z(u)∫Rk Z(v)dv

du (1.207)

attains its absolute minimum value at the unique point τ(θ) = τ .

Then the distribution of the random variables ϕ−1(tε − θ) converges uniformly in θ ∈ K to the distribution of τ and for anycontinuous loss function w ∈Wp we have uniformly in θ ∈ K,

limε→0

E(ε)θ w(ϕ−1(ε)(tε − θ)) = Ew(τ(θ)). (1.208)

Remark 1.9.1. Note that condition 1.(a) is substantially weaker than the analogous condition of Theorem 1.9.1. Condition(3) is automatically satisfied if l is a convex function with a unique minimum, for example, if l(u) = |u|p, p ≥ 1.

1.9.1 Examples

Example 1.9.1. Consider iid observations Xj with density f(x; θ), θ ∈ Θ ⊂ Rk with respect to some measure ν. Assume therequirements in approximating likelihood ratio by sum of iid random variables are satisfied. In particular, Fisher informationI(θ) exists.

As we already know, if one chooses ϕ(n) = 1/√n and sets

Zn(u) =

n∏j=1

f(Xj ; θ + u/√n)

f(Xj ; θ), (1.209)

then the conditions of Theorem 1.9.2 will be fulfilled. In view of our assumptions,

lnZn(u) =1√n

n∑j=1

〈 ∂∂θ

ln f(Xj ; θ), u〉+1

2n

n∑j=1

〈 ∂2

∂θ2ln f(Xj ; θ)u, u〉+Rn, Rn → 0. (1.210)

27

In the same manner as in approximating the likelihood ratio by iid random variables, we can prove that

1

2n

n∑j=1

〈 ∂2

∂θ2ln f(Xj ; θ)u, u〉 → −

1

2〈I(θ)u, u〉 (1.211)

with probability one as n → ∞. Therefore the marginal distributions of Zn(u) converge to the marginal distribution of therandom function

Z(u) = e〈ξ,u〉−12 〈I(θ)u,u〉, (1.212)

where ξ is a normal random vector in Rk with mean zero and correlation matrix I−1(θ).

Example 1.9.2. Let X(n) = (X1, X2, . . . , Xn) be a sample from a uniform distribution on the interval [θ−1/2, θ+1/2], θ ∈ R1.The parameter to be estimated is θ. If we choose ϕ(n) = n−1 then the function

Zn(u) =

n∏j=1

f(Xj − θ − u/n)

f(Xj − θ)(1.213)

satisfy the conditions of Theorem 1.9.2. Here f(x) is the density of a uniform distribution on [−1/2, 1/2]. The behavior ofthe marginal distribution of Zn does not depend on θ. Setting θ = 0, we obtain

Zn(u) =

{1 n(maxXj − 1/2) < u < n(minXj + 1/2)

0 u /∈ [n(maxXj − 1/2), n(minXj + 1/2)](1.214)

We have shown that the random variables −n(maxXj − 1/2), n(minXj + 1/2) are asymptotically independent and possess inthe limit the exponential distribution with parameter one. Thus condition 2 of Theorem 1.9.2 is also fulfilled and the limitingprocess

Z(u) =

{1 −τ− < u < τ+

0 u /∈ [−τ−, τ+](1.215)

here τ−, τ+ are iid Exp(1) random variables. The conditions of Theorem 1.9.1 are obviously not satisfied in this examplesince both Zn(u), Z(u) are discontinuous.

28

Chapter 2

Local Asymptotic Normality of Families ofDistributions

In a number of interesting papers of Hajek, Le Cam and other authors, it was proved that many important propertiesof statistical estimators follow from the asymptotic normality of the logarithm of the likelihood ratio for neighborhoodhypotheses regardless of the relation between the observations which produced the given likelihood function. This chapter isdevoted to an investigation of the conditions under which this property is valid for various models and to corollaries of thisproperty.

2.1 Independent Identically Distributed Observations

In this section we shall establish an important property of a family of regular statistical experiments generated by a sequenceof iid observations. Let Ei = {Xi,Xi, Pθ, θ ∈ Θ} be an regular experiment corresponding to the i-th observation and let Xi

be the i-th observation. The set Θ, as always, will be considered an open subset of Rk.Let E (n) = E1 × E2 × . . .× En and let

Zn,θ(u) =

n∏j=1

f(Xj ; θ + un−1/2)

f(Xj ; θ)(2.1)

be the normalized likelihood ratio.The following important theorem is due to Le Cam.

Theorem 2.1.1. If Ei are regular experiments in Θ and det I(θ) 6= 0 for θ ∈ Θ, then the normalized likelihood ratio Zn,θ(u)can be written as

Zn,θ(u) = e1√n

∑nj=1

⟨∂ ln f(Xj ;θ)

∂θ ,u⟩− 1

2 〈I(θ)u,u〉+ψn(u,θ). (2.2)

Moreover,Pnθ (|ψn(u, θ)| > ε)→ 0 (2.3)

as n→∞ for every u ∈ R1, ε > 0 and

L

1√n

n∑j=1

∂ ln f(Xj ; θ)

∂θ

∣∣∣∣∣Pnθ→ N (0, I(θ)). (2.4)

Remark 2.1.1. Using the substitution u = I(θ)−1/2v the assertion of Theorem 2.1.1 can be restated as follows: if theconditions of the theorem are satisfied then

Zn,θ(v) =dPn

θ+(nI(θ))−1/2v

dPnθ(Xn) = e〈v,∆n,θ〉−1/2|v|2+ψn(v,θ), (2.5)

where Pθ(|ψ(v, θ)| > ε)→ 0 as n→∞ andL(∆n,θ|Pnθ )→ N (0, J). (2.6)

The requirement of regularity of an experiment was not fully utilized in the proof of Theorem 2.1.1. Actually one canstate a condition which is sufficient for the validity of Theorem 2.1.1 without introducing any auxiliary measure ν, but ratherby introducing the requirement directly on the family Pθ in the neighborhood of a given point θ = t. We shall now formulatethis condition.

29

Condition 2.1.1 (Condition At). Let Pθ,c(·) and Pθ,s(·) represent the absolutely continuous and the singular components,respectively, of measure Pθ(·) with respect to Pt(·) (t fixed). Define

ζ(u) =

[dPt+u,cdPt

(X1)

]1/2

− 1 (2.7)

and assume the following conditions are satisfied:

1. The process ζ(u) is differentiable in L2(Pt) for u = 0, i.e. there exists a random vector ϕ such that as u→ 0,

Et (ζ(u)− 〈ϕ, u〉)2= o(|u|2). (2.8)

2. As u→ 0, ∫dPt+u,s = o(|u|2). (2.9)

It is easy to verify that the proof of Theorem 2.1.1 remains unchanged if condition 2.1.1 is satisfied. In this connection,we should set

I(t) = 4EtϕϕT . (2.10)

We also have the following uniform version of Theorem 2.1.1.

Theorem 2.1.2. If Ei are regular experiments in Θ and det I(θ) 6= 0,∀θ ∈ Θ then for any compact set K ⊂ Θ, any sequenceθn ⊂ K and any u ∈ Rk the following representation is valid as n→∞:

Zn,θn(u) = e1√n

⟨∑nj=1

∂ ln f(Xj,θn)

∂θ ,u⟩− 1

2 〈I(θn)u,u〉+ψn(u,θn), (2.11)

moreover, for any u ∈ Rk, ε > 0,Pnθn(|ψn(u, θn)| > ε)→ 0, n→∞, (2.12)

L

n−1/2I−1/2(θn)

n∑j=1

∂ ln f(Xj , θn)

∂θ

∣∣∣∣∣Pnθn→ N (0, J), n→∞. (2.13)

2.2 Local Asymptotic Normality (LAN)

It is important to note that the property of the likelihood ratio proved in Theorem 2.1.1 is valid in a substantially large classof cases than the case of independent observations.

In this connection it is desirable to state the following general definition. Let Eε = (X (ε),X (ε), P(ε)θ , θ ∈ Θ),Θ ⊂ Rk

be a family of statistical experiments and Xε be the corresponding observation. As usual we shall refer todP

(ε)θ2

dP(ε)θ1

(Xε) the

derivative of the absolutely continuous component of the measure P(ε)θ2

with respect to measure P(ε)θ1

at the observation Xε

as the likelihood ratio.

Definition 2.2.1 (Local Asymptotic Normality (LAN)). A family P(ε)θ is called locally asymptotically normal (LAN) at

point t ∈ Θ as ε→ 0 if for some nondegenerate k × k-matrix ϕ(ε) = ϕ(ε, t) and any u ∈ Rk, the representation

Zε,t(u) =dP

(εt+ϕ(ε)u

dP(ε)t

(Xε) = euT∆ε,t− 1

2 |u|2+ψε(u,t) (2.14)

is valid, where

L(∆ε,t|P (ε)t )→ N (0, J), (2.15)

as ε→ 0. Here J is the identity k × k matrix and moreover for any u ∈ Rk we have

ψε(u, t)→ 0 (2.16)

in P(ε)t -probability as ε→ 0.

30

First, it follows from Theorem 2.1.1 that in the case of iid regular experiments with a nondegenerate matrix I(t), theLAN condition is fulfilled at point θ = t, if we set

ε = 1/n, ϕ(ε, t) = (nI(t))−1/2,∆ε,t = (nI(t))−1/2n∑i=1

∂ ln f(Xj , t)

∂t. (2.17)

We now state here a simple theorem which would allow us to check the LAN condition for a one-dimensional parameterset.

Theorem 2.2.1 (Hajek). Let Θ ⊂ R1 and the density f(x; θ) of experiments Ei satisfy the following conditions:

1. The function f(x; θ) is absolutely continuous in θ in some neighborhood of the point θ = t for all x ∈X

2. The derivative ∂f(x;θ)∂θ exists for every θ belonging to this neighborhood for ν-almost all x ∈X

3. The function I(θ) is continuous and positive for θ = t.

Then the family Pnθ generated by independent experiments E1, . . . ,En with density f satisfy the LAN condition for θ = t andmoreover ε, ϕ,∆ are given by

ε = 1/n, ϕ(ε, t) = (nI(t))−1/2,∆ε,t = (nI(t))−1/2n∑i=1

∂ ln f(Xj , t)

∂t. (2.18)

Later a uniform version of the LAN condition will be needed. We now present the corresponding definition.

Definition 2.2.2. A family P(ε)θ , θ ∈ Θ ⊂ Rk is called uniformly asymptotically normal in Θ1 ⊂ Θ if for some nondegenerate

matrix ϕ(ε, t) and arbitrary sequences tn ⊂ Θ1, εn → 0, un → u ∈ Rk such that tn + ϕ(εn, tn) ∈ Θ1, the representation

Zεn,tn(u) =dP

(εn)tn+ϕ(εn,tn)un

dP(εn)tn

(Xεn) = e〈∆εn,tn ,u〉− 12 |u|

2+ψεn (un,tn) (2.19)

is valid; here

L(∆εn,tn |P(εn)tn )→ N (0, J), εn → 0, (2.20)

and the sequence ψn(εn, tn) converges to zero in P(εn)tn -probability.

Theorem 2.1.2 implies that for iid regular experiments witha matrix I(θ) nondegenerate in Θ the corresponding familyof distributions is uniformly asymptotically normal in any compact set K ⊂ Θ; moreover ϕ(ε, t) and ∆ε,t may be computedusing formula (2.18).

2.3 Independent Nonhomogeneous Observations

Let X1, X2, . . . , Xn be independent observations but we shall assume the densities fj with respect to the measure νj ofobservations Xj depend on j.

We shall assume that every experiment Ei = (Xi,Xi, Pθ,i, θ ∈ Θ) is regular, and we shall study the problem of whatadditional conditions should be imposed on the family of experiments so that the measure Pnθ corresponding to the experimentE (n) = E1 × E2 × . . .× En will satisfy the LAN condition.

Clearly the likelihood ratio in this case becomes

dPnθ2dPnθ1

=

n∏j=1

fj(Xj , θ2)

fj(Xj , θ1). (2.21)

Assume for the time being that Θ ⊂ R1 and denote by Ij(θ) the Fisher information amount of the experiment Ej ,

Ψ2(n, θ) =

n∑j=1

Ij(θ). (2.22)

31

Theorem 2.3.1. Let Ei be regular experiments, Θ ⊂ R1, Ψ2(n, t) > 0, and for any k > 0,

limn→∞

sup|u|<k

1

Ψ2(n, t)

n∑j=1

∫ (∂

∂θ

√fj(x, t+

u

Ψ(u, t))− ∂

∂t

√fj(x, t)

)2

vj(dxj) = 0, (2.23)

and moreover, let Lindeberg’s condition be satisfied: for every ε > 0,

limn→∞

1

Ψ2(n, t)

n∑j=1

Et

{∣∣∣∣f ′j(Xj , t)

fj(Xj , t)

∣∣∣∣2 I (∣∣∣∣f ′jfj∣∣∣∣ > εΨ(n, t)

)}= 0. (2.24)

Then the family of measures

Pnθ (A) =

∫. . .A

∫ n∏j=1

fj(xj , θ)νj(dxj) (2.25)

satisfies at point θ = t the LAN condition with

ε = 1/n, ϕ(n) = ϕ(n, t) = Ψ−1(n, t),∆n,t = ϕ(n, t)

n∑j=1

f ′j(Xj , t)

fj(Xj , t). (2.26)

Analyzing the proof of Theorem 2.3.1 one can easily verify that it remains valid also for the “sequence of series” ofindependent observations, i.e. triangular array models.

Indeed, Lindeberg’s theorem as well as the theorem concerning the relative stability of some random variables remainvalid when applied to the sequence of the series. It is even more valid as far as the remaining purely analytical calculationsare concerned. Thus the following theorem holds.

Theorem 2.3.2. Let Ejn, j = 1, 2, . . . , n;n = 1, 2, . . . be a sequence of a series of independent regular experiments where fjnis the density of experiment Ejn with respect to a σ-finite measure νjn and Ijn(θ) is the corresponding information amount.Let

Ψ2(n, t) =

n∑j=1

Ijn(t) > 0. (2.27)

If, moreover, conditions in Theorem 2.3.1 are satisfied with fj replaced by fjn, then the family

Pnθ (A) =

∫. . .A

∫ n∏j=1

fjn(xj , θ)νjn(dxj) (2.28)

satisfies at point θ = t the LAN condition with ϕ(n) = Ψ−1(n, t),

∆n,t = ϕ(n)

n∑j=1

f ′jn(Xjn, t)

fjn(Xjn, t). (2.29)

Conditions 2.23 and (2.24) seem to be quite complicated. We shall now present some weaker but more easily verifiableconditions. Condition 2.24 is the Lindeberg condition and it is well known that it follows from Lyapunov’s condition: forsome δ > 0,

1

[Ψ(n, t)]2+δ

n∑j=1

Et∣∣∣∣f ′jfj (Xj , t)

∣∣∣∣2+δ

→ 0, n→∞. (2.30)

If the functions (f1/2j (x, θ))′ are absolutely continuous in θ for almost all x, it is then easy to devise a simpler sufficient

condition for the validity of condition (2.23). Indeed, by the Cauchy–Schwarz inequality,∫ (∂

∂θ

√fj(x, θ)−

∂t

√fj(x, t)

)2

νj(dx) =

∫ (∫ θ

t

∂2

∂v2

√fj(x, v)dv

)2

νj(dx) (2.31)

≤ (θ − t)∫ θ

t

dv

∫ ∣∣∣∣ ∂2

∂v2

√fj(x, v)

∣∣∣∣2 νj(dx). (2.32)

Consequently, condition (2.23) follows from the condition

limn→∞

1

Ψ4(n)sup

|θ−t|<|u|/Ψ(n,t)

n∑j=1

∫ ∣∣∣∣ ∂2

∂θ2

√fj(x, θ)

∣∣∣∣2 νj(dx) = 0. (2.33)

32

2.4 Multidimensional Parameter Set

Let us assume once again that the statistical experiment E (n) is generated by a sequence of independent regular experimentsEj with density fj(x, θ), but Θ ⊂ Rk. As above, let Ij(θ) be the Fisher information matrix of the j-th experiment and assumethat the matrix

Ψ2(n, t) =

n∑j=1

Ij(t) (2.34)

is positive definite. Hence there exists a positive definite matrix

Ψ−1(n, t) =

n∑j=1

Ij(t)

−1/2

. (2.35)

Theorem 2.4.1. Let Θ ⊂ Rk, the matrix Ψ2(n, θ) is positive definite and the following conditions are satisfied:

1. for any k > 0,

limn→∞

sup|u|<k

n∑j=1

∫ ⟨f

1/2j (x, t+ Ψ−1(n, t)u)

∂t−∂f

1/2j (x, t)

∂t,Ψ−1(n, t)u

⟩2

νj(dx) = 0 (2.36)

2. Lindeberg’s condition: for any ε > 0, u ∈ Rk,

limn→∞

n∑j=1

Et

{⟨Ψ−1(n, t)u,

∂ ln fj(Xj , t)

∂t

⟩2

I

(∣∣∣∣⟨Ψ−1(n, t)u,∂ ln fj(Xj , t)

∂t

⟩∣∣∣∣ > ε

)}= 0. (2.37)

Then the family of measures

Pnθ (A) =

∫. . .A

∫ n∏j=1

fj(xj , θ)νj(dxj) (2.38)

satisfies the LAN condition at θ = t with

ϕ(n) =

n∑j=1

Ij(t)

−1/2

= Ψ−1(n, t),∆n,t = ϕ(n)

n∑j=1

f ′j(Xj , t)

fj(Xj , t). (2.39)

Evidently Theorem 2.4.1, subject to corresponding modifications, is valid also for a sequence of series.Somewhat strengthening the conditions of Theorem 2.4.1 one can assure the uniform asymptotic normality of the corre-

sponding family of distributions.

Theorem 2.4.2. Let E (n) = E1 × . . . × En, where Ej is a regular experiment with density fj(x, θ), x ∈ Xj , θ ∈ Θ ⊂ Rk.Assume that the matrix

Ψ2(n, θ) =

n∑j=1

Ij(θ) (2.40)

is positive definite for some n uniformly in Θ1 ⊂ Θ and, moreover, let the following conditions be satisfied:

1. Random vectors ηj(θ) =∂ ln fj(Xj ,θ)

∂θ satisfies Lindeberg’s condition uniformly in θ ∈ Θ1: for any ε > 0, u ∈ Rk,

limn→∞

supθ∈Θ1

n∑j=1

Enθ{〈ηj(θ),Ψ−1(n, θ)u〉2I(|〈ηj(θ),Ψ−1(n, θ)u〉| > ε)

}= 0. (2.41)

2. For any k > 0,

limn→∞

supθ∈Θ1

sup|u|<k

n∑j=1

∫ [⟨∂f

1/2j (x, θ + Ψ−1(n, θ)u)

∂θ−∂f

1/2j (x, θ)

∂θ,Ψ−1(n, θ)u

⟩]2

νj(dx) = 0. (2.42)

Then the condition of uniform asymptotic normality is fulfilled for the family Pnθ in the domain Θ1.

33

2.5 Characterizations of Limiting Distributions of Estimators

2.5.1 Estimators of an Unknown Parameter when the LAN Condition is Fulfilled

We shall now begin the study of properties of estimators of an unknown parameter in the case when the LAN condition isfulfilled. Here we shall not confine ourselves to nonrandomized estimators but shall assume, given the value of observationXε that a statistician can randomly choose an estimator θε of the parameter θ in accordance with the conditional distributionPε(θε ∈ A|Xε) which does not depend on θ. A measure generated by this mode of estimation which corresponds to the value

θ = t will be denoted by P(ε)t and the expectation with respect to this measure will be denoted by E(ε)

t . If the LAN conditionis satisfied, then we could show

sup|ξ|<1

∣∣∣∣∫ ξdP(ε)t+ϕ(ε)u −

∫ξe〈u,∆ε,t〉− 1

2 |u|2

dP(ε)t

∣∣∣∣→ 0. (2.43)

We shall try to describe the class of possible limiting distributions of appropriately normalized estimators under the LANconditions. It was shown in Chapter 1 that under some restrictions the maximum likelihood estimator θn in the case of iidobservations possesses the following limiting distribution as n→∞:

L(I(t)1/2n1/2(θn − t)|Pnt )→ N (0, 1). (2.44)

Obviously this distribution is not the only one possible. As the examples of superefficient estimators show there existasymptotically normal estimators which at a given point possess an arbitrarily small variance of the limiting distributionwhile at the remaining points the variance is one. Below in Theorem 2.8.2 it will be shown that these estimators ought not tobe considered since they are “bad” in a certain sense in the neighborhood of a radius of order n−1/2 of superefficiency points.Since the exact value of the parameter is unknown to the statistician, it is therefore natural to restrict the study of limitingdistributions to the estimators such that a small variation in the “true” value of the parameter yields a small variation inthe limiting distribution of the estimator. Such regular estimators are discussed in this section.

However, the limiting distribution is not necessarily normal even in the class of regular estimators. Let, for example, asample X1, X2, . . . , Xn represent iid observations of N (θ, 1). Consider the randomized estimator

θn = X + n−1/2η, (2.45)

where η is a random variable independent of X1, . . . , Xn with the distribution G(x). It is clear that in this case

L(√n(θn − t)|Pnt )→ N (0, 1) ∗G, (2.46)

where ∗ denotes convolution. In a very interesting paper by Hajek, it was established that there are no other limitingdistributions besides N (0, 1) ∗G for the regular estimators provided the LAN conditions are satisfied.

2.5.2 Regular Parameter Estimators

Definition 2.5.1 (Regular Estimators). Let a family P(ε)θ satisfy the LAN condition with the normalizing matrix ϕ(ε) at

point θ = t. An estimator θε (possibly a randomized one) of parameter θ is called regular at the point θ = t if for some properdistribution function F (x) the weak convergence

L(ϕ−1(ε)(θε − (t+ ϕ(ε)u))|P (ε)t+ϕ(ε)u)→ F (2.47)

as ε→ 0 for any u ∈ Rk is valid; this convergence is uniform in |u| < c for any c > 0.

We shall briefly discuss this definition. For u = 0 it implies that the random variable ϕ−1(ε)(θε − t) possesses the properlimiting distribution F (x) as ε → 0, provided the true value of the parameter is t. It is quite natural to require that thisconvergence be uniform in t. The condition in Definition 2.5.1 represents a weakened version of this requirement since|ϕ(ε)u| → 0 as ε→ 0 for any u ∈ Rk. In particular, it is satisfied at each point t ∈ ∆ if the relation

L(ϕ−1(ε)(θε − t)|P (ε)t )→ F (t, ·), (2.48)

is valid for some normalizing matrix ϕ−1(ε) and some function F (t, ·) continuous in t, uniformly in t ∈ ∆ as ε→ 0.The question arises: why should only limiting distributions with normalizing matrix ϕ−1(ε) be considered in this definition?

Are there other estimators which possess the proper limiting distributions with a “better” normalizing matrix? To formulatea precise result it is necessary to examine the meaning of a normalizing matrix Ψ(ε) which is not better than matrix ϕ−1(ε).

For a one-dimensional parameter set this question does not involve any difficulties: clearly a normalizing factor Ψ(ε) isnot better than ϕ−1(ε) if the product Ψ(ε)ϕ(ε) is bounded as ε→ 0.

34

Analogously, in the case Θ ⊂ Rk, k > 1, a normalizing matrix Ψ(ε) is called not better than ϕ−1(ε) if for some constant c,

‖Ψ(ε)ϕ(ε)‖ = sup|x|=1

|Ψ(ε)ϕ(ε)x| ≤ c. (2.49)

This definition is quite natural. Indeed, if for some family of random variables ξε the family xε = ϕ−1(ε)ξε is compact,then it is evident from (2.49) that Ψ(ε)ξε = Ψ(ε)ϕ(ε)xε is also compact, and therefore matrix Ψ(ε) “stretches” the vector ξεin the order of magnitude not larger than matrix ϕ−1(ε).

The following lemma shows that regular estimators with normalized matrix Ψ(ε) do not exist if the condition (2.49) isnot fulfilled.

Lemma 2.5.1. Let a family P(ε)t satisfy at point θ = t the LAN condition and let relation (2.16) be valid uniformly in

|u| < 1. Furthermore, let the family of matrices Ψ(ε) be such that

‖Ψ(ε)ϕ(ε)‖ → ∞ ε→ 0. (2.50)

Then there are no estimators of parameter θ such that for some proper distribution function F (x),

L(Ψ(ε)[θε − (t+ ϕ(ε)u)]|P (ε)t+ϕ(ε)u)→ F (2.51)

as ε→ 0 uniformly in |u| < 1.

Theorem 2.5.1. Let the family P(ε)θ satisfy the LAN condition for θ = t and let θε be a family of estimators (possibly

randomized) of the parameter θ which is regular at θ = t.Then

1. the limiting distribution F (x) of the random vector ζε = ϕ−1(ε)(θε − t) is a composition of N (0, J) and some otherdistribution G(x) : F = N (0, J) ∗G;

2. G(x) is the limiting distribution law of the difference ζε −∆ε,t as ε→ 0.

A refinement of Theorem 2.5.1 is the following.

Theorem 2.5.2. Let the conditions of Theorem 2.5.1 be fulfilled. Then the random variables ζε −∆ε,t and ∆ε,t are asymp-totically independent in the sense that the following weak convergence

P(ε)t (ζε −∆ε,t < x,∆ε,t < y)→ G(x)Φ(y) (2.52)

is valid as ε→ 0, here Φ(y) is the distribution function of the normal law N (0, J).

2.6 Asymptotic Efficiency under LAN Conditions

Various definitions of asymptotic efficiency were discussed in Chapter 1. Here, we shall prove some theorems interrelatingthese definitions using results proved in this chapter.

We know that in the case of iid observations, asymptotic efficiency in the Fisher sense reduces to the requirement ofasymptotic normality of estimators with parameters 0, I−1(θ). The following definition which relates to a more generalsituation is in complete agreement with the classical one.

Definition 2.6.1 (Asymptotic Efficiency in Fisher’s sense). Let a family of measures P(ε)θ satisfy the LAN condition with

the normalizing matrix ϕ(ε) at the point θ = t. A family of estimators θε is called asymptotically efficient in Fisher’s senseat the point θ = t if

L(ϕ−1(ε)(θε − t)|P εt )→ N (0, J) (2.53)

as ε→ 0.

J. Wolfowitz proposes a different definition of efficiency of statistical estimators. His reasoning is roughly as follows.Asymptotic efficiency in Fisher’s sense is natural if we confine ourselves to estimators whose distribution uniformly convergesto the limiting normal distribution with zero mean. However, there are not logical foundations for such a restriction becauseby enlarging the class of estimators one may possibly obtain better estimators in a certain sense. Of course, one cannot omitthe requirement of uniform convergence due to the existence of superefficient estimators, although it may be reasonable toomit the requirement of asymptotic normality.

35

However, how can one compare two family of estimators θ(1)ε , θ

(2)ε where one is asymptotically normal but the other is

not? Wolfowitz suggests comparing the quality of estimators by the degree of their “concentration” about the true value ofthe parameter. More precisely, in the case of a one-dimensional parameter space Θ, he proposes to consider as the better

one, the family for which the P(ε)θ probability that the estimator takes on a value in the interval [θ − a(ε), θ + a(ε)] is the

largest. However, two questions arise in this connection. First, how should one select the family a(ε)? For overly small a(ε)

all the estimators would be equally bad since the probability P(ε)θ (θ

(i)ε ∈ [θ − a(ε), θ + a(ε)]) will be close to zero, while for

a(ε) to large the proposed criterion will not be sensitive enough, since for too many families of estimators this probability will

be close to one. If, however, a family of distributions P(ε)θ satisfies the LAN condition it is then natural to put a(ε) = λϕ(ε)

which leads to Definition 2.6.2. The second question is what is the multidimensional analog of this method of comparingestimators? Kaufman suggested replacing symmetric intervals in the case Θ ⊂ Rk by symmetric convex sets. We thus arriveat the following definition.

Definition 2.6.2 (Asymptotic Efficiency in Wolfowitz’s sense). Let Θ ⊂ Rk and the family of measures P(ε)θ satisfy the

LAN condition with the normalizing matrix ϕ(ε) at the point θ = t. A family of estimators θε will be called asymptoticallyefficient in Wolfowitz’s sense at the point θ = t, if for any regular family Tε and any centrally symmetric convex set A ⊂ Rkthe relation

limε→0

P(ε)t (ϕ−1(ε)(θε − t) ∈ A) ≥ lim sup

ε→0P

(ε)t (ϕ−1(ε)(Tε − t) ∈ A) (2.54)

is valid.

Note that in this definition it is not required formally that the family θε be regular. However, this definition can hardlybe considered natural if there exist no regular estimators which are efficient in Wolfowitz’s sense. It will be shown belowthat both maximum likelihood and Bayesian estimators under quite general conditions are efficient in the Wolfowitz’s senseas well as are regular ones. Here we shall present a sufficient condition for efficiency in Wolfowitz’s sense.

Theorem 2.6.1. If a family of estimators θε is asymptotically efficient in the sense of Definition 2.6.1, then it is alsoasymptotically efficient in the sense of Definition 2.6.2.

Theorem 2.5.1 and results in the last section also allow us to obtain asymptotic bounds from below for the risk of regularestimators.

Theorem 2.6.2. Let the family Tε be regular at the point t ∈ Rk and w(x) ≥ 0, x ∈ Rk be a continuous function satisfying

1. w(x) = w(−x)

2. the set {x : w(x) < c} is convex in Rk for any c > 0. Then

lim infε→0

E(ε)t w[ϕ−1(ε)(Tε − t)] ≥ Ew(ξ), (2.55)

where L(ξ) = N (0, J).

Corollary 2.6.1. If the family Tε is regular at the point θ = t, then the matrix inequality

lim infε→0

[ϕ−1(ε)E(ε)t (Tε − t)(Tε − t)Tϕ−t(ε)T ] ≥ J (2.56)

is valid.

Thus a regular estimator cannot have a covariance matrix which is “better” than the limiting covariance matrix of anasymptotically efficient estimator in Fisher’s sense.

We now return to the definition of the asymptotic efficiency in Rao’s sense.Recall that a family of estimators θε is asymptotically efficient in Rao’s sense at the point θ = t if for some matrix B(t)

which does not depend on the observations the relation

ϕ−1(ε, t)(Tε − t)−B(t)∆ε,t → 0 (2.57)

in P(ε)t -probability is valid as ε→ 0.

If estimator Tε is regular then Theorem 2.5.2 implies that the difference ϕ−1(ε, t)(Tε−t)−∆ε,t is asymptotically independentof ∆ε,t and therefore in the case of regular estimators relation (2.57) can be fulfilled only if B(t) = J . It follows from hereand from the second assertion of Theorem 2.5.1 that for regular estimators the asymptotic efficiency in Rao’s sense coincideswith the asymptotic efficiency in Wolfowitz’s sense.

36

It follows from the above that according to the LAN condition it is natural to normalize the estimator by means of thefactor ϕ−1(ε, t). The corresponding loss function is wε(Tε − t) = w(ϕ−1(ε, t)(Tε − t)).

Later it will be shown in Theorem 2.7.1 that for any function w ∈ W and any estimator θε under the LAN conditionsthe relation

limδ→0

lim infε→0

sup|θ−t|<δ

E(ε)θ w(ϕ−1(ε, t)(θε − θ)) ≥ Ew(ξ), L(ξ) = N (0, J) (2.58)

is valid.From (2.58) and (1.162) we obtain that the estimator T is asymptotically efficient for the loss function w(ϕ−1(ε, t)x) at

the point θ = t provided that

limδ→0

lim infε→0

sup|θ−t|<δ

E(ε)θ w(ϕ−1(ε, t)(θε − θ)) = Ew(ξ). (2.59)

Below, we shall refer, under the LAN conditions, to an estimator Tε which satisfies (2.59) as an asymptotically efficientestimator for the loss function w(ϕ−1(ε, t)x) at the point θ = t.

2.7 Asymptotically Minimax Risk Bound

In chapter 1, when we discuss about asymptotic efficiency, we established a theorem which yields an asymptotically minimaxbound for the risks of arbitrary statistical estimators provided the asymptotic properties of Bayesian estimators are known.The LAN condition allows us to strengthen this result. The following interesting theorem is due to Hajek.

Theorem 2.7.1. Let the family P(ε)θ satisfy the LAN condition at the point θ = t with the normalizing matrix ϕ(ε) and let

trϕ(ε)ϕ(ε)T → 0 as ε→ 0. Then for any family of estimators Tε, any loss function w ∈Wε,2, and any δ > 0 the inequality

lim infε→0

sup|θ−t|<δ

E(ε)θ [w(ϕ−1(ε)(Tε − θ))] ≥

1

(2π)k/2

∫Rkw(x)e−|x|

2/2dx = Ew(ξ) (2.60)

is valid. Here L(ξ) = N (0, J).If, moreover, Θ ⊂ R1 and w(x) ∈W1

ε,2, then the equality

limδ→0

limε→0

sup|θ−t|<δ

E(ε)θ [w(ϕ−1(ε)(Tε − θ))]] = Ew(ξ) (2.61)

is possible if and only if the difference ϕ−1(ε)(Tε − t)−∆ε,t → 0 in P(ε)t -probability as ε→ 0.

Remark 2.7.1. Since the first assertion of Theorem 2.7.1 is valid for any family of estimators Tε, it can be written in theform

lim infε→0

infTε

sup|θ−t|<δ

E(ε)θ w[ϕ−1(ε)(Tε − θ)] ≥ Ew(ξ). (2.62)

Thus Theorem 2.7.1 yields an asymptotic minimax bound from below for a wide class of loss functions. Below we shall seethat in many important particular cases this bound is exact.

Remark 2.7.2. Denote by Kb a cube in Rk whose vertices possess coordinates ±b. If we drop the condition trϕ(ε)ϕ(ε)T → 0,we could replace the basic inequality of Theorem 2.7.1 by the inequality

limb→∞

lim infε→0

supθ:ϕ−1(ε)(θ−t)∈Kb

E(ε)θ [w(ϕ−1(ε)(Tε − θ))] ≥ Ew(ξ) (2.63)

Moreover, it follows from the proof that for any b > 0, the inequality

lim infε→0

supθ:ϕ−1(ε)(θ−t)∈Kb

E(ε)θ [w(ϕ−1(ε)(Tε − θ))] ≥

1

(2π)k/2

∫K√b

w(y)e−|y|2/2dy(1− b−1/2)k (2.64)

is valid.Analogously one can somewhat strengthen the second assertion of the theorem also and replace it by the assertion that the

equality

limb→∞

limε→0

supθ:ϕ−1(ε)|θ−t|≤b

E(ε)θ [w(ϕ−1(ε)(Tε − θ))] = Ew(ξ) (2.65)

is possible if and only if the difference ϕ−1(ε)(Tε − t)−∆ε,t → 0 in P(ε)t -probability as ε→ 0.

37

Remark 2.7.3. Inequality (2.64) presents a nontrivial bound from below only if b > 1. We could show the following coarsebut nontrivial bound from below for any b > 0:

lim infε→0

supθ:ϕ−1(ε)(θ−t)∈Kb

E(ε)θ [w(ϕ−1(ε)(Tε − θ))] ≥ 2−k

1

(2π)k/2

∫Kb/2

w(y)e−|y|2/2dy. (2.66)

Remark 2.7.4. It follows from the proof that the first assertion of Theorem 2.7.1 remains valid if in the left hand side ofthe basic inequality w is replaced by wε where wε ∈Wl,2 which is the family of functions convergent to w(x) for almost all xas ε→ 0.

2.8 Some Corollaries. Superefficient Estimators

Comparing Theorems 2.7.1 and 2.6.2, one arrives at the following conclusion: the asymptotic bound from below on the risksof regular estimators derived in section 2.6 is also the minimax asymptotic bound on the risks of arbitrary estimators. Forexample, setting w(x) = I(x ∈ Ac) where A is a convex set in Rk symmetric with respect to the origin, we shall obtain fromTheorem 2.7.1 the following assertion (see Theorem 2.6.1)

Theorem 2.8.1. If the family P(ε)t satisfies the LAN condition at the point θ = t with the normalizing matrix ϕ(ε), then for

any convex centrally-symmetric set A and any δ > 0, the inequality

lim supε→0

supTε

inf|θ−t|<δ

P(ε)θ (ϕ−1(ε)(Tε − θ) ∈ A) ≤ 1

(2π)k/2

∫A

e−|x|2/2dx (2.67)

is valid.

Setting w(x) = 〈h, x〉2 = hTxxTh we shall obtain from Theorem 2.7.1 that under the LAN conditions the matrix inequality

lim infε→0

ϕ−1(ε) sup|θ−t|<δ

E(ε)θ (Tε − θ)(Tε − θ)Tϕ−1(ε)T ≥ J (2.68)

is valid for any δ > 0.In accordance with the definition at the end of section 2.6, the estimator θε is asymptotically efficient for the loss function

w(ϕ−1(ε, t)x) at point t provided

limδ→0

lim infε→0

sup|θ−t|<δ

E(ε)θ w(ϕ−1(ε, t)(θε − θ)) = Ew(ξ). (2.69)

In agreement with chapter one we shall call the estimator Tε a superefficient estimator for the loss function w(ϕ−1(ε, θ)x)in Θ provided for θ ∈ Θ

limε→0

E(ε)θ w(ϕ−1(ε, θ)(Tε − θ)) ≤ Ew(ξ), L(ξ) = N (0, J) (2.70)

and for at least one point t ∈ Θ the strict inequality

limε→0

E(ε)θ w(ϕ−1(ε, θ)(Tε − θ)) < Ew(ξ) (2.71)

is valid.Stein’s example shows that in the case when the dimension of the parameter space k ≥ 3, there exist asymptotically

efficient estimators for the loss function w = |x|2 which are superefficient at one point of the set Rk. Then the theorem showsthat such a result is possible for k = 1 for no loss function belonging to a sufficiently wide class of functions.

Theorem 2.8.2. If family P(ε)θ satisfies the LAN condition at the point θ = t, Θ ⊂ R1 and Tε is an asymptotically efficient

estimator at the point θ = t for some loss function w0 ∈ W′, then Tε cannot be superefficient at this point for any lossfunction w ∈W.

Theorem 2.7.1 allows us to investigate properties of superefficient estimators also in the case of multidimensional parameterset. We have the following theorem.

Theorem 2.8.3. Let Tε be an estimator of the vector θ ∈ Θ ⊂ Rk in the parametric family P(ε)θ satisfying the LAN condition

at θ = t. Assume that every component of the vector Tε is an asymptotically efficient (at the point θ = t) estimator of thecorresponding component of the vector θ for some loss function w0 ∈ W′. Then the estimator Tε can be superefficient atpoint θ = t for no loss function w(x) ∈W, x ∈ Rk.

Corollary 2.8.1. The components of Stein’s vector estimator are not asymptotically efficient estimators of the componentsof the mean value vector in a multivariate normal distribution provided k ≥ 3.

38

Chapter 3

Some Applications to NonparametricEstimation

Nonparametric estimation is a large branch of mathematical statistics dealing with problems of estimating functionals ofelements of some functional spaces in situations when these are not determined by specifying a finite number of parameters.In this chapter we shall show by means of several examples how the ideas of parametric estimation presented in previouschapters can be applied to problems of this kind.

3.1 A Minimax Bound on Risks

The problem of parametric estimation can be considered as a particular case of the following more general statistical problem,which we shall, for the time being, consider only for iid observations. Thus let X1, X2, . . . , Xn be a sequence of iid randomvariables with values from a measurable space (X ,X ) and let F (·) be their common (unknown to the observer) distributionwhich belongs to some (known) class of distributions F on X . Let Φ(F ) be a real functional. The problem consists ofestimating the functional Φ(F ) based on observations X1, X2, . . . , Xn. We shall basically be concerned with the bounds onthe precision of estimation and with asymptotically best estimators.

In particular, if the class F is one-parametric, F = {F (·; θ), θ ∈ Θ} and Φ(F (·; θ)) = θ, then we arrive at the usual problemof estimating one-parametric distributions. Retaining the same set F and considering different functionals Φ, we arrive atthe problem of estimating a function of a parameter θ. However, more interesting examples can be obtained by consideringother classes of distributions F.

Example 3.1.1. Let F be a subset of the family of distributions such that the integral

EF |ϕ(X)| =∫|ϕ(x)|F (dx), ϕ : X → R1 (3.1)

is finite and Ψ(F ) =∫ϕ(x)F (dx). Evidently, in this case the following statistic is a rather “good” estimator of functional

Φ(F ):

ϕn =1

n

n∑i=1

ϕ(Xi). (3.2)

Will this estimator be the best in a certain sense? An answer to this question depends, of course, on how “extended” thefamily F is. For example, if F is a parametric set, then this estimator is not in general the best in any asymptotic sense(consider the uniform location model).

Nevertheless, it follows from the theorems proved below that for a sufficiently “substantial” set F this empirical meanestimator can not be asymptotically improved for all F ∈ F.

Example 3.1.2. Consider one of the possible generalizations of the preceding example. Let F be a subset of a class ofdistributions for which the functionals∫

|ϕi(x)|F (dx), i = 1, 2, . . . , r ϕi : X → R1, (3.3)

are finite and ϕ0 : Rr → R1 is a sufficiently smooth function. Consider the functional

Φ(F ) = ϕ0

(∫ϕ(x)F (dx)

), (3.4)

39

in which ϕ(x) is a vector in Rr with coordinates ϕ1(x), . . . , ϕr(x). Methods for constructing asymptotically efficient (andminimax) estimators for smooth parametric families F were considered above. But how can one construct an asymptoticallybest (in a certain sense) estimator of the functional Φ(F ) if the function F is known only approximately with precision up toa neighborhood of some fixed function F0 in the corresponding functional space? It will be shown that such an estimator is,for example, a function of the arithmetic mean

Φn = ϕ0(ϕn) (3.5)

Example 3.1.3. Let F be a contraction of the set of distributions on R1 for which the median t is uniquely determined asthe solution of the equation F (t) = 1/2 and Φ(F ) = med F .

A natural nonparametric estimator of the value of med F is the sample median, which is the [n/2]-th order statistic. Itfollows from the arguments presented in this and the succeeding sections that this estimation is asymptotically minimax if theclass F is sufficiently “massive” in the neighborhood of the given distribution F0.

In this present section, under some restrictions on the regularity of F and Φ, a minimax bound from below on the qualityof nonparametric estimators will be derived. The idea of the arguments presented below was heuristically stated by Steinand was developed in detail in a number of papers by Levit. This is an extremely simple idea.

Let F be a distribution in F: denote Φ(F ) by t. Consider a smooth parametric family ϕ = {Fh(x)} ∈ F which “passes”through the “point” F at h = t(Ft = F ) and such that the value of the parameter h on this family coincides with the value ofthe functional Φ in some neighborhood of h = t, i.e. Φ(Fh) = h. The smoothness of the family ϕ will be, for the time being,interpreted in the sense that there exists Fisher information quantity I(F,ϕ) for this family and that the LAN condition withnormalization (I(F,ϕ)n)−1/2 is satisfied.

Now it is easy to obtain a certain minimax bound on the risks for the problem of estimating the functional Φ(F ) with aloss function w ∈W. Indeed, for any estimator Φn of the functional Φ, for some δ > 0 the inequalities

supF∈F

EFw(√n(Φn − Φ(F ))) ≥ sup

{Fh}EFhw(

√n(Φn − Φ(Fh))) ≥ sup

|h−t|<δEhw(

√n(Φn − h)) (3.6)

are self-evident.In view of Theorem 2.7.1, we have for any δ > 0,

lim infn→∞

infΦn

sup|h−t|<δ

Ehw(√n(Φn − h)) ≥ 1√

∫w(xI−1/2(F,ϕ))e−x

2/2dx (3.7)

and we have obtained the derived bound from below on the risks. This bound depends on the choice of the smooth familyϕ = {Fh} ⊂ F with the properties indicated above. Clearly, the smaller I(F ;ϕ) is this bound is.

Remark 3.1.1. The most stringent requirement on the family ϕ is the requirement that Φ(Fh) = h. We shall show that theinequality

lim infn→∞

infΦn

supF∈F

w(√n(Φn − Φ(F ))) ≥ 1√

∫w(xI−1/2(F,ϕ))e−x

2/2dx (3.8)

remains valid if this requirement is replaced by a somewhat weaker one:

Φ(Fh) = h+ o(h− t) (3.9)

as h→ t.For this purpose, it is evidently sufficient to check that

lim infn→∞

sup|h−t|<δ

Ehw(√n(Φn − h− o(h− t))) ≥

1√2π

∫w(xI−1/2(F,ϕ))e−x

2/2dx. (3.10)

The proof of this refined version of Theorem 2.7.1 does not differ from the proof of Theorem 2.7.1 if one observers thatfunction w ∈W is continuous a.e. in R1.

As a result of this discussion the following definition would seem natural.Consider all possible families ϕ = {Fh} ∈ F parametrized by a real parameter h with the properties

1. Ft = F

2. Φ(Fh) = h+ o(h− t) as h→ t

40

3. The random variable

ηu =

[dF ct+udFt

(X)

]1/2

(3.11)

possess a derivative with respect to u in the mean square sense at u = 0, so that for some random variable η0

limu→0

1

u2EF

[(dF ct+udFt

(X)

)1/2

− 1− uη0

]2

= 0 (3.12)

and, moreover

limu→0

1

u2

∫(√dFt+u(x)−

√dFt(x))2 = EF η2

0 <∞. (3.13)

Let I(F ;ϕ) = 4EF η20 .

Definition 3.1.1 (Information Quantity in Nonparametric Estimation Problems). The quantity

infϕI(F0, ϕ) = I(F0) (3.14)

where ϕ belongs to F and satisfies conditions above is called the information quantity in the estimation problem Φ(F ), F ∈ F,ad the point F = F0. If there are no parametric families ϕ ∈ F which satisfies conditions above we shall then set I(F0) =∞.

It follows from this definition that in the case I(F0) <∞ there exists a sequence of parametric families ϕN = {FNh }, N =1, 2, . . . satisfying conditions above such that

I(F0, ϕN )→ I(F0), N →∞. (3.15)

We have the following theorem.

Theorem 3.1.1. If I(F) = infF∈F I(F ) > 0, then for the problem of estimating the functional Φ(F ), F ∈ F, the followingminimax bound from below on the risks for any loss function w ∈W is valid:

lim infn→∞

infΦn

supF∈F

EFw(√n(Φn − Φ(F ))) ≥ sup

F0∈F

1√2π

∫w(xI(F0)−1/2)e−x

2/2dx =1√2π

∫w(xI(F)−1/2)e−x

2/2dx. (3.16)

The quantity I(F0) appearing in this theorem evidently depends also on the set F. Moreover the quantity I(F0) in generalis not determined by the local structure of the set F in the neighborhood of F0.

It we want to localize the assertion of Theorem 3.1.1, i.e., to obtain a nonparametric bound on the risks from belowanalogous to Hajek’s bound in the neighborhood of a given distribution (see Theorem 2.7.1), then it is necessary first of allto decide which neighborhoods of the given distribution, i.e., which topologies on the set of distributions will be considered.

In the parametric case the specification of a topology by means of an Euclidean metric on the set Θ is sufficiently natural.Unfortunately, in a nonparametric situation there is, in general, no such natural method of defining a topology althoughsome requirements related to this topology follow from the actual problem.

First, the topology should be such that the estimating functional Φ(F ) be continuous in this topology since otherwisea consistent estimation is impossible. Other requirements on this topology follow from the properties of nonparametricinformation quantity. Next, if we want that quantity I(F0) to be defined by the local structure of the set F in the neighborhoodof a given distribution, it is then necessary to require that any family Fh satisfying conditions above will be continuous inthe topology under consideration. In this connection we introduce the following definition.

Definition 3.1.2 (Coordinated Topology). Topology R is called coordinated with the estimation problem (Φ(F ), F ∈ F) if

1. functional Φ(F ) is continuous on F in this topology

2. any family ϕ = {Fh} satisfying the three natural conditions above is continuous for h = t = Φ(F ) in this topology.

For not too degenerate estimation problems, the choice of topologies R coordinated with the estimation problem issufficiently large. We shall not dwell on this, but remark that for any estimation problem the second requirement inDefinition 3.1.2 is fulfilled for the topology defined by Hellinger’s distance ρ0 and for any weaker topology. Indeed for|h− t| < δ,

ρ0(Fh, Ft) =

∫(√dFh −

√dFt)

2 ≤ c(h− t)2 (3.17)

41

in view of the third conditions in the three natural conditions, and therefore any family satisfying the third condition in thethree natural conditions is continuous at point h = t in this topology. The first condition in Definition 3.1.2 is also generallynot too restrictive.

If we confine ourselves to neighborhoods U(F0) ⊂ F of a fixed distribution F0 in topologies coordinated with the estimationproblem Φ(F ), F ∈ F, then the arguments leading to Theorem 3.1.1 can be localized, since in this case in the left hand sideof (3.6) instead of considering the upper bound over F ∈ F we can consider the upper bound over F ∈ U(F0). We thus arriveat the local version of Theorem 3.1.1.

Theorem 3.1.2. Assume I(F0) > 0 and let UN (F0) be an arbitrary sequence of neighborhoods of distribution F0 in thetopology coordinated with the estimation problem Φ(F ), F ∈ F, such that UN (F0) ↓ F0 as N →∞. Then for any loss functionw ∈W the following asymptotically minimax bound on risk is valid:

limN→∞

lim infn→∞

infΦn

supF∈UN (F0)

supF∈UN (F0)

EFw(√n(Φn − Φ(F ))) ≥ 1√

∫w(xI−1/2(F0))e−x

2/2dx. (3.18)

In accordance with the point of view of this book, an estimator Φn such that for F0 ∈ F and any sequence UN (F0) ofneighborhoods converging to F0 in the topology R the relation

limN→∞

lim infn→∞

supF∈UN (F0)

supF∈UN (F0)

EFw(√n(Φn − Φ(F ))) =

1√2π

∫w(xI−1/2(F0))e−x

2/2dx. (3.19)

is valid will be called a (F, R,w)-asymptotically efficient nonparametric estimator of the functional Φ(F ) at point F0.In connection with these definitions and Theorems 3.1.1 and 3.1.2 several questions are:

1. How to compute I(F ) for a given problem of nonparametric estimation

2. Is the bound (3.18) attainable, i.e., are there asymptotically efficient nonparametric estimators for a given estimationproblem?

3. If the answer to the second question is positive, for which estimators is the bound (3.18) attainable in specific cases?

Answers to these questions are closely interconnected. Indeed the inequality I(F ) ≤ I(F,ϕ) follows from Definition 3.1.1.On the other hand, if for some positive functional A(F ) continuous in the topology R, a family of estimators Φn is found

such that for some loss function w ∈W and some domain U ⊂ F,

lim supn→∞

supF∈U

∣∣∣∣EFw(√n(Φn − Φ(F )))− 1√

∫w(xA−1/2(F ))e−x

2/2dx

∣∣∣∣ = 0, (3.20)

then it follows from (3.18) and the monotonicity of w that A(F0) ≤ I(F0).Thus if it is possible to construct a sequence of parametric families ϕr = {F rh}, r = 1, 2, . . . such that the corresponding

information quantities I(F0, ϕr) converge to A(F0) as r → ∞, and a sequence of estimators Φn satisfying relation (3.20),then I(F0) = A(F0) and Φn is an (F, R,w)-asymptotically efficient in U nonparametric estimator. We shall adhere to thisoutline of investigation of properties of nonparametric estimators in the next sections for the examples considered above.

3.2 Bounds on Risks for Some Smooth Functionals

Definition 3.2.1 (Differentiability in von Mises’ sense). Let F be a set of distributions on (X ,X ) where for any F1 ∈F, F2 ∈ F, h ∈ (0, 1) the distribution (1− h)F1 + hF2 ∈ F. A functional Φ(F ), F ∈ F is differentiable in von Mises’ sense inF if for any distributions F1 ∈ F, F2 ∈ F, and for some functional l(F, y), F ∈ F, y ∈X , the equality

Φ(F1 + h(F2 − F2)) = Φ(F1) + h

∫l(F1, y) (F2(dy)− F1(dy)) + o(h) (3.21)

is valid as h→ 0.

For differentiable functionals Φ(F ) one can find a class of parametric families satisfying the three natural conditions. Thisclass is convenient because the problem of minimization of the corresponding information quantity I(F,ϕ) is easily solved inthis class. Evidently, in general, minimization with respect to this class may not lead to I(F ), the nonparametric informationquantity. However, in almost all known examples, the bound on the quality of estimation thus obtained is asymptoticallythe best.

42

Thus suppose we solve the estimation problem of a differentiable, in von Mises’ sense, functional Φ(F ), F ∈ F. Considera parametric family of distributions {Fh} defined by the equality

Fh(dx) = F (dx)[1 + (h− t)ψ(x)], t = Φ(F ). (3.22)

Clearly, the conditions ∫ψ(x)F (dx) = 0, |ψ(x)| < N (3.23)

are sufficient for Fh(dx) = F (dx)[1 + (h− t)ψ(x)] to define a probability measure for |h− t| < δ = δ(N). Assume also thatFh ∈ F for |h− t| < δ(N) for any ψ satisfying (3.23) with N > 0.

The first natural condition in the first section is automatically fulfilled for the family (3.22). Setting

F1(Γ) = F (Γ), F2(Γ) = F (Γ) +1

N

∫Γ

ψ(x)F (dx), (3.24)

we obtain from (3.21) that

Φ(Fh) = Φ(F1 + (h− t)N(F2 − F1)) = Φ(F ) + (h− t)∫l(F, y)ψ(y)F (dy) + o(h− t). (3.25)

This implies that the second natural condition in the first section is fulfilled for the family (3.22) provided∫l(F, y)ψ(y)F (dy) = 1. (3.26)

Furthermore, since under condition (3.23) we have for the family (3.22)[dFhdF

(x)

]1/2

= [1 + (h− t)ψ(x)]1/2 = 1 + 1/2(h− t)ψ(x) + o(h− t), (3.27)

the third natural condition is also fulfilled and, moreover,

η0 =1

2ψ(x), I(F,ϕ) =

∫ψ2(x)F (dx). (3.28)

Equations (3.23) and (3.26) yield ∫[l(F, y)− EF l(F,X)]ψ(y)F (dy) = 1. (3.29)

From here and the Cauchy-Schwarz inequality, we obtain the following bound from below on the information quantitiesof parametric families of the form (3.22):

I(F,ϕ) =

∫ψ2(x)F (dx) ≥

[∫[l(F, y)− EF l(F,X)]2F (dy)

]−1

= σ−2(l, F ). (3.30)

If the functional l(F, x) is bounded in x, then setting

ψ(x) = ψ0(x) = (l(F, x)− EF l(F,X))σ−2(l, F ), (3.31)

we arrive at a parametric family for which the lower bound (3.30) is attained.Assume now that functional l(F, y) is unbounded but square integrable with respect to the measure F . We show that in

this case there exists a sequence of parametric families of the form (3.22) whose information quantities are arbitrarily closeto σ−2(l, F ). Let l(N)(F, ·) be a sequence of bounded functions converging to l(F, ·) in L2(F ) as N →∞. Then as N →∞,the following relations are obviously valid:

EF l(N)(F,X) =

∫l(N)(F, x)F (dx)→ EF l(F,X) (3.32)

σ2N (l, F ) =

∫(l(F, x)− EF l(F,X))(l(N)(F, x)− EF l(N)(F,X))F (dx)→ σ2(l, F ). (3.33)

Clearly the parametric family (3.22) in which ψ(x) is of the form

ψ(N)(x) = (l(N)(F, x)− EF l(N)(F,X))σ−2N (l, F ) (3.34)

43

satisfies (3.23) to (3.26), moreover (3.33) easily yields that

I(F,ψ(N))→ σ−2(l, F ). (3.35)

as N →∞.Relation (3.35) and Definition 3.1.1 imply the inequality

I(F ) ≤ σ−2(l, F ), (3.36)

provided the parametric family (3.22) with ψ(N)(x) in place of ψ(x) belongs to F for N > 0, |h− t| < δ(N). This inequality,together with Theorem 3.1.2, yields the following assertion:

Theorem 3.2.1. If the functional Φ(F ) is differentiable in the sense of (3.21) and l(F, ·) ∈ L2(F ), then family (3.22) withψ = ψ(M),M > 0, belongs to F for all |h − Φ(F0)| < δ(M), δ(M) > 0, w ∈ W, and the sequence of neighborhoods UN (F0)and topology R satisfy the conditions of Theorem 3.1.2, then

limN→∞

lim infn→∞

infΦn

supF∈UN (F0)

supF∈UN (F0)

EFw(√n(Φn − Φ(F ))) ≥ 1√

∫w(xσ(l, F0))e−x

2/2dx. (3.37)

where σ2(l, F ) =∫

[l(F, y)− EF l(F,X)]2F (dy).

The functional σ2(l, F0) as well as the bound (3.37) can be computed in many specific cases without much difficulty.

Example 3.2.1. Consider the functional Φ(F ) given in Example 3.1.1 on the set F2 of distributions F such that∫|ϕ(x)|2F (dx) <

∞. In this case, due to the linearity of Φ, we have for any F1, F2 ∈ F2,

Φ(F1 + h(F2 − F1)) = Φ(F1) + h

∫ϕ(x)[F2(dx)− F1(dx)]. (3.38)

Therefore l(F, x) = ϕ(x),

σ2(l, F ) =

∫[ϕ(x)− EFϕ(X)]2F (dx). (3.39)

Example 3.2.2. Let

Φ(F ) =

∫· · ·∫ϕ(x1, x2, . . . , xm)F (dx1) · · ·F (dxm), (3.40)

where ϕ(x1, x2, . . . , xm) is a symmetric function of x1, x2, . . . , xm and F is the set of distributions such that∫· · ·∫|ϕ(x1, x2, . . . , xm)|2F (dx1) · · ·F (dxm) <∞. (3.41)

In this case it is easy to verify that

Φ(F1 + h(F2 − F1)) = Φ(F1) +mh

∫· · ·∫ϕ(y, x2, . . . , xm)× F1(dx2) · · ·F1(dxm)[F2(dy)− F1(dy)] + o(h). (3.42)

Therefore,

l(F, x) = m

∫· · ·∫ϕ(y, x2, . . . , xm)× F (dx2) · · ·F (dxm) (3.43)

σ2(l, F ) = m2

∫ [∫· · ·∫ϕ(x, x2, . . . , xm)F (dx2) · · ·F (dxm)− Φ(F )

]2

F (dx). (3.44)

Example 3.2.3. Consider the functional given in Example 3.1.2 on the set F2 of Example 3.1.1. Recall that in this caseϕ : X → Rr. Assume that the function ϕ0 is continuously differentiable. Under these assumptions we obtain from Taylor’sformula

Φ(F1 + h(F2 − F1))− Φ(F1) = ϕ0

[∫ϕ(x)F1(dx) + h

∫ϕ(x)(F2(dx)− F1(dx))

]− ϕ0

(∫ϕ(x)F1(dx)

)(3.45)

= h

⟨∇ϕ0

(∫ϕ(x)F1(dx)

),

∫ϕ(y)(F2(dy)− F1(dy))

⟩+ o(h). (3.46)

Thus,l(F, y) = 〈∇ϕ0(EFϕ(X)), ϕ(y)〉 (3.47)

and therefore

σ2(l, F ) =

∫〈ϕ(y)− EFϕ(X)),∇ϕ0(EFϕ(X))〉2 F (dy). (3.48)

44

Example 3.2.4. The functional med F considered in Example 3.1.3 under the conditions stipulated there can be defined asthe root t = Φ(F ) of the equation ∫

sign(x− t)F (dx) = 0. (3.49)

Consider the more general functional Φ(F ) which represents the root t = Φ(F ) of the equation

EFϕ(X, t) =

∫ϕ(x, t)F (dx) = 0, (3.50)

where ϕ : X × R1 → R1 and the function ϕ and set F are such that

1. for any distribution F ∈ F, (3.50) possesses the unique solution Φ(F )

2. the function λF (h) = EFϕ(X,h) is continuous in h for F ∈ F, differentiable for h = t = Φ(F ), F ∈ F, and thederivative λ′F (F ) 6= 0.

3. for F ∈ F, the relation EF |ϕ(X,Φ(F ))|2 <∞ is satisfied

4. for any F1, F2 ∈ F, the relation Φ(F1 + h(F2 − F1))→ Φ(F1) as h→ 0 is valid.

From the equalities ∫ϕ(x,Φ(F1))F1(dx) = 0 (3.51)∫

ϕ(x,Φ(F1 + h(F2 − F1)))[F1(dx) + h(F2(dx)− F1(dx))] = 0 (3.52)

and the conditions 1,2,4 we easily obtain that

λF1[Φ(F1 + h(F2 − F1))]− λF1

(Φ(F1)) = (λ′F1(Φ(F1)) + o(h))(Φ(F1 + h(F2 − F1))− Φ(F1)) (3.53)

= −h∫ϕ[x,Φ(F1 + h(F2 − F1))](F2(dx)− F1(dx)). (3.54)

Here (3.53) follows from the Taylor formula for λF1(h) at point h = Φ(F1), and (3.54) follows from the fact that λF1

(Φ(F1)) =0 and (3.52).

Utilizing once again conditions 2 and 4, we arrive at the equalities

l(F, y) = −ϕ(y,Φ(F ))

λ′F (Φ(F ))(3.55)

σ2(l, F ) =

∫ϕ2(y,Φ(F ))F (dy)[λ′F (Φ(F ))]−2. (3.56)

Condition 4 can be replaced by the following: the function ϕ(x, h) varies monotonically in h. If this function is strictlymonotone in h, then condition 1 can be omitted. In particular, for a functional of the type considered in Example 3.1.3,conditions 1-4 are fulfilled if F is the set of distributions possessing positive density f(x) with respect to Lebesgue measure onthe real line. Moreover,

λ′F (med F ) = 2f(med F ), σ2(l, F ) = [4f2(med F )]−1. (3.57)

In the conclusion of this section we shall consider two examples in which the lower bound on risks is obtained not fromTheorem 3.2.1 but as a direct choice of a suitable parametric family. The first example may be viewed as a generalization ofthis example.

Example 3.2.5. Let the functions ϕ1(x, h), . . . , ϕr(x, h) : R1 → R1 → R1 and the family of distributions F on the real linebe such that the following conditions are satisfied:

1. For any F ∈ F, the equations ∫ϕi(x, t)F (dx) = 0, i = 1, 2, . . . , r, (3.58)

have a common solution t = Φ(F ) and, moreover, this solution is unique;

2. The functions λiF (h) = EFϕi(X,h) are continuous in h for F ∈ F and the derivativesdλiFdh exist at h = Φ(F )

45

3. The functions ϕi(x, h) ∈ L2(F ) for all F ∈ F and are continuous with respect to h in L2(F ) at h = t = Φ(F )

4. The determinant |BF (Φ(F ))| 6= 0, where BF (h) = ‖bijF (h)‖, bijF (h) =∫ϕi(x, h)ϕj(x, h)F (dx). Note that the determinant

BF is evidently the Gram determinant and thus this condition is the condition of linear independence of the functionsϕi(x,Φ(F )) in L2(F ) for F ∈ F.

To avoid some technical difficulties we shall assume, in addition, that the functions ϕi are bounded for x ∈ R1, h ∈ R1,and we shall seek for a parametric family Fh ∈ F in the form

Fh(dx) = F (dx)[1 + (h− t)ψ(x, h)] (3.59)

ψ(x, h) =

r∑j=1

γi(h)[ϕj(x, h)− EFϕj(X,h)]. (3.60)

The function ψ(x, h) for any h satisfies the condition (3.23) and hence Fh is a distribution. The equalities∫ϕi(x, h)[1 + (h− t)ψ(x, h)]F (dx) = 0, i = 1, 2, . . . , r (3.61)

follows from the condition Φ(Fh) = h. From here and from (3.58), noting the choice of ψ(x, h), we arrive at a system ofequations for the coefficients γ1(h), . . . , γr(h) (which so far are not determined):

r∑j=1

[bijF (h)− EFϕi(X,h)EFϕj(X,h)]γj(h) = −λiF (h)− λiF (t)

h− t. (3.62)

It follows from the four conditions in this example that for |h − t| < δ, the system (3.62) possesses a unique solution whichconverges as h→ t to the unique solution of the system

r∑j=1

bijF (t)γj(t) = −dλiF (t)

dt, i = 1, 2, . . . , r. (3.63)

Let the function ψ(x, h) and the family of distributions Fh, |h − t| < δ be chosen in accordance with (3.60) with functionsγj(h) defined by (3.63) and assume that this family Fh belongs to F.

We obtain from (3.60) [dFhdF

(x)

]1/2

= 1 +1

2(h− t)ψ(x, h) + o(h− t). (3.64)

From here and the third condition in this example it follows that the first condition given in section 1 is fulfilled for thefamily (3.60) as well as the equality for the information quantity

I(t, F ) =

∫ψ2(x, t)F (dx). (3.65)

Next we obtain

∫ψ2(x, t)F (dx) =

∫ ∣∣∣∣∣∣r∑j=1

γj(t)ϕj(x, t)

∣∣∣∣∣∣2

F (dx) =

r∑i,j=1

γi(t)γj(t)bijF (t) = I(t, F ) (3.66)

where γ1(t), . . . , γr(t) is the solution of the system of equations (3.63).The quantity I(t, F ) may be interpreted geometrically, provided the “true” distribution F (x) belongs to some parametric

family Gh with a known (up to parameter h) density g(x, h) with respect to some σ-finite measure ν on X and furthermoreGt(x) = F (x),Φ(Gh) = h.

Then relation (3.58) can be re-written in the form∫ϕj(x, h)g(x, h)ν(dx) = 0, i = 1, 2, . . . , r. (3.67)

Assume that these identities may be differentiated with respect to h at h = t under the integral sign and let J(x, t) =g′t(x,t)g(x,t) .

Then we obtain from (3.67)

− γ′i(t) =

∫ϕi(x, t)J(x, t)F (dx) = 〈ϕi(·, t), J(·, t)〉F (3.68)

46

From here and from (3.63), we obtain that∑γiϕi is the projection in L2(F ) of the vector J on the subspace A generated by

the vectors ϕ1(x, t), . . . , ϕr(x, t) and hence

I(t, F ) = ‖∑

γjϕj‖2F (3.69)

is the square of the length of the projection. Evidently I(t, F ) ≤ ‖J‖2F = Ig(t), where Ig(t) is the information quantity of thedensity g(x, t) and the equality is valid if and only if J(x, t) ∈ A. Thus I(t, F ) can be somewhat loosely interpreted as a partof Fisher information on the parameter t of density g which is contained in relation (3.67).

If the family (3.60) with functions γi(h) defined by (3.62) belongs to F for |u− t| < δ, then I(t, F ) ≥ I(F ) and, by meansof Theorem 3.1.2, we obtain the lower bound on the risk of estimators Φ(F ), F ∈ F. Only slightly more complicated is theargument in the case of unbounded functions ϕi(x, h). In this case it is necessary to approximate ψ(x, h) by means of asequence of bounded functions ψN (x, h) converging to ψ(x, h) in L2(F ) and to postulate that the family obtained from (3.60)by replacing ψ with ψN belongs to F for |u− t| > δ(N).

Example 3.2.6. Consider the problem of estimating a location parameter in the so-called two-sample problem which canbe described as follows. Let F be the set of all distributions on the real line possessing an absolutely continuous density f(x)with respect to the Lebesgue measure and, moreover

If =

∫(f ′(x))2

f(x)dx <∞. (3.70)

Let a two-dimensional sample (X1, Y1), . . . , (Xn, Yn) from a general population with distribution functions F (x), F (y −t), F ∈ F be given. It is required to estimate t. Thus in this example the set F is the totality of all two-dimensionaldistributions of the form F (x)F (y − t), t ∈ R1, F ∈ F.

In a somewhat different manner, this problem can be described as follows. It is required to estimate the functionalt = Φ(F,G) defined as the unique common root of the equations∫ ∫

[ϕ(x+ t)− ϕ(y)]dF (x)dG(y) = 0 (3.71)

for an arbitrary bounded X measurable function ϕ. In such a formalization this problem may be viewed as an infinite dimen-sional generalization of the preceding example. Let J(x) = f ′(x)/f(x) and consider the parametric family of distributions inR2 with density

fh(x)fh(y − t) = [f(x)f(x+ h− t)f(y − h)f(y − t)]1/2[∫

(f(z)f(z + h− t))1/2dz

]−2

. (3.72)

Clearly this family belongs to F and the first two conditions of section 1 are satisfied for this family. Moreover, thecondition If <∞ assures the validity of the relations∫

[f(x)f(x+ h− t)]1/2dx = 1 +O((h− t)2) (3.73)[f(x+ h− t)

f(x)

]1/2

= 1 +1

2(h− t)J(x) + o(h− t) (3.74)

as h→ t.This implies that the third natural condition of section 1 and the equalities

ψ(x, y) =1

2(J(x)− J(y − t)), I(t, f) =

If2

(3.75)

are valid.The topology R in this problem may be chosen quite naturally: a sequence of distributions Fn((x)× Fn(y − hn) converges

to the distribution F (x)F (y − h) provided Ifn → If and hn → h.From Theorem 3.1.2, we obtain the following minimax bound on the risks with w ∈W in the two-sample problem:

limN→∞

lim infn→∞

infΦn

supF∈UN (F0)

EFw(√n(Φn − Φ(F ))) ≥ 1√

∫w(x21/2I

−1/2f0

)e−x2/2dx. (3.76)

The example considered is also of interest due to the fact that in this example when computing I(F ) one cannot restrictoneself to linear parametric families of the type (3.22) since linear parametric families are not contained in F.

47

3.3 Examples of Asymptotically Efficient Estimators

As it is known, for parametric families of distributions there exist general methods of construction of asymptotically efficientestimators in various senses.

Unfortunately, in a nonparametric situation there are as yet no general methods of this kind. For the first five examplesconsidered in section 2 in a corresponding class of distributions F ⊂ F uniformly asymptotically efficient in this classestimators were constructed for a wide class of loss functions w. However, the methods of construction differ from oneexample to another. Their description as well as the proof of asymptotic efficiency would occupy a large amount of spaceand would presumably be useless for other classes of these problems.

Therefore we shall confine ourselves here to the construction of asymptotically efficient estimators in two simplest classes:Example 3.2.1 and Example 3.2.3.

We start with the investigation of properties of the arithmetic mean, where

Φ(F ) =

∫ϕ(x)F (dx). (3.77)

In the class of distributions F2 in Example 3.2.1, this estimator is not even uniformly consistent. Therefore it is necessaryto consider a “narrower” class of distributions.

Denote by Fα2 ⊂ F2 the class of distributions such that for some chosen function α = α(N) which tends to zero as N →∞,the inequality ∫

|ϕ|>N|ϕ(x)|2F (dx) ≤ α(N) (3.78)

is fulfilled. In this class the Lindeberg condition for random variables ϕ(X1), . . . , ϕ(Xn) is obviously fulfilled uniformly for Fand thus the weak convergence

L(√n[ϕn − Φ(F )]|F )→ N (0, σ2(F )), (3.79)

where σ2(F ) =∫|ϕ(x)− EFϕ(X)|2F (dx) is also uniform in F ∈ Fα2 for any function α(N)→ 0.

Furthermore, the equalityEF ζ2

n = σ2(F ) (3.80)

is clearly valid for the sequence ζn =√n(ϕn − Φ(F )). From these relations, the uniform in F ∈ Fα2 integrability of the

sequence ζ2n follows.

From (3.79) and the uniform integrability, for any function w ∈W such that

w(x) ≤ c(|x|2 + 1), (3.81)

the relation

limn→∞

supF∈Fα2

∣∣∣∣EFw(√n(ϕn − Φ(F )))− 1√

∫w(xσ(F ))e−x

2/2dx

∣∣∣∣ = 0 (3.82)

follows.If we introduce in the set Fα2 any topology R-coordinated with the estimation problem of functional in Example 3.2.1–in

which the functional σ2(F ) is continuous then, as it follows from the arguments at the end of section 1 and (3.82), ϕn is(F2, R,w)-asymptotically efficient uniformly in Fα2 estimator of the functional for any loss function w ∈ W satisfying thecondition (3.81) and I(F ) = σ−2(F ).

Note that for faster growing loss functions the estimator ϕn will not, in general, by asymptotically efficient. However,one can construct a corresponding truncation of the estimator ϕn which is asymptotically efficient for any w ∈W.

We now turn to a study of an estimator of the functional Φ(F ) = ϕ0(∫ϕ(x)F (dx)) in Example 3.1.2. A lower bound

in this case was obtained in Theorem 3.2.1 where σ(l, F ) was computed in Example 3.2.3. We shall now prove that in acorresponding class of distribution functions this bound cannot be improved and that the estimator

ϕ0(ϕn) (3.83)

is asymptotically efficient.Assume now that function ϕ0 : Rr → R1 possesses bounded derivatives of the first order in all its arguments satisfying

the Lipschitz condition. The set of distributions Fα2 is defined here as in the first example of this section, but the functionϕ : X → R1 is now replaced by the function ϕ : X → Rr.

Let Φn be an estimator of the formΦn = ϕ0(ϕn). (3.84)

48

Then expanding the function ϕ0 in Taylor’s formula and taking into account the boundedness and the Lipschitz conditionfor ∇ϕ0, for some constant c > 0 which is common for all F ∈ Fα2 we shall obtain the inequality via the mean value theorem:

|Φn − Φ(F )− 〈ϕn − EFϕ(X),∇ϕ0(EFϕ(X))〉| ≤ c|ϕn − EFϕ(X)|2[1 + |ϕn − EFϕ(X)|]−1. (3.85)

Letζn = 〈ϕn − EFϕ(X),∇ϕ0(EFϕ(X))〉. (3.86)

As in the preceding example we easily obtain that ζn is–uniformly in Fα2 –asymptotically normal with parameters (0, n−1σ2(F )),where

σ2(F ) =

∫〈ϕ(y)− EFϕ(X),∇ϕ0(EFϕ(X))〉2F (dy). (3.87)

Moreover, EF ζ2n = σ2(F )/n. This implies uniform in Fα2 integrability of the random variables nζ2

n by Scheffe’s lemma.Analogously we verify the uniform integrability of n|ϕ− EFϕ(X)|2.

The last assertion and (3.85) allow us to obtain for any function w ∈W satisfying condition (3.81) the relation

limn→∞

supF∈Fα2

∣∣∣∣EFw(√n(Φn − Φ(F )))− 1√

∫w(xσ(F ))e−x

2/2dx

∣∣∣∣ = 0. (3.88)

As above, (3.88) implies uniform in Fα2 asymptotic efficiency of estimator Φn = ϕ0(ϕn) in the corresponding topology forthe above indicated class of loss functions as well as the equality I(F ) = σ−2(F ).

In the conclusion of this section we shall describe without proofs asymptotically efficient nonparametric estimators insome other cases.

In Example 3.2.2, U -estimators, i.e., estimators of the form

Un =1

cmn

∑Sn

ϕ(Xα1, . . . , Xαm), (3.89)

whereSn = {(α1, . . . , αm) : 1 ≤ α1 < α2 < . . . < αm ≤ n} (3.90)

are asymptotically efficient for loss functions satisfying condition (3.81). For a wider class of loss functions certain truncationsof U -estimators will be asymptotically efficient. In Example 3.2.4 and 3.1.2 under quite general assumptions, Huber M -estimators and their truncations will be asymptotically efficient. These are defined as the solution of the equation∫

ϕ(x, t)dFn(x) = 0 (3.91)

where Fn is the empirical distribution function. In Example 3.2.5 the information amount under natural restrictions on F co-incides with I(t, F ) and asymptotically efficient nonparametric estimators may be constructed recursively. For Example 3.2.6,estimators which are asymptotically normal with parameters (0, 1

2If ) are constructed in a number of papers.

3.4 Estimation of Unknown Density

We have seen that in the case of estimation of, say, a one-dimensional parameter of a distribution in the regular case, thereexist usually

√n-consistent estimators provided the consistent ones exist. In a nonregular case, one can construct estimators

which converge even faster. In the nonparametric case the situation is quite different. Here there are many interestingproblems for which the nonparametric information quantity introduced vanishes on the whole set of distributions F underconsideration but a consistent estimator (with a slower rate of convergence than

√n) is nevertheless possible.

Note that in parametric problems the equality I(θ) ≡ 0 on a whole interval implies the nonexistence of consistentestimators of the parameter on this interval, since the density does not depend on the parameter θ.

This type of problem contains the problems of estimating a probability density at some fixed point on the whole real line,derivatives of a density, mode of distribution based on independent observations, spectral density based on observations of astationary process, and others.

In this section we shall consider only one example of this type of problem–namely, estimation of a probability density ata point based on observations in R1.

Let X1, X2, . . . , Xn ∈ R1 be a sample from a population with unknown density f(x). If f(x) depends on a finite numberof parameters and is a known function of x and of these parameters, we again arrive at a problem of parametric estimation.If, however, the only thing that is known is that f(x) belongs to a sufficiently large class F of functions then the problem ofestimating f(x) becomes infinitely dimensional, i.e. nonparametric.

49

We proceed from the empirical distribution function Fn(x) = νn(x)/n where νn(x) is the number of observations smallerthan x. Fn(x) is a well known estimator for the distribution function F (x). Setting

χ(x) = I(x ≥ 0), (3.92)

we have the representation

Fn(x) =1

n

n∑k=1

χ(x−Xk). (3.93)

As it is known, the function Fn(x) is close to the actual distribution function

F (x) =

∫ x

−∞f(y)dy (3.94)

provided that n is sufficiently large.Therefore one would expect that its derivative is close to f(x) = F ′(x). However,

F ′n(x) =1

n

n∑k=1

δ(x−Xk), (3.95)

where δ(x) is the Dirac δ-function, which is not a function in the sense of classical analysis. It would therefore be natural to“smooth” Fn(x) and use as an estimator of the density the derivative of such a smooth function. We thus arrive at estimatorssatisfying the condition

fn(x) =1

nhn

n∑i=1

V

(x−Xi

hn

), (3.96)

where V (x) is absolutely integrable and satisfies the condition∫ ∞−∞

V (x)dx = 1, (3.97)

while the sequence hn is such thathn → 0, nhn →∞. (3.98)

The class of estimators (3.96) was first introduced by Rosenblatt and Parzen. They are called the Parzen–Rosenblattestimators. Obviously the convergence fn(x) → f(x) in a certain sense is valid only under some restrictions on f(x). If,for example, f(x) possesses points of discontinuity then the convergence is uniform for no choice of hn and V (x). If it isknown beforehand that f belongs to a certain class of continuous functions, then one can find in class (3.96) estimators whichconverge to f(x) with a given rate.

We shall discuss this point in more detail. Let it be known that f(x), x ∈ R1, belongs to the class of functions satisfyingthe Lipschitz condition with constant L:

|f(x2)− f(x1)| ≤ L(|x2 − x1|). (3.99)

Denote by Σ(1, L) the set of all such functions. Let fn(x) be an estimator of the form (3.96). We shall bound the quantity

Dn(x) = Ef (fn(x)− f(x))2 = Ef (fn(x)− Effn(x))2 + (Effn(x)− f(x))2. (3.100)

First we shall consider the bias term. Clearly,

Effn(x)− f(x) =1

hn

∫V

(x− yhn

)[f(y)− f(x)]dy =

∫V (z)[f(x− hnz)− f(x)]dz. (3.101)

If the function |zV (z)| is integrable, then we obtain from the last relation the bound

|Effn(x)− f(x)| ≤ Lhn∫|zV (z)|dz, (3.102)

which is valid for f ∈ Σ(1, L). In the same manner,

Ef (fn(x)− Effn(x))2 =1

nh2n

{∫V 2

(x− yhn

)f(y)dy −

[EfV

(x−X1

hn

)]2}≤ 1

nhn

∫V 2(z)f(x− hnz)dz. (3.103)

50

If V 2 is integrable, then for some constant c common for all f ∈ Σ(1, L) the inequality

Ef (fn(x)− Effn(x))2 ≤ c

nhn(3.104)

is valid. Evidently the best bound (in order of magnitude) for Dn(x) is obtained if we set hn = n−1/3.For this choice of hn, we have Dn(x) ≤ c1n

−2/3, where as it is easy to verify, the constant c1 does not depend on x andf ∈ Σ(1, L). We thus obtain the following result:

If hn = n−1/3 and the functions |xV (x)|, V 2(x) are integrable, then for an estimator fn(x) of an unknown densityf(x) ∈ Σ(1, L) constructed in accordance with (3.96) the inequality

supf∈Σ(1,L)

supx∈R1

Ef (fn(x)− f(x))2 ≤ cn−2/3 (3.105)

is valid for all n.The result can be generalized in various directions. In particular, loss functions which are not quadratic may be considered

as well as classes of functions f other than Σ(1, L). It is not difficult to verify that if V (x) decreases rapidly as |x| → ∞,then for any integer k > 0,

supn

supf∈Σ(1,L)

supx∈R1

Ef |(fn(x)− f(x))n1/3|2k <∞. (3.106)

This fact evidently implies that for any loss function w(x) whose growth is at most polynomial as |x| → ∞ and for anyestimator of the form (3.96) with hn = n−1/3 and with a finite, say, function V (x) satisfying

∫∞−∞ V (x)dx = 1, the relation

supn

supf∈Σ(1,L)

supx∈R1

Efw((fn(x)− f(x))n1/3) <∞ (3.107)

is valid.Consider now yet another generalization for other families of functions f . We shall see in particular that for families of f

satisfying more stringent smoothness conditions one can find among estimators of the form (3.96) estimators which convergeto f(x) even faster and the attainable rate of convergence depends substantially on the smoothness of f .

Denote by Σ(β, L), β = k + α, 0 < α ≤ 1, k ≥ 0 the class of functions possessing k-th order derivatives and such that forxi ∈ R1,

|f (k)(x2)− f (k)(x1)| ≤ L|x2 − x1|α (3.108)

and Σ(beta) =⋃L>0 Σ(β, L). Thus Σ(β), β = k + α is the class of functions with k-th order derivatives satisfying Holder’s

condition with exponent α.In order not to specify each time the conditions for convergence of the corresponding integrals, we shall confine ourselves

below to the study of procedures of type (3.96) with bounded functions V (x).

Theorem 3.4.1. For an estimator of the form (3.96) with hn = n−1/(2β+1), β = k + α and a bounded function V (x)satisfying condition

∫∞−∞ V (x)dx = 1 and conditions∫ ∞

−∞xjV (x)dx = 0, j = 1, 2, . . . , k, (3.109)

the inequalitysupn

supf∈Σ(β,L)

supx∈R1

Ef [(fn(x)− f(x))nβ/(2β+1)]2 <∞ (3.110)

is valid for any L > 0.

We shall now show that this result can be extended to a substantially wider class of loss functions.

Theorem 3.4.2. Assume the conditions of Theorem 3.4.1 are fulfilled and let w(x) ∈We,2. Then for any L > 0 the relation

supn

supf∈Σ(β,L)

supx∈R1

Efw(nβ/(2β+1)(fn(x)− f(x))) <∞ (3.111)

is valid.

51

3.5 Minimax Bounds on Estimators for Density

We have shown that in the case when the information f ∈ Σ(β, L) is available, there exist estimators for the density whichconverge to the density with the rate n−β/(2β+1). Are there, however, even more rapidly convergent estimators? This problemwas first considered by Cencov. For one class of measures of deviations of fn from f–as it was shown by Cencov–the answeris negative if one considers minimax bounds. The result presented below is due to Farrel. We shall not only establish theexistence of a minimax bound from below of order n−β/(2β+1) but also indicate some qualitative bounds.

Denote by Fn the class of all possible estimators for a density based on observations X1, X2, . . . , Xn. Let w(x) be anarbitrary symmetric monotone (for x > 0) function such that w(0) = 0, w(x) 6≡ 0. As above, we shall denote the class ofthese functions by W.

Theorem 3.5.1. For any L > 0, x0 ∈ R1, k ≥ 0, α > 0 and w ∈W, the inequality

lim infn→∞

inffn(x)∈Fn

supf∈Σ(β,L)

Efw((fn(x0)− f(x0))nβ/(2β+1)) > 0 (3.112)

is valid. Here β = k + α.

Proof. Let f0(x) be an arbitrary density belonging to Σ(β, L/2) not vanishing for all x ∈ R1 and let g(x) ∈ Σ(β, L/2) befinite and satisfy

∫g(x)dx = 0, g(0) 6= 0.

Elementary verification shows that the function

ϕn(x, θ) = f0(x) +θ

nβδg((x− x0)nδκ) (3.113)

for any |θ| < κ−β , δ > 0, belongs to the set Σ(β, L). Moreover, for all n ≥ n0 this function is a probability density.Consider now a sample X1, X2, . . . , Xn from a population with density ϕn(x, θ), |θ| < κ−β . Denote by Pnθ the family of

measures generated by this sample. We could show that for δ = 12β+1 the LAN condition is satisfied and we have the Fisher

information

I0 = κ−1

∫g2(y)dy/f0(x0). (3.114)

To complete the proof, the following lemma is required, which is a direct consequence of Remark 2.7.3 right after Theo-rem 2.7.1.

Lemma 3.5.1. For any estimator Tn of the parameter |θ| > κ−β in the parametric family (3.113) and any even and monotone

for x > 0 loss function w0, for any c ≤ I1/20 κ−β the inequality

lim infn→∞

sup|u|<c/I1/20

Enuw0((Tn − u)I1/20 ) ≥ 1√

2π2−1

∫|y|<c/2

w0(y)e−y2/2dy. (3.115)

is valid.

Since ϕn ∈ Σ(β, L), using the notation γ = β2β+1 we obtain for any estimator fn(x) of the density f and any constant

c ≤ I1/20 κ−β the inequality

supf∈Σ(β,L)

Efw((fn(x)− f(x0))nγ) ≥ sup|θ|I1/20 <c

Enθw((fn(x0)− ϕn(x0, θ))nγ). (3.116)

By means of the density estimator fn one can, in particular, construct an estimator of the parameter θ in the parametricfamily ϕn using the formula

θn = [fn(x0)− f0(x0)]g−1(0)nγ . (3.117)

The following equality is self-evident:

(θn − θ)g(0) = [fn(x0)− ϕn(x0, θ)]nγ . (3.118)

Setting w0(x) = w(xg(0)I−1/20 ), one obtains the inequality

lim infn→∞

supf∈Σ(β,L)

Efw((fn(x0)− f(x0))nγ) ≥ 1

2√

∫|x|<I1/20 κ−β/2

w(xg(0)I−1/20 )e−x

2/2dx, (3.119)

which is valid for any estimator fn(x0).Since κ is arbitrarily small the proof is complete.

52

Remark 3.5.1. Note that (3.119) gives a bound from below for the minimax risk if one maximizes the right hand side. It isof interest to obtain the exact achievable bound as it was done for the parametric case. This problem has not been solved yet.

Remark 3.5.2. It is not too difficult to strengthen the assertion of Theorem 3.5.1, rendering it to be in a certain senseuniform in x0. More precisely, the inequality of Theorem 3.4.1 can be replaced by the inequality

lim infn

inffn∈Fn

infx0∈[a,b]

supf∈Σ(β,L)

Efw((fn(x)− f(x0))nβ/(2β+1)) > 0, (3.120)

which is valid for all −∞ < a < b <∞.

53

Bibliography

[1] I. A. Ibragimov, R. Z. Has’ minskii, and S. Kotz, Statistical estimation: asymptotic theory. Springer-Verlag New York,1981, vol. 2.

54