elements of statistical inference - indian institute of...

25
Elements of Statistical Inference Chiranjit Mukhopadhyay Indian Institute of Science Statistical inference mainly deals with estimation and hypothesis testing about unknown population parameters, given a set of observations on the variable whose population be- haviour we want to study or model. Using the developed model for forecasting or prediction purpose also falls under the realm of statistical inference but its general theory will not be discussed here and will be introduced only when it first arises in the context of regres- sion. Here we shall only give the flavour of general mathematical statistical treatment of the problem of estimation and hypothesis testing about unknown population parameters. 1 Some Terminologies 1.1 Parameter Parameters are quantities defined in the population. To emphasise this fact we shall refer to it as population parameter in this subsection. Population parameters are quantities derived from the population probability model like its mean, median, variance, 90-th percentile, or even the entire c.d.f. itself. In practice we often assume observations arising from a given parametric family of probability model like Binomial, Geometric, Negative Binomial, Pois- son, exponential, Gamma, Normal, Weibull etc. from external or empirical considerations. In such situations population parameters are nothing but those quantities which appear in the expressions of the p.m.f. or p.d.f. of those probability models, like the p of Binomial, Geometric or Negative Binomial distribution, or λ of the Poisson or exponential model or (μ, σ 2 ) of the Normal probability model, or (α, λ) of the Gamma distribution or (λ, β ) of the Weibull model. This is because in these parametric families any other population quantity of interest can be expressed in terms of these basic model parameters. For instance the standard deviation of a Bernoulli population is q p(1 - p), the mean of a Gamma population is α/λ, or the 75-th percentile of a Normal population is μ +0.6745 × σ. 1.2 Statistic Statistics are quantities which are computed from the data or observations. For example for a random sample Y 1 ,Y 2 ,...,Y n from a N (μ, σ 2 ) population, the sample mean Y = 1 n n i=1 Y i or the naive sample variance s 2 n = 1 n n i=1 (Y i - Y ) 2 are statistics, as opposed to their respective population counter-parts μ and σ 2 , which are population parameters. Sample proportion ˆ p = (# Successes in the sample)/n is another example of a statistic with a sample of size n from a Bernoulli 0-1 or Success-Failure population. In general a statistic is a function of observations Y 1 ,Y 2 ,...,Y n which is denoted by T (Y 1 ,Y 2 ,...,Y n ), to emphasise the fact that it is a formula used to derive quantities based on a sample Y 1 ,Y 2 ,...,Y n . 1

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

Elements of Statistical InferenceChiranjit Mukhopadhyay

Indian Institute of Science

Statistical inference mainly deals with estimation and hypothesis testing about unknownpopulation parameters, given a set of observations on the variable whose population be-haviour we want to study or model. Using the developed model for forecasting or predictionpurpose also falls under the realm of statistical inference but its general theory will notbe discussed here and will be introduced only when it first arises in the context of regres-sion. Here we shall only give the flavour of general mathematical statistical treatment of theproblem of estimation and hypothesis testing about unknown population parameters.

1 Some Terminologies

1.1 Parameter

Parameters are quantities defined in the population. To emphasise this fact we shall refer toit as population parameter in this subsection. Population parameters are quantities derivedfrom the population probability model like its mean, median, variance, 90-th percentile, oreven the entire c.d.f. itself. In practice we often assume observations arising from a givenparametric family of probability model like Binomial, Geometric, Negative Binomial, Pois-son, exponential, Gamma, Normal, Weibull etc. from external or empirical considerations.In such situations population parameters are nothing but those quantities which appear inthe expressions of the p.m.f. or p.d.f. of those probability models, like the p of Binomial,Geometric or Negative Binomial distribution, or λ of the Poisson or exponential model or(µ, σ2) of the Normal probability model, or (α, λ) of the Gamma distribution or (λ, β) of theWeibull model. This is because in these parametric families any other population quantityof interest can be expressed in terms of these basic model parameters. For instance the

standard deviation of a Bernoulli population is√p(1− p), the mean of a Gamma population

is α/λ, or the 75-th percentile of a Normal population is µ+ 0.6745× σ.

1.2 Statistic

Statistics are quantities which are computed from the data or observations. For example for arandom sample Y1, Y2, . . . , Yn from a N(µ, σ2) population, the sample mean Y = 1

n

∑ni=1 Yi or

the naive sample variance s2n = 1

n

∑ni=1(Yi−Y )2 are statistics, as opposed to their respective

population counter-parts µ and σ2, which are population parameters. Sample proportion p̂= (# Successes in the sample)/n is another example of a statistic with a sample of size nfrom a Bernoulli 0-1 or Success-Failure population. In general a statistic is a function ofobservations Y1, Y2, . . . , Yn which is denoted by T (Y1, Y2, . . . , Yn), to emphasise the fact thatit is a formula used to derive quantities based on a sample Y1, Y2, . . . , Yn.

1

Page 2: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

1.3 Sampling Distribution

In classical statistics the optimality of any method is judged in terms of its repeated useover different samples from the same population. This philosophy gives rise to the consid-eration of the possible values a statistic T (Y1, Y2, . . . , Yn) can take and the frequency withwhich it takes such values over repeated sampling. Or in other words we consider the prob-ability distribution of a statistic T (Y1, Y2, . . . , Yn) over repeated sampling. This probabilitydistribution is called the sampling distribution of T .

For example consider the sample mean Y . Though for a given population its mean µ issomething fixed (but unknown), we cannot expect to get the same value of the sample meanY for all different possible samples that we can draw from the population. However we canconsider and theoretically derive how Y would behave over repeated sampling for all possiblesamples in terms of its probability distribution. This probability distribution of Y over allpossible samples is called the sampling distribution of the sample mean Y .

Example 1: Consider a population consisting of only four numbers 1, 2, 3 and 4 with 30%1’s, 40% 2’s, 20% 3’s and the remaining 10% 4’s. That is the probability distribution in thepopulation may be expressed in terms of the p.m.f. pY (y) of the population random variableY as follows:

y 1 2 3 4pY (y) 0.3 0.4 0.2 0.1

Now consider drawing a random sample of size 2 from this population and the three statisticsY = (Y1 + Y2)/2, s2

2 = 1/2∑2i=1(Yi − Y )2 and s2

1 =∑2i=1(Yi − Y )2, where Y1 and Y2 are

the two observations. The sampling distributions of these three statistics can be figuredout by considering all possible samples of size 2 that can be drawn from this population,the corresponding probabilities of drawing each such sample, and the values of each of thestatistics for every such sample These steps are presented in the following table:

PossibleSamples

{1,1} {1,2} {1,3} {1,4} {2,2} {2,3} {2,4} {3,3} {3,4} {4,4}

Probability 0.09 0.24 0.12 0.06 0.16 0.16 0.08 0.04 0.04 0.01Y 1 1.5 2 2.5 2 2.5 3 3 3.5 4s2

2 0 0.25 1 2.25 0 0.25 1 0 0.25 0s2

1 0 0.5 2 4.5 0 0.5 2 0 0.5 0

Finally consolidating the distinct values that have been assumed by these statistics andadding the corresponding probabilities of obtaining the samples for the same values of thestatistics, we obtain the sampling distributions of these statistics as follows:

Sampling Distribution of the Sample Mean Yy 1.0 1.5 2.0 2.5 3.0 3.5 4.0

pY

(y) 0.09 0.24 0.28 0.22 0.12 0.04 0.01

2

Page 3: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

Sampling Distribution of s22 Sampling Distribution of s2

1

s22 0 0.25 1 2.25 s2

2 0 0.5 2 4.5prob. 0.30 0.44 0.20 0.06 prob. 0.30 0.44 0.20 0.06

2 Estimation

Suppose we have a probability model in the population characterised by the p.d.f. f(y|θ)(or p.m.f. p(y|θ)) and we have n independent and identically distributed (henceforth calledi.i.d.) observations Y1, Y2, . . . , Yn on the population random variable Y , which has the p.d.f.f(y|θ) (or p.m.f. p(y|θ)). The first inference problem at hand is to estimate the unknownpopulation parameter θ. The problem of estimation has two fangs - point estimation andinterval estimation. In the first case i.e. point estimation, as the name suggests, one dealswith a single valued estimator of θ; while in the later case i.e. interval estimation, one reportsan interval of values which is supposed to contain the true unknown value of the parameterθ.

2.1 Point Estimation

We start our discussion on point estimation by first considering the nature of a “good”estimator. That is we shall first try to understand the kind of properties and behaviour areasonable estimator should have. Or in other words we first discuss the desirable criteriafor “good” estimators. Then failing to devise any straight-forward method of obtaining suchestimators we shall next resort to some general methods of estimation which are reasonablywell-behaved for large samples for the so called “regular” models.

Let the statistics θ̂ and θ̂′ be two different estimators of the population parameter θ. Say forexample with a sample from a N(µ, 1) population, for the unknown population parameter µ,let µ̂ be the sample mean and µ̂′ be the sample median, as µ is both the population mean aswell as the population median. Likewise with a sample from a Poisson(λ) population, for theunknown population parameter λ, let λ̂ denote the sample mean and λ̂′ denote the samplevariance, as λ may be interpreted as both the population mean and population variance.Now θ̂ would be considered to be better than θ̂′ if

Probθ(θ − a < θ̂ < θ + b) ≥ Probθ(θ − a < θ̂′ < θ + b) ∀a, b > 0 and ∀θ ∈ Θ (1)

where the probabilities above are according to the sampling distributions of θ̂ and θ̂′ (it issubscripted with a θ to emphasise the fact that these probabilities in general depend on theunknown θ), and Θ is the parameter space, the set of all possible values the unknownpopulation parameter θ can take. A slightly weaker version of inequality (1) is

Eθ[(θ̂ − θ)2] ≤ Eθ[(θ̂′ − θ)2] ∀θ ∈ Θ (2)

where the expectation Eθ[·] is w.r.t.the sampling distributions of the estimators θ̂ and θ̂′

(again the Eθ[·] is subscripted with a θ to emphasise the fact that the expectation in general

3

Page 4: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

depends on the unknown θ). Inequality (2) is weaker than inequality (1) in the sense that ifinequality (1) is satisfied by two estimators θ̂ and θ̂′, then it implies that they must satisfyinequality (2). Since inequality (2), though weaker, is less clumsy to deal with, as it doesnot involve additional arbitrary constants like a and b as in inequality (1), “goodness” of anestimator is measured in terms of this weaker criterion. This criterion is called the MeanSquared Error or MSE which is formally defined as follows.

Definition 1: Mean Squared Error or MSE of an estimator θ̂ is given by MSEθ(θ̂ ) =Eθ[(θ̂ − θ)2].

Intuitively, if θ̂ is an estimator of θ, then the error of θ̂ is (θ̂ − θ), the amount by whichit misses the quantity θ it is trying to estimate. But this error could be either positive ornegative and in general we are not interested in its sign but only the magnitude. A smoothway of getting rid of the sign of a quantity is to square it.1 Thus we consider the squarederror (θ̂ − θ)2. Now this squared error is a random quantity as its value depends on thevalue of the estimator θ̂ , which varies from sample to sample. One way of consolidating thevalue of this random criterion viz. squared error would be to look at its mean over repeatedsampling or by how much is the estimator θ̂ missing its target value of θ on an average. Thisleads to the criterion mean squared error Eθ[(θ̂ − θ)2].

Note that though MSE of an estimator θ̂ does not depend on the value of θ̂ (since we haveaveraged it out), in general it still depends on θ. Thus MSE of θ̂ is a function of θ. Sincewe do not know the value of θ, which is the crux of the problem, one may now seek for asuper estimator which has the smallest MSE among all the estimators of θ, no matter whatthe value of θ may be or for all values of θ ∈ Θ. Unfortunately such super estimators donot exist, because when θ = θ0, some fixed value in Θ, no other estimator θ̂ can have asmaller MSE than the (rather silly) estimator θ̂′ ≡ θ0. That is if we allow anything andeverything in the universe to be an estimator of θ and then seek for an estimator whichuniformly minimises the MSE, that is simply asking for too much.

The solution becomes apparent once we have realised the problem. If we can somehow keeptrivial and silly estimators out of the consideration and then seek for an MSE minimisingestimator, there might be some hope of obtaining a solution. That is we have to keepour search for a “good” or MSE minimising estimator confined within a narrower class of“reasonable” estimators. One such smaller or restricted class of estimators which is usuallyconsidered in practice are called unbiased estimators, defined as follows.

Definition 2: An estimator θ̂ of the unknown population parameter θ is called unbiased,if Eθ[θ̂ ] = θ ∀θ ∈ Θ.

1Getting rid of the sign of a quantity is a recurring problem in statistics and we always use squaring foraccomplishing this task. Simply ignoring the sign is not a smooth procedure as the graph of f(x) = |x| for−∞ < x <∞ exhibits at x = 0. The closest one can come to this through a smooth procedure is by squaringit (study the graph of g(x) = x2 for −∞ < x < ∞). Raising the quantity, whose sign we want to get ridof, by any even power would also be a smooth procedure but the magnitude is distorted the least when thispower is 2 (for instance compare the graphs of f(x) = |x|, g(x) = x2 and h(x) = x4 for −∞ < x <∞). Wewant the process of ridding the sign to be smooth because otherwise it poses mathematical difficulties likenot being able to use standard calculas methods for subsequent theoretical development.

4

Page 5: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

Above definition requires the mean of the sampling distribution of an unbiased estimator tocoincide with the unknown population parameter it is trying to estimate, for its all possiblevalues. Intuitively, this means that if an estimator is unbiased, on an average it will alwayshit the target, no matter what or where the target is. This looks like a fairly reasonablething to demand from an estimator. Also notice that the unbiasedness requirement also getsrid of trivial constant estimators, as clearly an estimator like θ̂ ≡ θ0 has an expected valueof θ0, which does not coincide with the value of the unknown population parameter θ forany value other than θ0 in Θ.

Example 2: Suppose Y1, Y2, . . . , Yn is a random sample from an arbitrary population withmean µ. Then since E[Yi] = µ ∀µ and ∀i, Y5 is an unbiased estimator of µ, so is (Y2 +Y3)/2,so is 0.2Y8 + 0.5Y11 + 0.3Y13 and so is of course the sample mean Y = 1

n

∑ni=1 Yi. 5

The above example goes to show that usually the class of unbiased estimators is quitelarge. Now the task is to seek for an estimator with minimum MSE among this reasonablylarge class of unbiased estimators. Unbiased estimators has a simple neatly interpretableexpression for their MSE as the following corollary shows.

Theorem 1: MSEθ(θ̂ ) = Vθ[θ̂ ] + Bθ(θ̂ )2, where Vθ[θ̂ ] is the variance of the estimatorθ̂ w.r.t. its sampling distribution and Bθ(θ̂ ) = Eθ[θ̂ − θ] is the bias of the estimator θ̂ .

Proof:

MSEθ(θ̂ )

= Eθ[(θ̂ − θ)2]

= Eθ[{(θ̂ − Eθ[θ̂ ]) + (Eθ[θ̂ ]− θ)}2]

= Eθ[(θ̂ − Eθ[θ̂ ])2] + (Eθ[θ̂ − θ])2 + 2Eθ[(θ̂ − Eθ[θ̂ ])(Eθ[θ̂ ]− θ)](because Eθ[θ] = θ and (Eθ[θ̂ − θ])2 is a constant, whose expectation is the constant

itself.)

= Vθ[θ̂ ] +Bθ(θ̂ )2 + 2(Eθ[θ̂ ]− θ)Eθ[θ̂ − Eθ[θ̂ ]]

(by the definition of Vθ[θ̂ ] and Bθ(θ̂ ) and constancy of Eθ[θ̂ ]− θ)= Vθ[θ̂ ] +Bθ(θ̂ )2 (as Eθ[θ̂ − Eθ[θ̂ ]] = 0) 5

Corollary 1: If θ̂ is unbiased for θ, MSEθ(θ̂ ) = Vθ[θ̂ ].

Proof: If θ̂ is unbiased for θ, its bias Bθ(θ̂ ) = Eθ[θ̂ − θ] = 0, by the definition of unbiased-ness, and the result follows from Theorem 1. 5

For the above corollary it is usually customary to report the value of an (unbiased) point

estimator together with its√Vθ[θ̂ ], called its standard error. Corollary 1 reduces the

task of seeking an MSE minimising estimator among the class of unbiased estimators tothat of seeking an unbiased estimator with a uniformly minimum variance or standard error.This leads to the discussion of obtaining in general the “best” point estimators called theUniformly Minimum Variance Unbiased Estimators or UMVUE.

5

Page 6: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

2.1.1 UMVUE

The foregoing discussion goes to show the desirability of having an UMVUE. But unfor-tunately there is no direct method of obtaining an UMVUE for any parameter of a givenpopulation probability model. By a direct method we mean a ready made algorithmic tech-nique which readily yields a (may be a computerised numerical) solution as soon as theproblem of obtaining an UMVUE is stated or formulated. Instead there are a couple of the-orems which help one show or prove an estimator to be UMVUE. However before discussingthese theorems it is first necessary though to introduce a couple of more concepts and results,without which the main theorems regarding UMVUE would remain inaccessible. The firstsuch concept is called sufficiency.

Definition 3: A statistic T (Y1, Y2, . . . , Yn) is said to be sufficient for θ if the conditionaldistribution of the original observations Y1, Y2, . . . , Yn given T (Y1, Y2, . . . , Yn) = t does notdepend on θ.

Sufficiency plays a central role in mathematical statistics, not just for UMVUE. Intuitively,sufficient statistics provide a way of reducing the data without loosing any informationabout the unknown parameter θ. This is because if one has the value of the sufficientstatistic T but the original data set Y1, Y2, . . . , Yn is lost, one can still reconstruct a set ofY1, Y2, . . . , Yn (using for example a random number generator) as it does not require knowl-edge of the unknown θ (by definition), which is equivalent to the original data set in the sensethat its probability distribution remains the same as the original data set. Thus if sufficientstatistics exist, one need not carry around the entire original raw data set for drawing infer-ence about the model parameters. Just having the values of the sufficient statistics is goodenough or sufficient, as these statistics carry all the relevant information about θ containedin the observations Y1, Y2, . . . , Yn .

Example 3: Suppose Y1 and Y2 are i.i.d. Poisson(λ) i.e. we have a sample of size 2 from aPoisson population. Consider the statistic T = Y1 + Y2.

P (Y1 = y|T = t)

=P (Y1 = y, Y2 = t− y)

P (T = t)

=P (Y1 = y)P (Y2 = t− y)

e−2λ(2λ)t/t!

(because Y1 and Y2 are independent and T ∼ Poisson(2λ))

=e−2λλy+t−y/(y!(t− y)!)

e−2λ(2λ)t/t!

=

(ty

)(1

2

)y (1

2

)t−y,

which does not depend on the unknown population parameter λ. It should now be easyto see that if we had Y1, Y2, . . . , Yn a sample of size n from a Poisson(λ) population and

6

Page 7: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

T = Y1 + Y2 + · · ·Yn,

P (Y1 = y1, Y2 = y2, . . . , Yn = yn|T = t) =t!

y1!y2! · · · yn!

(1

n

)y1 ( 1

n

)y2· · ·

(1

n

)yn,

which does not depend on λ. Thus according to the definition T =∑ni=1 Yi is a suf-

ficient statistic for a Poisson sample. This is because if one has the value of T as t,one can reconstruct a version of the original sample by generating a set of values from aMultinomial(t; 1

n, . . . , 1

n) distribution, without bothering to carry around all the n Y1, Y2, . . . , Yn

values. 5

Now trying to intuitively guess, obtain and then show a statistic is sufficient from definition,as has been done in example 3 above, is an arduous if not an impossible task. Fortunatelythere is a theorem, called the Factorisation Theorem, which helps one obtain a sufficientstatistics in a routine manner from the expression of the p.m.f./p.d.f. of a probability model.At this point it will also be wise to broaden our horizon by considering all the unknownparameters in the population at once. Thus from now on we shall use a bold-faced θ todenote the vector (more than one) of unknown parameters, while preserving the notationθ in case of a single unknown.2 Before presenting the Factorisation Theorem theorem weneed to introduce another extremely important statistical concept called the likelihoodfunction.

Definition 4: If Y1, Y2, . . . , Yn are i.i.d. with p.d.f. f(y|θ) (or p.m.f. p(y|θ)) with re-alised values Y1 = y1, Y2 = y2, . . . , Yn = yn the likelihood function of θ is given byL(θ|y1, y2, . . . , yn ) =

∏ni=1 f(yi|θ) (or

∏ni=1 p(yi|θ) in the discrete case).

Very loosely speaking the likelihood function sort of gives the probability of observing thedata at hand given a value of the model parameter θ. But since θ is unknown, we try toview this quantity in its totality as a function of the unknown θ as it varies over its domainΘ. It is important to realise that in the expression of the likelihood function, the variable ofinterest is θ and not the observed data Y1 = y1, Y2 = y2, . . . , Yn = yn. It is something akin toa probability only when viewed as a function of y1, y2, . . . , yn, but since the likelihood mustbe viewed as a function of θ it is not a probability.

Example 4: A. If Y1, Y2, . . . , Yn are i.i.d. N(µ, σ2), with realised values Y1 = y1, Y2 =y2, . . . , Yn = yn, then the likelihood function of (µ, σ2) is given by

L(µ, σ2|y1, y2, . . . , yn ) = (2πσ2)−n/2e−1

2σ2

∑n

i=1(yi−µ)2 = (2πσ2)−n/2e−

12σ2{

∑n

i=1(yi−y)2+n(y−µ)2}

(3)

B. If Y1, Y2, . . . , Yn are i.i.d. Bernoulli(p), so that each Yi is 0-1 valued with probability ofassuming the value 1 is p and 0 is 1 − p, which may be expressed as the p.m.f. p(y|p) =py(1 − p)1−y for y = 0, 1, with realised values Y1 = y1, Y2 = y2, . . . , Yn = yn, then thelikelihood function of p is given by

L(p|y1, y2, . . . , yn ) = p∑n

i=1yi(1− p)n−

∑n

i=1yi (4)

2This convention of using bold-face for a vector and ordinary font for a scalar - be it the parameters orstatistics - will be used through out these notes.

7

Page 8: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

C. If Y1, Y2, . . . , Yn are i.i.d. Poisson(λ) with realised values Y1 = y1, Y2 = y2, . . . , Yn = yn,then the likelihood function of λ is given by

L(λ|y1, y2, . . . , yn ) = e−nλλ∑n

i=1yi/

n∏i=1

yi! (5)

Theorem 2 (Factorisation Theorem): If the population random variable Y has p.d.f.f(y|θ) (or p.m.f. p(y|θ)) then given the the observed data Y1 = y1, Y2 = y2, . . . , Yn = yn,statistics T (Y1, Y2, . . . , Yn ) is sufficient for θ if and only if the likelihood function can befactored as (t being the realised value of T )

L(θ|y1, y2, . . . , yn ) = g(t(y1, y2, . . . , yn ),θ)h(y1, y2, . . . , yn ). (6)

That is t(y1, y2, . . . , yn ) is sufficient for θ ⇐⇒ the likelihood function can be factored intotwo components, where the expression of the first component involves θ and terms involv-ing y1, y2, . . . , yn appearing only through t(y1, y2, . . . , yn ), and the expression of the secondcomponent involves only y1, y2, . . . , yn without any term involving θ.

Proof: We shall present the proof for discrete Y , which is a little more intuitive and illus-trative but less technical than the continuous case.“only if” or =⇒ part: Suppose T (Y1, Y2, . . . , Yn ) is sufficient for θ and let t(y1, y2, . . . , yn ) =t denote the observed value of T . Then

L(θ|y1, y2, . . . , yn )

= P (Y1 = y1, Y2 = y2, . . . , Yn = yn|θ)

(by definition of likelihood function)

= P (Y1 = y1, Y2 = y2, . . . , Yn = yn,T = t|θ)

(as the two events are same)

= P (T = t|θ)P (Y1 = y1, Y2 = y2, . . . , Yn = yn|T = t,θ)

(by definition of conditional probability.)

= g(t,θ)h(y1, y2, . . . , yn )

where g(t, θ) = P (T = t|θ), which involves only θ and t (without directly involvingy1, y2, . . . , yn - the only way y1, y2, . . . , yn appear in the expression of g(t,θ) is through t);and h(y1, y2, . . . , yn ) = P (Y1 = y1, Y2 = y2, . . . , Yn = yn|T = t,θ), which does not involveθ, by definition of the sufficiency of T .“if” or ⇐= part: Suppose the probability model of Y1, Y2, . . . , Yn is such that it admits thefactorisation (6). Then

P (Y1 = y1, Y2 = y2, . . . , Yn = yn|T = t,θ)

=P (Y1 = y1, Y2 = y2, . . . , Yn = yn,T = t|θ)

P (T = t|θ)

=g(t, θ)h(y1, y2, . . . , yn )∑

{y1, y2, . . . , yn :T (y1, y2, . . . , yn )=t} g(t,θ)h(y1, y2, . . . , yn )

8

Page 9: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

=g(t,θ)h(y1, y2, . . . , yn )

g(t,θ)∑{y1, y2, . . . , yn :T (y1, y2, . . . , yn )=t} h(y1, y2, . . . , yn )

=h(y1, y2, . . . , yn )∑

y1, y2, . . . , yn :T (y1, y2, . . . , yn )=t h(y1, y2, . . . , yn )

which does not depend on θ. 5

Example 5: A. i. Consider the Normal likelihood given in (3). Assume that σ2 is known

but not µ. Let T (Y1, Y2, . . . , Yn ) = Y , so that t = y. Then the likelihood function (3) can

be factorised into g(t, µ) = e−(t−µ)2/(2σ2) and h(y1, y2, . . . , yn ) = (2πσ2)−n/2e−1

2σ2

∑n

i=1(yi−y)2 ,

showing that Y is sufficient for µ in a N(µ, σ2) model for known σ2.

A. ii. Again consider the Normal likelihood given in (3). This time assume that µ is knownbut not σ2. Let T (Y1, Y2, . . . , Yn ) =

∑ni=1(Yi − µ)2. Note that this T is a statistic because

µ is known. In this case define g(t, σ2) = (2πσ2)−n/2e−1

2σ2 t and h(y1, y2, . . . , yn ) = 1 so thatL(σ2|y1, y2, . . . , yn ) = g(t, σ2)h(y1, y2, . . . , yn ). Thus in this case

∑ni=1(Yi − µ)2 is sufficient

for σ2.

A. iii. Finally consider the Normal likelihood in (3) with both (µ, σ2) unknown. Note thatin this case the unknown parameter is vector valued with θ = (µ, σ2). In this case we shouldhave a vector valued sufficient statistics T . Thus let T = (Y ,

∑ni=1(Yi − Y )2), g(t,θ) =

(2πσ2)−n/2e−1

2σ2 (∑n

i=1(yi−y)2+(y−µ)2) and h(y1, y2, . . . , yn ) = 1, so that L(θ|y1, y2, . . . , yn ) =

g(t,θ)h(y1, y2, . . . , yn ). Thus in this case (Y ,∑ni=1(Yi − Y )2) is sufficient for θ = (µ, σ2).

B. For the Bernoulli likelihood in (4) let T =∑ni=1 Yi, g(t, p) = pt(1−p)n−t and h(y1, y2, . . . , yn ) =

1. Then L(p|y1, y2, . . . , yn ) = g(t, p)h(y1, y2, . . . , yn ) and thus∑ni=1 Yi is sufficient for p.

C.∑ni=1 Yi is sufficient for λ of the Poisson model, because for the Poisson model define T =∑n

i=1 Yi, g(t, λ) = e−nλλt, and h(y1, y2, . . . , yn ) = 1/∏ni=1 yi! so that the Poisson likelihood

in (5) equals g(t, λ)h(y1, y2, . . . , yn ). 5

We are now in a position to state the first theorem pertaining UMVUE, due to Rao andBlackwell, which is as follows.

Theorem 3 (Rao-Blackwell Theorem): If T is sufficient for θ and U is any unbiasedestimate of θ, then the statistic h(T ) = E[U |T ] is also unbiased for θ and Vθ[h(T )] ≤ Vθ[U ].

Proof: First note that h(T ), the conditional expectation of U given T is a statistic i.e. canbe completely determined from the data, because since T is sufficient, the conditional dis-tribution of Y1, Y2, . . . , Yn |T and hence E[U(Y1, Y2, . . . , Yn )|T ] = h(T ) also does not dependon θ.

Next observe that h(T ) is unbiased for θ. This is because Eθ[h(T )] = Eθ[E[U |T ]] = Eθ[U ] =θ ∀θ ∈ Θ. The last equality follows because U is unbiased for θ, and the one before thatfollows because E[E[X|Y ]] = E[X]. Now,

Vθ[U ]

= Eθ[(U − θ)2]

(because U is unbiased for θ)

9

Page 10: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

= Eθ[{(U − h(T )) + (h(T )− θ)}2]

= Eθ[(U − h(T ))2] + Eθ[(h(T )− θ)2] + 2Eθ[(U − h(T ))(h(T )− θ)]= Eθ[(U − h(T ))2] + Vθ[h(T )] + 2Eθ[E[{(U − h(T ))(h(T )− θ)|T ]]

(because h(T )is unbiased for θ and E[E[X|Y ] = E[X])

= Eθ[(U − h(T ))2] + Vθ[h(T )] + 2Eθ[(h(T )− θ)E[(U − h(T ))|T ]]

(because given T, (h(T )− θ)is a constant w.r.t. T and hence can be taken out of the

expectation)

= Eθ[(U − h(T ))2] + Vθ[h(T )]

(because E[(U − h(T ))|T ]] = E[U |T ]− E[h(T )|T ] = h(T )− h(T ) = 0)

≥ Vθ[h(T )]

(because Eθ[(U − h(T ))2] ≥ 0) 5

Rao-Blackwell theorem only goes on to show how an arbitrary unbiased estimator maybe improved w.r.t. reducing its variance by preserving its unbiasedness. It states that theway to reduce variance of an unbiased estimator after preserving its unbiasedness would beto consider a new statistic which is the conditional expectation of the unbiased estimatorgiven a sufficient statistics. Incidentally this process of taking conditional expectation ofa statistic given a sufficient statistics is called Rao-Blackwellisation. As such the theoremdoes not directly state anything about an estimator being a UMVUE. Indeed accordingto the theorem, if U1 and U2 are two unbiased estimators of θ then they can respectivelybe improved upon by considering h1(T ) = E[U1|T ] and h2(T ) = E[U2|T ], but it does notsay anything comparing the variances of h1(T ) and h2(T ). This problem is resolved byintroducing another concept called completeness.

Definition 5: A statistic T is called complete if Eθ[g(T )] = 0 ∀θ ∈ Θ =⇒ g(T ) ≡ 0.3

3The intuition behind completeness of a statistic can be understood only together with sufficiency. Op-posite of a sufficient statistic is called an ancillary statistic. A statistic is called ancillary if its distributiondoes not depend on the unknown θ. For example if Y1, Y2, . . . , Yn are i.i.d. Uniform[θ − 1/2, θ + 1/2] orUniform[θ, θ + 1], the smallest and largest observations denoted by (Y(1), Y(n)) are sufficient (use factorisa-tion theorem) but Y(n)−Y(1) is ancillary in either case. A sufficient statistic packs in itself all the informationthe data has got about the unknown θ, while an ancillary statistics is useless in drawing any inference aboutθ since its distribution does not depend on θ. Now a sufficient statistic is most successful in optimal datareduction when there is no ancillary statistics. In all the instances of Example 5, the number of parame-ters and the corresponding number of sufficient statistics agree, because in those cases there do not existany ancillary statistic. But in the Uniform[θ − 1/2, θ + 1/2] or Uniform[θ, θ + 1] examples above, the suffi-cient statistics is two-dimensional for a single parameter θ. This is because of the presence of the ancillarystatistic. Now a weaker notion of ancillarity, called first-order ancillarity, occurs when the expectation ofa non-constant statistic does not depend on θ. Ancillarity requires the entire distribution be free from θwhile first-order ancillarity requires the same only about its mean. Completeness of a sufficient statisticguarantees that there is no first-order ancillary statistics and thus no ancillary statistics (because ancillarityimplies first-order ancillarity). Because if T is complete and some function of it has a constant expectationc no matter what the value of θ is, then the function must identically equal c (if Eθ[f(T )] = c ∀θ ∈ Θ,Eθ[f(T )− c] = 0 ∀θ ∈ Θ implying f(T ) ≡ c by completeness). Thus if a sufficient statistic is also completeit does not contain any ancillarity and as a result we can expect good results from functions of this sufficientstatistic such as an UMVUE for example.

10

Page 11: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

Theorem 4 (Lehmann-Scheffe Theorem): If T is a complete sufficient statistic then forany unbiased estimate U of θ, h(T ) = E[U |T ] is the unique UMVUE for θ.

Proof: We have already shown that if T is sufficient, for any arbitrary unbiased estimateU of θ, Vθ[h(T )] ≤ Vθ[U ], where h(T ) = E[U |T ]. Now if U1 and and U2 are two differentunbiased estimators of θ and h1(T ) = E[U1|T ] and h2(T ) = E[U2|T ], then since Eθ[h1(T )] =Eθ[h2(T )] = θ ∀θ ∈ Θ, Eθ[h1(T ) − h2(T )] = 0 ∀θ ∈ Θ, which implies h1(T ) ≡ h2(T ), sinceT is complete. Thus no matter whatever unbiased estimator U one may start with, afterits Rao-Blackwellisation w.r.t. a complete sufficient statistic T , one always ends up withthe same or unique Rao-Blackwellised version h(T ) whose variance is no greater than anyarbitrary unbiased estimator of θ. Thus h(T ) is the unique UMVUE for θ. 5

Example 6: A random variable Y is said to belong to the exponential family of distri-butions if it has the p.d.f. (or p.m.f)

f(y|η) = exp{k∑i=1

ηiTi(y) + A(η) +B(y)} (7)

where η = (η1, η2, . . . , ηk) is the vector of unknown parameters and A(·), B(·), T1(·), T2(·),. . ., Tk(·), are known functions.

P.d.f/p.m.f of many standard probability models can be expressed in the form given in (7).

If Y ∼ B(n, p), the Binomial distribution with p as the parameter of interest and known n,

p(y|p) =

(ny

)py(1− p)n−y = exp

{ηy + n log

(1

1 + eη

)+ log

(ny

)}

where η = log(

p1−p

). This is in the form of (7) with k = 1, T1(y) = y, A(η) = n log

(1

1+eη

)and B(y) = log

(ny

).

If Y ∼ N(µ, σ2), the Normal distribution with (µ, σ2) as the unknown parameters,

f(y|µ, σ2)

=1√

2πσ2exp

{− 1

2σ2(y − µ)2

}

= exp

{(η1y + η2y

2)

+1

2

(η2

1

2η2

+ log(−η2)− log π

)}

where η1 = µσ2 and η2 = − 1

2σ2 . This is in the form of (7) with k = 2, T1(y) = y, T2(y) = y2,

A(η) = 12

(η21

2η2+ log(−η2)− log π

)and B(y) = 0.

If Y ∼ Poisson(λ), p(y|λ) = e−λ λy

y!= exp {ηy − eη − log y!}, where η = log λ. This is in

the form of (7) with k = 1, T1(y) = y, A(η) = −eη and B(y) = log y!.

If Y ∼ Gamma(α, λ), with both α and λ as unknown parameters,

f(y|α, λ) =λα

Γ(α)yα−1e−λy = exp {(η1 log y + η2y) + (η1 log(−η2)− log Γ(η1))− log y}

11

Page 12: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

where η1 = α and η2 = −λ. This is in the form of (7) with k = 2, T1(y) = log y, T2(y) = y,A(η) = η1 log(−η2)− log Γ(η1) and B(y) = − log y.

Similarly, Negative Binomial, Hypergeometric and Beta distributions can also be shownto belong to the exponential family.4 Note that the parameter η in (7) is not the naturalparameter of interest for the standard models as discussed above. It is called the canonicalparameter. However in all cases it will be the case that there exists a one-to-one relationshipbetween the natural parameter of a model and the canonical parameter, as illustrated in theabove important special models. We need not worry too much about this issue. As shall beseen shortly, we shall be able to draw inference about the natural parameter of interest only.Introduction of the concept of exponential family only facilitates the theoretical developmentin the sense that it allows us to prove results like sufficiency and completeness and hence theUMVUE for many probability models of interest at once in a unified manner.

Sufficiency: If the distribution of Y belongs to the exponential family, and Y1, Y2, . . . , Yn isan i.i.d. sample of Y , then it is immediate that T (Y1, Y2, . . . , Yn) = (

∑ni=1 T1(Yi), . . . ,

∑ni=1 Tk(Yi))

is sufficient for η using the Factorisation Theorem. This is because, in this case the likelihoodof η

L(η|y1, y2, . . . , yn) (8)

= exp

k∑j=1

ηjn∑i=1

Tj(yi) + nA(η)

exp

{n∑i=1

B(yi)

}= g(t,η)h(y1, y2, . . . , yn)

where g(t,η) = exp{∑k

j=1 ηj∑ni=1 Tj(yi) + nA(η)

}and h(y1, y2, . . . , yn) = exp {∑n

i=1B(yi)}.

Completeness: We shall prove completeness of the exponential family only for k =1, which is easily generalisable to higher dimension. Suppose the distribution of Y be-longs to the exponential family, and Eη[g(Y )] = 0 ∀η. Then from (7) it follows that∫{g(y) exp(B(y))} exp(ηy)dy5 = 0 ∀η. But the l.h.s. is nothing but the Laplace transforma-

tion of g(y) exp(B(y)). Since Laplace transformation of a function is unique, g(y) exp(B(y)) ≡0, implying g(y) ≡ 0.6

Thus for exponential family by the Rao-Blackwell and Lehmann-Scheffe theorems any func-tion of T is the UMVUE of its expectation. This is because the conditional expectation ofany function of T given T is that function itself, and any statistic is an unbiased estimateof its own expectation. Now since T is also complete, it is the UMVUE.

4However not all important probability models, useful in application belongs to the exponential family.For example, the Weibull distribution with p.d.f. f(y|λ, β) = λβyβ−1e−λy

β

can at best be simplified toexp

{(β log y − λeβ log y

)+ (log λ+ log β)− log y

}, which is not in the form given in (7).

5In case Y is discrete this integral is to be replaced by summation and the rest of the argument carriesthrough.

6Completeness of the exponential family should be understood in terms the distribution of the corre-sponding sufficient statistics T . Say for example if Y ∼ N(0, σ2) then it belongs to the exponential family.But Y is not complete. This is because for example Eσ2 [Y α] = 0 ∀σ2 > 0 ∀α = 1, 3, 5, . . . odd powers. Orfor that matter expectation of any function of Y , which is symmetric about 0 is 0. But those functions arenot 0, showing that Y is not complete. But in this case the sufficient statistics is Y 2, which of course iscomplete.

12

Page 13: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

Thus if Y ∼ B(n, p), since Ep[Y/n] = p and Y is complete and sufficient, the sampleproportion p̂ = Y/n is the UMVUE of the population proportion p.

If Y1, Y2, . . . , Yn are i.i.d. N(µ, σ2), (∑ni=1 Yi,

∑ni=1 Y

2i ) is complete and sufficient for (µ/σ2,−1/(2σ2)).

But since (∑ni=1 Yi,

∑ni=1 Y

2i )↔ (Y = 1

n

∑ni=1 Yi, s

2n−1 = 1

n−1

∑ni=1(Yi−Y )2) and (µ/σ2,−1/(2σ2))↔

(µ, σ2) is one-to-one the sample mean and variance (Y , s2n−1) is complete and sufficient

for (µ, σ2). Now since Eµ,σ2 [Y ] = 1n

∑ni=1Eµ,σ2 [Yi] = 1

n

∑ni=1 µ = µ, Y is the UMVUE of

µ. Similarly since Eµ,σ2 [s2n−1] = 1

n−1Eµ,σ2

[∑ni=1(Yi − Y )2

]= 1

n−1Eµ,σ2

[∑ni=1 Y

2i − nY

2]

=[1

n−1

∑ni=1

{Vµ,σ2 [Yi] + (Eµ,σ2 [Yi])

2}]− n

n−1

{Vµ,σ2 [Y ] +

(Eµ,σ2 [Y ]

)2}

(because for any random

variable X, E[X2] = V [X] + (E[X])2) =[

1n−1

∑ni=1 (σ2 + µ2)

]− n

n−1

(σ2

n+ µ2

)(because

Vµ,σ2 [Y ] = 1n2

∑ni=1 Vµ,σ2 [Yi] = 1

n2

∑ni=1 σ

2 = σ2

n, and we have just shown that Eµ,σ2 [Y ] = µ.)

=(

nn−1− 1

n−1

)σ2 = σ2, s2

n−1 is the UMVUE of σ2.

If Y1, Y2, . . . , Yn are i.i.d. Poisson(λ) then∑ni=1 Yi is complete and sufficient. Since

Eλ [∑ni=1 Yi] = nλ Y = 1

n

∑ni=1 Yi is the UMVUE of λ. 5

We move on to the other method of obtaining UMVUE after presenting an interestingexample of obtaining UMVUE through Rao-Blackwellisation.

Example 7: If T denotes the life of a system, its reliability at time t is defined as P [T > t].Exponential distribution is a very popular model for life of a system. So let T have anexponential distribution with mean θ, denoted by T ∼ exp(θ) and suppose we are interestedin estimating the reliability of the system at a give mission time t. For this purpose supposewe observe time to failure or life of n independent systems, T1 = t1, . . . , Tn = tn. Now ifT ∼ exp(θ), P [T > t] = e−t/θ, and T = 1

n

∑ni=1 Ti is the UMVUE of θ, because

∑ni=1 Ti is a

complete and sufficient statistic for exp(θ) and Eθ[

1n

∑ni=1 Ti

]= θ. But this does not imply

that e−t/T is the UMVUE for e−t/θ. In fact e−t/T is not even an unbiased estimate of e−t/θ

and this leads to the search for an UMVUE of e−t/θ.

With the first observation T1, the estimator It(T1) =

{1 if T1 > t0 otherwise

, the indicator func-

tion of the event T1 > t, is an unbiased estimate of P [T > t] for all life distribution T (notjust exponential), because Eθ [It(T1)] = P [T > t]. Now for exponential distribution since∑ni=1 Ti is a complete and sufficient statistic for θ, by Rao-Blackwell Lehmann-Scheffe theo-

rems E [It(T1)|∑ni=1 Ti] would be the UMVUE for P [T > t]. The derivation of this quantity

only requires some probability calculation which is as follows.

Let S =∑ni=1 Ti and R =

∑ni=2 Ti. Since It(T1) is a function of T1, its conditional expec-

tation is found by deriving the conditional density fT1|S(t1|s) of T1|S, which requires thejoint density fT1,S(t1, s) of (T1, S). This is accomplished by considering the joint density of(T1, R) (which is easy because of the independence of T1 and R) and then considering theone-to-one onto transformation (T1, R)↔ (T1, S) defined as T1 = T1 and S = R + T1. Notethat S ∼ Gamma(n, 1/θ) and likewise R ∼ Gamma(n− 1, 1/θ). Thus the density of R is

fR(r) =1

θn−1(n− 2)!rn−2e−r/θ

13

Page 14: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

and since T1 ∼ exp(θ) the joint density of (T1, R) is given by

fT1,r(t1, r) =1

θn(n− 2)!rn−2e−(r+t1)/θ.

Now applying the one-to-one onto transformation (T1, R)↔ (T1, S) we get the joint densityof (T1, S) as

fT1,S(t1, s) =1

θn(n− 2)!(s− t1)n−2e−s/θ.

The conditional density fT1|S(t1|s) of T1|S is found by dividing fT1,S(t1, s) by the marginaldensity fS(s) of S, which since is a Gamma(n, 1/θ) density,

fT1|S(t1|s) =n− 1

s

(1− t1

s

)n−2

0 < t1 < s.

Now since It(T1) is the indicator function of the event T1 > t, E [It(T1)|S] is found byintegrating fT1|S(t1|s) from t to t1’s upper-bound s. This integral is found as follows.

E [It(T1)|S]

=n− 1

s

∫ s

t

(1− t1

s

)n−2

dt1

= (n− 1)∫ 1−t/s

0un−2 du

(with the substitution u = 1− t1/s we have du = −dt1/s and the range of the integral

from 1− t/s to 0, which after adjusting for the negative sign becomes from 0 to 1− t/s.)

=(

1− t

s

)n−1

Thus the UMVUE of P [T > t] = e−t/θ, the reliability at mission time t of a system hav-ing an exponential life distribution with mean θ, with observations T1, . . . , Tn is given by(

1− t∑n

i=1Ti

)n−1

. 5

Now we shall present the second and final method of determining an UMVUE. Though sofar while discussing the theory, we have pretended as if our interest lies only in the naturalpopulation parameter θ, as Example 7 reveals, many times we might be interested in somefunction φ(θ) of the original population parameter θ. Note that this does not require redoingthe previous Rao-Blackwell Lehmann-Scheffe theory of determining the UMVUE. In this caseone confines oneself with unbiased estimators of φ(θ) and restricts one’s search of UMVUEonly among functions of complete sufficient statistics of θ.

So now suppose we are interested in estimating some parametric function φ(θ) of θ. In thisdiscussion we shall only deal with scalar valued θ and φ(θ).

Theorem 5 (Cramer-Rao Theorem): If the population p.d.f. f(y|θ) is “regular”, Y1, Y2, . . . , Yn are

i.i.d. f(y|θ) and T (Y1, Y2, . . . , Yn ) is an unbiased estimate of φ(θ), then Vθ[T ] ≥ − {φ′(θ)}2{nEθ

[∂2 log f(Y |θ)

∂θ2

]} .7

7In the discrete case, replace the p.d.f. f(y|θ) by p.m.f. p(y|θ) in the statement of the theorem andreplace the integrals by summations in its proof, and the same result follows through.

14

Page 15: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

Proof: Let Y = (Y1, Y2, . . . , Yn) and y = (y1, y2, . . . , yn). Since the joint density of Y1, Y2, . . . , Yn issame as the likelihood function of θ, as defined in Definition 4, Eθ [g(Y )] =

∫<n g(y)L(θ|y) dy

for any function of Y and ∫<nL(θ|y) dy = 1.

Partially differentiating the above w.r.t. θ within the integral sign (which is allowed sincethe model is assumed to be “regular”) we get∫

<n

∂L(θ|y)

∂θdy = 0

which can be rewritten as

[∂ logL(θ|Y )

∂θ

]=∫<n

{1

L(θ|y)

∂L(θ|y)

∂θ

}L(θ|y) dy = 0 (9)

If we again differentiate (9) w.r.t. θ under the integral sign (which is allowed under a second“regularity” condition) we get

∫<n

{(1

L(θ|y)

∂L(θ|y)

∂θ

)∂L(θ|y)

∂θ+ L(θ|y)

∂θ

(1

L(θ|y)

∂L(θ|y)

∂θ

)}dy

=∫<n

{(∂ logL(θ|y)

∂θ

)(1

L(θ|y)

∂L(θ|y)

∂θ

)+

∂θ

(∂ logL(θ|y)

∂θ

)}L(θ|y) dy

=∫<n

(∂ logL(θ|y)

∂θ

)2

+∂2 logL(θ|y)

∂θ2

L(θ|y) dy = 0

The last equality implies that

(∂ logL(θ|Y )

∂θ

)2 = −Eθ

[∂2 logL(θ|Y )

∂θ2

]. (10)

Now if T (Y ) is an unbiased estimator of φ(θ)∫<nT (y)L(θ|y) dy = φ(θ). (11)

Differentiating (11) under the integral sign (which is allowed under a third “regularity”condition) we obtain ∫

<nT (y)

∂ logL(θ|y)

∂θL(θ|y) dy = φ′(θ)

and by (9) ∫<n{T (y)− φ(θ)} ∂ logL(θ|y)

∂θL(θ|y) dy = φ′(θ). (12)

Hence by (9), (11) and (12)

Covθ

[T (Y ),

∂ logL(θ|Y )

∂θ

]= φ′(θ). (13)

15

Page 16: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

Now by (9) and (10)

[∂ logL(θ|Y )

∂θ

]

= −Eθ[∂2 logL(θ|Y )

∂θ2

]= −Eθ

[∂2

∂θ2

{n∑i=1

log(f(Yi|θ))}]

= −nEθ[∂2 log(f(Y |θ))

∂θ2

]. (14)

Let ρT (Y),

∂ logL(θ|Y)∂θ

denote the correlation coefficient between T (Y ) and ∂ logL(θ|Y)∂θ

. Then by

(13) and (14)

ρ2

T (Y),∂ logL(θ|Y)

∂θ

=

{Covθ

[T (Y ), ∂ logL(θ|Y)

∂θ

]}2

(Vθ [T (Y )])(Vθ[∂ logL(θ|Y)

∂θ

]) ={φ′(θ)}2

(Vθ [T (Y )])(−nEθ

[∂2 log(f(Y |θ)

∂θ2

]) ≤ 1

implying Vθ[T ] ≥ − {φ′(θ)}2{nEθ

[∂2 log f(Y |θ)

∂θ2

]} . 5

The quantity − {φ′(θ)}2{nEθ

[∂2 log f(Y |θ)

∂θ2

]} is called the Cramer-Rao lower bound (CRLB). The

quantity −Eθ[∂2 log f(Y |θ)

∂θ2

]which by virtue of (10) (for n = 1) equals Eθ

[(∂ log f(Y |θ)

∂θ

)2]

is

called Fisher Information and is denoted by I(θ). Fisher Information for the entiresample, denoted by In(θ), which in general is given in (10) reduces to nI(θ) in the i.i.d.case. The intuitive reason behind calling this quantity “information” will be clear in thenext sub-subsection when we take up the Method of Maximum Likelihood or MaximumLikelihood Estimation (MLE). In this connection note that the CRLB will be attained by

an unbiased estimator T (Y ) of φ(θ) if and only if its correlation with ∂ logL(θ|Y )∂θ

is ±1

which happens if and only if T (Y ) − φ(θ) is a constant multiple of ∂ logL(θ|Y )∂θ

, since by (9)

[∂ logL(θ|Y )

∂θ

]= 0. But in general since this constant multiple may depend on θ it may be

said that CRLB is attained by an unbiased estimator T (Y ) of φ(θ) if and only if

∂ logL(θ|Y )

∂θ= A(θ) {T (Y )− φ(θ)} (15)

We shall close this sub-subsection by illustrating the utility of CRLB in determination ofUMVUE. Theorem 5 gives a lower-bound for the variance of an unbiased estimator T (Y ) ofa parametric function φ(θ). Thus if we can somehow catch hold of an unbiased estimator φ̂of φ(θ), whose variance coincides with this lower-bound, then we know that the variance ofany other unbiased estimator is more than that of φ̂ and thus φ̂ is a UMVUE. That is if theCRLB is attained by the variance of any unbiased estimator of φ(θ), then it is a UMVUE.

Example 8: A. If Y1, Y2, . . . , Yn are i.i.d. Bernoulli(p)

I(p) = Vp

[∂

∂p{Y log(p) + (1− Y ) log(1− p)}

]= Vp

[Y

p(1− p)

]=

1

p(1− p)

16

Page 17: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

and Vp[p̂ = 1

n

∑ni=1 Yi

]= p(1−p)

n= 1In(p)

, the CRLB, showing the sample proportion to be theUMVUE for the population proportion.

B. Let Y1, Y2, . . . , Yn be i.i.d. N(µ, 1). Then

In(µ) = −Eµ[∂2

∂µ2log

(n∏i=1

{1√2πe−

12

(Yi−µ)2})]

= −Eµ[∂

∂µ

n∑i=1

(Yi − µ)

]= n

and Vµ[Y ] = 1/n = 1/In(µ). Since Y is unbiased for µ it is an UMVUE, as its variance isattaining the CRLB.

C. Let Y1, Y2, . . . , Yn be i.i.d. exponential with mean θ. Then f(y|θ) = (1/θ)e−y/θ. Thus∂2

∂θ2log f(y|θ) = ∂

∂θ{−(1/θ)+(y/θ2)} = (1/θ2)−2(y/θ3), so that I(θ) = −Eθ [(1/θ2)− 2(Y/θ3)] =

(1/θ2) and In(θ) = n/θ. Now if we are interested in estimating θ since Eθ[Y ] = θ andVθ[Y ] = θ2/n = 1/In(θ) = the CRLB, it is the UMVUE for θ. 5

2.1.2 MLE

So far we have discussed some criteria under which an estimator may be called “good”and have only provided at best some indirect methods of obtaining such estimators. Ingeneral it is desirable to have some automatic methods which yield estimators which areat least approximately “good”. One such method is the method of Maximum Likelihood(ML) yielding an MLE. If a UMVUE does not exist or cannot be obtained using the resultsdiscussed in §2.1.1, then by far the most popular method of estimation employed in practice isthe method of Maximum Likelihood. This is partly because instead of having an assortmentof theorems guiding one in the search of a “good” estimator as in the case of UMVUE,it is an automatic method producing at least numerical estimates; and partly (actuallyphilosophically the only reason) because the method yields estimators which are “good” inthe asymptotic sense and thus work very well for large samples.

Let the parameter of interest θ be vector valued, and let L(θ|y) denote the likelihood ofθ given the observations y, where the likelihood function L(θ|y) is as has been defined inDefinition 4.

Definition 6: θ̂ is called the Maximum Likelihood Estimator (MLE) of θ if L(θ̂|y) ≥L(θ|y) ∀θ ∈ Θ .

Intuitively, granting the informal interpretation of the likelihood function discussed in theparagraph following Definition 4, MLE θ̂ is that value of θ for which one is most likely toobserve a data set such as the one at hand namely Y = y, among all possible values ofθ ∈ Θ . Other than this intuitive appeal, in the so-called “regular”8 cases, computation ofMLE is conceptually straight forward. All one has to do is find the maxima of L(θ|y), which

is typically accomplished by solving the system of equations ∂L(θ|y)

∂θ

∣∣∣∣θ=θ̂

= 0 or equivalently

by solving∂`(θ|y)

∂θ

∣∣∣∣∣θ=θ̂

= 0 where `(θ|y) = log(L(θ|y)) (16)

8A model f(y|θ) or p(y|θ) is regular if it satisfies the conditions required for interchanging the differen-tiation and integration signs in Theorem 5 yielding CRLB.

17

Page 18: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

((16) is usually easier to deal with because the likelihood function is a product which aftertaking log becomes a sum), and then showing the resulting solution θ̂ to be the globalmaxima9.

Example 9: A. If Y1, Y2, . . . , Yn are i.i.d. Bernoulli(p)

L(p|y) = p∑n

i=1yi(1− p)n−

∑n

i=1yi ⇒ `(p|y) = log

(p

1− p

)n∑i=1

yi + n log(1− p)

and∂`(p|y)

∂p

∣∣∣∣∣p=p̂

= 0⇒ p̂ =1

n

n∑i=1

yi

Since ∂2`(p|y)

∂p2= −

(∑n

i=1yi

p2+

n−∑n

i=1yi

(1−p)2

)< 0 ∀0 < p < 1, `(p|y) is concave and thus p̂ =

1n

∑ni=1 yi, the sample proportion is the MLE of the unknown population proportion p.

B. Let Y1, Y2, . . . , Yn be i.i.d. N(µ, σ2). Then

L(µ, σ2) = (2πσ2)−n/2 exp

{− 1

2σ2

n∑i=1

(yi − µ)2

}⇒ `(µ, σ2) = −n

2log(2π)−n

2log(σ2)− 1

2σ2

n∑i=1

(yi−µ)2

Now ∂`(µ,σ2|y)

∂µ

∣∣∣(µ=µ̂,σ2=σ̂2)

∂`(µ,σ2|y)

∂(σ2)

∣∣∣(µ=µ̂,σ2=σ̂2)

=

(00

)⇒

∑ni=1(yi − µ̂)∑n

i=1(yi−µ̂)2

2(σ̂2)2

=

(0n

2σ̂2

)⇒(

µ̂

σ̂2

)=

(y

1n

∑ni=1(yi − y)2

).

The second derivative matrix of `(µ, σ2|y) ∂2`(µ,σ2|y)

∂µ2

∂2`(µ,σ2|y)

∂µ∂(σ2)∂2`(µ,σ2|y)

∂(σ2)∂µ

∂2`(µ,σ2|y)

∂(σ2)2

= −

nσ2

1(σ2)2

∑ni=1(yi − µ)

1(σ2)2

∑ni=1(yi − µ) 1

(σ2)2

{n/2− 1

σ2

∑ni=1(yi − µ)2

} is negative-definite at (µ̂, σ̂2) and is asymptotically so as n → ∞ in probability10. Becauseby the law of large numbers

1

n

∂2`(µ,σ2|y)

∂µ2

∂2`(µ,σ2|y)

∂µ∂(σ2)∂2`(µ,σ2|y)

∂(σ2)∂µ

∂2`(µ,σ2|y)

∂(σ2)2

P−→ −[

1σ2 00 − 1

(σ2)2

].

Hence (µ̂, σ̂2) = (y, 1n

∑ni=1(yi − y)2) is the MLE of (µ, σ2). 5

Having understood the method i.e. how to implement it, let us now turn our attentionto the more important question of the justification of this method. That is why in a given

9Showing something to be a global maxima in general is a daunting task, but in most MLE computationsthis feat is typically accomplished by showing the log-likelihood is concave.

10A sequence of random variables {Xn}∞n=1 is said to converge to a random variable X in probability,

notationally represented as XnP→ X, if ∀ε > 0 the sequence of real numbers P [|Xn−X| < ε]→ 1 as n→∞.

18

Page 19: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

situation MLE should be considered and what optimality property does it possess? We shallpresent the proofs only in the single parameter case, which are easily extendable to multior vector-valued parameter case. Also we shall only be concerned with regular cases. Fornon-regular cases ML may or may not be the best approach for estimation. But beforediscussing optimality of MLE, we have to first introduce another criterion for an estimatorbeing “good” called consistency.

Definition 7: An estimator θ̂n, based on a sample of size n is said to be consistent for θ

if θnP→ θ ∀θ ∈ Θ.

That is if an estimator is consistent, it has the desirable property of converging to the trueunknown value as the sample size increases.

Example 10 (Weak Law of Large Numbers): Let Y n = 1n

∑ni=1 Yi denote the sample

mean based on n i.i.d. observations Y1, Y2, . . . , Yn from an arbitrary population having meanµ and variance σ2. Since E[Y n] = µ (Example 2) and V [Y n] = σ2/n (Example 6, page 13),by Chebyshev’s inequality

P[|Y n − µ| < ε

]= P

[|Y n − µ| <

√nε

σ

(σ√n

)]≥ 1− σ2

nε2→ 1 as n→∞

Thus the sample mean is always a consistent estimate of population mean, provided thepopulation mean and variance exists. 5

Theorem 6: Assume the probability model is “regular”. Then as n → ∞, (16) alwayshas a solution with probability tending to 1. Furthermore such a solution is unique withprobability tending to 1, which is also consistent for θ.

Proof: Above theorem essentially has three claims namely existence of a solution, uniquenessof the solution and finally consistency of the solution. We shall prove existence, consistencyand uniqueness, in that order, after some preliminaries.

Let θ0 denote the true unknown value of the parameter of interest θ. Consider the

random variable L(θ|Y )

L(θ0|Y ), where L(θ|Y ) is the Likelihood function of the data set Y =

(Y1, Y2, . . . , Yn ). Since arithmetic mean is always greater than geometric mean, so are theirlogarithms, which means logarithm of an expectation or arithmetic mean is always greaterthan the expectation of the logarithms (because that’s what the logarithm of a geometricmean reduces to). Thus,

log

(Eθ0

[L(θ|Y )

L(θ0|Y )

])> Eθ0

[log

(L(θ|Y )

L(θ0|Y )

)]

where Eθ0[g(Y )] denotes the expectation of the random variable g(Y ) when θ equals its

true unknown value θ0. Now ∀θ ∈ Θ ,

Eθ0

[L(θ|Y )

L(θ0|Y )

]=∫<n

L(θ|y)

L(θ0|y)

n∏i=1

f(yi|θ0)dy =∫<n

L(θ|y)

L(θ0|y)L(θ0|y)dy =

∫<n

n∏i=1

f(yi|θ)dy = 1

Thus ∀θ ∈ Θ ,Eθ0

[log (L(θ0|Y ))] ≥ Eθ0[log (L(θ|Y ))]

19

Page 20: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

But 1n

log (L(θ|Y )) = 1n

∑ni=1 log (f(Yi|θ)) ∀θ ∈ Θ , and thus by the law of large num-

bers (see Example 10 above) with probability tending to 1, 1n

log (L(θ|Y )) would equalEθ0

[log (L(θ|Y ))]. Hence ∀θ ∈ Θ ,

log (L(θ0|Y )) ≥ log (L(θ|Y )) with probability tending to 1. (17)

Existence: All the statements below are true with probability tending to 1. According to(17), log (L(θ|Y )) or `(θ|Y ) will have a maxima at θ0. Hence its derivative must vanishat θ0. Therefore equation (16) must have a solution (at θ0 in particular). This shows thatthere exists a solution for equation (16).

Consistency: Substituting θ = θ̂, the MLE of θ, in (17) we get log (L(θ0|Y )) ≥ log(L(θ̂|Y )

)with probability tending to 1. But on the other hand, by the definition of MLE (vide. Def-

inition 6), log(L(θ̂|Y )

)≥ log (L(θ|Y )) ∀θ ∈ Θ , and thus in particular when θ = θ0,

log(L(θ̂|Y )

)≥ log (L(θ0|Y )). Therefore the MLE θ̂ have to coincide with θ0, the true un-

known value of the parameter of interest, with probability tending to 1, as n→∞, showingthat MLE is consistent.

Uniqueness: We shall outline the proof for the single parameter case. The proof forthe general multi-parameter case is analogous, but since it requires a little more additionalnotations is not included here. MLE is obtained by solving equation (16) and then checkingwhether it is at least a local maxima or not (because equation (16) would be satisfied byboth local minima as well as saddle points of the log likelihood function `(θ|Y )) by looking

at the sign of ∂2`(θ|Y )∂θ2

∣∣∣∣θ=θ̂

, which needs to be negative for θ̂ to be a maxima. Let θ̂c be any

consistent estimator of θ. Then by the law of large numbers

1

n

∂2`(θ|Y )

∂θ2

∣∣∣∣∣θ=θ̂c

=1

n

n∑i=1

∂2 log(f(Yi|θ))∂θ2

∣∣∣∣∣θ=θ̂c

P−→ Eθ0

[∂2 log(f(Y |θ))

∂θ2

∣∣∣∣∣θ=θ0

]

Now by (10) (for n = 1) ∀θ ∈ Θ,

[∂2 log(f(Y |θ))

∂θ2

]= −Eθ

(∂ log(f(Y |θ))∂θ

)2 < 0

and hence Eθ0

[∂2 log(f(Y |θ))

∂θ2

∣∣∣θ=θ0

]< 0, which means that if θ̂c is a consistent estimator then

by the law of large numbers, with probability tending to 1, ∂2`(θ|Y )∂θ2

∣∣∣∣θ=θ̂c

< 0. Now let θ̂1 < θ̂2

be two distinct MLE’s. Then they also must be consistent. Furthermore since the regularityconditions guarantee that likelihood function is smooth there must exist a minima θ̂3 such

that θ̂1 < θ̂3 < θ̂2. Since θ̂3 is a minima ∂2`(θ|Y )∂θ2

∣∣∣∣θ=θ̂3

> 0. But since θ̂1 and θ̂2 are consistent,

and θ̂1 < θ̂3 < θ̂2 so will be θ̂3, in which case ∂2`(θ|Y )∂θ2

∣∣∣∣θ=θ̂3

< 0, which is a contradiction, and

thus one cannot have more than one MLE. 5

20

Page 21: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

In the foregoing proof of uniqueness, we saw that ∂2`(θ|Y )∂θ2

∣∣∣∣θ=θ̂

played a crucial role in

establishing an estimate to be an MLE. It was also seen that this quantity approximately

equals −Eθ[∂2`(θ|Y )

∂θ2

]= In(θ) with θ = θ0. The quantity I(θ) = ∞

\ I\(θ) also arose in §2.2.1

in the context of CRLB, where I(θ) was called Fisher Information. It is called “information”because of the following reasons stated below (without proof):

(i) By (10)

I(θ) = −Eθ[∂2`(θ|Y )

∂θ2

]= Eθ

(∂`(θ|Y )

∂θ

)2 > 0 ∀θ ∈ Θ

(ii) I(θ) is additive in the sense that if I(1)(θ) and I(2)(θ) respectively denote informationsfor two samples Y1 and Y2 of size 1 each, then the information I2(θ) for the com-bined sample {Y1, Y2} of size 2 equals I(1)(θ) + I(2)(θ), provided of course they areindependent.

(iii) If I(T )(θ) denotes information based on a statistic T (Y ), then I(θ) ≥ I(T )(θ), withequality iff T is sufficient. This shows that you always loose some information bysummarising raw data, unless it is summarised using sufficient statistics, in which caseyou do not loose any information as from the original raw data.

(iv) A variety of distance measures between two distributions for two different values of θhas an approximate increasing relationship with I(θ). What this means is I(θ) can beviewed as a measure of sensitivity of changing the population distribution by changingθ.

Apart from the above mathematical facts, the so called likelihood principle says thatwhatever a data set Y has to say about the unknown parameter θ, it is packed in thelikelihood or log-likelihood function `(θ|Y ). Now for the same parameter for two differentdata sets the one with sharper (more peaked) `(θ|Y ) is more informative about θ. Thatis flatter the likelihood lesser is the information about θ. In(θ)(= nI(θ)), negative of theexpected value of the second derivative of the log-likelihood simply measures how peakedone can expect this log-likelihood function to be. The negative sign simply ensures that thequantity is positive. Larger this value the likelihood is expected to be more peaked and thusmore is the information about θ and hence the name “information”.

Fisher Information being an expectation is a population quantity, and thus depicts thekind of informative behaviour to expect from a likelihood, in general for an arbitrary samplefrom this population. But for a given set of data the peakedness of the likelihood can beexactly measured without resorting to its expected value. This consideration gives rise tothe sample version of Fisher information called the Observed Information which is given

by In(θ) = −∂2`(θ|Y )∂θ2

. Note that by law of large numbers ∀θ ∈ Θ, I(θ) = 1nIn(θ)

P−→ I(θ),which allows us to consistently estimate the Fisher Information by Observed Information.

In the multi-parameter or vector valued θ, instead of a scalar Fisher/ observed informa-tion we have a Fisher Information Matrix I(θ) and its sample analogue Observed

21

Page 22: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

Information Matrix I(θ) which are as follows. Let θ = (θ1, . . . , θk) have k components.Then

I(θ) = −

[∂2`(θ|Y )∂θ21

]Eθ

[∂2`(θ|Y )∂θ1∂θ2

]· · · Eθ

[∂2`(θ|Y )∂θ1∂θk

]Eθ

[∂2`(θ|Y )∂θ2∂θ1

]Eθ

[∂2`(θ|Y )∂θ22

]· · · Eθ

[∂2`(θ|Y )∂θ2∂θk

]...

.... . .

...

[∂2`(θ|Y )∂θk∂θ1

]Eθ

[∂2`(θ|Y )∂θk∂θ2

]· · · Eθ

[∂2`(θ|Y )∂θ2k

]

and likewise

I(θ) = −

1n

∑ni=1

∂2`(θ|Yi)∂θ21

1n

∑ni=1

∂2`(θ|Yi)∂θ1∂θ2

· · · 1n

∑ni=1

∂2`(θ|Yi)∂θ1∂θk

1n

∑ni=1

∂2`(θ|Yi)∂θ2∂θ1

1n

∑ni=1

∂2`(θ|Yi)∂θ22

· · · 1n

∑ni=1

∂2`(θ|Yi)∂θ2∂θk

......

. . ....

1n

∑ni=1

∂2`(θ|Yi)∂θk∂θ1

1n

∑ni=1

∂2`(θ|Yi)∂θk∂θ2

· · · 1n

∑ni=1

∂2`(θ|Yi)∂θ2k

.

As far as the optimality of MLE is concerned, so far we have only shown that it is consistent.Though often MLE is the starting point of the search of an UMVUE its relationship with anUMVUE still remains to be examined. Actually the connection between the two in the singleparameter case is best studied through Theorem 5, establishing the CRLB. Thus supposethere exists an unbiased estimator T (Y ) for a parametric function φ(θ), which attains theCRLB. Then by (15) and (16) θ̂, the MLE of θ must satisfy T (Y ) = φ(θ̂) which goes onto show that the UMVUE of φ(θ) is a function of the MLE. In case φ(θ) ≡ θ and T (Y ) isunbiased for θ attaining CRLB, the above argument also shows that then T (Y ) must equalthe MLE θ̂.

The connection between UMVUE and MLE in the multi-parameter case is easiest to ap-preciate for the exponential family of distributions, covering most of the probability modelsapplied in practice, introduced in Example 6. Again as usual we shall show the mathematicsfor the continuous case, which carries through exactly the same way in the discrete case byreplacing the p.d.f. by p.m.f. and the integrals by summations. By (7) since f(y|η) is ap.d.f. ∫

<exp

{k∑l=1

ηlTl(y) +B(y)

}dy = e−A(η).

Differentiating the above w.r.t. ηj, interchanging the differentiation and integral sign, de-

noting ∂A(η)

∂ηjby Aj(η), and then multiplying both sides by e−A(η), we get∀j = 1, 2, . . . , k,

∫<Tj(y) exp

{k∑l=1

ηlTl(y) +B(y) + A(η)

}dy = −Aj(η)

or∫< Tj(y)f(y|η) dy = Eη [Tj(Y )] = −Aj(η). Thus T (Y ) is an unbiased estimator of

−Aj(η), and as shown in Example 6, based on a sample of size n, 1n

∑ni=1 Tj(Yi) is the

UMVUE for −Aj(η). But by (9) the likelihood equation (16) in this case reduces to

1

n

n∑i=1

Tj(Yi) = −Aj(η̂) ∀j = 1, 2, . . . , k,

22

Page 23: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

where η̂ is the MLE of η showing that functions of T being UMVUE estimates of theirexpectations (vide. Example 6) are necessarily functions of the MLE η̂.

We shall close our discussion about MLE after proving an important theorem. This theo-rem, among other things, like enabling one to approximate the sampling distribution of theMLE in large samples, also helps one to see the connection between MLE and CRLB.

Theorem 7: For “regular” models, for large n, the sampling distribution of MLE θ̂ maybe approximated by a k-variate Normal distribution with mean θ0 and variance-covariancematrix I−1

n (θ0), where θ0 is the true unknown value of θ.

Proof: We shall give the outline of the proof only for k = 1. For general k, though theproof is similar, is a little more technical in nature and hence omitted. By Taylor’s theorem,expanding the l.h.s. of the likelihood equation (16) about θ0, the true unknown value of θ,we get,

∂`(θ|Y )

∂θ

∣∣∣∣∣θ=θ̂

=∂`(θ|Y )

∂θ

∣∣∣∣∣θ=θ0

+ (θ̂ − θ0)∂2`(θ|Y )

∂θ2

∣∣∣∣∣θ=θ∗

where θ∗ lies in between θ̂ and θ0. But by (16) since the l.h.s. of the above equation is 0,

(θ̂ − θ0) =∂`(θ|Y )

∂θ

∣∣∣∣∣θ=θ0

/{− ∂2`(θ|Y )

∂θ2

∣∣∣∣∣θ=θ∗

}.

Multiplying both sides of above by√nI(θ0), and multiplying the numerator and denominator

of the r.h.s. by 1n

this equation may be rewritten as

(θ̂ − θ0)√nI(θ0) =

{1n∂`(θ|Y )

∂θ

∣∣∣∣θ=θ0

}/√I(θ0)/n{

1n∂2`(θ|Y )

∂θ2

∣∣∣∣θ=θ∗

}/{−I(θ0)}

. (18)

Now by law of large numbers, definition of I(θ0), and because θ∗ is sandwitched between θ0

and θ̂, with θ̂P−→ θ0,

1

n

∂2`(θ|Y )

∂θ2

∣∣∣∣∣θ=θ∗

=1

n

n∑i=1

∂2 log f(Yi|θ)∂θ2

∣∣∣∣∣θ=θ∗

P−→ −I(θ0).

Thus the denominator of (18) converges in probability to 1. For handling the numeratornote that

1

n

∂`(θ|Y )

∂θ

∣∣∣∣∣θ=θ0

=1

n

n∑i=1

Ui

where Ui = ∂ log f(Yi|θ)∂θ

∣∣∣θ=θ0

. By (9), Eθ0 [Ui] = 0 and by (9) and (10), Vθ0 [Ui] = I(θ0) = σ2

(say). Thus by denoting 1n

∑ni=1 Ui by U , the numerator of (18) may be written as U−0√

σ2/n

whose distribution would approach that of a standard Normal distribution as n → ∞ by

the Central Limit Theorem. Thus the distribution of (θ̂ − θ0)√nI(θ0), the l.h.s. of (18),

23

Page 24: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

must also approach that of a standard Normal distribution as n → ∞, implying that θ̂ isasymptotically Normal with mean θ0 and variance 1/ {nI(θ0)}. 5

Above theorem shows that the asymptotic variance of MLE is same as the CRLB. In thissense MLE may be called an “efficient” estimator. However a general discussion of efficiencyis beyond the scope of this notes, and thus we shall wrap up our discussion about MLE bymaking a a couple of final remarks.

Note that optimality properties of MLE like consistency and asymptotic Normality withCRLB as the asymptotic variance are all large sample properties. That is they are true onlyif the sample size n → ∞. Thus it must be borne in mind that one can expect MLE tobehave reasonably well only when the sample size is large, unless other results show them tobe optimal for small samples.

The second point is regarding numerical computation. Since more often than not thelikelihood equation (16) does not admit closed form analytical solution, one has to employnumerical methods. In such situations one typically employs standard numerical methodlike Newton-Raphson to directly solve (16), or utilise method of steepest ascent for directmaximisation of `(θ|Y ). Newton-Raphson requires the second derivative matrix of `(θ|Y ).When one uses the expression of the expected values of these second derivatives in the theNewton-Raphson algorithm, the method is called Rao’s method of scoring, which typicallyconverges faster than the usual Newton-Raphson.

1/√nI(θ0) is called the asymptotic standard error of MLE. Since one does not know θ0

the true unknown value of θ one may replace θ0 by MLE θ̂ in the expression for I(θ) or

alternatively report 1/√In(θ̂) as estimated asymptotic standard error of the MLE, which

also is a consistent estimator. As shall be shortly seen in §2.2, the section on interval esti-mation, standard error is typically best interpreted in its intuitive sense when the samplingdistribution of the estimator is Normal. In case of MLE, by virtue of Theorem 7 since it isso, this asymptotic standard error typically is a fairly interpretable quantity whose use andinterpretation will be further clarified in §2.2.

2.1.3 Other Methods

For a given estimation problem though we first strive to procure a UMVUE with the searchpossibly guided by first obtaining an anlytical expression for MLE, because of the computa-tional difficulties sometimes one resorts to some other alternative methods. In this subsectionwe shall discuss a few such other methods.

Method of Moments (MM)In this method given an i.i.d. sample Y1, Y2, . . . , Yn from the population one equates thesample moment(s) with the theoretical population moment(s) which are typically some func-tions of the original population parameters of interest θ. If θ has k components one typicallyconsiders the first k raw or central moments, whichever is convenient. Let µr denote ther-th raw moment in the population and Y r denote the corresponding r-th raw moment in

24

Page 25: Elements of Statistical Inference - Indian Institute of Sciencemgmt.iisc.ac.in/CM/LectureNotes/statistical_inference_frequentist.pdf · Indian Institute of Science Statistical inference

the sample. Then

µr =∫<yrf(y|θ) dy = gr(θ) (say) and Y r =

1

n

n∑i=1

Y ri

Now MM requires one to form the system of k equations

gr(θ) = Y r for r = 1, 2, . . . , k

which are then solved for θ to obtain MM estimator θ̂M of θ.

Example 11: Let Y1, Y2, . . . , Yn be i.i.d. Gamma(α, λ) with population p.d.f. λα

Γ(α)yα−1e−λy.

Now as can be immediately seen, method of ML would require messing around with thedi-gamma function (derivative of log Γ() functions), which is not exactly a routine numer-ical task. But by appealing to the method of moments we immediately observe that thepopulation mean or the first raw moment equals α/λ and the population variance or thesecond central moment equals α/λ2. Equating these two to their respective (unbiased) sam-ple counter-parts Y = 1

n

∑ni=1 Yi and s2

n−1 = 1n−1

∑ni=1(Yi − Y )2 and solving for (α, λ), we

immediately obtain the MM estimates of (α, λ) as α̂M = Y2/s2

n−1 and λ̂M = Y /s2n−1. 5

Now let us address the issue of optimality of MM. By law of large numbers Y rP−→ µr.

Now if the gr(·) functions are such that (g1(θ), . . . , gk(θ))←→ (θ1, . . . , θk) is one-to-one withcontinuous inverse (as in the gamma example above), then θ̂M is a consistent estimator ofθ. Furthermore Y r’s are unbiased estimators of µr’s. (But this does not imply that θ̂M isunbiased for θ, unless of course if the gr(·)’s are linear, which is extremely rare.) Other thanthese there is no other compleing reason to use MM. In fact MM estimators in general areless efficient than MLE, where the notion of efficiency was very briefly touched upon in§2.1.2.

25