msda3 notes

Notes on Mathematical Statistics and Data Analysis3e by J.A. Rice (2006)

YAP Von Bing, Statistics and Applied Probability, NUS

November 18, 2011

1 Introduction

These notes were the result of a course on Mathematical Statistics for year-two undergraduatesat the National University of Singapore offered in Semester 1 (Fall) 2011. Topics covered were

• Selected items in chapters 1-6, on probability

• Chapter 7 Survey sampling: up to, but excluding 7.4 ratio estimation.

• Chapter 8 Estimation: except for 8.6 Bayesian estimation.

• Chapter 9 Testing hypotheses: up to, but excluding 9.7 hanging rootograms.

The materials here are meant to enhance the very readable and useful book. Most of themwere implemented in the course, some in a less polished state initially than presented here,but eventually tidied up. The others were not implemented, but I think are good ideas for afuture course.

2 Notation

The following are some rules:

1. A random variable is denoted by a capital Roman letter, and the corresponding smallletter denotes its realisation. In Example A (page 213), there is “X̄ = 938.5”. Bythe rule, “x̄ = 938.5” is preferable. The phrase “random variable equals constant” isrestricted to probability statements, like P(X = 1).

2. An estimator of a parameter θ is a random variable, denoted by θ̂. On page 262, theparameter is λ0, but the estimator is written λ̂. By the rule, the estimator should be λ̂0.A realisation of θ̂ is called an estimate, which is not denoted by θ̂. In Example A (page261), the estimator of λ is λ̂ = X̄. Later, there is “λ̂ = 24.9”. This would be rephrasedas something like “24.9 is an estimate of λ”.

1

3 Sampling

3.1 With replacement, then without

While simple random sampling from a finite population is a very good starting point, thedependence among the samples gets in the way of variance calculations. This is an obstacle tocomprehending the standard error (SE). Going first with sampling with replacement, there areindependent and identically distributed (IID) random variables X1, . . . , Xn with expectationµ and variance σ2, both unknown constants. The sample mean X̄ is an estimator of µ, i.e.,its realisation x̄ is an estimate of µ. The actual error in x̄ is unknown, but is around SD(X̄),which is σ/

√n. This is an important definition:

SE(x̄) := SD(X̄)

Now σ2 has to be estimated using the data x1, . . . , xn. A reasonable estimate is v, the reali-sation of the sample variance

V =1

n

n∑i=1

(Xi − X̄)2

Since E(V ) = (n− 1)σ2/n, an unbiased estimator of σ2 is

S2 =n

n− 1V

The approximation of the SE by√

v/n or s/√

n is called the bootstrap. After the above ispresented, we may go back to adjust the formulae for sampling without replacement from apopulation of size N . The estimator is still X̄, but the SE should be multiplied by√

1− n− 1

N − 1

which is important if n is large relative to the population size. The unbiased estimator of σ2

is (N − 1)S2/N , which matters only if N is small.The random variable V is usually written as σ̂2. I reserve σ̂2 for a generic estimator of σ2.

For example, using the method of moments, µ̂ = X̄ and σ̂2 = V .

3.2 S2 or V ?

Let X1, . . . , Xn be IID with mean µ and variance σ2, both unknown. Consider confidenceintervals for µ. S2 must be used when the t distribution applies, i.e., when the X’s arenormally distributed. Otherwise, n should be large enough for the normal approximation tobe good, in which case there is little difference between the two.

For the calculation of SE, S2 is more accurate.

3.3 Sampling without replacement

Let X1, . . . , Xn be a simple random sample from a population of size N with variance σ2.Lemma B (page 207) states: if i 6= j, then cov(Xi, Xj) = −σ2/(N − 1). The proof assumes,without justification, that the X’s are exchangeable, or more specifically, the conditionaldistribution of Xj given Xi is the same as that of X2 given X1.

2

3.4 Systematic sampling

On page 238, systematic sampling is defined. Essentially, the population is arranged in arectangle. A row (or column) is chosen at random (equal probability) as the sample. It issaid that periodic structure can cause bias. This statement cannot apply to the sample mean,which is unbiased for the population mean. Perhaps the point is: its variance is larger thanthat for a simple random sample of the same size.

4 Estimation

4.1 Motivation

The main idea is a natural extension from sampling with replacement. IID random variablesX1, . . . , Xn are used to construct an estimator θ̂, a random variable, for a parameter θ, anunknown constant. Given data, i.e., realisations x1, . . . , xn, the corresponding realisation ofθ̂ is an estimate of θ. The actual error in the estimate is unknown, but is roughly SD(θ̂).General definition:

SE(estimate) := SD(estimator)

The SE often involves some unknown parameters. Obtaining an approximate SE, by replacingthe parameters by their estimates, is called the bootstrap.

4.2 Sample size

It is important to have a strict definition of sample size:

sample size = the number of IID data

The sample size is usually denoted by n, but here is a confusing example. Let X ∼ Binomial(n, p).The sample size is 1, not n (Wouldn’t it be so confusing to write “n is 1, not n.”?). That itis equivalent to n IID Bernoulli(p) random variables is the job of sufficiency.

4.3 What does θ mean?

Often, θ denotes two things, both in the parameter space Θ: (a) a particular unknown con-stant, (b) a generic element. This double-think is quite unavoidable in maximum likelihood,but for method of moments (MOM), it can be sidestepped by keeping Θ implicit. For ex-ample, the estimation problem on pages 255-257 can be stated as follows: For n = 1, 207,assume that X1, . . . , Xn are IID Poisson(λ) random variables, where λ is an unknown positiveconstant. Given data x1, . . . , xn, estimate λ and find an approximate SE. The MOM estimatorλ̂ is constructed without referring to the parameter space: the positive real numbers.

4.4 Alpha particle emissions (page 256)

10220/1207 ≈ 8.467, not close to 8.392. In Berk (1966), the first three categories (0, 1, 2emmisions) were reported as 0, 4, 14, and the average was 8.3703. Since 8.3703 × 1207 =10102.95, the total number of emissions is most likely 10103. Combined with the reporteds2 = 8.6592, the sum of squares is close to 95008.54, likely 95009. Going through all possiblecounts for 17, 18, 19, 20 emissions, the complete table is reconstructed, shown on the nextpage.

3

Emissions Frequency0 01 42 143 284 565 1056 1267 1468 1649 16110 12311 10112 7413 5314 2315 1516 917 318 019 120 1

Total 1207

Table 1: Reconstructed Table 10 A of Berk (1966). x̄ = 8.3703, s2 = 8.6596.

4.5 Rainfall (pages 264-265)

Let X1, . . . , X227 be IID gamma(α, λ) random variables, where the shape α and the rate λ areunknown constants. The MOM estimator of (α, λ) is

(α̂, λ̂) =

(X̄2

V,X̄

V

)The SEs, SD(α̂) and SD(λ̂), have no known algebraic expressions, but can be estimated byMonte Carlo, if (α, λ) is known. It is unknown, estimated as (0.38,1.67). The bootstrapassumes

SD(α̂) ≈ SD(0̂.38), SD(λ̂) ≈ SD(1̂.67)

where (0̂.38, 1̂.67) is the MOM estimator of (0.38,1.67). This odd-looking notation is entirely

consistent. To get a realisation of 0̂.38, generate x1, . . . , x227 from the gamma(0.38,1.67)distribution and compute x̄2/v.

The double-act of Monte Carlo and boostrap can be used to assess bias. Define

bias(0.38) = E(α̂)− α

By the bootstrap,E(α̂)− α ≈ E(0̂.38)− 0.38

which can be estimated by Monte Carlo. The bias in (0.38,1.67) is around (0.10,0.02). Themaximum likelihood estimates have smaller bias and SE.

4

4.6 Maximum likelihood

Unlike MOM, in maximum likelihood (ML) the two meanings of θ clash, like an actor playingtwo characters who appear in the same scene. The safer route is to put the subscript 0 to theunknown constant. This is worked out in the Poisson case.

Let X1, . . . , Xn be IID Poisson(λ0) random variables, where λ0 > 0 is an unknown constant.The aim is to estimate λ0 using realisations x1, . . . , xn. Let λ > 0. The likelihood at λ is theprobability of the data, supposing that they come from Poisson(λ):

L(λ) = Pr(X1 = x1, . . . , Xn = xn) =n∏

i=1

λxie−λ

xi!

On (0,∞), L has a unique maximum at x̄. x̄ is the ML estimate of λ0. The ML estimator isλ̂0 = X̄.

The distinction between λ0 and λ helps make the motivation of maximum likelihood clear.However, for routine use, λ0 is clunky (I have typed tutorial problems where the subscriptappears in the correct places; it is hard on the eyes). The following is more economical, atthe cost of some mental gymnastics.

Let X1, . . . , Xn be IID Poisson(λ) random variables, where λ > 0 is an unknown constant.[Switch.] The likelihood function on (0,∞) is

L(λ) = Pr(X1 = x1, . . . , Xn = xn) =n∏

i=1

λxie−λ

xi!

which has a unique maximum at x̄. [Switch back.] The ML estimator of λ is λ̂ = X̄.With some practice, it gets more efficient. The random likelihood function is a product of

the respective random densities:

L(λ) =n∏

i=1

λXie−λ

Xi!

Since L has a unique maximum at X̄, λ̂ = X̄.The distinction between a generic θ and a fixed unknown θ0 is useful in the motivation

and in the proof of large-sample properties of the ML estimator (pages 276-279). Otherwise,θ0 can be quietly dropped without confusion.

4.7 Hardy-Weinberg equilibrium (pages 273-275)

By the Hardy-Weinberg equilibrium, the proportions of the population having 0, 1 and 2 aalleles are respectively (1− θ)2, 2θ(1− θ) and θ2. Hence the number of a alleles in a randomindividual has the binomial(2,θ) distribution; so the total number of a alleles in a simplerandom sample of size n � N has the binomial(2n, θ) distribution.

Let X1, X2, X3 be the number of individuals of genotype AA, Aa and aa in the sample.The number of a alleles in the sample is X2 + 2X3 ∼ binomial(2n, θ), so

var

(X2 + 2X3

2n

)=

θ(1− θ)

2n

The Monte Carlo is unnecessary. The variance can also be obtained directly using the expec-tation and variance of the trinomial distribution, but the binomial route is much shorter.

Since there is no cited work, it is not clear whether the data were a simple random sampleof the Chinese population of Hong Kong in 1937. If not, the trinomial distribution is in doubt.

5

4.8 MOM on the multinomial

Let X = (X1, . . . , Xr) have the multinomial(n,p) distribution, where p is an unknown fixedprobability vector. The first moment µ1 = np. Since the sample size is 1, the estimator of µ1

is X. The MOM estimator of p is X/n. This illustrates the definition of sample size.Let r = 3. Under Hardy-Weinberg equilibrium, p = ((1 − θ)2, 2θ(1 − θ), θ2). There are

three different MOM estimators of θ.

4.9 Confidence interval for normal parameters (page 280)

Let x1, . . . , xn be realisations from IID random variables X1, . . . , Xn with normal(µ, σ2) dis-tribution, where µ and σ2 are unknown constants. The (1− α)-confidence interval for µ:(

x̄− s√n

tn−1(α/2), x̄ +s√n

tn−1(α/2)

)is justified by the fact that

P

(X̄ − S√

ntn−1(α/2) ≤ µ ≤ X̄ +

S√n

tn−1(α/2)

)= 1− α

Likewise,

P

(nV

χ2n−1(α/2)

≤ σ2 ≤ nV

χ2n−1(1− α/2)

)= 1− α

justifies the confidence interval for σ2:(nv

χ2n−1(α/2)

,nv

χ2n−1(1− α/2)

)This illustrates the superiority of V over σ̂2.

4.10 The Fisher information

Let X have density f(x|θ), θ ∈ Θ, an open subset of Rp. The Fisher information in X isdefined as

I(θ) = −E

[∂2

∂θ2log f(X|θ)

]When the above exists, it equals the more general definition

I(θ) = E

[∂

∂θlog f(X|θ)

]2

In all the examples, and most applications, they are equivalent, and the first definition iseasier to use.

6

4.11 Asymptotic normality of ML estimators

Let θ̂ be the ML estimator of θ based on IID random variables X1, . . . , Xn with density f(x|θ),where θ is an unknown constant. Let I(θ) be the Fisher information in X1. Under regularityconditions, as n →∞, √

nI(θ)(θ̂ − θ) → N(0, 1)

Note the Fisher information of a single unit of the IID X’s, not that of all of them, which isn times larger.

For large n, approximately

θ̂ ∼ N

(θ,I(θ)−1

n

)I(θ)−1 plays a similar role as the population variance in sampling.

4.12 Asymptotic normality: multinomial, regression

This is essentially the content of the paragraph before Example C on page 283. Let X =(X1, . . . , Xr) have a multinomial(n,p(θ)) distribution, where θ is an unknown constant. LetI(θ) be the Fisher information. Then as n →∞,√

I(θ)(θ̂ − θ) → N(0, 1)

Proof: X has the same distribution as the sum of n IID random vectors with the multinomial(1,p(θ))distribution. The ML estimator of θ based on these data is θ̂ (by sufficiency or direct calcula-tion). The Fisher information of a multinomial(1,p(θ)) random vector is I(θ)/n. Now applythe result in the previous section.

If the X’s are independent but not identically distributed, like in regression, asymptoticnormality of the MLEs does not follow from the previous section. A new definition of samplesize and a new theorem are needed.

4.13 Characterisation of sufficiency

Let X1, . . . , Xn be IID random variables with density f(x|θ), where θ ∈ Θ. Let T be a functionof the X’s. The sample space S is a disjoint union of St across all possible values t, where

St = {x : T (x) = t}

The conditional distribution of X given T = t depends on θ in general. If for every t, it is thesame for every θ ∈ Θ, then T is sufficient for θ. This gives a characterisation: T is sufficientfor θ if and only if there is a function q(x) such that for every t and every θ ∈ Θ

f(x|t) = q(x), x ∈ St

This clarifies the proof of the factorisation theorem (page 307).Logically, sufficiency is a property of Θ. “T is sufficient for Θ” is more correct.

7

4.14 Problem 4 (page 313)

Let X1, . . . , X10 be IID with distribution P(X = 0) = 2θ/3, P(X = 1) = θ/3, P(X = 2) =2(1 − θ)/3, P(X = 3) = (1 − θ)/3, where θ ∈ (0, 1) is an unknown constant. The MOMestimator of θ is

θ̃ =7

6− X̄

2The ML estimate based on the given numerical data can be readily found. That of generalx1, . . . , x10, or the ML estimator, is hard because the distribution of X1 is not in the formf(x|θ). One possibility is

f(x|θ) =

[2

3θ

]y0[1

3θ

]y1[2

3(1− θ)

]y2[1

3(1− θ)

]y3

, yj = 1{x=j}

It follows that the ML estimator is

θ̂ =Y

nwhere Y is the number of times 0 or 1 turn up.

This is an interesting example. θ̃ and θ̂ are different, both are unbiased. θ̂ is efficient, butnot θ̃:

var(θ̂) =θ(1− θ)

n, var(θ̃) =

1

18n+

θ(1− θ)

nThey illustrate relative efficiency more directly than on pages 299-300, and the sample sizeintepretation of relative efficiency. Y is sufficient for θ, so E(θ̃|Y ) has smaller variance thanθ̃. Another clue to the superiority of ML: If 0, 1, 2, 3 are replaced by any other four distinctnumbers, the ML estimator stays the same, but the MOM estimator will be fooled intoadapting. Fundamentally, the distribution is multinomial.

5 Hypothesis tests

Let X = (X1, . . . , Xr) be multinomial(n,p(θ)), where θ ∈ Θ, an open subset of R. Let θ̂ bethe ML estimator of θ based on X. Then the null distribution of the Pearson’s

X2 =r∑

i=1

[Xi − npi(θ̂)]2

npi(θ̂)

is asymptotically χ2r−2. Let X be obtained from a Poisson distribution, so that θ = λ ∈

(0,∞). The above distribution holds for λ̂ based on X. Example B (page 345) assumes italso holds for λ̂ = X̄, the ML estimator based on the raw data, before collapsing. This isunsupported. An extreme counter-example: collapse Poisson(λ) to the indicator of 0, i.e., thedata is binomial(n, e−λ). Since r = 2, theory predicts X2 ≡ 0, as can be verified directly, butplugging λ̂ = X̄ into X2 gives a positive random variable. Using R, the correct ML estimateof λ in Example B is 2.39, giving X2 = 79.2.

6 Minor typographical errors

• Page A39: In the solution to Problem 55 on page 324, the asymptotic variance has 1+2θin the denominator, not 1 + θ. Refer to Fisher (1970) page 313.

• Page A26: Bliss and Fisher (1953) is on pages 176-200.

8

msda3 notes

Documents