durham maths question set - stats concepts ii 13-14 all merged

Statistical Concepts 2013/14 – Sheet 0 – Probability Revision

For this course, it is crucial that you have excellent knowledge of the material presentedin the first-year Probability course. These exercises are strongly recommended. You maywish to go over the lecture notes and summary handouts of the Probability course if youstruggle with some of these exercises. Some of these may be discussed in class, and fullsolutions will be handed out.

1. I have ten coins in my pocket. Nine of them are ordinary coins with equal chancesof coming up head and tail when tossed, and one has two heads.

(a) If I take one of the coins at random from my pocket, what is the probabilitythat it is the coin with two heads?

(b) If I toss the coin and it comes up heads, what is the probability that it is thecoin with two heads?

(c) If I toss the coin a further n times and it comes up heads every time, what isthe probability that it is the coin with two heads?

(d) If I toss the coin one further time and it comes up tails, what is the probabilitythat it is one of the nine ordinary coins?

2. Let U be a random quantity which has probability density function

f(u) =

{1 u ∈ [0, 1]

0 otherwise

Calculate

(a) P [U ≥ 0.3]

(b) E [U ]

(c) Var [U ]

(d) E [log(1 + U)]

(e) If the random quantity Y has a uniform distribution on the interval [a, b],express Y in terms of U above and hence find E [Y ] and Var [Y ].

3. Let Y be a random quantity which has a Poisson distribution. Suppose that E [Y ] =3.

(a) What is Var [Y ]?

(b) Suppose that I take a large number of independent random quantities, eachwith the same distribution as Y . Why should I suppose that a normal dis-tribution would be a good approximation to the distribution of their averageY ?

(c) Use part (b) to calculate an interval which would have approximately a 98%probability of containing Y based on 100 such random quantities.

4. An entomologist has a large colony of ants which he knows contains just two types, Aand B, of similar appearance. By spending some time examining each ant carefullyhe can discover which type it is, but he decides, on grounds of cost, to use the muchquicker method of classifying an ant as type A if its length is less than 8 mm andas type B otherwise.

He knows that lengths of each of the two types of ants are normally distributed:type A with expectation 6.5 mm and standard deviation 0.8 mm, and type B withexpectation 9.4 mm and standard deviation 0.9 mm.

What proportion of (i) type A and (ii) type B ants would he misclassify by thismethod?

If the colony consists of 70% of type A and 30% of type B, what proportion of allants would he misclassify?

It is thought that the number of of ants misclassified may be reduced by choosinga critical point other than 8 mm. Discuss!

5. The mad king has captured Anne, Betty and Charles. He would like to kill them all,but, as he is a fan of probability puzzles, he offers them the following challenge. Thefollowing morning, each of the three prisoners will be escorted to a cell. They willeach enter the cell simultaneously, but each through a different door. Immediatelybefore entering the cell, a hat will be placed on the head of each prisoner. Thecolour of the hat will be either red or blue, and the choice for each prisoner will bedecided by the flip of a fair coin, independently for each prisoner. As they enterthe room, each prisoner will be able to see the colours of the hats of the othertwo prisoners, but not the colour of their own hat. No communication of any kindis allowed between the prisoners. At the moment that all of the prisoners enterthe cell, and observe the colours of their comrades’ hats, each may choose eitherto remain silent or, instantly, to guess the colour of the hat on their head. If atleast one prisoner guesses correctly the colour of their hat, and nobody guessesincorrectly, all the prisoners will be set free. Otherwise, they will all be executed.

The prisoners are allowed a meeting beforehand to discuss their strategy. Theyimmediately realise that one possible strategy would be for Anne to guess that herhat was red, and Betty and Charles to stay silent. This strategy gives a probabilityof 1/2 that all will go free. Is there a better strategy?

Statistical Concepts 2013/14 – Solutions 0 – Probability Revision

1. (a) Each of the coins is equally likely to be chosen. The ten probabilities must sum toone since exactly one coin must be chosen. Hence, P [coin with two heads] = 1/10.

(b) P [head | coin with two heads] = 1 and P [head | fair coin] = 1/2. Therefore, apply-ing Bayes theorem,

P [coin with two heads | head] =1× 1/10

1× 1/10 + 1/2× 9/10=

2

11' 0.182

(c) Start from the beginning. A total of n+ 1 heads have occurred. But

P [n+ 1 heads | coin with two heads] = 1

andP [n+ 1 heads | fair coin] = 1/2n+1

Therefore, applying Bayes theorem as in the previous part,

P [coin with two heads | n+ 1 heads] =1× 1/10

1× 1/10 + 1/2n+1 × 9/10=

2n+1

9 + 2n+1

(d) This probability is 1, verify this using Bayes theorem.

2. Using the basic rules for handling probability density functions, expectations and vari-ances, we have:

(a) P [U ≥ 0.3] =∫∞0.3 f(u) du =

∫ 10.3 1 du = 0.7.

(b) E [U ] =∫∞−∞ uf(u) du =

∫ 10 u du = 1/2.

(c) E[U2]

=∫ 10 u

2f(u) du = 1/3→ Var [U ] = E[U2]− E [U ]2 = 1/3− 1/4 = 1/12.

(d) E [log(1 + U)] =∫ 10 log(1+u)f(u) du = [(1+u)[log(1+u)−1]10 = 2(log 2−1)−(−1) =

2 log 2− 1.

(e) It is straightforward to show that Y = a+ (b− a)U has the required uniform distri-bution. Hence, E [Y ] = a+ (b− a)/2 = (a+ b)/2 and Var [Y ] = (b− a)2/12.

3. Recall that if Y has the Poisson distribution with parameter λ then E [Y ] = λ andVar [Y ] = λ.

(a) Therefore, Var [Y ] = 3.

(b) The central limit theorem says that when Y1, . . . , Yn are independent and identicallydistributed with mean µ and variance σ2, the distribution of

√n(Y −µ)/σ for large n

will be approximately N(0, 1).

(c) Applying the central limit theorem, the distribution of Z = 10(Y −3)/1.732 should beapproximately N(0, 1). But from tables of the normal distribution, if Z ∼ N(0, 1) thenP [Z ≤ 2.33] = 0.99 and so P [|Z| ≤ 2.33] = 0.98. Therefore, P

[|Y − 3| ≤ 0.40

]'

0.98; that is, P[Y ∈ [2.60, 3.40]

]' 0.98. Thus, [2.60, 3.40] is the required interval.

4. Let Y denote the length of an ant. Then,

P [misclassify an ant | type A] = P [Y > 8 | A] = P [Z > 1.875] ' 0.0304

where Z ∼ N(0, 1). Similarly, P [misclassify an ant | type B] ' 0.0599.

The overall misclassification rate is given by

P [misclassify an ant]

= P [misclassify | type A] P [type A] + P [misclassify | type B] P [type B]

= 0.0304× 0.7 + 0.0599× 0.3 ' 0.0392

Intuitively, the cutoff point should be made larger to reduce the overall error rate: thisreduces the error rate for the type A ants, the more numerous of the two populations.

[The following remarks go beyond the intuitive explanation suggested above. First, noticethat the original cutoff point of 8 mm is slightly larger than the average of the type A andtype B expectations (7.95 mm): this is to account for the difference in standard deviations.In general, for two populations with densities f1(y) and f2(y) and associated proportionsp1 and p2, it can be shown that the classification regions are determined by the solution(s)to the equation p1f1(y) = p2f2(y), and with this choice the overall misclassification rate isminimised. We can see that this might be the case by noticing that pifi(y) is proportionalto the conditional probability that an ant of length y belongs to population i; and anintuitive rule would be to assign such an ant to the population with the largest of theseconditional probabilities. This rule generalises to any number of populations, and it turnsout that the overall misclassification rate is minimised]

5. Here is the best strategy. Each prisoner does the following. If the hat colours of the othertwo prisoners are different, they say nothing. If the colour is the same, they guess theopposite colour. This method will be successful unless all of the hats are of the samecolour. The chance that all are of the same colour is 1/4. (There is a 1/8 chance that allare red, and 1/8 chance that all are blue.) Therefore this strategy has probability 3/4 ofsuccess.

2H Statistical Concepts 2013/14 — Sheet 1 — Sampling

1. [Hand in to be marked] This question illustrates sampling distributions and relatedideas in the context of a very small population with known x-values. The populationcomprises 5 individuals with x-values {1, 3, 3, 7, 9}. A sample of size two (resulting invalues Y1 and Y2) is drawn at random and without replacement from the population.

(a) What is value of N? Compute the population mean and the population variance.

(b) What is value of n? Write down the (sampling) distribution of Y1. What is the(sampling) distribution of Y2?

(c) Derive the exact (sampling) distribution of Y and, in this case, check directly theformulae for E

[Y]

and Var[Y]

given in lectures.

(d) Derive the exact (sampling) distribution for the range of the two sample values(the largest minus the smallest) and show that in this case the sample range is notan unbiased estimator of the population range (the largest minus smallest of thepopulation values). Under what general conditions on the population size and valuesand the sample size will the sample range be an unbiased estimator for the populationrange?

2. This is a simple numerical (no context) exercise on some basic things you should havelearned so far. A simple random sample (without replacement) of size 25 from a populationof size 2000 yielded the following values:

104 109 111 109 8786 80 119 88 12291 103 99 108 96

104 98 98 83 10779 87 94 92 97

For the above data,∑25

1 Yj = 2451,∑25

1 Y 2j = 243505

(a) Calculate an unbiased estimate of the population mean and of the population total.

(b) Calculate unbiased estimates of the population variance and of Var[Y].

(c) Compute (estimated) standard errors for the population mean and for the populationtotal.

3. Among three boys, Andy has 3 sweets, Bill has 4 sweets and Charles has 5 sweets. Amongthree girls, Doreen has 4 sweets, Eve has 6 sweets and Florence has 8 sweets. One boyis selected at random, with number of sweets B1, and independently, one girl is selectedwith number of sweets G1. Let D1 = G1 −B1.

(a) Find the sampling distribution of D1 and thus find, directly, the expected value andvariance of D1.

(b) Find the expected value and variance of D1 by first finding the corresponding valuesfor G1 and B1, and check that you get the same answers.

(c) A second boy is selected at random from the remaining two boys, with number ofsweets B2, and a second girl is selected with number of sweets G2. Let D2 = G2−B2.Find the sampling distribution of D = (D1 +D2)/2 and thus find the expected valueand variance of D.

(d) Find the expected value and variance of D, using the formulae for E(D1+D2), V ar(D1+D2), and check that you get the same answers.

4. Show that with simple random sampling without replacement from a finite population therandom quantity

s2

n

[1− n

N

](usually denoted by s2

Y) is an unbiased estimator of Var

[Y], where

s2 =1

n− 1

n∑i=1

(Yi − Y )2.

[Hint: First show thatn∑

i=1

(Yi − Y )2 =n∑

i=1

Y 2i − nY 2

and then use the expression for Var[Y]

given in lectures in combination with the generalresult that E

[Z2]

= Var [Z] + [E [Z]]2 for any random quantity Z.]

Statistical Concepts 2013/14—Solutions 1—Sampling

1. (a) N = 5. µ = 23/5 = 4.6, σ2 = (149/5)− (4.6)2 = 8.64

(b) n = 2. P [Yj = 1] = P [Yj = 7] = P [Yj = 9] = 0.2, P [Yj = 3] = 0.4, j = 1, 2

(c) Possible samples and corresponding values for the sample mean and range are

Sample (y1, y2) (1,3) (1,3) (1,7) (1,9) (3,3) (3,7) (3,9) (3,7) (3,9) (7,9)Mean (y) 2 2 4 5 3 5 6 5 6 8Range (r) 2 2 6 8 0 4 6 4 6 2

Hence, the sampling distribution of Y is

y 2 3 4 5 6 8P[Y = y

]0.2 0.1 0.1 0.3 0.2 0.1

E[Y]

= 2 × 0.2 + · · · + 8 × 0.1 = 4.6 = µ, E[Y 2]

= 22 × 0.2 + · · · + 82 × 0.1 = 24.4. Thus,

Var[Y]

= 24.4− (4.6)2 = 3.24. This agrees with

Var[Y]

=

[N − nN − 1

]σ2

n=

[5− 2

5− 1

]8.64

2= 3.24

(d) The sampling distribution of R is

r 0 2 4 6 8P [R = r] 0.1 0.3 0.2 0.3 0.1

E [R] = 0× 0.1 + · · ·+ 8× 0.1 = 4 < population range = 9− 1 = 8

In general, the sample range can never be larger than the population range. Therefore it willbe biased if there is a positive probability that the sample range will be smaller than thepopulation range. Therefore there are two situations:

• If the population range is zero (all values in the population are the same), the samplerange will always be zero and will be unbiased.

• If the population range is postive: let a denote the maximum value and b the minimumvalue in the population; the sample range can only be unbiased if every sample mustcontain at least one a and at least one b as otherwise there would be positive probabilityof obtaining a sample with smaller range than the population. The only way to guaranteethat both a and b appear in the sample is if n > N −min(Na, Nb) where Nx denotes thenumber of times the value x appears in the population.

2. n = 25, N = 2000,∑n

1 Yj = 2451,∑n

1 Y2j = 243505,

∑n1 (Yj − Y )2 = 243505 − (24512/25) =

3208.96

(a) Y = 98.04 is an unbiased estimate of the population mean µ; and T = 2000× 98.04 = 196080is an unbiased estimate of the population total τ .

(b) 2000−12000 ×

3208.9625−1 = 133.64 is an unbiased estimate of the population variance σ2.

s2Y

= s2

n

(1− n

N

)= 3208.96

25×24

(1− 25

2000

)' 5.28, an unbiased estimate of Var

[Y].

(c) Estimated SE of Y as an estimate of the population mean µ is sY = 2.298; and the estimatedSE of T as an estimate of the population total τ is 2000 times this; namely, sT = 4596.

3. (a) The possible values of D1 are -1,0,1,1,2,3,3,4,5 each with probability 1/9. Therefore E(D1) =(1 + 0− 1 + 3 + 2 + 1 + 5 + 4 + 3)/9 = 2, and V ar(D1) = E(D1 − 2)2 = (1 + 4 + 9 + 1 + 0 +1 + 9 + 4 + 1)/9 = 10/3.

(b)E(B1) = (3 + 4 + 5)/3 = 4, E(G1) = (4 + 6 + 8)/3 = 6

soE(G1 −B1) = E(G1)− E(B1) = 6− 4 = 2.

V ar(B1) = E(B1 − 4)2 = 2/3, V ar(G1) = E(G1 − 6)2 = 8/3

so, as G1, B1 are independent,

V ar(G1 −B1) = V ar(G1) + V ar(B1) = 10/3

.

(c) If we choose two boys and two girls, then we leave one boy and one girl behind. Call their valuesB3, G3 with D3 = G3−B3. As D1 +D2 +D3 = 6, we have D = (6−D3)/2. D3 has the samedistribution as D1 so that the possible values of D are 7/2, 6/2, 5/2, 5/2, 4/2, 3/2, 3/2, 2/2, 1/2,each with probability 1/9. So, we can find E(D), V ar(D) directly from this distribution, orfrom

E(D) = (6− E(D3))/2 = (6− 2)/2 = 2,

V ar(D) = V ar((6−D3)/2) = V ar(D3)/4 = 10/12.

(d)E(D) = (1/2)(E(D1) + E(D2)) = 2

V ar(D) = (1/4)V ar(D1 +D2) = (1/4)(V ar(D1) + V ar(D2) + 2Cov(D1, D2)).

We have V ar(D1) = V ar(D2) = 10/3 and

Cov(D1, D2) = Cov(G1 −B1, G2 −B2) = Cov(G1, G2) + Cov(B1, B2),

as G1, G2 are independent of B1, B2 so Cov(B1, G2) = Cov(G1, B2) = 0. From results inlectures, we have that the covariance between any two values sampled without replacementfrom a population is minus the variance of a single sample, divided by one less than thepopulation size, so that

Cov(G1, G2) = −V ar(G1)/(3− 1) = −8/6, Cov(B1, B2) = −V ar(B1)/(3− 1) = −2/6,

Cov(D1, D2) = −8/6− 2/6 = −10/6

so

V ar(D) = (1/4)(10/3 + 10/3− 20/6) = 10/12

4. (n − 1)s2 =∑nj=1(Yj − Y )2 =

∑nj=1(Y 2

j − 2Y Yj + Y 2) =∑nj=1 Y

2j − 2Y

∑nj=1 Yj +

∑nj=1 Y

2 =∑nj=1 Y

2j − 2Y nY + nY 2 =

∑nj=1 Y

2j − nY 2.

In what follows use (i) E[Y 2]

= Var [Y ] + (E [Y ])2 for any random quantity Y , and (ii) Var[Y]

=[N−nN−1

]σ2

n . Then

(n− 1)E[s2]

=

n∑j=1

E[Y 2j

]− nE

[Y 2]

=

n∑j=1

(σ2 + µ2)− n[µ2 +

[N − nN − 1

]σ2

n

]

= nσ2

[1− N − n

n(N − 1)

]=

(n− 1)N

N − 1σ2

Therefore, E[s2]

= NN−1σ

2. Hence

E[s2Y

]=

1

n

[1− n

N

]E[s2]

=1

n

[1− n

N

] N

N − 1σ2

=

[N − nN − 1

]σ2

n

= Var[Y]

Hence, s2Y

is an unbiased estimator of Var[Y].

2H Statistical Concepts 2013/14 – Sheet 2 –Estimators and Confidence Intervals

1. [Hand in to be marked] At the time of a historic potential challenge for theleadership of the Conservative party (the “stalking horse” affair where Sir AnthonyMeyer challenged Mrs Thatcher for the leadership of the Conservative party), theIndependent newspaper performed an opinion poll to assess the level of support forMrs Thatcher. They asked 150 of the 377 Conservative MPs whether or not theyfelt it was time for a change of leader and used the results to draw conclusions aboutthe level of support for Mrs Thatcher in the whole of the parliamentary party.

Supposing the actual level of support to be 40% among the 377 Conservative MPs,(i) calculate the standard deviation of the proportion of the sample supportingMrs Thatcher, assuming simple random sampling; and (ii) using the Central LimitTheorem, estimate the chance that the level of support in a sample of size 150 willbe in error by more than 1%.

Suppose that 50 in the sample of 150 said they supported Mrs Thatcher. Computean approximate 95% confidence interval for the percentage support (without assum-ing that the actual level of support is 40%). Discuss whether or not this interval isconsistent with an actual level of support of 40%.

2. In a private library the books are kept on 130 shelves of similar size. The numbersof books on 15 shelves selected at random were found to have a sum of 381 anda sum of squares of 9947. Estimate the total number of books in the library andprovide an estimated standard error for your estimate. Give an approximate 95%confidence interval for the total number of books in the library, and comment onthe reliability of the approximation in this instance.

3. In auditing, the following sampling method is sometimes used to estimate the totalunknown value α = a1 + a2 + · · ·+ aN of an inventory of N items, where ai is the(as yet unknown) “audit value” of item i and, as is often the case, a “book value” biof each item i = 1, . . . , N is readily available.[Think of second-hand cars with theirpublished “blue book” values, or stamps with their catalogue values.] A simplerandom sample without replacement of size n is taken from the inventory, and foreach item j in the sample the difference Dj = Aj − Bj between the audited valueAj and the book value Bj is recorded and the sample average D = A− B is formed.The total inventory value α is estimated as V = ND+β, where β = b1+b2+· · ·+bNis the known sum of the book values of the inventory.

(a) Show that V is an unbiased estimator of the total value α of the inventory. (b)Find an expression for the variance of the estimator V in terms of the populationvariances σ2

a and σ2b of the inventory values and book values and their covariance

σab, where you may assume

cov(A,B) =

[N − n

N − 1

]σabn

[Note σab is defined to be∑N

i=1(ai−µa)(bi−µb)/N where µa = α/N and µb = β/Nare the inventory and book value population means; and when a = b we get theusual variance formula] (c) Under what conditions on the variances and covariancesof the inventory value and book value populations will the variance of V be smallerthan that of the usual estimator NA of the total inventory α? (d) Under whatcircumstances will the answer to (c) be useful in practice?

Statistical Concepts 2012/13 – Solutions 2 –Estimators and Confidence Intervals

1. (i) Assuming true population proportion p = 0.4, standard deviation of estimator p is

σp =

√p(1− p)

n

(N − n)

(N − 1)=

√0.4(1− 0.4)

150

(377− 150)

(377− 1)= 0.031 (about 3%)

(ii)

P [p− 0.4| > 0.01] = 2

(1− Φ

(0.01

0.031

))= 0.748 (about a 75% chance)

(iii) p = 50150

= 0.333. SE of p is

sp =

√p(1− p)n− 1

(1− n

N

)= 0.03

Hence, approximate 95% limits are 0.333±1.96×0.03 leading to the interval [27.5%, 39.2%].Since 40% is outside this interval, data is not consistent with this level of support.

2. n = 15, N = 130,∑n

1 Yj = 381,∑n

1 Y2j = 9947

s2 =(

9947− 3812

15

)/(15− 1) = 19.257.

T = NY = 3302, an unbiased estimate of the total number of books in the library.

s2Y

= s2

n

(1− n

N

)= 1.136.

sT = NsY = 138.54, the SE of T .

z0.025 = 1.96. Hence, an approximate 95% CI for the total number of books has limits,T ± z0.025sT , which evaluates to [3030, 3574] (nearest integer) .

As n is not “large”, the accuracy of the CLT-based CI cannot be guaranteed.

3. (a) E [V ] = NE[D]

+ β = N(µa − µb) + β = α− β + β = α

(b) Var [V ]

= N2Var[D]

= N2Var[A− B

]= N2

[Var

[A]

+ Var[B]− 2Cov

[A, B

]]= N2

[σ2a

n

N − nN − 1

+σ2b

n

N − nN − 1

− 2σabn

N − nN − 1

]=

N2(N − n)

n(N − 1)

[σ2a + σ2

b − 2σab]

(c) Var[NA

]= N2N−n

N−1σ2a

n> Var [V ] when σ2

b < 2σab.

(d) Useful, provided we have knowledge about the relative magnitudes of σ2b and σab. We

know the value of σ2b but not σab. The closer the audit and book values are related,

as measured by σab. the more useful V would be.

Statistical Concepts 2013/14 – Sheet 3 –Probability Models and Goodness of Fit

1. [Hand in to be marked] The Poisson distribution has been used by traffic engineers as a modelfor light traffic, based on the rationale that if the rate is approximately constant and the traffic islight, so that cars move independently of each other, the distribution of counts of cars in a giventime interval should be nearly Poisson. The following table shows the numbers of right turnsduring 300 three-minute intervals at a specific road intersection over various hours of the day andvarious days of the week.

# right turns 0 1 2 3 4 5 6 7 8 9 10 11 12 13+count 14 30 36 68 43 43 30 14 10 6 4 1 1 0

Estimate the rate parameter λ in the Poisson distribution. After pooling the last five cells(explain why we do this), assess the fit of the Poisson distribution using Pearson’s chi-squarestatistic. Carefully explain any lack of fit.

2. Are birthrates constant throughout the year? Here are all the births in Sweden, in 1935, groupedby season.

Spring (Apr-June) 23,385

Summer (Jul-Aug) 14,978

Autumn (Sep-Oct) 14,106

Winter (Nov-Mar) 35,804

(a) Carry out a chi-square test of the constant birthrate hypothesis for these data. Commenton any divergences from constant birth rate.

(b) From the given data, construct a 95% confidence interval for the proportion of spring births.Comment on the relationship of this analysis with part (a).

3. Capture-recapture. How does one estimate the number of fish in a lake? The following techniqueis actually used for this and other problems concerning sizes of wildlife populations. A net is setup to catch some fish. These are marked and returned to the lake. At a later date another batchof fish are caught. The size of the fish population can then be estimated from seeing how manyof the marked fish have been caught in the second sample.

Argue that if M marked fish from the first stage are returned to the lake, then the probabilitydistribution P [Y = y |M,n,N ] of the number Y of marked fish caught in a second catch of nfish, when the total number of fish in the lake is N , is given by

P [Y = y |M,n,N ] =

(My

)(N −Mn− y

)(Nn

)Find E [Y |M,n,N ] and hence suggest an estimator for N . Note that you can evaluate the

expectation without using P [Y = y |M,n,N ]; explain how. Evaluate your estimator forthe possible values of y in the case when 6 fish are marked in first stage and 10 fish are to becaught in the second stage.

For an observed value y of Y , discuss how you might use P [Y = y |M,n,N ] (when consideredas a function of N) as providing an alternative route to estimating N . [To clarify this, youmight consider by way of an example plotting P [Y = 3 |M = 6, n = 10, N ] as a function of N ,corresponding to the specific situation described above in which 3 marked fish are caught at thesecond stage.]

The probability model in this question is known as the hypergeometric distribution; it appears inmany contexts.

Statistical Concepts 2013/14 –Solutions 3–Probability Models and Goodness of Fit

1. Total = 1168, n = 300, λ = 1168/300 = 3.893. Pool cells to ensure E ≥ 5 in each cell.

0 1 2 3 4 5 6 7 8 9+ TotalO 14 30 36 68 43 43 30 14 10 12 300E 6.1 23.8 46.3 60.1 58.5 45.6 29.6 16.4 8.0 5.5 300X2 10.2 1.6 2.3 1.0 4.1 0.1 0.01 0.4 0.5 7.8 27.93

Pearson’s chi-square statistic, X2 = 27.93. Degrees-of-freedom, ν = (10 − 1) − 1 = 8.Significance probability, P [X2 > 27.93 | Poisson model] = 0.0005, which is very strongevidence against the Poisson model. There are too many zero counts and too many countsof 9 or more. The Poisson model is unlikely to be reasonable, as the traffic rate will tendto vary over different times of the day and on different days of the week. A Poisson modelwould more likely hold at a specific place during a specific time period on the same day ofthe week, such as 7:00 am to 8:00 am on a Sunday, when traffic density is more likely tobe light and nearly constant.

2. (a) Under the simple model where births are independent and occur with the same prob-ability every day (assume 365 days in a year), where we have seen a random sampleof n = 88, 273 births from a hypothetical infinite population of such births, here isthe layout of the chi-square calculation.

Obs. freq Probability Exp freqSeason O p E=np (O-E) (O − E)2/E

Spring (Apr-June) 23,385 0.24932 22,008 1,377 86.16Summer (Jul-Aug) 14,978 0.16986 14,994 -16 0.02Autumn (Sep-Oct) 14,106 0.16712 14,752 -646 28.29Winter (Nov-Mar) 35,804 0.41370 36,519 -715 14.00

From the table, the χ2 statistic is 128.47, which should be compared to the nulldistribution of χ2 with 3 degrees of freedom. The p-value for this statistic is very small(much smaller than the smallest value in your χ2 tables), so that we can reject thenull hypothesis of equal birthrates on all days with a very low significance probability.Note that with very large data sets we can be sensitive to very small discrepanciesfrom the null distribution. Looking at the table, we see more births than we wouldexpect in spring compensated by less in autumn and winter.

(b) To see how far off equal birthrates the data are, we may assess the probability of birthin each season. For spring, the estimate is p = 23385/88, 273 = 0.265. Our confidenceinterval is therefore 0.265 ± 1.96

√(0.265)(0.735)/88, 273 = 0.265 ± 0.0029. If birth

rates were constant over days, then this probability would be 0.249. Note that thisvalue does not lie within the confidence interval. The observed value is about 6%above the equal rate value, and our confidence interval suggests that the true ratiois within 5% and 7%. You can check that the confidence intervals for autumn andwinter similarly fail to include the equal rate values.

3. Assuming no mixing, no deaths, etc, the expression for the quantity P [Y = y |M,n,N ]follows from 1H Probability; compare for example to counting the number of aces in apoker hand.

Y = Y1 + · · ·+ Yn, where Yi = 1 if i-th fish caught is marked and Yi = 0 otherwise. Then

E [Y |M,n,N ] = nE [Y1 |M,n,N ] = nM

N

Equate number of marked fish y (in second sample) to nMN

to obtain estimator N∗ = nM/y.For M = 6, n = 10, we obtain

y 0 1 2 3 4 5 6N∗ ∞ 60 30 20 15 12 10

For observed Y = y, we can use P [Y = y |M,n,N ] as a likelihood function for N , andfind the value of N , N , which maximises this function. This is the maximum likelihoodestimator. For the above example with say y = 3, we have

P [Y = 3 |M = 6, n = 10, N ] =

(63

)(N−610−3

)(N10

) ∝(N−610−3

)(N10

) ≡ l(N)

l(N) = 0 for N < 13. Plot l(N) for N ≥ 13. Note that

l(N + 1)

l(N)=

(N − 5)(N − 9)

(N − 12)(N + 1)≥ 1 for N ≤ 19

Therefore, the m.l.e of N is not unique: N = 19 or 20 (cf. N∗ = 20 when y = 3).

Statistical Concepts 2013/14 – Sheet 4 – Probability Models

1. [Hand in to be marked] A r.q. Y has a gamma distribution with parameters α, λ > 0if its p.d.f. is

f(y | α, λ) =1

Γ(α)λαyα−1e−λy y ≥ 0.

and zero otherwise.

(a) Show that Var [Y | α, λ] = α/λ2.

(b) Show that the moment generating function of Y is given by

MY (t) =

(λ

λ− t

)α

t < λ

and explain the condition on t.

(c) Deduce that if Y1, . . . , Yn is an independent random sample from such a gammadistribution then their sum also has a gamma distribution, and give the parameters.

2. Overfield and Klauber (1980) published data on the incidence of tuberculosis in relationto blood groups in a sample of Eskimos. For the data below, is there any association ofthe disease and blood group within the ABO system?

Severity O A AB B

Moderate-to-advanced 7 5 3 13Minimal 27 32 8 18Not present 55 50 7 24

3. A family contains just two boys. Write down, evaluate, and sketch the likelihood func-tion for the family size n, assuming the simple Bernoulli model where successive birthsare independent and where the probability of a boy is 1/2 and the same for a girl. Whatis the m.l.e. n of n?

Statistical Concepts 2013/14 – Solutions 4 – Probability Models

1. (a) E [Y | α, λ] = α/λ (from lectures)

E[Y 2 | α, λ

]=

∫ ∞0

y2 1

Γ(α)λαyα−1e−λy dy

=Γ(α+ 2)

λ2Γ(α)

∫ ∞0

1

Γ(α+ 2)λα+2y(α+2)−1e−λy dy

=Γ(α+ 2)

λ2Γ(α)

=(α+ 1)αΓ(α)

λ2Γ(α)=

(α+ 1)α

λ2

Therefore

Var [Y | α, λ] =(α+ 1)α

λ2−(αλ

)2=

α

λ2

(b) We can calculate the moment generating function similarly

MY (t) = E[etY | α, λ

]=

∫ ∞0

ety1

Γ(α)λαyα−1e−λy dy

=

(λ

λ− t

)α ∫ ∞0

1

Γ(α)(λ− t)αyα−1e−(λ−t)y dy =

(λ

λ− t

)αfor λ− t > 0

The condition on t is required as the integral does not converge/exist for othervalues of t. As the moment generating function is defined for t in an open intervalincluding 0, it can be used to find all moments E(Y r), and hence uniquely specifiesthe distribution of Y (see your 1H Probability notes, or the textbook by Rice, formore details on mgf’s if needed to refresh your knowledge).

(c) Put Y = Y1+· · ·+Yn. Then (1H Probability course), because the Yi are independent,

MY (t) = MY1(t)MY2(t) · · ·MYn(t) =

(λ

λ− t

)nαTherefore, Y ∼ Gamma(nα, λ).

2. Observed values are

Severity O A AB B

Moderate-to-advanced 7 5 3 13Minimal 27 32 8 18Not present 55 50 7 24

Expected values are

Severity O A AB B

Moderate-to-advanced 10.01 9.78 2.02 6.18Minimal 30.38 29.70 6.14 18.78Not present 48.61 47.52 9.83 30.04

The (O − E)2/E entries are

Severity O A AB B

Moderate-to-advanced 0.904 2.339 0.471 7.510Minimal 0.376 0.178 0.560 0.032Not present 0.840 0.130 0.815 1.214

giving a total of 15.36957. Under the hypothesis of no association between disease andblood group, the null distribution is a chi-square distribution with (r−1)(c−1) = 3×2 = 6degrees of freedom. The observed value corresponds to a significance level of 0.0176. (InR this can be computed as 1-pchisq(15.36957,6)).)

[Note that one of the cells, [AB,moderate] has a small expected value. If we had a largeobserved value in this cell, then our chi-square distribution approximation would not bereliable.]

Thus, there is some evidence of association of disease and blood group within the ABOsystem, which is mainly confined to A and B in “Moderate-to-advanced”, especially B.

3. Likelihood l(n) for n is given by

l(n) = P [2 boys | n] =

(n

2

)1

2n=n(n− 1)

2n+1for n = 2, 3, 4, 5, . . .

and takes values 14 ,

38 ,

38 ,

516 , . . . Thus, m.l.e. n is not unique, as both n = 3 and n = 4

maximise l(n). [Obviously, l(0) = l(1) = 0.]

4. [Hand in to be marked]

(a) Likelihood, l(τ) =∏ni=1 τ

−1 exp (−yi/τ) = τ−n exp (−ny/τ) for τ > 0. Hence, it issufficient to know the values of (n, y) to compute l(τ).

(b) L(τ) = −n log τ − ny/τ . Hence,

L′(τ) = −nτ

+ ny

τ2= 0→ τ = y

[L

′′(τ) = − n

y2< 0

](c) E

[Y | τ

]= E [Y | τ ], where Y ∼ Gamma(α, λ) with α = 1, λ = 1/τ . Thus,

E [τ | τ ] = α/λ = τ (unbiased).

5. The likelihood is l(p) =∏ni=1(1− p)pyi−1 = (1− p)npn(y−1). Put L(p) = log l(p). Then

L‘(p) =−n

1− p+n(y − 1)

p→ p =

y − 1

y

For data,∑yi = 1× 48 + 2× 31 + · · ·+ 12× 1 = 363, n = 48 + 31 + · · ·+ 1 = 130. Hence,

p = (363− 130)/363 = 0.642. Noting that Ek+1 = pEk and E1 = 130(1− p) = 46.6 (1 dp)we obtain the following table:

Hops 1 2 3 4 5 6 7 8 9 10 11 12+ Total

O 48 31 20 9 6 5 4 2 1 1 2 1 130E 46.6 29.9 19.2 12.3 7.9 5.1 3.3 2.1 1.3 0.9 0.6 0.4 130

X2 .045 .042 .035 .891 .458 .001 .397 −→ 1.868

Pooled cells have expectation 9.092 = 130P [Y ≥ 7 | p = 0.642] and a contribution of0.3967 toX2. Degrees-of-freedom = 7−1−1 = 5, X2 = 1.868 and P

[X2

5 > 1.868 | geometric model]

=0.8672. This “large” significance probability suggests that the geometric model fits well—perhaps too well. However, why should the geometric distribution model the number ofhops?

Statistical Concepts 2013/14 – Sheet 5 – Likelihood

1. [Hand in to be marked] Let y1, . . . , yn be a random sample from a geometric distri-bution

P [Y = y | p] = (1− p)py−1 y = 1, 2, . . .

where p ∈ [0, 1]. [Remember from 1H Probability that this is the distribution forthe number trials to the first failure in a sequence of independent trials with successprobability p.] Show that, if the model is correct, it is sufficient to know the samplesize n and the sample mean y to evaluate the likelihood function. Find the maximumlikelihood estimator (m.l.e.) of p in terms of y.

In an ecological study of the feeding behaviour of birds, the number of hops betweenflights was counted for several birds. For the following data, fit a geometric distributionto these data using the m.l.e. of p, and test for goodness of fit using Pearson’s chi-squarestatistic, remembering the “rule of thumb” to “pool” counts in adjacent cells so that allresulting cell expectations are at least 5.

Number of hops 1 2 3 4 5 6 7 8 9 10 11 12Count 48 31 20 9 6 5 4 2 1 1 2 1

2. Let y1, . . . , yn be a random sample from an exponential distribution with p.d.f.

f(y | τ) =1

τe−y/τ 0 ≤ y <∞

and zero otherwise, where τ > 0.

(a) Show that if the model is correct it is sufficient to know the sample size n and thesample mean y to evaluate the likelihood function for any value of τ .

(b) Find the maximum likelihood estimator τ of τ in terms of y.

(c) Show that τ is an unbiased estimator of τ ; that is, E [τ | τ ] = τ for all τ > 0..

More problems on other side

3. Suppose that a parameter θ can assume one of three possible values θ1 = 1, θ2 = 10and θ3 = 20. The distribution of a discrete random quantity Y , with possible valuesy1, y2, y3, y4, depends on θ as follows:

θ1 θ2 θ3y1 .1 .2 .4y2 .1 .2 .3y3 .2 .3 .1y4 .6 .3 .2

Thus, each column gives the distribution of Y given the value of θ at the head of thecolumn.

(a) Write down the parameter space Θ.

(b) A single observation of Y is made. Sketch the likelihood function and evaluate them.l.e. θ of θ for each of the possible values of Y .

(c) Evaluate the sampling distribution of θ; that is, for each θ compute the probabilitydistribution of θ, based on a single observation of Y . Display your answer in tabularform.

(d) Is θ an unbiased estimator of θ? Prove your result!

4. The random quantity Y has a geometric distribution with probability function

P [Y = y | p] = (1− p)py−1 y = 1, 2, . . . p ∈ [0, 1]

Show that P [Y > y | p] = py. Recall from 1H that Y counts the number of trials to thefirst ‘failure’ in a sequence of Bernoulli trials, each with success probability p.

As part of a quality control procedure for a certain mass production process, batchescontaining very large numbers of components from the production are inspected fordefectives. We will assume the process is in equilibrium and denote by q the overallproportion of defective components produced.

The inspection procedure is as follows. During each shift n batches are selected fromthe production and for each such batch components are inspected until a defective oneis found, and the number of inspected components is recorded. At the end of the shift,there may be some inspected batches which have not yet yielded a defective component;and for such batches the number of inspected components is recorded.

Suppose at the end of one such inspection shift, a defective component was detected ineach of r of the batches, the recorded numbers of inspected components being y1, . . . , yr.Inspection of the remaining s = n− r batches was incomplete, the recorded numbers ofinspected components being c1, . . . , cs.

(a) Show that the likelihood function for q based on these data is

l(q) = qr(1− q)y+c−r q ∈ [0, 1]

where y = y1 + · · ·+ yr and c = c1 + · · ·+ cs.

(b) Therefore, show that the maximum likelihood estimate of q is q = 1/a, wherea = (y + c)/r.

Statistical Concepts 2013/14 – Solutions 5 – Likelihood

1. The likelihood is l(p) =∏n

i=1(1−p)pyi−1 = (1−p)npn(y−1). Therefore, by the factorisationcriterion, y is sufficient for p, if we know n.

Put L(p) = log l(p). Then

L′(p) =

−n1− p

+n(y − 1)

p→ p =

y − 1

y

(This is the maximum as L′′(p) is negative for p ∈ [0, 1].)

For data,∑yi = 1× 48 + 2× 31 + · · ·+ 12× 1 = 363, n = 48 + 31 + · · ·+ 1 = 130. Hence,

p = (363− 130)/363 = 0.642. Noting that Ek+1 = pEk and E1 = 130(1− p) = 46.6 (1 dp)we obtain the following table:

Hops 1 2 3 4 5 6 7 8 9 10 11 12+ Total

O 48 31 20 9 6 5 4 2 1 1 2 1 130E 46.6 29.9 19.2 12.3 7.9 5.1 3.3 2.1 1.3 0.9 0.6 1.0 130

X2 .045 .042 .035 .891 .458 .001 .397 −→ 1.868

Pooled cells have expectation 9.092 = 130P [Y ≥ 7 | p = 0.642] and a contribution of0.3967 to X2. Degrees-of-freedom = 7− 1− 1 = 5, X2 = 1.868 and

P[X2

5 > 1.868 | geometric model]

= 0.8672

This “large” significance probability suggests that the geometric model fits well—perhapstoo well. However, why should the geometric distribution model the number of hops?

2. (a) Likelihood, l(τ) =∏n

i=1 τ−1 exp (−yi/τ) = τ−n exp (−ny/τ) for τ > 0. Hence, it is

sufficient to know the values of (n, y) to compute l(τ) (by factorisation criterion).

(b) L(τ) = −n log τ − ny/τ . Hence, the MLE is

L′(τ) = −nτ

+ ny

τ2= 0→ τ = y

[as L

′′(τ) = − n

y2< 0

](c) E

[Y | τ

]= E [Y | τ ], where Y ∼ Gamma(α, λ) with α = 1, λ = 1/τ . Thus,

E [τ | τ ] = α/λ = τ (unbiased).

3. (a) Θ = {1, 10, 20}.(b) θ(y1) = 20, θ(y2) = 20, θ(y3) = 10, θ(y4) = 1.

(c) There are three different (sampling) distributions (displayed as columns in the tablebelow) for θ, one for each θ ∈ Θ.

θ

θ 1 10 20

20 .2 .4 .710 .2 .3 .11 .6 .3 .2

Eg, P[θ = 20 | θ = 1

]= P [Y = y1 | θ = 1] + P [Y = y2 | θ = 1] = 0.1 + 0.1 = 0.2.

(d) E[θ | θ = 1

]= 20 × 0.2 + 10 × 0.2 + 1 × 0.6 = 6.6 6= 1. Therefore, E

[θ | θ

]6= θ for

at least one value of θ; so θ is not an unbiased estimator of θ.

4. (a) P [Y > y] = 1 − P [Y ≤ y] = 1 −∑y

r=1(1 − p)pr−1 = py. The likelihood, i.e. theprobability of the observed data for a given p, is

P [Y1 = y1, . . . , Yr = yr, Yr+1 > c1, . . . , Yn > cs | p] =

r∏i=1

(1−p)pyi−1s∏

j=1

pcj = (1−p)rpy−rpc

(b) Therefore the log-likelihood for q = 1− p is

L(q) = r log q + (y + c− r) log (1− q)

and

L′(q) =

r

q+y + c− r

1− q(−1) = 0 → q =

1

a

where a = (y + c)/r.

Statistical Concepts 2013/14 – Sheet 6 – Likelihood

1. [Hand in to be marked] An independent sample x = (x1, . . . , xn) of size n is drawn from aRayleigh distribution with pdf

f(x|α) =x

αe−x

2/2α, x > 0

= 0, x ≤ 0

with unknown parameter α > 0

(a) Show that the maximum likelihood estimator for α is α =∑n

i=1 x2i /2n.

(b) If X has a Rayleigh distribution, parameter α, show that E(X2) = 2α.

Therefore, show that Fisher’s information for a sample of size one is 1/α2. Therefore,write down the information in a sample of size n.

(c) Calculate an approximate 95% confidence interval for α if n is large.

2. An offspring in a breeding experiment can be of three types with probabilities, independentlyof other offspring,

1

4(2 + p)

1

2(1− p) 1

4p

(a) Show that for n offspring the probability that there are a, b and c of the three types,respectively, is of the form

K (2 + p)a(1− p)bpc

where K does not depend on p.

(b) Show that the maximum likelihood estimate p of p is a root of np2 +(2b+c−a)p−2c = 0.

(c) Suppose that an experiment gives a = 58, b = 33 and c = 9. Find the m.l.e. p.

(d) Find Fisher’s information, and give an approximate 95% confidence interval for p.

(e) Use p to calculate expected frequencies of the three types of offspring, and test the ade-quacy of the genetic model using Pearson’s chi-square statistic.

3. A random quantity Y has a uniform distribution on the interval (0, θ) so that its p.d.f. is givenby

f(y | θ) =1

θ0 < y ≤ θ

and zero elsewhere. Why must θ be positive?

A random sample Y1, . . . , Yn is drawn from this distribution in order to learn about the valueof θ. Show that the joint p.d.f. is

f(y1, . . . , yn | θ) =1

θn0 < m ≤ θ

and zero elsewhere, where m = max{y1, . . . , yn}.Sketch the likelihood function of θ. What is the maximum likelihood estimate θ of θ?

Derive the exact sampling distribution of M = max{Y1, . . . , Yn}, and hence show that

E[θ | θ

]=

(1− 1

n+ 1

)θ

Is it surprising that θ “under-estimates” θ? Provide an unbiased estimator of θ.

[Hint: To calculate the sampling distribution of M , first calculate its c.d.f. by noting thatM ≤ m if and only if all the Yi are less than or equal to m: then find its p.d.f.]

4. Suppose that each individual in a very large population must fall into exactly one of k mutuallyexclusive classes C1, . . . , Ck with probabilities p1, . . . , pk. In a random sample of n such indi-viduals from the population, let Y1, . . . , Yk be the numbers falling in C1, . . . , Ck, respectively.

(a) Reason that

P [Y1 = y1, Y2 = y2, . . . , Yk = yk | p1, p2, . . . , pk] = Kpy11 py22 . . . pykk

where K depends on y1, . . . , yk but not on p1, . . . , pk. For the remainder of this questionyou will not need to know the expression for K.

(b) What is the distribution of the sum of any (proper) subset of Y1, . . . , Yk? For example,what is the distribution of Y2, or of Y2 + Y4?

(c) Suppose that in the population of twins, males (M) and females (F ) are equally likely tooccur and that the probability that the twins are identical is θ. If twins are not identicaltheir genders are independent. Show that P [MM ] = P [FF ] = (1 + θ)/4 and P [MF ] =(1− θ)/2.

(d) Suppose that n twins are sampled and it is found that y1 are MM , y2 are FF and y3 areMF , but it is not known which twins are identical. Find the m.l.e. θ of θ in terms of nand y1, y2, y3.

(e) Is θ an unbiased estimator of θ?

(f) Compute the variance of θ exactly.

(g) Find Fisher’s information, for this sampling experiment, and therefore find the largesample approximation to the variance of θ. Compare your answer to the result of part(f).

Statistical Concepts 2013/14 – Solutions 6 – Likelihood

1. (a) The likelihood of x is

l(α) =

n∏i=1

f(xi|α) =

n∏i=1

xiα

exp(−x2i /2α) =

∏ni=1 xiαn

exp(−n∑i=1

x2i /2α)

Therefore

L(α) = log(l(α)) = log(

n∏i=1

xi)− n log(α)− T (x)

2α

where T (x) =∑ni=1 x

2i , so that

d

dαL(α) = −n

α+T (x)

2α2= 0

when α = T (x)/2n. To check that this is a maximum, we evaluate

d2

dα2L(α) =

n

α2− T (x)

α3=

1

α2[n− T (x)

α]

sod2

dα2L(α) =

1

α2[n− 2n] = − 4n3

T (x)2< 0,

so α is the maximum.

(b) Integrating by parts, we have

E(X2) =

∫ ∞0

x2f(x|α)dx =

∫ ∞0

x2x

αe−x

2/2αdx

= [−x2e−x2/2α]∞0 + 2

∫ ∞0

xe−x2/2αdx = 2α

Therefore, Fisher’s information, for a sample of size one, is

I(α) = −E(d2

dα2log(f(X|α))

= E(1

α2(X2

α− 1)) = (

1

α2(2α

α− 1)) =

1

α2

so that the information in a sample of size n is n/α2.

(c) For large n, the probability distribution of α is approximately normal, mean α, variance 1/nI(α). Thereforean approximate 95% confidence interval for α is α± 1.96

√1/nI(α). We estimate α by α so the approximate

95% confidence interval is α± 1.96√T 2(x)/4n3

2. (a)

P [A = a,B = b, C = c | p, n] =n!

a!b!c!

[1

4(2 + p)

]a [1

2(1− p)

]b [1

4p

]c= K(2 + p)a(1− p)bpc

(b)L(p) = const + a log (2 + p) + b log (1− p) + c log p

L′(p) =

a

2 + p− b

1− p+c

p= 0 −→ np2 + (2b+ c− a)p− 2c = 0

(c) a = 58, b = 33, c = 9, n = 100 → 100p2 + 17p− 18 = 0 → p = (−17 +√

7489)/200 = 0.3477.

This value is a maximum as

L′′(p) = −(

a

(2 + p)2+

b

(1− p)2+

c

p2) < 0

(d) For a sample of size 1, given p,

E(a) =2 + p

4, E(b) =

1− p2

, E(c) =p

4.

So, Fisher’s information is

I(p) = −E(L′′(p)) = (

E(a)

(2 + p)2+

E(b)

(1− p)2+E(c)

p2) =

[1

4(2 + p)+

1

2(1− p)+

1

4p

]which we approximate as

I(p) = 1.59202

The large sample approximation to the variance of p, with n = 100, is therefore

V ar(p) ≈ 1

nI(p)=

1

159.202

Hence, the large sample SE of p is sp = 1/√

159.202 = 0.07925, leading to [0.1924, 0.5030] as an approximate95% CI for p.

(e) Computing expected values (E) in the usual way we obtain:

O 58 33 9 100E 58.69 32.62 8.69 100

(O − E)2/E .0082 .0045 .0109 .0236

Degrees-of-freedom = 3− 1− 1 = 1, X2 = 0.0236 and

P[χ21 > 0.0236 | genetic model is correct

]= 0.8779

This “large” significance probability suggests that the genetic model fits well—perhaps too well.

3. We must have θ > 0 for f(y|θ) to be non-negative.

The joint pdf is

f(y1, . . . , yn | θ) =

{θ−n for 0 < yi ≤ θ i = 1, . . . , n

0 otherwise

But {0 < yi ≤ θ , i = 1, . . . , n} ≡ {0 < m ≤ θ}, where m = max{y1, . . . , yn}, and the result follows.

The likelihood function is

l(θ) =

{0 if θ < m

θ−n if θ ≥ m

so that θ = m.

P [M ≤ m | θ] = P [Y1 ≤ m, . . . , Yn ≤ m | θ] =

0 if m ≤ 0

(m/θ)n if 0 < m ≤ θ1 if m > θ

Hence

E[θ | θ

]=

∫ θ

0

m

(nmn−1

θn

)dm =

(1− 1

n+ 1

)θ

We would expect the true value of θ to be larger than the largest observation, so result is not surprising.

An unbiased estimator of θ is [(n+ 1)/n]M , which is bigger than M = θ.

4. (a) One way of observing y1, . . . yk is

C1C1 · · ·C1︸︷︷︸y1

C2C2 · · ·C2︸︷︷︸y2

· · ·CkCk · · ·Ck︸︷︷︸yk

with probabilityp1p1 · · · p1︸︷︷︸

y1

p2p2 · · · p2︸︷︷︸y2

· · · pkpk · · · pk︸︷︷︸yk

= py11 py21 · · · p

yk1

ThusP [Y1 = y1, Y2 = y2, . . . Yk = yk | p1, p2, . . . , pk] = Kpy11 p

y21 · · · p

yk1

where K is the number of different ways this event can occur.

(b) Let Y = the sum of a proper subset of Y1, Y2, . . . , Yk and p the sum of the corresponding subset of p1, p2, . . . , pk.Then Y is the number of observations that fall into the corresponding disjoint union of classes C1, C2, . . . , Ck.Thus, Y ∼ Binomial(n, p).

(c)

P [MM ] = P [MM | I] P [I] + P [MM | Ic] P [Ic] =1

2θ +

1

2

1

2(1− θ) =

1 + θ

4= P [FF ]

P [MF ] = 1− P [FF ]− P [MM ] =1− θ

2

(d) The likelihood is

l(θ) = C

(1 + θ

4

)y1 (1 + θ

4

)y2 (1− θ2

)y3Therfore, the log-likelihood is

L(θ) = constant + (y1 + y2) log (1 + θ) + y3 log (1− θ)

Differentiating wrt to θ we obtain

L′(θ) =

y1 + y21 + θ

− y31− θ

= 0

which has solution θ∗ = 1 − 2y3n . This θ∗ will be the maximum likelihood estimator provided that θ∗ ≥ 0, so

then θ = θ∗ = 1− 2y3n , otherwise θ = 0. (Check that θ is the maximum by checking L

′′(θ) < 0 as in part (g)

below.)

(e) Observe that the expectation of θ∗ is found as

E [θ∗ | θ] = 1− 2

nE [Y3 | θ] = 1− 2

nn

(1− θ

2

)= θ

so θ∗ is unbiased. We know that θ ≥ θ∗, but actually there is a positive probability that θ∗ < 0 in which case

θ = 0 > θ∗, hence E[θ | θ

]> θ so the estimator θ is biased.

(f) For large n, we have θ ≈ θ∗. So, we may find the variance of θ approximately as

Var[θ | θ

]≈ Var [θ∗ | θ] =

(2

n

)2

Var [Y3 | θ] =

(2

n

)2

n

(1− θ

2

)(1 + θ

2

)=

1− θ2

n

(g)

L′′(θ) = − y1 + y2

(1 + θ)2− y3

(1− θ)2

For a sample of size 1, given θ,

E(y1) = E(y2) =1 + θ

4, E(y3) =

1− θ2

.

So, Fisher’s information is

I(θ) = −E(L′′(θ)) =

E(y1) + E(y2)

(1 + θ)2+

E(y3)

(1− θ)2=

1+θ2

(1 + θ)2+

1−θ2

(1− θ)2) =

1

(1− θ2)

The large sample approximation to the variance of θ is therefore

V ar(θ) ≈ 1

nI(θ)=

(1− θ2)

n

which, in this case, is the same value as found in (f).

Statistical Concepts 2012/13 – Sheet 7 – Sample information

1. [Hand in to be marked] A thousand individuals were classified according to gender andwhether or not they were colourblind:

Male FemaleNormal 442 514Colourblind 38 6

According to genetic theory, each individual, independently of other individuals, has thefollowing probabilities of belonging to the above categories:

Male FemaleNormal 1

2p pq + 1

2p2

Colourblind 12q 1

2q2

where q = 1 − p.

(a) Show that the maximum likelihood estimate q of q is 0.0871, to four decimal places.

(b) Compute the large sample estimated standard error for the maximum likelihood esti-mate, using the “observed information”. Hence, find an approximate 99% confidenceinterval for q.

2. Evaluate and compare

(i) the estimated sample information, nI(θ),

and

(ii) the observed information, −L′′(θ), for the given sample,

for each of the following situations.

(a) A sample X from a binomial distribution, parameters n (known) and p (unknown).

(b) An iid sample Y1, ..., Yn of size n, from a Poisson distribution, parameter λ.

Statistical Concepts 2013/14 – Solutions 7 – Sample information

1. (a) Putting a = 442, b = 514, c = 38, d = 6, n = 1000, the likelihood is

l(q) = constant× pa(pq +1

2p2)bqc(q2)d ∝ (1− q)a+b(1 + q)bqc+2d

L(q) = log(l(q)) = constant + (a+ b) log (1− q) + b log (1 + q) + (c+ 2d) log q

L′(q) = − (a+ b)

(1− q)+

b

(1 + q)+

(c+ 2d)

q= 0→ (a+ 2b+ c+ 2d)q2 + aq − (c+ 2d) = 0→

760q2 + 221q − 25 = 0→ q = 0.0871

(Check q is a maximum by checking L′′(q) < 0 as below.)

(b) Differentiating again, we have

L′′(q) = −[

(a+ b)

(1− q)2+

b

(1 + q)2+

(c+ 2d)

q2]

Substituting the observed values of a, b, c, d and q, the observed information is

L′′(q) = 1147.127 + 434.935 + 6590.733 = 8172.79

Hence, the estimated standard error of q is sq =√

1/− L′′(q) = 0.011. As z0.005 = 2.5758, an approximate99% confidence interval for q has limits

0.0871± 2.5758× 0.011→ [0.059, 0.116]

2. (a) (i) First find Fisher’s information for a sample, z say, of size 1. As in lectures, the likelihood is

l(p) = f(z|p) = pz(1− p)1−z

So, if L(p) = ln(l(p)), then

L′′(p) = − z

p2− 1− z

(1− p)2

so that

I(p) = −E(L′′(p)) =

p

p2+

1− p(1− p)2

=1

p(1− p)From lectures, the maximum likelihood estimate for p given the binomial sample X = x is p = x

n .Therefore, we estimate the sample information as

nI(p) =n

p(1− p)=

nxn (1− x

n )=

n3

x(n− x)

(ii) Alternately, writing out the likelihood for the sample X = x of size n, we have

l(p) =n!

x!(n− x)!px(1− p)n−x

so that

L′′(p) = − x

p2− n− x

(1− p)2

Therefore the observed information is

−L′′(p) =

x

p2+

n− x(1− p)2

=x

( xn )2+

n− x(1− x

n )2=

n3

x(n− x)

Observe that (i) and (ii) are the same in this case.

(b) (i) First, find Fisher’s information for a sample, y, of size 1. As in lectures, the likelihood is

f(y|λ) =e−λλy

y!

ThereforeL

′′(λ) = − y

λ2

so that

I(λ) = −E(L′′(λ)) =

1

λ

From lectures, the maximum likelihood estimate for λ given sample values y1, ..., yn is λ = y, the mean ofthe n observations. Therefore, we estimate the sample information as

nI(λ) =n

λ=n

y

(ii) Alternately, writing out the likelihood for the sample we have

l(λ) = Πni=1

e−λλyi

yi!

so that

L′′(λ) = −

∑ni=1 yiλ2

Therefore the observed information is

−L′′(λ) =

∑ni=1 yi

λ2=ny

y2=n

y

Observe that (i) and (ii) are the same in this case.(They are not always the same!)

Statistical Concepts 2013/14 – Sheet 8 – LR Tests

1. [Hand in to be marked] An independent, identically distributed sample, x = (x1, ..., xn),of size n, is drawn from a Poisson distribution with parameter λ. We want to test the nullhypothesis H0 : λ = λ1 against the alternative hypothesis H1 : λ = λ2, where λ1 < λ2.

(a) Write down the likelihood ratio for the data, and show that all likelihood ratio testsof H0 against H1 are of the form: Reject H0 if

∑ni=1 xi > c, for some c.

(b) Suppose that n = 50, λ1 = 2, λ2 = 3. By using the central limit theorem, find,approximately,

(i) the value of c for which the significance level of the test is 0.01.

(ii) the power of the test for this choice of c.

2. We want to construct a test of hypothesis H0 against H1, based on observation of a randomquantity Y , which takes possible values 1,2,3,4,5, with probabilities, given H0 and H1 asfollows.

1 2 3 4 5H0 .4 .2 .2 .1 .1H1 .1 .2 .2 .2 .3

(a) Suppose that α0(δ) is the probability that the test δ accepts H1, if H0 is true, andα1(δ) is the probability that δ accepts H0, if H1 is true. Suppose that we are alittle more concerned to avoid making the first type of error than we are to avoidmaking the second type of error. Therefore, we decide to construct the test δ∗ whichminimises the quantity γ(δ) = 1.5α0(δ) +α1(δ). Find the test, δ∗, and find the valuesof α0(δ

∗), α1(δ∗).

(b) In the above example, suppose that we replace γ(δ) = 1.5α0(δ) + α1(δ) by γ(δ, c) =cα0(δ) + α1(δ). Find the corresponding optimal test δc, and find the correspondingvalues α0(δc), α1(δc) for each value of c > 0.

3. If gene frequencies AA, Aa, aa are in Hardy-Weinberg equilibrium, then the gene frequen-cies are (1 − θ)2, 2θ(1 − θ), θ2, for some value of θ.

Suppose that we wish to test the null hypothesis H0 : θ = 1/3, against the alternativeH1 : θ = 2/3, based on the number of individuals, x1, x2, x3 with the given genotypes in asample of n individuals.

(a) Find the general form of the likelihood ratio test.

(b) If n = 36, find, approximately, the test with significance level 0.01, and find the powerof this test.

[Hint: You will need to find the mean and variance of (x3 − x1). First find these fora sample of size n = 1.]

Comment on possible improvements to this choice of test procedure.

Statistical Concepts 2013/14 – Solutions 8 – LR tests.

1. (a) The Poisson distribution, parameter λ has frequency function

f(x|λ) =e−λλx

x!, x = 0, 1, ...

Therefore the likelihood, for the data x = (x1, , ..., xn), given λ is

l(λ) = Πni=1

e−λλxi

xi!=e−nλλ

∑ni=1 xi

Πni=1xi!

Therefore the likelihood ratio for the data is

LR(x) =l(λ2)

l(λ1)= e−n(λ2−λ1)(

λ2λ1

)∑n

i=1 xi

Each likelihood ratio test is of form: Reject H0 if LR(x) > k for some k.

As LR(x) is a monotone function of∑ni=1 xi, this is equivalent to the test:

Reject H0 if∑ni=1 xi > c for some c.

(b) As each Xi has a Poisson distribution, parameter λ, Xi has mean and variance equal to the value ofλ. Therefore,T =

∑ni=1Xi has mean and variance equal to nλ. If n = 50, then approximately T =∑n

i=1Xi has a normal distribution, by the central limit theorem, so that approximately T is distributedas N(nλ, nλ).

(i) We want to choose c so that, if n = 50 and λ = 2, then P (T > c) = 0.01. In this case, approximatelyT is N(100, 100). Therefore we want

0.01 = P (T > c) = 1− P (T − 100

10≤ c− 100

10) ≈ 1− Φ(

c− 100

10)

Therefore, from tables, c−10010 = 2.33 so that c = 123.3.

(ii) The power of the test is the probability of rejecting H0 if H1 is true, i.e. we want to calculateP (T > 123.3) if n = 50 and λ = 3. In this case, approximately, T is N(150, 150). Therefore, the power ofthe test is

P (T > 123.3) = 1− P (T − 150√

150≤ 123.3− 150√

150) ≈ 1− Φ(

123.3− 150√150

) = 1− Φ(−2.18) = 0.985

2. (a) As shown in the lectures, the test which minimises aα0(δ) + bα1(δ) is to accept H0 if LR(y) < a/b, and toaccept H1 if LR(y) > a/b, and to accept either if LR(y) = a/b, where LR(y) = f1(y)/f0(y). The likelihoodratio values are as follows.

1 2 3 4 5H0 .4 .2 .2 .1 .1H1 .1 .2 .2 .2 .3

LR(y) .25 1 1 2 3

As a = 1.5, b = 1, the optimal test δ∗ accepts H1 if Y is 4 or 5, and accepts H0 if Y is 1, 2 or 3.

Therefore α0(δ∗), the probability that δ∗ accepts H1, if H0 is true, equals the probability of observing Yto be 4 or 5, given H0, which is 0.2. Similarly, α1(δ∗), equals the probability of observing Y to be 1,2 or3, given H1,which is 0.5.

(b) The acceptance set for H0 is empty if c < 0.25, with (α0, α1) = (1, 0). For 0.25 < c < 1, add the valuey = 1, with (α0, α1) = (0.6, 0.1). For 1 < c < 2, also add the values 2 and 3, with (α0, α1) = (0.2, 0.5).For 2 < c < 3 also add 4, with (α0, α1) = (0.1, 0.7). For c > 3, add y = 5 with (α0, α1) = (0, 1)

3. (a) The likelihood, for general θ is

l(θ) = f(x1, x2, x3|θ) =n!

x1!x2!x3![(1− θ)2]x1 [2θ(1− θ)]x2 [θ2]x3

Therefore, the likelihood ratio is

LR(x1, x2, x3) =l(2/3)

l(1/3)= 4(x3−x1)

The likelihood ratio is monotonic in x3 − x1. Therefore, the general form of the LR test is

Reject H0 if [x3 − x1] > c.

(b) As the sample size is reasonably large, approximately, by the central limit theorem, X = X3 −X1 has anormal distribution.

To find the mean and variance of this distribution, consider a sample of n = 1. For general θ, the possiblevalues of X if n = 1 are -1, 0 , +1, with probabilities (1 − θ)2, 2θ(1 − θ), θ2 respectively. Therefore, ifθ = 1/3, X takes values -1,0,1 with probabilities 4/9, 4/9, 1/9, so that E(X) = −1/3, V ar(X) = 4/9.

Therefore, approximately, the distribution of X, when n = 36, is approximately normal, with meanµ = −36/3 = −12, and variance σ2 = 36× (4/9) = 42.

We want to choose a value for c so that P (X > c) = 0.01, when X ∼ N(−12, 42). From normal tables,the upper 99% point of the standard normal is 2.33. Therefore c = −12 + 2.33 × 4 = −2.68 gives thecritical value for a test at significance level 0.01.

From the symmetry of the specification, the distribution of X under H1 is, approximately X ∼ N(12, 42).So, the power of the test, namely 1−P (X < −2.68), given H1, is, approximately, 1−Φ((−12−2.68)/4)) =1− Φ(−3.67) = 0.9999.

Note that when a test has better power than significance level, we may often be able to change the criticalvalue to reduce the significance level at small cost to the power. For example, choosing c = 0 givessignificance level 0.0013, and power 0.9987.

Statistical Concepts 2013/14 – Sheet 9 – LR Tests

1. [Hand in to be marked] We observe a series of n counts, x1, ..., xn. Our null hypothesisH0 is that each count xi is Poisson, with a common parameter λ, while our alternativehypothesis, H1, is that each xi is Poisson, but with different parameters λ1, ..., λn.

(a) Given H0, what is the maximum likelihood estimate for λ? Given H1, what is themaximum likelihood estimate for each λi? Show that, if Λ is the generalised likelihoodratio, then the corresponding test statistic is

−2 log(Λ) = 2n∑

i=1

xi log(xix

)

where x is the sample mean. How many degrees of freedom does the null distributionhave?

(b) In a study done at the National Institute of Science and Technology, 1980, asbestosfibres on filters were counted as part of a project to develop measurement standards forasbestos concentration. Assessment of the numbers of fibres in each of 23 grid squaresgave the following counts:

31 34 26 28 29 27 27 24 19 34 27 21 18 30 18 17 31 16 24 24 28 18 22

Carry out the above test as to whether the counts have the same Poisson distribution andreport your conclusions.

2. Let y1, . . . , yn be a random sample from N(µ, σ2), where the value of σ2 is known. Showthat the likelihood ratio test of the hypothesis µ = µ0 for some specified value of µ0 isequivalent to rejecting the hypothesis when the ratio

|y − µ0|σ/√n

is “large”, where y is the sample average. What is the exact significance level whenµ0 = 0, σ = 1, n = 9, y = 1?

Statistical Concepts 2013/14 – Solutions 9 – LR tests

1. Under H0, x1, ..., xn are an independent sample from a Poisson distribution, parameter λ. As shown inlectures, the maximum likelihood estimate for λ is therefore λ = x, the sample mean. Under H1, each xiindividually is Poisson, parameter λi, so the maximum likelihood estimator for each λi is λi = xi. Thelikelihood ratio is therefore

Λ =

∏ni=1 f(xi|λ)∏ni=1 f(xi|λi)

=

∏ni=1 λ

xie−λ/xi!∏ni=1 λi

xie−λi/xi!

=

n∏i=1

(x

xi)xiexi−x

The likelihood ratio test statistic is therefore

−2 ln(Λ) = −2

n∑i=1

[xi ln(x

xi) + (xi − x)] = 2

n∑i=1

xi ln(xix

)

Under H1 there are n independent parameters, while under H0 there is only one parameter λ. Therefore,asymptotically −2 ln(Λ) has a X2 distribution with n− 1 degrees of freedom.

(b) For the given data, 2∑ni=1 xi ln(xi

x ) = 27.11. With 23 observations, we have 22 degrees of freedom.From the tables of the X2 distribution, we see that the p-value (i.e. the probability of exceeding this value,given the null distribution) is around 0.2. This would only provide very weak evidence against the nullhypothesis of a common value of λ. On the other hand, the sample size is fairly small, so the asymptoticapproximation is not fully reliable, and we might expect the test to have quite low power.

2. The likelihood function is

l(µ) =

n∏i=1

f(yi|µ) =

n∏i=1

1

σ√

2πe−(yi−µ)2/2σ2

As

n∑i=1

(yi − µ)2 =

n∑i=1

(yi − y)2 + n(y − µ)2

the log-likelihood can be written:

L(µ) = constant− n

2σ2(y − µ)2

The unrestricted m.l.e. of µ is µ = y and the restricted m.l.e is µ = µ0. Hence,

2[L(µ)− L(µ0)] =n

σ2(y − µ0)2

with dim(Θ)− dim(ω) = 1− 0 = 1 degree-of-freedom. Hence, rejecting when 2[L(µ)− L(µ0)] is “large” isequivalent to rejecting when

|y − µ0|σ/√n

is “large”. When the null hypothesis (µ = µ0) is “true”

Z =Y − µ0

σ/√n

has a N(0, 1) distribution, and its value is 3 when µ0 = 0, σ = 1, n = 9, y = 1. Thus the p-value ofthis test (which can also be called the exact significance level) is P[|Z| ≥ 3] = 0.002699796 (computed as2*(1-pnorm(3)) in R), which is strong evidence against the null hypothesis. Note that in this example,2[L(µ)− L(µ0)] = Z2 ∼ X2

1 , exactly.

Statistical Concepts 2013/14 – Sheet 10Small sample statistics and distribution theory

1. [Hand in to be marked] Suppose that a pharmaceutical company must estimatethe average increase in blood pressure of patients who take a certain new drug. Sup-pose that only six patients (randomly selected from the population of all patients)can be used in the initial phase of human testing. Assume that the probability dis-tribution of changes in blood pressure from which our sample was selected is normal,with unknown mean, µ, unknown variance, σ2.

(a) Suppose that we use the sample variance s2 to estimate the population varianceσ2. Find the probability that s2 will overestimate σ2 by at least 50%.

(b) Suppose that the increase in blood pressure, in points, for each of the sample ofsix patients is as follows:

1.7, 3.0, 0.8, 3.4, 2.7, 2.1

Evaluate a 95% confidence interval for µ from these data. Compare the intervalthat you would obtain using large sample approximations.

(c) Evaluate a 95% confidence interval for σ2 from these data.

2. If Z has a normal probability distribution, mean µ, variance σ2, and Y = eZ , thenfind the probability density function of Y .[Y is said to have a lognormal density as log(Y ) is normally distributed.]

3. Let X and Y have the joint density

f(x, y) =6

7(x+ y)2 0 ≤ x ≤ 1, 0 ≤ y ≤ 1

(a) By integrating over appropriate regions, find(i) P(X > Y )

(ii) P(X + Y ≤ 1)(iii) P(X ≤ 1

2)

(b) Find the marginal density of X.(c) Write down the marginal density of Y .(d) Find the conditional density of X given Y .(e) Write down the conditional density of Y given X.(f) Are X and Y independent? Explain!

Statistical Concepts 2013/14 – Solutions 10–Small sample statistics and distribution theory

1. (a) As each Xi ∼ N(µ, σ2), the distribution of (n−1)s2/σ2 has a chi-square distribution,with degrees of freedom n − 1, where s2 = (1/(n − 1))

∑ni=1(Xi − X)2, the sample

variance, and, in this example n = 6. Therefore if X2α is the upper α% point of the

chi-square distribution with 5 df, then

α = P (

∑ni=1(Xi − X)2

σ2≥ X2

α) = P (s2 ≥ (X2α/5)σ2)

Therefore, P (s2 ≥ 1.5σ2) corresponds to the value of α for which X2α/5 = 1.5, which,

from detailed tables, or from using R, is 0.186.

[ The version of the tables distributed in class gives X20.2 = 7.29, X2

0.15 = 8.12, identi-fying the probability as being a bit lower than 0.2.]

(b) From the given data x = 2.283, s = 0.950. The appropriate 95% interval is

x± t0.025s√n

where t0.025 is the upper 0.025 point of the t-distribution with (6-1)=5 degrees offreedom, which is 2.571 from the tables. Therefore the interval is

2.283± (2.571)0.950√

6= 2.283± 0.997

The large sample approximation in this problem would be to suppose that s2 was avery accurate estimate for σ2 (which we saw above is rather optimistic), and thereforeto use the interval

x± z0.025s√n

where z0.025 is the upper 0.025 point of the normal distribution, namely 1.96 replacingthe value 2.571 in the above interval (and so giving a narrower interval, 2.283±0.760,based on ignoring the substantial uncertainty arising from estimating the variancefrom a small sample).

(c) The 95% confidence interval for σ2 based on a normal sample of size n is

(

∑ni=1(Xi − X)2

X2(n−1)(0.025)

,

∑ni=1(Xi − X)2

X2(n−1)(0.975)

)

From the given data, n = 6 and∑6

i=1(Xi − X)2 = 4.51. The 0.025 and 0.975 valuesfor the chi-square distribution with 5 DF are 0.831 and 12.83, so the 95% interval forσ2 is (0.35,5.43).

2.

fZ(z) =1

σ√

2πe−

12σ2 (z−µ)2

Therefore, z = s(y) = log y → ds(y)/dy = 1/y, so

fY (y) = fZ(s(y))∣∣∣ds(y)

dy

∣∣∣ =1

σ√

2πye−

12σ2 (log y−µ)2

for y > 0 and zero otherwise.

3. (a) (i) P[X > Y ] = 1/2 by symmetry, or∫ 1

0

[∫ x0

67(x+ y)2 dy

]dx.

(ii) P[X + Y ≤ 1] =∫ 1

0

[∫ 1−x0

67(x+ y)2 dy

]dx = 3/14.

(iii) P[X ≤ 12] =

∫ 1

0

[∫ 12

067(x+ y)2 dx

]dy = 2/7.

(b) fX(x) =∫ 1

067(x+ y)2 dy = 2

7(3x2 + 3x+ 1) for x ∈ [0, 1] and zero o/w.

(c) Similarly, by symmetry, fY (y) = 27(3y2 + 3y + 1) for y ∈ [0, 1] and zero o/w.

(d)

f(x|y) =f(x, y)

fY (y)=

{3(x+y)2

(3y2+3y+1)for x ∈ [0, 1]

0, otherwise

(e) Similarly, by symmetry

f(y|x) =f(x, y)

fX(x)=

{3(x+y)2

(3x2+3x+1)for y ∈ [0, 1]

0, otherwise

(f) X and Y are not independent because their joint pdf is not the product of thetwo marginal densities for all x, y. Equivalently, the conditional densities are notequivalent to their corresponding marginal densities.

Statistical Concepts 2013/14 – Sheet 11 – Distribution theory

1. [Hand in to be marked] Suppose that X and Y are independent randomquantities, each with exponential pdf

f(z) = λe−λz for z > 0 and 0 otherwise.

LetU = X + Y and V = X/Y

(a) Find the joint pdf of U and V .

(b) Find the marginal pdfs of U and V .

(c) Are U and V independent? Justify your answer.

2. Suppose that Y and Z are independent random quantities, where Y has achi-square distribution with n df, and Z has a standard normal distribution.

Let

X =Z√Yn

and W = Y

(i) Find the joint pdf of W and X.

(ii) Deduce that the pdf of X is

fX(x) =Γ[(n+ 1)/2]√nπΓ(n/2)

(1 +

x2

n

)−(n+1)/2

[This is the pdf of the t-distribution with n df.]

Statistical Concepts 2013/14 – Solutions 11 – Distribution theory

1. (a) The inverse function to

u = r1(x, y) = x+ y, v = r2(x, y) = x/y

isx = s1(u, v) = uv/(1 + v), y = s2(u, v) = u/(1 + v)

over u > 0, v > 0.

The Jacobian, J , namely the determinant∣∣∣∣ δs1δu

δs1δv

δs2δu

δs2δv

∣∣∣∣has absolute value |J | = u/(1 + v)2. Hence, since X and Y are independent

fU,V (u, v) = fX,Y (x, y)|J | = fX(x)fY (y)|J |

= λe−λxλe−λy|J | = λe−λuv/(1+v)λe−λu/(1+v)u

(1 + v)2= λ2e−λu

u

(1 + v)2

for positive u and v and zero otherwise.

(b) The marginal pdf of U is

fU(u) =

∫ ∞−∞

fU,V (u, v)dv =

∫ ∞0

λ2e−λuu

(1 + v)2dv = λ2ue−λu, u > 0

and similarly,

fV (v) =1

(1 + v)2

(c) As

fU,V (u, v) = fU(u)fV (v)

U, V are independent.

2. (a) The inverse function to

w = r1(y, z) = y, x = r2(z, y) = z/√

(y/n),

isy = s1(x,w) = w, z = s2(x,w) = x

√(w/n)

over w > 0,−∞ < x < +∞.

The Jacobian, J , namely the determinant∣∣∣∣ δs1δx

δs1δw

δs2δx

δs2δw

∣∣∣∣has absolute value |J | =

√w/n. Hence, since Y and Z are independent with pdf

fY (y) =1

2n/2Γ(n/2)y(n/2)−1e−y/2

fZ(z) =1√2πe−z

2/2

so that

fW,X(w, x) = fY,Z(y, z)|J | = fY (y)fZ(z)|J |

= fY (w)fZ(x

√w

n)

√w

n= cw(n+1)/2−1e−

12(1+x2

n)w

where

c =1

2(n+1)/2√nπΓ(n

2)

(b) The marginal pdf of X is therefore

fX(x) =

∫ ∞−∞

fW,X(w, x)dw = c

∫ ∞0

w(n+1)/2−1e−wh(x)dw

where h(x) = 12(1 + x2/n).

Recalling the gamma integral ∫ ∞0

xa−1e−bxdx =Γ(a)

ba

(as the gamma pdf integrates to 1), we have

fX(x) = cΓ((n+ 1)/2)

h(x)(n+1)/2=

Γ[(n+ 1)/2]√nπΓ(n/2)

(1 +x2

n)−(n+1)/2

Statistical Concepts 2013/14 – Sheet 12 – Bayesian statistics

1. [Hand in to be marked] Joe is a trainee manager, and his boss decides he should preparea report on the durability of the light bulbs used in the corporation’s offices. His boss wantsto know what proportion last longer than the 900 hours claimed by the manufacturer asthe time that at least 90% should survive. What should Joe do?

Fred, who looks after light bulb replacement, tells Joe that he has been sceptical aboutthe manufacturer’s claims for years, and he reckons it is more like 80%. Joe is a carefultype and decides to pin Fred down a bit, offering him a choice between there being 75%,80%, 85% and 90% which survive beyond 900 hours and getting him to say how relativelylikely he thinks those percentages are. Fred says he reckons that 80% is about 4 timesmore likely than 75%, and about twice as likely as 85% and that 85% is about 4 times aslikely as 90%.

Joe knows that since his boss is an ex-engineer he is going to demand some facts to backup the speculation. Joe decides to monitor the lifetimes of the next 30 bulbs installed inoffices. Fortunately, since the lights are left permanently on (to show passers–by how wellthe corporation is doing financially), he simply has to record the time of installation andwait for 900 hours.

At the end of the study, Joe is able to write up his report. Of his 30 bulbs, 4 have failed.Assuming that Joe accepts Fred’s opinions as the honest opinions of an expert, what shouldhe conclude about the proportion of bulbs which last beyond 900 hours?

2. Suppose that you have a blood test for a rare disease. The proportion of people whocurrently have this disease is .001. The blood test comes back with two possible results:positive, which is some indication that you may have the disease, or negative. Supposethat the test may give the wrong result: if you have the disease, it will give a negativereading with probability .05; likewise, a false positive result will happen with probability.05.

You have three blood tests and they are all positive. What is the probability of you havingthe disease, assuming blood test results are conditionally independent given disease state?

3. An automatic machine in a small factory produces metal parts. Most of the time (90%from long records) it produces 95% good parts and the remaining have to be scrapped.Other times, the machine slips into a less productive mode and only produces 70% goodparts. The foreman observes the quality of parts that are produced by the machine andwants to stop and adjust the machine when she believes that the machine is not workingwell. Suppose that the first dozen parts produced are given by the sequence

s, u, s, s, s, s, s, s, s, u, s, u

where s = satisfactory and u = unsatisfactory. After observing this sequence, what is theprobability that the machine is in its ‘good’ state, assuming outcomes are conditionallyindependent given the state of parts? If the foreman wishes to stop the machine when theprobability of ‘good state’ is under .7, when should she stop it?

After observing the above sequence, what is the probability that the next two parts pro-duced are unsatisfactory?

4. Suppose that a parameter θ can assume one of three possible values θ1 = 1, θ2 = 10and θ3 = 20. The distribution of a discrete random quantity Y , with possible valuesy1, y2, y3, y4, depends on θ as follows:

θ1 θ2 θ3y1 .1 .2 .4y2 .1 .2 .3y3 .2 .3 .1y4 .6 .3 .2

Thus, each column gives the distribution of Y given the value of θ at the head of thecolumn.

Suppose that the parameter θ assumes its possible values 1, 10 and 20, with prior probabil-ities 0.5, 0.25 and 0.25 respectively. In what follows, assume observations are conditionallyindependent given θ.

(a) Suppose y2 is observed. What is the posterior distribution of θ? What is the modeof this distribution, and compare it with the mle of θ based on y2.

(b) Suppose a second observation is made and y1 is observed. What does the posteriordistribution for θ become?

(c) Suppose a third observation were contemplated. Find the conditional probabilitydistribution of this “future” observation given that y2 and y1 have been observed.How might this conditional distribution help in predicting the outcome of the thirdobservation?

Statistical Concepts 2013/14 – Solutions 12 – Bayesian Statistics

1. Data D: “4 failures out of 30”.

Model(M) % P [M ] P [D |M ] P [M | D]

M1 75 215

(304

)(0.75)26(0.25)4 0.055704

M2 80 815

(304

)(0.80)26(0.20)4 0.488715

M3 85 415

(304

)(0.85)26(0.15)4 0.373958

M4 90 115

(304

)(0.90)26(0.10)4 0.081623

Data essentially confirms Fred’s belief that the rate is 80% (or at least between 80% and85%).

2. Let D mean “has disease” and D mean “does not have disease”.

Prior: P [D] = 0.001, P[D]

= 0.999.

Likelihood: P [+ + + | D] ∝ (0.95)3, P[+ + + | D

]∝ (0.05)3.

Posterior: P [D | + + +] ∝ (0.95)3 × 0.001, P[D | + + +

]∝ (0.05)3 × 0.999.

Hence, P [D | + + +] = 0.872868, P[D | + + +

]= 0.127132.

3.

P [G] = 0.90 P [B] = 0.10

P [S | G] = 0.95 P [U | G] = 0.05

P [S | B] = 0.70 P [U | B] = 0.30

P [G | sequence] ∝ P [G] P [sequence | G] ∝ 0.9× (0.95)9(0.05)3 → 0.394217

P [B | sequence] ∝ P [B] P [sequence | B] ∝ 0.1× (0.70)9(0.30)3 → 0.605783

P [G | SU ] ∝ P [G] P [SU | G] ∝ 0.90× 0.95× 0.05→ 0.670588

P [B | SU ] ∝ P [B] P [SU | B] ∝ 0.10× 0.70× 0.30→ 0.329412

As P [G | SU ] < 0.70, she will stop after the second item, which is unsatisfactory.

P [UU | sequence] = P [UU | G] P [G | sequence] + P [UU | B] P [B | sequence] = (0.05)2 ×0.394217 + (0.30)2 × 0.605783 = 0.055506

Statistical Concepts 2013/14 – Sheet 13 – Bayesian Statistics

1. [Hand in to be marked] Show that if X ∼ Beta (a, b) then

E [X | a, b] =a

a+ bVar [X | a, b] =

ab

(a+ b)2(a+ b+ 1)

In question 1, problem sheet 12, Joe elicits Fred’s prior beliefs in the form ofa discrete distribution. Suppose instead that Joe had managed to elicit fromFred that his mean and standard deviation for the percentage of lightbulbslasting more than 900 hours are 82% and 4%, respectively.

Use a Beta distribution to capture Fred’s prior beliefs and calculate theposterior mean and posterior standard deviation for the percentage of light-bulbs lasting more than 900 hours, given that 4 out of the 30 lightbulbs hadfailed by 900 hours.

2. Independent observations y1, . . . , yn are such that yi (i = 1, . . . , n) is arealisation from a Poisson distribution with mean θti where t1, . . . , tn areknown positive constants and θ is an unknown positive parameter. [It maybe helpful to regard yi as the number of events ocurring in an interval oflength ti in a Poisson process of constant rate θ, and where the n intervalsare non-overlapping.]

Prior beliefs about θ are represented by a Gamma (a, b) distribution, forspecified constants a and b. Show that the posterior distribution for θ isGamma (a+ y, b+ t), where y = y1 + . . .+ yn and t = t1 + . . .+ tn.

In all of what follows, put a = b = 0 in the posterior distribution for θ,corresponding to a limiting form of “vague” prior beliefs.

A new extrusion process for the manufacture of artificial fibre is underinvestigation. It is assumed that incidence of flaws along the length of thefibre follows a Poisson process with a constant mean number of flaws permetre. The numbers of flaws in five fibres of lengths 10, 15, 25, 30, and 40metres were found to be 3, 2, 7, 6 and 10, respectively.

Find the posterior distribution for the mean number of flaws per metre offibre; and compute the posterior mean and variance of the mean number offlaws per metre.

Show that the probability that a new fibre of length 5 metres will notcontain any flaws is exactly

(2425

)28. [Hint: “average” the probability of this

event for any θ with respect to the posterior distribution of θ.]


1.

E [X] =Γ(a+ b)

Γ(a)Γ(b)

∫ 1

0

x(a+1)−1(1− x)b−1dx =Γ(a+ b)

Γ(a)Γ(b)

Γ(a+ 1)Γ(b)

Γ(a+ b+ 1)=

a

a+ b

Similarly

E[X2]

=Γ(a+ b)

Γ(a)Γ(b)

Γ(a+ 2)Γ(b)

Γ(a+ b+ 2)=

a(a+ 1)

(a+ b)(a+ b+ 1)

Therefore, Var [X] = E [X2]− (E [X])2 = ab/(a+ b)2(a+ b+ 1).

Equating mean and variance, a/(a+ b) = 0.82 and ab/(a+ b)2(a+ b+ 1) = (0.04)2, givesa = 74.825 and b = 16.425.

With s = 26, f = 4, posterior beliefs for p ∼ Beta(a + s, b + f) = Beta(100.825, 20.425).Therefore, E [p | s = 26, f = 4] = 100.825/(100.825+20.425) = 100.825/121.25 = 0.831546,and Var [p | s = 26, f = 4] = 100.825× 20.425/121.252 × 122.25 = 0.001146 = (0.03385)2.

Hence, the posterior expectation for the percentage of lightbulbs lasting more than 900hours is approximately 83.2% and the posterior standard deviation is approximately 3.4%.

2.

likelihood ∝n∏i=1

(θti)yie−θti ∝ θye−θt

prior ∝ θa−1e−bθ

posterior ∝ θa−1e−bθ × θye−θt = θa+y−1e−(b+t)θ

Hence, the posterior distribution is Gamma(a+ y, b+ t). When a = b = 0

f(θ|y, t) =ty

Γ(y)θy−1e−tθ

In example, y = 3 + 2 + 7 + 6 + 10 = 28 and t = 10 + 15 + 25 + 30 + 40 = 120. Hence,

f(θ|y, t) =12028

Γ(28)θ27e−120θ

E [θ | y = 28, t = 120] = 28/120 = 0.233 · · · and Var [θ | y = 28, t = 120] = 28/1202 =0.00194.

Let Y be the number of flaws in a new fibre of length T = 5. We want

P [Y = 0 | y = 28, t = 120, T = 5] =

∫ ∞0

e−5θf(θ|y = 28, t = 120) dθ

=

∫ ∞0

12028

Γ(28)θ27e−(120+5)θ dθ

=

(120

125

)28 ∫ ∞0

12528

Γ(28)θ27e−125θ dθ

=

(24

25

)28

= 0.318856

Statistical Concepts 2013/14 – Sheet 14 – Bayesian Statistics

1. [Hand in to be marked] Suppose that the heights of individuals in a certain populationhave a normal distribution for which the value of the mean height µ is unknown but thestandard deviation is know to be 2 inches. Suppose also that prior beliefs about µ can beadequately represented by a normal distribution with a mean of 68 inches and a standarddeviation of 1 inch. Suppose 10 people are selected at random from the population andtheir average height is found to be 69.5 inches.

(a) What is the posterior distribution of µ?

(b) (i) Which interval, 1 inch long, had the highest prior probability of containing µ?What is the value of this probability?

(ii) Which interval, 1 inch long, has the highest posterior probability of containing µ?What is the value of this probability?

(c) What is the posterior probability that the next person selected at random from thepopulation will have height greater than 70 inches?

(d) What happens to the posterior distribution in this problem when the number of peoplen whose heights we measure becomes very large? Investigate this by (i) seeing whathappens when n becomes very large in the formulae you used for part (a); (ii) usingthe general theoretical result on limiting posterior distributions. Check that (i) and(ii) give the same answer in this case.

2. Albert, a geologist, is examining the amount of radiation being emitted by a geologicalformation in order to assess the risk to health of people whose homes are built on it. Hewould like to learn about the average amount of radiation λ being absorbed per minute byindividual residents. His mean and standard deviation for λ are 100 particles/minute and10 particles/minute, and he is willing to use a gamma distribution to represent his beliefsabout λ. Albert would like to have more precise knowledge about λ. He has an instrumentwhich measures the exposure which would have been received by a human standing at thesame location as the instrument for one minute. Since he is dealing with radioactivity, hebelieves his machine measurements follow a Poisson distribution with mean λ. However,he does not know how many measurements he needs to make to sufficiently reduce hisuncertainty about λ. How many measurements would you advise him to make if he wisheshis expected posterior variance for λ to be 4 or less? [HINT: first find what his posteriordistribution for λ would be for n observations and then use the first-year probability resultE [X] = E [E [X | Y ]] to help you compute the expectation of his posterior variance for λ.]

3. When gene frequencies are in equilibrium, the genotypes Hp1-1, Hp1-2 and Hp2-2 ofHaptoglobin occur with probabilities (1−θ)2, 2θ(1−θ) and θ2, respectively. In a study of190 people the corresponding sample numbers were 10, 68 and 112. Assuming a uniformprior distribution for θ over the interval (0, 1), compute the posterior distribution for θ.Compute the posterior expectation and variance for θ.

Find a “large sample” 99% Bayesian confidence interval for θ, based on these data.

4. Suppose that y1, . . . , yn is a random sample from a uniform distribution on (0, θ) where θis an unknown positive parameter. Show that the likelihood function l(θ) is given by

l(θ) =

{1/θn m < θ <∞0 otherwise

(1)

where m = max{y1, . . . , yn}. Suppose that the prior distribution for θ is a Pareto distri-bution

f(θ) =

{aba/θa+1 b < θ <∞0 otherwise

(2)

where a and b are specified positive constants. Show that the posterior distribution for θis also Pareto with constants a+ n and max{b,m}.Now put a = b = 0 in the posterior distribution (corresponding to “vague” prior beliefs),and with this specification show that (m,mα−1/n) is the 100 (1 − α)% highest posteriordensity (HPD) credibility interval for θ; that is, the posterior density at any point inside theinterval is greater than that of any point outside the interval. Is this interval a 100 (1−α)%confidence interval in the frequentist sense? If so, show this to be the case.


1. (a) Height, Y ∼ N(µ, 22), so σ2 = 4. µ ∼ N(68, 12), so µ0 = 68, σ20 = 1. n = 10 and

y = 69.5. Therefore, µ|data ∼ N(µn, σ2n) where

µn =

(µ0σ20

+ nyσ2

)(

1σ20

+ nσ2

) =

(6812

+ 10×69.522

)(112

+ 1022

) = 69.07 inches

and1

σ2n

=

(1

σ20

+n

σ2

)=

(1

12+

10

22

)= 3.5

so that σn = 0.53 inches. Thus, µ|data ∼ N(69.07, 0.532).

(b) (i) 68± 0.5 → (67.5, 68.5). Probability is P [|Z| ≤ 0.5] = 0.3829

(ii) 69.07± 0.5 → (68.57, 69.57) Probability is P [|Z| ≤ 0.5√

3.5] = 0.6505

(c) If X ∼ N(w, σ2), w ∼ N(µ, v2) then X ∼ N(µ, σ2 + v2). In this problem, theposterior predictive distribution for Y , the height of a further person selected fromthe population, is therefore N(69.07, 4.281). Therefore

P (Y > 70) = 1− P (Y − 69.07√

4.281≤ 70− 69.07√

4.281) = 1− Φ(0.45) = 0.326

(d) (i)

µn =

(µ0σ20

+ nyσ2

)(

1σ20

+ nσ2

) → y, n→∞

1

σ2n

=

(1

σ20

+n

σ2

)⇒ σ2

n →σ2

n, n→∞

(ii) The general large sample result is that the posterior distribution of µ tends to anormal distribution N(µ,−1/L′′(µ)), as n → ∞, where µ is the maximum likelihoodestimator for µ and L is the log likelihood, (or equivalently N(µ, 1/nI(µ)), where I isFisher’s information, i.e. minus the expected value of L′′). In this case, we have posteriornormality for all sample sizes. The m.l.e. is the sample mean y as previously found, wherethe log likelihood was shown to be

L(µ) = constant− n(y − µ)2

2σ2

so thatL′′(µ) = − n

σ2

so that the large sample limit for posterior variance is σ2/n, agreeing with the values founddirectly in (i).

2. For a Gamma (a, b) prior E [λ] = a/b = 100 and Var [λ] = a/b2 = 100 gives a = 100and b = 1. For general a and b and with a Poisson likelihood the posterior for λ isGamma (a+ ny, b+ n). Thus

Var [λ | y] =a+ ny

(b+ n)2

Therefore

E [Var [λ | y]] =a+ nE

[Y]

(b+ n)2=

a

b(b+ n)

because E[Y]

= E[E[Y | λ

]]= E [λ] = a/b. With a = 100 and b = 1, we require

100

(1 + n)≤ 4

which gives n = 24.

3. For sample numbers A = 10, B = 68, C = 112, the likelihood is proportional to

[(1− θ)2]A[2θ(1− θ)]B[θ2]C ∝ θ2C+B(1− θ)2A+B

and with a uniform prior for θ on (0, 1) the posterior is proportional to the likelihood, andwe recognise that this is a Beta(2C+B+1, 2A+B+1) distribution with 2C+B+1 = 293and 2A+B + 1 = 89.

Hence, E [θ | data] = 293/(89+293) = 0.767 and Var [θ | data] = 293×89/3822×(382+1) =0.0004666; and the posterior SD is 0.0216.

A 99% confidence can be based on these values, or, almost exactly the same, based onthe result that for large samples or “vague” prior information, the posterior distributionfor θ is approximately normal with mean θ = (2C + B)/2n = 292/380 = 0.7684 andvariance −1/L′′(θ) = θ(1 − θ)/2n = 0.0004683. In either case the confidence interval isapproximately 0.768± 0.02164× 2.575→ (0.713, 0.824), where z0.005 = 2.575.

4. The joint pdf is

f(y1, . . . , yn|θ) =

{1/θn yi < θ <∞ for i = 1, . . . , yn

0 otherwise

But yi < θ < ∞ for i = 1, . . . , n is equivalent to m = max{y1, . . . , yn} < θ < ∞. Whenf(y|θ) is considered as a function of θ (for given data y1, . . . , yn) the likelihood is

l(θ) =

{1/θn m < θ <∞0 otherwise

The prior for θ is

f(θ) =

{aba/θa+1 b < θ <∞0 otherwise

The posterior for θ is

f(θ|y) ∝

{1/θa+1+n b < θ <∞ and m < θ <∞0 otherwise

But b < θ <∞ and m < θ <∞ is equivalent to max{b,m} < θ <∞. Hence, the posteriordistribution for θ is Pareto with parameters a+ n and max{b,m}.When a = b = 0, the posterior is

f(θ|y) =

{nmn/θn+1 m < θ <∞0 otherwise

Then ∫ mα−1/n

m

f(θ|y) dθ =

∫ mα−1/n

m

nmn/θn+1 dθ =[−(mθ

)n]= 1− α

Hence, the interval is a 100 (1 − α)% credibility interval for θ. Inspection of the graphof f(θ|y) and the placement of the interval shows it to be a highest posterior density100 (1− α)% credibility interval.

Let M denote the random quantity M = max{Y1, . . . , Yn}. Then because [M ≤ m] =[Y1 ≤ m, . . . , Yn ≤ m] for any m, it follows that the distribution function P [M ≤ m | θ] ofM for any fixed θ is given by

P [M ≤ m | θ] =

0 m ≤ 0

(m/θ)n 0 < m < θ

1 m ≥ θ

Therefore, as required

P[M ≤ θ < Mα−1/n | θ

]= P

[θα1/n ≤M < θ | θ

]=

[θ

θ

]n−[θα1/n

θ

]n= 1− α

Statistical Concepts 2013/14 – Sheet 15 – Regression

1. [Hand in to be marked] The following data give the diameter d in mm and theheight h in metres for 20 Norway spruce trees situated in a very small part of aSwedish forest.

d 140 134 180 177 178 114 221 122 237 82h 10.0 10.0 12.0 12.0 15.0 10.5 17.0 12.0 15.0 7.0

d 152 166 86 157 74 173 172 153 190 196h 11.5 13.0 7.0 11.5 7.0 14.0 11.5 11.0 14.5 16.0

Diameter is very easily measured, but height is much more difficult to estimate in adense forest. It is therefore of interest to attempt to predict height from diameter.Make a scatter plot of diameter versus height (“by hand” or using R), and commentbriefly on the apparent strength of relationship between the two variables.

In what follows you should use the following information∑di = 3104,

∑hi = 237.5,

∑d2i = 518202,

∑dihi = 39046.5,

∑h2i = 2977.25

Fit a straight line using the method of least squares, and draw it on your scatter plot.Does it seem a sensible line to fit to these data? Evaluate the estimate s of σ, wheres2 is the usual unbiased estimate of the assumed common error variance σ2. Give a95% confidence interval for the slope of the line: is it statistically significantly differentfrom zero?

2. Consider the simple linear regression model

y = β0 + β1x+ ε

with data (x1, y1), ..., (xn, yn).

(a) Derive the equations of the usual least squares estimators β0, β1.

(b) Show that if the errors ε1, ..., εn are independent normal random quantities, mean0, variance σ2, then the usual least squares estimators are also the maximumlikelihood estimators of β0, β1.

(c) Show that

Cov(β0, β1) =−σ2x∑

i(xi − x)2

[Hint: Evaluate V ar(y) both directly and also using the relation

y = β0 + xβ1]

3. Zebedee is trying to measure the elasticity of a spring. The elasticity is the increasein the length of the spring if a 1g mass is suspended from the end. The length ofthe spring can only be measured when a mass is attached to the end, otherwise itcurls up in a tangled ball. Zebedee has available two masses, one of 10g and one of20g. He is impatient to get home for dinner and so he only has time to perform 100measurements. How many of the 100 measurements should be made with the 10g massif he wants to estimate the elasticity as accurately as possible? Prove your answer iscorrect. You may assume that the expected increase in the length of the spring isexactly proportional to the mass attached, that the measurements are independentand that the variation in repeated measurements is the same for both masses.

[ Hint. Set the problem up as a simple linear regression problem, where you want tochoose the number of 10g readings and the number of 20g readings to minimise thevariance of the estimate for the slope coefficient.]

Statistical Concepts 2013/14 – Solutions 15 – Regression

1. n = 20, d = 155.2, h = 11.875, Sdd = 36461.2, Sdh = 2186.5, Shh = 156.9375

Plot indicates fairly strong approximate linear positive relationship between height anddiameter.

••

•••

•

•

•

•

•

•

•

•

•

•

• •

•

•

•

Nor

way

Spr

uce

diam

eter

height

100

150

200

810121416

Figure 1: Scatter plot of Norway spruce data.

For hi = β0 + β1di + zi we have

β1 =Sdh

Sdd

=2186.5

36461.2= 0.06 (2 dp)

β0 = h− β1d = 2.57 (2 dp)

Estimate of assumed common error variance σ2

s2 =1

n− 2

(

Shh −S2dh

Sdd

)

= 1.4343

Hence, s = 1.1976.

Confidence interval for slope β1 has limits

β1 ±s√Sdd

t18,0.025

where t18,0.025 = 2.101. The confidence interval [0.047, 0.073] does not contain 0, offeringfurther empirical evidence that variation in height can be explained in part by variationin diameter.

[Checking whether zero is in this confidence interval is exactly equivalent to testing thenull hypothesis that β is zero at the 0.05 level.]

2. (a) Let the sum of squares be

Q =∑

i

(yi − (β0 + β1xi))2

We want to choose β0, β1 to minimise Q.

∂Q

δβ0

= −2∑

i

(yi − (β0 + β1xi))

and∂Q

δβ1

= −2∑

i

(yi − (β0 + β1xi))xi

Setting these equations to zero gives the following pair of equations for β0, β1.

β0n+ β1

∑

i

xi =∑

i

yi

β0

∑

i

xi + β1

∑

i

x2i =

∑

i

xiyi

The first equation gives

β0 = y − β1x

which, substituting into the second gives

β1 =

∑

i(xi − x)(yi − y)∑

i(xi − x)2

(b) The likelihood of observation yi for given values xi, β0, β1, σ2 is

f(yi|xi, β0, β1, σ2) =

1

σ√2π

e−1

2σ2(yi−(β0+β1xi))

2

Therefore the likelihood of the sample is

f(y|x, β0, β1, σ2) =

1

σn(2π)n/2e−

1

2σ2

∑i(yi−(β0+β1xi))

2

Therefore, we maximise the likelihood by minimising the exponent, namely∑

i(yi −(β0 + β1xi))

2, and this is exactly the least squares solution obtained above.

(c) Since y = β0 + β1x, we have

V ar(y) = V ar(β0) + x2V ar(β1) + 2xCov(β0, β1).

Therefore, for x 6= 0, as y1, ..., yn are independent, each with variance σ2, and usingthe formulae for V ar(β0), V ar(β1) derived in lectures, we have

Cov(β0, β1) =1

2x(V ar(y)− V ar(β0)− x2V ar(β1))

=1

2x(σ2

n− σ2

∑

i x2i

n∑

i(xi − x)2− σ2x2

∑

i(xi − x)2)

=σ2

2x(

−2nx2

n∑

i(xi − x)2) = − σ2x

∑

i(xi − x)2

[If x = 0, then β0 = y, and you can check directly that Cov(β0, β1) = 0.]

3. Take m observations at 10g and 100−m at 20g.

Model for length y isy = β0 + β1x+ z

where, x = “weight attached to spring”, β0 = “length of spring”, β1 = “elasticity of spring”and Var [Z] = σ2.

As Var[

β1

]

= σ2/∑

i(xi − x)2, it is straightforward to show that σ2/Var[

β1

]

= m(100−m), which is maximised at m = 50.

Statistical Concepts 2013/14 – Sheet 16 – Regression

1. [Hand in to be marked]

In a study of how wheat yield depends on fertilizer, suppose that funds are availablefor only seven experimental observations. Therefore X (fertilizer in lb/acre) is set toseven different levels, with one observation Y (yield in bushels/acre) for each value ofX. The data are as follows.

X 100 200 300 400 500 600 700Y 40 50 50 70 65 65 80

(a) Fit the least squares line Y = β0 + β1X.

(b) Find and interpret the multiple R2 value for the above data.

(c) Suppose that we intend to use 550 pounds of fertiliser. Find a 95% confidenceinterval for the expected yield, assuming the usual simple linear regression modelwith normal errors.

(d) Suppose that we intend to apply 550 pounds of fertiliser for a single plot, givingyield Y ∗. Explain how you would modify the above confidence interval to give a95% prediction interval for Y ∗.

2. (a) In the usual simple linear regression model, where yi = β0+ β1xi, and ri = yi− yi,show that

i.∑ri = 0

ii.∑xi ri = 0

iii.∑yi ri = 0

Hence show that the sample correlation between the values of the independentvariable, xi, and the residuals, ri, is zero, and the sample correlation between thefitted values, yi, and the residuals is zero. What are the implications for graphicaldiagnostics of the underlying assumptions of the simple linear regression model?

(b) If the error variance in the regression model is σ2, show that

V ar(ri) = σ2

[1 − 1

n− (xi − x)2

SSxx

]Comment on the implications of this result for the least squares fit of a line to acollection of points for which a single x value is very much larger than all of theremaining x values.

Statistical Concepts 2013/14 – Solutions 16 – Regression

1. (a) For the given data

X = 400, Y = 60, SSxy = 16, 500, SSxx = 280, 000, SSyy = 1150.001

Therefore,

β1 =SSxy

SSxx

= 0.059, β0 = Y − β1X = 36.4

(b) R2 is the square of the sample correlation coefficient, so

R2 =SS2

xy

SSxxSSyy

= 0.845

R2 may be interpreted as the proportion of the variance in y explained by the re-gression i.e. the ratio

∑i(yi − y)2/

∑i(yi − y)2. (Here R2 is reasonably large, as the

regression explains most of the variation.)

(c) The residual sum of squares is RSS =∑

i(yi − yi)2 = 177.68. Therefore the estimatefor the error variance σ2 is s2 = RSS/(7− 2) = 35.5.

The estimate for expected yield when x = 550 is

y = β0 + β1550 = 69

The standard error of this estimate is

s

√1

7+

(550− x)2

SSxx

= 5.96√

0.223

Therefore a 95% confidence interval for expected yield, assuming normal errors is

69± t.025,55.96√

0.223 = 69± 7

(d) For the prediction interval for a single yield, the estimate is the same but the standarderror becomes √

1 +1

7+

(550− x)2

SSxx

so the prediction interval becomes 69± 17.

2. (a) i. ∑i

ri =∑i

(yi − β0 − xiβ1) = n(y − xβ1 − β0) = 0

ii. ∑i

xiri =∑

xi((yi − β0 − xiβ1)

=∑i

xi(yi − (y − xβ1)− xiβ1) =∑i

xi(yi − y)− β1xi(xi − x)

= SSxy − β1SSxx = SSxy −SSxy

SSxx

SSxx = 0

iii. ∑i

yiri =∑i

(β0 + xiβ1)ri = β0

∑i

ri + β1

∑i

xiri = 0

The sample correlation between the residuals and the x values will be zero if thesample covariance is zero, which, as the sum of the residuals is zero is equivalent tothe condition that

∑xiri = 0, and similarly for the sample correlation between the

fitted values, y, and the residuals.

What this implies is that in plots of residuals versus x values or against fitted valuesthere cannot be a non-zero linear fit. We would expect there to be a random scatterand any pattern in the plot may suggest a problem with the regression model.

(b)

ri = yi − (y − xβ1)− β1xi = yi − y +(x− xi)

∑j(xj − x)yj

SSxx

= yi[1−1

n− (xi − x)2

SSxx

]−∑j 6=i

yj[1

n+

(xi − x)(xj − x)

SSxx

]

As y1, ..., yn are independent, and each has variance σ2, letting Qi = 1n

+ (xi−x)2

SSxx, we

have

V ar(ri) = σ2[1−Qi]2 + σ2

∑j 6=i

[1

n+

(xi − x)(xj − x)

SSxx

]2

= σ2[1−Qi]2 + σ2

∑j

[1

n+

(xi − x)(xj − x)

SSxx

]2 − σ2Q2i

= σ2[(1−Qi)2 +Qi −Q2

i ] = σ2[1−Qi]

Observe that V ar(ri) is a decreasing function of (xi − x)2. Therefore, if a singlex value, xj say, is very much larger than all of the remaining x values, then thevariance of the jth residual will be much smaller than the variance of all the otherresiduals. (You can check that moving xj to infinity reduces the variance for zj tozero.) This means that the least squares line will have a very small residual for thispoint. Therefore, if a single value xj is very different from the rest of the values thenthe least squares line will essentially be the line that goes through the sample mean(x, y) of the remaining points and also goes through (xj, yj). In some circumstancesit may be of concern that a single unusual point has been allowed to determine theslope of the line.

Statistical Concepts 2013/14 – Sheet 17 – Unpaired & PairedComparisons

1. [Hand in to be marked] “Aerobic capacity”, the peak oxygen intake per unit of bodyweight of an individual performing a strenuous activity is a measure of work capacity. Ina comparative study, measurements of aerobic capacities were recorded for a group of 20Peruvian highland natives and for a group of 10 USA lowlanders acclimatised as adults inhigh altitudes. The following summary statistics were obtained from the data

Peruvians U.S. subjectsy 46.3 38.5s 5.0 5.8

(i) Estimate the difference in mean aerobic capacities between the two populations andgive the standard error of your estimate, stating any assumptions you make. (ii) Testat the 2% significance level the hypothesis that there is no difference between the meanaerobic capacities in the two populations. (iii) Construct a 98% confidence interval for thedifference in mean aerobic capacities between the two populations.

State any assumptions you have made and suggest ways in which you might informallyvalidate them if you had access to the original data.

2. A study was carried out to compare the starting salaries of male and female college grad-uates who found jobs after leaving the same program from an (American) institution.Matched pairs were formed of one male and one female graduate with the same subjectmix and grade point average. The starting salaries (in dollars) were as follows.

Pair 1 2 3 4 5 6 7 8 9 10Male 29300 41500 40400 38500 43500 37800 69500 41200 38400 59200Female 28800 41600 39800 38500 42600 38000 69200 40100 38200 58500

(a) Test the hypothesis that there is no difference between average starting salaries be-tween sexes, at significance level 5%. State your assumptions.

(b) Construct a 95% confidence interval for the difference between the mean salaries formale and female.

(c) Explain why the test procedure

“Reject the hypothesis that the mean difference is zero if the 95% confidence intervalfor the mean difference does not contain the value zero”

is a valid significance test at level 5%.

(d) Suppose that we forget that the above individuals were selected as matched pairs, andtreat the 10 males as a random sample from male graduating students and the tenfemales as a random sample of female graduating students. (The standard deviationfor the male values is 11665 and for females is 11617). Find a 95% confidence intervalfor the difference between the means of the two groups.

Compare your answer with the matched pair analysis and explain why you havereached different conclusions.

[This comparison is only to provide some order of magnitude comparison of the differ-ences we might expect from a matched pairs versus a simple two sample experiment.If the design really is a matched pairs design, we do not have independent samplesfrom the two populations, although if the ten individuals chosen are fairly representa-tive of the general population then the order of magnitude comparison may be aboutright.]

(e) Under what circumstances might it be better not to use matched pairs in this typeof study?

Statistical Concepts 2013/14 – Solutions 17 – Unpaired & Paired Comparisons

1. I will assume that the samples were drawn at random from both populations. As sample sizes aresmall, I will assume normality for both populations which can be validated informally by lookingat histograms or normal quartile plots if the data were available. Also, I will assume that thepopulation variances are equal—the sample variances lend some support for this assumption.

(i) Pooled estimate of assumed common variance

s2p =19× 5.02 + 9× 5.82

19 + 9= 27.78

Estimated difference in population means is D = 46.3− 38.5 = 7.8 with estimated standard error

sD =

√27.78

(1

20+

1

10

)= 2.04

(ii) The test statistic is

t =D

sD=

7.8

2.04= 3.82

If there is no difference between the means then the distribution of t will be a t-distribution with10 + 20 − 2 = 28 degrees of freedom and t28, 0.01 = 2.467. Our test is to reject the hypothesisof equal means with significance level 0.02 if |t| > 2.467, and so, in this case, we can reject thehypothesis at this level.

(iii) Similarly, 98% confidence limits for the difference in population mean aerobic capacities are:7.8± 2.04× 2.467 → [2.8, 12.8]. (Note that zero is not in this confidence interval).

2. (a) Assuming the pair differences Di = Xi − Yi are iid N(µ, σ2) random quantities, we want totest the hypothesis the µ = 0. The pairwise data differences (male - female) are

500,−100, 600, 0, 900,−200, 300, 1100, 200, 700

The sample mean of these numbers is D = 400 and the sample standard deviation is sD =434.61. The test statistic is

t =D

sD/√n

=400

434.61/√

10= 2.91

If the mean difference is zero, then t has a t-distribution with n− 1 = 9 degrees of freedom.As t0.025,9 = 2.26, the observed value of |t| is larger than the critical value and the test rejectsthe hypothesis of equality of means at 5%.

(b) Similarly, 95% confidence interval for the mean difference is

D ± t0.025,9sd

√1

n= 400± 2.26× 434.6

√1

10= 400± 310.6

(Note that this interval does not contain zero.)

(c) If the mean difference is zero, then the probability that we will obtain a sample for whichthe 95% confidence interval for the mean difference contains the value zero is 0.95. (This isby the definition of the confidence interval.)

Therefore, the probability that the corresponding test, which rejects the hypothesis that themean difference is zero when the 95% confidence interval does not contain the value zero,will reject the hypothesis if it is true is 0.05, so that the test is a valid significance test atlevel 5%.

[Of course, there is nothing special about this example and the simple argument above showsthat, if we can construct the confidence interval for a quantity, then we can always constructthe corresponding significance test for the value of the quantity.]

(d) Ignoring the matching of pairs, our confidence interval is

X − Y ± t0.25,18sp

√1

m+

1

n

where X − Y = 400,m = n = 10 and s2p is the pooled variance estimate given by

sp =

√9s2x + 9s2y

18= 11641

so the confidence interval is 400 ± 10985. Notice that this interval is much wider thanthe previous interval and does contain zero, near the centre. This is because we have notcontrolled for the substantial variability in the sample means which is due to variability inthe values of individual male and female scores.

(e) We should not use matched pairs if the criterion that we use for matching has little to dowith the response we are measuring. In this case we will not eliminate variability due toblocking, but we will halve the number of degrees of freedom in our t-statistic, correspondingto the loss of information in our assessment of the underlying variability, as we will only learnabout population variability by considering differences in the individual pairs.

[Of course, there are may be practical reasons why a matched pair experiment might bemore difficult or expensive.]

Statistical Concepts 2013/14 – Sheet 18 – Nonparametric methods

1. A study was done to compare the performances of engine bearings made of different compounds.Ten bearings of each type were tested. The following table gives the times until failure (in unitsof millions of cycles):

Type I 15.21 3.03 12.95 12.51 16.04 16.84 9.92 9.30 5.53 5.60Type II 12.75 4.67 12.78 6.79 9.37 4.26 4.53 4.69 3.19 4.47

(a) Assuming normal distributions are good models for lifetimes of the two types of bearing,test the hypothesis that there is no difference in lifetime between the two types of bearing.

(b) Test the same hypothesis using the nonparametric Wilcoxon rank sum method. Try com-puting it using the normal approximation and using tables of the statistic.

(c) Which of the two methods, (a) or (b), do you think gives a better test of the hypothesis?

(d) Suppose, instead, all that is recorded for each bearing above is whether the time to failurewas short, medium or long, where short is 6 or less, medium is 6 to 14, and long is greaterthan 14. Use the Wilcoxon rank sum statistic to test the hypothesis of no difference.

2. An experiment was performed to compare microbiological and hydroxylamine methods for analysisof ampicillin dosages. Pairs of tablets were analysed by the two methods. The following datagive the percentages of ampicillin measured to be in each pair of tablets using these methods (sorelative to the real amount).

Pair Microbiological Hydroxylamine

1 97.2 97.22 105.8 97.83 99.5 96.24 100.0 101.85 93.8 88.06 79.2 74.07 72.0 75.08 72.0 67.59 69.5 65.810 20.5 21.211 95.2 94.812 90.8 95.813 96.2 98.014 96.2 99.015 91.0 100.2

Perform an appropriate nonparametric test of the hypothesis that there is no systematic differencebetween the two methods. Do your calculations two ways:

(a) Using a normal approximation to the distribution of the statistic.

(b) Using tables which give exact critical values.

Do your answers suggest that there is a systematic difference between the methods?

Statistical Concepts 2013/14 – Solutions 18 – Nonparametric methods

1. (a) We do a pooled variance t-test. We have x = 10.69, y = 6.75, sx = 4.82 and sy = 3.60.Hence the pooled variance is

s2 =9× s2x + 9× s2y

18= 18.1

Hence the test statistic is10.69− 6.75√18.1( 1

10 + 110)

= 2.065

which is just on the verge of being significant at the 5% level since t0.025 for 18 d.f. is 2.101.

(b) We combine the two samples together and find the ranks. and get

Type I 18 1 17 14 19 20 13 11 8 9Type II 15 6 16 10 12 3 5 7 2 4

from which we find that the sum of the type I ranks is T = 130.

i. According to the normal approximation, if the hypothesis is true, this should be from anormal distribution with mean 1

2 × 10× 21 = 105 and variance 10× 10× 21/12 = 175.

But (130−105)/√

175 = 1.89 which again is on the verge of significance since the criticalvalue for 5% significance is 1.96.

ii. From the table which you were given, we reject at level α = .05 if T is less than TL = 79or greater than TU = 10(10 + 10 + 1)− 79 = 131. So we just fail to reject the hypothesisbut we would reject at the 10% level.

(c) Neither of these methods suggests rejecting the null hypothesis very strongly, nor do theydisagree very much. Nevertheless, the Wilcoxon method is better for this data as there is noreason to suppose normality of the population, and the data don’t look very normal either.

(d) The reduced data table is

Short Medium LongType I 3 4 3Type II 6 4 0

In the above table, 3 observations tie at the largest value, 8 observations tie at the next largestvalue, and 9 observations tie at the smallest value. Therefore the midrank of the smallest 9 is(1 + 9)/2 = 5, the midrank of the next 8 is 9 + (8 + 1)/2 = 13.5 and the midrank of the largest 3is 17 + (3 + 1)/2 = 19. Therefore, the Wilcoxon statistic is WS = 3× 5 + 4× 13.5 + 3× 19 = 126.

The expectation of WS is 12 × 10× 21 = 105, as before.

The variance of WS is 10×10×21/12−[10×10((93−9)+(83−8)+(33−3))/(12×20×19)] = 147.6

Therefore, under the hypothesis of no treatment difference WS has approximately a normal dis-tribution, with mean 105, variance 147.6, so

P (WS > 126) = P (WS − 105

12.15>

126− 105

12.15) = 1− Φ(1.73) = 0.042

so that the two sided test of no difference would reject only at around 10%, which again is fairlyweak evidence of difference between the two types.

2. The paired structure means that the Wilcoxon signed rank test is appropriate. The differencesof the data and the ranks of those which are non-zero are

Difference rank of |Difference|1 0.0 –2 8.0 13.03 3.3 7.04 -1.8 3.55 5.8 12.06 5.2 11.07 -3.0 6.08 4.5 9.09 3.7 8.010 -0.7 2.011 0.4 1.012 -5.0 10.013 -1.8 3.514 -2.8 5.015 -9.2 14.0

The sum of those ranks with positive associated differences is 61. We calculate significance levelsusing n = 14 since one of the differences is 0.

(a) The expectation of the signed rank sum is n(n+ 1)/4 = 52.5. The variance is n(n+ 1)(2n+1)/24 = 253.75 and so the standard deviation is 15.9. The standardised observed value is(61− 52.5)/15.9 = 0.535 which is not significant for any normal level of α. We do not rejectthe null hypothesis.

(b) The table which you were given is for the minimum of W+ (the sum of the positive signedranks) and W− (the sum of the negative signed ranks). The sum of the ranks with negativeassociated sign is 44. For n = 14 and α = 0.05, the critical value is 22. Our value is largerso we do not reject the null hypothesis. Note that we would not even reject the hypothesisat the 10% level.

We do not find sufficient evidence to conclude that there is a systematic difference.

durham maths question set - stats concepts ii 13-14 all merged

Documents

type b ants

type b withexpectation

andas type b

random quantity y

aand b

express y

probability revisionfor

isthe probability