con...con ten ts ac kno wledgemen ts 1 6 limit theorems (con tin ued from stat 6710) 1 6.3 strong la...

STAT 6720

Mathematical Statistics II

Spring Semester 2001

Dr. J�urgen Symanzik

Utah State University

Department of Mathematics and Statistics

3900 Old Main Hill

Logan, UT 84322{3900

Tel.: (435) 797{0696

FAX: (435) 797{1822

e-mail: [email protected]

Contents

Acknowledgements 1

6 Limit Theorems (Continued from Stat 6710) 1

6.3 Strong Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

6.4 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

7 Sample Moments 18

7.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

7.2 Sample Moments and the Normal Distribution . . . . . . . . . . . . . . . . . . 21

8 The Theory of Point Estimation 26

8.1 The Problem of Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 26

8.2 Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

8.3 Su�cient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8.4 Unbiased Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.5 Lower Bounds for the Variance of an Estimate . . . . . . . . . . . . . . . . . . 47

8.6 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

8.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8.8 Decision Theory | Bayes and Minimax Estimation . . . . . . . . . . . . . . . . 63

9 Hypothesis Testing 71

9.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

9.2 The Neyman{Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

9.3 Monotone Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

9.4 Unbiased and Invariant Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

10 More on Hypothesis Testing 95

10.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

10.2 Parametric Chi{Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

10.3 t{Tests and F{Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

10.4 Bayes and Minimax Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

1

11 Con�dence Estimation 112

11.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

11.2 Shortest{Length Con�dence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 116

11.3 Con�dence Intervals and Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . 121

11.4 Bayes Con�dence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

12 Nonparametric Inference 129

12.1 Nonparametric Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

12.2 Single-Sample Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

12.3 More on Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

13 Some Results from Sampling 146

13.1 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

13.2 Strati�ed Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

14 Some Results from Sequential Statistical Inference 153

14.1 Fundamentals of Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . 153

14.2 Sequential Probability Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . 157

Index 161

2

Acknowledgements

I would like to thank my students, Hanadi B. Eltahir, Rich Madsen, and Bill Morphet, who

helped during the Fall 1999 and Spring 2000 semesters in typesetting these lecture notes

using LATEX and for their suggestions how to improve some of the material presented in class.

Thanks are also due to my Fall 2000 and Spring 2001 semester students, Weiping Deng, David

B. Neal, Emily Simmons, Shea Watrin, Aonan Zhang, and Qian Zhao, for their comments

that helped to improve these lecture notes.

In addition, I particularly would like to thank Mike Minnotte and Dan Coster, who previously

taught this course at Utah State University, for providing me with their lecture notes and other

materials related to this course. Their lecture notes, combined with additional material from

Rohatgi (1976) and other sources listed below, form the basis of the script presented here.

The textbook required for this class is:

� Rohatgi, V. K. (1976): An Introduction to Probability Theory and Mathematical Statis-

tics, John Wiley and Sons, New York, NY.

A Web page dedicated to this class is accessible at:

http://www.math.usu.edu/~symanzik/teaching/2001_stat6720/stat6720.html

This course closely follows Rohatgi (1976) as described in the syllabus. Additional material

originates from the lectures from Professors Hering, Trenkler, Gather, and Kreienbrock I have

attended while studying at the Universit�at Dortmund, Germany, the collection of Masters and

PhD Preliminary Exam questions from Iowa State University, Ames, Iowa, and the following

textbooks:

� Bandelow, C. (1981): Einf�uhrung in die Wahrscheinlichkeitstheorie, Bibliographisches

Institut, Mannheim, Germany.

� B�uning, H., and Trenkler, G. (1978): Nichtparametrische statistische Methoden, Walter

de Gruyter, Berlin, Germany.

� Casella, G., and Berger, R. L. (1990): Statistical Inference, Wadsworth & Brooks/Cole,

Paci�c Grove, CA.

� Fisz, M. (1989): Wahrscheinlichkeitsrechnung und mathematische Statistik, VEB Deut-

scher Verlag der Wissenschaften, Berlin, German Democratic Republic.

� Gibbons, J. D., and Chakraborti, S. (1992): Nonparametric Statistical Inference (Third

Edition, Revised and Expanded), Dekker, New York, NY.

3

� Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1994): Continuous Univariate

Distributions, Volume 1 (Second Edition), Wiley, New York, NY.

� Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1995): Continuous Univariate

Distributions, Volume 2 (Second Edition), Wiley, New York, NY.

� Kelly, D. G. (1994): Introduction to Probability, Macmillan, New York, NY.

� Lehmann, E. L. (1983): Theory of Point Estimation (1991 Reprint), Wadsworth &

Brooks/Cole, Paci�c Grove, CA.

� Lehmann, E. L. (1986): Testing Statistical Hypotheses (Second Edition { 1994 Reprint),

Chapman & Hall, New York, NY.

� Mood, A. M., and Graybill, F. A., and Boes, D. C. (1974): Introduction to the Theory

of Statistics (Third Edition), McGraw-Hill, Singapore.

� Parzen, E. (1960): Modern Probability Theory and Its Applications, Wiley, New York,

NY.

� Searle, S. R. (1971): Linear Models, Wiley, New York, NY.

� Tamhane, A. C., and Dunlop, D. D. (2000): Statistics and Data Analysis { From Ele-

mentary to Intermediate, Prentice Hall, Upper Saddle River, NJ.

Additional de�nitions, integrals, sums, etc. originate from the following formula collections:

� Bronstein, I. N. and Semendjajew, K. A. (1985): Taschenbuch der Mathematik (22.

Au age), Verlag Harri Deutsch, Thun, German Democratic Republic.

� Bronstein, I. N. and Semendjajew, K. A. (1986): Erg�anzende Kapitel zu Taschenbuch der

Mathematik (4. Au age), Verlag Harri Deutsch, Thun, German Democratic Republic.

� Sieber, H. (1980): Mathematische Formeln | Erweiterte Ausgabe E, Ernst Klett, Stuttgart,

Germany.

J�urgen Symanzik, May 7, 2001

4

Lecture 01:

Mo 01/08/016 Limit Theorems (Continued from Stat 6710)

6.3 Strong Laws of Large Numbers

De�nition 6.3.1:

Let fXig1i=1 be a sequence of rv's. Let Tn =nXi=1

Xi. We say that fXig obeys the SLLN

with respect to a sequence of norming constants fBig1i=1, Bi > 0; Bi " 1, if there exists a

sequence of centering constants fAig1i=1 such that

B�1n (Tn �An)a:s:�! 0:

Note:

Unless otherwise speci�ed, we will only use the case that Bn = n in this section.

Theorem 6.3.2:

Xna:s:�! X () lim

n!1P ( sup

m�nj Xm �X j> �) = 0 8� > 0.

Proof: (see also Rohatgi, page 249, Theorem 11)

WLOG, we can assume that X = 0 since Xna:s:�! X implies Xn �X

a:s:�! 0. Thus, we have to

prove:

Xna:s:�! 0 () lim

n!1P ( sup

m�nj Xm j> �) = 0 8� > 0

Choose � > 0 and de�ne

An(�) = f supm�n

j Xm j> �g

C = f limn!1

Xn = 0g

\=)":

Since Xna:s:�! 0, we know that P (C) = 1 and therefore P (Cc) = 0.

Let Bn(�) = C \ An(�). Note that Bn+1(�) � Bn(�) and for the limit set1\n=1

Bn(�) = �. It

follows that

limn!1

P (Bn(�)) = P (1\n=1

Bn(�)) = 0:

1

We also have

P (Bn(�)) = P (An \ C)

= 1� P (Cc [Acn)

= 1� P (Cc)| {z }=0

�P (Acn) + P (Cc \AC

n )| {z }=0

= P (An)

=) limn!1

P (An(�)) = 0

\(=":Assume that lim

n!1P (An(�)) = 0 8� > 0 and de�ne D(�) = f lim

n!1j Xn j> �g.

Since D(�) � An(�) 8n 2 IN , it follows that P (D(�)) = 0 8� > 0. Also,

Cc = f limn!1

Xn 6= 0g �1[k=1

f limn!1

j Xn j>1

kg:

=) 1� P (C) �1Xk=1

P (D(1

k)) = 0

=) Xna:s:�! 0

Note:

(i) Xna:s:�! 0 implies that 8� > 0 8� > 0 9n0 2 IN : P ( sup

n�n0j Xn j> �) < �.

(ii) Recall that for a given sequence of events fAng1n=1,

A = limn!1

An = limn!1

1[k=n

Ak =1\n=1

1[k=n

Ak

is the event that in�nitely many of the An occur. We write P (A) = P (An i:o:) where

i:o: stands for \in�nitely often".

(iii) Using the terminology de�ned in (ii) above, we can rewrite Theorem 6.3.2 as

Xna:s:�! 0 () P (j Xn j> � i:o:) = 0 8� > 0:

2

Lecture 02:

We 01/10/01Theorem 6.3.3: Borel{Cantelli Lemma

(i) 1st BC{Lemma:

Let fAng1n=1 be a sequence of events such that1Xn=1

P (An) <1. Then P (A) = 0.

(ii) 2nd BC{Lemma:

Let fAng1n=1 be a sequence of independent events such that1Xn=1

P (An) = 1. Then

P (A) = 1.

Proof:

(i):

P (A) = P ( limn!1

1[k=n

Ak)

= limn!1

P (1[k=n

Ak)

� limn!1

1Xk=n

P (Ak)

= limn!1

1Xk=1

P (Ak)�n�1Xk=1

P (Ak)

!(�)= 0

(�) holds since1Xn=1

P (An) <1.

(ii): We have Ac =1[n=1

1\k=n

Ack. Therefore,

P (Ac) = P ( limn!1

1\k=n

Ack) = lim

n!1P (

1\k=n

Ack):

If we choose n0 > n, it holds that

1\k=n

Ack �

n0\k=n

Ack:

Therefore,

P (1\k=n

Ack) � lim

n0!1P (

n0\k=n

Ack)

indep:= lim

n0!1

n0Yk=n

(1� P (Ak))

3

(+)

� limn0!1

exp

�

n0Xk=n

P (Ak)

!= 0

=) P (A) = 1

(+) holds since

1� exp

0@� n0Xj=n

�j

1A � 1�n0Yj=n

(1� �j) �n0Xj=n

�j for n0 > n and 0 � �j � 1

Example 6.3.4:

Independence is necessary for 2nd BC{Lemma:

Let = (0; 1) and P a uniform distribution on .

Let An = I(0; 1n)(!). Therefore,

1Xn=1

P (An) =1Xn=1

1

n=1:

But for any ! 2 , An occurs only for 1; 2; : : : ; b 1! c, where b1! c denotes the largest integer

(\ oor") that is � 1!. Therefore, P (A) = P (An i:o:) = 0.

Lemma 6.3.5: Kolmogorov's Inequality

Let fXig1i=1 be a sequence of independent rv's with common mean 0 and variances �2i . Let

Tn =nXi=1

Xi. Then it holds:

P ( max1�k�n

j Tk j� �) �

nXi=1

�2i

�28� > 0

Proof:

See Rohatgi, page 268, Lemma 2.

Lemma 6.3.6: Kronecker's Lemma

For any real numbers xn, if1Xn=1

xn converges to s <1 and Bn " 1, then it holds:

1

Bn

nXk=1

Bkxk ! 0 as n!1

4

Proof:


Theorem 6.3.7: Cauchy Criterion

Xna:s:�! X () lim

n!1P (sup

mj Xn+m �Xn j� �) = 1 8� > 0.

Proof:

See Rohatgi, page 270, Theorem 5.

Theorem 6.3.8:

If1Xn=1

V ar(Xn) <1, then1Xn=1

(Xn �E(Xn)) converges almost surely.

Proof:


Corollary 6.3.9:

Let fXig1i=1 be a sequence of independent rv's. Let fBig1i=1, Bi > 0; Bi " 1, be a sequence

of norming constants. Let Tn =nXi=1

Xi.

If1Xi=1

V ar(Xi)

B2i

<1, then it holds:

Tn �E(Tn)

Bn

a:s:�! 0

Proof:

This Corollary follows directly from Theorem 6.3.8 and Lemma 6.3.6.

Lemma 6.3.10: Equivalence Lemma

Let fXig1i=1 and fX 0ig1i=1 be sequences of rv's. Let Tn =

nXi=1

Xi and T0n =

nXi=1

X 0i.

If the series1Xi=1

P (Xi 6= X 0i) < 1, then the series fXig and fX 0

ig are tail{equivalent and

Tn and T 0n are convergence{equivalent, i.e., for Bn " 1 the sequences 1BnTn and 1

BnT 0n

converge on the same event and to the same limit, except for a null set.

Proof:


5

Lemma 6.3.11:

Let X be a rv with E(j X j) <1. Then it holds:

1Xn=1

P (j X j� n) � E(j X j) � 1 +1Xn=1

P (j X j� n)

Proof:

Continuous case only:

Let X have a pdf f . Then it holds:

E(j X j) =Z 1

�1j x j f(x)dx =

1Xk=0

Zk�jxj�k+1

j x j f(x)dx

=)1Xk=0

kP (k �j X j� k + 1) � E(j X j) �1Xk=0

(k + 1)P (k �j X j� k + 1)

It is

1Xk=0

kP (k �j X j� k + 1) =1Xk=0

kXn=1

P (k �j X j� k + 1)

=1Xn=1

1Xk=n

P (k �j X j� k + 1)

=1Xn=1

P (j X j� n)

Similarly,

1Xk=0

(k + 1)P (k �j X j� k + 1) =1Xn=1

P (j X j� n) +1Xk=0

P (k �j X j� k + 1)

=1Xn=1

P (j X j� n) + 1

Lecture 03:

Fr 01/12/01Theorem 6.3.12:

Let fXig1i=1 be a sequence of independent rv's. Then it holds:

Xna:s:�! 0 ()

1Xn=1

P (j Xn j> �) <1 8� > 0

Proof:


6

Theorem 6.3.13: Kolmogorov's SLLN

Let fXig1i=1 be a sequence of iid rv's. Let Tn =nXi=1

Xi. Then it holds:

Tn

n= Xn

a:s:�! � <1 () E(j X j) <1 (and then � = E(X))

Proof:

\=)":

Suppose that Xna:s:�! � <1. It is

Tn =nXi=1

Xi =n�1Xi=1

Xi +Xn = Tn�1 +Xn.

=) Xn

n=

Tn

n|{z}a:s:�!�

� n� 1

n| {z }!1

Tn�1n� 1| {z }a:s:�!�

a:s:�! 0

By Theorem 6.3.12, we have

1Xn=1

P (j Xn

nj� 1) <1; i.e.,

1Xn=1

P (j Xn j� n) <1:

Lemma 6:3:11=) E(j X j) <1

Th: 6:2:7 (WLLN)=) Xn

p�! E(X)

Since Xna:s:�! �, it follows by Theorem 6.1.26 that Xn

p�! �. Therefore, it must hold that

� = E(X).

\(=":Let E(j X j) <1. De�ne truncated rv's:

X 0k =

(Xk; if j Xk j� k

0; otherwise

T 0n =nX

k=1

X 0k

X0n =

T 0nn

Then it holds:

1Xk=1

P (Xk 6= X 0k) =

1Xk=1

P (j Xk j> k)

�1Xk=1

P (j Xk j� k)

7

iid=

1Xk=1

P (j X j� k)

Lemma 6:3:11� E(j X j)

< 1

By Lemma 6.3.10, it follows that Tn and T 0n are convergence{equivalent. Thus, it is su�cient

to prove that X0n

a:s:�! E(X).

We now establish the conditions needed in Corollary 6.3.9. It is

V ar(X 0n) � E((X 0

n)2)

=

Z n

�nx2fX(x)dx

=n�1Xk=0

Zk�jxj<k+1

x2fX(x)dx

�n�1Xk=0

(k + 1)2Zk�jxj<k+1

fX(x)dx

=n�1Xk=0

(k + 1)2P (k �j X j< k + 1)

�nX

k=0

(k + 1)2P (k �j X j< k + 1)

=)1Xn=1

1

n2V ar(X 0

n) �1Xn=1

nXk=0

(k + 1)2

n2P (k �j X j< k + 1)

=1Xn=1

nXk=1

(k + 1)2

n2P (k �j X j< k + 1) +

1Xn=1

1

n2P (0 �j X j< 1)

(�)�

1Xk=1

(k + 1)2P (k �j X j< k + 1)

1Xn=k

1

n2

!+ 2P (0 �j X j< 1) (A)

(�) holds since1Xn=1

1

n2=�2

6� 1:65 < 2 and the �rst two sums can be rearranged as follows:

n k

1 1

2 1; 2

3 1; 2; 3...

...

=)

k n

1 1; 2; 3; : : :

2 2; 3; : : :

3 3; : : :...

...

8

It is

1Xn=k

1

n2=

1

k2+

1

(k + 1)2+

1

(k + 2)2+ : : :

� 1

k2+

1

k(k + 1)+

1

(k + 1)(k + 2)+ : : :

=1

k2+

1Xn=k+1

1

n(n� 1)

From Bronstein, page 30, # 7, we know that

1 =1

1 � 2 +1

2 � 3 +1

3 � 4 + : : :+1

n(n+ 1)+ : : :

=1

1 � 2 +1

2 � 3 +1

3 � 4 + : : :+1

(k � 1) � k +1X

n=k+1

1

n(n� 1)

=)1X

n=k+1

1

n(n� 1)= 1� 1

1 � 2 �1

2 � 3 �1

3 � 4 � : : : � 1

(k � 1) � k

=1

2� 1

2 � 3 �1

3 � 4 � : : :� 1

(k � 1) � k

=1

3� 1

3 � 4 � : : :� 1

(k � 1) � k

=1

4� : : :� 1

(k � 1) � k= : : :

=1

k

=)1Xn=k

1

n2� 1

k2+

1Xn=k+1

1

n(n� 1)

=1

k2+1

k

� 2

k

9

Using this result in (A), we get

1Xn=1

1

n2V ar(X 0

n) � 21Xk=1

(k + 1)2

kP (k �j X j< k + 1) + 2P (0 �j X j< 1)

= 21Xk=0

kP (k �j X j< k + 1) + 41Xk=1

P (k �j X j< k + 1)

+ 21Xk=1

1

kP (k �j X j< k + 1) + 2P (0 �j X j< 1)

(B)

� 2E(j X j) + 4 + 2 + 2

< 1

To establish (B), we use an inequality from the Proof of Lemma 6.3.11, i.e.,

1Xk=0

kP (k �j X j< k + 1)Proof

�1Xn=1

P (j X j� n)Lemma 6:3:11

� E(j X j)

Thus, the conditions needed in Corollary 6.3.9 are met. With Bn = n, it follows that

1

nT 0n �

1

nE(T 0n)

a:s:�! 0 (C)

Since E(X 0n) ! E(X) as n ! 1, it follows by Kronecker's Lemma (6.3.6) that 1

nE(T0n) !

E(X). Thus, when we replace 1nE(T 0n) by E(X) in (C), we get

1

nT 0n

a:s:�! E(X)Lemma 6:3:10

=) 1

nTn

a:s:�! E(X)

since Tn and T 0n are convergence{equivalent.

10

Lecture 04:

We 01/17/016.4 Central Limit Theorems

Let fXng1n=1 be a sequence of rv's with cdf's fFng1n=1. Suppose that the mgf Mn(t) of Xn

exists.

Questions: Does Mn(t) converge? Does it converge to a mgf M(t)? If it does converge, does

it hold that Xnd�! X for some rv X?

Example 6.4.1:

Let fXng1n=1 be a sequence of rv's such that P (Xn = �n) = 1. Then the mgf is Mn(t) =

E(etX ) = e�tn. So

limn!1

Mn(t) =

8>><>>:0; t > 0

1; t = 0

1; t < 0

So Mn(t) does not converge to a mgf and Fn(x)! F (x) = 1 8x. But F (x) is not a cdf.

Note:

Due to Example 6.4.1, the existence of mgf's Mn(t) that converge to something is not enough

to conclude convergence in distribution.

Conversely, suppose that Xn has mgf Mn(t), X has mgf M(t), and Xnd�! X. Does it hold

that

Mn(t)!M(t)?

Not necessarily! See Rohatgi, page 277, Example 2, as a counter example. Thus, convergence

in distribution of rv's that all have mgf's does not imply the convergence of mgf's.

However, we can make the following statement in the next Theorem:

Theorem 6.4.2: Continuity Theorem

Let fXng1n=1 be a sequence of rv's with cdf's fFng1n=1 and mgf's fMn(t)g1n=1. Suppose thatMn(t) exists for j t j� t0 8n. If there exists a rv X with cdf F and mgf M(t) which exists for

j t j� t1 < t0 such that limn!1

Mn(t) =M(t) 8t 2 [�t1; t1], then Fn w�! F , i.e., Xnd�! X.

11

Example 6.4.3:

Let Xn � Bin(n; �n). Recall (e.g., from Theorem 3.3.12 and related Theorems) that for

X � Bin(n; p) the mgf is MX(t) = (1� p+ pet)n. Thus,

Mn(t) = (1� �

n+�

net)n = (1 +

�(et � 1)

n)n

(�)�! e�(et�1) as n!1:

In (�) we use the fact that limn!1

(1+x

n)n = ex. Recall that e�(e

t�1) is the mgf of a rv X where

X � Poisson(�). Thus, we have established the well{known result that the Binomial distribu-

tion approaches the Poisson distribution, given that n!1 in such a way that np = � > 0.

Note:

Recall Theorem 3.3.11: Suppose that fXng1n=1 is a sequence of rv's with characteristic fuctionsf�n(t)g1n=1. Suppose that

limn!1

�n(t) = �(t) 8t 2 (�h; h) for some h > 0;

and �(t) is the characteristic function of a rv X. Then Xnd�! X.

Theorem 6.4.4: Lindeberg{L�evy Central Limit Theorem

Let fXng1n=1 be a sequence of iid rv's with E(Xi) = � and 0 < V ar(Xi) = �2 <1. Then it

holds for Xn =1n

nXi=1

Xi that

pn(Xn � �)

�

d�! Z

where Z � N(0; 1).

Proof:

Let Z � N(0; 1). According to Theorem 3.3.12 (v), the characteristic function of Z is

�Z(t) = exp(�12 t2).

Let �(t) be the characteristic function of Xi. We now determine the characteristic function

�n(t) ofpn(Xn��)

�:

�n(t) = E

0BBBB@exp0BBBB@it

pn( 1n

nXi=1

Xi � �)

�

1CCCCA1CCCCA

=

Z 1

�1: : :

Z 1

�1exp

0BBBB@itpn( 1

n

nXi=1

Xi � �)

�

1CCCCAdFX

12

= exp(� itpn�

�)

Z 1

�1exp(

itx1pn�

)dFX1: : :

Z 1

�1exp(

itxnpn�

)dFXn

=

��(

tpn�

) exp(� it�pn�

)

�nRecall from Theorem 3.3.5 that if the kth moment exists, then �(k)(0) = ikE(Xk). Also recall

the de�nition of a Taylor series in MacLaurin's form:

f(x) = f(0) +f 0(0)

1!x+

f 00(0)

2!x2 +

f 000(0)

3!x3 + : : :+

f (n)(0)

n!xn + : : : ;

e.g.,

f(x) = ex = 1 + x+x2

2!+x3

3!+ : : :

Thus, if we develop a Taylor series for �( tpn�) around t = 0, we get:

�(tpn�

) = �(0) + t�0(0) +1

2t2�00(0) +

1

6t3�000(0) + : : :

= 1 + ti�pn�

� 1

2t2�2 + �2

n�2+ o

�(

tpn�

)2�

Here we make use of the Landau symbol \o". In general, if we write u(x) = o(v(x)) for

x ! L, this implies limx!L

u(x)

v(x)= 0, i.e., u(x) goes to 0 faster than v(x) or v(x) goes to 1

faster than u(x). We say that u(x) is of smaller order than v(x) as x ! L. Examples are1x3

= o( 1x2) and x2 = o(x3) for x!1. See Rohatgi, page 6, for more details on the Landau

symbols \O" and \o".

Similarly, if we develop a Taylor series for exp(� it�pn�) around t = 0, we get:

exp(� it�pn�

) = 1� ti�pn�

� 1

2t2�2

n�2+ o

�(

tpn�

)2�

Combining these results, we get:

�n(t) =

1 + t

i�pn�

� 1

2t2�2 + �2

n�2+ o

�(

tpn�

)2�!

1� ti�pn�

� 1

2t2�2

n�2+ o

�(

tpn�

)2�!!n

=

1� t

i�pn�

� 1

2t2�2

n�2+ t

i�pn�

+ t2�2

n�2� 1

2t2�2 + �2

n�2+ o

�(

tpn�

)2�!n

=

1� 1

2

t2

n+ o

�(

tpn�

)2�!n

(�)�! exp(� t2

2) as n!1

13

Thus, limn!1

�n(t) = �Z(t) 8t. For a proof of (�), see Rohatgi, page 278, Lemma 1. Accordingto the Note above, it holds that

pn(Xn � �)

�

d�! Z:

Lecture 05:

Fr 01/19/01De�nition 6.4.5:

Let X1; X2 be iid non{degenerate rv's with common cdf F . Let a1; a2 > 0. We say that F is

stable if there exist constants A and B (depending on a1 and a2) such that

B�1(a1X1 + a2X2 �A) also has cdf F .

Note:

When generalizing the previous de�nition to sequences of rv's, we have the following examples

for stable distributions:

� Xi iid Cauchy. Then 1n

nXi=1

Xi � Cauchy (here Bn = n;An = 0).

� Xi iid N(0; 1). Then 1pn

nXi=1

Xi � N(0; 1) (here Bn =pn;An = 0).

De�nition 6.4.6:

Let fXig1i=1 be a sequence of iid rv's with common cdf F . Let Tn =nXi=1

Xi. F belongs to

the domain of attraction of a distribution V if there exist norming and centering constants

fBng1n=1; Bn > 0, and fAng1n=1 such that

P (B�1n (Tn �An) � x) = FB�1n (Tn�An)(x)! V (x) as n!1

at all continuity points x of V .

Note:

A very general Theorem from Lo�eve states that only stable distributions can have domains

of attraction. From the practical point of view, a wide class of distributions F belong to the

domain of attraction of the Normal distribution.

14

Theorem 6.4.7: Lindeberg Central Limit Theorem

Let fXig1i=1 be a sequence of independent non{degenerate rv's with cdf's fFig1i=1. Assume

that E(Xk) = �k and V ar(Xk) = �2k <1. Let s2n =nX

k=1

�2k.

If the Fk are absolutely continuous with pdf's fk = F 0k, assume that it holds for all � > 0 that

(A) limn!1

1

s2n

nXk=1

Zfjx��kj>�sng

(x� �k)2F 0k(x)dx = 0:

If the Xk are discrete rv's with support fxklg and probabilities fpklg, l = 1; 2; : : :, assume that

it holds for all � > 0 that

(B) limn!1

1

s2n

nXk=1

Xjxkl��kj>�sn

(xkl � �k)2pkl = 0:

The conditions (A) and (B) are called Lindeberg Condition (LC). If either LC holds, then

nXk=1

(Xk � �k)

sn

d�! Z

where Z � N(0; 1).

Proof:

Similar to the proof of Theorem 6.4.4, we can use characteristic functions again. An alterna-

tive proof is given in Rohatgi, pages 282{288.

Note:

Feller shows that the LC is a necessary condition if�2ns2n

! 0 and s2n !1 as n!1.

Corollary 6.4.8:

Let fXig1i=1 be a sequence of iid rv's such that 1pn

nXi=1

Xi has the same distribution for all n.

If E(Xi) = 0 and V ar(Xi) = 1, then Xi � N(0; 1).

Proof:

Let F be the common cdf of 1pn

nXi=1

Xi for all n (including n = 1). By the CLT,

limn!1

P (1pn

nXi=1

Xi � x) = �(x);

where �(x) denotes P (Z � x) for Z � N(0; 1). Also, P ( 1pn

nXi=1

Xi � x) = F (x) for each n.

Therefore, we must have F (x) = �(x).

15

Note:

In general, if X1;X2; : : :, are independent rv's such that there exists a constant A with

P (j Xn j� A) = 1 8n, then the LC is satis�ed if s2n !1 as n!1. Why??

Suppose that s2n !1 as n!1. Since the j Xk j's are uniformly bounded (by A), so are therv's (Xk �E(Xk)). Thus, for every � > 0 there exists an N� such that if n � N� then

P (j Xk �E(Xk) j< �sn; k = 1; : : : ; n) = 1:

This implies that the LC holds since we would integrate (or sum) over the empty set, i.e., the

set fj x� �k j> �sng = �.

The converse also holds. For a sequence of uniformly bounded independent rv's, a necessary

and su�cient condition for the CLT to hold is that s2n !1 as n!1.

Example 6.4.9:

Let fXig1i=1 be a sequence of independent rv's such that E(Xk) = 0, �k = E(j Xk j2+�) <1

for some � > 0, andnX

k=1

�k = o(s2+�n ).

Does the LC hold? It is:

1

s2n

nXk=1

Zfjxj>�sng

x2fk(x)dx(A)

� 1

s2n

nXk=1

Zfjxj>�sng

j x j2+��s�n

fk(x)dx

� 1

s2n��s�n

nXk=1

Z 1

�1j x j2+� fk(x)dx

=1

s2n��s�n

nXk=1

�k

=1

��

0BBBB@nX

k=1

�k

s2+�n

1CCCCA(B)�! 0 as n!1

(A) holds since for j x j> �sn, it isjxj��s�

n

> 1. (B) holds sincenX

k=1

�k = o(s2+�n ).

Thus, the LC is satis�ed and the CLT holds.

16

Note:

(i) In general, if there exists a � > 0 such that

nXk=1

E(j Xk � �k j2+�)

s2+�n

�! 0 as n!1;

then the LC holds.

(ii) Both the CLT and the WLLN hold for a large class of sequences of rv's fXigni=1. If

the fXig's are independent uniformly bounded rv's, i.e., if P (j Xn j� M) = 1 8n, theWLLN (as formulated in Theorem 6.2.3) holds. The CLT holds provided that s2n !1as n!1.

If the rv's fXig are iid, then the CLT is a stronger result than the WLLN since the CLT

provides an estimate of the probability P ( 1n jnXi=1

Xi � n� j� �) � 1 � P (j Z j� �

�

pn),

where Z � N(0; 1), and the WLLN follows. However, note that the CLT requires the

existence of a 2nd moment while the WLLN does not.

(iii) If the fXig are independent (but not identically distributed) rv's, the CLT may apply

while the WLLN does not.

(iv) See Rohatgi, pages 289{293, for additional details and examples.

17

7 Sample Moments

7.1 Random Sampling

De�nition 7.1.1:

Let X1; : : : ;Xn be iid rv's with common cdf F . We say that fX1; : : : ;Xng is a (random)

sample of size n from the population distribution F . The vector of values fx1; : : : ; xng iscalled a realization of the sample. A rv g(X1; : : : ; Xn) which is a Borel{measurable function

of X1; : : : ;Xn and does not depend on any unknown parameter is called a (sample) statistic.

De�nition 7.1.2:

Let X1; : : : ; Xn be a sample of size n from a population with distribution F . Then

X =1

n

nXi=1

Xi

is called the sample mean and

S2 =1

n� 1

nXi=1

(Xi �X)2 =1

n� 1

nXi=1

X2i � nX

2

!

is called the sample variance.

De�nition 7.1.3:

Let X1; : : : ; Xn be a sample of size n from a population with distribution F . The function

F̂n(x) =1

n

nXi=1

I(�1;x](Xi)

is called empirical cumulative distribution function (empirical cdf).

Note:

For any �xed x 2 IR, F̂n(x) is a rv.

Theorem 7.1.4:

The rv F̂n(x) has pmf

P (F̂n(x) =j

n) =

n

j

!(F (x))j(1� F (x))n�j ; j 2 f0; 1; : : : ; ng;

with E(F̂n(x)) = F (x) and V ar(F̂n(x)) =F (x)(1�F (x))

n.

18

Proof:

It is I(�1;x](Xi) � Bin(1; F (x)). Then nF̂n(x) � Bin(n; F (x)).

The results follow immediately.

Corollary 7.1.5:

By the WLLN, it follows that

F̂n(x)p�! F (x):

Corollary 7.1.6:

By the CLT, it follows that pn(F̂n(x)� F (x))pF (x)(1 � F (x))

d�! Z;

where Z � N(0; 1).

Theorem 7.1.7: Glivenko{Cantelli Theorem

F̂n(x) converges uniformly to F (x), i.e., it holds for all � > 0 that

limn!1

P ( sup�1<x<1

j F̂n(x)� F (x) j> �) = 0:

Lecture 06:

Mo 01/22/01De�nition 7.1.8:

Let X1; : : : ; Xn be a sample of size n from a population with distribution F . We call

ak =1

n

nXi=1

Xki

the sample moment of order k and

bk =1

n

nXi=1

(Xi � a1)k =

1

n

nXi=1

(Xi �X)k

the sample central moment of order k.

Note:

It is b1 = 0 and b2 =n�1nS2.

19

Theorem 7.1.9:

Let X1; : : : ;Xn be a sample of size n from a population with distribution F . Assume that

E(X) = �, V ar(X) = �2, and E((X � �)k) = �k exist. Then it holds:

(i) E(a1) = E(X) = �

(ii) V ar(a1) = V ar(X) = �2

n

(iii) E(b2) =n�1n�2

(iv) V ar(b2) =�4��22

n� 2(�4�2�22)

n2+

�4�3�22n3

(v) E(S2) = �2

(vi) V ar(S2) = �4n� n�3

n(n�1)�22

Proof:

(i)

E(X) =1

n

nXi=1

E(Xi) =n

n� = �

(ii)V ar(X) =

�1

n

�2 nXi=1

V ar(Xi) =�2

n

(iii)

E(b2) = E

0@ 1

n

nXi=1

X2i �

1

n2

nXi=1

Xi

!21A

= E(X2)� 1

n2E

0@ nXi=1

X2i +

XXi6=j

XiXj

1A(�)= E(X2)� 1

n2(nE(X2) + n(n� 1)�2)

=n� 1

n(E(X2)� �2)

=n� 1

n�2

(�) holds since Xi and Xj are independent and then, due to Theorem 4.5.3, it holds

that E(XiXj) = E(Xi)E(Xj).

See Rohatgi, page 303{306, for the proof of parts (iv) through (vi) and results regarding the

3rd and 4th moments and covariances.

20

7.2 Sample Moments and the Normal Distribution

Theorem 7.2.1:

Let X1; : : : ;Xn be iid N(�; �2) rv's. Then X = 1n

nXi=1

Xi and (X1 � X; : : : ;Xn � X) are

independent.

Proof:

By computing the joint mgf of (X;X1 �X;X2 �X; : : : ;Xn �X), we can use Theorem 4.6.3

(iv) to show independence. We will use the following two facts:

(1):

MX(t) = M 1n

nXi=1

Xi

!(t)(A)=

nYi=1

MXi(t

n)

(B)=

"exp(

t

n�+

�2t2

2n2)

#n

= exp

t�+

�2t2

2n

!

(A) holds by Theorem 4.6.4 (i). (B) follows from Theorem 3.3.12 (vi) since the Xi's are iid.

(2):

MX1�X;X2�X;:::;Xn�X(t1; t2; : : : ; tn)Def:4:6:1

= E

"exp

nXi=1

ti(Xi �X)

!#

= E

"exp

nXi=1

tiXi �X

nXi=1

ti

!#

= E

"exp

nXi=1

Xi(ti � t)

!#; where t =

1

n

nXi=1

ti

= E

"nYi=1

exp (Xi(ti � t))

#

(C)=

nYi=1

E(exp(Xi(ti � t)))

=nYi=1

MXi(ti � t)

21

(D)=

nYi=1

exp

�(ti � t) +

�2(ti � t)2

2

!

= exp

0BBBB@�nXi=1

(ti � t)| {z }=0

+�2

2

nXi=1

(ti � t)2

1CCCCA= exp

�2

2

nXi=1

(ti � t)2

!

(C) follows from Theorem 4.5.3 since the Xi's are independent. (D) holds since we evaluate

MX(h) = exp(�h+ �2h2

2) for h = ti � t.

From (1) and (2), it follows:

MX;X1�X;:::;Xn�X(t; t1; : : : ; tn)

Def:4:6:1= E

hexp(tX + t1(X1 �X) + : : :+ tn(Xn �X))

i= E

hexp(tX + t1X1 � t1X + : : : + tnXn � tnX)

i= E

"exp

nXi=1

Xiti � (nXi=1

ti � t)X

!#

= E

266664exp0BBBB@

nXi=1

Xiti �(t1 + : : :+ tn � t)

nXi=1

Xi

n

1CCCCA377775

= E

"exp

nXi=1

Xi(ti �t1 + : : : + tn � t

n)

!#

= E

"nYi=1

exp

Xinti � nt+ t

n

!#; where t =

1

n

nXi=1

ti

(E)=

nYi=1

E

"exp

Xi[t+ n(ti � t)]

n

!#

=nYi=1

MXi

t+ n(ti � t)

n

!

(F )=

nYi=1

exp

�[t+ n(ti � t)]

n+�2

2

1

n2[t+ n(ti � t)]2

!

22

= exp

0BBBB@�n0BBBB@nt+ n

nXi=1

(ti � t)| {z }=0

1CCCCA+�2

2n2

nXi=1

(t+ n(ti � t))2

1CCCCA

= exp(�t) exp

0BBBB@ �2

2n2

0BBBB@nt2 + 2ntnXi=1

(ti � t)| {z }=0

+n2nXi=1

(ti � t)2

1CCCCA1CCCCA

= exp

�t+

�2

2nt2

!exp

�2

2

nXi=1

(ti � t)2!

(1)&(2)= MX(t)MX1�X;:::;Xn�X(t1; : : : ; tn)

Thus, X and (X1 � X; : : : ;Xn � X) are independent by Theorem 4.6.3 (iv). (E) follows

from Theorem 4.5.3 since the Xi's are independent. (F ) holds since we evaluate MX(h) =

exp(�h+ �2h2

2 ) for h =t+n(ti�t)

n.

Corollary 7.2.2:

X and S2 are independent.

Proof:

This can be seen since S2 is a function of the vector (X1 � X; : : : ;Xn � X), and (X1 �X; : : : ;Xn � X) is independent of X, as previously shown in Theorem 7.2.1. We can use

Theorem 4.2.7 to formally complete this proof.

Corollary 7.2.3:

(n� 1)S2

�2� �2n�1:

Proof:

Recall the following facts:

� If Z � N(0; 1) then Z2 � �21.

� If Y1; : : : ; Yn � iid �21, thennXi=1

Yi � �2n.

� For �2n, the mgf is M(t) = (1� 2t)�n=2.

� If Xi � N(�; �2), then Xi��

� N(0; 1) and(Xi��)2

�2� �21.

Therefore,nXi=1

(Xi � �)2

�2� �2n and

(X��)2( �p

n)2

= n(X��)2

�2� �21. (�)

23

Now consider

nXi=1

(Xi � �)2 =nXi=1

((Xi �X) + (X � �))2

=nXi=1

((Xi �X)2 + 2(Xi �X)(X � �) + (X � �)2)

= (n� 1)S2 + 0 + n(X � �)2

Therefore,nXi=1

(Xi � �)2

�2| {z }W

=n(X � �)2

�2| {z }U

+(n� 1)S2

�2| {z }V

We have an expression of the form: W = U + V

Since U and V are functions ofX and S2, we know by Corollary 7.2.2 that they are independent

and also that their mgf's factor by Theorem 4.6.3 (iv). Now we can write:

MW (t) = MU (t)MV (t)

=)MV (t) =MW (t)

MU (t)

(�)=

(1� 2t)�n=2

(1� 2t)�1=2

= (1� 2t)�(n�1)=2

Note that this is the mgf of �2n�1 by the uniqueness of mgf's. Thus, V =(n�1)S2

�2� �2n�1.

Corollary 7.2.4: pn(X � �)

S� tn�1:

Proof:

Recall the following facts:

� If Z � N(0; 1); Y � �2n and Z; Y independent, then ZpY

n

� tn.

� Z1 =pn(X��)�

� N(0; 1); Yn�1 =(n�1)S2

�2� �2n�1, and Z1; Yn�1 are independent.

Therefore,pn(X � �)

S=

(X��)�=pn

S=pn

�=pn

=

(X��)�=pnr

S2(n�1)�2(n�1)

=Z1qYn�1(n�1)

� tn�1:

24

Corollary 7.2.5:

Let (X1; : : : ;Xm) � iid N(�1; �21) and (Y1; : : : ; Yn) � iid N(�2; �

22). Let Xi; Yj be independent

8i; j:Then it holds:

X � Y � (�1 � �2)q[(m� 1)S21=�

21 ] + [(n� 1)S22=�

22 ]�s

m+ n� 2

�21=m+ �22=n� tm+n�2

In particular, if �1 = �2, then:

X � Y � (�1 � �2)q(m� 1)S21 + (n� 1)S22

�

smn(m+ n� 2)

m+ n� tm+n�2

Proof:

Homework.

Corollary 7.2.6:

Let (X1; : : : ;Xm) � iid N(�1; �21) and (Y1; : : : ; Yn) � iid N(�2; �

22). Let Xi; Yj be independent

8i; j:Then it holds:

S21=�21

S22=�22

� Fm�1;n�1

In particular, if �1 = �2, then:S21

S22� Fm�1;n�1

Proof:

Recall that, if Y1 � �2m and Y2 � �2n, then

F =Y1=m

Y2=n� Fm;n:

Now, C1 =(m�1)S21

�21� �2m�1 and C2 =

(n�1)S22�22

� �2n�1. Therefore,

S21=�21

S22=�22

=

(m�1)S21�21(m�1)

(n�1)S22�22(n�1)

=C1=(m� 1)

C2=(n� 1)� Fm�1;n�1:

If �1 = �2, thenS21

S22� Fm�1;n�1:

25

Lecture 07:

We 01/24/01

8 The Theory of Point Estimation

8.1 The Problem of Point Estimation

Let X be a rv de�ned on a probability space (; L; P ). Suppose that the cdf F of X depends

on some set of parameters and that the functional form of F is known except for a �nite

number of these parameters.

De�nition 8.1.1:

The set of admissible values of � is called the parameter space �. If F� is the cdf of X

when � is the parameter, the set fF� : � 2 �g is the family of cdf's. Likewise, we speak of

the family of pdf's if X is continuous, and the family of pmf's if X is discrete.

Example 8.1.2:

X � Bin(n; p); p unknown. Then � = p and � = fp : 0 < p < 1g.

X � N(�; �2); (�; �2) unknown. Then � = (�; �2) and � = f(�; �2) : �1 < � <1; �2 > 0g.

De�nition 8.1.3:

Let X be a sample from F�; � 2 � � IR. Let a statistic T (X) map IRn to �. We call T (X)

an estimator of � and T (x) for a realization x of X an (point) estimate of �. In practice,

the term estimate is used for both.

Example 8.1.4:

Let X1; : : : ; Xn be iid Bin(1; p), p unknown. Estimates of p include:

T1(X) = X; T2(X) = X1; T3(X) =1

2; T4(X) =

X1 +X2

3

Obviously, not all estimates are equally good.

26

8.2 Properties of Estimates

De�nition 8.2.1:

Let fXig1i=1 be a sequence of iid rv's with cdf F�; � 2 �. A sequence of point estimates

Tn(X1; : : : ; Xn) = Tn is called

� (weakly) consistent for � if Tnp�! � as n!1 8� 2 �

� strongly consistent for � if Tna:s:�! � as n!1 8� 2 �

� consistent in the rth mean for � if Tnr�! � as n!1 8� 2 �

Example 8.2.2:

Let fXig1i=1 be a sequence of iid Bin(1; p) rv's. Let Xn = 1n

nXi=1

Xi. Since E(Xi) = p, it

follows by the WLLN that Xnp�! p, i.e., consistency, and by the SLLN that Xn

a:s:�! p, i.e,

strong consistency.

However, a consistent estimate may not be unique. We may even have in�nite many consistent

estimates, e.g.,nXi=1

Xi + a

n+ b

p�! p 8 �nite a; b 2 IR:

Theorem 8.2.3:

If Tn is a sequence of estimates such that E(Tn)! � and V ar(Tn)! 0 as n!1, then Tn is

consistent for �.

Proof:

P (j Tn � � j> �)(A)

� E(Tn � �)2)

�2

=E[((Tn �E(Tn)) + (E(Tn)� �))2]

�2

=V ar(Tn) + 2E[(Tn �E(Tn))(E(Tn)� �)] + (E(Tn)� �)2

�2

=V ar(Tn) + (E(Tn)� �)2

�2

(B)�! 0 as n!1

27

(A) holds due to Corollary 3.5.2 (Markov's Inequality). (B) holds since V ar(Tn) ! 0 as

n!1 and E(Tn)! � as n!1.

De�nition 8.2.4:

Let G be a group of Borel{measurable functions of IRn onto itself which is closed under com-

position and inverse. A family of distributions fP� : � 2 �g is invariant under G if for

each g 2 G and for all � 2 �, there exists a unique �0 = g(�) such that the distribution of

g(X) is P�0 whenever the distribution of X is P�. We call g the induced function on � since

P�(g(X) 2 A) = Pg(�)(X 2 A).

Example 8.2.5:

Let (X1; : : : ;Xn) be iid N(�; �2) with pdf

f(x1; : : : ; xn) =1

(p2��)n

exp

� 1

2�2

nXi=1

(xi � �)2

!:

The group of linear transformations G has elements

g(x1; : : : ; xn) = (ax1 + b; : : : ; axn + b); a > 0; �1 < b <1:

The pdf of g(X) is

f�(x�1; : : : ; x�n) =

1

(p2�a�)n

exp

� 1

2a2�2

nXi=1

(x�i � a�� b)2

!; x�i = axi + b; i = 1; : : : ; n:

So ff : �1 < � <1; �2 > 0g is invariant under this group G, with g(�; �2) = (a�+b; a2�2),

where �1 < a�+ b <1 and a2�2 > 0.

De�nition 8.2.6:

Let G be a group of transformations that leaves fF� : � 2 �g invariant. An estimate T is

invariant under G if

T (g(X1); : : : ; g(Xn)) = T (X1; : : : ;Xn) 8g 2 G:

28

De�nition 8.2.7:

An estimate T is:

� location invariant if T (X1 + a; : : : ;Xn + a) = T (X1; : : : ; Xn); a 2 IR

� scale invariant if T (cX1; : : : ; cXn) = T (X1; : : : ;Xn); c 2 IR� f0g

� permutation invariant if T (Xi1 ; : : : ; Xin) = T (X1; : : : ;Xn) 8 permutations (i1; : : : ; in)of 1; : : : ; n

Example 8.2.8:

Let F� � N(�; �2).

S2 is location invariant.

X and S2 are both permutation invariant.

Neither X nor S2 is scale invariant.

Note:

Di�erent sources make di�erent use of the term invariant. Mood, Graybill & Boes (1974)

for example de�ne location invariant as T (X1 + a; : : : ;Xn + a) = T (X1; : : : ;Xn) + a (page

332) and scale invariant as T (cX1; : : : ; cXn) = cT (X1; : : : ;Xn) (page 336). According to their

de�nition, X is location invariant and scale invariant.

29

8.3 Su�cient Statistics

De�nition 8.3.1:

Let X = (X1; : : : ;Xn) be a sample from fF� : � 2 � � IRkg. A statistic T = T (X) is

su�cient for � (or for the family of distributions fF� : � 2 �g) i� the conditional dis-

tribution of X given T = t does not depend on � (except possibly on a null set A where

P�(T 2 A) = 0 8�).

Note:

(i) The sample X is always su�cient but this is not particularly interesting and usually is

excluded from further considerations.

(ii) Idea: Once we have \reduced" from X to T (X), we have captured all the information

in X about �.

(iii) Usually, there are several su�cient statistics for a given family of distributions

Example 8.3.2:

Let X = (X1; : : : ;Xn) be iid Bin(1; p) rv's. To estimate p, can we ignore the order and simply

count the number of \successes"?

Let T (X) =nXi=1

Xi. It is

P (X1 = x1; : : : Xn = xn jnXi=1

Xi = t) =P (X1 = x1; : : : ; Xn = xn; T = t)

P (T = t)

=

8>><>>:P (X1 = x1; : : : ;Xn = xn)

P (T = t);

nXi=1

xi = t

0; otherwise

=

8>>>><>>>>:pt(1� p)n�t n

t

!pt(1� p)n�t

;

nXi=1

xi = t

0; otherwise

=

8>>>><>>>>:1 n

t

! ; nXi=1

xi = t

0; otherwise

This does not depend on p. Thus, T =nXi=1

Xi is su�cient for p.

30

Example 8.3.3:

Let X = (X1; : : : ;Xn) be iid Poisson(�). Is T =nXi=1

Xi su�cient for �? It is

P (X1 = x1; : : : ;Xn = xn j T = t) =

8>>>>>><>>>>>>:

nYi=1

e��xi

xi!

e�n�(n�)t

t!

;

nXi=1

xi = t

0; otherwise

=

8>>>>><>>>>>:

e�n��P

xiQxi!

e�n�(n�)t

t!

;

nXi=1

xi = t

0; otherwise

=

8>>>><>>>>:t!

ntnYi=1

xi!

;

nXi=1

xi = t

0; otherwise

This does not depend on �. Thus, T =nXi=1

Xi is su�cient for �.

Example 8.3.4:

Let X1; X2 be iid Poisson(�). Is T = X1 + 2X2 su�cient for �? It is

P (X1 = 0;X2 = 1 j X1 + 2X2 = 2) =P (X1 = 0;X2 = 1)

P (X1 + 2X2 = 2)

=P (X1 = 0; X2 = 1)

P (X1 = 0;X2 = 1) + P (X1 = 2;X2 = 0)

=e��(e��)

e��(e��) + ( e��2

2 )e��

=1

1 + �2

This still depends on �. Thus, T = X1 + 2X2 is not su�cient for �.

Note:

De�nition 8.3.1 can be di�cult to check. In addition, it requires a candidate statistic. We

need something constructive that helps in �nding su�cient statistics without having to check

De�nition 8.3.1. The next Theorem helps in �nding such statistics.

31

Lecture 08:

Fr 01/26/01Theorem 8.3.5: Factorization Criterion

Let X1; : : : ;Xn be rv's with pdf (or pmf) f(x1; : : : ; xn j �); � 2 �. Then T (X1; : : : ;Xn) is

su�cient for � i� we can write

f(x1; : : : ; xn j �) = h(x1; : : : ; xn) g(T (x1; : : : ; xn) j �);

where h does not depend on � and g does not depend on x1; : : : ; xn except as a function of T .

Proof:

Discrete case only.

\=)":

Suppose T (X) is su�cient for �. Let

g(t j �) = P�(T (X) = t)

h(x) = P (X = x j T (X) = t)

Then it holds:

f(x j �) = P�(X = x)

(�)= P�(X = x; T (X) = T (x) = t)

= P�(T (X) = t) P (X = x j T (X) = t)

= g(t j �)h(x)

(�) holds since X = x implies that T (X) = T (x) = t.

\(=":Suppose the factorization holds. For �xed t0, it is

P�(T (X) = t0) =X

fx : T (x)=t0gP�(X = x)

=X

fx : T (x)=t0gh(x)g(T (x) j �)

= g(t0 j �)X

fx : T (x)=t0gh(x) (A)

32

If P�(T (X) = t0) > 0, it holds:

P�(X = x j T (X) = t0) =P�(X = x; T (X) = t0)

P�(T (X) = t0)

=

8><>:P�(X = x)

P�(T (X) = t0); if T (x) = t0

0; otherwise

(A)=

8>>><>>>:g(t0 j �)h(x)

g(t0 j �)X

fx : T (x)=t0gh(x)

; if T (x) = t0

0; otherwise

=

8>>><>>>:h(x)X

fx : T (x)=t0gh(x)

; if T (x) = t0

0; otherwise

This last expression does not depend on �. Thus, T (X) is su�cient for �.

Note:

(i) In the Theorem above, � and T may be vectors.

(ii) If T is su�cient for �, then also any 1{to{1 mapping of T is su�cient for �. However,

this does not hold for arbitrary functions of T .

Example 8.3.6:

Let X1; : : : ; Xn be iid Bin(1; p). It is

P (X1 = x1; : : : ; Xn = xn j p) = pP

xi(1� p)n�P

xi :

Thus, h(x1; : : : ; xn) = 1 and g(Pxi j p) = p

Pxi(1� p)n�

Pxi .

Hence, T =nXi=1

Xi is su�cient for p.

33

Example 8.3.7:

Let X1; : : : ; Xn be iid Poisson(�). It is

P (X1 = x1; : : : ;Xn = xn j �) =nYi=1

e��xi

xi!=e�n��

PxiQ

xi!:

Thus, h(x1; : : : ; xn) =1Qxi!

and g(Pxi j �) = e�n��

Pxi .

Hence, T =nXi=1

Xi is su�cient for �.

Example 8.3.8:

Let X1; : : : ; Xn be iid N(�; �2) where � 2 IR and �2 > 0 are both unknown. It is

f(x1; : : : ; xn j �; �2) =1

(p2��)n

exp

�P(xi � �)2

2�2

!=

1

(p2��)n

exp

�Px2i

2�2+ �

Pxi

�2� n�2

2�2

!:

Hence, T = (nXi=1

Xi;

nXi=1

X2i ) is su�cient for (�; �

2).

Example 8.3.9:

Let X1; : : : ; Xn be iid U(�; � + 1) where �1 < � <1. It is

f(x1; : : : ; xn j �) =

(1; � < xi < � + 1 8i 2 f1; : : : ; ng0; otherwise

=nYi=1

I(�;1)(xi) I(�1;�+1)(xi)

= I(�;1)(min(xi)) I(�1;�+1)(max(xi))

Hence, T = (X(1);X(n)) is su�cient for �.

De�nition 8.3.10:

Let ff�(x) : � 2 �g be a family of pdf's (or pmf's). We say the family is complete if

E�(g(X)) = 0 8� 2 �

implies that

P�(g(X) = 0) = 1 8� 2 �:

We say a statistic T (X) is complete if the family of distributions of T is complete.

34

Example 8.3.11:

Let X1; : : : ; Xn be iid Bin(1; p). We have seen in Example 8.3.6 that T =nXi=1

Xi is su�cient

for p. Is it also complete?

We know that T � Bin(n; p). Thus,

Ep(g(T )) =nXt=0

g(t)

n

t

!pt(1� p)n�t = 0 8p 2 (0; 1)

implies that

(1� p)nnXt=0

g(t)

n

t

!(

p

1� p)t = 0 8p 2 (0; 1) 8t:

However,nXt=0

g(t)

n

t

!(

p

1� p)t is a polynomial in p

1�p which is only equal to 0 for all p 2 (0; 1)

if all of its coe�cients are 0.

Therefore, g(t) = 0 for t = 0; 1; : : : ; n. Hence, T is complete.

Lecture 09:

Mo 01/29/01Example 8.3.12:

Let X1; : : : ;Xn be iid N(�; �2). We know from Example 8.3.8 that T = (nXi=1

Xi;

nXi=1

X2i ) is

su�cient for �. Is it also complete?

We know thatnXi=1

Xi � N(n�; n�2). Therefore,

E((nXi=1

Xi)2) = n�2 + n2�2 = n(n+ 1)�2

E(nXi=1

X2i ) = n(�2 + �2) = 2n�2

It follows that

E

2(

nXi=1

Xi)2 � (n+ 1)

nXi=1

X2i

!= 0 8�:

But g(x1; : : : ; xn) = 2(nXi=1

xi)2 � (n+ 1)

nXi=1

x2i is not identically to 0.

Therefore, T is not complete.

Note:

Recall from Section 5.2 what it means if we say the family of distributions ff� : � 2 �g is aone{parameter (or k{parameter) exponential family.

35

Theorem 8.3.13:

Let ff� : � 2 �g be a k{parameter exponential family. Let T1; : : : ; Tk be statistics. Then thefamily of distributions of (T1(X); : : : ; Tk(X)) is also a k{parameter exponential family given

by

g�(t) = exp

kXi=1

tiQi(�) +D(�) + S�(t)

!for suitable S�(t).

Proof:

The proof follows from our Theorems regarding the transformation of rv's.

Theorem 8.3.14:

Let ff� : � 2 �g be a k{parameter exponential family with k � n and let T1; : : : ; Tk be

statistics as in Theorem 8.3.13. Suppose that the range of Q = (Q1; : : : ; Qk) contains an open

set in IRk. Then T = (T1(X); : : : ; Tk(X)) is a complete su�cient statistic.

Proof:

Discrete case and k = 1 only.

Write Q(�) = � and let (a; b) � �.

It follows from the Factorization Criterion (Theorem 8.3.5) that T is su�cient for �. Thus,

we only have to show that T is complete, i.e., that

E�(g(T (X))) =Xt

g(t)P�(T (X) = t)

(A)=

Xt

g(t) exp(�t+D(�) + S�(t)) = 0 8� (B)

implies g(t) = 0 8t. Note that in (A) we make use of a result established in Theorem 8.3.13.

We now de�ne functions g+ and g� as:

g+(t) =

(g(t); if g(t) � 0

0; otherwise

g�(t) =

(�g(t); if g(t) < 0

0; otherwise

It is g(t) = g+(t)� g�(t) where both functions, g+ and g�, are non{negative functions. Using

g+ and g�, it turns out that (B) is equivalent toXt

g+(t) exp(�t+ S�(t)) =Xt

g�(t) exp(�t+ S�(t)) 8� (C)

where the term exp(D(�)) in (A) drops out as a constant on both sides.

36

If we �x �0 2 (a; b) and de�ne

p+(t) =g+(t) exp(�0t+ S�(t))Xt

g+(t) exp(�0t+ S�(t)); p�(t) =

g�(t) exp(�0t+ S�(t))Xt

g�(t) exp(�0t+ S�(t));

it is obvious that p+(t) � 0 8t and p�(t) � 0 8t and by constructionXt

p+(t) = 1 andXt

p�(t) = 1. Hence, p+ and p� are both pmf's.

From (C), it follows for the mgf's M+ and M� of p+ and p� that

M+(�) =Xt

e�tp+(t)

=

Xt

e�tg+(t) exp(�0t+ S�(t))Xt

g+(t) exp(�0t+ S�(t))

=

Xt

g+(t) exp((�0 + �)t+ S�(t))Xt

g+(t) exp(�0t+ S�(t))

(C)=

Xt

g�(t) exp((�0 + �)t+ S�(t))Xt

g�(t) exp(�0t+ S�(t))

=

Xt

e�tg�(t) exp(�0t+ S�(t))Xt

g�(t) exp(�0t+ S�(t))

=Xt

e�tp�(t)

= M�(�) 8� 2 (a� �0| {z }<0

; b� �0| {z }>0

):

By the uniqueness of mgf's it follows that p+(t) = p�(t) 8t.

=) g+(t) = g�(t) 8t

=) g(t) = 0 8t

=) T is complete

37

8.4 Unbiased Estimation

De�nition 8.4.1:

Let fF� : � 2 �g, � � IR, be a nonempty set of cdf's. A Borel{measurable function T from

IRn to � is called unbiased for � (or an unbiased estimate for �) if

E�(T ) = � 8� 2 �:

Any function d(�) for which an unbiased estimate T exists is called an estimable function.

If T is biased,

b(�; T ) = E�(T )� �

is called the bias of T .

Example 8.4.2:

If the kth population moment exists, the kth sample moment is an unbiased estimate. If

V ar(X) = �2, the sample variance S2 is an unbiased estimate of �2.

However, note that for X1; : : : ;Xn iid N(�; �2), S is not an unbiased estimate of �:

(n� 1)S2

�2� �2n�1 = Gamma(

n� 1

2; 2)

=) E

0@s(n� 1)S2

�2

1A =

Z 1

0

pxxn�12�1e�

x

2

2n�12 �(n�12 )

dx

=

p2�(n2 )

�(n�12 )

Z 1

0

xn

2�1e�

x

2

2n

2 �(n2 )dx

(�)=

p2�(n2 )

�(n�12 )

=) E(S) = �

s2

n� 1

�(n2 )

�(n�12 )

(�) holds since xn

2�1e�

x

2

2n

2 �(n

2)is the pdf of a Gamma(n2 ; 2) distribution and thus the integral is 1.

So S is biased for � and

b(�; S) = �

0@s 2

n� 1

�(n2 )

�(n�12 )� 1

1A :Note:

If T is unbiased for �, g(T ) is not necessarily unbiased for g(�) (unless g is a linear function).

38

Lecture 10:

We 01/31/01Example 8.4.3:

Unbiased estimates may not exist (see Rohatgi, page 351, Example 2) or they me be absurd

as in the following case:

Let X � Poisson(�) and let d(�) = e�2�. Consider T (X) = (�1)X as an estimate. It is

E�(T (X)) = e��1Xx=0

(�1)x�x

x!

= e��1Xx=0

(��)xx!

= e��e��

= e�2�

= d(�)

Hence T is unbiased for d(�) but since T alternates between -1 and 1 while d(�) > 0, T is not

a good estimate.

Note:

If there exist 2 unbiased estimates T1 and T2 of �, then any estimate of the form �T1+(1��)T2for 0 < � < 1 will also be an unbiased estimate of �. Which one should we choose?

De�nition 8.4.4:

The mean square error of an estimate T of � is de�ned as

MSE(�; T ) = E�((T � �)2)

= V ar�(T ) + (b(�; T ))2:

Let fTig1i=1 be a sequence of estimates of �. If

limi!1

MSE(�; Ti) = 0 8� 2 �;

then fTig is called a mean{squared{error consistent (MSE{consistent) sequence of es-

timates of �.

Note:

(i) If we allow all estimates and compare their MSE, generally it will depend on � which

estimate is better. For example �̂ = 17 is perfect if � = 17, but it is lousy otherwise.

39

(ii) If we restrict ourselves to the class of unbiased estimates, then MSE(�; T ) = V ar�(T ).

(iii) MSE{consistency means that both the bias and the variance of Ti approach 0 as i!1.

De�nition 8.4.5:

Let �0 2 � and let U(�0) be the class of all unbiased estimates T of �0 such that E�0(T2) <1.

Then T0 2 U(�0) is called a locally minimum variance unbiased estimate (LMVUE)

at �0 if

E�0((T0 � �0)2) � E�0((T � �0)

2) 8T 2 U(�0):

De�nition 8.4.6:

Let U be the class of all unbiased estimates T of � 2 � such that E�(T2) <1 8� 2 �. Then

T0 2 U is called a uniformly minimum variance unbiased estimate (UMVUE) of � if

E�((T0 � �)2) � E�((T � �)2) 8� 2 � 8T 2 U:

40

An Excursion into Logic II

In our �rst \Excursion into Logic" in Lectures 10 & 11 of Stat 6710 Mathematical Statistics I

on 9/20/2000 and 9/22/2000, we have established the following results:

A) B is equivalent to :B ) :A is equivalent to :A _B:

A B A) B :A :B :B ) :A :A _B1 1 1 0 0 1 1

1 0 0 0 1 0 0

0 1 1 1 0 1 1

0 0 1 1 1 1 1

When dealing with formal proofs, there exists one more technique to show A ) B. Equiva-

lently, we can show (A^:B)) 0, a technique called Proof by Contradiction. This means,

assuming that A and :B hold, we show that this implies 0, i.e., something that is always

false, i.e., a contradiction. And here is the corresponding truth table:

A B A) B :B A ^ :B (A ^ :B)) 0

1 1 1 0 0 1

1 0 0 1 1 0

0 1 1 0 0 1

0 0 1 1 0 1

Note:

We make use of this proof technique in the Proof of the next Theorem.

Example:

Let A : x = 5 and B : x2 = 25. Obviously A) B.

But we can also prove this in the following way:

A : x = 5 and :B : x2 6= 25

=) x2 = 25 ^ x2 6= 25

This is impossible, i.e., a contradiction. Thus, A) B.

41

Theorem 8.4.7:

Let U be the class of all unbiased estimates T of � 2 � with E�(T2) < 1 8�, and suppose

that U is non{empty. Let U0 be the set of all unbiased estimates of 0, i.e.,

U0 = f� : E�(�) = 0; E�(�2) <1 8� 2 �g:

Then T0 2 U is UMVUE i�

E�(�T0) = 0 8� 2 � 8� 2 U0:

Proof:

Note that E�(�T0) always exists.

\=):"

We suppose that T0 2 U is UMVUE and that E�0(�0T0) 6= 0 for some �0 2 � and some

�0 2 U0.

It holds

E�(T0 + ��0) = E�(T0) = � 8� 2 IR 8� 2 �:

Therefore, T0 + ��0 2 U 8� 2 IR.

Also, E�0(�20 ) > 0 (since otherwise, P�0(�0 = 0) = 1 and then E�0(�0T0) = 0).

Now let

� = �E�0(T0�0)

E�0(�20)

:

Then,

E�0((T0 + ��0)2) = E�0(T

20 + 2�T0�0 + �2�20 )

= E�0(T20 )� 2

(E�0(T0�0))2

E�0(�20)

+(E�0(T0�0))

2

E�0(�20 )

= E�0(T20 )�

(E�0(T0�0))2

E�0(�20 )

< E�0(T20 );

and therefore,

V ar�0(T0 + ��0) < V ar�0(T0):

This means, T0 is not UMVUE, i.e., a contradiction!

42

\(=:"Let E�(�T0) = 0 for some T0 2 U for all � 2 � and all � 2 U0.

We choose T 2 U , then also T0 � T 2 U0 and

E�(T0(T0 � T )) = 0 8� 2 �:

It follows by the Cauchy{Schwarz{Inequality (Theorem 4.5.7 (ii)) that

E�(T20 ) = E�(T0T ) � (E�(T

20 ))

12 (E�(T

2))12 :

This implies

(E�(T20 ))

12 � (E�(T

2))12

and

V ar�(T0) � V ar�(T );

where T is an arbitrary unbiased estimate of �. Thus, T0 is UMVUE.

Lecture 11:

Mo 02/05/01Theorem 8.4.8:

Let U be the non{empty class of unbiased estimates of � 2 � as de�ned in Theorem 8.4.7.

Then there exists at most one UMVUE T 2 U for �.

Proof:

Suppose T0; T1 2 U are both UMVUE.

Then T1 � T0 2 U0, V ar�(T0) = V ar�(T1), and E�(T0(T1 � T0)) = 0 8� 2 �

=) E�(T20 ) = E�(T0T1)

=) Cov�(T0; T1) = E�(T0T1)�E�(T0)E�(T1)

= E�(T20 )� (E�(T0))

2

= V ar�(T0)

= V ar�(T1) 8� 2 �

=) �T0T1 = 1 8� 2 �

=) P�(aT0 + bT1 = 0) = 1 for some a; b 8� 2 �

=) � = E�(T0) = E�(� baT1) = E�(T1) 8� 2 �

=) � ba= 1

=) P�(T0 = T1) = 1 8� 2 �

43

Theorem 8.4.9:

(i) If an UMVUE T exists for a real function d(�), then �T is the UMVUE for �d(�); � 2 IR.

(ii) If UMVUE's T1 and T2 exist for real functions d1(�) and d2(�), respectively, then T1+T2

is the UMVUE for d1(�) + d2(�).

Proof:

Homework.

Theorem 8.4.10:

If a sample consists of n independent observations X1; : : : ; Xn from the same distribution, the

UMVUE, if it exists, is permutation invariant.

Proof:

Homework.

Theorem 8.4.11: Rao{Blackwell

Let fF� : � 2 �g be a family of cdf's, and let h be any statistic in U , where U is the non{

empty class of all unbiased estimates of � with E�(h2) <1. Let T be a su�cient statistic for

fF� : � 2 �g. Then the conditional expectation E�(h j T ) is independent of � and it is an

unbiased estimate of �. Additionally,

E�((E(h j T )� �)2) � E�((h� �)2) 8� 2 �

with equality i� h = E(h j T ).

Proof:

By Theorem 4.7.3, E�(E(h j T )) = E�(h) = �.

Since X j T does not depend on � due to su�ciency, neither does E(h j T ) depend on �.

Thus, we only have to show that

E�((E(h j T ))2) � E�(h2) = E�(E(h

2 j T )):

Thus, we only have to show that

(E(h j T ))2 � E(h2 j T ):

But the Cauchy{Schwarz{Inequality (Theorem 4.5.7 (ii)) gives us

(E(h j T ))2 � E(h2 j T )E(1 j T ) = E(h2 j T ):

44

Equality holds i�

E�((E(h j T ))2) = E�(h2)

() E�(E(h2 j T )� (E(h j T ))2) = 0

() E�(V ar(h j T )) = 0

() V ar(h j T ) = 0

() E(h2 j T ) = (E(h j T ))2

() h is a function of T and h = E(h j T ).

For the proof of the last step, see Rohatgi, page 170{171, Theorem 2, Corollary, and Proof of

the Corollary.

Theorem 8.4.12: Lehmann{Sche��ee

If T is a complete su�cient statistic and if there exists an unbiased estimate h of �, then

E(h j T ) is the (unique) UMVUE.

Proof:

Suppose that h1; h2 2 U . Then E(h1 j T ) = E(h2 j T ) = � by Theorem 8.4.11.

Therefore,

E�(E(h1 j T )�E(h2 j T )) = 0 8� 2 �:

Since T is complete, E(h1 j T ) = E(h2 j T ).

Therefore, E(h j T ) must be the same for all h 2 U and E(h j T ) improves all h 2 U . There-fore, E(h j T ) is UMVUE by Theorem 8.4.11.

Note:

We can use Theorem 8.4.12 to �nd the UMVUE in two ways if we have a complete su�cient

statistic T :

(i) If we can �nd an unbiased estimate h(T ), it will be the UMVUE since E(h(T ) j T ) =h(T ).

(ii) If we have any unbiased estimate h and if we can calculate E(h j T ), then E(h j T )will be the UMVUE. The process of determining the UMVUE this way often is called

Rao{Blackwellization.

(iii) Even if a complete su�cient statistic does not exist, the UMVUE may still exist (see

Rohatgi, page 357{358, Example 10).

45

Example 8.4.13:

Let X1; : : : ;Xn be iid Bin(1; p). Then T =nXi=1

Xi is a complete su�cient statistic as seen in

Examples 8.3.6 and 8.3.11.

Since E(X1) = p, X1 is an unbiased estimate of p. However, due to part (i) of the Note above,

since X1 is not a function of T , X1 is not the UMVUE.

We can use part (ii) of the Note above to construct the UMVUE. It is

P (X1 = x j T ) =

8<:Tn ; x = 1

n�Tn ; x = 0

=) E(X1 j T ) = Tn= X

=) X is the UMVUE for p

If we are interested in the UMVUE for d(p) = p(1� p) = p� p2 = V ar(X), we can �nd it in

the following way:

E(T ) = np

E(T 2) = E

0@ nXi=1

X2i +

nXi=1

nXj=1;j 6=i

XiXj

1A= np+ n(n� 1)p2

=) E

�nT

n(n� 1)

�=

np

n� 1

E

� T 2

n(n� 1)

!= � p

n� 1� p2

=) E

nT � T 2

n(n� 1)

!=

np

n� 1� p

n� 1� p2

=(n� 1)p

n� 1� p2

= p� p2

= d(p)

Thus, due to part (i) of the Note above, nT � T 2

n(n� 1)is the UMVUE for d(p) = p(1� p).

46

Lecture 12:

We 02/07/018.5 Lower Bounds for the Variance of an Estimate

Theorem 8.5.1: Cram�er{Rao Lower Bound (CRLB)

Let � be an open interval of IR. Let ff� : � 2 �g be a family of pdf's or pmf's. Assume

that the set fx : f�(x) = 0g is independent of �.

Let (�) be de�ned on � and let it be di�erentiable for all � 2 �. Let T be an unbiased

estimate of (�) such that E�(T2) <1 8� 2 �. Suppose that

(i)@f�(x)@�

is de�ned for all � 2 �,

(ii) for a pdf f�@

@�

�Zf�(x)dx

�=

Z@f�(x)

@�dx = 0 8� 2 �

or for a pmf f�

@

@�

0@Xx

f�(x)

1A =Xx

@f�(x)

@�= 0 8� 2 �;

(iii) for a pdf f�@

@�

�ZT (x)f�(x)dx

�=

ZT (x)

@f�(x)

@�dx 8� 2 �

or for a pmf f�

@

@�

0@Xx

T (x)f�(x)

1A =Xx

T (x)@f�(x)

@�8� 2 �:

Let � : �! IR be any measurable function. Then it holds

( 0(�))2 � E�((T (X)� �(�))2) E�

�@ log f�(X)

@�

�2!8� 2 � (A):

Further, for any �0 2 �, either 0(�0) = 0 and equality holds in (A) for � = �0, or we have

E�0((T (X)� �(�0))2) � ( 0(�0))

2

E�0

�(@ log f�(X)

@�)2� (B):

Finally, if equality holds in (B), then there exists a real number K(�0) 6= 0 such that

T (X)� �(�0) = K(�0)@ log f�(X)

@�

��=�0

(C)

with probability 1, provided that T is not a constant.

47

Note:

(i) Conditions (i), (ii), and (iii) are called regularity conditions. Conditions under which

they hold can be found in Rohatgi, page 11{13, Parts 12 and 13.

(ii) The right hand side of inequality (B) is called Cram�er{Rao Lower Bound of �0, or, in

symbols CRLB(�0).

Proof:

From (ii), we get

E�

�@

@�log f�(X)

�=

Z �@

@�log f�(x)

�f�(x)dx

=

Z �@

@�f�(x)

�1

f�(x)f�(x)dx

=

Z �@

@�f�(x)

�dx

= 0

=) E�

��(�)

@

@�log f�(X)

�= 0

From (iii), we get

E�

�T (X)

@

@�log f�(X)

�=

Z �T (x)

@

@�log f�(x)

�f�(x)dx

=

Z �T (x)

@

@�f�(x)

�1

f�(x)f�(x)dx

=

Z �T (x)

@

@�f�(x)

�dx

(iii)=

@

@�

�ZT (x)f�(x)dx

�

=@

@�E(T (X))

=@

@� (�)

= 0(�)

=) E�

�(T (X)� �(�))

@

@�log f�(X)

�= 0(�)

48

=) ( 0(�))2 =

�E�

�(T (X)� �(�))

@

@�log f�(X)

��2(�)� E�

�(T (X)� �(�))2

�E�

�@

@�log f�(X)

�2!;

i.e., (A) holds. (�) follows from the Cauchy{Schwarz{Inequality (Theorem 4.5.7 (ii)).

If 0(�0) 6= 0, then the left{hand side of (A) is > 0. Therefore, the right{hand side is > 0.

Thus,

E�0

�@

@�log f�(X)

�2!> 0;

and (B) follows directly from (A).

If 0(�0) = 0, but equality does not hold in (A), then

E�0

�@

@�log f�(X)

�2!> 0;

and (B) follows directly from (A) again.

Finally, if equality holds in (B), then 0(�0) 6= 0 (because T is not constant). Thus,

MSE(�(�0); T (X)) > 0. The Cauchy{Schwarz{Inequality (Theorem 4.5.7 (iii)) gives equality

i� there exist constants (�; �) 2 IR2 � f(0; 0)g such that

P

�(T (X)� �(�0)) + �

@

@�log f�(X)

��=�0

!= 0

!= 1:

This implies K(�0) = ��and (C) holds. Since T is not a constant, it also holds that

K(�0) 6= 0.

Example 8.5.2:

If we take �(�) = (�), we get from (B)

V ar�(T (X)) � ( 0(�))2

E�

�(@ log f�(X)

@�)2� (�):

If we have (�) = �, the inequality (�) above reduces to

V ar�(T (X)) � E�

�@ log f�(X)

@�

�2!!�1:

Finally, if X = (X1; : : : ;Xn) iid with identical f�(x), the inequality (�) reduces to

V ar�(T (X)) � ( 0(�))2

nE�

�(@ log f�(X1)

@� )2� :

49

Example 8.5.3:

Let X1; : : : ; Xn be iid Bin(1; p). Let X � Bin(n; p), p 2 � = (0; 1) � IR. Let

(p) = E(T (X)) =nX

x=0

T (x)

n

x

!px(1� p)n�x:

(p) is di�erentiable with respect to p under the summation sign since it is a �nite polynomial

in p.

Since X =nXi=1

Xi with fp(x1) = px1(1� p)1�x1 ; x1 2 f0; 1g

=) log fp(x1) = x1 log p+ (1� x1) log(1� p)

=) @

@plog fp(x1) =

x1

p� 1� x1

1� p=x1(1� p)� p(1� x1)

p(1� p)=

x1 � p

p(1� p)

=) Ep

�@

@plog fp(X1)

�2!=

V ar(X1)

p2(1� p)2=

1

p(1� p)

So, if (p) = �(p) = p and if T is unbiased for p, then

V arp(T (X)) � 1

n 1p(1�p)

=p(1� p)

n:

Since V ar(X) =p(1�p)

n , X attains the CRLB. Therefore, X is the UMVUE.

Lecture 13:

Fr 02/09/01Example 8.5.4:

Let X � U(0; �), � 2 � = (0;1) � IR.

f�(x) =1

�I(0;�)(x)

=) log f�(x) = � log �

=) (@

@�log f�(x))

2 =1

�2

=) E�

�@

@�log f�(X)

�2!=

1

�2

Thus, the CRLB is �2

n .

We know that n+1nX(n) is the UMVUE since it is a function of a complete su�cient statistic

X(n) (see Homework) and E(X(n)) =n

n+1�. It is

V ar

�n+ 1

nX(n)

�=

�2

n(n+ 2)<�2

n???

50

How is this possible? Quite simple, since one of the required conditions for Theorem 8.5.1

does not hold. The support of X depends on �.

Theorem 8.5.5: Chapman, Robbins, Kiefer Inequality (CRK Inequality)

Let � � IR. Let ff� : � 2 �g be a family of pdf's or pmf's. Let (�) be de�ned on �. Let

T be an unbiased estimate of (�) such that E�(T2) <1 8� 2 �.

If � 6= #; � and # 2 �, assume that f�(x) and f#(x) are di�erent. Also assume that there

exists such a # 2 � such that � 6= # and

S(�) = fx : f�(x) > 0g � S(#) = fx : f#(x) > 0g:

Then it holds that

V ar�(T (X)) � supf# : S(#)�S(�); # 6=�g

( (#) � (�))2

V ar�

�f#(X)f�(X)

� 8� 2 �:

Proof:

Since T is unbiased, it follows

E#(T (X)) = (#) 8# 2 �:

For # 6= � and S(#) � S(�), it followsZS(�)

T (x)f#(x)� f�(x)

f�(x)f�(x)dx = E#(T (X))�E�(T (X)) = (#)� (�)

and

0 =

ZS(�)

f#(x)� f�(x)

f�(x)f�(x)dx = E�

�f#(X)

f�(X)� 1

�:

Therefore

Cov�

�T (X);

f#(X)

f�(X)� 1

�= (#)� (�):

It follows by the Cauchy{Schwarz{Inequality (Theorem 4.5.7 (ii)) that

( (#)� (�))2 =

�Cov�

�T (X);

f#(X)

f�(X)� 1

��2

� V ar�(T (X))V ar�

�f#(X)

f�(X)� 1

�

= V ar�(T (X))V ar�

�f#(X)

f�(X)

�:

Thus,

V ar�(T (X)) � ( (#) � (�))2

V ar�

�f#(X)f�(X)

� :Finally, we take the supremum of the right{hand side with respect to f# : S(#) � S(�);

# 6= �g, which completes the proof.

51

Note:

(i) The CRK inequality holds without the previous regularity conditions.

(ii) An alternative form of the CRK inequality is:

Let �; � + �; � 6= 0, be distinct with S(� + �) � S(�). Let (�) = �. De�ne

J = J(�; �) =1

�2

�f�+�(X)

f�(X)

�2� 1

!:

Then the CRK inequality reads as

V ar�(T (X)) � 1

inf�E�(J)

with the in�mum taken over � 6= 0 such that S(� + �) � S(�).

(iii) The CRK inequality works for discrete �, the CRLB does not work in such cases.

Example 8.5.6:

Let X � U(0; �); � > 0. The required conditions for the CRLB are not met. Recall from

Example 8.5.4 that n+1n X(n) is UMVUE with V ar(n+1n X(n)) =

�2

n(n+2) <�2

n = CRLB.

Let (�) = �. If # < �, then S(#) � S(�). It is

E�

�f#(X)

f�(X)

�2!=

Z #

0

��

#

�2 1�dx =

�

#

E�

�f#(X)

f�(X)

�=

Z #

0

�

#

1

�dx = 1

=) V ar�(T (X)) � supf# : 0<#<�g

(#� �)2

�#� 1

= supf# : 0<#<�g

(#(� � #)) =�2

4

Since X is complete and su�cient and 2X is unbiased for �, so T (X) = 2X is the UMVUE.

It is

V ar�(2X) = 4V ar�(X) = 4�2

12=�2

3>�2

4:

Since the CRK lower bound is not achieved by the UMVUE, it is not achieved by any unbiased

estimate of �.

52

De�nition 8.5.7:

Let T1; T2 be unbiased estimates of � with E�(T21 ) <1 and E�(T

22 ) <1 8� 2 �. We de�ne

the e�ciency of T1 relative to T2 by

eff�(T1; T2) =V ar�(T1)

V ar�(T2)

and say that T1 is more e�cient than T2 if eff�(T1; T2) < 1.

De�nition 8.5.8:

Assume the regularity conditions of Theorem 8.5.1 are satis�ed by a family of cdf's

fF� : � 2 �g, � � IR. An unbiased estimate T for � is most e�cient for the family

fF�g if

V ar�(T ) =

�E�

�(@ log f�(X)

@�)2��1

:

De�nition 8.5.9:

Let T be the most e�cient estimate for the family of cdf's fF� : � 2 �g, � � IR. Then the

e�ciency of any unbiased T1 of � is de�ned as

eff�(T1) = eff�(T1; T ) =V ar�(T1)

V ar�(T ):

De�nition 8.5.10:

T1 is asymptotically (most) e�cient if T1 is asymptotically unbiased, i.e., limn!1

E�(T1) = �,

and limn!1

eff�(T1) = 1, where n is the sample size.

Lecture 14:

Mo 02/12/01Theorem 8.5.11:

A necessary and su�cient condition for an estimate T of � to be most e�cient is that T is

su�cient and1

K(�)(T (x)� �) =

@ log f�(x)

@�8� 2 � (�);

where K(�) is de�ned as in Theorem 8.5.1 and the regularity conditions for Theorem 8.5.1

hold.

Proof:

\=):"

Theorem 8.5.1 says that if T is most e�cient, then (�) holds.

Assume that � = IR. We de�ne

C(�0) =

Z �0

�1

1

K(�)d�; (�0) =

Z �0

�1

�

K(�)d�; and �(x) = lim

�!�1log f�(x)� c(x):

53

Integrating (�) with respect to � givesZ �0

�1

1

K(�)T (x)d� �

Z �0

�1

�

K(�)d� =

Z �0

�1

@ log f�(x)

@�d�

=) T (x)C(�0)� (�0) = log f�(x)j�0�1 + c(x)

=) T (x)C(�0)� (�0) = log f�0(x)� lim�!�1

log f�(x) + c(x)

=) T (x)C(�0)� (�0) = log f�0(x)� �(x)

Therefore,

f�0(x) = exp(T (x)C(�0)� (�0) + �(x))

which belongs to an exponential family. Thus, T is su�cient.

\(=:"From (�), we get

E�

�@ log f�(X)

@�

�2!=

1

(K(�))2V ar�(T (X)):

Additionally, it holds

E�

�(T (X)� �)

@ log f�(X)

@�

�= 1

as shown in the Proof of Theorem 8.5.1.

Using (�) in the line above, we get

K(�)E�

�@ log f�(X)

@�

�2!= 1;

i.e.,

K(�) =

E�

�@ log f�(X)

@�

�2!!�1:

Therefore,

V ar�(T (X)) =

E�

�@ log f�(X)

@�

�2!!�1;

i.e., T is most e�cient for �.

Note:

Instead of saying \a necessary and su�cient condition for an estimate T of � to be most

e�cient ..." in the previous Theorem, we could say that \an estimate T of � is most e�cient

i� ...", i.e., \necessary and su�cient" means the same as \i�".

54

8.6 The Method of Moments

De�nition 8.6.1:

LetX1; : : : ; Xn be iid with pdf (or pmf) f�; � 2 �. We assume that �rst k momentsm1; : : : ;mk

of f� exist. If � can be written as

� = h(m1; : : : ;mk);

where h : IRk ! IR is a Borel{measurable function, the method of moments estimate

(mom) of � is

�̂mom = T (X1; : : : ;Xn) = h(1

n

nXi=1

Xi;1

n

nXi=1

X2i ; : : : ;

1

n

nXi=1

Xki ):

Note:

(i) The De�nition above can also be used to estimate joint moments. For example, we use

1n

nXi=1

XiYi to estimate E(XY ).

(ii) Since E( 1n

nXi=1

Xji ) = mj, method of moment estimates are unbiased for the popula-

tion moments. The WLLN and the CLT say that these estimates are consistent and

asymptotically Normal as well.

(iii) If � is not a linear function of the population moments, �̂mom will, in general, not be

unbiased. However, it will be consistent and (usually) asymptotically Normal.

(iv) Method of moments estimates do not exist if the related moments do not exist.

(v) Method of moments estimates may not be unique. If there exist multiple choices for the

mom, one usually takes the estimate involving the lowest{order sample moment.

(vi) Alternative method of moment estimates can be obtained from central moments (rather

than from raw moments) or by using moments other than the �rst k moments.

55

Example 8.6.2:

Let X1; : : : ; Xn be iid N(�; �2).

Since � = m1, it is �̂mom = X.

This is an unbiased, consistent and asymptotically Normal estimate.

Since � =qm2 �m2

1, it is �̂mom =

vuut 1n

nXi=1

X2i �X

2.

This is a consistent, asymptotically Normal estimate. However, it is not unbiased.

Example 8.6.3:

Let X1; : : : ; Xn be iid Poisson(�).

We know that E(X1) = V ar(X1) = �.

Thus, X and 1n

nXi=1

(Xi � X)2 are possible choices for the mom of �. Due to part (v) of the

Note above, one uses �̂mom = X .

56

8.7 Maximum Likelihood Estimation

De�nition 8.7.1:

Let (X1; : : : ;Xn) be an n{rv with pdf (or pmf) f�(x1; : : : ; xn); � 2 �. We call the function

L(�;x1; : : : ; xn) = f�(x1; : : : ; xn)

of � the likelihood function.

Note:

(i) Often � is a vector of parameters.

(ii) If (X1; : : : ;Xn) are iid with pdf (or pmf) f�(x), then L(�;x1; : : : ; xn) =nYi=1

f�(xi).

De�nition 8.7.2:

A maximum likelihood estimate (MLE) is a non{constant estimate �̂ML such that

L(�̂ML;x1; : : : ; xn) = sup�2�

L(�;x1; : : : ; xn):

Note:

It is often convenient to work with logL when determining the maximum likelihood estimate.

Since the log is monotone, the maximum is the same.

Example 8.7.3:

Let X1; : : : ; Xn be iid N(�; �2), where � and �2 are unknown.

L(�; �2;x1; : : : ; xn) =1

�n(2�)n

2

exp

�

nXi=1

(xi � �)2

2�2

!

=) logL(�; �2;x1; : : : ; xn) = �n2log �2 � n

2log(2�) �

nXi=1

(xi � �)2

2�2

The MLE must satisfy

@ logL

@�=

1

�2

nXi=1

(xi � �) = 0 (A)

@ logL

@�2= � n

2�2+

1

2�4

nXi=1

(xi � �)2 = 0 (B)

57

These are the two likelihood equations. From equation (A) we get �̂ML = X . Substituting

this for � into equation (B) and solving for �2, we get �̂2ML = 1n

nXi=1

(Xi�X)2. Note that �̂2ML

is biased for �2.

Formally, we still have to verify that we found the maximum (and not a minimum) and that

there is no parameter � at the edge of the parameter space � such that the likelihood function

does not take its absolute maximum which is not detectable by using our approach for local

extrema.

Lecture 15:

We 02/14/01Example 8.7.4:

Let X1; : : : ; Xn be iid U(� � 12 ; � +

12).

L(�;x1; : : : ; xn) =

8<: 1; if � � 12� xi � � + 1

28i = 1; : : : ; n

0; otherwise

Therefore, any �̂(X) such that max(X)� 12 � �̂(X) � min(X) + 1

2 is an MLE. Obviously, the

MLE is not unique.

Example 8.7.5:

Let X � Bin(1; p); p 2 [14 ;34 ].

L(p;x) =

8<: p; if x = 1

1� p; if x = 0

This is maximized by

p̂ =

8<:34 ; if x = 1

14 ; if x = 0

=2x+ 1

4

It is

Ep(p̂) =3

4p+

1

4(1� p) =

1

2p+

1

4

MSEp(p̂) = Ep((p̂� p)2)

= Ep((2X + 1

4� p)2)

=1

16Ep((2X + 1� 4p)2)

=1

16Ep(4X

2 + 2 � 2X � 2 � 8pX � 2 � 4p+ 1 + 16p2)

=1

16(4(p(1 � p) + p2) + 4p� 16p2 � 8p+ 1 + 16p2)

=1

16

58

So p̂ is biased with MSEp(p̂) =116 . If we compare this with ~p = 1

2 regardless of the data, we

have

MSEp(1

2) = Ep((

1

2� p)2) = (

1

2� p)2 � 1

168p 2 [

1

4;3

4]:

Thus, in this example the MLE is worse than the trivial estimate when comparing their MSE's.

Theorem 8.7.6:

Let T be a su�cient statistic for f�(x); � 2 �. If a unique MLE of � exists, it is a function

of T .

Proof:

Since T is su�cient, we can write

f�(x) = h(x)g�(T (x))

due to the Factorization Criterion (Theorem 8.3.5). Maximizing the likelihood function with

respect to � takes h(x) as a constant and therefore is equivalent to maximizing g�(x) with

respect to �. But g�(x) involves x only through T .

Note:

(i) MLE's may not be unique (however they frequently are).

(ii) MLE's are not necessarily unbiased.

(iii) MLE's may not exist.

(iv) If a unique MLE exists, it is a function of a su�cient statistic.

(v) Often (but not always), the MLE will be a su�cient statistic itself.

Theorem 8.7.7:

Suppose the regularity conditions of Theorem 8.5.1 hold and � belongs to an open interval in

IR. If an estimate �̂ of � attains the CRLB, it is the unique MLE.

Proof:

If �̂ attains the CRLB, it follows by Theorem 8.5.1 that

@ log f�(X)

@�=

1

K(�)(�̂(X)� �) w.p. 1:

Thus, �̂ satis�es the likelihood equations.

59

We de�ne A(�) = 1K(�) . Then it follows

@2 log f�(X)

@�2= A0(�)(�̂(X)� �)�A(�):

The Proof of Theorem 8.5.11 gives us

A(�) = E�

�@ log f�(X)

@�

�2!> 0:

So@2 log f�(X)

@�2

��=�̂

= �A(�) < 0;

i.e., log f�(X) has a maximum in �̂. Thus, �̂ is the MLE.

Note:

The previous Theorem does not imply that every MLE is most e�cient.

Theorem 8.7.8:

Let ff� : � 2 �g be a family of pdf's (or pmf's) with � � IRk; k � 1. Let h : �! � be a

mapping of � onto � � IRp; 1 � p � k. If �̂ is an MLE of �, then h(�̂) is an MLE of h(�).

Proof:

For each � 2 �, we de�ne

�� = f� : � 2 �; h(�) = �g

and

M(�;x) = sup�2��

L(�;x);

the likelihood function induced by h.

Let �̂ be an MLE and a member of ��̂, where �̂ = h(�̂). It holds

M(�̂;x) = sup�2�

�̂

L(�;x) � L(�̂;x);

but also

M(�̂;x) � sup�2�

M(�;x) = sup�2�

sup�2��

L(�;x)

!= sup

�2�L(�;x) = L(�̂;x):

Therefore,

M(�̂;x) = L(�̂;x) = sup�2�

M(�;x):

Thus, �̂ = h(�̂) is an MLE.

60

Example 8.7.9:

Let X1; : : : ; Xn be iid Bin(1; p). Let h(p) = p(1� p).

Since the MLE of p is p̂ = X , the MLE of h(p) is h(p̂) = X(1�X).

Theorem 8.7.10:

Consider the following conditions a pdf f� can ful�ll:

(i) @ log f�@�

;@2 log f�@�2

;@3 log f�@�3

exist for all � 2 � for all x. Also,Z 1

�1

@f�(x)

@�dx = E�

�@ log f�(X)

@�

�= 0 8� 2 �:

(ii)

Z 1

�1

@2f�(x)

@�2dx = 0 8� 2 �.

(iii) �1 <

Z 1

�1

@2 log f�(x)

@�2f�(x)dx < 0 8� 2 �.

(iv) There exists a function H(x) such that for all � 2 �:��@3 log f�(x)@�3

�� < H(x) and

Z 1

�1H(x)f�(x)dx =M(�) <1:

(v) There exists a function g(�) that is positive and twice di�erentiable for every � 2 � and

there exists a function H(x) such that for all � 2 �:�� @2@�2�g(�)

@ log f�(x)

@�

�� < H(x) and

Z 1

�1H(x)f�(x)dx =M(�) <1:

In case that multiple of these conditions are ful�lled, we can make the following statements:

(i) (Cram�er) Conditions (i), (iii), and (iv) imply that, with probability approaching 1, as

n!1, the likelihood equation has a consistent solution.

(ii) (Cram�er) Conditions (i), (ii), (iii), and (iv) imply that a consistent solution �̂n of the

likelihood equation is asymptotically Normal, i.e.,

pn

�(�̂n � �)

d�! Z

where Z � N(0; 1) and �2 =

�E�

��@ log f�(X)

@�

�2��1.

(iii) (Kulldorf) Conditions (i), (iii), and (v) imply that, with probability approaching 1, as

n!1, the likelihood equation has a consistent solution.

61

(iv) (Kulldorf) Conditions (i), (ii), (iii), and (v) imply that a consistent solution �̂n of the

likelihood equation is asymptotically Normal.

Note:

In case of a pmf f�, we can de�ne similar conditions as in Theorem 8.7.10.

62

Lecture 16:

Fr 02/16/018.8 Decision Theory | Bayes and Minimax Estimation

Let ff� : � 2 �g be a family of pdf's (or pmf's). Let X1; : : : ;Xn be a sample from f�. Let

A be the set of possible actions (or decisions) that are open to the statistician in a given

situation , e.g.,

A = freject H0, do not reject H0g (Hypothesis testing, see Chapter 9)

A = artefact found is of fGreek, Romang origin (Classi�cation)

A = � (Estimation)

De�nition 8.8.1:

A decision function d is a statistic, i.e., a Borel{measurable function, that maps IRn into

A. If X = x is observed, the statistician takes action d(x) 2 A.

Note:

For the remainder of this Section, we are restricting ourselves to A = �, i.e., we are facing

the problem of estimation.

De�nition 8.8.2:

A non{negative function L that maps � � A into IR is called a loss function. The value

L(�; a) is the loss incurred to the statistician if he/she takes action a when � is the true pa-

rameter value.

De�nition 8.8.3:

Let D be a class of decision functions that map IRn into A. Let L be a loss function on ��A.The function R that maps �� D into IR is de�ned as

R(�; d) = E�(L(�; d(X)))

and is called the risk function of d at �.

Example 8.8.4:

Let A = � � IR. Let L(�; a) = (� � a)2. Then it holds that

R(�; d) = E�(L(�; d(X))) = E�((� � d(X))2) = E�((� � �̂)2):

Note that this is just the MSE. If �̂ is unbiased, this would just be V ar(�̂).

63

Note:

The basic problem of decision theory is that we would like to �nd a decision function d 2 D

such that R(�; d) is minimized for all � 2 �. Unfortunately, this is usually not possible.

De�nition 8.8.5:

The minimax principle is to choose the decision function d� 2 D such that

max�2�

R(�; d�) � max�2�

R(�; d) 8d 2 D:

Note:

If the problem of interest is an estimation problem, we call a d� that satisi�es the condition

in De�nition 8.8.5 a minimax estimate of �.

Example 8.8.6:

Let X � Bin(1; p); p 2 � = f14 ;34g = A.

We consider the following loss function:

p a L(p; a)14

14 0

14

34 2

34

14 5

34

34 0

The set of decision functions consists of the following four functions:

d1(0) =1

4; d1(1) =

1

4

d2(0) =1

4; d2(1) =

3

4

d3(0) =3

4; d3(1) =

1

4

d4(0) =3

4; d4(1) =

3

4

First, we evaluate the loss function for these four decision functions:

L(1

4; d1(0)) = L(

1

4;1

4) = 0

L(1

4; d1(1)) = L(

1

4;1

4) = 0

L(3

4; d1(0)) = L(

3

4;1

4) = 5

64

L(3

4; d1(1)) = L(

3

4;1

4) = 5

L(1

4; d2(0)) = L(

1

4;1

4) = 0

L(1

4; d2(1)) = L(

1

4;3

4) = 2

L(3

4; d2(0)) = L(

3

4;1

4) = 5

L(3

4; d2(1)) = L(

3

4;3

4) = 0

L(1

4; d3(0)) = L(

1

4;3

4) = 2

L(1

4; d3(1)) = L(

1

4;1

4) = 0

L(3

4; d3(0)) = L(

3

4;3

4) = 0

L(3

4; d3(1)) = L(

3

4;1

4) = 5

L(1

4; d4(0)) = L(

1

4;3

4) = 2

L(1

4; d4(1)) = L(

1

4;3

4) = 2

L(3

4; d4(0)) = L(

3

4;3

4) = 0

L(3

4; d4(1)) = L(

3

4;3

4) = 0

Then, the risk function

R(p; di(X)) = Ep(L(p; d(X))) = L(p; d(0)) � Pp(X = 0) + L(p; d(1)) � Pp(X = 1)

takes the following values:

i p = 14 : R(14 ; di) p = 3

4 : R(34 ; di) maxp2f1=4; 3=4g

R(p; di)

1 0 5 5

2 34 � 0 +

14 � 2 =

12

14 � 5 +

34 � 0 =

54

54

3 34 � 2 +

14 � 0 =

32

14 � 0 +

34 � 5 =

154

154

4 2 0 2

Hence,

mini2f1; 2; 3; 4g

maxp2f1=4; 3=4g

R(p; di) =5

4:

Thus, d2 is the minimax estimate.

65

Note:

Minimax estimation does not require any unusual assumptions. However, it tends to be very

conservative.

De�nition 8.8.7:

Suppose we consider � to be a rv with pdf �(�) on �. We call � the a priori distribution

(or prior distribution).

Note:

f(x j �) is the conditional density of x given a �xed �. The joint density of x and � is

f(x; �) = �(�)f(x j �);

the marginal density of x is

g(x) =

Zf(x; �)d�;

and the a posteriori distribution (or posterior distribution), which gives the distribution

of � after sampling, has pdf (or pmf)

h(� j x) = f(x; �)

g(x):

De�nition 8.8.8:

The Bayes risk of a decision function d is de�ned as

R(�; d) = E�(R(�; d));

where � is the a priori distribution.

Note:

If � is a continuous rv and X is of continuous type, then

R(�; d) = E�(R(�; d))

=

ZR(�; d) �(�) d�

=

ZE�(L(�; d(X))) �(�) d�

=

Z �ZL(�; d(x)) f(x j �) dx

��(�) d�

=

Z ZL(�; d(x)) f(x j �) �(�) dx d�

66

=

Z ZL(�; d(x)) f(x; �) dx d�

=

Zg(x)

�ZL(�; d(x))h(� j x) d�

�dx

Similar expressions can be written if � and/or X are discrete.

Lecture 17:

Tu 02/20/01De�nition 8.8.9:

A decision function d� is called a Bayes rule if d� minimizes the Bayes risk, i.e., if

R(�; d�) = infd2D

R(�; d):

Theorem 8.8.10:

Let A = � � IR. Let L(�; d(x)) = (� � d(x))2. In this case, a Bayes rule is

d(x) = E(� j X = x):

Proof:

Minimizing

R(�; d) =

Zg(x)

�Z(� � d(x))2h(� j x) d�

�dx;

where g is the marginal pdf of X and h is the conditional pdf of � given x, is the same as

minimizing Z(� � d(x))2h(� j x) d�:

However, this is minimized when d(x) = E(� j X = x) as shown in Stat 6710, Homework 4

(ii), for the unconditional case.

Note:

Under the conditions of Theorem 8.8.10, d(x) = E(� j X = x) is called the Bayes estimate.

Example 8.8.11:

Let X � Bin(n; p). Let L(p; d(x)) = (p� d(x))2.

Let �(p) = 1 8p 2 (0; 1), i.e., � � U(0; 1), be the a priori distribution of p.

Then it holds:

67

f(x; p) =

n

x

!px(1� p)n�x

g(x) =

Zf(x; p)dp

=

Z 1

0

n

x

!px(1� p)n�xdp

h(p j x) =f(x; p)

g(x)

=

�nx

�px(1� p)n�xZ 1

0

n

x

!px(1� p)n�xdp

=px(1� p)n�xZ 1

0px(1� p)n�xdp

E(p j x) =

Z 1

0ph(p j x)dp

=

Z 1

0p

n

x

!px(1� p)n�xdp

Z 1

0

n

x

!px(1� p)n�xdp

=

Z 1

0px+1(1� p)n�xdpZ 1

0px(1� p)n�xdp

=B(x+ 2; n� x+ 1)

B(x+ 1; n� x+ 1)

=

�(x+ 2)�(n� x+ 1)

�(x+ 2 + n� x+ 1)

�(x+ 1)�(n� x+ 1)

�(x+ 1 + n� x+ 1)

=x+ 1

n+ 2

Thus, the Bayes rule is

p̂Bayes = d�(X) =X + 1

n+ 2:

68

The Bayes risk of d�(X) is

R(�; d�(X)) = E�(R(p; d�(X)))

=

Z 1

0�(p)R(p; d�(x))dp

=

Z 1

0�(p)Ep(L(p; d

�(x)))dp

=

Z 1

0�(p)Ep((p� d�(x))2)dp

=

Z 1

0Ep

�(X + 1

n+ 2� p)2

�dp

=

Z 1

0Ep

�(X + 1

n+ 2)2 � 2p

X + 1

n+ 2+ p2

�dp

=1

(n+ 2)2

Z 1

0Ep

�(X + 1)2 � 2p(n+ 2)(X + 1) + p2(n+ 2)2

�dp

=1

(n+ 2)2

Z 1

0Ep

�X2 + 2X + 1� 2p(n+ 2)(X + 1) + p2(n+ 2)2

�dp

=1

(n+ 2)2

Z 1

0(np(1� p) + (np)2 + 2np+ 1� 2p(n+ 2)(np+ 1) + p2(n+ 2)2) dp

=1

(n+ 2)2

Z 1

0(np� np2 + n2p2 + 2np+ 1� 2n2p2 � 2np� 4np2 � 4p+

p2n2 + 4np2 + 4p2) dp

=1

(n+ 2)2

Z 1

0(1� 4p+ np� np2 + 4p2) dp

=1

(n+ 2)2

Z 1

0(1 + (n� 4)p+ (4� n)p2) dp

=1

(n+ 2)2(p+

n� 4

2p2 +

4� n

3p3)

��10

=1

(n+ 2)2(1 +

n� 4

2+4� n

3)

=1

(n+ 2)26 + 3n� 12 + 8� 2n

6

=1

(n+ 2)2n+ 2

6

=1

6(n+ 2)

69

Now we compare the Bayes rule d�(X) with the MLE p̂ML = Xn. This estimate has Bayes risk

R(�;X

n) =

Z 1

0Ep((

X

n� p)2)dp

=

Z 1

0

1

n2Ep((X � np)2)dp

=

Z 1

0

np(1� p)

n2dp

=

Z 1

0

p(1� p)

ndp

=1

n

p2

2� p3

3

!��1

0

=1

6n

which is, as expected, larger than the Bayes risk of d�(X).

Theorem 8.8.12:

Let ff� : � 2 �g be a family of pdf's (or pmf's). Suppose that an estimate d� of � is a

Bayes estimate corresponding to some prior distribution � on �. If the risk function R(�; d�)

is constant on �, then d� is a minimax estimate of �.

Proof:

Homework.

70

9 Hypothesis Testing

9.1 Fundamental Notions

We assume that X = (X1; : : : ;Xn) is a random sample from a population distribution

F�; � 2 � � IRk, where the functional form of F� is known, except for the parameter �.

We also assume that � contains at least two points.

De�nition 9.1.1:

A parametric hypothesis is an assumption about the unknown parameter �.

The null hypothesis is of the form

H0 : � 2 �0 � �:

The alternative hypothesis is of the form

H1 : � 2 �1 = ��0:

De�nition 9.1.2:

If �0 (or �1) contains only one point, we say that H0 and �0 (or H1 and �1) are simple. In

this case, the distribution of X is completely speci�ed under the null (or alternative) hypoth-

esis.

If �0 (or �1) contains more than one point, we say that H0 and �0 (or H1 and �1) are

composite.

Example 9.1.3:

Let X1; : : : ;Xn be iid Bin(1; p). Examples for hypotheses are p = 12 (simple), p � 1

2 (com-

posite), p 6= 14 (composite), etc.

Note:

The problem of testing a hypothesis can be described as follows: Given a sample point x, �nd

a decision rule that will lead to a decision to accept or reject the null hypothesis. This means,

we partition the space IRn into two disjoint sets C and Cc such that, if x 2 C, we reject

H0 : � 2 �0 (and we accept H1). Otherwise, if x 2 Cc, we accept H0 that X � F�; � 2 �0.

71

De�nition 9.1.4:

Let X � F�; � 2 �. Let C be a subset of IRn such that, if x 2 C, then H0 is rejected (with

probability 1), i.e.,

C = fx 2 IRn : H0 is rejected for this xg:

The set C is called the critical region.

De�nition 9.1.5:

If we reject H0 when it is true, we call this a Type I error. If we fail to reject H0 when it

is false, we call this a Type II error. Usually, H0 and H1 are chosen such that the Type I

error is considered more serious.

Example 9.1.6:

We �rst consider a non{statistical example, in this case a jury trial. Our hypotheses are that

the defendant is innocent or guilty. Our possible decisions are guilty or not guilty. Since it is

considered worse to punish the innocent than to let the guilty go free, we make innocence the

null hypothesis. Thus, we have

Truth (unknown)

Decision (known)Innocent (H0) Guilty (H1)

Not Guilty (H0) Correct Type II Error

Guilty (H1) Type I Error Correct

The jury tries to make a decision \beyond a reasonable doubt", i.e., it tries to make the

probability of a Type I error small.

De�nition 9.1.7:

If C is the critical region, then P�(C); � 2 �0, is a probability of Type I error, and

P�(Cc); � 2 �1, is a probability of Type II error.

Note:

We would like both error probabilities to be 0, but this is usually not possible. We usually

settle for �xing the probability of Type I error to be small, e.g., 0.05 or 0.01, and minimizing

the Type II error.

72

Lecture 18:

We 02/21/01De�nition 9.1.8:

Every Borel{measurable mapping � of IRn ! [0; 1] is called a test function. �(x) is the

probability of rejecting H0 when x is observed.

If � is the indicator function of a subset C � IRn, � is called a nonrandomized test and C

is the critical region of this test function.

Otherwise, if � is not an indicator function of a subset C � IRn, � is called a randomized

test.

De�nition 9.1.9:

Let � be a test function of the hypothesis H0 : � 2 �0 against the alternative H1 : � 2 �1.

We say that � has a level of signi�cance of � (or � is a level{�{test or � is of size �) if

E�(�(X)) = P�(rejct H0) � � 8� 2 �0:

In short, we say that � is a test for the problem (�;�0;�1).

De�nition 9.1.10:

Let � be a test for the problem (�;�0;�1). For every � 2 �, we de�ne

��(�) = E�(�(X)) = P�(rejct H0):

We call ��(�) the power function of �. For any � 2 �1, ��(�) is called the power of �

against the alternative �.

De�nition 9.1.11:

Let �� be the class of all tests for (�;�0;�1). A test �0 2 �� is called a most powerful

(MP) test against an alternative � 2 �1 if

��0(�) � ��(�) 8� 2 ��:

De�nition 9.1.12:

Let �� be the class of all tests for (�;�0;�1). A test �0 2 �� is called a uniformly most

powerful (UMP) test if

��0(�) � ��(�) 8� 2 �� 8� 2 �1:

73

Example 9.1.13:

Let X1; : : : ; Xn be iid N(�; 1); � 2 � = f�0; �1g; �0 < �1.

Let H0 : Xi � N(�0; 1) vs. H1 : Xi � N(�1; 1).

Intuitively, reject H0 when X is too large, i.e., if X � k for some k.

Under H0 it holds that X � N(�0;1n).

For a given �, we can solve the following equation for k:

P�0(X > k) = P

X � �0

1=pn

>k � �0

1=pn

!= P (Z > z�) = �

Here, X��01=pn= Z � N(0; 1) and z� is de�ned such in such a way that P (Z > z�) = �, i.e., z�

is the upper �{quantile of the N(0; 1) distribution. It follows that k��01=pn= z� and therefore,

k = �0 +z�pn.

Thus, we obtain the nonrandomized test

�(x) =

8<: 1; if x > �0 +z�pn

0; otherwise

� has power

��(�1) = P�1

�X > �0 +

z�pn

�

= P

X � �1

1=pn

> (�0 � �1)pn+ z�

!

= P (Z > z� �pn (�1 � �0)| {z }

>0

)

> �

The probability of a Type II error is

P (Type II error) = 1� ��(�1):

74

Example 9.1.14:

Let X � Bin(6; p); p 2 � = (0; 1).

H0 : p =12 ; H1 : p 6= 1

2 .

Desired level of signi�cance: � = 0:05.

Reasonable plan: Since Ep= 12(X) = 3, reject H0 when j X � 3 j� c for some constant c. But

how should we select c?

x c =j x� 3 j Pp= 12(X = x) Pp= 1

2(j X � 3 j� c)

0; 6 3 0:015625 0:03125

1; 5 2 0:093750 0:21875

2; 4 1 0:234375 0:68750

3 0 0:312500 1:00000

Thus, there is no nonrandomized test with � = 0:05.

What can we do instead? | Three possibilities:

(i) Reject if j X � 3 j= 3, i.e., use a nonrandomized test of size � = 0:03125.

(ii) Reject if j X � 3 j� 2, i.e., use a nonrandomized test of size � = 0:21875.

(iii) Reject if j X � 3 j= 3, do not reject if j X � 3 j� 1, and reject with probability0:05 � 0:03125

2 � 0:093750 = 0:1 if j X � 3 j= 2. Thus, we obtain the randomized test

�(x) =

8>>>><>>>>:1; if x = 0; 6

0:1; if x = 1; 5

0; if x = 2; 3; 4

This test has size

� = Ep= 12(�(X))

= 1 � 0:015625 � 2 + 0:1 � 0:093750 � 2 + 0 � (0:24375 � 2 + 0:3125)

= 0:05

as intended. The power of � can be calculated for any p 6= 12 and it is

��(p) = Pp(X = 0 or X = 6) + 0:1 � Pp(X = 1 or X = 5)

75

Lecture 19:

Fr 02/23/019.2 The Neyman{Pearson Lemma

Let ff� : � 2 � = f�0; �1gg be a family of possible distributions of X . f� represents the pdf

(or pmf) of X. For convenience, we write f0(x) = f�0(x) and f1(x) = f�1(x).

Theorem 9.2.1: Neyman{Pearson Lemma (NP Lemma)

Suppose we wish to test H0 : X � f0(x) vs. H1 : X � f1(x), where fi is the pdf (or pmf) of

X under Hi; i = 0; 1, where both, H0 and H1, are simple.

(i) Any test of the form

�(x) =

8>><>>:1; if f1(x) > kf0(x)

(x); if f1(x) = kf0(x)

0; if f1(x) < kf0(x)

(�)

for some k � 0 and 0 � (x) � 1, is most powerful of its signi�cance level for testing

H0 vs. H1.

If k =1, the test

�(x) =

(1; if f0(x) = 0

0; if f0(x) > 0(��)

is most powerful of size (or signi�cance level) 0 for testing H0 vs. H1.

(ii) Given 0 � � � 1, there exists a test of the form (�) or (��) with (x) = (i.e., a

constant) such that

E�0(�(X)) = �:

Proof:

We prove the continuous case only.

(i):

Let � be a test satisfying (�). Let �� be any test with E�0(��(X)) � E�0(�(X)).

It holds thatZ(�(x)� ��(x))(f1(x)� kf0(x))dx =

�Zf1>kf0

(�(x)� ��(x))(f1(x)� kf0(x))dx

�

+

�Zf1<kf0

(�(x)� ��(x))(f1(x)� kf0(x))dx

�since Z

f1=kf0

(�(x)� ��(x))(f1(x)� kf0(x))dx = 0:

76

It isZf1>kf0

(�(x)� ��(x))(f1(x)� kf0(x))dx =

Zf1>kf0

(1� ��(x))| {z }�0

(f1(x)� kf0(x))| {z }�0

dx � 0

andZf1<kf0

(�(x)� ��(x))(f1(x)� kf0(x))dx =

Zf1<kf0

(0� ��(x))| {z }�0

(f1(x)� kf0(x))| {z }�0

dx � 0:

Therefore,

0 �Z(�(x)� ��(x))(f1(x)� kf0(x))dx

= E�1(�(X))�E�1(��(X))� k(E�0(�(X))�E�0(�

�(X))):

Since E�0(�(X)) � E�0(��(X)), it holds that

��(�1)� ��(�1) � k(E�0(�(X))�E�0(��(X))) � 0;

i.e., � is most powerful.

If k =1, any test �� of size 0 must be 0 on the set fx j f0(x) > 0g. Therefore,

��(�1)� ��(�1) = E�1(�(X))�E�1(��(X)) =

Zfxjf0(x)=0g

(1� ��(x))f1(x)dx � 0;

i.e., � is most powerful of size 0.

(ii):

If � = 0, then use (��). Otherwise, assume that 0 < � � 1 and (x) = . It is

E�0(�(X)) = P�0(f1(X) > kf0(X)) + P�0(f1(X) = kf0(X))

= 1� P�0(f1(X) � kf0(X)) + P�0(f1(X) = kf0(X))

= 1� P�0

�f1(X)

f0(X)� k

�+ P�0

�f1(X)

f0(X)= k

�:

Note that the last step is valid since P�0(f0(X) = 0) = 0.

Therefore, given 0 < � � 1, we want to �nd k and such that E�0(�(X)) = �, i.e.,

P�0

�f1(X)

f0(X)� k

�� P�0

�f1(X)

f0(X)= k

�= 1� �:

Note thatf1(X)f0(X) is a rv and, therefore, P�0

�f1(X)f0(X) � k

�is a cdf.

If there exists a k0 such that

P�0

�f1(X)

f0(X)� k0

�= 1� �;

77

we choose = 0 and k = k0.

Otherwise, if there exists no such k0, then there exists a k1 such that

P�0

�f1(X)

f0(X)< k1

�� 1� � < P�0

�f1(X)

f0(X)� k1

�;

i.e., the cdf has a jump at k1. In this case, we choose k = k1 and

=P�0

�f1(X)f0(X)

� k1

�� (1� �)

P�0

�f1(X)f0(X)

= k1

� :

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• • • • • • • • • • • • • • • • • • • • •

• • • • • • • • • • • • • • • • • • • • •

k1 x

F(x)

*

o

(f1/f0 <= k1)

1 - alpha

P(f1/f0 < k1)

Theorem 9.2.1

Let us verify that these values for k1 and meet the necessary conditions:

Obviously,

P�0

�f1(X)

f0(X)� k1

��P�0

�f1(X)f0(X)

� k1

�� (1� �)

P�0

�f1(X)f0(X) = k1

� P�0

�f1(X)

f0(X)= k1

�= 1� �:

Also, since

P�0

�f1(X)

f0(X)� k1

�> (1� �);

it follows that � 0 and since

P�0

�f1(X)f0(X) � k1

�� (1� �)

P�0

�f1(X)f0(X) = k1

� �P�0

�f1(X)f0(X) � k1

�� P�0

�f1(X)f0(X) < k1

�P�0

�f1(X)f0(X) = k1

�

=P�0

�f1(X)f0(X) = k1

�P�0

�f1(X)f0(X) = k1

�= 1;

it follows that � 1. Overall, 0 � � 1 as required.

78

Lecture 21:

We 02/28/01Theorem 9.2.2:

If a su�cient statistic T exists for the family ff� : � 2 � = f�0; �1gg, then the Neyman{

Pearson most powerful test is a function of T .

Proof:

Homework

Example 9.2.3:

We want to test H0 : X � N(0; 1) vs. H1 : X � Cauchy(1; 0), based on a single observation.

It isf1(x)

f0(x)=

1�

11+x2

1p2�exp(�x2

2 )=

r2

�

exp(x2

2)

1 + x2:

The MP test is

�(x) =

8><>: 1; ifq

2�

exp(x2

2)

1+x2> k

0; otherwise

where k is determined such that EH0(�(X)) = �.

If � < 0:113, we reject H0 if j x j> z�2, where z�

2is the upper �

2 quantile of a N(0; 1)

distribution.

If � > 0:113, we reject H0 if j x j> k1 or if j x j< k2, where k1 > 0, k2 > 0, such that

exp(k212 )

1 + k21=

exp(k222 )

1 + k22and

Z k1

k2

1p2�

exp(�x2

2)dx =

1� �

2:

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

x

f1(x

) / f

0(x)

-2 -1 0 1 2

0.7

0.8

0.9

1.0

1.1

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••

••••••••••••••

••••••••••••••

••••••••••••••

-1.585 1.585-k1 -k2 k2 k1

Example 9.2.3a

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

x

f0(x

)

-2 -1 0 1 2

0.1

0.2

0.3

0.4

-1.585 1.585

alpha/2 = 0.113/2 alpha/2 = 0.113/2

Example 9.2.3b

79

Why is � = 0:113 so interesting?

For x = 0, it isf1(x)

f0(x)=

r2

�� 0:7979:

Similarly, for x � �1:585 and x � 1:585, it is

f1(x)

f0(x)=

r2

�

exp((�1:585)2

2 )

1 + (�1:585)2 � 0:7979 � f1(0)

f0(0):

More importantly, PH0(j X j> 1:585) = 0:113.

80

9.3 Monotone Likelihood Ratios

Suppose we want to test H0 : � � �0 vs. H1 : � > �0 for a family of pdf's ff� : � 2 � � IRg.In general, it is not possible to �nd a UMP test. However, there exist conditions under which

UMP tests exist.

De�nition 9.3.1:

Let ff� : � 2 � � IRg be a family of pdf's (pmf's) on a one{dimensional parameter space.

We say the family ff�g has a monotone likelihood ratio (MLR) in statistic T (X) if for

�1 < �2, whenever f�1 and f�2 are distinct, the ratiof�2 (x)

f�1 (x)is a nondecreasing function of T (x)

for the set of values x for which at least one of f�1 and f�2 is > 0.

Note:

We can also de�ne families of densities with nonincreasing MLR in T (X), but such families

can be treated by symmetry.

Example 9.3.2:

Let X1; : : : ; Xn � U [0; �]; � > 0. Then the joint pdf is

f�(x) =

(1�n ; 0 � xmax � �

0; otherwise= I[0;�](xmax);

where xmax = maxi=1;:::;n

xi.

Let �2 > �1, thenf�2(x)

f�1(x)= (

�1

�2)nI[0;�2](xmax)

I[0;�1](xmax):

It isI[0;�2](xmax)

I[0;�1](xmax)=

(1; xmax 2 [0; �1]

1; xmax 2 (�1; �2]

since for xmax 2 [0; �1], it holds that xmax 2 [0; �2].

But for xmax 2 (�1; �2], it is I[0;�1](xmax) = 0.

=) as T (X) = Xmax increases, the density ratio goes from ( �1�2 )n to 1.

=) f�2f�1

is a nondecreasing function of T (X) = Xmax

=) the family of U [0; �] distributions has a MLR in T (X) = Xmax

81

Theorem 9.3.3:

The one{parameter exponential family f�(x) = exp(Q(�)T (x) +D(�) + S(x)), where Q(�) is

nondecreasing, has a MLR in T (X).

Proof:

Homework.

Example 9.3.4:

Let X = (X1; � � � ;Xn) be a random sample from the Poisson family with parameter � > 0.

Then the joint pdf is

f�(x) =nYi=1

�e��xi

1

xi!

�= e�n��

Pn

i=1xi

nYi=1

1

xi!= exp

�n�+

nXi=1

xi � log(�) �nXi=1

log(xi!)

!;

which belongs to the one{parameter exponential family.

Since Q(�) = log(�) is a nondecreasing function of �, it follows by Theorem 9.3.3 that the

Poisson family with parameter � > 0 has a MLR in T (X) =nXi=1

Xi.

We can verify this result by De�nition 9.3.1:

f�2(x)

f�1(x)=�

Pxi

2

�

Pxi

1

e�n�2

e�n�1=

��2

�1

�Pxi

e�n(�2��1):

If �2 > �1, then�2�1> 1 and

��2�1

�Pxiis a nondecreasing function of

Pxi.

Therefore, f� has a MLR in T (X) =nXi=1

Xi.

Theorem 9.3.5:

Let X � f�; � 2 � � IR, where the family ff�g has a MLR in T (X).

For testing H0 : � � �0 vs. H1 : � > �0; �0 2 �; any test of the form

�(x) =

8>><>>:1; if T (x) > t0

; if T (x) = t0

0; if T (x) < t0

(�)

has a nondecreasing power function and is UMP of its size E�0(�(X)) = �, if the size is not 0.

Also, for every 0 � � � 1 and every �0 2 �, there exists a t0 and a (�1 � t0 � 1,

0 � � 1), such that the test of form (�) is the UMP size � test of H0 vs. H1:

82

Proof:

\=)":

Let �1 < �2; �1; �2 2 � and suppose E�1(�(X)) > 0, i.e., the size is > 0. Since f� has a MLR

in T ,f�2 (x)

f�1 (x)is a nondecreasing function of T . Therefore, any test of form (�) is equivalent to

a test

�(x) =

8>>>>><>>>>>:1; if

f�2 (x)

f�1 (x)> k

; iff�2 (x)

f�1 (x)= k

0; iff�2 (x)

f�1 (x)< k

(��)

which by the Neyman{Pearson Lemma (Theorem 9.2.1) is MP of size � for testing H0 : � = �1

vs. H1 : � = �2.Lecture 22:

Fr 03/02/01Let �� be the class of tests of size �, and let �t 2 �� be the trivial test �t(x) = � 8x. Then�t has size and power �. The power of the MP test � of form (�) must be at least � as the

MP test � cannot have power less than the trivial test �t, i.e,

E�2(�(X)) � E�2(�t(X)) = � = E�1(�(X)):

Thus, for �2 > �1,

E�2(�(X)) � E�1(�(X));

i.e., the power function of the test � of form (�) is a nondecreasing function of �.

Now let �1 = �0 and �2 > �0. We know that a test � of form (�) is MP for H0 : � = �0 vs.

H1 : � = �2 > �0, provided that its size � = E�0(�(X)) is > 0.

Notice that the test � of form (�) does not depend on �2. It only depends on t0 and .

Therefore, the test � of form (�) is MP for all �2 2 �1. Thus, this test is UMP for simple

H0 : � = �0 vs. composite H1 : � > �0 with size E�0(�(X)) = �0.

Since � is UMP for a class �00 of tests (�00 2 �00) satisfying

E�0(�00(X)) � �0;

� must also be UMP for the more restrictive class �000 of tests (�000 2 �000) satisfying

E�(�000(X)) � �0 8� � �0:

But since the power function of � is nondecreasing, it holds for � that

E�(�(X)) � E�0(�(X)) = �0 8� � �0:

Thus, � is the UMP size �0 test of H0 : � � �0 vs. H1 : � > �0 if �0 > 0.

83

\(=":Use the Neyman{Pearson Lemma (Theorem 9.2.1).

Note:

By interchanging inequalities throughout Theorem 9.3.5 and its proof, we see that this The-

orem also provides a solution of the dual problem H 00 : � � �0 vs. H

01 : � < �0.

Theorem: 9.3.6

For the one{parameter exponential family, there exists a UMP two{sided test of H0 : � � �1

or � � �2, (where �1 < �2) vs. H1 : �1 < � < �2 of the form

�(x) =

8>><>>:1; if c1 < T (x) < c2

i; if T (x) = ci; i = 1; 2

0; if T (x) < c1; or if T (x) > c2

Note:

UMP tests for H0 : �1 � � � �2 and H00 : � = �0 do not exist for one{parameter exponential

families.

84

9.4 Unbiased and Invariant Tests

If we look at all size � tests in the class ��, there exists no UMP test for many hypotheses.

Can we �nd UMP tests if we reduce �� by reasonable restrictions?

De�nition 9.4.1:

A size � test � of H0 : � 2 �0 vs H1 : � 2 �1 is unbiased if

E�(�(X)) � � 8� 2 �1:

Note:

This condition means that ��(�) � � 8� 2 �0 and ��(�) � � 8� 2 �1. In other words, the

power of this test is never less than �.

De�nition 9.4.2:

Let U� be the class of all unbiased size � tests of H0 vs H1. If there exists a test � 2 U�

that has maximal power for all � 2 �1, we call � aUMP unbiased (UMPU) size � test.

Note:

It holds that U� � ��. A UMP test �� 2 �� will have �� 8� 2 �1 since we must

compare all tests �� with the trivial test �(x) = �. Thus, if a UMP test exists in ��, it is

also a UMPU test in U�.

Example 9.4.3:

Let X1; : : : ; Xn be iidN(�; �2), where �2 > 0 is known. ConsiderH0 : � = �0 vs H1 : � 6= �0.

From the Neyman{Pearson Lemma, we know that for �1 > �0, the MP test is of the form

�1(X) =

8<: 1; if X > �0 +�pnz�

0; otherwise

and for �2 < �0, the MP test is of the form

�2(X) =

8<: 1; if X < �0 � �pnz�

0; otherwise

If a test is UMP, it must have the same rejection region as �1 and �2. However, these 2

rejection regions are di�erent (actually, their intersection is empty). Thus, there exists no

UMP test.

We next state a helpful Theorem and then continue with this example and see how we can

�nd a UMPU test.

85

Theorem 9.4.4:

Let c1; : : : ; cn 2 IR be constants and f1(x); : : : ; fn+1(x) be real{valued functions. Let C be the

class of functions �(x) satisfying 0 � �(x) � 1 andZ 1

�1�(x)fi(x)dx = ci 8i = 1; : : : ; n:

If �� 2 C satis�es

��(x) =

8>>>><>>>>:1; if fn+1(x) >

nXi=1

kifi(x)

0; if fn+1(x) <nXi=1

kifi(x)

for some constants k1; : : : ; kn 2 IR, then �� maximizesZ 1

�1�(x)fn+1(x)dx among all � 2 C.

Proof:

Let ��(x) be as above. Let �(x) be any other function in C. Since 0 � �(x) � 1 8x, it is

(��(x)� �(x))

fn+1(x)�

nXi=1

kifi(x)

!� 0 8x:

This holds since if ��(x) = 1, the left factor is � 0 and the right factor is � 0. If ��(x) = 0,

the left factor is � 0 and the right factor is � 0.

Therefore,

0 �Z(��(x)� �(x))

fn+1(x)�

nXi=1

kifi(x)

!dx

=

Z��(x)fn+1(x)dx�

Z�(x)fn+1(x)dx�

nXi=1

ki

�Z��(x)fi(x)dx�

Z�(x)fi(x)dx

�| {z }

=ci�ci=0

Thus, Z��(x)fn+1(x)dx �

Z�(x)fn+1(x)dx:

Note:

(i) If fn+1 is a pdf, then �� maximizes the power.

(ii) The Theorem above is the Neyman{Pearson Lemma if n = 1; f1 = f�0 ; f2 = f�1 , and

c1 = �.

86

Example 9.4.3: (continued)

So far, we have seen that there exists no UMP test for H0 : � = �0 vs H1 : � 6= �0.

We will show that

�3(x) =

8<: 1; if X < �0 � �pnz�=2 or if X > �0 +

�pnz�=2

0; otherwise

is a UMPU size � test.

Due to Theorem 9.2.2, we only have to consider functions of su�cient statistics T (X) = X .

Let �2 = �2

n .

To be unbiased and of size �, a test � must have

(i)

Z�(t)f�0(t)dt = �, and

(ii) @@�

Z�(t)f�(t)dtj�=�0 =

Z�(t)

�@

@�f�(t)

��=�0

dt = 0, i.e., we have a minimum at �0.

We want to maximize

Z�(t)f�(t)dt; � 6= �0 such that conditions (i) and (ii) hold.

Lecture 23:

Mo 03/05/01We choose an arbitrary �1 6= �0 and let

f1(t) = f�0(t)

f2(t) =@

@�f�(t)

��=�0

f3(t) = f�1(t)

We now consider how the conditions on �� in Theorem 9.4.4 can be met:

f3(t) > k1f1(t) + k2f2(t)

() 1p2��

exp(� 1

2�2(x� �1)

2) >k1p2��

exp(� 1

2�2(x� �0)

2) +

k2p2��

exp(� 1

2�2(x� �0)

2)(x� �0

�2)

() exp(� 1

2�2(x� �1)

2) > k1 exp(�1

2�2(x� �0)

2) + k2 exp(�1

2�2(x� �0)

2)(x� �0

�2)

() exp(1

2�2((x� �0)

2 � (x� �1)2)) > k1 + k2(

x� �0

�2)

() exp(x(�1 � �0)

�2� �21 � �20

2�2) > k1 + k2(

x� �0

�2)

87

Note that the left hand side of this inequality is increasing in x if �1 > �0 and decreasing in

x if �1 < �0. Either way, we can choose k1 and k2 such that the linear function in x crosses

the exponential function in x at the two points

�L = �0 ��pnz�=2; �U = �0 +

�pnz�=2:

Obviously, �3 satis�es (i). We still need to check that �3 satis�es (ii) and that ��3(�) has a

minimum at �0 but omit this part from our proof here.

�3 is of the form �� in Theorem 9.4.4 and therefore �3 is UMP in C. But the trivial test

�t(x) = � also satis�es (i) and (ii) above. Therefore, ��3(�) � � 8� 6= �0. This means that

�3 is unbiased.

Overall, �3 is a UMPU test of size �.

De�nition 9.4.5:

A test � is said to be �{similar on a subset �� of � if

��(�) = E�(�(X)) = � 8� 2 ��:

A test � is said to be similar on �� if it is �{similar on �� for some �; 0 � � � 1.

Note:

The trivial test �(x) = � is �{similar on every �� .

Theorem 9.4.6:

Let � be an unbiased test of size � for H0 : � 2 �0 vs H1 : � 2 �1 such that ��(�) is a

continuous function in �. Then � is �{similar on the boundary � = �0 \�1, where �0 and

�1 are the closures of �0 and �1, respectively.

Proof:

Let � 2 �. There exist sequences f�ng and f�0ng whith �n 2 �0 and �0n 2 �1 such that

limn!1

�n = � and limn!1

�0n = �.

By continuity, ��(�n)! ��(�) and ��(�0n)! ��(�).

Since ��(�n) � � implies ��(�) � � and since ��(�0n) � � implies ��(�) � � it must hold

that ��(�) = � 8� 2 �.

88

Lecture 24:

We 03/07/01De�nition 9.4.7:

A test � that is UMP among all �{similar tests on the boundary � = �0 \ �1 is called a

UMP �{similar test.

Theorem 9.4.8:

Suppose ��(�) is continuous in � for all tests � of H0 : � 2 �0 vs H1 : � 2 �1. If a size �

test of H0 vs H1 is UMP �{similar, then it is UMP unbiased.

Proof:

Let �0 be UMP �{similar and of size �. This means that E�(�(X)) � � 8� 2 �0.

Since the trivial test �(x) = � is �{similar, it must hold for �0 that ��0(�) � � 8� 2 �1 since

�0 is UMP �{similiar. This implies that �0 is unbiased.

Since ��(�) is continuous in �, we see from Theorem 9.4.6 that the class of unbiased tests is

a subclass of the class of �{similar tests. Since �0 is UMP in the larger class, it is also UMP

in the subclass. Thus, �0 is UMPU.

Note:

The continuity of the power function ��(�) cannot always be checked easily.

Example 9.4.9:

Let X1; : : : ; Xn � N(�; 1).

Let H0 : � � 0 vs H1 : � > 0.

Since the family of densities has a MLR innXi=1

Xi, we could use Theorem 9.3.5 to �nd a UMP

test. However, we want to illustate the use of Theorem 9.4.8 here.

It is � = f0g and the power function

��(�) =

ZIRn

�(x)

�1p2�

�nexp

�1

2

nXi=1

(xi � �)2!dx

of any test � is continuous in �. Thus, due to Theorem 9.4.6, any unbiased size � test of H0

is �{similar on �.

We need a UMP test of H 00 : � = 0 vs H1 : � > 0.

By the NP Lemma, a MP test of H 000 : � = 0 vs H 00

1 : � = �1, where �1 > 0 is given by

�(x) =

8><>: 1; if exp

�Px2i

2 �P

(xi��)22

�> k0

0; otherwise

89

or equivalently, by Theorem 9.2.2,

�(x) =

8>><>>:1; if T =

nXi=1

Xi > k

0; otherwise

Since under H0, T � N(0; n), k is determined by � = P�=0(T > k) = P ( Tpn> kp

n), i.e.,

k =pnz�.

� is independent of �1 for every �1 > 0. So � is UMP �{similar for H 00 vs. H1.

Finally, � is of size �, since for � � 0, it holds that

E�(�(X)) = P�(T >pnz�)

= P

�T � n�p

n> z� �

pn�

�(�)� P (Z > z�)

= �

(�) holds since T�n�pn� N(0; 1) for � � 0 and z� �

pn� � z� for � � 0.

Thus all the requirements are met for Theorem 9.4.8, i.e., �� is continuous and � is UMP

�{similar and of size �, and thus � is UMPU.

Note:

Rohatgi, page 428{430, lists Theorems (without proofs), stating that for Normal data, one{

and two{tailed t{tests, one{ and two{tailed �2{tests, two{sample t{tests, and F{tests are all

UMPU.

Note:

Recall from De�nition 8.2.4 that a class of distributions is invariant under a group G of trans-

formations, if for each g 2 G and for each � 2 � there exists a unique �0 2 � such that if

X � P�, then g(X) � P�0 .

De�nition 9.4.10:

A group G of transformations on X leaves a hypothesis testing problem invariant if G

leaves both fP� : � 2 �0g and fP� : � 2 �1g invariant, i.e., if y = g(x) � h�(y), then

ff�(x) : � 2 �0g � fh�(y) : � 2 �0g and ff�(x) : � 2 �1g � fh�(y) : � 2 �1g.

90

Note:

We want two types of invariance for our tests:

Measurement Invariance: If y = g(x) is a 1{to{1 mapping, the decision based on y should

be the same as the decision based on x. If �(x) is the test based on x and ��(y) is the

test based on y, then it must hold that �(x) = ��(g(x)) = ��(y).

Formal Invariance: If two tests have the same structure, i.e, the same �, the same pdf's (or

pmf's), and the same hypotheses, then we should use the same test in both problems.

So, if the transformed problem in terms of y has the same formal structure as that of

the problem in terms of x, we must have that ��(y) = �(x) = ��(g(x)).

We can combine these two requirements in the following de�nition:

De�nition 9.4.11:

An invariant test with respect to a group G of tansformations is any test � such that

�(x) = �(g(x)) 8x 8g 2 G:

Example 9.4.12:

Let X � Bin(n; p). Let H0 : p =12 vs. H1 : p 6= 1

2 .

Let G = fg1; g2g, where g1(x) = n� x and g2(x) = x.

If � is invariant, then �(x) = �(n � x). Is the test problem invariant? For g2, the answer is

obvious.

For g1, we get:

g1(X) = n�X � Bin(n; 1� p)

H0 : p =12 : ffp(x) : p =

12g = fhp(g1(x)) : p = 1

2g = Bin(n; 12)

H1 : p 6= 12 : ffp(x) : p 6=

1

2g| {z }

=Bin(n;p6= 12)

= fhp(g1(x)) : p 6=1

2g| {z }

=Bin(n;p6= 12)

So all the requirements in De�nition 9.4.10 are met. If, for example, n = 10, the test

�(x) =

8<: 1; if x = 0; 1; 2; 8; 9; 10

0; otherwise

is invariant under G. For example, �(4) = 0 = �(10 � 4) = �(6), and, in general,

�(x) = �(10 � x) 8x 2 f0; 1; : : : ; 9; 10g.

91

Lecture 25:

Fr 03/09/01Example 9.4.13:

Let X1; : : : ;Xn � N(�; �2) where both � and �2 > 0 are unknown. It is X � N(�; �2

n ) andn�1�2S2 � �2n�1 and X and S2 independent.

Let H0 : � � 0 vs. H1 : � > 0.

Let G be the group of scale changes:

G = fgc(x; s2); c > 0 : gc(x; s2) = (cx; c2s2)g

The problem is invariant because, when gc(x; s2) = (cx; c2s2), then

(i) cX and c2S2 are independent.

(ii) cX � N(c�; c2�2

n) 2 fN(�; �

2

n)g.

(iii) n�1c2�2

c2S2 � �2n�1.

So, this is the same family of distributions and De�nition 9.4.10 holds because � � 0 implies

that c� � 0 (for c > 0).

An invariant test satis�es �(x; s2) � �(cx; c2s2); c > 0; s2 > 0; x 2 IR.

Let c = 1s. Then �(x; s2) � �(x

s; 1) so invariant tests depend on (x; s2) only through x

s.

If x1s16= x2

s2, then there exists no c > 0 such that (x2; s

22) � (cx1; c

2s21). So invariance places

no restrictions on � for di�erent x1s1

= x2s2. Thus, invariant tests are exactly those that depend

only on xs , which are equivalent to tests that are based only on t = x

s=pn. Since this mapping

is 1{to{1, the invariant test will use T = XS=pn� tn�1 if � = 0. Note that this test does not

depend on the nuisance parameter �2. Invariance often produces such results.

De�nition 9.4.14:

Let G be a group of transformations on the space of X. We say a statistic T (x) is maximal

invariant under G if

(i) T is invariant, i.e., T (x) = T (g(x)) 8g 2 G, and

(ii) T is maximal, i.e., T (x1) = T (x2) implies that x1 = g(x2) for some g 2 G.

92

Example 9.4.15:

Let x = (x1; : : : ; xn) and gc(x) = (x1 + c; : : : ; xn + c).

Consider T (x) = (xn � x1; xn � x2; : : : ; xn � xn�1).

It is T (gc(x)) = (xn � x1; xn � x2; : : : ; xn � xn�1) = T (x), so T is invariant.

If T (x) = T (x0), then xn � xi = x0n � x0i 8i = 1; 2; : : : ; n� 1.

This implies that xi � x0i = xn � x0n = c 8i = 1; 2; : : : ; n� 1.

Thus, gc(x0) = (x01 + c; : : : ; x0n + c) = x.

Therefore, T is maximal invariant.

De�nition 9.4.16:

Let I� be the class of all invariant tests of size � of H0 : � 2 �0 vs. H1 : � 2 �1. If there

exists a UMP member in I�, it is called the UMP invariant test of H0 vs H1.

Theorem 9.4.17:

Let T (x) be maximal invariant with respect to G. A test � is invariant under G i� � is a

function of T .

Proof:

\=)":

Let � be invariant under G. If T (x1) = T (x2), then there exists a g 2 G such that x1 = g(x2).

Thus, it follows from invariance that �(x1) = �(g(x2)) = �(x2). Since � is the same whenever

T (x1) = T (x2), � must be a function of T .

\(=":Let � be a function of T , i.e., �(x) = h(T (x)). It follows that

�(g(x)) = h(T (g(x)))(�)= h(T (x)) = �(x):

(�) holds since T is invariant.

This means that � is invariant.

93

Example 9.4.18:

Consider the test problem

H0 : X � f0(x1 � �; : : : ; xn � �) vs. H1 : X � f1(x1 � �; : : : ; xn � �);

where � 2 IR.

Let G be the group of transformations with

gc(x) = (x1 + c; : : : ; xn + c);

where c 2 IR and n � 2.

As shown in Example 9.4.15, a maximal invariant statistic is T (X) = (X1 �Xn; : : : ;Xn�1 �Xn) = (T1; : : : ; Tn�1). Due to Theorem 9.4.17, an invariant test � depends on X only through

T .

Since the transformation

T 0

Z

!=

0BBBBB@T1...

Tn�1

Z

1CCCCCA =

0BBBBB@X1 �Xn

...

Xn�1 �Xn

Xn

1CCCCCAis 1{to{1, there exists inverses Xn = Z and Xi = Ti + Xn = Ti + Z 8i = 1; : : : ; n � 1.

Applying Theorem 4.3.5 and integrating out the last component Z (= Xn) gives us the joint

pdf of T = (T1; : : : ; Tn�1).

Thus, under Hi; i = 0; 1, the joint pdf of T is given by

Z 1

�1fi(t1+ z; t2 + z; : : : ; tn�1+ z; z)dz

which is independent of �. The problem is thus reduced to testing a simple hypothesis against

a simple alternative. By the NP Lemma (Theorem 9.2.1), the MP test is

�(t1; : : : ; tn�1) =

(1; if �(t) > c

0; if �(t) < c

where t = (t1; : : : ; tn�1) and �(t) =

Z 1

�1f1(t1 + z; t2 + z; : : : ; tn�1 + z; z)dzZ 1

�1f0(t1 + z; t2 + z; : : : ; tn�1 + z; z)dz

.

In the homework assignment, we use this result to construct a UMP invariant test of

H0 : X � N(�; 1) vs. H1 : X � Cauchy(1; �);

where a Cauchy(1; �) distribution has pdf f(x; �) =1

�

1

1 + (x� �)2, where � 2 IR.

94

Lecture 26:

Mo 03/19/01

10 More on Hypothesis Testing

10.1 Likelihood Ratio Tests

De�nition 10.1.1:

The likelihood ratio test statistic for

H0 : � 2 �0 vs. H1 : � 2 �1 = ��0

is

�(x) =

sup�2�0

f�(x)

sup�2�

f�(x):

The likelihood ratio test (LRT) is the test function

�(x) = I[0;c)(�(x));

for some constant c 2 [0; 1], where c is usually chosen in such a way to make � a test of size

�.

Note:

(i) We have to select c such that 0 � c � 1 since 0 � �(x) � 1.

(ii) LRT's are strongly related to MLE's. If �̂ is the unrestricted MLE of � over � and �̂0 is

the MLE of � over �0, then �(x) =f�̂0(x)

f�̂(x)

.

Example 10.1.2:

Let X1; : : : ; Xn be a sample from N(�; 1). We want to construct a LRT for

H0 : � = �0 vs. H1 : � 6= �0:

It is �̂0 = �0 and �̂ = X. Thus,

�(x) =(2�)�n=2 exp(�1

2

P(xi � �0)

2)

(2�)�n=2 exp(�12

P(xi � x)2)

= exp(�n2(x� �0)

2):

The LRT rejects H0 if �(x) � c, or equivalently, j x� �0 j�q�2 log cn . This means, the LRT

rejects H0 : � = �0 if x is too far from �0.

95

Theorem 10.1.3:

If T (X) is su�cient for � and ��(t) and �(x) are LRT statistics based on T and X respectively,

then

��(T (x)) = �(x) 8x;

i.e., the LRT can be expressed as a function of every su�cient statistic for �.

Proof:

Since T is su�cient, it follows from Theorem 8.3.5 that its pdf (or pmf) factorizes as f�(x) =

g�(T )h(x). Therefore we get:

�(x) =

sup�2�0

f�(x)

sup�2�

f�(x)

=

sup�2�0

g�(T )h(x)

sup�2�

g�(T )h(x)

=

sup�2�0

g�(T )

sup�2�

g�(T )

= ��(T (x))

Thus, our simpli�ed expression for �(x) indeed only depends on a su�cient statistic T .

Theorem 10.1.4:

If for a given �, 0 � � � 1, and for a simple hypothesis H0 and a simple alternative H1 a non{

randomized test based on the NP Lemma and a LRT exist, then these tests are equivalent.

Note:

Usually, LRT's perform well since they are often UMP or UMPU size � tests. However, this

does not always hold. Rohatgi, Example 4, page 440{441, cites an example where the LRT is

not unbiased and it is even worse than the trivial test �(x) = �.

Theorem 10.1.5:

Under some regularity conditions on f�(x), the rv �2 log �(X) under H0 has asymptotically

a chi{squared distribution with � degrees of freedom, where � equals the di�erence between

the number of independent parameters in � and �0, i.e.,

�2 log �(X)d�! �2� under H0:

96

Note:

The regularity conditions required for Theorem 10.1.5 are basically the same as for Theorem

8.7.10. Under \independent" parameters we understand parameters that are unspeci�ed, i.e.,

free to vary.

Example 10.1.6:

Let X1; : : : ; Xn � N(�; �2) where � 2 IR and �2 > 0 are both unknown.

Let H0 : � = �0 vs. H1 : � 6= �0.

We have � = (�; �2), � = f(�; �2) : � 2 IR; �2 > 0g and �0 = f(�0; �2) : �2 > 0g.

It is �̂0 = (�0;1n

nXi=1

(xi � �0)2) and �̂ = (x; 1

n

nXi=1

(xi � x)2).

Now, the LR test statistic �(x) can be determined:

�(x) =f�̂0(x)

f�̂(x)

=

1

( 2�n)n

2 (P

(xi��0)2)n

2

exp

��

P(xi��0)2

2 1n

P(xi��0)2

�1

( 2�n)n

2 (P

(xi�x)2)n

2

exp

��

P(xi�x)2

2 1n

P(xi�x)2

�

=

P(xi � x)2P(xi � �0)2

!n

2

=

Px2i � nx2P

x2i � 2�0Pxi + n�20 � nx2 + nx2

!n

2

=

0BB@ 1

1 +

�n(x��0)2P(xi�x)2

�1CCA

n

2

Note that this is a decreasing function of

t(X) =

pn(X � �0)q

1n�1

P(Xi �X)2

=

pn(X � �0)

S

(�)� tn�1:

(�) holds due to Corollary 7.2.4.

So we reject H0 if t(x) is large. Now,

�2 log �(x) = �2(�n2) log

1 + n

(x� �0)2P

(xi � x)2

!

= n log

1 + n

(x� �0)2P

(xi � x)2

!

97

Under H0,pn(X��0)

� � N(0; 1) and

P(Xi�X)2

�2� �2n�1 and both are independent.

Therefore, under H0,n(X � �0)

2

1n�1

P(Xi �X)2

� F1;n�1:

Thus, the mgf of �2 log �(X) under H0 is

Mn(t) = EH0(exp(�2t log �(X)))

= EH0(exp(nt log(1 +

F

n� 1)))

= EH0(exp(log(1 +

F

n� 1)nt))

= EH0((1 +

F

n� 1)nt)

=

Z 1

0

�(n2 )

�(n�12 )�(12)(n� 1)12

1pf

�1 +

f

n� 1

��n

2�1 +

f

n� 1

�ntdf

Note that

f1;n�1(f) =�(n2 )

�(n�12 )�(12 )(n� 1)12

1pf

�1 +

f

n� 1

��n

2

� I[0;1)(f)

is the pdf of a F1;n�1 distribution.

Let y =�1 + f

n�1

��1, then f

n�1 =1�yy and df = �n�1

y2dy.

Thus,

Mn(t) =�(n2 )

�(n�12 )�(12 )(n� 1)12

Z 1

0

1pf

�1 +

f

n� 1

�nt�n

2

df

=�(n2 )

�(n�12 )�(12 )(n� 1)12

(n� 1)12

Z 1

0yn�32�nt(1� y)�

12 dy

(�)=

�(n2 )

�(n�12 )�(12 )B(

n� 1

2� nt;

1

2)

=�(n2 )

�(n�12 )�(12 )

�(n�12 � nt)�(12)

�(n2 � nt)

=�(n2 )

�(n�12 )

�(n�12 � nt)

�(n2 � nt); t <

1

2� 1

2n

(�) holds since the integral represents the Beta function (see also Example 8.8.11).

98

As n!1, we can apply Stirling's formula which states that

�(�(n) + 1) � (�(n))! �p2�(�(n))�(n)+

12 exp(��(n)):

Lecture 27:

We 03/21/01So,

Mn(t) �p2�(n�22 )

n�12 exp(�n�2

2 )p2�(

n(1�2t)�32 )

n(1�2t)�22 exp(�n(1�2t)�3

2 )p2�(n�32 )

n�22 exp(�n�3

2 )p2�(

n(1�2t)�22 )

n(1�2t)�12 exp(�n(1�2t)�2

2 )

=

�n� 2

n� 3

�n�22�n� 2

2

� 12�n(1� 2t)� 3

n(1� 2t)� 2

�n(1�2t)�22

�n(1� 2t)� 2

2

�� 12

=

�(1 +

1

n� 3)n�2

� 12

| {z }�!e

12 as n!1

�(1� 1

n(1� 2t)� 2)n(1�2t)�2

� 12

| {z }�!e

�12 as n!1

�n� 2

n(1� 2t)� 2

� 12

| {z }�!(1�2t)�

12 as n!1

Thus,

Mn(t) �!1

(1� 2t)12

as n!1:

Note that this is the mgf of a �21 distribution. Therefore, it follows by the Continuity Theorem

(Theorem 6.4.2) that

�2 log �(X)d�! �21:

Obviously, we could use Theorem 10.1.5 (after checking that the regularity conditions hold)

as well to obtain the same result. Since under H0 1 parameter (�2) is unspeci�ed and under

H1 2 parameters (�, �2) are unspeci�ed, it is � = 2� 1 = 1.

99

10.2 Parametric Chi{Squared Tests

De�nition 10.2.1: Normal Variance Tests

Let X1; : : : ;Xn be a sample from a N(�; �2) distribution where � may be known or unknown

and �2 > 0 is unknown. The following table summarizes the �2 tests that are typically being

used:

Reject H0 at level � if

H0 H1 � known � unknown

I � � �0 � < �0P(xi � �)2 � �20�

2n;1�� s2 � �20

n�1�2n�1;1��

II � � �0 � > �0P(xi � �)2 � �20�

2n;� s2 � �20

n�1�2n�1;�

III � = �0 � 6= �0P(xi � �)2 � �20�

2n;1��=2 s2 � �20

n�1�2n�1;1��=2

orP(xi � �)2 � �20�

2n;�=2 or s2 � �20

n�1�2n�1;�=2

Note:

(i) In De�nition 10.2.1, �0 is any �xed positive constant.

(ii) Tests I and II are UMPU if � is unknown and UMP if � is known.

(iii) In test III, the constants have been chosen in such a way to give equal probability to

each tail. This is the usual approach. However, this may result in a biased test.

(iv) �2n;1�� is the (lower) � quantile and �2n;� is the (upper) 1�� quantile, i.e., for X � �2n,

it holds that P (X � �2n;1��) = � and P (X � �2n;�) = 1� �.

(v) We can also use �2 tests to test for equality of binomial probabilities as shown in the

next few Theorems.

Theorem 10.2.2:

Let X1; : : : ; Xk be independent rv's with Xi � Bin(ni; pi); i = 1; : : : ; k. Then it holds that

T =kXi=1

Xi � nipipnipi(1� pi)

!2d�! �2k

as n1; : : : ; nk !1.

Proof:

Homework

100

Corollary 10.2.3:

Let X1; : : : ;Xk be as in Theorem 10.2.2 above. We want to test the hypothesis that H0 : p1 =

p2 = : : : = pk = p, where p is a known constant. An appoximate level{� test rejects H0 if

y =kXi=1

xi � nippnip(1� p)

!2

� �2k;�:

Theorem 10.2.4:

Let X1; : : : ;Xk be independent rv's with Xi � Bin(ni; p); i = 1; : : : ; k. Then the MLE of p is

p̂ =

kXi=1

xi

kXi=1

ni

:

Proof:

This can be shown by using the joint likelihood function or by the fact thatPXi � Bin(

Pni; p)

and for X � Bin(n; p), the MLE is p̂ = xn .

Theorem 10.2.5:

Let X1; : : : ;Xk be independent rv's with Xi � Bin(ni; pi); i = 1; : : : ; k. An approximate

level{� test of H0 : p1 = p2 = : : : = pk = p, where p is unknown, rejects H0 if

y =kXi=1

xi � nip̂pnip̂(1� p̂)

!2

� �2k�1;�;

where p̂ =

PxiPni.

Theorem 10.2.6:

Let (X1; : : : ;Xk) be a multinomial rv with parameters n; p1; p2; : : : ; pk wherekXi=1

pi = 1 and

kXi=1

Xi = n. Then it holds that

Uk =kXi=1

(Xi � npi)2

npi

d�! �2k�1

as n!1.

An approximate level{� test of H0 : p1 = p01; p2 = p02; : : : ; pk = p0k rejects H0 if

kXi=1

(xi � np0i)2

np0i> �2k�1;�:

101

Proof:

Case k = 2 only:

U2 =(X1 � np1)

2

np1+(X2 � np2)

2

np2

=(X1 � np1)

2

np1+(n�X1 � n(1� p1))

2

n(1� p1)

= (X1 � np1)2

�1

np1+

1

n(1� p1)

�

= (X1 � np1)2

�(1� p1) + p1

np1(1� p1)

�

=(X1 � np1)

2

np1(1� p1)

By the CLT,X1 � np1pnp1(1� p1)

d�! N(0; 1). Therefore, U2d�! �21.

Lecture 28:

Fr 03/23/01Theorem 10.2.7:

Let X1; : : : ;Xn be a sample from X. Let H0 : X � F , where the functional form of F is

known completely. We partition the real line into k disjoint Borel sets A1; : : : ; Ak and let

P (X 2 Ai) = pi, where pi > 0 8i = 1; : : : ; k.

Let Yj = #X 0is in Aj =

nXi=1

IAj (Xi); 8j = 1; : : : ; k.

Then, (Y1; : : : ; Yk) has multinomial distribution with parameters n; p1; p2; : : : ; pk.

Theorem 10.2.8:

Let X1; : : : ;Xn be a sample from X. Let H0 : X � F�, where � = (�1; : : : ; �r) is unknown.

Let the MLE �̂ exist. We partition the real line into k disjoint Borel sets A1; : : : ; Ak and let

P�̂(X 2 Ai) = p̂i, where p̂i > 0 8i = 1; : : : ; k.

Let Yj = #X 0is in Aj =

nXi=1

IAj (Xi); 8j = 1; : : : ; k.

Then it holds that

Vk =kXi=1

(Yi � np̂i)2

np̂i

d�! �2k�r�1:

An approximate level{� test of H0 : X � F� rejects H0 if

kXi=1

(yi � np̂i)2

np̂i> �2k�r�1;�;

where r is the number of parameters in � that have to be estimated.

102

10.3 t{Tests and F{Tests

De�nition 10.3.1: One{ and Two{Tailed t-Tests

Let X1; : : : ;Xn be a sample from a N(�; �2) distribution where �2 > 0 may be known or

unknown and � is unknown. Let X = 1n

nXi=1

Xi and S2 = 1

n�1

nXi=1

(Xi �X)2.

The following table summarizes the z{ and t{tests that are typically being used:


H0 H1 �2 known �2 unknown

I � � �0 � > �0 X � �0 +�pnz� X � �0 +

spntn�1;�

II � � �0 � < �0 X � �0 +�pnz1�� X � �0 +

spntn�1;1��

III � = �0 � 6= �0 j X � �0 j� �pnz�=2 j X � �0 j� sp

ntn�1;�=2

Note:

(i) In De�nition 10.3.1, �0 is any �xed constant.

(ii) These tests are based on just one sample and are often called one sample t{tests.

(iii) Tests I and II are UMP and test III is UMPU if �2 is known. Tests I, II, and III are

UMPU and UMP invariant if �2 is unknown.

(iv) For large n (� 30), we can use z{tables instead of t-tables. Also, for large n we can

drop the Normality assumption due to the CLT. However, for small n, none of these

simpli�cations is justi�ed.

De�nition 10.3.2: Two{Sample t-Tests

Let X1; : : : ;Xm be a sample from a N(�1; �21) distribution where �21 > 0 may be known or

unknown and �1 is unknown. Let Y1; : : : ; Yn be a sample from a N(�2; �22) distribution where

�22 > 0 may be known or unknown and �2 is unknown.

Let X = 1m

mXi=1

Xi and S21 =

1m�1

mXi=1

(Xi �X)2.

Let Y = 1n

nXi=1

Yi and S22 =

1n�1

nXi=1

(Yi � Y )2.

Let S2p =(m�1)S21+(n�1)S22

m+n�2 .

The following table summarizes the z{ and t{tests that are typically being used:

103


H0 H1 �21; �22 known �21 ; �

22 unknown; �1 = �2

I �1 � �2 � � �1 � �2 > � X � Y � � + z�

q�21m +

�22n X � Y � � + tm+n�2;�Sp

q1m + 1

n

II �1 � �2 � � �1 � �2 < � X � Y � � + z1��

q�21

m+

�22

nX � Y � � + tm+n�2;1��Sp

q1m+ 1

n

III �1 � �2 = � �1 � �2 6= � j X � Y � � j� z�=2

q�21m +

�22n j X � Y � � j� tm+n�2;�=2Sp

q1m + 1

n

Note:

(i) In De�nition 10.3.2, � is any �xed constant.

(ii) All tests are UMPU and UMP invariant.

(iii) If �21 = �22 = �2 (which is unknown), then S2p is an unbiased estimate of �2. We should

check that �21 = �22 with an F{test.

(iv) For large m+ n, we can use z{tables instead of t-tables. Also, for large m and large n

we can drop the Normality assumption due to the CLT. However, for small m or small

n, none of these simpli�cations is justi�ed.

De�nition 10.3.3: Paired t-Tests

Let (X1; Y1) : : : ; (Xn; YN ) be a sample from a bivariate N(�1; �2; �21 ; �

22 ; �) distribution where

all 5 parameters are unknown.

Let Di = Xi � Yi � N(�1 � �2; �21 + �22 � 2��1�2).

Let D = 1n

nXi=1

Di and S2d =

1n�1

nXi=1

(Di �D)2.

The following table summarizes the t{tests that are typically being used:

H0 H1 Reject H0 at level � if

I �1 � �2 � � �1 � �2 > � D � � + Sdpntn�1;�

II �1 � �2 � � �1 � �2 < � D � � + Sdpntn�1;1��

III �1 � �2 = � �1 � �2 6= � j D � � j� Sdpntn�1;�=2

104

Note:

(i) In De�nition 10.3.3, � is any �xed constant.

(ii) These tests are special cases of one{sample tests. All the properties stated in the Note

following De�nition 10.3.1 hold.

(iii) We could do a test based on Normality assumptions if �2 = �21 + �22 � 2��1�2 were

known, but that is a very unrealistic assumption.

De�nition 10.3.4: F{Tests

LetX1; : : : ;Xm be a sample from a N(�1; �21) distribution where �1 may be known or unknown

and �21 is unknown. Let Y1; : : : ; Yn be a sample from a N(�2; �22) distribution where �2 may

be known or unknown and �22 is unknown.

Recall thatmXi=1

(Xi �X)2

�21� �2m�1;

nXi=1

(Yi � Y )2

�22� �2n�1;

andmXi=1

(Xi �X)2

(m� 1)�21nXi=1

(Yi � Y )2

(n� 1)�22

=�22

�21

S21

S22� Fm�1;n�1:

The following table summarizes the F{tests that are typically being used:


H0 H1 �1; �2 known �1; �2 unknown

I �21 � �22 �21 > �22

1m

P(Xi��1)2

1n

P(Yi��2)2

� Fm;n;�S21S22

� Fm�1;n�1;�

II �21 � �22 �21 < �22

1n

P(Yi��2)2

1m

P(Xi��1)2

� Fn;m;�S22S21� Fn�1;m�1;�

III �21 = �22 �21 6= �22

1m

P(Xi��1)2

1n

P(Yi��2)2

� Fm;n;�=2S21S22� Fm�1;n�1;�=2

or1n

P(Yi��2)2

1m

P(Xi��1)2

� Fn;m;�=2 orS22S21� Fn�1;m�1;�=2

105

Note:

(i) Tests I and II are UMPU and UMP invariant if �1 and �2 are unknown.

(ii) Test III uses equal tails and therefore may not be unbiased.

(iii) If an F{test (at level �1) and a t{test (at level �2) are both performed, the combined

test has level � = 1� (1� �1)(1� �2) � max(�1; �2) (� �1 + �2 if both are small).

106

Lecture 29:

Mo 03/26/0110.4 Bayes and Minimax Tests

Hypothesis testing may be conducted in a decision{theoretic framework. Here our action

space A consists of two options: a0 = fail to reject H0 and a1 = reject H0.

Usually, we assume no loss for a correct decision. Thus, our loss function looks like:

L(�; a0) =

8<: 0; if � 2 �0

a(�); if � 2 �1

L(�; a1) =

8<: b(�); if � 2 �0

0; if � 2 �1

We consider the following special cases:

0{1 loss: a(�) = b(�) = 1, i.e., all errors are equally bad.

Generalized 0{1 loss: a(�) = cII , b(�) = cI , i.e., all Type I errors are equally bad and all

Type II errors are equally bad and Type I errors are worse than Type II errors or vice

versa.

Then, the risk function can be written as

R(�; d(X)) = L(�; a0)P�(d(X) = a0) + L(�; a1)P�(d(X) = a1)

=

8<: a(�)P�(d(X) = a0); if � 2 �1

b(�)P�(d(X) = a1); if � 2 �0

The minimax rule minimizes

max�fa(�)P�(d(X) = a0); b(�)P�(d(X) = a1)g:

Theorem 10.4.1:

The minimax rule d for testing

H0 : � = �0 vs. H1 : � = �1

under the generalized 0{1 loss function rejects H0 if

f�1(x)

f�0(x)� k;

107

where k is chosen such that

R(�0; d(X)) = R(�1; d(X))

() cIIP�1(d(X) = a0) = cIP�0(d(X) = a1)

() cIIP�1

�f�1(X)

f�0(X)< k

�= cIP�0

�f�1(X)

f�0(X)� k

�:

Proof:

Let d� be any other rule.

� If R(�0; d) < R(�0; d�), then

R(�0; d) = R(�1; d) < maxfR(�0; d�); R(�1; d�)g:

So, d� is not minimax.

� If R(�0; d) � R(�0; d�), i.e.,

cIP�0(d = a1) = R(�0; d) � R(�0; d�) = cIP�0(d

� = a1);

then

P (reject H0 j H0 true) = P�0(d = a1) � P�0(d� = a1):

By the NP Lemma, the rule d is MP of its size. Thus,

P�1(d = a1) � P�1(d� = a1) () P�1(d = a0) � P�1(d

� = a0)

=) R(�1; d) � R(�1; d�)

=) maxfR(�0; d); R(�1; d)g = R(�1; d) � R(�1; d�) � maxfR(�0; d�); R(�1; d�)g 8d�

=) d is minimax

Example 10.4.2:

Let X1; : : : ; Xn be iid N(�; 1). Let H0 : � = �0 vs. H1 : � = �1 > �0.

As we have seen before,f�1(x)

f�0(x)� k1 is equivalent to x � k2.

Therefore, we choose k2 such that

cIIP�1(X < k2) = cIP�0(X � k2)

() cII�(pn(k2 � �1)) = cI(1��(

pn(k2 � �0)));

where �(z) = P (Z � z) for Z � N(0; 1).

Given cI ; cII ; �0; �1, and n, we can solve (numerically) for k2 using Normal tables.

108

Note:

Now suppose we have a prior distribution �(�) on �. Then the Bayes risk of a decision rule

d (under the loss function introduced before) is

R(�; d) = E�(R(�; d(X)))

=

Z�R(�; d)�(�)d�

=

Z�0

b(�)�(�)P�(d(X) = a1)d� +

Z�1

a(�)�(�)P�(d(X) = a0)d�

if � is a pdf.

The Bayes risk for a pmf � looks similar (see Rohatgi, page 461).

Theorem 10.4.3:

The Bayes rule for testing H0 : � = �0 vs. H1 : � = �1 under the prior �(�0) = �0 and

�(�1) = �1 = 1� �0 and the generalized 0{1 loss function is to reject H0 if

f�1(x)

f�0(x)� cI�0

cII�1:

Proof:

We wish to minimize R(�; d). We know that

R(�; d)Def: 8:8:8

= E�(R(�; d))

Def: 8:8:3= E�(E�(L(�; d(X))))

Note=

Zg(x)|{z}

marginal

(

ZL(�; d(x))h(� j x)| {z }

posterior

d�) dx

Def: 4:7:1=

Zg(x)E�(L(�; d(X)) j X = x) dx

= EX(E�(L(�; d(X)) j X)):

Therefore, it is su�cient to minimize E�(L(�; d(X)) j X).

The a posteriori distribution of � is

h(� j x) =�(�)f�(x)X�

�(�)f�(x)

=

8>><>>:�0f�0(x)

�0f�0(x) + �1f�1(x); � = �0

�1f�1(x)

�0f�0(x) + �1f�1(x); � = �1

109

Therefore,

E�(L(�; d(X)) j X = x) =

8>>>>>>><>>>>>>>:

cIh(�0 j x); if � = �0; d(x) = a1

cIIh(�1 j x); if � = �1; d(x) = a0

0; if � = �0; d(x) = a0

0; if � = �1; d(x) = a1

This will be minimized if we reject H0, i.e., d(x) = a1, when cIh(�0 j x) � cIIh(�1 j x)

=) cI�0f�0(x) � cII�1f�1(x)

=) f�1(x)

f�0(x)� cI�0

cII�1

Note:

For minimax rules and Bayes rules, the signi�cance level � is no longer predetermined.

Example 10.4.4:

Let X1; : : : ; Xn be iid N(�; 1). Let H0 : � = �0 vs. H1 : � = �1 > �0. Let cI = cII .

By Theorem 10.4.3, the Bayes rule d rejects H0 if

f�1(x)

f�0(x)� �0

1� �0

=) exp

�P(xi � �1)

2

2+

P(xi � �0)

2

2

!� �0

1� �0

=) exp

(�1 � �0)

Xxi +

n(�20 � �21)

2

!� �0

1� �0

=) (�1 � �0)X

xi +n(�20 � �21)

2� ln(

�0

1� �0)

=) 1

n

Xxi � 1

n

ln( �01��0 )

�1 � �0+�0 + �1

2

If �0 =12 , then we reject H0 if x �

�0 + �1

2.

110

Note:

We can generalize Theorem 10.4.3 to the case of classifying among k options �1; : : : ; �k. If we

use the 0{1 loss function

L(�i; d) =

8<: 1; if d(X) = �j 8j 6= i

0; if d(X) = �i

;

then the Bayes rule is to pick �i if

�if�i(x) � �jf�j (x) 8j 6= i:

Lecture 30:

We 03/28/01Example 10.4.5:

Let X1; : : : ; Xn be iid N(�; 1). Let �1 < �2 < �3 and let �1 = �2 = �3.

Choose � = �i if

�i exp

�P(xk � �i)

2

2

!� �j exp

�P(xk � �j)

2

2

!; j 6= i; j = 1; 2; 3:

Similar to Example 10.4.4, these conditions can be transformed as follows:

x(�i � �j) �(�i � �j)(�i + �j)

2; j 6= i; j = 1; 2; 3:

In our particular example, we get the following decision rules:

(i) Choose �1 if x � �1+�22 (and x � �1+�3

2 ).

(ii) Choose �2 if x � �1+�22 and x � �2+�3

2 .

(iii) Choose �3 if x � �2+�32 (and x � �1+�3

2 ).

Note that in (i) and (iii) the condition in parentheses automatically holds when the other

condition holds.

If �1 = 0; �2 = 2, and �3 = 4, we have the decision rules:

(i) Choose �1 if x � 1.

(ii) Choose �2 if 1 � x � 3.

(iii) Choose �3 if x � 3.

We do not have to worry how to handle the boundary since the probability that the rv will

realize on any of the two boundary points is 0.

111

11 Con�dence Estimation

11.1 Fundamental Notions

Let X be a rv and a; b be �xed positive numbers, a < b. Then

P (a < X < b) = P (a < X and X < b)

= P (a < X andX

b< 1)

= P (a < X andaX

b< a)

= P (aX

b< a < X)

The interval I(X) = (aXb;X) is an example of a random interval. I(X) contains the value

a with a certain �xed probability.

For example, if X � U(0; 1); a = 14 , and b =

34 , then the interval I(X) = (X3 ;X) contains 1

4

with probability 12 .

De�nition 11.1.1:

Let P�; � 2 � � IRk, be a set of probability distributions of a rv X . A family of subsets

S(x) of �, where S(x) depends on x but not on �, is called a family of random sets. In

particular, if � 2 � � IR and S(x) is an interval (�(x); �(x)) where �(x) and �(x) depend on

x but not on �, we call S(X) a random interval, with �(X) and �(X) as lower and upper

bounds, respectively. �(X) may be �1 and �(X) may be +1.

Note:

Frequently in inference, we are not interested in estimating a parameter or testing a hypoth-

esis about it. Instead, we are interested in establishing a lower or upper bound (or both) for

one or multiple parameters.

De�nition 11.1.2:

A family of subsets S(x) of � � IRk is called a family of con�dence sets at con�dence

level 1� � if

P�(S(X) 3 �) � 1� � 8� 2 �;

where 0 < � < 1 is usually small.

The quantity

inf�P�(S(X) 3 �)

is called the con�dence coe�cient.

112

De�nition 11.1.3:

For k = 1, we use the following names for some of the con�dence sets de�ned in De�nition

11.1.2:

(i) If S(x) = (�(x);1), then �(x) is called a level 1� � lower con�dence bound.

(ii) If S(x) = (�1; �(x)), then �(x) is called a level 1� � upper con�dence bound.

(iii) S(x) = (�(x); �(x)) is called a level 1� � con�dence interval (CI).

De�nition 11.1.4:

A family of 1� � level con�dence sets fS(x)g is called uniformly most accurate (UMA)

if

P�0(S(X) 3 �) � P�0(S0(X) 3 �) 8�; �0 2 �

and for any 1� � level family of con�dence sets S0(X).

Theorem 11.1.5:

Let X1; : : : ;Xn � F�; � 2 �, where � is an interval on IR. Let T (X; �) be a function on

IRn � � such that for each �, T (X; �) is a statistic, and as a function of �, T is strictly

monotone (either increasing or decreasing) in � at every value of x 2 IRn.

Let � � IR be the range of T and let the equation � = T (x; �) be solvable for � for every

� 2 � and every x 2 IRn.

If the distribution of T (X; �) is independent of �, then we can construct a con�dence interval

for � at any level.

Proof:

Choose � such that 0 < � < 1. Then we can choose �1(�) < �2(�) (which may not necessarily

be unique) such that

P�(�1(�) < T (X; �) < �2(�)) � 1� � 8�:

Since the distribution of T (X; �) is independent of �, �1(�) and �2(�) also do not depend on

�.

If T (X; �) is increasing in �, solve the equations �1(�) = T (X; �) for �(X) and �2(�) = T (X; �)

for �(X).

If T (X; �) is decreasing in �, solve the equations �1(�) = T (X; �) for �(X) and �2(�) =

T (X; �) for �(X).

113

In either case, it holds that

P�(�(X) < � < �(X)) � 1� � 8�:

Note:

(i) Solvability is guaranteed if T is continuous and strictly monotone as a function of �.

(ii) If T is not monotone, we can still use this Theorem to get con�dence sets that may not

be con�dence intervals.

Example 11.1.6:

Let X1; : : : ;Xn � N(�; �2), where � and �2 > 0 are both unknown. We seek a 1 � � level

con�dence interval for �.

Note that

T (X;�) =X � �

S=pn� tn�1

is independent of � and monotone and decreasing in �.

We choose �1(�) and �2(�) such that

P (�1(�) < T (X;�) < �2(�)) = 1� �

and solve for � which yields

P (X � S�2(�)pn

< � < X � S�1(�)pn

) = 1� �:

Thus,

(X � S�2(�)pn

;X � S�1(�)pn

)

is a 1� � level CI for �. We commonly choose �2(�) = ��1(�) = tn�1;�=2.

Example 11.1.7:

Let X1; : : : ; Xn � U(0; �).

We know that �̂ = max(Xi) =Maxn is the MLE for � and su�cient for �.

The pdf of Maxn is given by

fn(y) =nyn�1

�nI(0;�)(y):

114

Then the rv Tn =Maxn�

has the pdf

hn(t) = ntn�1I(0;1)(t);

which is independent of �. Tn is monotone and decreasing in �.

We now have to �nd numbers �1(�) and �2(�) such that

P (�1(�) < Tn < �2(�)) = 1� �

=) n

Z �2

�1

tn�1dt = 1� �

=) �n2 � �n1 = 1� �

If we choose �2 = 1 and �1 = �1=n, then (Maxn; ��1=nMaxn) is a 1� � level CI for �. This

holds since

P (�1=n <Maxn

�< 1) = P (��1=n >

�

Maxn> 1)

= P (��1=nMaxn > � > Maxn)

= 1� �:

115

Lecture 31:

Fr 03/30/0111.2 Shortest{Length Con�dence Intervals

In practice, we usually want not only an interval with coverage probability 1� � for �, but if

possible the shortest (most precise) such interval.

De�nition 11.2.1:

A rv T (X; �) whose distribution is independent of � is called a pivot.

Note:

The methods we will discuss here can provide the shortest interval based on a given pivot.

They will not guarantee that there is no other pivot with a shorter minimal interval.

Example 11.2.2:

Let X1; : : : ; Xn � N(�; �2), where �2 > 0 is known. The obvious pivot for � is

T�(X) =X � �

�=pn� N(0; 1):

Suppose that (a; b) is an interval such that P (a < Z < b) = 1� �, where Z � N(0; 1).

A 1� � level CI based on this pivot is found by

1� � = P

a <

X � �

�=pn< b

!= P

�X � b

�pn< � < X � a

�pn

�:

The length of the interval is L = (b� a) �pn.

To minimize L, we must choose a and b such that b� a is minimal while

�(b)� �(a) =1p2�

Z b

ae�

x2

2 dx = 1� �;

where �(z) = P (Z � z).

To �nd a minimum, we can di�erentiate these expressions with respect to a. However, b is

not a constant but is an implicit function of a. Formally, we could writed b(a)da

. However, this

is usually shortened to dbda.

Here we getd

da(�(b)� �(a)) = �(b)

db

da� �(a) =

d

da(1� �) = 0

anddL

da=

�pn(db

da� 1) =

�pn(�(a)

�(b)� 1):

116

The minimum occurs when �(a) = �(b) which happens when a = b or a = �b. If we selecta = b, then �(b)��(a) = �(a)��(a) = 0 6= 1��. Thus, we must have that b = �a = z�=2.

Thus, the shortest CI based on T� is

(X � z�=2�pn;X + z�=2

�pn):

De�nition 11.2.3:

A pdf f(x) is unimodal i� there exists a x� such that f(x) is nondecreasing for x � x� and

f(x) is nonincreasing for x � x�.

Theorem 11.2.4:

Let f(x) be a unimodal pdf. If the interval [a; b] satis�es

(i)

Z b

af(x)dx = 1� �

(ii) f(a) = f(b) > 0, and

(iii) a � x� � b, where x� is a mode of f(x),

then the interval [a; b] is the shortest of all intervals which satisfy condition (i).

Proof:

Let [a0; b0] be any interval with b0�a0 < b�a. We will show that this implies

Z b

af(x)dx < 1��,

i.e., a contradiction.

We assume that a0 � a. The case a < a0 is similar.

� Suppose that b0 � a. Then a0 � b0 � a � x�.

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

x*a ba’ b’

Theorem 11.2.4a

117

It follows Z b0

a0f(x)dx � f(b0)(b0 � a0) jx � b0 � x� ) f(x) � f(b0)

� f(a)(b0 � a0) jb0 � a � x� ) f(b0) � f(a)

< f(a)(b� a) jb0 � a0 < b� a and f(a) > 0

�Z b

af(x)dx jf(x) � f(a) for a � x � b

= 1� � jby (i)

� Suppose b0 > a. We can immediately exclude that b0 > b since then b0 � a0 > b � a, i.e.,

b0 � a0 wouldn't be of shorter length than b � a. Thus, we have to consider the case that

a0 � a < b0 < b.

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

x*a ba’ b’

Theorem 11.2.4b

It holds that Z b0

a0f(x)dx =

Z b

af(x)dx+

Z a

a0f(x)dx�

Z b

b0f(x)dx

Note that

Z a

a0f(x)dx � f(a)(a� a0) and

Z b

b0f(x)dx � f(b)(b� b0). Therefore, we get

Z a

a0f(x)dx�

Z b

b0f(x)dx � f(a)(a� a0)� f(b)(b� b0)

= f(a)((a� a0)� (b� b0)) since f(a) = f(b)

= f(a)((b0 � a0)� (b� a))

< 0

Thus, Z b0

a0f(x)dx < 1� �:

118

Note:

Example 11.2.2 is a special case of Theorem 11.2.4.

Example 11.2.5:

Let X1; : : : ; Xn � N(�; �2), where � is known. The obvious pivot for �2 is

T�2(X) =

P(Xi � �)2

�2� �2n:

So

P

a <

P(Xi � �)2

�2< b

!= 1� �

() P

P(Xi � �)2

b< �2 <

P(Xi � �)2

a

!= 1� �

We wish to minimize

L = (1

a� 1

b)X

(Xi � �)2

such that

Z b

afn(t)dt = 1� �, where fn(t) is the pdf of a �

2n distribution.

We get

fn(b)db

da� fn(a) = 0

anddL

da=

�1

a2� 1

b2db

da

�X(Xi � �)2 =

�1

a2� 1

b2fn(a)

fn(b)

�X(Xi � �)2:

We obtain a minimum if a2fn(a) = b2fn(b).

Note that in practice equal tails �2n;�=2 and �2n;1��=2 are used, which do not result in shortest{

length CI's. The reason for this selection is simple: When these tests were developed, com-

puters did not exist that could solve these equations numerically. People in general had to

rely on tabulated values. Manually solving the equation above for each case obviously wasn't

a feasible solution.

Example 11.2.6:

Let X1; : : : ;Xn � U(0; �). Let Maxn = maxXi = X(n). Since Tn = Maxn�

has pdf

ntn�1I(0;1)(t) which does not depend on �, Tn can be selected as a our pivot. The den-

sity of Tn is strictly increasing for n � 2, so we cannot �nd constants a and b as in Example

11.2.5.

If P (a < Tn < b) = 1� �, then P (Maxnb

< � < Maxna

) = 1� �.

119

We wish to minimize

L =Maxn(1

a� 1

b)

such that

Z b

antn�1dt = bn � an = 1� �.

We get

nbn�1 � nan�1da

db= 0 =) da

db=bn�1

an�1

and

dL

db=Maxn(�

1

a2da

db+1

b2) =Maxn(�

bn�1

an+1+1

b2) =Maxn(

an+1 � bn+1

b2an+1) < 0 for 0 � a < b � 1:

Thus, L does not have a local minimum. However, since dLdb < 0, L is strictly decreasing

as a function of b. It is minimized when b = 1, i.e., when b is as large as possible. The

corresponding a is selected as a = �1=n.

The shortest 1� � level CI based on Tn is (Maxn; ��1=nMaxn).

120

Lecture 32:

Mo 04/02/0111.3 Con�dence Intervals and Hypothesis Tests

Example 11.3.1:

Let X1; : : : ;Xn � N(�; �2), where �2 > 0 is known. In Example 11.2.2 we have shown that

the interval

(X � z�=2�pn;X + z�=2

�pn)

is a 1� � level CI for �.

Suppose we de�ne a test � of H0 : � = �0 vs. H1 : � 6= �0 that rejects H0 i� �0 does not

fall in this interval. Then,

P�0(Type I error) = P�0(X � z�=2�pn� �0) + P�0(X + z�=2

�pn� �0)

= P�0

j X � �0 j

�pn

� z�=2

!= �;

i.e., � has size �.

Conversely, if �(x; �0) is a family of size � tests ofH0 : � = �0, the set f�0 j �(x; �0) fails to reject H0gis a level 1� � con�dence set for �0.

Theorem 11.3.2:

Denote H0(�0) for H0 : � = �0, and H1(�0) for the alternative. Let A(�0); �0 2 �, denote the

acceptance region of a level{� test of H0(�0). For each possible observation x, de�ne

S(x) = f� : x 2 A(�); � 2 �g:

Then S(x) is a family of 1� � level con�dence sets for �.

If, moreover, A(�0) is UMP for (�;H0(�0);H1(�0)), then S(x) minimizes P�(S(X) 3 �0)

8� 2 H1(�0) among all 1� � level families of con�dence sets, i.e., S(x) is UMA.

Proof:

It holds that S(x) 3 � i� x 2 A(�). Therefore,

P�(S(X) 3 �) = P�(X 2 A(�)) � 1� �:

Let S�(X) be any other family of 1�� level con�dence sets. De�ne A�(�) = fx : S�(x) 3 �g.Then,

P�(X 2 A�(�)) = P�(S�(X) 3 �) � 1� �:

121

Since A(�0) is UMP, it holds that

P�(X 2 A�(�0)) � P�(X 2 A(�0)) 8� 2 H1(�0):

This implies that

P�(S�(X) 3 �0) � P�(X 2 A(�0)) = P�(S(X) 3 �0) 8� 2 H1(�0):

Example 11.3.3:

LetX be a rv that belongs to a one{parameter exponential family with pdf f�(x) = exp(Q(�)T (x)+

S0(x)+D(�)), where Q(�) is non{decreasing. We consider a test H0 : � = �0 vs. H1 : � < �0.

The acceptance region of a UMP size � test of H0 has the form A(�0) = fx : T (x) > c(�0)g.

For � � �0,

P�0(T (X) � c(�0)) = � = P�(T (X) � c(�)) � P�0(T (X) � c(�))

(since for a UMP test, it holds power � size). Therefore, we can choose c(�) as non{decreasing.

A level 1� � CI for � is then

S(x) = f� : x 2 A(�)g = (�1; c�1(T (x)));

where c�1(T (x)) = sup�

f� : c(�) � T (x)g.

Example 11.3.4:

Let X � Exp(�) with f�(x) =1�e�

x

� I(0;1)(x). Then Q(�) = �1�and T (x) = x.

We want to test H0 : � = �0 vs. H1 : � < �0.

The acceptance region of a UMP size � test of H0 has the form A(�0) = fx : x � c(�0)g,where

� =

Z c(�0)

0f�0(x)dx =

Z c(�0)

0

1

�0e� x

�0 dx = 1� e� c(�0)

�0 :

Thus,

e� c(�0)

�0 = 1� � =) c(�0) = �0 log(1

1� �)

Therefore, the UMA family of 1� � level con�dence sets is of the form

S(x) = f� : x 2 A(�)g

= f� : � � x

log( 11�� )

g

=

"0;

x

log( 11��)

#:

122

Note:

Just as we frequently restrict the class of tests (when UMP tests don't exist), we can make

the same sorts of restrictions on CI's.

De�nition 11.3.5:

A family S(x) of con�dence sets for parameter � is said to be unbiased at level 1� � if

P�(S(X) 3 �) � 1� � and P�0(S(X) 3 �) � 1� � 8�; �0 2 �:

If S(x) is unbiased and minimizes P�0(S(X) 3 �) among all unbiased CI's at level 1� �, it is

called uniformly most accurate unbiased (UMAU).

Theorem 11.3.6:

Let A(�0) be the acceptance region of a UMPU size � test of H0 : � = �0 vs. H1 : � 6= �0

(for all �0). Then S(x) = f� : x 2 A(�)g is a UMAU family of con�dence sets at level 1��.

Proof:

Since A(�) is unbiased, it holds that

P�0(S(X) 3 �) = P�0(X 2 A(�)) � 1� �:

Thus, S is unbiased.

Let S�(x) be any other unbiased family of level 1� � con�dence sets, where

A�(�) = fx : S�(x) 3 �g:

It holds that

P�0(X 2 A�(�)) = P�0(S�(X) 3 �) � 1� �:

Therefore, A�(�) is the acceptance region of an unbiased size � test. Thus,

P�0(S�(X) 3 �) = P�0(X 2 A�(�))

A UMPU� P�0(X 2 A(�))

= P�0(S(X) 3 �)

123

Lecture 33:

We 04/04/01Theorem 11.3.7:

Let � be an interval on IR and f� be the pdf of X. Let S(X) be a family of 1� � level CI's,

where S(X) = (�(X); �(X)), � and � increasing functions of X, and �(X) � �(X) is a �nite

rv.

Then it holds that

E�(�(X)� �(X))) =

Z(�(x)� �(x))f�(x)dx =

Z�0 6=�

P�(S(X) 3 �0) d�0 8� 2 �:

Proof:

It holds that � � � =

Z �

�d�0. Thus, for all � 2 �,

E�(�(X)� �(X))) =

ZIRn

(�(x)� �(x))f�(x)dx

=

ZIRn

Z �(x)

�(x)d�0!f�(x)dx

=

ZIR

0BBBBB@Z 2IRnz }| {��1(�0)

��1(�0)| {z }2IRn

f�(x)dx

1CCCCCA d�0

=

ZIRP�

�X 2 [��1(�0); �

�1(�0)]

�d�0

=

ZIRP�(S(X) 3 �0) d�0

=

Z�0 6=�

P�(S(X) 3 �0) d�0

Note:

Theorem 11.3.7 says that the expected length of the CI is the probability that S(X) includes

the false �, averaged over all false values of �.

Corollary 11.3.8:

If S(X) is UMAU, then E�(�(X)� �(X)) is minimized among all unbiased families of CI's.

Proof:

In Theorem 11.3.7 we have shown that

E�(�(X)� �(X)) =

Z�0 6=�

P�(S(X) 3 �0) d�0:

Since a UMAU CI minimizes this probability for all �0, the entire integral is minimized.

124

Example 11.3.9:

Let X1; : : : ; Xn � N(�; �2), where �2 > 0 is known.

By Example 11.2.2, (X � z�=2�pn; X + z�=2

�pn) is the shortest 1� � level CI for �.

By Example 9.4.3, the equivalent test is UMPU. So by Theorem 11.3.6 this interval is UMAU

and by Corollary 11.3.8 it has shortest expected length as well.

Example 11.3.10:

Let X1; : : : ; Xn � N(�; �2), where � and �2 > 0 are both unknown.

Note that

T (X;�2) =(n� 1)S2

�2= T� � �2n�1:

Thus,

P�2

�1 <

(n� 1)S2

�2< �2

!= 1� � () P�2

(n� 1)S2

�2< �2 <

(n� 1)S2

�1

!= 1� �:

We now de�ne P ( ) as

P�2

(n� 1)S2

�2< �02 <

(n� 1)S2

�1

!= P

�T�

�2< <

T�

�1

�= P (�1 < T� < �2 ) = P ( );

where = �02

�2.

If our test is unbiased, then it follows from De�nition 11.3.5 that

P (1) = 1� � and P ( ) < 1� � 8 < 1:

This implies that we can �nd �1; �2 such that P (1) = 1� � and

dP ( )

d

�� =1

= �2fT�(�2)� �1fT�(�1) = 0;

where fT� is the pdf that relates to T�, i.e., a the pdf of a �2n�1 distribution. We can solve for

�1; �2 numerically. Then, (n� 1)S2

�2;(n� 1)S2

�1

!is an unbiased 1� � level CI for �2.

Rohatgi, Theorem 4(b), page 428{429, states that the related test is UMPU. Therefore, by

Theorem 11.3.6 and Corollary 11.3.8, our CI is UMAU with shortest expected length among

all unbiased intervals.

Note that this CI is di�erent from the equal{tail CI based on De�nition 10.2.1, III, and from

the shortest{length CI obtained in Example 11.2.5.

125

11.4 Bayes Con�dence Intervals

De�nition 11.4.1:

Given a posterior distribution h(� j x), a level 1 � � credible set (Bayesian con�dence

set) is any set A such that

P (� 2 A j x) =ZAh(� j x)d� = 1� �:

Note:

If A is an interval, we speak of a Bayesian con�dence interval.

Example 11.4.2:

Let X � Bin(n; p) and �(p) � U(0; 1).

In Example 8.8.11, we have shown that

h(p j x) =px(1� p)n�xZ 1

0px(1� p)n�xdp

I(0;1)(p)

= B(x+ 1; n� x+ 1)�1px(1� p)n�xI(0;1)(p)

=�(n+ 2)

�(x+ 1)�(n� x+ 1)px(1� p)n�xI(0;1)(p)

� Beta(x+ 1; n� x+ 1);

where B(a; b) =�(a)�(b)�(a+b) is the beta function evaluated for a and b and Beta(x+ 1; n� x+1)

represents a Beta distribution with parameters x+ 1 and n� x+ 1.

Using the observed value for x and tables for incomplete beta integrals or a numerical ap-

proach, we can �nd �1 and �2 such that Ppjx(�1 < p < �2) = 1� �. So (�1; �2) is a credible

interval for p.

Note:

(i) The de�nitions and interpretations of credible intervals and con�dence intervals are quite

di�erent. Therefore, very di�erent intervals may result.

(ii) We can often use Theorem 11.2.4 to �nd the shortest credible interval.

126

Lecture 34:

Mo 04/09/01Example 11.4.3:

Let X1; : : : ;Xn be iid N(�; 1) and �(�) � N(0; 1). We want to construct a Bayesian level

1� � CI for �.

By De�nition 8.8.7, the posterior distribution of � given x is

h(� j x) = �(�)f(x j �)g(x)

where

g(x) =

Z 1

�1f(x; �)d�

=

Z 1

�1�(�)f(x j �)d�

=

Z 1

�1

1p2�

exp(�1

2�2)

1

(p2�)n

exp

�1

2

nXi=1

(xi � �)2

!d�

=

Z 1

�1

1

(2�)n+12

exp

�1

2

nXi=1

x2i � 2�nXi=1

xi + n�2

!� 1

2�2

!d�

=

Z 1

�1

1

(2�)n+12

exp

�1

2

nXi=1

x2i + �nx� n

2�2 � 1

2�2

!d�

=1

(2�)n+12

exp

�1

2

nXi=1

x2i

!Z 1

�1exp

��n+ 1

2

��2 � 2�

nx

n+ 1

��d�

=1

(2�)n+12

exp

�1

2

nXi=1

x2i

!Z 1

�1exp

�n+ 1

2

�� nx

n+ 1

�2+n+ 1

2

�nx

n+ 1

�2!d�

=1

(2�)n+12

exp

�1

2

nXi=1

x2i +n2x2

2(n+ 1)

!s2�

1

n+ 1�

Z 1

�1

1q2� 1

n+1

exp

�1

2

1

1=(n+ 1)

�� nx

n+ 1

�2!d�

| {z }=1 since pdf of a N( nx

n+1; 1n+1

) distribution

=(n+ 1)�

12

(2�)n

2

exp

�1

2

nXi=1

x2i +n2x2

2(n+ 1)

!

127

Therefore,

h(� j x) =�(�)f(x j �)

g(x)

=

1p2�

exp(�1

2�2)

1

(p2�)n

exp

�1

2

nXi=1

(xi � �)2

!(n+ 1)�

12

(2�)n

2

exp

�1

2

nXi=1

x2i +n2x2

2(n+ 1)

!

=

pn+ 1p2�

exp

�1

2

nXi=1

x2i + �nx� n

2�2 � 1

2�2 +

1

2

nXi=1

x2i �n2x2

2(n+ 1)

!

=

pn+ 1p2�

exp

�n+ 1

2

�2 � 2

n+ 1�nx+

2

n+ 1

n2x2

2(n+ 1)

!!

=1q

2� 1n+1

exp

�1

2

11

n+1

�� nx

n+ 1

�2!;

i.e.,

h(� j x) � N

�nx

n+ 1;

1

n+ 1

�:

Therefore, a Bayesian level 1� � CI for � is�n

n+ 1X �

z�=2pn+ 1

;n

n+ 1X +

z�=2pn+ 1

�:

The shortest (\classical") level 1� � CI for � (treating � as �xed) is�X �

z�=2pn; X +

z�=2pn

�as seen in Example 11.3.9.

Thus, the Bayesian CI is slightly shorter than the classical CI since we use additional infor-

mation in constructing the Bayesian CI.

128

12 Nonparametric Inference

12.1 Nonparametric Estimation

De�nition 12.1.1:

A statistical method which does not rely on assumptions about the distributional form of a

rv (except, perhaps, that it is absolutely continuous, or purely discrete) is called a nonpara-

metric or distribution{free method.

Note:

Unless otherwise speci�ed, we make the following assumptions for the remainder of this chap-

ter: Let X1; : : : ;Xn be iid � F , where F is unknown. Let P be the class of all possible

distributions of X.

De�nition 12.1.2:

A statistic T (X) is su�cient for a family of distributions P if the conditional distibution of

X given T = t is the same for all F 2 P.

Example 12.1.3:

Let X1; : : : ; Xn be absolutely continuous. Let T = (X(1); : : : ; X(n)) be the order statistics.

It holds that

f(x j T = t) =1

n!;

so T is su�cient for the family of absolutely continuous distributions on IR.

De�nition 12.1.4:

A family of distributions P is complete if the only unbiased estimate of 0 is the 0 itself, i.e.,

EF (h(X)) = 0 8F 2 P =) h(x) = 0 8x:

De�nition 12.1.5:

A statistic T (X) is complete in relation to P if the class of induced distributions of T is

complete.

Theorem 12.1.6:

The order statistic (X(1); : : : ;X(n)) is a complete su�cient statistic, provided that X1; : : : ;Xn

are of either (pure) discrete of (pure) continuous type.

129

De�nition 12.1.7:

A parameter g(F ) is called estimable if it has an unbiased estimate, i.e., if there exists a

T (X) such that

EF (T (X)) = g(F ) 8F 2 P:

Example 12.1.8:

Let P be the class of distributions for which second moments exist. Then X is unbiased for

�(F ) =RxdF (x). Thus, �(F ) is estimable.

De�nition 12.1.9:

The degree m of an estimable parameter g(F ) is the smallest sample size for which an unbi-

ased estimate exists for all F 2 P.

An unbiased estimate based on a sample of size m is called a kernel.

Lemma 12.1.10:

There exists a symmetric kernel for every estimable parameter.

Proof:

Let T (X1; : : : ;Xm) be a kernel of g(F ). De�ne

Ts(X1; : : : ;Xm) =1

m!

Xall permutations off1;:::;mg

T (Xi1 ; : : : ;Xim):

where the summation is over all m! permutations of f1; : : : ;mg.

Clearly Ts is symmetric and E(Ts) = g(F ).

Example 12.1.11:

(i) E(X1) = �(F ), so �(F ) has degree 1 with kernel X1.

(ii) E(I(c;1)(X1)) = PF (X > c), where c is a known constant. So g(F ) = PF (X > c) has

degree 1 with kernel I(c;1)(X1).

(iii) There exists no T (X1) such that E(T (X1)) = �2(F ) =R(x� �(F ))2dF (x).

But E(T (X1; X2)) = E(X21 � X1X2) = �2(F ). So �2(F ) has degree 2 with kernel

X21 �X1X2. Note that X

22 �X2X1 is another kernel.

(iv) A symmetric kernel for �2(F ) is

Ts(X1; X2) =1

2((X2

1 �X1X2) + (X22 �X1X2)) =

1

2(X1 �X2)

2:

130

Lecture 35:

We 04/11/01De�nition 12.1.12:

Let g(F ) be an estimable parameter of degree m. Let X1; : : : ;Xn be a sample of size n; n � m.

Given a kernel T (Xi1 ; : : : ;Xim) of g(F ), we de�ne a U{statistic by

U(X1; : : : ;Xn) =1�nm

� Xc

Ts(Xi1 ; : : : ;Xim);

where Ts is de�ned as in Lemma 12.1.10 and the summation c is over all�nm

�combina-

tions of m integers (i1; : : : ; im) from f1; � � � ; ng: U(X1; : : : ;Xn) is symmetric in the Xi's and

EF (U(X)) = g(F ) for all F:

Example 12.1.13:

For estimating �(F ) with degree m of �(F ) = 1:

Symmetric kernel:

Ts(Xi) = Xi; i = 1; : : : ; n

U{statistic:

U�(X) =1�n1

� Xc

Xi

=1 � (n� 1)!

n!

Xc

Xi

=1

n

nXi=1

Xi

= X

For estimating �2(F ) with degree m of �2(F ) = 2:

Symmetric kernel:

Ts(Xi1 ;Xi2) =1

2(Xi1 �Xi2)

2; i1; i2 = 1; : : : ; n; i1 6= i2

U{statistic:

U�2(X) =1�n2

� Xi1<i2

1

2(Xi1 �Xi2)

2

=1�n2

� 14

Xi1 6=i2

(Xi1 �Xi2)2

=(n� 2)! � 2!

n!

1

4

Xi1 6=i2

(Xi1 �Xi2)2

131

=1

2n(n� 1)

Xi1

Xi2 6=i1

(X2i1� 2Xi1Xi2 +X2

i2)

=1

2n(n� 1)

24(n� 1)nX

i1=1

X2i1� 2(

nXi1=1

Xi1)(nX

i2=1

Xi2) + 2nXi=1

X2i +

(n� 1)nX

i2=1

X2i2

35

=1

2n(n� 1)

24n nXi1=1

X2i1�

nXi1=1

X2i1� 2(

nXi1=1

Xi1)2 + 2

nXi=1

X2i +

n

nXi2=1

X2i2 �

nXi2=1

X2i2

35=

1

n(n� 1)

"n

nXi=1

X2i � (

nXi=1

Xi)2

#

=1

n(n� 1)

264n nXi=1

0@Xi �1

n

nXj=1

Xj

1A2375

=1

(n� 1)

nXi=1

(Xi �X)2

= S2

Theorem 12.1.14:

Let P be the class of all absolutely continuous or all purely discrete distribution functions on

IR. Any estimable function g(F ); F 2 P, has a unique estimate that is unbiased and sym-

metric in the observations and has uniformly minimum variance among all unbiased estimates.

Proof:

Let X1; : : : ; Xniid� F 2 P; with T (X1; : : : ; Xn) an unbiased estimate of g(F ).

We de�ne

Ti = Ti(X1; : : : ; Xn) = T (Xi1 ;Xi2 ; : : : ;Xin); i = 1; 2; : : : ; n!;

over all possible permutations of f1; : : : ; ng.

Let T = 1n!

n!Xi=1

Ti and T =n!Xi=1

Ti.

132

Then

EF (T ) = g(F )

and

V ar(T ) = E(T2)� (E(T ))2

= E

"(1

n!

n!Xi=1

Ti)2

#� [g(F )]2

= E

24( 1n!)2

n!Xi=1

n!Xj=1

TiTj

35� [g(F )]2

� E

24 n!Xi=1

n!Xj=1

TiTj

35� [g(F )]2

= E

24 n!Xi=1

Ti

!0@ n!Xj=1

Tj

1A35� [g(F )]2

= E

24 n!Xi=1

Ti

!235� [g(F )]2

= E(T 2)� [g(F )]2

= V ar(T )

Equality holds i� Ti = Tj 8i; j = 1; : : : ; n!

=) T is symmetric in (X1; : : : ;Xn) and T = T

=) by Rohatgi, Problem 4, page 538, T is a function of order statistics

=) by Rohatgi, Theorem 1, page 535, T is a complete su�cient statistic

=) by Note (i) following Theorem 8.4.12, T is UMVUE

Corollary 12.1.15:

If T (X1; : : : ;Xn) is unbiased for g(F ); F 2 P, the corresponding U{statistic is an essentially

unique UMVUE.

133

De�nition 12.1.16:

Suppose we have independent samples X1; : : : ;Xmiid� F 2 P; Y1; : : : ; Yn

iid� G 2 P (G may or

may not equal F:) Let g(F;G) be an estimable function with unbiased estimator T (X1; : : : ;Xk; Y1; : : : ; Yl):

De�ne

Ts(X1; : : : ;Xk; Y1; : : : ; Yl) =1

k!l!

XPX

XPY

T (Xi1 ; : : : ; Xik ; Yj1 ; : : : ; Yjl)

(where PX and PY are permutations of X and Y ) and

U(X;Y ) =1�m

k

��nl

� XCX

XCY

Ts(Xi1 ; : : : ;Xik ; Yj1 ; : : : ; Yjl)

(where CX and CY are combinations of X and Y ).

U is a called a generalized U{statistic.

Example 12.1.17:

Let X1; : : : ;Xm and Y1; : : : ; Yn be independent random samples from F and G, respectively,

with F;G 2 P: We wish to estimate

g(F;G) = PF;G(X � Y ):

Let us de�ne

Zij =

(1; Xi � Yj

0; Xi > Yj

for each pair Xi; Yj; i = 1; 2; : : : ;m; j = 1; 2; : : : ; n:

ThenmXi=1

Zij is the number of X's � Yj, andnXj=1

Zij is the number of Y 's > Xi:

E(I(Xi � Yj)) = g(F;G) = PF;G(X � Y );

and degrees k and l are = 1, so we use

U(X;Y ) =1�m

1

��n1

� XCX

XCY

Ts(Xi1 ; : : : ; Xik ; Yj1 ; : : : ; Yjl)

=(m� 1)!(n� 1)!

m!n!

XCX

XCY

1

1!1!

XPX

XPY

T (Xi1 ; : : : ;Xik ; Yj1 ; : : : ; Yjl)

=1

mn

mXi=1

nXj=1

I(Xi � Yj):

This Mann{Whitney estimator (or Wilcoxin 2{Sample estimator) is unbiased and

symmetric in the X's and Y 's. It follows by Corollary 12.1.15 that it has minimum variance.

134

Lecture 36:

Fr 04/13/0112.2 Single-Sample Hypothesis Tests

Let X1; : : : ;Xn be a sample from a distribution F . The problem of �t is to test the hypoth-

esis that the sample X1; : : : ;Xn is from some speci�ed distribution against the alternative

that it is from some other distribution, i.e., H0 : F = F0 vs. H1 : F (x) 6= F0(x) for some x.

De�nition 12.2.1:

Let X1; : : : ; Xniid� F , and let the corresponding empirical cdf be

F �n(x) =1

n

nXi=1

I(�1;x](Xi):

The statistic

Dn = supxj F �n(x)� F (x) j

is called the two{sided Kolmogorov{Smirnov statistic (K{S statistic).

The one{sided K{S statistics are

D+n = sup

x[F �n(x)� F (x)] and D�

n = supx[F (x) � F �n(x)]:

Theorem 12.2.2:

For any continuous distribution F , the K{S statistics Dn;D�n ;D

+n are distribution free.

Proof:

Let X(1); : : : ;X(n) be the order statistics of X1; : : : ; Xn, i.e., X(1) � X(2) � : : : � X(n), and

de�ne X(0) = �1 and X(n+1) = +1.

Then,

F �n(x) =i

nfor X(i) � x < X(i+1); i = 0; : : : ; n:

Therefore,

D+n = max

0�i�nf sup

X(i)�x<X(i+1)

[i

n� F (x)]g

= max0�i�n

f in� [ inf

X(i)�x<X(i+1)

F (x)]g

(�)= max

0�i�nf in� F (X(i))g

= max f max1�i�n

�i

n� F (X(i))

�; 0g

(�) holds since F is nondecreasing in [X(i);X(i+1)).

135

Note that D+n is a function of F (X(i)). In order to make some inference about D+

n , the dis-

tribution of F (X(i)) must be known. We know from the Probability Integral Transformation

(see Rohatgi, page 203, Theorem 1) that for a rv X with continuous cdf FX , it holds that

FX(X) � U(0; 1).

Thus, F (X(i)) is the ith order statistic of a sample from U(0; 1), independent from F . There-

fore, the distribution of D+n is independent of F .

Similarly, the distribution of

D�n = max f max

1�i�n

�F (X(i))�

i� 1

n

�; 0g

is independent of F .

Since

Dn = supxj F �n(x)� F (x) j= max fD+

n ; D�n g;

the distribution of Dn is also independent of F .

Theorem 12.2.3:

If F is continuous, then

P (Dn � � +1

2n) =

8>>>>><>>>>>:

0; if � � 0Z �+ 12n

12n��

Z �+ 32n

32n��

: : :

Z �+ 2n�12n

2n�12n

��f(u)du; if 0 < � < 2n�1

2n

1; if � � 2n�12n

where

f(u) = f(u1; : : : ; un) =

(n!; if 0 < u1 < u2 < : : : < un < 1

0; otherwise

is the joint pdf of an order statistic of a sample of size n from U(0; 1).

Note:

As Gibbons & Chakraborti (1992), page 108{109, point out, this result must be interpreted

carefully. Consider the case n = 2.

For 0 < � < 34 , it holds that

P (D2 � � +1

4) =

Z �+ 14

14��

0<u1<u2<1

Z �+ 34

34��

2! du2 du1:

Note that the integration limits overlap if

� +1

4� �� + 3

4

() � � 1

4

136

When 0 < � < 14 , it automatically holds that 0 < u1 < u2 < 1. Thus, for 0 < � < 1

4 , it holds

that

P (D2 � � +1

4) =

Z �+ 14

14��

Z �+ 34

34��

2! du2 du1

= 2!

Z �+ 14

14��

�u2j

�+ 34

34��

�du1

= 2!

Z �+ 14

14��

2� du1

= 2! (2�) u1j�+ 1

414��

= 2! (2�)2

For 14 � � < 3

4 , the region of integration is as follows:

u1

u2

1

10

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

1/4 + nu

3/4 - nu

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

••

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

3/4 - nu

Area 1 Area 2

Note to Theorem 12.2.3

Thus, for 14 � � < 3

4 , it holds that

P (D2 � � +1

4) =

Z �+ 14

14��

0<u1<u2<1

Z �+ 34

34��

2! du2 du1

=

Z �+ 14

34��

Z 1

u1

2! du2 du1 +

Z 34��

0

Z 1

34��

2! du2 du1

= 2

"Z �+ 14

34��

�u2j1u1

�du1 +

Z 34��

0

�u2j13

4��

�du1

#

= 2

"Z �+ 14

34��

(1� u1) du1 +

Z 34��

0(1� 3

4+ �) du1

#

137

= 2

24 u1 � u21

2

!��+ 1

4

34��

+

�u1

4+ �u1

�� 34��0

35= 2

"(� +

1

4)�

(� + 14)2

2� (�� + 3

4) +

(�� + 34)2

2+(�� + 3

4)

4+ �(�� + 3

4)

#

= 2

"� +

1

4� �2

2� �

4� 1

32+ � � 3

4+�2

2� 3

4� +

9

32� �

4+

3

16� �2 +

3

4�

#

= 2

��2 + 3

2� � 1

16

�

= �2�2 + 3� � 1

8

Combining these results gives

P (D2 � � +1

4) =

8>>>>>>><>>>>>>>:

0; if � � 0

2! (2�)2; if 0 < � < 14

�2�2 + 3� � 18 ; if 1

4 � � < 34

1; if � � 34

Lecture 37:

Mo 04/16/01Theorem 12.2.4:

Let F be a continuous cdf. Then it holds 8z � 0:

limn!1

P (Dn �zpn) = L1(z) = 1� 2

1Xi=1

(�1)i�1 exp(�2i2z2):

Theorem 12.2.5:

Let F be a continuous cdf. Then it holds:

P (D+n � z) = P (D�

n � z) =

8>>>>><>>>>>:0; if z � 0Z 1

1�z

Z un

n�1n�z: : :

Z u3

2n�z

Z u2

1n�zf(u)du; if 0 < z < 1

1; if z � 1

where f(u) is de�ned in Theorem 12.2.3.

Note:

It should be obvious that the statistics D+n and D�

n have the same distribution because of

symmetry.

138

Theorem 12.2.6:

Let F be a continuous cdf. Then it holds 8z � 0:

limn!1

P (D+n �

zpn) = lim

n!1P (D�

n � zpn) = L2(z) = 1� exp(�2z2)

Corollary 12.2.7:

Let Vn = 4n(D+n )

2. Then it holds Vnd�! �22, i.e., this transformation of D

+n has an asymptotic

�22 distribution.

Proof:

Let x � 0. Then it follows:

limn!1

P (Vn � x)x=4z2= lim

n!1P (Vn � 4z2)

= limn!1

P (4n(D+n )

2 � 4z2)

= limn!1

P (pnD+

n � z)

Th:12:2:6= 1� exp(�2z2)

4z2=x= 1� exp(�x=2)

Thus, limn!1

P (Vn � x) = 1� exp(�x=2) for x � 0. Note that this is the cdf of a �22 distribu-

tion.

De�nition 12.2.8:

Let Dn;� be the smallest value such that P (Dn > Dn;�) � �. Likewise, let D+n;� be the

smallest value such that P (D+n > D+

n;�) � �.

The Kolmogorov{Smirnov test (K{S test) rejects H0 : F (x) = F0(x) 8x at level � if

Dn > Dn;�.

It rejects H 00 : F (x) � F0(x) 8x at level � if D�

n > D+n;� and it rejects H 00

0 : F (x) � F0(x) 8xat level � if D+

n > D+n;�.

Note:

Rohatgi, Table 7, page 661, gives values of Dn;� and D+n;� for selected values of � and small

n. Theorems 12.2.4 and 12.2.6 allow the approximation of Dn;� and D+n;� for large n.

139

Example 12.2.9:

Let X1; : : : ; Xn � C(1; 0). We want to test whether H0 : X � N(0; 1).

The following data has been observed for x(1); : : : ; x(10):

�1:42;�0:43;�0:19; 0:26; 0:30; 0:45; 0:64; 0:96; 1:97; and 4:68

The results for the K{S test have been obtained through the following S{Plus session, i.e.,

D+10 = 0:02219616, D�

10 = 0:3025681, and D10 = 0:3025681:

> x _ c(-1.42, -0.43, -0.19, 0.26, 0.30, 0.45, 0.64, 0.96, 1.97, 4.68)

> FX _ pnorm(x)

> FX

[1] 0.07780384 0.33359782 0.42465457 0.60256811 0.61791142 0.67364478

[7] 0.73891370 0.83147239 0.97558081 0.99999857

> Dp _ (1:10)/10 - FX

> Dp

[1] 2.219616e-02 -1.335978e-01 -1.246546e-01 -2.025681e-01 -1.179114e-01

[6] -7.364478e-02 -3.891370e-02 -3.147239e-02 -7.558081e-02 1.434375e-06

> Dm _ FX - (0:9)/10

> Dm

[1] 0.07780384 0.23359782 0.22465457 0.30256811 0.21791142 0.17364478

[7] 0.13891370 0.13147239 0.17558081 0.09999857

> max(Dp)

[1] 0.02219616

> max(Dm)

[1] 0.3025681

> max(max(Dp), max(Dm))

[1] 0.3025681

>

> ks.gof(x, alternative = "two.sided", mean = 0, sd = 1)

One-sample Kolmogorov-Smirnov Test

Hypothesized distribution = normal

data: x

ks = 0.3026, p-value = 0.2617

alternative hypothesis:

True cdf is not the normal distn. with the specified parameters

Using Rohatgi, Table 7, page 661, we have to use D10;0:20 = 0:323 for � = 0:20. Since

D10 = 0:3026 < 0:323 = D10;0:20, it is p > 0:20. The K{S test does not reject H0 at level

� = 0:20. As S{Plus shows, the precise p{value is even p = 0:2617.

140

Note:

Comparison between �2 and K{S goodness of �t tests:

� K{S uses all available data; �2 bins the data and loses information

� K{S works for all sample sizes; �2 requires large sample sizes

� it is more di�cult to modify K{S for estimated parameters; �2 can be easily adapted

for estimated parameters

� K{S is \conservative" for discrete data, i.e., it tends to accept H0 for such data

� the order matters for K{S; �2 is better for unordered categorical data

141

12.3 More on Order Statistics

De�nition 12.3.1:

Let F be a continuous cdf. A tolerance interval for F with tolerance coe�cient is

a random interval such that the probability is that this random interval covers at least a

speci�ed percentage 100p% of the distribution.

Theorem 12.3.2:

If order statisticsX(r) < X(s) are used as the endpoints for a tolerance interval for a continuous

cdf F , it holds that

=s�r�1Xi=0

n

i

!pi(1� p)n�i:

Proof:

According to De�nition 12.3.1, it holds that

= PX(r);X(s)

�PX(X(r) < X < X(s)) � p

�:

Since F is continuous, it holds that FX(X) � U(0; 1). Therefore,

PX(X(r) < X < X(s)) = P (X < X(s))� P (X � X(r))

= F (X(s))� F (X(r))

= U(s) � U(r);

where U(s) and U(r) are the order statistics of a U(0; 1) distribution.

Thus,

= PX(r);X(s)

�PX(X(r) < X < X(s)) � p

�= P (U(s) � U(r) � p):

By Therorem 4.4.4, we can determine the joint distribution of order statistics and calculate

as

=

Z 1

p

Z y�p

0

n!

(r � 1)!(s� r � 1)!(n� s)!xr�1(y � x)s�r�1(1� y)n�s dx dy:

Rather than solving this integral directly, we make the transformation

U = U(s) � U(r)

V = U(s):

Then the joint pdf of U and V is

fU;V (u; v) =

8<:n!

(r�1)!(s�r�1)!(n�s)! (v � u)r�1us�r�1(1� v)n�s; if 0 < u < v < 1

0; otherwise

142

and the marginal pdf of U is

fU(u) =

Z 1

0fU;V (u; v) dv

=n!

(r � 1)!(s� r � 1)!(n� s)!us�r�1I(0;1)(u)

Z 1

u(v � u)r�1(1� v)n�s dv

(A)=

n!

(r � 1)!(s� r � 1)!(n� s)!us�r�1(1� u)n�s+rI(0;1)(u)

Z 1

0tr�1(1� t)n�s dt| {z }B(r;n�s+1)

=n!

(r � 1)!(s� r � 1)!(n� s)!us�r�1(1� u)n�s+r

(r � 1)!(n� s)!

(n� s+ r)!I(0;1)(u)

=n!

(n� s+ r)!(s� r � 1)!us�r�1(1� u)n�s+rI(0;1)(u)

= n

n� 1

s� r � 1

!us�r�1(1� u)n�s+rI(0;1)(u):

(A) is based on the transformation t =v � u

1� u, v� u = (1� u)t, 1� v = 1� u� (1� u)t =

(1� u)(1� t) and dv = (1� u)dt.

It follows that

= P (U(s) � U(r) � p)

= P (U � p)

=

Z 1

pn

n� 1

s� r � 1

!us�r�1(1� u)n�s+r du

(B)= P (Y < s� r) j where Y � Bin(n; p)

=s�r�1Xi=0

n

i

!pi(1� p)n�i:

(B) holds due to Rohatgi, Remark 3 after Theorem 5.3.18, page 216, since for X � Bin(n; p),

it holds that

P (X < k) =

Z 1

pn

n� 1

k � 1

!xk�1(1� x)n�k dx:

143

Lecture 38:

We 04/18/01Example 12.3.3:

Let s = n and r = 1. Then,

=n�2Xi=0

n

i

!pi(1� p)n�i = 1� pn � npn�1(1� p):

If p = 0:8 and n = 10, then

10 = 1� (0:8)10 � 10 � (0:8)9 � (0:2) = 0:624;

i.e., (X(1);X(10)) de�nes a 62.4% tolerance interval for 80% probability.

If p = 0:8 and n = 20, then

20 = 1� (0:8)20 � 20 � (0:8)19 � (0:2) = 0:931;

and if p = 0:8 and n = 30, then

30 = 1� (0:8)30 � 30 � (0:8)29 � (0:2) = 0:989:

Theorem 12.3.4:

Let kp be the pth quantile of a continuous cdf F . Let X(1); : : : ; X(n) be the order statistics of

a sample of size n from F . Then it holds that

P (X(r) � kp � X(s)) =s�1Xi=r

n

i

!pi(1� p)n�i:

Proof:

It holds that

P (X(r) � kp) = P (at least r of the Xi's are � kp)

=nXi=r

n

i

!pi(1� p)n�i:

Therefore,

P (X(r) � kp � X(s)) = P (X(r) � kp)� P (X(s) < kp)

=nXi=r

n

i

!pi(1� p)n�i �

nXi=s

n

i

!pi(1� p)n�i

=s�1Xi=r

n

i

!pi(1� p)n�i:

144

Corollary 12.3.5:

(X(r);X(s)) is a levels�1Xi=r

n

i

!pi(1� p)n�i con�dence interval for kp.

Example 12.3.6:

Let n = 10. We want a 95% con�dence interval for the median, i.e., kp where p =12 .

We get the following probabilities pr;s =s�1Xi=r

n

i

!pi(1� p)n�i that (X(r);X(s)) covers k0:5:

pr;s s

2 3 4 5 6 7 8 9 10

1 0:01 0:05 0:17 0:38 0:62 0:83 0:94 0:99 0:998

2 0:04 0:16 0:37 0:61 0:82 0:93 0:98 0:99

3 0:12 0:32 0:57 0:77 0:89 0:93 0:94

4 0:21 0:45 0:66 0:77 0:82 0:83

r 5 0:25 0:45 0:57 0:61 0:62

6 0:21 0:32 0:37 0:38

7 0:12 0:16 0:17

8 0:04 0:05

9 0:01

Only the random intervals (X(1);X(9)), (X(1); X(10)), (X(2);X(9)), and (X(2); X(10)) give the

desired coverage probability. Therefore, we use the one that comes closest to 95%, i.e.,

(X(2); X(9)), as the 95% con�dence interval for the median.

145

13 Some Results from Sampling

13.1 Simple Random Samples

De�nition 13.1.1:

Let be a population of size N with mean � and variance �2. A sampling method (of size

n) is called simple if the set S of possible samples contains all combinations of n elements of

(without repetition) and the probability for each sample s 2 S to become selected depends

only on n, i.e., p(s) = 1

(Nn)8s 2 S. Then we call s 2 S a simple random sample (SRS) of

size n.

Theorem 13.1.2:

Let be a population of size N with mean � and variance �2. Let Y : ! IR be a measurable

function. Let ni be the total number of times the parameter ~yi occurs in the population and

pi =niN

be the relative frequency the parameter ~yi occurs in the population. Let (y1; : : : ; yn)

be a SRS of size n with respect to Y , where P (Y = ~yi) = pi =niN .

Then the components yi; i = 1; : : : ; n, are identically distributed as Y and it holds for i 6= j:

P (yi = ~yk; yj = ~yl) =1

N(N � 1)nkl; where nkl =

8<: nknl; k 6= l

nk(nk � 1); k = l

Note:

(i) In Sampling, many authors use capital letters to denote properties of the population

and small letters to denote properties of the random sample. In particular, xi's and yi's

are considered as random variables related to the sample. They are not seen as speci�c

realizations.

(ii) The following equalities hold in the scenario of Theorem 13.1.2:

N =Xi

ni

� =1

N

Xi

ni~yi

�2 =1

N

Xi

ni(~yi � �)2

=1

N

Xi

ni~y2i � �2

146

Theorem 13.1.3:

Let the same conditions hold as in Theorem 13.1.2. Let y = 1n

nXi=1

yi be the sample mean of a

SRS of size n. Then it holds:

(i) E(y) = �, i.e., the sample mean is unbiased for the population mean �.

(ii) V ar(y) =1

n

N � n

N � 1�2 =

1

n(1� f)

N

N � 1�2, where f = n

N.

Proof:

(i)

E(y) =1

n

nXi=1

E(yi) = �; since E(yi) = � 8i:

(ii)

V ar(y) =1

n2

0@ nXi=1

V ar(yi) + 2Xi<j

Cov(yi; yj)

1ACov(yi; yj) = E(yi � yj)�E(yi)E(yj)

= E(yi � yj)� �2

=Xk;l

~yk~ylP (yi = ~yk; yj = ~yl)� �2

Th:13:1:2=

1

N(N � 1)

0@Xk 6=l

~yk~ylnknl +Xk

~y2knk(nk � 1)

1A� �2

=1

N(N � 1)

0@Xk;l

~yk~ylnknl �Xk

~y2knk

1A� �2

=1

N(N � 1)

Xk

~yknk

! Xl

~ylnl

!�Xk

~y2knk

!� �2

Note (ii)=

1

N(N � 1)

�N2�2 �N(�2 + �2)

�� 2

=1

N � 1

�N�2 � �2 � �2 � �2(N � 1)

�= � 1

N � 1�2; for i 6= j

147

Lecture 39:

Fr 04/20/01

=) V ar(y) =1

n2

0@ nXi=1

V ar(yi) + 2Xi<j

Cov(yi; yj)

1A=

1

n2

�n�2 + n(n� 1)

�� 1

N � 1�2��

=1

n

�1� n� 1

N � 1

��2

=1

n

N � n

N � 1�2

=1

n(1� n

N)

N

N � 1�2

=1

n(1� f)

N

N � 1�2

Theorem 13.1.4:

Let yn be the sample mean of a SRS of size n. Then it holds thatrn

1� f

yn � �qN

N�1�

d�! N(0; 1);

where N !1 and f = nNis a constant.

In particular, when the yi's are 0{1{distributed with E(yi) = P (yi = 1) = p 8i, then it holds

that rn

1� f

yn � pqN

N�1p(1� p)

d�! N(0; 1);

where N !1 and f = nNis a constant.

148

13.2 Strati�ed Random Samples

De�nition 13.2.1:

Let be a population of size N , that is split into m disjoint sets j, called strata, of size

Nj , j = 1; : : : ;m, where N =mXj=1

Nj . If we independently draw a random sample of size nj in

each strata, we speak of a strati�ed random sample.

Note:

(i) The random samples in each strata are not always SRS's.

(ii) Strati�ed random samples are used in practice as a means to reduce the sample variance

in the case that data in each strata is homogeneous and data among di�erent strata is

heterogeneous.

(iii) Frequently used strata in practice are gender, state (or county), income range, ethnic

background, etc.

De�nition 13.2.2:

Let Y : ! IR be a measurable function. In case of a strati�ed random sample, we use the

following notation:

Let Yjk; j = 1; : : : ;m; k = 1; : : : ; Nj be the elements in j. Then, we de�ne

(i) Yj =

NjXk=1

Yjk the total in the jth strata,

(ii) �j =1NjYj the mean in the jth strata,

(iii) � = 1N

mXj=1

Nj�j the expectation (or grand mean),

(iv) N� =mXj=1

Yj =mXj=1

NjXk=1

Yjk the total,

(v) �2j =1Nj

NjXk=1

(Yjk � �j)2 the variance in the jth strata, and

(vi) �2 = 1N

mXj=1

NjXk=1

(Yjk � �)2 the variance.

149

(vii) We denote an (ordered) sample in j of size nj as (yj1; : : : ; yjnj ) and yj =1nj

njXk=1

yjk the

sample mean in the jth strata.

Theorem 13.2.3:

Let the same conditions hold as in De�nitions 13.2.1 and 13.2.2. Let �̂j be an unbiased

estimate of �j anddV ar(�̂j) be an unbiased estimate of V ar(�̂j). Then it holds:

(i) �̂ = 1N

mXj=1

Nj�̂j is unbiased for �.

V ar(�̂) = 1N2

mXj=1

N2j V ar(�̂j).

(ii) dV ar(�̂) = 1N2

mXj=1

N2j

dV ar(�̂j) is unbiased for V ar(�̂).

Proof:

(i)

E(�̂) =1

N

mXj=1

NjE(�̂j) =1

N

mXj=1

Nj�j = �

By independence of the samples within each strata,

V ar(�̂) =1

N2

mXj=1

N2j V ar(�̂j):

(ii)

E( dV ar(�̂)) = 1

N2

mXj=1

N2j E(

dV ar(�̂j)) =1

N2

mXj=1

N2j V ar(�̂j) = V ar(�̂)

Theorem 13.2.4:

Let the same conditions hold as in Theorem 13.2.3. If we draw a SRS in each strata, then it

holds:

(i) �̂ = 1N

mXj=1

Njyj is unbiased for �, where yj =1nj

njXk=1

yjk; j = 1; : : : ;m.

V ar(�̂) = 1N2

mXj=1

N2j

1

nj(1� fj)

Nj

Nj � 1�2j , where fj =

njNj

.

150

(ii) dV ar(�̂) = 1N2

mXj=1

N2j

1

nj(1� fj)s

2j is unbiased for V ar(�̂), where

s2j =1

nj � 1

njXk=1

(yjk � yj)2:

Proof:

For a SRS in the jth strata, it follows by Theorem 13.1.3:

E(yj) = �j

V ar(yj) =1

nj(1� fj)

Nj

Nj � 1�2j

Also, we can show that

E(s2j ) =Nj

Nj � 1�2j :

Now the proof follows directly from Theorem 13.2.3.

De�nition 13.2.5:

Let the same conditions hold as in De�nitions 13.2.1 and 13.2.2. If the sample in each strata

is of size nj = nNj

N ; j = 1; : : : ;m, where n is the total sample size, then we speak of pro-

portional selection.

Note:

(i) In the case of proportional selection, it holds that fj =njNj

= nN = f; j = 1; : : : ;m.

(ii) Proportional strata cannot always be obtained for each combination of m, n, and N .

Theorem 13.2.6:

Let the same conditions hold as in De�nition 13.2.5. If we draw a SRS in each strata, then it

holds in case of proportional selection that

V ar(�̂) =1

N2

1� f

f

mXj=1

Nj~�2j ;

where ~�2j =Nj

Nj�1�2j .

Proof:

The proof follows directly from Theorem 13.2.4 (i).

151

Theorem 13.2.7:

If we draw (1) a strati�ed random sample that consists of SRS's of sizes nj under proportional

selection and (2) a SRS of size n =mXj=1

nj from the same population, then it holds that

V ar(y)� V ar(�̂) =1

n

N � n

N(N � 1)

0@ mXj=1

Nj(�j � �)2 � 1

N

mXj=1

(N �Nj)~�2j

1A :Proof:

Homework.

152

Lecture 41:

We 04/25/01

14 Some Results from Sequential Statistical Inference

14.1 Fundamentals of Sequential Sampling

Example 14.1.1:

A particular machine produces a large number of items every day. Each item can be either

\defective" or \non{defective". The unknown proportion of defective items in the production

of a particular day is p.

Let (X1; : : : ;Xm) be a sample from the daily production where xi = 1 when the item is

defective and xi = 0 when the item is non{defective. Obviously, Sm =mXi=1

Xi � Bin(m; p)

denotes the total number of defective items in the sample (assuming that m is small compared

to the daily production).

We might be interested to test H0 : p � p0 vs. H1 : p > p0 at a given signi�cance level �

and use this decision to trash the entire daily production and have the machine �xed if indeed

p > p0. A suitable test could be

�1(x1; : : : ; xm) =

8<: 1; if sm > c

0; if sm � c

where c is chosen such that �1 is a level{� test.

However, wouldn't it be more bene�cial if we sequentially sample the items (e.g., take item

# 57, 623, 1005, 1286, 2663, etc.) and stop the machine as soon as it becomes obvious that

it produces too many bad items. (Alternatively, we could also �nish the time consuming and

expensive process to determine whether an item is defective or non{defective if it is impossible

to surpass a certain proportion of defectives.) For example, if for some j < m it already holds

that sj > c, then we could stop (and immediately call maintenance) and reject H0 after only

j observations.

More formally, let us de�ne T = minfj j Sj > cg and T 0 = minfT;mg. We can now con-

sider a decision rule that stops with the sampling process at random time T 0 and rejects H0 if

T � m. Thus, if we consider R0 = f(x1; : : : ; xm) j t � mg and R1 = f(x1; : : : ; xm) j sm > cgas critical regions of two tests �0 and �1, then these two tests are equivalent.

153

De�nition 14.1.2:

Let � be the parameter space and A the set of actions the statistician can take. We assume

that the rv's X1;X2; : : : are observed sequentially and iid with common pdf (or pmf) f�(x).

A sequential decision procedure is de�ned as follows:

(i) A stopping rule speci�es whether an element of A should be chosen without taking

any further observation. If at least one observation is taken, this rule speci�es for every

set of observed values (x1; x2; : : : ; xn), n � 1, whether to stop sampling and choose an

action in A or to take another observation xn+1.

(ii) A decision rule speci�es the decision to be taken. If no observation has been taken,

then we take action d0 2 A. If n � 1 observation have been taken, then we take action

dn(x1; : : : ; xn) 2 A, where dn(x1; : : : ; xn) speci�es the action that has to be taken for

the set (x1; : : : ; xn) of observed values. Once an action has been taken, the sampling

process is stopped.

Note:

In the remainder of this chapter, we assume that the statistician takes at least one observation.

De�nition 14.1.3:

Let Rn � IRn; n = 1; 2; : : :, be a sequence of Borel{measurable sets such that the sampling

process is stopped after observing X1 = x1; X2 = x2; : : : ; Xn = xn if (x1; : : : ; xn) 2 Rn. If

(x1; : : : ; xn) =2 Rn, then another observation xn+1 is taken. The sets Rn; n = 1; 2; : : : are called

stopping regions.

De�nition 14.1.4:

With every sequential stopping rule we associate a stopping random variable N which

takes on the values 1; 2; 3; : : :. Thus, N is a rv that indicates the total number of observations

taken before the sampling is stopped.

Note:

We use the (sloppy) notation fN = ng to denote the event that sampling is stopped after

observing exactly n values x1; : : : ; xn (i.e., sampling is not stopped before taking n samples).

Then the following equalities hold:

fN = 1g = R1

154

fN = ng = f(x1; : : : ; xn) 2 IRn j sampling is stopped after n observations but not beforeg

= (R1 [R2 [ : : : [Rn�1)c \Rn

= Rc1 \Rc

2 \ : : : \Rcn�1 \Rn

fN � ng =n[

k=1

fN = kg

Here we will only consider closed sequential sampling procedures, i.e., procedures where

sampling eventually stops with probability 1, i.e.,

P (N <1) = 1;

P (N =1) = 1� P (N <1) = 0:

Theorem 14.1.5: Wald's Equation

LetX1;X2; : : : be iid rv's withE(j X1 j) <1. Let N be a stopping variable. Let SN =NXk=1

Xk.

If E(N) <1, then it holds

E(SN ) = E(X1)E(N):

Proof:

De�ne a sequence of rv's Yi; i = 1; 2; : : :, where

Yi =

8<: 1; if no decision is reached up to the (i� 1)th stage, i.e., N > (i� 1)

0; otherwise

Then each Yi is a function of X1;X2; : : : ;Xi�1 only and Yi is independent of Xi.

Consider the rv 1Xn=1

XnYn:

Obviously, it holds that

SN =1Xn=1

XnYn:

Thus, it follows that

E(SN ) = E

1Xn=1

XnYn

!: (�)

It holds that

1Xn=1

E(j XnYn j) =1Xn=1

E(j Xn j)E(j Yn j)

= E(j X1 j)1Xn=1

P (N � n)

155

= E(j X1 j)1Xn=1

1Xk=n

P (N = k)

(A)= E(j X1 j)

1Xn=1

nP (N = n)

= E(j X1 j)E(N)

< 1

(A) holds due to the following rearrangement of indizes:

n k

1 1; 2; 3; : : :

2 2; 3; : : :

3 3; : : :...

...

Lecture 42:

Fr 04/27/01

We may therefore interchange the expectation and summation signs in (�) and get

E(SN ) = E

1Xn=1

XnYn

!

=1Xn=1

E(XnYn)

=1Xn=1

E(Xn)E(Yn)

= E(X1)1Xn=1

P (N � n)

= E(X1)E(N)

which completes the proof.

156

14.2 Sequential Probability Ratio Tests

De�nition 14.2.1:

Let X1; X2; : : : be a sequence of iid rv's with common pdf (or pmf) f�(x). We want to test a

simple hypothesis H0 : X � f�0 vs. a simple alternative H1 : X � f�1 when the observations

are taken sequentially.

Let f0n and f1n denote the joint pdf's (or pmf's) of X1; : : : ;Xn under H0 and H1 respectively,

i.e.,

f0n(x1; : : : ; xn) =nYi=1

f�0(xi) and f1n(x1; : : : ; xn) =nYi=1

f�1(xi):

Finally, let

�n(x1; : : : ; xn) =f1n(x)

f0n(x);

where x = (x1; : : : ; xn). Then a sequential probability ratio test (SPRT) for testing H0

vs. H1 is the following decision rule:

(i) If at any stage of the sampling process it holds that

�n(x) � A;

then stop and reject H0.

(ii) If at any stage of the sampling process it holds that

�n(x) � B;

then stop and accept H0, i.e., reject H1.

(iii) If

B < �n(x) < A;

then continue sampling by taking another observation xn+1.

Note:

(i) It is usually convenient to de�ne

Zi = logf�1(Xi)

f�0(Xi);

where Z1; Z2; : : : are iid rv's. Then, we work with

log �n(x) =nXi=1

zi =nXi=1

(log f�1(xi)� log f�0(xi))

instead of using �n(x). Obviously, we now have to use constants b = logB and a = logA

instead of the original constants B and A.

157

(ii) A and B (where A > B) are constants such that the SPRT will have strength (�; �),

where

� = P (Type I error) = P (Reject H0 j H0)

and

� = P (Type II error) = P (Accept H0 j H1):

If N is the stopping rv, then

� = P�0(�N (X) � A) and � = P�1(�N (X) � B):

Example 14.2.2:

Let X1;X2; : : : be iid N(�; �2), where � is unknown and �2 > 0 is known. We want to test

H0 : � = �0 vs. H1 : � = �1, where �0 < �1.

If our data is sampled sequentially, we can constract a SPRT as follows:

log �n(x) =nXi=1

�� 1

2�2(xi � �1)

2 � (� 1

2�2(xi � �0)

2)

�

=1

2�2

nXi=1

�(xi � �0)

2 � (xi � �1)2)�

=1

2�2

nXi=1

�x2i � 2xi�0 + �20 � x2i + 2xi�1 � �21

�

=1

2�2

nXi=1

��2xi�0 + �20 + 2xi�1 � �21

�

=1

2�2

nXi=1

2xi(�1 � �0) + n(�20 � �21)

!

=�1 � �0

�2

nXi=1

xi � n�0 + �1

2

!

We decide for H0 if

log �n(x) � b

() �1 � �0

�2

nXi=1

xi � n�0 + �1

2

!� b

()nXi=1

xi � n�0 + �1

2+ b�;

where b� = �2

�1��0 b.

158

We decide for H1 if

log �n(x) � a

() �1 � �0

�2

nXi=1

xi � n�0 + �1

2

!� a

()nXi=1

xi � n�0 + �1

2+ a�;

where a� = �2

�1��0a.

n

sum(x_i)b* a*

1

2

3

4

•••• •••• ••••• ••••• ••••• •••• ••••• ••••• ••••• •••• ••••

accept H0 continue accept H1

Example 14.2.2

Theorem 14.2.3:

For a SPRT with stopping bounds A and B, A > B, and strength (�; �), we have

A � 1� �

�and B � �

1� �;

where 0 < � < 1 and 0 < � < 1.

Theorem 14.2.4:

Assume we select for given �; � 2 (0; 1), where �+ � � 1, the stopping bounds

A0 =1� �

�and B0 =

�

1� �:

Then it holds that the SPRT with stopping bounds A0 and B0 has strength (�0; �0), where

�0 � �

1� �; �0 � �

1� �; and �0 + �0 � �+ �:

159

Note:

(i) The approximation A0 = 1��

and B0 = �1�� in Theorem 14.2.4 is called Wald{

Approximation for the optimal stopping bounds of a SPRT.

(ii) A0 and B0 are functions of � and � only and do not depend on the pdf's (or pmf's) f�0

and f�1 . Therefore, they can be computed once and for all f�i 's, i = 0; 1.

THE END !!!

160

Index

�{similar, 88

0{1 Loss, 107

A Posteriori Distribution, 66

A Priori Distribution, 66

Action, 63

Alternative Hypothesis, 71

Asymptotically (Most) E�cient, 53

Bayes Estimate, 67

Bayes Risk, 66

Bayes Rule, 67

Bayesian Con�dence Interval, 126

Bayesian Con�dence Set, 126

Bias, 38

Borel{Cantelli Lemma, 3

Cauchy Criterion, 5

Centering Constants, 1

Central Limit Theorem, Lindeberg, 15

Central Limit Theorem, Lindeberg{L�evy, 12

Chapman, Robbins, Kiefer Inequality, 51

CI, 113

Closed, 155

Complete, 34, 129

Complete in Relation to P, 129

Composite, 71

Con�dence Bound, Lower, 113

Con�dence Bound, Upper, 113

Con�dence Coe�cient, 112

Con�dence Interval, 113

Con�dence Level, 112

Con�dence Sets, 112

Consistent in the rthMean, 27

Consistent, Mean{Squared{Error, 39

Consistent, Strongly, 27

Consistent, Weakly, 27

Continuity Theorem, 11

Contradiction, Proof by, 41

Convergence{Equivalent, 5

Cram�er{Rao Lower Bound, 47, 48

Credible Set, 126

Critical Region, 72

CRK Inequality, 51

CRLB, 47, 48

Decision Function, 63

Decision Rule, 154

Degree, 130

Distribution, A Posteriori, 66

Distribution, Population, 18

Distribution{Free, 129

Domain of Attraction, 14

E�ciency, 53

E�cient, Asymptotically (Most), 53

E�cient, More, 53

E�cient, Most, 53

Empirical CDF, 18

Empirical Cumulative Distribution Function, 18

Equivalence Lemma, 5

Error, Type I, 72

Error, Type II, 72

Estimable, 130

Estimable Function, 38

Estimate, Bayes, 67

Estimate, Maximum Likelihood, 57

Estimate, Method of Moments, 55

Estimate, Minimax, 64

Estimate, Point, 26

Estimator, 26

Estimator, Mann{Whitney, 134

Estimator, Wilcoxin 2{Sample, 134

Exponential Family, One{Parameter, 35

F{Test, 105

Factorization Criterion, 32

Family of CDF's, 26

Family of Con�dence Sets, 112

Family of PDF's, 26

Family of PMF's, 26

Family of Random Sets, 112

Formal Invariance, 91

Generalized U{Statistic, 134

Glivenko{Cantelli Theorem, 19

Hypothesis, Alternative, 71

Hypothesis, Null, 71

Independence of X and S2, 23

Induced Function, 28

Inequality, Kolmogorov's, 4

Interval, Random, 112

Invariance, Measurement, 91

Invariant, 28, 90

Invariant Test, 91

Invariant, Location, 29

Invariant, Maximal, 92

Invariant, Permutation, 29

Invariant, Scale, 29

K{S Statistic, 135

K{S Test, 139

Kernel, 130

Kernel, Symmetric, 130

Kolmogorov's Inequality, 4

161

Kolmogorov's SLLN, 7

Kolmogorov{Smirnov Statistic, 135

Kolmogorov{Smirnov Test, 139

Kronecker's Lemma, 4

Landau Symbols O and o, 13

Lehmann{Sche��ee, 45

Level of Signi�cance, 73

Level{�{Test, 73

Likelihood Function, 57

Likelihood Ratio Test, 95

Likelihood Ratio Test Statistic, 95

Lindeberg Central Limit Theorem, 15

Lindeberg Condition, 15

Lindeberg{L�evy Central Limit Theorem, 12

LMVUE, 40

Locally Minumum Variance Unbiased Estimate, 40

Location Invariant, 29

Logic, 41

Loss Function, 63

Lower Con�dence Bound, 113

LRT, 95

Mann{Whitney Estimator, 134

Maximal Invariant, 92

Maximum Likelihood Estimate, 57

Mean Square Error, 39

Mean{Squared{Error Consistent, 39

Measurement Invariance, 91

Method of Moments Estimate, 55

Minimax Estimate, 64

Minimax Principle, 64

MLE, 57

MLR, 81

MOM, 55

Monotone Likelihood Ratio, 81

More E�cient, 53

Most E�cient, 53

Most Powerful Test, 73

MP, 73

MSE{Consistent, 39

Neyman{Pearson Lemma, 76

Nonparametric, 129

Nonrandomized Test, 73

Normal Variance Tests, 100

Norming Constants, 1

NP Lemma, 76

Null Hypothesis, 71

One Sample t{Test, 103

One{Tailed t-Test, 103

Paired t-Test, 104

Parameter Space, 26

Parametric Hypothesis, 71

Permutation Invariant, 29

Pivot, 116

Point Estimate, 26

Point Estimation, 26

Population Distribution, 18

Posterior Distribution, 66

Power, 73

Power Function, 73

Prior Distribution, 66

Probability Integral Transformation, 136

Probability Ratio Test, Sequential, 157

Problem of Fit, 135

Proof by Contradiction, 41

Proportional Selection, 151

Random Interval, 112

Random Sample, 18

Random Sets, 112

Random Variable, Stopping, 154

Randomized Test, 73

Rao{Blackwell, 44

Rao{Blackwellization, 45

Realization, 18

Regularity Conditions, 48

Risk Function, 63

Risk, Bayes, 66

Sample, 18

Sample Central Moment of Order k, 19

Sample Mean, 18

Sample Moment of Order k, 19

Sample Statistic, 18

Sample Variance, 18

Scale Invariant, 29

Selection, Proportional, 151

Sequential Decision Procedure, 154

Sequential Probability Ratio Test, 157

Signi�cance Level, 73

Similar, 88

Similar, �, 88

Simple, 71, 146

Simple Random Sample, 146

Size, 73

SPRT, 157

SRS, 146

Stable, 14

Statistic, 18

Statistic, Kolmogorov{Smirnov, 135

Statistic, Likelihood Ratio Test, 95

Stopping Random Variable, 154

Stopping Regions, 154

Stopping Rule, 154

Strata, 149

Strati�ed Random Sample, 149

162

Strong Law of Large Numbers, Kolmogorov's, 7

Strongly Consistent, 27

Su�cient, 30, 129

Symmetric Kernel, 130

t{Test, 103

Tail{Equivalent, 5

Taylor Series, 13

Test Function, 73

Test, Invariant, 91

Test, Kolmogorov{Smirnov, 139

Test, Likelihood Ratio, 95

Test, Most Powerful, 73

Test, Nonrandomized, 73

Test, Randomized, 73

Test, Uniformly Most Powerful, 73

Tolerance Coe�cient, 142

Tolerance Interval, 142

Two{Sample t-Test, 103

Two{Tailed t-Test, 103

Type I Error, 72

Type II Error, 72

U{Statistic, 131

U{Statistic, Generalized, 134

UMA, 113

UMAU, 123

UMP, 73

UMP �{similar, 89

UMP Invariant, 93

UMP Unbiased, 85

UMPU, 85

UMVUE, 40

Unbiased, 38, 85, 123

Uniformly Minumum Variance Unbiased Estimate, 40

Uniformly Most Accurate, 113

Uniformly Most Accurate Unbiased, 123

Uniformly Most Powerful Test, 73

Unimodal, 117

Upper Con�dence Bound, 113

Wald's Equation, 155

Wald{Approximation, 160

Weakly Consistent, 27

Wilcoxin 2{Sample Estimator, 134

163

con...con ten ts ac kno wledgemen ts 1 6 limit theorems (con tin ued from stat 6710) 1 6.3 strong la...

Documents