con...con ten ts ac kno wledgemen ts 1 6 limit theorems (con tin ued from stat 6710) 1 6.3 strong la...
TRANSCRIPT
STAT 6720
Mathematical Statistics II
Spring Semester 2001
Dr. J�urgen Symanzik
Utah State University
Department of Mathematics and Statistics
3900 Old Main Hill
Logan, UT 84322{3900
Tel.: (435) 797{0696
FAX: (435) 797{1822
e-mail: [email protected]
Contents
Acknowledgements 1
6 Limit Theorems (Continued from Stat 6710) 1
6.3 Strong Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
6.4 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7 Sample Moments 18
7.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.2 Sample Moments and the Normal Distribution . . . . . . . . . . . . . . . . . . 21
8 The Theory of Point Estimation 26
8.1 The Problem of Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2 Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.3 Su�cient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.4 Unbiased Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.5 Lower Bounds for the Variance of an Estimate . . . . . . . . . . . . . . . . . . 47
8.6 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 57
8.8 Decision Theory | Bayes and Minimax Estimation . . . . . . . . . . . . . . . . 63
9 Hypothesis Testing 71
9.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.2 The Neyman{Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.3 Monotone Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9.4 Unbiased and Invariant Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
10 More on Hypothesis Testing 95
10.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10.2 Parametric Chi{Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.3 t{Tests and F{Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
10.4 Bayes and Minimax Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
1
11 Con�dence Estimation 112
11.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.2 Shortest{Length Con�dence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 116
11.3 Con�dence Intervals and Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . 121
11.4 Bayes Con�dence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
12 Nonparametric Inference 129
12.1 Nonparametric Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
12.2 Single-Sample Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
12.3 More on Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
13 Some Results from Sampling 146
13.1 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
13.2 Strati�ed Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
14 Some Results from Sequential Statistical Inference 153
14.1 Fundamentals of Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . 153
14.2 Sequential Probability Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . 157
Index 161
2
Acknowledgements
I would like to thank my students, Hanadi B. Eltahir, Rich Madsen, and Bill Morphet, who
helped during the Fall 1999 and Spring 2000 semesters in typesetting these lecture notes
using LATEX and for their suggestions how to improve some of the material presented in class.
Thanks are also due to my Fall 2000 and Spring 2001 semester students, Weiping Deng, David
B. Neal, Emily Simmons, Shea Watrin, Aonan Zhang, and Qian Zhao, for their comments
that helped to improve these lecture notes.
In addition, I particularly would like to thank Mike Minnotte and Dan Coster, who previously
taught this course at Utah State University, for providing me with their lecture notes and other
materials related to this course. Their lecture notes, combined with additional material from
Rohatgi (1976) and other sources listed below, form the basis of the script presented here.
The textbook required for this class is:
� Rohatgi, V. K. (1976): An Introduction to Probability Theory and Mathematical Statis-
tics, John Wiley and Sons, New York, NY.
A Web page dedicated to this class is accessible at:
http://www.math.usu.edu/~symanzik/teaching/2001_stat6720/stat6720.html
This course closely follows Rohatgi (1976) as described in the syllabus. Additional material
originates from the lectures from Professors Hering, Trenkler, Gather, and Kreienbrock I have
attended while studying at the Universit�at Dortmund, Germany, the collection of Masters and
PhD Preliminary Exam questions from Iowa State University, Ames, Iowa, and the following
textbooks:
� Bandelow, C. (1981): Einf�uhrung in die Wahrscheinlichkeitstheorie, Bibliographisches
Institut, Mannheim, Germany.
� B�uning, H., and Trenkler, G. (1978): Nichtparametrische statistische Methoden, Walter
de Gruyter, Berlin, Germany.
� Casella, G., and Berger, R. L. (1990): Statistical Inference, Wadsworth & Brooks/Cole,
Paci�c Grove, CA.
� Fisz, M. (1989): Wahrscheinlichkeitsrechnung und mathematische Statistik, VEB Deut-
scher Verlag der Wissenschaften, Berlin, German Democratic Republic.
� Gibbons, J. D., and Chakraborti, S. (1992): Nonparametric Statistical Inference (Third
Edition, Revised and Expanded), Dekker, New York, NY.
3
� Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1994): Continuous Univariate
Distributions, Volume 1 (Second Edition), Wiley, New York, NY.
� Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1995): Continuous Univariate
Distributions, Volume 2 (Second Edition), Wiley, New York, NY.
� Kelly, D. G. (1994): Introduction to Probability, Macmillan, New York, NY.
� Lehmann, E. L. (1983): Theory of Point Estimation (1991 Reprint), Wadsworth &
Brooks/Cole, Paci�c Grove, CA.
� Lehmann, E. L. (1986): Testing Statistical Hypotheses (Second Edition { 1994 Reprint),
Chapman & Hall, New York, NY.
� Mood, A. M., and Graybill, F. A., and Boes, D. C. (1974): Introduction to the Theory
of Statistics (Third Edition), McGraw-Hill, Singapore.
� Parzen, E. (1960): Modern Probability Theory and Its Applications, Wiley, New York,
NY.
� Searle, S. R. (1971): Linear Models, Wiley, New York, NY.
� Tamhane, A. C., and Dunlop, D. D. (2000): Statistics and Data Analysis { From Ele-
mentary to Intermediate, Prentice Hall, Upper Saddle River, NJ.
Additional de�nitions, integrals, sums, etc. originate from the following formula collections:
� Bronstein, I. N. and Semendjajew, K. A. (1985): Taschenbuch der Mathematik (22.
Au age), Verlag Harri Deutsch, Thun, German Democratic Republic.
� Bronstein, I. N. and Semendjajew, K. A. (1986): Erg�anzende Kapitel zu Taschenbuch der
Mathematik (4. Au age), Verlag Harri Deutsch, Thun, German Democratic Republic.
� Sieber, H. (1980): Mathematische Formeln | Erweiterte Ausgabe E, Ernst Klett, Stuttgart,
Germany.
J�urgen Symanzik, May 7, 2001
4
Lecture 01:
Mo 01/08/016 Limit Theorems (Continued from Stat 6710)
6.3 Strong Laws of Large Numbers
De�nition 6.3.1:
Let fXig1i=1 be a sequence of rv's. Let Tn =nXi=1
Xi. We say that fXig obeys the SLLN
with respect to a sequence of norming constants fBig1i=1, Bi > 0; Bi " 1, if there exists a
sequence of centering constants fAig1i=1 such that
B�1n (Tn �An)a:s:�! 0:
Note:
Unless otherwise speci�ed, we will only use the case that Bn = n in this section.
Theorem 6.3.2:
Xna:s:�! X () lim
n!1P ( sup
m�nj Xm �X j> �) = 0 8� > 0.
Proof: (see also Rohatgi, page 249, Theorem 11)
WLOG, we can assume that X = 0 since Xna:s:�! X implies Xn �X
a:s:�! 0. Thus, we have to
prove:
Xna:s:�! 0 () lim
n!1P ( sup
m�nj Xm j> �) = 0 8� > 0
Choose � > 0 and de�ne
An(�) = f supm�n
j Xm j> �g
C = f limn!1
Xn = 0g
\=)":
Since Xna:s:�! 0, we know that P (C) = 1 and therefore P (Cc) = 0.
Let Bn(�) = C \ An(�). Note that Bn+1(�) � Bn(�) and for the limit set1\n=1
Bn(�) = �. It
follows that
limn!1
P (Bn(�)) = P (1\n=1
Bn(�)) = 0:
1
We also have
P (Bn(�)) = P (An \ C)
= 1� P (Cc [Acn)
= 1� P (Cc)| {z }=0
�P (Acn) + P (Cc \AC
n )| {z }=0
= P (An)
=) limn!1
P (An(�)) = 0
\(=":Assume that lim
n!1P (An(�)) = 0 8� > 0 and de�ne D(�) = f lim
n!1j Xn j> �g.
Since D(�) � An(�) 8n 2 IN , it follows that P (D(�)) = 0 8� > 0. Also,
Cc = f limn!1
Xn 6= 0g �1[k=1
f limn!1
j Xn j>1
kg:
=) 1� P (C) �1Xk=1
P (D(1
k)) = 0
=) Xna:s:�! 0
Note:
(i) Xna:s:�! 0 implies that 8� > 0 8� > 0 9n0 2 IN : P ( sup
n�n0j Xn j> �) < �.
(ii) Recall that for a given sequence of events fAng1n=1,
A = limn!1
An = limn!1
1[k=n
Ak =1\n=1
1[k=n
Ak
is the event that in�nitely many of the An occur. We write P (A) = P (An i:o:) where
i:o: stands for \in�nitely often".
(iii) Using the terminology de�ned in (ii) above, we can rewrite Theorem 6.3.2 as
Xna:s:�! 0 () P (j Xn j> � i:o:) = 0 8� > 0:
2
Lecture 02:
We 01/10/01Theorem 6.3.3: Borel{Cantelli Lemma
(i) 1st BC{Lemma:
Let fAng1n=1 be a sequence of events such that1Xn=1
P (An) <1. Then P (A) = 0.
(ii) 2nd BC{Lemma:
Let fAng1n=1 be a sequence of independent events such that1Xn=1
P (An) = 1. Then
P (A) = 1.
Proof:
(i):
P (A) = P ( limn!1
1[k=n
Ak)
= limn!1
P (1[k=n
Ak)
� limn!1
1Xk=n
P (Ak)
= limn!1
1Xk=1
P (Ak)�n�1Xk=1
P (Ak)
!(�)= 0
(�) holds since1Xn=1
P (An) <1.
(ii): We have Ac =1[n=1
1\k=n
Ack. Therefore,
P (Ac) = P ( limn!1
1\k=n
Ack) = lim
n!1P (
1\k=n
Ack):
If we choose n0 > n, it holds that
1\k=n
Ack �
n0\k=n
Ack:
Therefore,
P (1\k=n
Ack) � lim
n0!1P (
n0\k=n
Ack)
indep:= lim
n0!1
n0Yk=n
(1� P (Ak))
3
(+)
� limn0!1
exp
�
n0Xk=n
P (Ak)
!= 0
=) P (A) = 1
(+) holds since
1� exp
0@� n0Xj=n
�j
1A � 1�n0Yj=n
(1� �j) �n0Xj=n
�j for n0 > n and 0 � �j � 1
Example 6.3.4:
Independence is necessary for 2nd BC{Lemma:
Let = (0; 1) and P a uniform distribution on .
Let An = I(0; 1n)(!). Therefore,
1Xn=1
P (An) =1Xn=1
1
n=1:
But for any ! 2 , An occurs only for 1; 2; : : : ; b 1! c, where b1! c denotes the largest integer
(\ oor") that is � 1!. Therefore, P (A) = P (An i:o:) = 0.
Lemma 6.3.5: Kolmogorov's Inequality
Let fXig1i=1 be a sequence of independent rv's with common mean 0 and variances �2i . Let
Tn =nXi=1
Xi. Then it holds:
P ( max1�k�n
j Tk j� �) �
nXi=1
�2i
�28� > 0
Proof:
See Rohatgi, page 268, Lemma 2.
Lemma 6.3.6: Kronecker's Lemma
For any real numbers xn, if1Xn=1
xn converges to s <1 and Bn " 1, then it holds:
1
Bn
nXk=1
Bkxk ! 0 as n!1
4
Proof:
See Rohatgi, page 269, Lemma 3.
Theorem 6.3.7: Cauchy Criterion
Xna:s:�! X () lim
n!1P (sup
mj Xn+m �Xn j� �) = 1 8� > 0.
Proof:
See Rohatgi, page 270, Theorem 5.
Theorem 6.3.8:
If1Xn=1
V ar(Xn) <1, then1Xn=1
(Xn �E(Xn)) converges almost surely.
Proof:
See Rohatgi, page 272, Theorem 6.
Corollary 6.3.9:
Let fXig1i=1 be a sequence of independent rv's. Let fBig1i=1, Bi > 0; Bi " 1, be a sequence
of norming constants. Let Tn =nXi=1
Xi.
If1Xi=1
V ar(Xi)
B2i
<1, then it holds:
Tn �E(Tn)
Bn
a:s:�! 0
Proof:
This Corollary follows directly from Theorem 6.3.8 and Lemma 6.3.6.
Lemma 6.3.10: Equivalence Lemma
Let fXig1i=1 and fX 0ig1i=1 be sequences of rv's. Let Tn =
nXi=1
Xi and T0n =
nXi=1
X 0i.
If the series1Xi=1
P (Xi 6= X 0i) < 1, then the series fXig and fX 0
ig are tail{equivalent and
Tn and T 0n are convergence{equivalent, i.e., for Bn " 1 the sequences 1BnTn and 1
BnT 0n
converge on the same event and to the same limit, except for a null set.
Proof:
See Rohatgi, page 266, Lemma 1.
5
Lemma 6.3.11:
Let X be a rv with E(j X j) <1. Then it holds:
1Xn=1
P (j X j� n) � E(j X j) � 1 +1Xn=1
P (j X j� n)
Proof:
Continuous case only:
Let X have a pdf f . Then it holds:
E(j X j) =Z 1
�1j x j f(x)dx =
1Xk=0
Zk�jxj�k+1
j x j f(x)dx
=)1Xk=0
kP (k �j X j� k + 1) � E(j X j) �1Xk=0
(k + 1)P (k �j X j� k + 1)
It is
1Xk=0
kP (k �j X j� k + 1) =1Xk=0
kXn=1
P (k �j X j� k + 1)
=1Xn=1
1Xk=n
P (k �j X j� k + 1)
=1Xn=1
P (j X j� n)
Similarly,
1Xk=0
(k + 1)P (k �j X j� k + 1) =1Xn=1
P (j X j� n) +1Xk=0
P (k �j X j� k + 1)
=1Xn=1
P (j X j� n) + 1
Lecture 03:
Fr 01/12/01Theorem 6.3.12:
Let fXig1i=1 be a sequence of independent rv's. Then it holds:
Xna:s:�! 0 ()
1Xn=1
P (j Xn j> �) <1 8� > 0
Proof:
See Rohatgi, page 265, Theorem 3.
6
Theorem 6.3.13: Kolmogorov's SLLN
Let fXig1i=1 be a sequence of iid rv's. Let Tn =nXi=1
Xi. Then it holds:
Tn
n= Xn
a:s:�! � <1 () E(j X j) <1 (and then � = E(X))
Proof:
\=)":
Suppose that Xna:s:�! � <1. It is
Tn =nXi=1
Xi =n�1Xi=1
Xi +Xn = Tn�1 +Xn.
=) Xn
n=
Tn
n|{z}a:s:�!�
� n� 1
n| {z }!1
Tn�1n� 1| {z }a:s:�!�
a:s:�! 0
By Theorem 6.3.12, we have
1Xn=1
P (j Xn
nj� 1) <1; i.e.,
1Xn=1
P (j Xn j� n) <1:
Lemma 6:3:11=) E(j X j) <1
Th: 6:2:7 (WLLN)=) Xn
p�! E(X)
Since Xna:s:�! �, it follows by Theorem 6.1.26 that Xn
p�! �. Therefore, it must hold that
� = E(X).
\(=":Let E(j X j) <1. De�ne truncated rv's:
X 0k =
(Xk; if j Xk j� k
0; otherwise
T 0n =nX
k=1
X 0k
X0n =
T 0nn
Then it holds:
1Xk=1
P (Xk 6= X 0k) =
1Xk=1
P (j Xk j> k)
�1Xk=1
P (j Xk j� k)
7
iid=
1Xk=1
P (j X j� k)
Lemma 6:3:11� E(j X j)
< 1
By Lemma 6.3.10, it follows that Tn and T 0n are convergence{equivalent. Thus, it is su�cient
to prove that X0n
a:s:�! E(X).
We now establish the conditions needed in Corollary 6.3.9. It is
V ar(X 0n) � E((X 0
n)2)
=
Z n
�nx2fX(x)dx
=n�1Xk=0
Zk�jxj<k+1
x2fX(x)dx
�n�1Xk=0
(k + 1)2Zk�jxj<k+1
fX(x)dx
=n�1Xk=0
(k + 1)2P (k �j X j< k + 1)
�nX
k=0
(k + 1)2P (k �j X j< k + 1)
=)1Xn=1
1
n2V ar(X 0
n) �1Xn=1
nXk=0
(k + 1)2
n2P (k �j X j< k + 1)
=1Xn=1
nXk=1
(k + 1)2
n2P (k �j X j< k + 1) +
1Xn=1
1
n2P (0 �j X j< 1)
(�)�
1Xk=1
(k + 1)2P (k �j X j< k + 1)
1Xn=k
1
n2
!+ 2P (0 �j X j< 1) (A)
(�) holds since1Xn=1
1
n2=�2
6� 1:65 < 2 and the �rst two sums can be rearranged as follows:
n k
1 1
2 1; 2
3 1; 2; 3...
...
=)
k n
1 1; 2; 3; : : :
2 2; 3; : : :
3 3; : : :...
...
8
It is
1Xn=k
1
n2=
1
k2+
1
(k + 1)2+
1
(k + 2)2+ : : :
� 1
k2+
1
k(k + 1)+
1
(k + 1)(k + 2)+ : : :
=1
k2+
1Xn=k+1
1
n(n� 1)
From Bronstein, page 30, # 7, we know that
1 =1
1 � 2 +1
2 � 3 +1
3 � 4 + : : :+1
n(n+ 1)+ : : :
=1
1 � 2 +1
2 � 3 +1
3 � 4 + : : :+1
(k � 1) � k +1X
n=k+1
1
n(n� 1)
=)1X
n=k+1
1
n(n� 1)= 1� 1
1 � 2 �1
2 � 3 �1
3 � 4 � : : : � 1
(k � 1) � k
=1
2� 1
2 � 3 �1
3 � 4 � : : :� 1
(k � 1) � k
=1
3� 1
3 � 4 � : : :� 1
(k � 1) � k
=1
4� : : :� 1
(k � 1) � k= : : :
=1
k
=)1Xn=k
1
n2� 1
k2+
1Xn=k+1
1
n(n� 1)
=1
k2+1
k
� 2
k
9
Using this result in (A), we get
1Xn=1
1
n2V ar(X 0
n) � 21Xk=1
(k + 1)2
kP (k �j X j< k + 1) + 2P (0 �j X j< 1)
= 21Xk=0
kP (k �j X j< k + 1) + 41Xk=1
P (k �j X j< k + 1)
+ 21Xk=1
1
kP (k �j X j< k + 1) + 2P (0 �j X j< 1)
(B)
� 2E(j X j) + 4 + 2 + 2
< 1
To establish (B), we use an inequality from the Proof of Lemma 6.3.11, i.e.,
1Xk=0
kP (k �j X j< k + 1)Proof
�1Xn=1
P (j X j� n)Lemma 6:3:11
� E(j X j)
Thus, the conditions needed in Corollary 6.3.9 are met. With Bn = n, it follows that
1
nT 0n �
1
nE(T 0n)
a:s:�! 0 (C)
Since E(X 0n) ! E(X) as n ! 1, it follows by Kronecker's Lemma (6.3.6) that 1
nE(T0n) !
E(X). Thus, when we replace 1nE(T 0n) by E(X) in (C), we get
1
nT 0n
a:s:�! E(X)Lemma 6:3:10
=) 1
nTn
a:s:�! E(X)
since Tn and T 0n are convergence{equivalent.
10
Lecture 04:
We 01/17/016.4 Central Limit Theorems
Let fXng1n=1 be a sequence of rv's with cdf's fFng1n=1. Suppose that the mgf Mn(t) of Xn
exists.
Questions: Does Mn(t) converge? Does it converge to a mgf M(t)? If it does converge, does
it hold that Xnd�! X for some rv X?
Example 6.4.1:
Let fXng1n=1 be a sequence of rv's such that P (Xn = �n) = 1. Then the mgf is Mn(t) =
E(etX ) = e�tn. So
limn!1
Mn(t) =
8>><>>:0; t > 0
1; t = 0
1; t < 0
So Mn(t) does not converge to a mgf and Fn(x)! F (x) = 1 8x. But F (x) is not a cdf.
Note:
Due to Example 6.4.1, the existence of mgf's Mn(t) that converge to something is not enough
to conclude convergence in distribution.
Conversely, suppose that Xn has mgf Mn(t), X has mgf M(t), and Xnd�! X. Does it hold
that
Mn(t)!M(t)?
Not necessarily! See Rohatgi, page 277, Example 2, as a counter example. Thus, convergence
in distribution of rv's that all have mgf's does not imply the convergence of mgf's.
However, we can make the following statement in the next Theorem:
Theorem 6.4.2: Continuity Theorem
Let fXng1n=1 be a sequence of rv's with cdf's fFng1n=1 and mgf's fMn(t)g1n=1. Suppose thatMn(t) exists for j t j� t0 8n. If there exists a rv X with cdf F and mgf M(t) which exists for
j t j� t1 < t0 such that limn!1
Mn(t) =M(t) 8t 2 [�t1; t1], then Fn w�! F , i.e., Xnd�! X.
11
Example 6.4.3:
Let Xn � Bin(n; �n). Recall (e.g., from Theorem 3.3.12 and related Theorems) that for
X � Bin(n; p) the mgf is MX(t) = (1� p+ pet)n. Thus,
Mn(t) = (1� �
n+�
net)n = (1 +
�(et � 1)
n)n
(�)�! e�(et�1) as n!1:
In (�) we use the fact that limn!1
(1+x
n)n = ex. Recall that e�(e
t�1) is the mgf of a rv X where
X � Poisson(�). Thus, we have established the well{known result that the Binomial distribu-
tion approaches the Poisson distribution, given that n!1 in such a way that np = � > 0.
Note:
Recall Theorem 3.3.11: Suppose that fXng1n=1 is a sequence of rv's with characteristic fuctionsf�n(t)g1n=1. Suppose that
limn!1
�n(t) = �(t) 8t 2 (�h; h) for some h > 0;
and �(t) is the characteristic function of a rv X. Then Xnd�! X.
Theorem 6.4.4: Lindeberg{L�evy Central Limit Theorem
Let fXng1n=1 be a sequence of iid rv's with E(Xi) = � and 0 < V ar(Xi) = �2 <1. Then it
holds for Xn =1n
nXi=1
Xi that
pn(Xn � �)
�
d�! Z
where Z � N(0; 1).
Proof:
Let Z � N(0; 1). According to Theorem 3.3.12 (v), the characteristic function of Z is
�Z(t) = exp(�12 t2).
Let �(t) be the characteristic function of Xi. We now determine the characteristic function
�n(t) ofpn(Xn��)
�:
�n(t) = E
0BBBB@exp0BBBB@it
pn( 1n
nXi=1
Xi � �)
�
1CCCCA1CCCCA
=
Z 1
�1: : :
Z 1
�1exp
0BBBB@itpn( 1
n
nXi=1
Xi � �)
�
1CCCCAdFX
12
= exp(� itpn�
�)
Z 1
�1exp(
itx1pn�
)dFX1: : :
Z 1
�1exp(
itxnpn�
)dFXn
=
��(
tpn�
) exp(� it�pn�
)
�nRecall from Theorem 3.3.5 that if the kth moment exists, then �(k)(0) = ikE(Xk). Also recall
the de�nition of a Taylor series in MacLaurin's form:
f(x) = f(0) +f 0(0)
1!x+
f 00(0)
2!x2 +
f 000(0)
3!x3 + : : :+
f (n)(0)
n!xn + : : : ;
e.g.,
f(x) = ex = 1 + x+x2
2!+x3
3!+ : : :
Thus, if we develop a Taylor series for �( tpn�) around t = 0, we get:
�(tpn�
) = �(0) + t�0(0) +1
2t2�00(0) +
1
6t3�000(0) + : : :
= 1 + ti�pn�
� 1
2t2�2 + �2
n�2+ o
�(
tpn�
)2�
Here we make use of the Landau symbol \o". In general, if we write u(x) = o(v(x)) for
x ! L, this implies limx!L
u(x)
v(x)= 0, i.e., u(x) goes to 0 faster than v(x) or v(x) goes to 1
faster than u(x). We say that u(x) is of smaller order than v(x) as x ! L. Examples are1x3
= o( 1x2) and x2 = o(x3) for x!1. See Rohatgi, page 6, for more details on the Landau
symbols \O" and \o".
Similarly, if we develop a Taylor series for exp(� it�pn�) around t = 0, we get:
exp(� it�pn�
) = 1� ti�pn�
� 1
2t2�2
n�2+ o
�(
tpn�
)2�
Combining these results, we get:
�n(t) =
1 + t
i�pn�
� 1
2t2�2 + �2
n�2+ o
�(
tpn�
)2�!
1� ti�pn�
� 1
2t2�2
n�2+ o
�(
tpn�
)2�!!n
=
1� t
i�pn�
� 1
2t2�2
n�2+ t
i�pn�
+ t2�2
n�2� 1
2t2�2 + �2
n�2+ o
�(
tpn�
)2�!n
=
1� 1
2
t2
n+ o
�(
tpn�
)2�!n
(�)�! exp(� t2
2) as n!1
13
Thus, limn!1
�n(t) = �Z(t) 8t. For a proof of (�), see Rohatgi, page 278, Lemma 1. Accordingto the Note above, it holds that
pn(Xn � �)
�
d�! Z:
Lecture 05:
Fr 01/19/01De�nition 6.4.5:
Let X1; X2 be iid non{degenerate rv's with common cdf F . Let a1; a2 > 0. We say that F is
stable if there exist constants A and B (depending on a1 and a2) such that
B�1(a1X1 + a2X2 �A) also has cdf F .
Note:
When generalizing the previous de�nition to sequences of rv's, we have the following examples
for stable distributions:
� Xi iid Cauchy. Then 1n
nXi=1
Xi � Cauchy (here Bn = n;An = 0).
� Xi iid N(0; 1). Then 1pn
nXi=1
Xi � N(0; 1) (here Bn =pn;An = 0).
De�nition 6.4.6:
Let fXig1i=1 be a sequence of iid rv's with common cdf F . Let Tn =nXi=1
Xi. F belongs to
the domain of attraction of a distribution V if there exist norming and centering constants
fBng1n=1; Bn > 0, and fAng1n=1 such that
P (B�1n (Tn �An) � x) = FB�1n (Tn�An)(x)! V (x) as n!1
at all continuity points x of V .
Note:
A very general Theorem from Lo�eve states that only stable distributions can have domains
of attraction. From the practical point of view, a wide class of distributions F belong to the
domain of attraction of the Normal distribution.
14
Theorem 6.4.7: Lindeberg Central Limit Theorem
Let fXig1i=1 be a sequence of independent non{degenerate rv's with cdf's fFig1i=1. Assume
that E(Xk) = �k and V ar(Xk) = �2k <1. Let s2n =nX
k=1
�2k.
If the Fk are absolutely continuous with pdf's fk = F 0k, assume that it holds for all � > 0 that
(A) limn!1
1
s2n
nXk=1
Zfjx��kj>�sng
(x� �k)2F 0k(x)dx = 0:
If the Xk are discrete rv's with support fxklg and probabilities fpklg, l = 1; 2; : : :, assume that
it holds for all � > 0 that
(B) limn!1
1
s2n
nXk=1
Xjxkl��kj>�sn
(xkl � �k)2pkl = 0:
The conditions (A) and (B) are called Lindeberg Condition (LC). If either LC holds, then
nXk=1
(Xk � �k)
sn
d�! Z
where Z � N(0; 1).
Proof:
Similar to the proof of Theorem 6.4.4, we can use characteristic functions again. An alterna-
tive proof is given in Rohatgi, pages 282{288.
Note:
Feller shows that the LC is a necessary condition if�2ns2n
! 0 and s2n !1 as n!1.
Corollary 6.4.8:
Let fXig1i=1 be a sequence of iid rv's such that 1pn
nXi=1
Xi has the same distribution for all n.
If E(Xi) = 0 and V ar(Xi) = 1, then Xi � N(0; 1).
Proof:
Let F be the common cdf of 1pn
nXi=1
Xi for all n (including n = 1). By the CLT,
limn!1
P (1pn
nXi=1
Xi � x) = �(x);
where �(x) denotes P (Z � x) for Z � N(0; 1). Also, P ( 1pn
nXi=1
Xi � x) = F (x) for each n.
Therefore, we must have F (x) = �(x).
15
Note:
In general, if X1;X2; : : :, are independent rv's such that there exists a constant A with
P (j Xn j� A) = 1 8n, then the LC is satis�ed if s2n !1 as n!1. Why??
Suppose that s2n !1 as n!1. Since the j Xk j's are uniformly bounded (by A), so are therv's (Xk �E(Xk)). Thus, for every � > 0 there exists an N� such that if n � N� then
P (j Xk �E(Xk) j< �sn; k = 1; : : : ; n) = 1:
This implies that the LC holds since we would integrate (or sum) over the empty set, i.e., the
set fj x� �k j> �sng = �.
The converse also holds. For a sequence of uniformly bounded independent rv's, a necessary
and su�cient condition for the CLT to hold is that s2n !1 as n!1.
Example 6.4.9:
Let fXig1i=1 be a sequence of independent rv's such that E(Xk) = 0, �k = E(j Xk j2+�) <1
for some � > 0, andnX
k=1
�k = o(s2+�n ).
Does the LC hold? It is:
1
s2n
nXk=1
Zfjxj>�sng
x2fk(x)dx(A)
� 1
s2n
nXk=1
Zfjxj>�sng
j x j2+���s�n
fk(x)dx
� 1
s2n��s�n
nXk=1
Z 1
�1j x j2+� fk(x)dx
=1
s2n��s�n
nXk=1
�k
=1
��
0BBBB@nX
k=1
�k
s2+�n
1CCCCA(B)�! 0 as n!1
(A) holds since for j x j> �sn, it isjxj���s�
n
> 1. (B) holds sincenX
k=1
�k = o(s2+�n ).
Thus, the LC is satis�ed and the CLT holds.
16
Note:
(i) In general, if there exists a � > 0 such that
nXk=1
E(j Xk � �k j2+�)
s2+�n
�! 0 as n!1;
then the LC holds.
(ii) Both the CLT and the WLLN hold for a large class of sequences of rv's fXigni=1. If
the fXig's are independent uniformly bounded rv's, i.e., if P (j Xn j� M) = 1 8n, theWLLN (as formulated in Theorem 6.2.3) holds. The CLT holds provided that s2n !1as n!1.
If the rv's fXig are iid, then the CLT is a stronger result than the WLLN since the CLT
provides an estimate of the probability P ( 1n jnXi=1
Xi � n� j� �) � 1 � P (j Z j� �
�
pn),
where Z � N(0; 1), and the WLLN follows. However, note that the CLT requires the
existence of a 2nd moment while the WLLN does not.
(iii) If the fXig are independent (but not identically distributed) rv's, the CLT may apply
while the WLLN does not.
(iv) See Rohatgi, pages 289{293, for additional details and examples.
17
7 Sample Moments
7.1 Random Sampling
De�nition 7.1.1:
Let X1; : : : ;Xn be iid rv's with common cdf F . We say that fX1; : : : ;Xng is a (random)
sample of size n from the population distribution F . The vector of values fx1; : : : ; xng iscalled a realization of the sample. A rv g(X1; : : : ; Xn) which is a Borel{measurable function
of X1; : : : ;Xn and does not depend on any unknown parameter is called a (sample) statistic.
De�nition 7.1.2:
Let X1; : : : ; Xn be a sample of size n from a population with distribution F . Then
X =1
n
nXi=1
Xi
is called the sample mean and
S2 =1
n� 1
nXi=1
(Xi �X)2 =1
n� 1
nXi=1
X2i � nX
2
!
is called the sample variance.
De�nition 7.1.3:
Let X1; : : : ; Xn be a sample of size n from a population with distribution F . The function
F̂n(x) =1
n
nXi=1
I(�1;x](Xi)
is called empirical cumulative distribution function (empirical cdf).
Note:
For any �xed x 2 IR, F̂n(x) is a rv.
Theorem 7.1.4:
The rv F̂n(x) has pmf
P (F̂n(x) =j
n) =
n
j
!(F (x))j(1� F (x))n�j ; j 2 f0; 1; : : : ; ng;
with E(F̂n(x)) = F (x) and V ar(F̂n(x)) =F (x)(1�F (x))
n.
18
Proof:
It is I(�1;x](Xi) � Bin(1; F (x)). Then nF̂n(x) � Bin(n; F (x)).
The results follow immediately.
Corollary 7.1.5:
By the WLLN, it follows that
F̂n(x)p�! F (x):
Corollary 7.1.6:
By the CLT, it follows that pn(F̂n(x)� F (x))pF (x)(1 � F (x))
d�! Z;
where Z � N(0; 1).
Theorem 7.1.7: Glivenko{Cantelli Theorem
F̂n(x) converges uniformly to F (x), i.e., it holds for all � > 0 that
limn!1
P ( sup�1<x<1
j F̂n(x)� F (x) j> �) = 0:
Lecture 06:
Mo 01/22/01De�nition 7.1.8:
Let X1; : : : ; Xn be a sample of size n from a population with distribution F . We call
ak =1
n
nXi=1
Xki
the sample moment of order k and
bk =1
n
nXi=1
(Xi � a1)k =
1
n
nXi=1
(Xi �X)k
the sample central moment of order k.
Note:
It is b1 = 0 and b2 =n�1nS2.
19
Theorem 7.1.9:
Let X1; : : : ;Xn be a sample of size n from a population with distribution F . Assume that
E(X) = �, V ar(X) = �2, and E((X � �)k) = �k exist. Then it holds:
(i) E(a1) = E(X) = �
(ii) V ar(a1) = V ar(X) = �2
n
(iii) E(b2) =n�1n�2
(iv) V ar(b2) =�4��22
n� 2(�4�2�22)
n2+
�4�3�22n3
(v) E(S2) = �2
(vi) V ar(S2) = �4n� n�3
n(n�1)�22
Proof:
(i)
E(X) =1
n
nXi=1
E(Xi) =n
n� = �
(ii)V ar(X) =
�1
n
�2 nXi=1
V ar(Xi) =�2
n
(iii)
E(b2) = E
0@ 1
n
nXi=1
X2i �
1
n2
nXi=1
Xi
!21A
= E(X2)� 1
n2E
0@ nXi=1
X2i +
XXi6=j
XiXj
1A(�)= E(X2)� 1
n2(nE(X2) + n(n� 1)�2)
=n� 1
n(E(X2)� �2)
=n� 1
n�2
(�) holds since Xi and Xj are independent and then, due to Theorem 4.5.3, it holds
that E(XiXj) = E(Xi)E(Xj).
See Rohatgi, page 303{306, for the proof of parts (iv) through (vi) and results regarding the
3rd and 4th moments and covariances.
20
7.2 Sample Moments and the Normal Distribution
Theorem 7.2.1:
Let X1; : : : ;Xn be iid N(�; �2) rv's. Then X = 1n
nXi=1
Xi and (X1 � X; : : : ;Xn � X) are
independent.
Proof:
By computing the joint mgf of (X;X1 �X;X2 �X; : : : ;Xn �X), we can use Theorem 4.6.3
(iv) to show independence. We will use the following two facts:
(1):
MX(t) = M 1n
nXi=1
Xi
!(t)(A)=
nYi=1
MXi(t
n)
(B)=
"exp(
t
n�+
�2t2
2n2)
#n
= exp
t�+
�2t2
2n
!
(A) holds by Theorem 4.6.4 (i). (B) follows from Theorem 3.3.12 (vi) since the Xi's are iid.
(2):
MX1�X;X2�X;:::;Xn�X(t1; t2; : : : ; tn)Def:4:6:1
= E
"exp
nXi=1
ti(Xi �X)
!#
= E
"exp
nXi=1
tiXi �X
nXi=1
ti
!#
= E
"exp
nXi=1
Xi(ti � t)
!#; where t =
1
n
nXi=1
ti
= E
"nYi=1
exp (Xi(ti � t))
#
(C)=
nYi=1
E(exp(Xi(ti � t)))
=nYi=1
MXi(ti � t)
21
(D)=
nYi=1
exp
�(ti � t) +
�2(ti � t)2
2
!
= exp
0BBBB@�nXi=1
(ti � t)| {z }=0
+�2
2
nXi=1
(ti � t)2
1CCCCA= exp
�2
2
nXi=1
(ti � t)2
!
(C) follows from Theorem 4.5.3 since the Xi's are independent. (D) holds since we evaluate
MX(h) = exp(�h+ �2h2
2) for h = ti � t.
From (1) and (2), it follows:
MX;X1�X;:::;Xn�X(t; t1; : : : ; tn)
Def:4:6:1= E
hexp(tX + t1(X1 �X) + : : :+ tn(Xn �X))
i= E
hexp(tX + t1X1 � t1X + : : : + tnXn � tnX)
i= E
"exp
nXi=1
Xiti � (nXi=1
ti � t)X
!#
= E
266664exp0BBBB@
nXi=1
Xiti �(t1 + : : :+ tn � t)
nXi=1
Xi
n
1CCCCA377775
= E
"exp
nXi=1
Xi(ti �t1 + : : : + tn � t
n)
!#
= E
"nYi=1
exp
Xinti � nt+ t
n
!#; where t =
1
n
nXi=1
ti
(E)=
nYi=1
E
"exp
Xi[t+ n(ti � t)]
n
!#
=nYi=1
MXi
t+ n(ti � t)
n
!
(F )=
nYi=1
exp
�[t+ n(ti � t)]
n+�2
2
1
n2[t+ n(ti � t)]2
!
22
= exp
0BBBB@�n0BBBB@nt+ n
nXi=1
(ti � t)| {z }=0
1CCCCA+�2
2n2
nXi=1
(t+ n(ti � t))2
1CCCCA
= exp(�t) exp
0BBBB@ �2
2n2
0BBBB@nt2 + 2ntnXi=1
(ti � t)| {z }=0
+n2nXi=1
(ti � t)2
1CCCCA1CCCCA
= exp
�t+
�2
2nt2
!exp
�2
2
nXi=1
(ti � t)2!
(1)&(2)= MX(t)MX1�X;:::;Xn�X(t1; : : : ; tn)
Thus, X and (X1 � X; : : : ;Xn � X) are independent by Theorem 4.6.3 (iv). (E) follows
from Theorem 4.5.3 since the Xi's are independent. (F ) holds since we evaluate MX(h) =
exp(�h+ �2h2
2 ) for h =t+n(ti�t)
n.
Corollary 7.2.2:
X and S2 are independent.
Proof:
This can be seen since S2 is a function of the vector (X1 � X; : : : ;Xn � X), and (X1 �X; : : : ;Xn � X) is independent of X, as previously shown in Theorem 7.2.1. We can use
Theorem 4.2.7 to formally complete this proof.
Corollary 7.2.3:
(n� 1)S2
�2� �2n�1:
Proof:
Recall the following facts:
� If Z � N(0; 1) then Z2 � �21.
� If Y1; : : : ; Yn � iid �21, thennXi=1
Yi � �2n.
� For �2n, the mgf is M(t) = (1� 2t)�n=2.
� If Xi � N(�; �2), then Xi���
� N(0; 1) and(Xi��)2
�2� �21.
Therefore,nXi=1
(Xi � �)2
�2� �2n and
(X��)2( �p
n)2
= n(X��)2
�2� �21. (�)
23
Now consider
nXi=1
(Xi � �)2 =nXi=1
((Xi �X) + (X � �))2
=nXi=1
((Xi �X)2 + 2(Xi �X)(X � �) + (X � �)2)
= (n� 1)S2 + 0 + n(X � �)2
Therefore,nXi=1
(Xi � �)2
�2| {z }W
=n(X � �)2
�2| {z }U
+(n� 1)S2
�2| {z }V
We have an expression of the form: W = U + V
Since U and V are functions ofX and S2, we know by Corollary 7.2.2 that they are independent
and also that their mgf's factor by Theorem 4.6.3 (iv). Now we can write:
MW (t) = MU (t)MV (t)
=)MV (t) =MW (t)
MU (t)
(�)=
(1� 2t)�n=2
(1� 2t)�1=2
= (1� 2t)�(n�1)=2
Note that this is the mgf of �2n�1 by the uniqueness of mgf's. Thus, V =(n�1)S2
�2� �2n�1.
Corollary 7.2.4: pn(X � �)
S� tn�1:
Proof:
Recall the following facts:
� If Z � N(0; 1); Y � �2n and Z; Y independent, then ZpY
n
� tn.
� Z1 =pn(X��)�
� N(0; 1); Yn�1 =(n�1)S2
�2� �2n�1, and Z1; Yn�1 are independent.
Therefore,pn(X � �)
S=
(X��)�=pn
S=pn
�=pn
=
(X��)�=pnr
S2(n�1)�2(n�1)
=Z1qYn�1(n�1)
� tn�1:
24
Corollary 7.2.5:
Let (X1; : : : ;Xm) � iid N(�1; �21) and (Y1; : : : ; Yn) � iid N(�2; �
22). Let Xi; Yj be independent
8i; j:Then it holds:
X � Y � (�1 � �2)q[(m� 1)S21=�
21 ] + [(n� 1)S22=�
22 ]�s
m+ n� 2
�21=m+ �22=n� tm+n�2
In particular, if �1 = �2, then:
X � Y � (�1 � �2)q(m� 1)S21 + (n� 1)S22
�
smn(m+ n� 2)
m+ n� tm+n�2
Proof:
Homework.
Corollary 7.2.6:
Let (X1; : : : ;Xm) � iid N(�1; �21) and (Y1; : : : ; Yn) � iid N(�2; �
22). Let Xi; Yj be independent
8i; j:Then it holds:
S21=�21
S22=�22
� Fm�1;n�1
In particular, if �1 = �2, then:S21
S22� Fm�1;n�1
Proof:
Recall that, if Y1 � �2m and Y2 � �2n, then
F =Y1=m
Y2=n� Fm;n:
Now, C1 =(m�1)S21
�21� �2m�1 and C2 =
(n�1)S22�22
� �2n�1. Therefore,
S21=�21
S22=�22
=
(m�1)S21�21(m�1)
(n�1)S22�22(n�1)
=C1=(m� 1)
C2=(n� 1)� Fm�1;n�1:
If �1 = �2, thenS21
S22� Fm�1;n�1:
25
Lecture 07:
We 01/24/01
8 The Theory of Point Estimation
8.1 The Problem of Point Estimation
Let X be a rv de�ned on a probability space (; L; P ). Suppose that the cdf F of X depends
on some set of parameters and that the functional form of F is known except for a �nite
number of these parameters.
De�nition 8.1.1:
The set of admissible values of � is called the parameter space �. If F� is the cdf of X
when � is the parameter, the set fF� : � 2 �g is the family of cdf's. Likewise, we speak of
the family of pdf's if X is continuous, and the family of pmf's if X is discrete.
Example 8.1.2:
X � Bin(n; p); p unknown. Then � = p and � = fp : 0 < p < 1g.
X � N(�; �2); (�; �2) unknown. Then � = (�; �2) and � = f(�; �2) : �1 < � <1; �2 > 0g.
De�nition 8.1.3:
Let X be a sample from F�; � 2 � � IR. Let a statistic T (X) map IRn to �. We call T (X)
an estimator of � and T (x) for a realization x of X an (point) estimate of �. In practice,
the term estimate is used for both.
Example 8.1.4:
Let X1; : : : ; Xn be iid Bin(1; p), p unknown. Estimates of p include:
T1(X) = X; T2(X) = X1; T3(X) =1
2; T4(X) =
X1 +X2
3
Obviously, not all estimates are equally good.
26
8.2 Properties of Estimates
De�nition 8.2.1:
Let fXig1i=1 be a sequence of iid rv's with cdf F�; � 2 �. A sequence of point estimates
Tn(X1; : : : ; Xn) = Tn is called
� (weakly) consistent for � if Tnp�! � as n!1 8� 2 �
� strongly consistent for � if Tna:s:�! � as n!1 8� 2 �
� consistent in the rth mean for � if Tnr�! � as n!1 8� 2 �
Example 8.2.2:
Let fXig1i=1 be a sequence of iid Bin(1; p) rv's. Let Xn = 1n
nXi=1
Xi. Since E(Xi) = p, it
follows by the WLLN that Xnp�! p, i.e., consistency, and by the SLLN that Xn
a:s:�! p, i.e,
strong consistency.
However, a consistent estimate may not be unique. We may even have in�nite many consistent
estimates, e.g.,nXi=1
Xi + a
n+ b
p�! p 8 �nite a; b 2 IR:
Theorem 8.2.3:
If Tn is a sequence of estimates such that E(Tn)! � and V ar(Tn)! 0 as n!1, then Tn is
consistent for �.
Proof:
P (j Tn � � j> �)(A)
� E(Tn � �)2)
�2
=E[((Tn �E(Tn)) + (E(Tn)� �))2]
�2
=V ar(Tn) + 2E[(Tn �E(Tn))(E(Tn)� �)] + (E(Tn)� �)2
�2
=V ar(Tn) + (E(Tn)� �)2
�2
(B)�! 0 as n!1
27
(A) holds due to Corollary 3.5.2 (Markov's Inequality). (B) holds since V ar(Tn) ! 0 as
n!1 and E(Tn)! � as n!1.
De�nition 8.2.4:
Let G be a group of Borel{measurable functions of IRn onto itself which is closed under com-
position and inverse. A family of distributions fP� : � 2 �g is invariant under G if for
each g 2 G and for all � 2 �, there exists a unique �0 = g(�) such that the distribution of
g(X) is P�0 whenever the distribution of X is P�. We call g the induced function on � since
P�(g(X) 2 A) = Pg(�)(X 2 A).
Example 8.2.5:
Let (X1; : : : ;Xn) be iid N(�; �2) with pdf
f(x1; : : : ; xn) =1
(p2��)n
exp
� 1
2�2
nXi=1
(xi � �)2
!:
The group of linear transformations G has elements
g(x1; : : : ; xn) = (ax1 + b; : : : ; axn + b); a > 0; �1 < b <1:
The pdf of g(X) is
f�(x�1; : : : ; x�n) =
1
(p2�a�)n
exp
� 1
2a2�2
nXi=1
(x�i � a�� b)2
!; x�i = axi + b; i = 1; : : : ; n:
So ff : �1 < � <1; �2 > 0g is invariant under this group G, with g(�; �2) = (a�+b; a2�2),
where �1 < a�+ b <1 and a2�2 > 0.
De�nition 8.2.6:
Let G be a group of transformations that leaves fF� : � 2 �g invariant. An estimate T is
invariant under G if
T (g(X1); : : : ; g(Xn)) = T (X1; : : : ;Xn) 8g 2 G:
28
De�nition 8.2.7:
An estimate T is:
� location invariant if T (X1 + a; : : : ;Xn + a) = T (X1; : : : ; Xn); a 2 IR
� scale invariant if T (cX1; : : : ; cXn) = T (X1; : : : ;Xn); c 2 IR� f0g
� permutation invariant if T (Xi1 ; : : : ; Xin) = T (X1; : : : ;Xn) 8 permutations (i1; : : : ; in)of 1; : : : ; n
Example 8.2.8:
Let F� � N(�; �2).
S2 is location invariant.
X and S2 are both permutation invariant.
Neither X nor S2 is scale invariant.
Note:
Di�erent sources make di�erent use of the term invariant. Mood, Graybill & Boes (1974)
for example de�ne location invariant as T (X1 + a; : : : ;Xn + a) = T (X1; : : : ;Xn) + a (page
332) and scale invariant as T (cX1; : : : ; cXn) = cT (X1; : : : ;Xn) (page 336). According to their
de�nition, X is location invariant and scale invariant.
29
8.3 Su�cient Statistics
De�nition 8.3.1:
Let X = (X1; : : : ;Xn) be a sample from fF� : � 2 � � IRkg. A statistic T = T (X) is
su�cient for � (or for the family of distributions fF� : � 2 �g) i� the conditional dis-
tribution of X given T = t does not depend on � (except possibly on a null set A where
P�(T 2 A) = 0 8�).
Note:
(i) The sample X is always su�cient but this is not particularly interesting and usually is
excluded from further considerations.
(ii) Idea: Once we have \reduced" from X to T (X), we have captured all the information
in X about �.
(iii) Usually, there are several su�cient statistics for a given family of distributions
Example 8.3.2:
Let X = (X1; : : : ;Xn) be iid Bin(1; p) rv's. To estimate p, can we ignore the order and simply
count the number of \successes"?
Let T (X) =nXi=1
Xi. It is
P (X1 = x1; : : : Xn = xn jnXi=1
Xi = t) =P (X1 = x1; : : : ; Xn = xn; T = t)
P (T = t)
=
8>><>>:P (X1 = x1; : : : ;Xn = xn)
P (T = t);
nXi=1
xi = t
0; otherwise
=
8>>>><>>>>:pt(1� p)n�t n
t
!pt(1� p)n�t
;
nXi=1
xi = t
0; otherwise
=
8>>>><>>>>:1 n
t
! ; nXi=1
xi = t
0; otherwise
This does not depend on p. Thus, T =nXi=1
Xi is su�cient for p.
30
Example 8.3.3:
Let X = (X1; : : : ;Xn) be iid Poisson(�). Is T =nXi=1
Xi su�cient for �? It is
P (X1 = x1; : : : ;Xn = xn j T = t) =
8>>>>>><>>>>>>:
nYi=1
e���xi
xi!
e�n�(n�)t
t!
;
nXi=1
xi = t
0; otherwise
=
8>>>>><>>>>>:
e�n��P
xiQxi!
e�n�(n�)t
t!
;
nXi=1
xi = t
0; otherwise
=
8>>>><>>>>:t!
ntnYi=1
xi!
;
nXi=1
xi = t
0; otherwise
This does not depend on �. Thus, T =nXi=1
Xi is su�cient for �.
Example 8.3.4:
Let X1; X2 be iid Poisson(�). Is T = X1 + 2X2 su�cient for �? It is
P (X1 = 0;X2 = 1 j X1 + 2X2 = 2) =P (X1 = 0;X2 = 1)
P (X1 + 2X2 = 2)
=P (X1 = 0; X2 = 1)
P (X1 = 0;X2 = 1) + P (X1 = 2;X2 = 0)
=e��(e���)
e��(e���) + ( e���2
2 )e��
=1
1 + �2
This still depends on �. Thus, T = X1 + 2X2 is not su�cient for �.
Note:
De�nition 8.3.1 can be di�cult to check. In addition, it requires a candidate statistic. We
need something constructive that helps in �nding su�cient statistics without having to check
De�nition 8.3.1. The next Theorem helps in �nding such statistics.
31
Lecture 08:
Fr 01/26/01Theorem 8.3.5: Factorization Criterion
Let X1; : : : ;Xn be rv's with pdf (or pmf) f(x1; : : : ; xn j �); � 2 �. Then T (X1; : : : ;Xn) is
su�cient for � i� we can write
f(x1; : : : ; xn j �) = h(x1; : : : ; xn) g(T (x1; : : : ; xn) j �);
where h does not depend on � and g does not depend on x1; : : : ; xn except as a function of T .
Proof:
Discrete case only.
\=)":
Suppose T (X) is su�cient for �. Let
g(t j �) = P�(T (X) = t)
h(x) = P (X = x j T (X) = t)
Then it holds:
f(x j �) = P�(X = x)
(�)= P�(X = x; T (X) = T (x) = t)
= P�(T (X) = t) P (X = x j T (X) = t)
= g(t j �)h(x)
(�) holds since X = x implies that T (X) = T (x) = t.
\(=":Suppose the factorization holds. For �xed t0, it is
P�(T (X) = t0) =X
fx : T (x)=t0gP�(X = x)
=X
fx : T (x)=t0gh(x)g(T (x) j �)
= g(t0 j �)X
fx : T (x)=t0gh(x) (A)
32
If P�(T (X) = t0) > 0, it holds:
P�(X = x j T (X) = t0) =P�(X = x; T (X) = t0)
P�(T (X) = t0)
=
8><>:P�(X = x)
P�(T (X) = t0); if T (x) = t0
0; otherwise
(A)=
8>>><>>>:g(t0 j �)h(x)
g(t0 j �)X
fx : T (x)=t0gh(x)
; if T (x) = t0
0; otherwise
=
8>>><>>>:h(x)X
fx : T (x)=t0gh(x)
; if T (x) = t0
0; otherwise
This last expression does not depend on �. Thus, T (X) is su�cient for �.
Note:
(i) In the Theorem above, � and T may be vectors.
(ii) If T is su�cient for �, then also any 1{to{1 mapping of T is su�cient for �. However,
this does not hold for arbitrary functions of T .
Example 8.3.6:
Let X1; : : : ; Xn be iid Bin(1; p). It is
P (X1 = x1; : : : ; Xn = xn j p) = pP
xi(1� p)n�P
xi :
Thus, h(x1; : : : ; xn) = 1 and g(Pxi j p) = p
Pxi(1� p)n�
Pxi .
Hence, T =nXi=1
Xi is su�cient for p.
33
Example 8.3.7:
Let X1; : : : ; Xn be iid Poisson(�). It is
P (X1 = x1; : : : ;Xn = xn j �) =nYi=1
e���xi
xi!=e�n��
PxiQ
xi!:
Thus, h(x1; : : : ; xn) =1Qxi!
and g(Pxi j �) = e�n��
Pxi .
Hence, T =nXi=1
Xi is su�cient for �.
Example 8.3.8:
Let X1; : : : ; Xn be iid N(�; �2) where � 2 IR and �2 > 0 are both unknown. It is
f(x1; : : : ; xn j �; �2) =1
(p2��)n
exp
�P(xi � �)2
2�2
!=
1
(p2��)n
exp
�Px2i
2�2+ �
Pxi
�2� n�2
2�2
!:
Hence, T = (nXi=1
Xi;
nXi=1
X2i ) is su�cient for (�; �
2).
Example 8.3.9:
Let X1; : : : ; Xn be iid U(�; � + 1) where �1 < � <1. It is
f(x1; : : : ; xn j �) =
(1; � < xi < � + 1 8i 2 f1; : : : ; ng0; otherwise
=nYi=1
I(�;1)(xi) I(�1;�+1)(xi)
= I(�;1)(min(xi)) I(�1;�+1)(max(xi))
Hence, T = (X(1);X(n)) is su�cient for �.
De�nition 8.3.10:
Let ff�(x) : � 2 �g be a family of pdf's (or pmf's). We say the family is complete if
E�(g(X)) = 0 8� 2 �
implies that
P�(g(X) = 0) = 1 8� 2 �:
We say a statistic T (X) is complete if the family of distributions of T is complete.
34
Example 8.3.11:
Let X1; : : : ; Xn be iid Bin(1; p). We have seen in Example 8.3.6 that T =nXi=1
Xi is su�cient
for p. Is it also complete?
We know that T � Bin(n; p). Thus,
Ep(g(T )) =nXt=0
g(t)
n
t
!pt(1� p)n�t = 0 8p 2 (0; 1)
implies that
(1� p)nnXt=0
g(t)
n
t
!(
p
1� p)t = 0 8p 2 (0; 1) 8t:
However,nXt=0
g(t)
n
t
!(
p
1� p)t is a polynomial in p
1�p which is only equal to 0 for all p 2 (0; 1)
if all of its coe�cients are 0.
Therefore, g(t) = 0 for t = 0; 1; : : : ; n. Hence, T is complete.
Lecture 09:
Mo 01/29/01Example 8.3.12:
Let X1; : : : ;Xn be iid N(�; �2). We know from Example 8.3.8 that T = (nXi=1
Xi;
nXi=1
X2i ) is
su�cient for �. Is it also complete?
We know thatnXi=1
Xi � N(n�; n�2). Therefore,
E((nXi=1
Xi)2) = n�2 + n2�2 = n(n+ 1)�2
E(nXi=1
X2i ) = n(�2 + �2) = 2n�2
It follows that
E
2(
nXi=1
Xi)2 � (n+ 1)
nXi=1
X2i
!= 0 8�:
But g(x1; : : : ; xn) = 2(nXi=1
xi)2 � (n+ 1)
nXi=1
x2i is not identically to 0.
Therefore, T is not complete.
Note:
Recall from Section 5.2 what it means if we say the family of distributions ff� : � 2 �g is aone{parameter (or k{parameter) exponential family.
35
Theorem 8.3.13:
Let ff� : � 2 �g be a k{parameter exponential family. Let T1; : : : ; Tk be statistics. Then thefamily of distributions of (T1(X); : : : ; Tk(X)) is also a k{parameter exponential family given
by
g�(t) = exp
kXi=1
tiQi(�) +D(�) + S�(t)
!for suitable S�(t).
Proof:
The proof follows from our Theorems regarding the transformation of rv's.
Theorem 8.3.14:
Let ff� : � 2 �g be a k{parameter exponential family with k � n and let T1; : : : ; Tk be
statistics as in Theorem 8.3.13. Suppose that the range of Q = (Q1; : : : ; Qk) contains an open
set in IRk. Then T = (T1(X); : : : ; Tk(X)) is a complete su�cient statistic.
Proof:
Discrete case and k = 1 only.
Write Q(�) = � and let (a; b) � �.
It follows from the Factorization Criterion (Theorem 8.3.5) that T is su�cient for �. Thus,
we only have to show that T is complete, i.e., that
E�(g(T (X))) =Xt
g(t)P�(T (X) = t)
(A)=
Xt
g(t) exp(�t+D(�) + S�(t)) = 0 8� (B)
implies g(t) = 0 8t. Note that in (A) we make use of a result established in Theorem 8.3.13.
We now de�ne functions g+ and g� as:
g+(t) =
(g(t); if g(t) � 0
0; otherwise
g�(t) =
(�g(t); if g(t) < 0
0; otherwise
It is g(t) = g+(t)� g�(t) where both functions, g+ and g�, are non{negative functions. Using
g+ and g�, it turns out that (B) is equivalent toXt
g+(t) exp(�t+ S�(t)) =Xt
g�(t) exp(�t+ S�(t)) 8� (C)
where the term exp(D(�)) in (A) drops out as a constant on both sides.
36
If we �x �0 2 (a; b) and de�ne
p+(t) =g+(t) exp(�0t+ S�(t))Xt
g+(t) exp(�0t+ S�(t)); p�(t) =
g�(t) exp(�0t+ S�(t))Xt
g�(t) exp(�0t+ S�(t));
it is obvious that p+(t) � 0 8t and p�(t) � 0 8t and by constructionXt
p+(t) = 1 andXt
p�(t) = 1. Hence, p+ and p� are both pmf's.
From (C), it follows for the mgf's M+ and M� of p+ and p� that
M+(�) =Xt
e�tp+(t)
=
Xt
e�tg+(t) exp(�0t+ S�(t))Xt
g+(t) exp(�0t+ S�(t))
=
Xt
g+(t) exp((�0 + �)t+ S�(t))Xt
g+(t) exp(�0t+ S�(t))
(C)=
Xt
g�(t) exp((�0 + �)t+ S�(t))Xt
g�(t) exp(�0t+ S�(t))
=
Xt
e�tg�(t) exp(�0t+ S�(t))Xt
g�(t) exp(�0t+ S�(t))
=Xt
e�tp�(t)
= M�(�) 8� 2 (a� �0| {z }<0
; b� �0| {z }>0
):
By the uniqueness of mgf's it follows that p+(t) = p�(t) 8t.
=) g+(t) = g�(t) 8t
=) g(t) = 0 8t
=) T is complete
37
8.4 Unbiased Estimation
De�nition 8.4.1:
Let fF� : � 2 �g, � � IR, be a nonempty set of cdf's. A Borel{measurable function T from
IRn to � is called unbiased for � (or an unbiased estimate for �) if
E�(T ) = � 8� 2 �:
Any function d(�) for which an unbiased estimate T exists is called an estimable function.
If T is biased,
b(�; T ) = E�(T )� �
is called the bias of T .
Example 8.4.2:
If the kth population moment exists, the kth sample moment is an unbiased estimate. If
V ar(X) = �2, the sample variance S2 is an unbiased estimate of �2.
However, note that for X1; : : : ;Xn iid N(�; �2), S is not an unbiased estimate of �:
(n� 1)S2
�2� �2n�1 = Gamma(
n� 1
2; 2)
=) E
0@s(n� 1)S2
�2
1A =
Z 1
0
pxxn�12�1e�
x
2
2n�12 �(n�12 )
dx
=
p2�(n2 )
�(n�12 )
Z 1
0
xn
2�1e�
x
2
2n
2 �(n2 )dx
(�)=
p2�(n2 )
�(n�12 )
=) E(S) = �
s2
n� 1
�(n2 )
�(n�12 )
(�) holds since xn
2�1e�
x
2
2n
2 �(n
2)is the pdf of a Gamma(n2 ; 2) distribution and thus the integral is 1.
So S is biased for � and
b(�; S) = �
0@s 2
n� 1
�(n2 )
�(n�12 )� 1
1A :Note:
If T is unbiased for �, g(T ) is not necessarily unbiased for g(�) (unless g is a linear function).
38
Lecture 10:
We 01/31/01Example 8.4.3:
Unbiased estimates may not exist (see Rohatgi, page 351, Example 2) or they me be absurd
as in the following case:
Let X � Poisson(�) and let d(�) = e�2�. Consider T (X) = (�1)X as an estimate. It is
E�(T (X)) = e��1Xx=0
(�1)x�x
x!
= e��1Xx=0
(��)xx!
= e��e��
= e�2�
= d(�)
Hence T is unbiased for d(�) but since T alternates between -1 and 1 while d(�) > 0, T is not
a good estimate.
Note:
If there exist 2 unbiased estimates T1 and T2 of �, then any estimate of the form �T1+(1��)T2for 0 < � < 1 will also be an unbiased estimate of �. Which one should we choose?
De�nition 8.4.4:
The mean square error of an estimate T of � is de�ned as
MSE(�; T ) = E�((T � �)2)
= V ar�(T ) + (b(�; T ))2:
Let fTig1i=1 be a sequence of estimates of �. If
limi!1
MSE(�; Ti) = 0 8� 2 �;
then fTig is called a mean{squared{error consistent (MSE{consistent) sequence of es-
timates of �.
Note:
(i) If we allow all estimates and compare their MSE, generally it will depend on � which
estimate is better. For example �̂ = 17 is perfect if � = 17, but it is lousy otherwise.
39
(ii) If we restrict ourselves to the class of unbiased estimates, then MSE(�; T ) = V ar�(T ).
(iii) MSE{consistency means that both the bias and the variance of Ti approach 0 as i!1.
De�nition 8.4.5:
Let �0 2 � and let U(�0) be the class of all unbiased estimates T of �0 such that E�0(T2) <1.
Then T0 2 U(�0) is called a locally minimum variance unbiased estimate (LMVUE)
at �0 if
E�0((T0 � �0)2) � E�0((T � �0)
2) 8T 2 U(�0):
De�nition 8.4.6:
Let U be the class of all unbiased estimates T of � 2 � such that E�(T2) <1 8� 2 �. Then
T0 2 U is called a uniformly minimum variance unbiased estimate (UMVUE) of � if
E�((T0 � �)2) � E�((T � �)2) 8� 2 � 8T 2 U:
40
An Excursion into Logic II
In our �rst \Excursion into Logic" in Lectures 10 & 11 of Stat 6710 Mathematical Statistics I
on 9/20/2000 and 9/22/2000, we have established the following results:
A) B is equivalent to :B ) :A is equivalent to :A _B:
A B A) B :A :B :B ) :A :A _B1 1 1 0 0 1 1
1 0 0 0 1 0 0
0 1 1 1 0 1 1
0 0 1 1 1 1 1
When dealing with formal proofs, there exists one more technique to show A ) B. Equiva-
lently, we can show (A^:B)) 0, a technique called Proof by Contradiction. This means,
assuming that A and :B hold, we show that this implies 0, i.e., something that is always
false, i.e., a contradiction. And here is the corresponding truth table:
A B A) B :B A ^ :B (A ^ :B)) 0
1 1 1 0 0 1
1 0 0 1 1 0
0 1 1 0 0 1
0 0 1 1 0 1
Note:
We make use of this proof technique in the Proof of the next Theorem.
Example:
Let A : x = 5 and B : x2 = 25. Obviously A) B.
But we can also prove this in the following way:
A : x = 5 and :B : x2 6= 25
=) x2 = 25 ^ x2 6= 25
This is impossible, i.e., a contradiction. Thus, A) B.
41
Theorem 8.4.7:
Let U be the class of all unbiased estimates T of � 2 � with E�(T2) < 1 8�, and suppose
that U is non{empty. Let U0 be the set of all unbiased estimates of 0, i.e.,
U0 = f� : E�(�) = 0; E�(�2) <1 8� 2 �g:
Then T0 2 U is UMVUE i�
E�(�T0) = 0 8� 2 � 8� 2 U0:
Proof:
Note that E�(�T0) always exists.
\=):"
We suppose that T0 2 U is UMVUE and that E�0(�0T0) 6= 0 for some �0 2 � and some
�0 2 U0.
It holds
E�(T0 + ��0) = E�(T0) = � 8� 2 IR 8� 2 �:
Therefore, T0 + ��0 2 U 8� 2 IR.
Also, E�0(�20 ) > 0 (since otherwise, P�0(�0 = 0) = 1 and then E�0(�0T0) = 0).
Now let
� = �E�0(T0�0)
E�0(�20)
:
Then,
E�0((T0 + ��0)2) = E�0(T
20 + 2�T0�0 + �2�20 )
= E�0(T20 )� 2
(E�0(T0�0))2
E�0(�20)
+(E�0(T0�0))
2
E�0(�20 )
= E�0(T20 )�
(E�0(T0�0))2
E�0(�20 )
< E�0(T20 );
and therefore,
V ar�0(T0 + ��0) < V ar�0(T0):
This means, T0 is not UMVUE, i.e., a contradiction!
42
\(=:"Let E�(�T0) = 0 for some T0 2 U for all � 2 � and all � 2 U0.
We choose T 2 U , then also T0 � T 2 U0 and
E�(T0(T0 � T )) = 0 8� 2 �:
It follows by the Cauchy{Schwarz{Inequality (Theorem 4.5.7 (ii)) that
E�(T20 ) = E�(T0T ) � (E�(T
20 ))
12 (E�(T
2))12 :
This implies
(E�(T20 ))
12 � (E�(T
2))12
and
V ar�(T0) � V ar�(T );
where T is an arbitrary unbiased estimate of �. Thus, T0 is UMVUE.
Lecture 11:
Mo 02/05/01Theorem 8.4.8:
Let U be the non{empty class of unbiased estimates of � 2 � as de�ned in Theorem 8.4.7.
Then there exists at most one UMVUE T 2 U for �.
Proof:
Suppose T0; T1 2 U are both UMVUE.
Then T1 � T0 2 U0, V ar�(T0) = V ar�(T1), and E�(T0(T1 � T0)) = 0 8� 2 �
=) E�(T20 ) = E�(T0T1)
=) Cov�(T0; T1) = E�(T0T1)�E�(T0)E�(T1)
= E�(T20 )� (E�(T0))
2
= V ar�(T0)
= V ar�(T1) 8� 2 �
=) �T0T1 = 1 8� 2 �
=) P�(aT0 + bT1 = 0) = 1 for some a; b 8� 2 �
=) � = E�(T0) = E�(� baT1) = E�(T1) 8� 2 �
=) � ba= 1
=) P�(T0 = T1) = 1 8� 2 �
43
Theorem 8.4.9:
(i) If an UMVUE T exists for a real function d(�), then �T is the UMVUE for �d(�); � 2 IR.
(ii) If UMVUE's T1 and T2 exist for real functions d1(�) and d2(�), respectively, then T1+T2
is the UMVUE for d1(�) + d2(�).
Proof:
Homework.
Theorem 8.4.10:
If a sample consists of n independent observations X1; : : : ; Xn from the same distribution, the
UMVUE, if it exists, is permutation invariant.
Proof:
Homework.
Theorem 8.4.11: Rao{Blackwell
Let fF� : � 2 �g be a family of cdf's, and let h be any statistic in U , where U is the non{
empty class of all unbiased estimates of � with E�(h2) <1. Let T be a su�cient statistic for
fF� : � 2 �g. Then the conditional expectation E�(h j T ) is independent of � and it is an
unbiased estimate of �. Additionally,
E�((E(h j T )� �)2) � E�((h� �)2) 8� 2 �
with equality i� h = E(h j T ).
Proof:
By Theorem 4.7.3, E�(E(h j T )) = E�(h) = �.
Since X j T does not depend on � due to su�ciency, neither does E(h j T ) depend on �.
Thus, we only have to show that
E�((E(h j T ))2) � E�(h2) = E�(E(h
2 j T )):
Thus, we only have to show that
(E(h j T ))2 � E(h2 j T ):
But the Cauchy{Schwarz{Inequality (Theorem 4.5.7 (ii)) gives us
(E(h j T ))2 � E(h2 j T )E(1 j T ) = E(h2 j T ):
44
Equality holds i�
E�((E(h j T ))2) = E�(h2)
() E�(E(h2 j T )� (E(h j T ))2) = 0
() E�(V ar(h j T )) = 0
() V ar(h j T ) = 0
() E(h2 j T ) = (E(h j T ))2
() h is a function of T and h = E(h j T ).
For the proof of the last step, see Rohatgi, page 170{171, Theorem 2, Corollary, and Proof of
the Corollary.
Theorem 8.4.12: Lehmann{Sche��ee
If T is a complete su�cient statistic and if there exists an unbiased estimate h of �, then
E(h j T ) is the (unique) UMVUE.
Proof:
Suppose that h1; h2 2 U . Then E(h1 j T ) = E(h2 j T ) = � by Theorem 8.4.11.
Therefore,
E�(E(h1 j T )�E(h2 j T )) = 0 8� 2 �:
Since T is complete, E(h1 j T ) = E(h2 j T ).
Therefore, E(h j T ) must be the same for all h 2 U and E(h j T ) improves all h 2 U . There-fore, E(h j T ) is UMVUE by Theorem 8.4.11.
Note:
We can use Theorem 8.4.12 to �nd the UMVUE in two ways if we have a complete su�cient
statistic T :
(i) If we can �nd an unbiased estimate h(T ), it will be the UMVUE since E(h(T ) j T ) =h(T ).
(ii) If we have any unbiased estimate h and if we can calculate E(h j T ), then E(h j T )will be the UMVUE. The process of determining the UMVUE this way often is called
Rao{Blackwellization.
(iii) Even if a complete su�cient statistic does not exist, the UMVUE may still exist (see
Rohatgi, page 357{358, Example 10).
45
Example 8.4.13:
Let X1; : : : ;Xn be iid Bin(1; p). Then T =nXi=1
Xi is a complete su�cient statistic as seen in
Examples 8.3.6 and 8.3.11.
Since E(X1) = p, X1 is an unbiased estimate of p. However, due to part (i) of the Note above,
since X1 is not a function of T , X1 is not the UMVUE.
We can use part (ii) of the Note above to construct the UMVUE. It is
P (X1 = x j T ) =
8<:Tn ; x = 1
n�Tn ; x = 0
=) E(X1 j T ) = Tn= X
=) X is the UMVUE for p
If we are interested in the UMVUE for d(p) = p(1� p) = p� p2 = V ar(X), we can �nd it in
the following way:
E(T ) = np
E(T 2) = E
0@ nXi=1
X2i +
nXi=1
nXj=1;j 6=i
XiXj
1A= np+ n(n� 1)p2
=) E
�nT
n(n� 1)
�=
np
n� 1
E
� T 2
n(n� 1)
!= � p
n� 1� p2
=) E
nT � T 2
n(n� 1)
!=
np
n� 1� p
n� 1� p2
=(n� 1)p
n� 1� p2
= p� p2
= d(p)
Thus, due to part (i) of the Note above, nT � T 2
n(n� 1)is the UMVUE for d(p) = p(1� p).
46
Lecture 12:
We 02/07/018.5 Lower Bounds for the Variance of an Estimate
Theorem 8.5.1: Cram�er{Rao Lower Bound (CRLB)
Let � be an open interval of IR. Let ff� : � 2 �g be a family of pdf's or pmf's. Assume
that the set fx : f�(x) = 0g is independent of �.
Let (�) be de�ned on � and let it be di�erentiable for all � 2 �. Let T be an unbiased
estimate of (�) such that E�(T2) <1 8� 2 �. Suppose that
(i)@f�(x)@�
is de�ned for all � 2 �,
(ii) for a pdf f�@
@�
�Zf�(x)dx
�=
Z@f�(x)
@�dx = 0 8� 2 �
or for a pmf f�
@
@�
0@Xx
f�(x)
1A =Xx
@f�(x)
@�= 0 8� 2 �;
(iii) for a pdf f�@
@�
�ZT (x)f�(x)dx
�=
ZT (x)
@f�(x)
@�dx 8� 2 �
or for a pmf f�
@
@�
0@Xx
T (x)f�(x)
1A =Xx
T (x)@f�(x)
@�8� 2 �:
Let � : �! IR be any measurable function. Then it holds
( 0(�))2 � E�((T (X)� �(�))2) E�
�@ log f�(X)
@�
�2!8� 2 � (A):
Further, for any �0 2 �, either 0(�0) = 0 and equality holds in (A) for � = �0, or we have
E�0((T (X)� �(�0))2) � ( 0(�0))
2
E�0
�(@ log f�(X)
@�)2� (B):
Finally, if equality holds in (B), then there exists a real number K(�0) 6= 0 such that
T (X)� �(�0) = K(�0)@ log f�(X)
@�
�����=�0
(C)
with probability 1, provided that T is not a constant.
47
Note:
(i) Conditions (i), (ii), and (iii) are called regularity conditions. Conditions under which
they hold can be found in Rohatgi, page 11{13, Parts 12 and 13.
(ii) The right hand side of inequality (B) is called Cram�er{Rao Lower Bound of �0, or, in
symbols CRLB(�0).
Proof:
From (ii), we get
E�
�@
@�log f�(X)
�=
Z �@
@�log f�(x)
�f�(x)dx
=
Z �@
@�f�(x)
�1
f�(x)f�(x)dx
=
Z �@
@�f�(x)
�dx
= 0
=) E�
��(�)
@
@�log f�(X)
�= 0
From (iii), we get
E�
�T (X)
@
@�log f�(X)
�=
Z �T (x)
@
@�log f�(x)
�f�(x)dx
=
Z �T (x)
@
@�f�(x)
�1
f�(x)f�(x)dx
=
Z �T (x)
@
@�f�(x)
�dx
(iii)=
@
@�
�ZT (x)f�(x)dx
�
=@
@�E(T (X))
=@
@� (�)
= 0(�)
=) E�
�(T (X)� �(�))
@
@�log f�(X)
�= 0(�)
48
=) ( 0(�))2 =
�E�
�(T (X)� �(�))
@
@�log f�(X)
��2(�)� E�
�(T (X)� �(�))2
�E�
�@
@�log f�(X)
�2!;
i.e., (A) holds. (�) follows from the Cauchy{Schwarz{Inequality (Theorem 4.5.7 (ii)).
If 0(�0) 6= 0, then the left{hand side of (A) is > 0. Therefore, the right{hand side is > 0.
Thus,
E�0
�@
@�log f�(X)
�2!> 0;
and (B) follows directly from (A).
If 0(�0) = 0, but equality does not hold in (A), then
E�0
�@
@�log f�(X)
�2!> 0;
and (B) follows directly from (A) again.
Finally, if equality holds in (B), then 0(�0) 6= 0 (because T is not constant). Thus,
MSE(�(�0); T (X)) > 0. The Cauchy{Schwarz{Inequality (Theorem 4.5.7 (iii)) gives equality
i� there exist constants (�; �) 2 IR2 � f(0; 0)g such that
P
�(T (X)� �(�0)) + �
@
@�log f�(X)
�����=�0
!= 0
!= 1:
This implies K(�0) = ���and (C) holds. Since T is not a constant, it also holds that
K(�0) 6= 0.
Example 8.5.2:
If we take �(�) = (�), we get from (B)
V ar�(T (X)) � ( 0(�))2
E�
�(@ log f�(X)
@�)2� (�):
If we have (�) = �, the inequality (�) above reduces to
V ar�(T (X)) � E�
�@ log f�(X)
@�
�2!!�1:
Finally, if X = (X1; : : : ;Xn) iid with identical f�(x), the inequality (�) reduces to
V ar�(T (X)) � ( 0(�))2
nE�
�(@ log f�(X1)
@� )2� :
49
Example 8.5.3:
Let X1; : : : ; Xn be iid Bin(1; p). Let X � Bin(n; p), p 2 � = (0; 1) � IR. Let
(p) = E(T (X)) =nX
x=0
T (x)
n
x
!px(1� p)n�x:
(p) is di�erentiable with respect to p under the summation sign since it is a �nite polynomial
in p.
Since X =nXi=1
Xi with fp(x1) = px1(1� p)1�x1 ; x1 2 f0; 1g
=) log fp(x1) = x1 log p+ (1� x1) log(1� p)
=) @
@plog fp(x1) =
x1
p� 1� x1
1� p=x1(1� p)� p(1� x1)
p(1� p)=
x1 � p
p(1� p)
=) Ep
�@
@plog fp(X1)
�2!=
V ar(X1)
p2(1� p)2=
1
p(1� p)
So, if (p) = �(p) = p and if T is unbiased for p, then
V arp(T (X)) � 1
n 1p(1�p)
=p(1� p)
n:
Since V ar(X) =p(1�p)
n , X attains the CRLB. Therefore, X is the UMVUE.
Lecture 13:
Fr 02/09/01Example 8.5.4:
Let X � U(0; �), � 2 � = (0;1) � IR.
f�(x) =1
�I(0;�)(x)
=) log f�(x) = � log �
=) (@
@�log f�(x))
2 =1
�2
=) E�
�@
@�log f�(X)
�2!=
1
�2
Thus, the CRLB is �2
n .
We know that n+1nX(n) is the UMVUE since it is a function of a complete su�cient statistic
X(n) (see Homework) and E(X(n)) =n
n+1�. It is
V ar
�n+ 1
nX(n)
�=
�2
n(n+ 2)<�2
n???
50
How is this possible? Quite simple, since one of the required conditions for Theorem 8.5.1
does not hold. The support of X depends on �.
Theorem 8.5.5: Chapman, Robbins, Kiefer Inequality (CRK Inequality)
Let � � IR. Let ff� : � 2 �g be a family of pdf's or pmf's. Let (�) be de�ned on �. Let
T be an unbiased estimate of (�) such that E�(T2) <1 8� 2 �.
If � 6= #; � and # 2 �, assume that f�(x) and f#(x) are di�erent. Also assume that there
exists such a # 2 � such that � 6= # and
S(�) = fx : f�(x) > 0g � S(#) = fx : f#(x) > 0g:
Then it holds that
V ar�(T (X)) � supf# : S(#)�S(�); # 6=�g
( (#) � (�))2
V ar�
�f#(X)f�(X)
� 8� 2 �:
Proof:
Since T is unbiased, it follows
E#(T (X)) = (#) 8# 2 �:
For # 6= � and S(#) � S(�), it followsZS(�)
T (x)f#(x)� f�(x)
f�(x)f�(x)dx = E#(T (X))�E�(T (X)) = (#)� (�)
and
0 =
ZS(�)
f#(x)� f�(x)
f�(x)f�(x)dx = E�
�f#(X)
f�(X)� 1
�:
Therefore
Cov�
�T (X);
f#(X)
f�(X)� 1
�= (#)� (�):
It follows by the Cauchy{Schwarz{Inequality (Theorem 4.5.7 (ii)) that
( (#)� (�))2 =
�Cov�
�T (X);
f#(X)
f�(X)� 1
��2
� V ar�(T (X))V ar�
�f#(X)
f�(X)� 1
�
= V ar�(T (X))V ar�
�f#(X)
f�(X)
�:
Thus,
V ar�(T (X)) � ( (#) � (�))2
V ar�
�f#(X)f�(X)
� :Finally, we take the supremum of the right{hand side with respect to f# : S(#) � S(�);
# 6= �g, which completes the proof.
51
Note:
(i) The CRK inequality holds without the previous regularity conditions.
(ii) An alternative form of the CRK inequality is:
Let �; � + �; � 6= 0, be distinct with S(� + �) � S(�). Let (�) = �. De�ne
J = J(�; �) =1
�2
�f�+�(X)
f�(X)
�2� 1
!:
Then the CRK inequality reads as
V ar�(T (X)) � 1
inf�E�(J)
with the in�mum taken over � 6= 0 such that S(� + �) � S(�).
(iii) The CRK inequality works for discrete �, the CRLB does not work in such cases.
Example 8.5.6:
Let X � U(0; �); � > 0. The required conditions for the CRLB are not met. Recall from
Example 8.5.4 that n+1n X(n) is UMVUE with V ar(n+1n X(n)) =
�2
n(n+2) <�2
n = CRLB.
Let (�) = �. If # < �, then S(#) � S(�). It is
E�
�f#(X)
f�(X)
�2!=
Z #
0
��
#
�2 1�dx =
�
#
E�
�f#(X)
f�(X)
�=
Z #
0
�
#
1
�dx = 1
=) V ar�(T (X)) � supf# : 0<#<�g
(#� �)2
�#� 1
= supf# : 0<#<�g
(#(� � #)) =�2
4
Since X is complete and su�cient and 2X is unbiased for �, so T (X) = 2X is the UMVUE.
It is
V ar�(2X) = 4V ar�(X) = 4�2
12=�2
3>�2
4:
Since the CRK lower bound is not achieved by the UMVUE, it is not achieved by any unbiased
estimate of �.
52
De�nition 8.5.7:
Let T1; T2 be unbiased estimates of � with E�(T21 ) <1 and E�(T
22 ) <1 8� 2 �. We de�ne
the e�ciency of T1 relative to T2 by
eff�(T1; T2) =V ar�(T1)
V ar�(T2)
and say that T1 is more e�cient than T2 if eff�(T1; T2) < 1.
De�nition 8.5.8:
Assume the regularity conditions of Theorem 8.5.1 are satis�ed by a family of cdf's
fF� : � 2 �g, � � IR. An unbiased estimate T for � is most e�cient for the family
fF�g if
V ar�(T ) =
�E�
�(@ log f�(X)
@�)2���1
:
De�nition 8.5.9:
Let T be the most e�cient estimate for the family of cdf's fF� : � 2 �g, � � IR. Then the
e�ciency of any unbiased T1 of � is de�ned as
eff�(T1) = eff�(T1; T ) =V ar�(T1)
V ar�(T ):
De�nition 8.5.10:
T1 is asymptotically (most) e�cient if T1 is asymptotically unbiased, i.e., limn!1
E�(T1) = �,
and limn!1
eff�(T1) = 1, where n is the sample size.
Lecture 14:
Mo 02/12/01Theorem 8.5.11:
A necessary and su�cient condition for an estimate T of � to be most e�cient is that T is
su�cient and1
K(�)(T (x)� �) =
@ log f�(x)
@�8� 2 � (�);
where K(�) is de�ned as in Theorem 8.5.1 and the regularity conditions for Theorem 8.5.1
hold.
Proof:
\=):"
Theorem 8.5.1 says that if T is most e�cient, then (�) holds.
Assume that � = IR. We de�ne
C(�0) =
Z �0
�1
1
K(�)d�; (�0) =
Z �0
�1
�
K(�)d�; and �(x) = lim
�!�1log f�(x)� c(x):
53
Integrating (�) with respect to � givesZ �0
�1
1
K(�)T (x)d� �
Z �0
�1
�
K(�)d� =
Z �0
�1
@ log f�(x)
@�d�
=) T (x)C(�0)� (�0) = log f�(x)j�0�1 + c(x)
=) T (x)C(�0)� (�0) = log f�0(x)� lim�!�1
log f�(x) + c(x)
=) T (x)C(�0)� (�0) = log f�0(x)� �(x)
Therefore,
f�0(x) = exp(T (x)C(�0)� (�0) + �(x))
which belongs to an exponential family. Thus, T is su�cient.
\(=:"From (�), we get
E�
�@ log f�(X)
@�
�2!=
1
(K(�))2V ar�(T (X)):
Additionally, it holds
E�
�(T (X)� �)
@ log f�(X)
@�
�= 1
as shown in the Proof of Theorem 8.5.1.
Using (�) in the line above, we get
K(�)E�
�@ log f�(X)
@�
�2!= 1;
i.e.,
K(�) =
E�
�@ log f�(X)
@�
�2!!�1:
Therefore,
V ar�(T (X)) =
E�
�@ log f�(X)
@�
�2!!�1;
i.e., T is most e�cient for �.
Note:
Instead of saying \a necessary and su�cient condition for an estimate T of � to be most
e�cient ..." in the previous Theorem, we could say that \an estimate T of � is most e�cient
i� ...", i.e., \necessary and su�cient" means the same as \i�".
54
8.6 The Method of Moments
De�nition 8.6.1:
LetX1; : : : ; Xn be iid with pdf (or pmf) f�; � 2 �. We assume that �rst k momentsm1; : : : ;mk
of f� exist. If � can be written as
� = h(m1; : : : ;mk);
where h : IRk ! IR is a Borel{measurable function, the method of moments estimate
(mom) of � is
�̂mom = T (X1; : : : ;Xn) = h(1
n
nXi=1
Xi;1
n
nXi=1
X2i ; : : : ;
1
n
nXi=1
Xki ):
Note:
(i) The De�nition above can also be used to estimate joint moments. For example, we use
1n
nXi=1
XiYi to estimate E(XY ).
(ii) Since E( 1n
nXi=1
Xji ) = mj, method of moment estimates are unbiased for the popula-
tion moments. The WLLN and the CLT say that these estimates are consistent and
asymptotically Normal as well.
(iii) If � is not a linear function of the population moments, �̂mom will, in general, not be
unbiased. However, it will be consistent and (usually) asymptotically Normal.
(iv) Method of moments estimates do not exist if the related moments do not exist.
(v) Method of moments estimates may not be unique. If there exist multiple choices for the
mom, one usually takes the estimate involving the lowest{order sample moment.
(vi) Alternative method of moment estimates can be obtained from central moments (rather
than from raw moments) or by using moments other than the �rst k moments.
55
Example 8.6.2:
Let X1; : : : ; Xn be iid N(�; �2).
Since � = m1, it is �̂mom = X.
This is an unbiased, consistent and asymptotically Normal estimate.
Since � =qm2 �m2
1, it is �̂mom =
vuut 1n
nXi=1
X2i �X
2.
This is a consistent, asymptotically Normal estimate. However, it is not unbiased.
Example 8.6.3:
Let X1; : : : ; Xn be iid Poisson(�).
We know that E(X1) = V ar(X1) = �.
Thus, X and 1n
nXi=1
(Xi � X)2 are possible choices for the mom of �. Due to part (v) of the
Note above, one uses �̂mom = X .
56
8.7 Maximum Likelihood Estimation
De�nition 8.7.1:
Let (X1; : : : ;Xn) be an n{rv with pdf (or pmf) f�(x1; : : : ; xn); � 2 �. We call the function
L(�;x1; : : : ; xn) = f�(x1; : : : ; xn)
of � the likelihood function.
Note:
(i) Often � is a vector of parameters.
(ii) If (X1; : : : ;Xn) are iid with pdf (or pmf) f�(x), then L(�;x1; : : : ; xn) =nYi=1
f�(xi).
De�nition 8.7.2:
A maximum likelihood estimate (MLE) is a non{constant estimate �̂ML such that
L(�̂ML;x1; : : : ; xn) = sup�2�
L(�;x1; : : : ; xn):
Note:
It is often convenient to work with logL when determining the maximum likelihood estimate.
Since the log is monotone, the maximum is the same.
Example 8.7.3:
Let X1; : : : ; Xn be iid N(�; �2), where � and �2 are unknown.
L(�; �2;x1; : : : ; xn) =1
�n(2�)n
2
exp
�
nXi=1
(xi � �)2
2�2
!
=) logL(�; �2;x1; : : : ; xn) = �n2log �2 � n
2log(2�) �
nXi=1
(xi � �)2
2�2
The MLE must satisfy
@ logL
@�=
1
�2
nXi=1
(xi � �) = 0 (A)
@ logL
@�2= � n
2�2+
1
2�4
nXi=1
(xi � �)2 = 0 (B)
57
These are the two likelihood equations. From equation (A) we get �̂ML = X . Substituting
this for � into equation (B) and solving for �2, we get �̂2ML = 1n
nXi=1
(Xi�X)2. Note that �̂2ML
is biased for �2.
Formally, we still have to verify that we found the maximum (and not a minimum) and that
there is no parameter � at the edge of the parameter space � such that the likelihood function
does not take its absolute maximum which is not detectable by using our approach for local
extrema.
Lecture 15:
We 02/14/01Example 8.7.4:
Let X1; : : : ; Xn be iid U(� � 12 ; � +
12).
L(�;x1; : : : ; xn) =
8<: 1; if � � 12� xi � � + 1
28i = 1; : : : ; n
0; otherwise
Therefore, any �̂(X) such that max(X)� 12 � �̂(X) � min(X) + 1
2 is an MLE. Obviously, the
MLE is not unique.
Example 8.7.5:
Let X � Bin(1; p); p 2 [14 ;34 ].
L(p;x) =
8<: p; if x = 1
1� p; if x = 0
This is maximized by
p̂ =
8<:34 ; if x = 1
14 ; if x = 0
=2x+ 1
4
It is
Ep(p̂) =3
4p+
1
4(1� p) =
1
2p+
1
4
MSEp(p̂) = Ep((p̂� p)2)
= Ep((2X + 1
4� p)2)
=1
16Ep((2X + 1� 4p)2)
=1
16Ep(4X
2 + 2 � 2X � 2 � 8pX � 2 � 4p+ 1 + 16p2)
=1
16(4(p(1 � p) + p2) + 4p� 16p2 � 8p+ 1 + 16p2)
=1
16
58
So p̂ is biased with MSEp(p̂) =116 . If we compare this with ~p = 1
2 regardless of the data, we
have
MSEp(1
2) = Ep((
1
2� p)2) = (
1
2� p)2 � 1
168p 2 [
1
4;3
4]:
Thus, in this example the MLE is worse than the trivial estimate when comparing their MSE's.
Theorem 8.7.6:
Let T be a su�cient statistic for f�(x); � 2 �. If a unique MLE of � exists, it is a function
of T .
Proof:
Since T is su�cient, we can write
f�(x) = h(x)g�(T (x))
due to the Factorization Criterion (Theorem 8.3.5). Maximizing the likelihood function with
respect to � takes h(x) as a constant and therefore is equivalent to maximizing g�(x) with
respect to �. But g�(x) involves x only through T .
Note:
(i) MLE's may not be unique (however they frequently are).
(ii) MLE's are not necessarily unbiased.
(iii) MLE's may not exist.
(iv) If a unique MLE exists, it is a function of a su�cient statistic.
(v) Often (but not always), the MLE will be a su�cient statistic itself.
Theorem 8.7.7:
Suppose the regularity conditions of Theorem 8.5.1 hold and � belongs to an open interval in
IR. If an estimate �̂ of � attains the CRLB, it is the unique MLE.
Proof:
If �̂ attains the CRLB, it follows by Theorem 8.5.1 that
@ log f�(X)
@�=
1
K(�)(�̂(X)� �) w.p. 1:
Thus, �̂ satis�es the likelihood equations.
59
We de�ne A(�) = 1K(�) . Then it follows
@2 log f�(X)
@�2= A0(�)(�̂(X)� �)�A(�):
The Proof of Theorem 8.5.11 gives us
A(�) = E�
�@ log f�(X)
@�
�2!> 0:
So@2 log f�(X)
@�2
������=�̂
= �A(�) < 0;
i.e., log f�(X) has a maximum in �̂. Thus, �̂ is the MLE.
Note:
The previous Theorem does not imply that every MLE is most e�cient.
Theorem 8.7.8:
Let ff� : � 2 �g be a family of pdf's (or pmf's) with � � IRk; k � 1. Let h : �! � be a
mapping of � onto � � IRp; 1 � p � k. If �̂ is an MLE of �, then h(�̂) is an MLE of h(�).
Proof:
For each � 2 �, we de�ne
�� = f� : � 2 �; h(�) = �g
and
M(�;x) = sup�2��
L(�;x);
the likelihood function induced by h.
Let �̂ be an MLE and a member of ��̂, where �̂ = h(�̂). It holds
M(�̂;x) = sup�2�
�̂
L(�;x) � L(�̂;x);
but also
M(�̂;x) � sup�2�
M(�;x) = sup�2�
sup�2��
L(�;x)
!= sup
�2�L(�;x) = L(�̂;x):
Therefore,
M(�̂;x) = L(�̂;x) = sup�2�
M(�;x):
Thus, �̂ = h(�̂) is an MLE.
60
Example 8.7.9:
Let X1; : : : ; Xn be iid Bin(1; p). Let h(p) = p(1� p).
Since the MLE of p is p̂ = X , the MLE of h(p) is h(p̂) = X(1�X).
Theorem 8.7.10:
Consider the following conditions a pdf f� can ful�ll:
(i) @ log f�@�
;@2 log f�@�2
;@3 log f�@�3
exist for all � 2 � for all x. Also,Z 1
�1
@f�(x)
@�dx = E�
�@ log f�(X)
@�
�= 0 8� 2 �:
(ii)
Z 1
�1
@2f�(x)
@�2dx = 0 8� 2 �.
(iii) �1 <
Z 1
�1
@2 log f�(x)
@�2f�(x)dx < 0 8� 2 �.
(iv) There exists a function H(x) such that for all � 2 �:�����@3 log f�(x)@�3
����� < H(x) and
Z 1
�1H(x)f�(x)dx =M(�) <1:
(v) There exists a function g(�) that is positive and twice di�erentiable for every � 2 � and
there exists a function H(x) such that for all � 2 �:����� @2@�2�g(�)
@ log f�(x)
@�
������ < H(x) and
Z 1
�1H(x)f�(x)dx =M(�) <1:
In case that multiple of these conditions are ful�lled, we can make the following statements:
(i) (Cram�er) Conditions (i), (iii), and (iv) imply that, with probability approaching 1, as
n!1, the likelihood equation has a consistent solution.
(ii) (Cram�er) Conditions (i), (ii), (iii), and (iv) imply that a consistent solution �̂n of the
likelihood equation is asymptotically Normal, i.e.,
pn
�(�̂n � �)
d�! Z
where Z � N(0; 1) and �2 =
�E�
��@ log f�(X)
@�
�2���1.
(iii) (Kulldorf) Conditions (i), (iii), and (v) imply that, with probability approaching 1, as
n!1, the likelihood equation has a consistent solution.
61
(iv) (Kulldorf) Conditions (i), (ii), (iii), and (v) imply that a consistent solution �̂n of the
likelihood equation is asymptotically Normal.
Note:
In case of a pmf f�, we can de�ne similar conditions as in Theorem 8.7.10.
62
Lecture 16:
Fr 02/16/018.8 Decision Theory | Bayes and Minimax Estimation
Let ff� : � 2 �g be a family of pdf's (or pmf's). Let X1; : : : ;Xn be a sample from f�. Let
A be the set of possible actions (or decisions) that are open to the statistician in a given
situation , e.g.,
A = freject H0, do not reject H0g (Hypothesis testing, see Chapter 9)
A = artefact found is of fGreek, Romang origin (Classi�cation)
A = � (Estimation)
De�nition 8.8.1:
A decision function d is a statistic, i.e., a Borel{measurable function, that maps IRn into
A. If X = x is observed, the statistician takes action d(x) 2 A.
Note:
For the remainder of this Section, we are restricting ourselves to A = �, i.e., we are facing
the problem of estimation.
De�nition 8.8.2:
A non{negative function L that maps � � A into IR is called a loss function. The value
L(�; a) is the loss incurred to the statistician if he/she takes action a when � is the true pa-
rameter value.
De�nition 8.8.3:
Let D be a class of decision functions that map IRn into A. Let L be a loss function on ��A.The function R that maps �� D into IR is de�ned as
R(�; d) = E�(L(�; d(X)))
and is called the risk function of d at �.
Example 8.8.4:
Let A = � � IR. Let L(�; a) = (� � a)2. Then it holds that
R(�; d) = E�(L(�; d(X))) = E�((� � d(X))2) = E�((� � �̂)2):
Note that this is just the MSE. If �̂ is unbiased, this would just be V ar(�̂).
63
Note:
The basic problem of decision theory is that we would like to �nd a decision function d 2 D
such that R(�; d) is minimized for all � 2 �. Unfortunately, this is usually not possible.
De�nition 8.8.5:
The minimax principle is to choose the decision function d� 2 D such that
max�2�
R(�; d�) � max�2�
R(�; d) 8d 2 D:
Note:
If the problem of interest is an estimation problem, we call a d� that satisi�es the condition
in De�nition 8.8.5 a minimax estimate of �.
Example 8.8.6:
Let X � Bin(1; p); p 2 � = f14 ;34g = A.
We consider the following loss function:
p a L(p; a)14
14 0
14
34 2
34
14 5
34
34 0
The set of decision functions consists of the following four functions:
d1(0) =1
4; d1(1) =
1
4
d2(0) =1
4; d2(1) =
3
4
d3(0) =3
4; d3(1) =
1
4
d4(0) =3
4; d4(1) =
3
4
First, we evaluate the loss function for these four decision functions:
L(1
4; d1(0)) = L(
1
4;1
4) = 0
L(1
4; d1(1)) = L(
1
4;1
4) = 0
L(3
4; d1(0)) = L(
3
4;1
4) = 5
64
L(3
4; d1(1)) = L(
3
4;1
4) = 5
L(1
4; d2(0)) = L(
1
4;1
4) = 0
L(1
4; d2(1)) = L(
1
4;3
4) = 2
L(3
4; d2(0)) = L(
3
4;1
4) = 5
L(3
4; d2(1)) = L(
3
4;3
4) = 0
L(1
4; d3(0)) = L(
1
4;3
4) = 2
L(1
4; d3(1)) = L(
1
4;1
4) = 0
L(3
4; d3(0)) = L(
3
4;3
4) = 0
L(3
4; d3(1)) = L(
3
4;1
4) = 5
L(1
4; d4(0)) = L(
1
4;3
4) = 2
L(1
4; d4(1)) = L(
1
4;3
4) = 2
L(3
4; d4(0)) = L(
3
4;3
4) = 0
L(3
4; d4(1)) = L(
3
4;3
4) = 0
Then, the risk function
R(p; di(X)) = Ep(L(p; d(X))) = L(p; d(0)) � Pp(X = 0) + L(p; d(1)) � Pp(X = 1)
takes the following values:
i p = 14 : R(14 ; di) p = 3
4 : R(34 ; di) maxp2f1=4; 3=4g
R(p; di)
1 0 5 5
2 34 � 0 +
14 � 2 =
12
14 � 5 +
34 � 0 =
54
54
3 34 � 2 +
14 � 0 =
32
14 � 0 +
34 � 5 =
154
154
4 2 0 2
Hence,
mini2f1; 2; 3; 4g
maxp2f1=4; 3=4g
R(p; di) =5
4:
Thus, d2 is the minimax estimate.
65
Note:
Minimax estimation does not require any unusual assumptions. However, it tends to be very
conservative.
De�nition 8.8.7:
Suppose we consider � to be a rv with pdf �(�) on �. We call � the a priori distribution
(or prior distribution).
Note:
f(x j �) is the conditional density of x given a �xed �. The joint density of x and � is
f(x; �) = �(�)f(x j �);
the marginal density of x is
g(x) =
Zf(x; �)d�;
and the a posteriori distribution (or posterior distribution), which gives the distribution
of � after sampling, has pdf (or pmf)
h(� j x) = f(x; �)
g(x):
De�nition 8.8.8:
The Bayes risk of a decision function d is de�ned as
R(�; d) = E�(R(�; d));
where � is the a priori distribution.
Note:
If � is a continuous rv and X is of continuous type, then
R(�; d) = E�(R(�; d))
=
ZR(�; d) �(�) d�
=
ZE�(L(�; d(X))) �(�) d�
=
Z �ZL(�; d(x)) f(x j �) dx
��(�) d�
=
Z ZL(�; d(x)) f(x j �) �(�) dx d�
66
=
Z ZL(�; d(x)) f(x; �) dx d�
=
Zg(x)
�ZL(�; d(x))h(� j x) d�
�dx
Similar expressions can be written if � and/or X are discrete.
Lecture 17:
Tu 02/20/01De�nition 8.8.9:
A decision function d� is called a Bayes rule if d� minimizes the Bayes risk, i.e., if
R(�; d�) = infd2D
R(�; d):
Theorem 8.8.10:
Let A = � � IR. Let L(�; d(x)) = (� � d(x))2. In this case, a Bayes rule is
d(x) = E(� j X = x):
Proof:
Minimizing
R(�; d) =
Zg(x)
�Z(� � d(x))2h(� j x) d�
�dx;
where g is the marginal pdf of X and h is the conditional pdf of � given x, is the same as
minimizing Z(� � d(x))2h(� j x) d�:
However, this is minimized when d(x) = E(� j X = x) as shown in Stat 6710, Homework 4
(ii), for the unconditional case.
Note:
Under the conditions of Theorem 8.8.10, d(x) = E(� j X = x) is called the Bayes estimate.
Example 8.8.11:
Let X � Bin(n; p). Let L(p; d(x)) = (p� d(x))2.
Let �(p) = 1 8p 2 (0; 1), i.e., � � U(0; 1), be the a priori distribution of p.
Then it holds:
67
f(x; p) =
n
x
!px(1� p)n�x
g(x) =
Zf(x; p)dp
=
Z 1
0
n
x
!px(1� p)n�xdp
h(p j x) =f(x; p)
g(x)
=
�nx
�px(1� p)n�xZ 1
0
n
x
!px(1� p)n�xdp
=px(1� p)n�xZ 1
0px(1� p)n�xdp
E(p j x) =
Z 1
0ph(p j x)dp
=
Z 1
0p
n
x
!px(1� p)n�xdp
Z 1
0
n
x
!px(1� p)n�xdp
=
Z 1
0px+1(1� p)n�xdpZ 1
0px(1� p)n�xdp
=B(x+ 2; n� x+ 1)
B(x+ 1; n� x+ 1)
=
�(x+ 2)�(n� x+ 1)
�(x+ 2 + n� x+ 1)
�(x+ 1)�(n� x+ 1)
�(x+ 1 + n� x+ 1)
=x+ 1
n+ 2
Thus, the Bayes rule is
p̂Bayes = d�(X) =X + 1
n+ 2:
68
The Bayes risk of d�(X) is
R(�; d�(X)) = E�(R(p; d�(X)))
=
Z 1
0�(p)R(p; d�(x))dp
=
Z 1
0�(p)Ep(L(p; d
�(x)))dp
=
Z 1
0�(p)Ep((p� d�(x))2)dp
=
Z 1
0Ep
�(X + 1
n+ 2� p)2
�dp
=
Z 1
0Ep
�(X + 1
n+ 2)2 � 2p
X + 1
n+ 2+ p2
�dp
=1
(n+ 2)2
Z 1
0Ep
�(X + 1)2 � 2p(n+ 2)(X + 1) + p2(n+ 2)2
�dp
=1
(n+ 2)2
Z 1
0Ep
�X2 + 2X + 1� 2p(n+ 2)(X + 1) + p2(n+ 2)2
�dp
=1
(n+ 2)2
Z 1
0(np(1� p) + (np)2 + 2np+ 1� 2p(n+ 2)(np+ 1) + p2(n+ 2)2) dp
=1
(n+ 2)2
Z 1
0(np� np2 + n2p2 + 2np+ 1� 2n2p2 � 2np� 4np2 � 4p+
p2n2 + 4np2 + 4p2) dp
=1
(n+ 2)2
Z 1
0(1� 4p+ np� np2 + 4p2) dp
=1
(n+ 2)2
Z 1
0(1 + (n� 4)p+ (4� n)p2) dp
=1
(n+ 2)2(p+
n� 4
2p2 +
4� n
3p3)
����10
=1
(n+ 2)2(1 +
n� 4
2+4� n
3)
=1
(n+ 2)26 + 3n� 12 + 8� 2n
6
=1
(n+ 2)2n+ 2
6
=1
6(n+ 2)
69
Now we compare the Bayes rule d�(X) with the MLE p̂ML = Xn. This estimate has Bayes risk
R(�;X
n) =
Z 1
0Ep((
X
n� p)2)dp
=
Z 1
0
1
n2Ep((X � np)2)dp
=
Z 1
0
np(1� p)
n2dp
=
Z 1
0
p(1� p)
ndp
=1
n
p2
2� p3
3
!�����1
0
=1
6n
which is, as expected, larger than the Bayes risk of d�(X).
Theorem 8.8.12:
Let ff� : � 2 �g be a family of pdf's (or pmf's). Suppose that an estimate d� of � is a
Bayes estimate corresponding to some prior distribution � on �. If the risk function R(�; d�)
is constant on �, then d� is a minimax estimate of �.
Proof:
Homework.
70
9 Hypothesis Testing
9.1 Fundamental Notions
We assume that X = (X1; : : : ;Xn) is a random sample from a population distribution
F�; � 2 � � IRk, where the functional form of F� is known, except for the parameter �.
We also assume that � contains at least two points.
De�nition 9.1.1:
A parametric hypothesis is an assumption about the unknown parameter �.
The null hypothesis is of the form
H0 : � 2 �0 � �:
The alternative hypothesis is of the form
H1 : � 2 �1 = ���0:
De�nition 9.1.2:
If �0 (or �1) contains only one point, we say that H0 and �0 (or H1 and �1) are simple. In
this case, the distribution of X is completely speci�ed under the null (or alternative) hypoth-
esis.
If �0 (or �1) contains more than one point, we say that H0 and �0 (or H1 and �1) are
composite.
Example 9.1.3:
Let X1; : : : ;Xn be iid Bin(1; p). Examples for hypotheses are p = 12 (simple), p � 1
2 (com-
posite), p 6= 14 (composite), etc.
Note:
The problem of testing a hypothesis can be described as follows: Given a sample point x, �nd
a decision rule that will lead to a decision to accept or reject the null hypothesis. This means,
we partition the space IRn into two disjoint sets C and Cc such that, if x 2 C, we reject
H0 : � 2 �0 (and we accept H1). Otherwise, if x 2 Cc, we accept H0 that X � F�; � 2 �0.
71
De�nition 9.1.4:
Let X � F�; � 2 �. Let C be a subset of IRn such that, if x 2 C, then H0 is rejected (with
probability 1), i.e.,
C = fx 2 IRn : H0 is rejected for this xg:
The set C is called the critical region.
De�nition 9.1.5:
If we reject H0 when it is true, we call this a Type I error. If we fail to reject H0 when it
is false, we call this a Type II error. Usually, H0 and H1 are chosen such that the Type I
error is considered more serious.
Example 9.1.6:
We �rst consider a non{statistical example, in this case a jury trial. Our hypotheses are that
the defendant is innocent or guilty. Our possible decisions are guilty or not guilty. Since it is
considered worse to punish the innocent than to let the guilty go free, we make innocence the
null hypothesis. Thus, we have
Truth (unknown)
Decision (known)Innocent (H0) Guilty (H1)
Not Guilty (H0) Correct Type II Error
Guilty (H1) Type I Error Correct
The jury tries to make a decision \beyond a reasonable doubt", i.e., it tries to make the
probability of a Type I error small.
De�nition 9.1.7:
If C is the critical region, then P�(C); � 2 �0, is a probability of Type I error, and
P�(Cc); � 2 �1, is a probability of Type II error.
Note:
We would like both error probabilities to be 0, but this is usually not possible. We usually
settle for �xing the probability of Type I error to be small, e.g., 0.05 or 0.01, and minimizing
the Type II error.
72
Lecture 18:
We 02/21/01De�nition 9.1.8:
Every Borel{measurable mapping � of IRn ! [0; 1] is called a test function. �(x) is the
probability of rejecting H0 when x is observed.
If � is the indicator function of a subset C � IRn, � is called a nonrandomized test and C
is the critical region of this test function.
Otherwise, if � is not an indicator function of a subset C � IRn, � is called a randomized
test.
De�nition 9.1.9:
Let � be a test function of the hypothesis H0 : � 2 �0 against the alternative H1 : � 2 �1.
We say that � has a level of signi�cance of � (or � is a level{�{test or � is of size �) if
E�(�(X)) = P�(rejct H0) � � 8� 2 �0:
In short, we say that � is a test for the problem (�;�0;�1).
De�nition 9.1.10:
Let � be a test for the problem (�;�0;�1). For every � 2 �, we de�ne
��(�) = E�(�(X)) = P�(rejct H0):
We call ��(�) the power function of �. For any � 2 �1, ��(�) is called the power of �
against the alternative �.
De�nition 9.1.11:
Let �� be the class of all tests for (�;�0;�1). A test �0 2 �� is called a most powerful
(MP) test against an alternative � 2 �1 if
��0(�) � ��(�) 8� 2 ��:
De�nition 9.1.12:
Let �� be the class of all tests for (�;�0;�1). A test �0 2 �� is called a uniformly most
powerful (UMP) test if
��0(�) � ��(�) 8� 2 �� 8� 2 �1:
73
Example 9.1.13:
Let X1; : : : ; Xn be iid N(�; 1); � 2 � = f�0; �1g; �0 < �1.
Let H0 : Xi � N(�0; 1) vs. H1 : Xi � N(�1; 1).
Intuitively, reject H0 when X is too large, i.e., if X � k for some k.
Under H0 it holds that X � N(�0;1n).
For a given �, we can solve the following equation for k:
P�0(X > k) = P
X � �0
1=pn
>k � �0
1=pn
!= P (Z > z�) = �
Here, X��01=pn= Z � N(0; 1) and z� is de�ned such in such a way that P (Z > z�) = �, i.e., z�
is the upper �{quantile of the N(0; 1) distribution. It follows that k��01=pn= z� and therefore,
k = �0 +z�pn.
Thus, we obtain the nonrandomized test
�(x) =
8<: 1; if x > �0 +z�pn
0; otherwise
� has power
��(�1) = P�1
�X > �0 +
z�pn
�
= P
X � �1
1=pn
> (�0 � �1)pn+ z�
!
= P (Z > z� �pn (�1 � �0)| {z }
>0
)
> �
The probability of a Type II error is
P (Type II error) = 1� ��(�1):
74
Example 9.1.14:
Let X � Bin(6; p); p 2 � = (0; 1).
H0 : p =12 ; H1 : p 6= 1
2 .
Desired level of signi�cance: � = 0:05.
Reasonable plan: Since Ep= 12(X) = 3, reject H0 when j X � 3 j� c for some constant c. But
how should we select c?
x c =j x� 3 j Pp= 12(X = x) Pp= 1
2(j X � 3 j� c)
0; 6 3 0:015625 0:03125
1; 5 2 0:093750 0:21875
2; 4 1 0:234375 0:68750
3 0 0:312500 1:00000
Thus, there is no nonrandomized test with � = 0:05.
What can we do instead? | Three possibilities:
(i) Reject if j X � 3 j= 3, i.e., use a nonrandomized test of size � = 0:03125.
(ii) Reject if j X � 3 j� 2, i.e., use a nonrandomized test of size � = 0:21875.
(iii) Reject if j X � 3 j= 3, do not reject if j X � 3 j� 1, and reject with probability0:05 � 0:03125
2 � 0:093750 = 0:1 if j X � 3 j= 2. Thus, we obtain the randomized test
�(x) =
8>>>><>>>>:1; if x = 0; 6
0:1; if x = 1; 5
0; if x = 2; 3; 4
This test has size
� = Ep= 12(�(X))
= 1 � 0:015625 � 2 + 0:1 � 0:093750 � 2 + 0 � (0:24375 � 2 + 0:3125)
= 0:05
as intended. The power of � can be calculated for any p 6= 12 and it is
��(p) = Pp(X = 0 or X = 6) + 0:1 � Pp(X = 1 or X = 5)
75
Lecture 19:
Fr 02/23/019.2 The Neyman{Pearson Lemma
Let ff� : � 2 � = f�0; �1gg be a family of possible distributions of X . f� represents the pdf
(or pmf) of X. For convenience, we write f0(x) = f�0(x) and f1(x) = f�1(x).
Theorem 9.2.1: Neyman{Pearson Lemma (NP Lemma)
Suppose we wish to test H0 : X � f0(x) vs. H1 : X � f1(x), where fi is the pdf (or pmf) of
X under Hi; i = 0; 1, where both, H0 and H1, are simple.
(i) Any test of the form
�(x) =
8>><>>:1; if f1(x) > kf0(x)
(x); if f1(x) = kf0(x)
0; if f1(x) < kf0(x)
(�)
for some k � 0 and 0 � (x) � 1, is most powerful of its signi�cance level for testing
H0 vs. H1.
If k =1, the test
�(x) =
(1; if f0(x) = 0
0; if f0(x) > 0(��)
is most powerful of size (or signi�cance level) 0 for testing H0 vs. H1.
(ii) Given 0 � � � 1, there exists a test of the form (�) or (��) with (x) = (i.e., a
constant) such that
E�0(�(X)) = �:
Proof:
We prove the continuous case only.
(i):
Let � be a test satisfying (�). Let �� be any test with E�0(��(X)) � E�0(�(X)).
It holds thatZ(�(x)� ��(x))(f1(x)� kf0(x))dx =
�Zf1>kf0
(�(x)� ��(x))(f1(x)� kf0(x))dx
�
+
�Zf1<kf0
(�(x)� ��(x))(f1(x)� kf0(x))dx
�since Z
f1=kf0
(�(x)� ��(x))(f1(x)� kf0(x))dx = 0:
76
It isZf1>kf0
(�(x)� ��(x))(f1(x)� kf0(x))dx =
Zf1>kf0
(1� ��(x))| {z }�0
(f1(x)� kf0(x))| {z }�0
dx � 0
andZf1<kf0
(�(x)� ��(x))(f1(x)� kf0(x))dx =
Zf1<kf0
(0� ��(x))| {z }�0
(f1(x)� kf0(x))| {z }�0
dx � 0:
Therefore,
0 �Z(�(x)� ��(x))(f1(x)� kf0(x))dx
= E�1(�(X))�E�1(��(X))� k(E�0(�(X))�E�0(�
�(X))):
Since E�0(�(X)) � E�0(��(X)), it holds that
��(�1)� ���(�1) � k(E�0(�(X))�E�0(��(X))) � 0;
i.e., � is most powerful.
If k =1, any test �� of size 0 must be 0 on the set fx j f0(x) > 0g. Therefore,
��(�1)� ���(�1) = E�1(�(X))�E�1(��(X)) =
Zfxjf0(x)=0g
(1� ��(x))f1(x)dx � 0;
i.e., � is most powerful of size 0.
(ii):
If � = 0, then use (��). Otherwise, assume that 0 < � � 1 and (x) = . It is
E�0(�(X)) = P�0(f1(X) > kf0(X)) + P�0(f1(X) = kf0(X))
= 1� P�0(f1(X) � kf0(X)) + P�0(f1(X) = kf0(X))
= 1� P�0
�f1(X)
f0(X)� k
�+ P�0
�f1(X)
f0(X)= k
�:
Note that the last step is valid since P�0(f0(X) = 0) = 0.
Therefore, given 0 < � � 1, we want to �nd k and such that E�0(�(X)) = �, i.e.,
P�0
�f1(X)
f0(X)� k
�� P�0
�f1(X)
f0(X)= k
�= 1� �:
Note thatf1(X)f0(X) is a rv and, therefore, P�0
�f1(X)f0(X) � k
�is a cdf.
If there exists a k0 such that
P�0
�f1(X)
f0(X)� k0
�= 1� �;
77
we choose = 0 and k = k0.
Otherwise, if there exists no such k0, then there exists a k1 such that
P�0
�f1(X)
f0(X)< k1
�� 1� � < P�0
�f1(X)
f0(X)� k1
�;
i.e., the cdf has a jump at k1. In this case, we choose k = k1 and
=P�0
�f1(X)f0(X)
� k1
�� (1� �)
P�0
�f1(X)f0(X)
= k1
� :
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
• • • • • • • • • • • • • • • • • • • • •
• • • • • • • • • • • • • • • • • • • • •
k1 x
F(x)
*
o
(f1/f0 <= k1)
1 - alpha
P(f1/f0 < k1)
Theorem 9.2.1
Let us verify that these values for k1 and meet the necessary conditions:
Obviously,
P�0
�f1(X)
f0(X)� k1
��P�0
�f1(X)f0(X)
� k1
�� (1� �)
P�0
�f1(X)f0(X) = k1
� P�0
�f1(X)
f0(X)= k1
�= 1� �:
Also, since
P�0
�f1(X)
f0(X)� k1
�> (1� �);
it follows that � 0 and since
P�0
�f1(X)f0(X) � k1
�� (1� �)
P�0
�f1(X)f0(X) = k1
� �P�0
�f1(X)f0(X) � k1
�� P�0
�f1(X)f0(X) < k1
�P�0
�f1(X)f0(X) = k1
�
=P�0
�f1(X)f0(X) = k1
�P�0
�f1(X)f0(X) = k1
�= 1;
it follows that � 1. Overall, 0 � � 1 as required.
78
Lecture 21:
We 02/28/01Theorem 9.2.2:
If a su�cient statistic T exists for the family ff� : � 2 � = f�0; �1gg, then the Neyman{
Pearson most powerful test is a function of T .
Proof:
Homework
Example 9.2.3:
We want to test H0 : X � N(0; 1) vs. H1 : X � Cauchy(1; 0), based on a single observation.
It isf1(x)
f0(x)=
1�
11+x2
1p2�exp(�x2
2 )=
r2
�
exp(x2
2)
1 + x2:
The MP test is
�(x) =
8><>: 1; ifq
2�
exp(x2
2)
1+x2> k
0; otherwise
where k is determined such that EH0(�(X)) = �.
If � < 0:113, we reject H0 if j x j> z�2, where z�
2is the upper �
2 quantile of a N(0; 1)
distribution.
If � > 0:113, we reject H0 if j x j> k1 or if j x j< k2, where k1 > 0, k2 > 0, such that
exp(k212 )
1 + k21=
exp(k222 )
1 + k22and
Z k1
k2
1p2�
exp(�x2
2)dx =
1� �
2:
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
x
f1(x
) / f
0(x)
-2 -1 0 1 2
0.7
0.8
0.9
1.0
1.1
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••
••••••••••••••
••••••••••••••
••••••••••••••
-1.585 1.585-k1 -k2 k2 k1
Example 9.2.3a
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
x
f0(x
)
-2 -1 0 1 2
0.1
0.2
0.3
0.4
-1.585 1.585
alpha/2 = 0.113/2 alpha/2 = 0.113/2
Example 9.2.3b
79
Why is � = 0:113 so interesting?
For x = 0, it isf1(x)
f0(x)=
r2
�� 0:7979:
Similarly, for x � �1:585 and x � 1:585, it is
f1(x)
f0(x)=
r2
�
exp((�1:585)2
2 )
1 + (�1:585)2 � 0:7979 � f1(0)
f0(0):
More importantly, PH0(j X j> 1:585) = 0:113.
80
9.3 Monotone Likelihood Ratios
Suppose we want to test H0 : � � �0 vs. H1 : � > �0 for a family of pdf's ff� : � 2 � � IRg.In general, it is not possible to �nd a UMP test. However, there exist conditions under which
UMP tests exist.
De�nition 9.3.1:
Let ff� : � 2 � � IRg be a family of pdf's (pmf's) on a one{dimensional parameter space.
We say the family ff�g has a monotone likelihood ratio (MLR) in statistic T (X) if for
�1 < �2, whenever f�1 and f�2 are distinct, the ratiof�2 (x)
f�1 (x)is a nondecreasing function of T (x)
for the set of values x for which at least one of f�1 and f�2 is > 0.
Note:
We can also de�ne families of densities with nonincreasing MLR in T (X), but such families
can be treated by symmetry.
Example 9.3.2:
Let X1; : : : ; Xn � U [0; �]; � > 0. Then the joint pdf is
f�(x) =
(1�n ; 0 � xmax � �
0; otherwise= I[0;�](xmax);
where xmax = maxi=1;:::;n
xi.
Let �2 > �1, thenf�2(x)
f�1(x)= (
�1
�2)nI[0;�2](xmax)
I[0;�1](xmax):
It isI[0;�2](xmax)
I[0;�1](xmax)=
(1; xmax 2 [0; �1]
1; xmax 2 (�1; �2]
since for xmax 2 [0; �1], it holds that xmax 2 [0; �2].
But for xmax 2 (�1; �2], it is I[0;�1](xmax) = 0.
=) as T (X) = Xmax increases, the density ratio goes from ( �1�2 )n to 1.
=) f�2f�1
is a nondecreasing function of T (X) = Xmax
=) the family of U [0; �] distributions has a MLR in T (X) = Xmax
81
Theorem 9.3.3:
The one{parameter exponential family f�(x) = exp(Q(�)T (x) +D(�) + S(x)), where Q(�) is
nondecreasing, has a MLR in T (X).
Proof:
Homework.
Example 9.3.4:
Let X = (X1; � � � ;Xn) be a random sample from the Poisson family with parameter � > 0.
Then the joint pdf is
f�(x) =nYi=1
�e���xi
1
xi!
�= e�n��
Pn
i=1xi
nYi=1
1
xi!= exp
�n�+
nXi=1
xi � log(�) �nXi=1
log(xi!)
!;
which belongs to the one{parameter exponential family.
Since Q(�) = log(�) is a nondecreasing function of �, it follows by Theorem 9.3.3 that the
Poisson family with parameter � > 0 has a MLR in T (X) =nXi=1
Xi.
We can verify this result by De�nition 9.3.1:
f�2(x)
f�1(x)=�
Pxi
2
�
Pxi
1
e�n�2
e�n�1=
��2
�1
�Pxi
e�n(�2��1):
If �2 > �1, then�2�1> 1 and
��2�1
�Pxiis a nondecreasing function of
Pxi.
Therefore, f� has a MLR in T (X) =nXi=1
Xi.
Theorem 9.3.5:
Let X � f�; � 2 � � IR, where the family ff�g has a MLR in T (X).
For testing H0 : � � �0 vs. H1 : � > �0; �0 2 �; any test of the form
�(x) =
8>><>>:1; if T (x) > t0
; if T (x) = t0
0; if T (x) < t0
(�)
has a nondecreasing power function and is UMP of its size E�0(�(X)) = �, if the size is not 0.
Also, for every 0 � � � 1 and every �0 2 �, there exists a t0 and a (�1 � t0 � 1,
0 � � 1), such that the test of form (�) is the UMP size � test of H0 vs. H1:
82
Proof:
\=)":
Let �1 < �2; �1; �2 2 � and suppose E�1(�(X)) > 0, i.e., the size is > 0. Since f� has a MLR
in T ,f�2 (x)
f�1 (x)is a nondecreasing function of T . Therefore, any test of form (�) is equivalent to
a test
�(x) =
8>>>>><>>>>>:1; if
f�2 (x)
f�1 (x)> k
; iff�2 (x)
f�1 (x)= k
0; iff�2 (x)
f�1 (x)< k
(��)
which by the Neyman{Pearson Lemma (Theorem 9.2.1) is MP of size � for testing H0 : � = �1
vs. H1 : � = �2.Lecture 22:
Fr 03/02/01Let �� be the class of tests of size �, and let �t 2 �� be the trivial test �t(x) = � 8x. Then�t has size and power �. The power of the MP test � of form (�) must be at least � as the
MP test � cannot have power less than the trivial test �t, i.e,
E�2(�(X)) � E�2(�t(X)) = � = E�1(�(X)):
Thus, for �2 > �1,
E�2(�(X)) � E�1(�(X));
i.e., the power function of the test � of form (�) is a nondecreasing function of �.
Now let �1 = �0 and �2 > �0. We know that a test � of form (�) is MP for H0 : � = �0 vs.
H1 : � = �2 > �0, provided that its size � = E�0(�(X)) is > 0.
Notice that the test � of form (�) does not depend on �2. It only depends on t0 and .
Therefore, the test � of form (�) is MP for all �2 2 �1. Thus, this test is UMP for simple
H0 : � = �0 vs. composite H1 : � > �0 with size E�0(�(X)) = �0.
Since � is UMP for a class �00 of tests (�00 2 �00) satisfying
E�0(�00(X)) � �0;
� must also be UMP for the more restrictive class �000 of tests (�000 2 �000) satisfying
E�(�000(X)) � �0 8� � �0:
But since the power function of � is nondecreasing, it holds for � that
E�(�(X)) � E�0(�(X)) = �0 8� � �0:
Thus, � is the UMP size �0 test of H0 : � � �0 vs. H1 : � > �0 if �0 > 0.
83
\(=":Use the Neyman{Pearson Lemma (Theorem 9.2.1).
Note:
By interchanging inequalities throughout Theorem 9.3.5 and its proof, we see that this The-
orem also provides a solution of the dual problem H 00 : � � �0 vs. H
01 : � < �0.
Theorem: 9.3.6
For the one{parameter exponential family, there exists a UMP two{sided test of H0 : � � �1
or � � �2, (where �1 < �2) vs. H1 : �1 < � < �2 of the form
�(x) =
8>><>>:1; if c1 < T (x) < c2
i; if T (x) = ci; i = 1; 2
0; if T (x) < c1; or if T (x) > c2
Note:
UMP tests for H0 : �1 � � � �2 and H00 : � = �0 do not exist for one{parameter exponential
families.
84
9.4 Unbiased and Invariant Tests
If we look at all size � tests in the class ��, there exists no UMP test for many hypotheses.
Can we �nd UMP tests if we reduce �� by reasonable restrictions?
De�nition 9.4.1:
A size � test � of H0 : � 2 �0 vs H1 : � 2 �1 is unbiased if
E�(�(X)) � � 8� 2 �1:
Note:
This condition means that ��(�) � � 8� 2 �0 and ��(�) � � 8� 2 �1. In other words, the
power of this test is never less than �.
De�nition 9.4.2:
Let U� be the class of all unbiased size � tests of H0 vs H1. If there exists a test � 2 U�
that has maximal power for all � 2 �1, we call � aUMP unbiased (UMPU) size � test.
Note:
It holds that U� � ��. A UMP test �� 2 �� will have ��� � � 8� 2 �1 since we must
compare all tests �� with the trivial test �(x) = �. Thus, if a UMP test exists in ��, it is
also a UMPU test in U�.
Example 9.4.3:
Let X1; : : : ; Xn be iidN(�; �2), where �2 > 0 is known. ConsiderH0 : � = �0 vs H1 : � 6= �0.
From the Neyman{Pearson Lemma, we know that for �1 > �0, the MP test is of the form
�1(X) =
8<: 1; if X > �0 +�pnz�
0; otherwise
and for �2 < �0, the MP test is of the form
�2(X) =
8<: 1; if X < �0 � �pnz�
0; otherwise
If a test is UMP, it must have the same rejection region as �1 and �2. However, these 2
rejection regions are di�erent (actually, their intersection is empty). Thus, there exists no
UMP test.
We next state a helpful Theorem and then continue with this example and see how we can
�nd a UMPU test.
85
Theorem 9.4.4:
Let c1; : : : ; cn 2 IR be constants and f1(x); : : : ; fn+1(x) be real{valued functions. Let C be the
class of functions �(x) satisfying 0 � �(x) � 1 andZ 1
�1�(x)fi(x)dx = ci 8i = 1; : : : ; n:
If �� 2 C satis�es
��(x) =
8>>>><>>>>:1; if fn+1(x) >
nXi=1
kifi(x)
0; if fn+1(x) <nXi=1
kifi(x)
for some constants k1; : : : ; kn 2 IR, then �� maximizesZ 1
�1�(x)fn+1(x)dx among all � 2 C.
Proof:
Let ��(x) be as above. Let �(x) be any other function in C. Since 0 � �(x) � 1 8x, it is
(��(x)� �(x))
fn+1(x)�
nXi=1
kifi(x)
!� 0 8x:
This holds since if ��(x) = 1, the left factor is � 0 and the right factor is � 0. If ��(x) = 0,
the left factor is � 0 and the right factor is � 0.
Therefore,
0 �Z(��(x)� �(x))
fn+1(x)�
nXi=1
kifi(x)
!dx
=
Z��(x)fn+1(x)dx�
Z�(x)fn+1(x)dx�
nXi=1
ki
�Z��(x)fi(x)dx�
Z�(x)fi(x)dx
�| {z }
=ci�ci=0
Thus, Z��(x)fn+1(x)dx �
Z�(x)fn+1(x)dx:
Note:
(i) If fn+1 is a pdf, then �� maximizes the power.
(ii) The Theorem above is the Neyman{Pearson Lemma if n = 1; f1 = f�0 ; f2 = f�1 , and
c1 = �.
86
Example 9.4.3: (continued)
So far, we have seen that there exists no UMP test for H0 : � = �0 vs H1 : � 6= �0.
We will show that
�3(x) =
8<: 1; if X < �0 � �pnz�=2 or if X > �0 +
�pnz�=2
0; otherwise
is a UMPU size � test.
Due to Theorem 9.2.2, we only have to consider functions of su�cient statistics T (X) = X .
Let �2 = �2
n .
To be unbiased and of size �, a test � must have
(i)
Z�(t)f�0(t)dt = �, and
(ii) @@�
Z�(t)f�(t)dtj�=�0 =
Z�(t)
�@
@�f�(t)
������=�0
dt = 0, i.e., we have a minimum at �0.
We want to maximize
Z�(t)f�(t)dt; � 6= �0 such that conditions (i) and (ii) hold.
Lecture 23:
Mo 03/05/01We choose an arbitrary �1 6= �0 and let
f1(t) = f�0(t)
f2(t) =@
@�f�(t)
�����=�0
f3(t) = f�1(t)
We now consider how the conditions on �� in Theorem 9.4.4 can be met:
f3(t) > k1f1(t) + k2f2(t)
() 1p2��
exp(� 1
2�2(x� �1)
2) >k1p2��
exp(� 1
2�2(x� �0)
2) +
k2p2��
exp(� 1
2�2(x� �0)
2)(x� �0
�2)
() exp(� 1
2�2(x� �1)
2) > k1 exp(�1
2�2(x� �0)
2) + k2 exp(�1
2�2(x� �0)
2)(x� �0
�2)
() exp(1
2�2((x� �0)
2 � (x� �1)2)) > k1 + k2(
x� �0
�2)
() exp(x(�1 � �0)
�2� �21 � �20
2�2) > k1 + k2(
x� �0
�2)
87
Note that the left hand side of this inequality is increasing in x if �1 > �0 and decreasing in
x if �1 < �0. Either way, we can choose k1 and k2 such that the linear function in x crosses
the exponential function in x at the two points
�L = �0 ��pnz�=2; �U = �0 +
�pnz�=2:
Obviously, �3 satis�es (i). We still need to check that �3 satis�es (ii) and that ��3(�) has a
minimum at �0 but omit this part from our proof here.
�3 is of the form �� in Theorem 9.4.4 and therefore �3 is UMP in C. But the trivial test
�t(x) = � also satis�es (i) and (ii) above. Therefore, ��3(�) � � 8� 6= �0. This means that
�3 is unbiased.
Overall, �3 is a UMPU test of size �.
De�nition 9.4.5:
A test � is said to be �{similar on a subset �� of � if
��(�) = E�(�(X)) = � 8� 2 ��:
A test � is said to be similar on �� � � if it is �{similar on �� for some �; 0 � � � 1.
Note:
The trivial test �(x) = � is �{similar on every �� � �.
Theorem 9.4.6:
Let � be an unbiased test of size � for H0 : � 2 �0 vs H1 : � 2 �1 such that ��(�) is a
continuous function in �. Then � is �{similar on the boundary � = �0 \�1, where �0 and
�1 are the closures of �0 and �1, respectively.
Proof:
Let � 2 �. There exist sequences f�ng and f�0ng whith �n 2 �0 and �0n 2 �1 such that
limn!1
�n = � and limn!1
�0n = �.
By continuity, ��(�n)! ��(�) and ��(�0n)! ��(�).
Since ��(�n) � � implies ��(�) � � and since ��(�0n) � � implies ��(�) � � it must hold
that ��(�) = � 8� 2 �.
88
Lecture 24:
We 03/07/01De�nition 9.4.7:
A test � that is UMP among all �{similar tests on the boundary � = �0 \ �1 is called a
UMP �{similar test.
Theorem 9.4.8:
Suppose ��(�) is continuous in � for all tests � of H0 : � 2 �0 vs H1 : � 2 �1. If a size �
test of H0 vs H1 is UMP �{similar, then it is UMP unbiased.
Proof:
Let �0 be UMP �{similar and of size �. This means that E�(�(X)) � � 8� 2 �0.
Since the trivial test �(x) = � is �{similar, it must hold for �0 that ��0(�) � � 8� 2 �1 since
�0 is UMP �{similiar. This implies that �0 is unbiased.
Since ��(�) is continuous in �, we see from Theorem 9.4.6 that the class of unbiased tests is
a subclass of the class of �{similar tests. Since �0 is UMP in the larger class, it is also UMP
in the subclass. Thus, �0 is UMPU.
Note:
The continuity of the power function ��(�) cannot always be checked easily.
Example 9.4.9:
Let X1; : : : ; Xn � N(�; 1).
Let H0 : � � 0 vs H1 : � > 0.
Since the family of densities has a MLR innXi=1
Xi, we could use Theorem 9.3.5 to �nd a UMP
test. However, we want to illustate the use of Theorem 9.4.8 here.
It is � = f0g and the power function
��(�) =
ZIRn
�(x)
�1p2�
�nexp
�1
2
nXi=1
(xi � �)2!dx
of any test � is continuous in �. Thus, due to Theorem 9.4.6, any unbiased size � test of H0
is �{similar on �.
We need a UMP test of H 00 : � = 0 vs H1 : � > 0.
By the NP Lemma, a MP test of H 000 : � = 0 vs H 00
1 : � = �1, where �1 > 0 is given by
�(x) =
8><>: 1; if exp
�Px2i
2 �P
(xi��)22
�> k0
0; otherwise
89
or equivalently, by Theorem 9.2.2,
�(x) =
8>><>>:1; if T =
nXi=1
Xi > k
0; otherwise
Since under H0, T � N(0; n), k is determined by � = P�=0(T > k) = P ( Tpn> kp
n), i.e.,
k =pnz�.
� is independent of �1 for every �1 > 0. So � is UMP �{similar for H 00 vs. H1.
Finally, � is of size �, since for � � 0, it holds that
E�(�(X)) = P�(T >pnz�)
= P
�T � n�p
n> z� �
pn�
�(�)� P (Z > z�)
= �
(�) holds since T�n�pn� N(0; 1) for � � 0 and z� �
pn� � z� for � � 0.
Thus all the requirements are met for Theorem 9.4.8, i.e., �� is continuous and � is UMP
�{similar and of size �, and thus � is UMPU.
Note:
Rohatgi, page 428{430, lists Theorems (without proofs), stating that for Normal data, one{
and two{tailed t{tests, one{ and two{tailed �2{tests, two{sample t{tests, and F{tests are all
UMPU.
Note:
Recall from De�nition 8.2.4 that a class of distributions is invariant under a group G of trans-
formations, if for each g 2 G and for each � 2 � there exists a unique �0 2 � such that if
X � P�, then g(X) � P�0 .
De�nition 9.4.10:
A group G of transformations on X leaves a hypothesis testing problem invariant if G
leaves both fP� : � 2 �0g and fP� : � 2 �1g invariant, i.e., if y = g(x) � h�(y), then
ff�(x) : � 2 �0g � fh�(y) : � 2 �0g and ff�(x) : � 2 �1g � fh�(y) : � 2 �1g.
90
Note:
We want two types of invariance for our tests:
Measurement Invariance: If y = g(x) is a 1{to{1 mapping, the decision based on y should
be the same as the decision based on x. If �(x) is the test based on x and ��(y) is the
test based on y, then it must hold that �(x) = ��(g(x)) = ��(y).
Formal Invariance: If two tests have the same structure, i.e, the same �, the same pdf's (or
pmf's), and the same hypotheses, then we should use the same test in both problems.
So, if the transformed problem in terms of y has the same formal structure as that of
the problem in terms of x, we must have that ��(y) = �(x) = ��(g(x)).
We can combine these two requirements in the following de�nition:
De�nition 9.4.11:
An invariant test with respect to a group G of tansformations is any test � such that
�(x) = �(g(x)) 8x 8g 2 G:
Example 9.4.12:
Let X � Bin(n; p). Let H0 : p =12 vs. H1 : p 6= 1
2 .
Let G = fg1; g2g, where g1(x) = n� x and g2(x) = x.
If � is invariant, then �(x) = �(n � x). Is the test problem invariant? For g2, the answer is
obvious.
For g1, we get:
g1(X) = n�X � Bin(n; 1� p)
H0 : p =12 : ffp(x) : p =
12g = fhp(g1(x)) : p = 1
2g = Bin(n; 12)
H1 : p 6= 12 : ffp(x) : p 6=
1
2g| {z }
=Bin(n;p6= 12)
= fhp(g1(x)) : p 6=1
2g| {z }
=Bin(n;p6= 12)
So all the requirements in De�nition 9.4.10 are met. If, for example, n = 10, the test
�(x) =
8<: 1; if x = 0; 1; 2; 8; 9; 10
0; otherwise
is invariant under G. For example, �(4) = 0 = �(10 � 4) = �(6), and, in general,
�(x) = �(10 � x) 8x 2 f0; 1; : : : ; 9; 10g.
91
Lecture 25:
Fr 03/09/01Example 9.4.13:
Let X1; : : : ;Xn � N(�; �2) where both � and �2 > 0 are unknown. It is X � N(�; �2
n ) andn�1�2S2 � �2n�1 and X and S2 independent.
Let H0 : � � 0 vs. H1 : � > 0.
Let G be the group of scale changes:
G = fgc(x; s2); c > 0 : gc(x; s2) = (cx; c2s2)g
The problem is invariant because, when gc(x; s2) = (cx; c2s2), then
(i) cX and c2S2 are independent.
(ii) cX � N(c�; c2�2
n) 2 fN(�; �
2
n)g.
(iii) n�1c2�2
c2S2 � �2n�1.
So, this is the same family of distributions and De�nition 9.4.10 holds because � � 0 implies
that c� � 0 (for c > 0).
An invariant test satis�es �(x; s2) � �(cx; c2s2); c > 0; s2 > 0; x 2 IR.
Let c = 1s. Then �(x; s2) � �(x
s; 1) so invariant tests depend on (x; s2) only through x
s.
If x1s16= x2
s2, then there exists no c > 0 such that (x2; s
22) � (cx1; c
2s21). So invariance places
no restrictions on � for di�erent x1s1
= x2s2. Thus, invariant tests are exactly those that depend
only on xs , which are equivalent to tests that are based only on t = x
s=pn. Since this mapping
is 1{to{1, the invariant test will use T = XS=pn� tn�1 if � = 0. Note that this test does not
depend on the nuisance parameter �2. Invariance often produces such results.
De�nition 9.4.14:
Let G be a group of transformations on the space of X. We say a statistic T (x) is maximal
invariant under G if
(i) T is invariant, i.e., T (x) = T (g(x)) 8g 2 G, and
(ii) T is maximal, i.e., T (x1) = T (x2) implies that x1 = g(x2) for some g 2 G.
92
Example 9.4.15:
Let x = (x1; : : : ; xn) and gc(x) = (x1 + c; : : : ; xn + c).
Consider T (x) = (xn � x1; xn � x2; : : : ; xn � xn�1).
It is T (gc(x)) = (xn � x1; xn � x2; : : : ; xn � xn�1) = T (x), so T is invariant.
If T (x) = T (x0), then xn � xi = x0n � x0i 8i = 1; 2; : : : ; n� 1.
This implies that xi � x0i = xn � x0n = c 8i = 1; 2; : : : ; n� 1.
Thus, gc(x0) = (x01 + c; : : : ; x0n + c) = x.
Therefore, T is maximal invariant.
De�nition 9.4.16:
Let I� be the class of all invariant tests of size � of H0 : � 2 �0 vs. H1 : � 2 �1. If there
exists a UMP member in I�, it is called the UMP invariant test of H0 vs H1.
Theorem 9.4.17:
Let T (x) be maximal invariant with respect to G. A test � is invariant under G i� � is a
function of T .
Proof:
\=)":
Let � be invariant under G. If T (x1) = T (x2), then there exists a g 2 G such that x1 = g(x2).
Thus, it follows from invariance that �(x1) = �(g(x2)) = �(x2). Since � is the same whenever
T (x1) = T (x2), � must be a function of T .
\(=":Let � be a function of T , i.e., �(x) = h(T (x)). It follows that
�(g(x)) = h(T (g(x)))(�)= h(T (x)) = �(x):
(�) holds since T is invariant.
This means that � is invariant.
93
Example 9.4.18:
Consider the test problem
H0 : X � f0(x1 � �; : : : ; xn � �) vs. H1 : X � f1(x1 � �; : : : ; xn � �);
where � 2 IR.
Let G be the group of transformations with
gc(x) = (x1 + c; : : : ; xn + c);
where c 2 IR and n � 2.
As shown in Example 9.4.15, a maximal invariant statistic is T (X) = (X1 �Xn; : : : ;Xn�1 �Xn) = (T1; : : : ; Tn�1). Due to Theorem 9.4.17, an invariant test � depends on X only through
T .
Since the transformation
T 0
Z
!=
0BBBBB@T1...
Tn�1
Z
1CCCCCA =
0BBBBB@X1 �Xn
...
Xn�1 �Xn
Xn
1CCCCCAis 1{to{1, there exists inverses Xn = Z and Xi = Ti + Xn = Ti + Z 8i = 1; : : : ; n � 1.
Applying Theorem 4.3.5 and integrating out the last component Z (= Xn) gives us the joint
pdf of T = (T1; : : : ; Tn�1).
Thus, under Hi; i = 0; 1, the joint pdf of T is given by
Z 1
�1fi(t1+ z; t2 + z; : : : ; tn�1+ z; z)dz
which is independent of �. The problem is thus reduced to testing a simple hypothesis against
a simple alternative. By the NP Lemma (Theorem 9.2.1), the MP test is
�(t1; : : : ; tn�1) =
(1; if �(t) > c
0; if �(t) < c
where t = (t1; : : : ; tn�1) and �(t) =
Z 1
�1f1(t1 + z; t2 + z; : : : ; tn�1 + z; z)dzZ 1
�1f0(t1 + z; t2 + z; : : : ; tn�1 + z; z)dz
.
In the homework assignment, we use this result to construct a UMP invariant test of
H0 : X � N(�; 1) vs. H1 : X � Cauchy(1; �);
where a Cauchy(1; �) distribution has pdf f(x; �) =1
�
1
1 + (x� �)2, where � 2 IR.
94
Lecture 26:
Mo 03/19/01
10 More on Hypothesis Testing
10.1 Likelihood Ratio Tests
De�nition 10.1.1:
The likelihood ratio test statistic for
H0 : � 2 �0 vs. H1 : � 2 �1 = ���0
is
�(x) =
sup�2�0
f�(x)
sup�2�
f�(x):
The likelihood ratio test (LRT) is the test function
�(x) = I[0;c)(�(x));
for some constant c 2 [0; 1], where c is usually chosen in such a way to make � a test of size
�.
Note:
(i) We have to select c such that 0 � c � 1 since 0 � �(x) � 1.
(ii) LRT's are strongly related to MLE's. If �̂ is the unrestricted MLE of � over � and �̂0 is
the MLE of � over �0, then �(x) =f�̂0(x)
f�̂(x)
.
Example 10.1.2:
Let X1; : : : ; Xn be a sample from N(�; 1). We want to construct a LRT for
H0 : � = �0 vs. H1 : � 6= �0:
It is �̂0 = �0 and �̂ = X. Thus,
�(x) =(2�)�n=2 exp(�1
2
P(xi � �0)
2)
(2�)�n=2 exp(�12
P(xi � x)2)
= exp(�n2(x� �0)
2):
The LRT rejects H0 if �(x) � c, or equivalently, j x� �0 j�q�2 log cn . This means, the LRT
rejects H0 : � = �0 if x is too far from �0.
95
Theorem 10.1.3:
If T (X) is su�cient for � and ��(t) and �(x) are LRT statistics based on T and X respectively,
then
��(T (x)) = �(x) 8x;
i.e., the LRT can be expressed as a function of every su�cient statistic for �.
Proof:
Since T is su�cient, it follows from Theorem 8.3.5 that its pdf (or pmf) factorizes as f�(x) =
g�(T )h(x). Therefore we get:
�(x) =
sup�2�0
f�(x)
sup�2�
f�(x)
=
sup�2�0
g�(T )h(x)
sup�2�
g�(T )h(x)
=
sup�2�0
g�(T )
sup�2�
g�(T )
= ��(T (x))
Thus, our simpli�ed expression for �(x) indeed only depends on a su�cient statistic T .
Theorem 10.1.4:
If for a given �, 0 � � � 1, and for a simple hypothesis H0 and a simple alternative H1 a non{
randomized test based on the NP Lemma and a LRT exist, then these tests are equivalent.
Note:
Usually, LRT's perform well since they are often UMP or UMPU size � tests. However, this
does not always hold. Rohatgi, Example 4, page 440{441, cites an example where the LRT is
not unbiased and it is even worse than the trivial test �(x) = �.
Theorem 10.1.5:
Under some regularity conditions on f�(x), the rv �2 log �(X) under H0 has asymptotically
a chi{squared distribution with � degrees of freedom, where � equals the di�erence between
the number of independent parameters in � and �0, i.e.,
�2 log �(X)d�! �2� under H0:
96
Note:
The regularity conditions required for Theorem 10.1.5 are basically the same as for Theorem
8.7.10. Under \independent" parameters we understand parameters that are unspeci�ed, i.e.,
free to vary.
Example 10.1.6:
Let X1; : : : ; Xn � N(�; �2) where � 2 IR and �2 > 0 are both unknown.
Let H0 : � = �0 vs. H1 : � 6= �0.
We have � = (�; �2), � = f(�; �2) : � 2 IR; �2 > 0g and �0 = f(�0; �2) : �2 > 0g.
It is �̂0 = (�0;1n
nXi=1
(xi � �0)2) and �̂ = (x; 1
n
nXi=1
(xi � x)2).
Now, the LR test statistic �(x) can be determined:
�(x) =f�̂0(x)
f�̂(x)
=
1
( 2�n)n
2 (P
(xi��0)2)n
2
exp
��
P(xi��0)2
2 1n
P(xi��0)2
�1
( 2�n)n
2 (P
(xi�x)2)n
2
exp
��
P(xi�x)2
2 1n
P(xi�x)2
�
=
P(xi � x)2P(xi � �0)2
!n
2
=
Px2i � nx2P
x2i � 2�0Pxi + n�20 � nx2 + nx2
!n
2
=
0BB@ 1
1 +
�n(x��0)2P(xi�x)2
�1CCA
n
2
Note that this is a decreasing function of
t(X) =
pn(X � �0)q
1n�1
P(Xi �X)2
=
pn(X � �0)
S
(�)� tn�1:
(�) holds due to Corollary 7.2.4.
So we reject H0 if t(x) is large. Now,
�2 log �(x) = �2(�n2) log
1 + n
(x� �0)2P
(xi � x)2
!
= n log
1 + n
(x� �0)2P
(xi � x)2
!
97
Under H0,pn(X��0)
� � N(0; 1) and
P(Xi�X)2
�2� �2n�1 and both are independent.
Therefore, under H0,n(X � �0)
2
1n�1
P(Xi �X)2
� F1;n�1:
Thus, the mgf of �2 log �(X) under H0 is
Mn(t) = EH0(exp(�2t log �(X)))
= EH0(exp(nt log(1 +
F
n� 1)))
= EH0(exp(log(1 +
F
n� 1)nt))
= EH0((1 +
F
n� 1)nt)
=
Z 1
0
�(n2 )
�(n�12 )�(12)(n� 1)12
1pf
�1 +
f
n� 1
��n
2�1 +
f
n� 1
�ntdf
Note that
f1;n�1(f) =�(n2 )
�(n�12 )�(12 )(n� 1)12
1pf
�1 +
f
n� 1
��n
2
� I[0;1)(f)
is the pdf of a F1;n�1 distribution.
Let y =�1 + f
n�1
��1, then f
n�1 =1�yy and df = �n�1
y2dy.
Thus,
Mn(t) =�(n2 )
�(n�12 )�(12 )(n� 1)12
Z 1
0
1pf
�1 +
f
n� 1
�nt�n
2
df
=�(n2 )
�(n�12 )�(12 )(n� 1)12
(n� 1)12
Z 1
0yn�32�nt(1� y)�
12 dy
(�)=
�(n2 )
�(n�12 )�(12 )B(
n� 1
2� nt;
1
2)
=�(n2 )
�(n�12 )�(12 )
�(n�12 � nt)�(12)
�(n2 � nt)
=�(n2 )
�(n�12 )
�(n�12 � nt)
�(n2 � nt); t <
1
2� 1
2n
(�) holds since the integral represents the Beta function (see also Example 8.8.11).
98
As n!1, we can apply Stirling's formula which states that
�(�(n) + 1) � (�(n))! �p2�(�(n))�(n)+
12 exp(��(n)):
Lecture 27:
We 03/21/01So,
Mn(t) �p2�(n�22 )
n�12 exp(�n�2
2 )p2�(
n(1�2t)�32 )
n(1�2t)�22 exp(�n(1�2t)�3
2 )p2�(n�32 )
n�22 exp(�n�3
2 )p2�(
n(1�2t)�22 )
n(1�2t)�12 exp(�n(1�2t)�2
2 )
=
�n� 2
n� 3
�n�22�n� 2
2
� 12�n(1� 2t)� 3
n(1� 2t)� 2
�n(1�2t)�22
�n(1� 2t)� 2
2
�� 12
=
�(1 +
1
n� 3)n�2
� 12
| {z }�!e
12 as n!1
�(1� 1
n(1� 2t)� 2)n(1�2t)�2
� 12
| {z }�!e
�12 as n!1
�n� 2
n(1� 2t)� 2
� 12
| {z }�!(1�2t)�
12 as n!1
Thus,
Mn(t) �!1
(1� 2t)12
as n!1:
Note that this is the mgf of a �21 distribution. Therefore, it follows by the Continuity Theorem
(Theorem 6.4.2) that
�2 log �(X)d�! �21:
Obviously, we could use Theorem 10.1.5 (after checking that the regularity conditions hold)
as well to obtain the same result. Since under H0 1 parameter (�2) is unspeci�ed and under
H1 2 parameters (�, �2) are unspeci�ed, it is � = 2� 1 = 1.
99
10.2 Parametric Chi{Squared Tests
De�nition 10.2.1: Normal Variance Tests
Let X1; : : : ;Xn be a sample from a N(�; �2) distribution where � may be known or unknown
and �2 > 0 is unknown. The following table summarizes the �2 tests that are typically being
used:
Reject H0 at level � if
H0 H1 � known � unknown
I � � �0 � < �0P(xi � �)2 � �20�
2n;1�� s2 � �20
n�1�2n�1;1��
II � � �0 � > �0P(xi � �)2 � �20�
2n;� s2 � �20
n�1�2n�1;�
III � = �0 � 6= �0P(xi � �)2 � �20�
2n;1��=2 s2 � �20
n�1�2n�1;1��=2
orP(xi � �)2 � �20�
2n;�=2 or s2 � �20
n�1�2n�1;�=2
Note:
(i) In De�nition 10.2.1, �0 is any �xed positive constant.
(ii) Tests I and II are UMPU if � is unknown and UMP if � is known.
(iii) In test III, the constants have been chosen in such a way to give equal probability to
each tail. This is the usual approach. However, this may result in a biased test.
(iv) �2n;1�� is the (lower) � quantile and �2n;� is the (upper) 1�� quantile, i.e., for X � �2n,
it holds that P (X � �2n;1��) = � and P (X � �2n;�) = 1� �.
(v) We can also use �2 tests to test for equality of binomial probabilities as shown in the
next few Theorems.
Theorem 10.2.2:
Let X1; : : : ; Xk be independent rv's with Xi � Bin(ni; pi); i = 1; : : : ; k. Then it holds that
T =kXi=1
Xi � nipipnipi(1� pi)
!2d�! �2k
as n1; : : : ; nk !1.
Proof:
Homework
100
Corollary 10.2.3:
Let X1; : : : ;Xk be as in Theorem 10.2.2 above. We want to test the hypothesis that H0 : p1 =
p2 = : : : = pk = p, where p is a known constant. An appoximate level{� test rejects H0 if
y =kXi=1
xi � nippnip(1� p)
!2
� �2k;�:
Theorem 10.2.4:
Let X1; : : : ;Xk be independent rv's with Xi � Bin(ni; p); i = 1; : : : ; k. Then the MLE of p is
p̂ =
kXi=1
xi
kXi=1
ni
:
Proof:
This can be shown by using the joint likelihood function or by the fact thatPXi � Bin(
Pni; p)
and for X � Bin(n; p), the MLE is p̂ = xn .
Theorem 10.2.5:
Let X1; : : : ;Xk be independent rv's with Xi � Bin(ni; pi); i = 1; : : : ; k. An approximate
level{� test of H0 : p1 = p2 = : : : = pk = p, where p is unknown, rejects H0 if
y =kXi=1
xi � nip̂pnip̂(1� p̂)
!2
� �2k�1;�;
where p̂ =
PxiPni.
Theorem 10.2.6:
Let (X1; : : : ;Xk) be a multinomial rv with parameters n; p1; p2; : : : ; pk wherekXi=1
pi = 1 and
kXi=1
Xi = n. Then it holds that
Uk =kXi=1
(Xi � npi)2
npi
d�! �2k�1
as n!1.
An approximate level{� test of H0 : p1 = p01; p2 = p02; : : : ; pk = p0k rejects H0 if
kXi=1
(xi � np0i)2
np0i> �2k�1;�:
101
Proof:
Case k = 2 only:
U2 =(X1 � np1)
2
np1+(X2 � np2)
2
np2
=(X1 � np1)
2
np1+(n�X1 � n(1� p1))
2
n(1� p1)
= (X1 � np1)2
�1
np1+
1
n(1� p1)
�
= (X1 � np1)2
�(1� p1) + p1
np1(1� p1)
�
=(X1 � np1)
2
np1(1� p1)
By the CLT,X1 � np1pnp1(1� p1)
d�! N(0; 1). Therefore, U2d�! �21.
Lecture 28:
Fr 03/23/01Theorem 10.2.7:
Let X1; : : : ;Xn be a sample from X. Let H0 : X � F , where the functional form of F is
known completely. We partition the real line into k disjoint Borel sets A1; : : : ; Ak and let
P (X 2 Ai) = pi, where pi > 0 8i = 1; : : : ; k.
Let Yj = #X 0is in Aj =
nXi=1
IAj (Xi); 8j = 1; : : : ; k.
Then, (Y1; : : : ; Yk) has multinomial distribution with parameters n; p1; p2; : : : ; pk.
Theorem 10.2.8:
Let X1; : : : ;Xn be a sample from X. Let H0 : X � F�, where � = (�1; : : : ; �r) is unknown.
Let the MLE �̂ exist. We partition the real line into k disjoint Borel sets A1; : : : ; Ak and let
P�̂(X 2 Ai) = p̂i, where p̂i > 0 8i = 1; : : : ; k.
Let Yj = #X 0is in Aj =
nXi=1
IAj (Xi); 8j = 1; : : : ; k.
Then it holds that
Vk =kXi=1
(Yi � np̂i)2
np̂i
d�! �2k�r�1:
An approximate level{� test of H0 : X � F� rejects H0 if
kXi=1
(yi � np̂i)2
np̂i> �2k�r�1;�;
where r is the number of parameters in � that have to be estimated.
102
10.3 t{Tests and F{Tests
De�nition 10.3.1: One{ and Two{Tailed t-Tests
Let X1; : : : ;Xn be a sample from a N(�; �2) distribution where �2 > 0 may be known or
unknown and � is unknown. Let X = 1n
nXi=1
Xi and S2 = 1
n�1
nXi=1
(Xi �X)2.
The following table summarizes the z{ and t{tests that are typically being used:
Reject H0 at level � if
H0 H1 �2 known �2 unknown
I � � �0 � > �0 X � �0 +�pnz� X � �0 +
spntn�1;�
II � � �0 � < �0 X � �0 +�pnz1�� X � �0 +
spntn�1;1��
III � = �0 � 6= �0 j X � �0 j� �pnz�=2 j X � �0 j� sp
ntn�1;�=2
Note:
(i) In De�nition 10.3.1, �0 is any �xed constant.
(ii) These tests are based on just one sample and are often called one sample t{tests.
(iii) Tests I and II are UMP and test III is UMPU if �2 is known. Tests I, II, and III are
UMPU and UMP invariant if �2 is unknown.
(iv) For large n (� 30), we can use z{tables instead of t-tables. Also, for large n we can
drop the Normality assumption due to the CLT. However, for small n, none of these
simpli�cations is justi�ed.
De�nition 10.3.2: Two{Sample t-Tests
Let X1; : : : ;Xm be a sample from a N(�1; �21) distribution where �21 > 0 may be known or
unknown and �1 is unknown. Let Y1; : : : ; Yn be a sample from a N(�2; �22) distribution where
�22 > 0 may be known or unknown and �2 is unknown.
Let X = 1m
mXi=1
Xi and S21 =
1m�1
mXi=1
(Xi �X)2.
Let Y = 1n
nXi=1
Yi and S22 =
1n�1
nXi=1
(Yi � Y )2.
Let S2p =(m�1)S21+(n�1)S22
m+n�2 .
The following table summarizes the z{ and t{tests that are typically being used:
103
Reject H0 at level � if
H0 H1 �21; �22 known �21 ; �
22 unknown; �1 = �2
I �1 � �2 � � �1 � �2 > � X � Y � � + z�
q�21m +
�22n X � Y � � + tm+n�2;�Sp
q1m + 1
n
II �1 � �2 � � �1 � �2 < � X � Y � � + z1��
q�21
m+
�22
nX � Y � � + tm+n�2;1��Sp
q1m+ 1
n
III �1 � �2 = � �1 � �2 6= � j X � Y � � j� z�=2
q�21m +
�22n j X � Y � � j� tm+n�2;�=2Sp
q1m + 1
n
Note:
(i) In De�nition 10.3.2, � is any �xed constant.
(ii) All tests are UMPU and UMP invariant.
(iii) If �21 = �22 = �2 (which is unknown), then S2p is an unbiased estimate of �2. We should
check that �21 = �22 with an F{test.
(iv) For large m+ n, we can use z{tables instead of t-tables. Also, for large m and large n
we can drop the Normality assumption due to the CLT. However, for small m or small
n, none of these simpli�cations is justi�ed.
De�nition 10.3.3: Paired t-Tests
Let (X1; Y1) : : : ; (Xn; YN ) be a sample from a bivariate N(�1; �2; �21 ; �
22 ; �) distribution where
all 5 parameters are unknown.
Let Di = Xi � Yi � N(�1 � �2; �21 + �22 � 2��1�2).
Let D = 1n
nXi=1
Di and S2d =
1n�1
nXi=1
(Di �D)2.
The following table summarizes the t{tests that are typically being used:
H0 H1 Reject H0 at level � if
I �1 � �2 � � �1 � �2 > � D � � + Sdpntn�1;�
II �1 � �2 � � �1 � �2 < � D � � + Sdpntn�1;1��
III �1 � �2 = � �1 � �2 6= � j D � � j� Sdpntn�1;�=2
104
Note:
(i) In De�nition 10.3.3, � is any �xed constant.
(ii) These tests are special cases of one{sample tests. All the properties stated in the Note
following De�nition 10.3.1 hold.
(iii) We could do a test based on Normality assumptions if �2 = �21 + �22 � 2��1�2 were
known, but that is a very unrealistic assumption.
De�nition 10.3.4: F{Tests
LetX1; : : : ;Xm be a sample from a N(�1; �21) distribution where �1 may be known or unknown
and �21 is unknown. Let Y1; : : : ; Yn be a sample from a N(�2; �22) distribution where �2 may
be known or unknown and �22 is unknown.
Recall thatmXi=1
(Xi �X)2
�21� �2m�1;
nXi=1
(Yi � Y )2
�22� �2n�1;
andmXi=1
(Xi �X)2
(m� 1)�21nXi=1
(Yi � Y )2
(n� 1)�22
=�22
�21
S21
S22� Fm�1;n�1:
The following table summarizes the F{tests that are typically being used:
Reject H0 at level � if
H0 H1 �1; �2 known �1; �2 unknown
I �21 � �22 �21 > �22
1m
P(Xi��1)2
1n
P(Yi��2)2
� Fm;n;�S21S22
� Fm�1;n�1;�
II �21 � �22 �21 < �22
1n
P(Yi��2)2
1m
P(Xi��1)2
� Fn;m;�S22S21� Fn�1;m�1;�
III �21 = �22 �21 6= �22
1m
P(Xi��1)2
1n
P(Yi��2)2
� Fm;n;�=2S21S22� Fm�1;n�1;�=2
or1n
P(Yi��2)2
1m
P(Xi��1)2
� Fn;m;�=2 orS22S21� Fn�1;m�1;�=2
105
Note:
(i) Tests I and II are UMPU and UMP invariant if �1 and �2 are unknown.
(ii) Test III uses equal tails and therefore may not be unbiased.
(iii) If an F{test (at level �1) and a t{test (at level �2) are both performed, the combined
test has level � = 1� (1� �1)(1� �2) � max(�1; �2) (� �1 + �2 if both are small).
106
Lecture 29:
Mo 03/26/0110.4 Bayes and Minimax Tests
Hypothesis testing may be conducted in a decision{theoretic framework. Here our action
space A consists of two options: a0 = fail to reject H0 and a1 = reject H0.
Usually, we assume no loss for a correct decision. Thus, our loss function looks like:
L(�; a0) =
8<: 0; if � 2 �0
a(�); if � 2 �1
L(�; a1) =
8<: b(�); if � 2 �0
0; if � 2 �1
We consider the following special cases:
0{1 loss: a(�) = b(�) = 1, i.e., all errors are equally bad.
Generalized 0{1 loss: a(�) = cII , b(�) = cI , i.e., all Type I errors are equally bad and all
Type II errors are equally bad and Type I errors are worse than Type II errors or vice
versa.
Then, the risk function can be written as
R(�; d(X)) = L(�; a0)P�(d(X) = a0) + L(�; a1)P�(d(X) = a1)
=
8<: a(�)P�(d(X) = a0); if � 2 �1
b(�)P�(d(X) = a1); if � 2 �0
The minimax rule minimizes
max�fa(�)P�(d(X) = a0); b(�)P�(d(X) = a1)g:
Theorem 10.4.1:
The minimax rule d for testing
H0 : � = �0 vs. H1 : � = �1
under the generalized 0{1 loss function rejects H0 if
f�1(x)
f�0(x)� k;
107
where k is chosen such that
R(�0; d(X)) = R(�1; d(X))
() cIIP�1(d(X) = a0) = cIP�0(d(X) = a1)
() cIIP�1
�f�1(X)
f�0(X)< k
�= cIP�0
�f�1(X)
f�0(X)� k
�:
Proof:
Let d� be any other rule.
� If R(�0; d) < R(�0; d�), then
R(�0; d) = R(�1; d) < maxfR(�0; d�); R(�1; d�)g:
So, d� is not minimax.
� If R(�0; d) � R(�0; d�), i.e.,
cIP�0(d = a1) = R(�0; d) � R(�0; d�) = cIP�0(d
� = a1);
then
P (reject H0 j H0 true) = P�0(d = a1) � P�0(d� = a1):
By the NP Lemma, the rule d is MP of its size. Thus,
P�1(d = a1) � P�1(d� = a1) () P�1(d = a0) � P�1(d
� = a0)
=) R(�1; d) � R(�1; d�)
=) maxfR(�0; d); R(�1; d)g = R(�1; d) � R(�1; d�) � maxfR(�0; d�); R(�1; d�)g 8d�
=) d is minimax
Example 10.4.2:
Let X1; : : : ; Xn be iid N(�; 1). Let H0 : � = �0 vs. H1 : � = �1 > �0.
As we have seen before,f�1(x)
f�0(x)� k1 is equivalent to x � k2.
Therefore, we choose k2 such that
cIIP�1(X < k2) = cIP�0(X � k2)
() cII�(pn(k2 � �1)) = cI(1��(
pn(k2 � �0)));
where �(z) = P (Z � z) for Z � N(0; 1).
Given cI ; cII ; �0; �1, and n, we can solve (numerically) for k2 using Normal tables.
108
Note:
Now suppose we have a prior distribution �(�) on �. Then the Bayes risk of a decision rule
d (under the loss function introduced before) is
R(�; d) = E�(R(�; d(X)))
=
Z�R(�; d)�(�)d�
=
Z�0
b(�)�(�)P�(d(X) = a1)d� +
Z�1
a(�)�(�)P�(d(X) = a0)d�
if � is a pdf.
The Bayes risk for a pmf � looks similar (see Rohatgi, page 461).
Theorem 10.4.3:
The Bayes rule for testing H0 : � = �0 vs. H1 : � = �1 under the prior �(�0) = �0 and
�(�1) = �1 = 1� �0 and the generalized 0{1 loss function is to reject H0 if
f�1(x)
f�0(x)� cI�0
cII�1:
Proof:
We wish to minimize R(�; d). We know that
R(�; d)Def: 8:8:8
= E�(R(�; d))
Def: 8:8:3= E�(E�(L(�; d(X))))
Note=
Zg(x)|{z}
marginal
(
ZL(�; d(x))h(� j x)| {z }
posterior
d�) dx
Def: 4:7:1=
Zg(x)E�(L(�; d(X)) j X = x) dx
= EX(E�(L(�; d(X)) j X)):
Therefore, it is su�cient to minimize E�(L(�; d(X)) j X).
The a posteriori distribution of � is
h(� j x) =�(�)f�(x)X�
�(�)f�(x)
=
8>><>>:�0f�0(x)
�0f�0(x) + �1f�1(x); � = �0
�1f�1(x)
�0f�0(x) + �1f�1(x); � = �1
109
Therefore,
E�(L(�; d(X)) j X = x) =
8>>>>>>><>>>>>>>:
cIh(�0 j x); if � = �0; d(x) = a1
cIIh(�1 j x); if � = �1; d(x) = a0
0; if � = �0; d(x) = a0
0; if � = �1; d(x) = a1
This will be minimized if we reject H0, i.e., d(x) = a1, when cIh(�0 j x) � cIIh(�1 j x)
=) cI�0f�0(x) � cII�1f�1(x)
=) f�1(x)
f�0(x)� cI�0
cII�1
Note:
For minimax rules and Bayes rules, the signi�cance level � is no longer predetermined.
Example 10.4.4:
Let X1; : : : ; Xn be iid N(�; 1). Let H0 : � = �0 vs. H1 : � = �1 > �0. Let cI = cII .
By Theorem 10.4.3, the Bayes rule d rejects H0 if
f�1(x)
f�0(x)� �0
1� �0
=) exp
�P(xi � �1)
2
2+
P(xi � �0)
2
2
!� �0
1� �0
=) exp
(�1 � �0)
Xxi +
n(�20 � �21)
2
!� �0
1� �0
=) (�1 � �0)X
xi +n(�20 � �21)
2� ln(
�0
1� �0)
=) 1
n
Xxi � 1
n
ln( �01��0 )
�1 � �0+�0 + �1
2
If �0 =12 , then we reject H0 if x �
�0 + �1
2.
110
Note:
We can generalize Theorem 10.4.3 to the case of classifying among k options �1; : : : ; �k. If we
use the 0{1 loss function
L(�i; d) =
8<: 1; if d(X) = �j 8j 6= i
0; if d(X) = �i
;
then the Bayes rule is to pick �i if
�if�i(x) � �jf�j (x) 8j 6= i:
Lecture 30:
We 03/28/01Example 10.4.5:
Let X1; : : : ; Xn be iid N(�; 1). Let �1 < �2 < �3 and let �1 = �2 = �3.
Choose � = �i if
�i exp
�P(xk � �i)
2
2
!� �j exp
�P(xk � �j)
2
2
!; j 6= i; j = 1; 2; 3:
Similar to Example 10.4.4, these conditions can be transformed as follows:
x(�i � �j) �(�i � �j)(�i + �j)
2; j 6= i; j = 1; 2; 3:
In our particular example, we get the following decision rules:
(i) Choose �1 if x � �1+�22 (and x � �1+�3
2 ).
(ii) Choose �2 if x � �1+�22 and x � �2+�3
2 .
(iii) Choose �3 if x � �2+�32 (and x � �1+�3
2 ).
Note that in (i) and (iii) the condition in parentheses automatically holds when the other
condition holds.
If �1 = 0; �2 = 2, and �3 = 4, we have the decision rules:
(i) Choose �1 if x � 1.
(ii) Choose �2 if 1 � x � 3.
(iii) Choose �3 if x � 3.
We do not have to worry how to handle the boundary since the probability that the rv will
realize on any of the two boundary points is 0.
111
11 Con�dence Estimation
11.1 Fundamental Notions
Let X be a rv and a; b be �xed positive numbers, a < b. Then
P (a < X < b) = P (a < X and X < b)
= P (a < X andX
b< 1)
= P (a < X andaX
b< a)
= P (aX
b< a < X)
The interval I(X) = (aXb;X) is an example of a random interval. I(X) contains the value
a with a certain �xed probability.
For example, if X � U(0; 1); a = 14 , and b =
34 , then the interval I(X) = (X3 ;X) contains 1
4
with probability 12 .
De�nition 11.1.1:
Let P�; � 2 � � IRk, be a set of probability distributions of a rv X . A family of subsets
S(x) of �, where S(x) depends on x but not on �, is called a family of random sets. In
particular, if � 2 � � IR and S(x) is an interval (�(x); �(x)) where �(x) and �(x) depend on
x but not on �, we call S(X) a random interval, with �(X) and �(X) as lower and upper
bounds, respectively. �(X) may be �1 and �(X) may be +1.
Note:
Frequently in inference, we are not interested in estimating a parameter or testing a hypoth-
esis about it. Instead, we are interested in establishing a lower or upper bound (or both) for
one or multiple parameters.
De�nition 11.1.2:
A family of subsets S(x) of � � IRk is called a family of con�dence sets at con�dence
level 1� � if
P�(S(X) 3 �) � 1� � 8� 2 �;
where 0 < � < 1 is usually small.
The quantity
inf�P�(S(X) 3 �)
is called the con�dence coe�cient.
112
De�nition 11.1.3:
For k = 1, we use the following names for some of the con�dence sets de�ned in De�nition
11.1.2:
(i) If S(x) = (�(x);1), then �(x) is called a level 1� � lower con�dence bound.
(ii) If S(x) = (�1; �(x)), then �(x) is called a level 1� � upper con�dence bound.
(iii) S(x) = (�(x); �(x)) is called a level 1� � con�dence interval (CI).
De�nition 11.1.4:
A family of 1� � level con�dence sets fS(x)g is called uniformly most accurate (UMA)
if
P�0(S(X) 3 �) � P�0(S0(X) 3 �) 8�; �0 2 �
and for any 1� � level family of con�dence sets S0(X).
Theorem 11.1.5:
Let X1; : : : ;Xn � F�; � 2 �, where � is an interval on IR. Let T (X; �) be a function on
IRn � � such that for each �, T (X; �) is a statistic, and as a function of �, T is strictly
monotone (either increasing or decreasing) in � at every value of x 2 IRn.
Let � � IR be the range of T and let the equation � = T (x; �) be solvable for � for every
� 2 � and every x 2 IRn.
If the distribution of T (X; �) is independent of �, then we can construct a con�dence interval
for � at any level.
Proof:
Choose � such that 0 < � < 1. Then we can choose �1(�) < �2(�) (which may not necessarily
be unique) such that
P�(�1(�) < T (X; �) < �2(�)) � 1� � 8�:
Since the distribution of T (X; �) is independent of �, �1(�) and �2(�) also do not depend on
�.
If T (X; �) is increasing in �, solve the equations �1(�) = T (X; �) for �(X) and �2(�) = T (X; �)
for �(X).
If T (X; �) is decreasing in �, solve the equations �1(�) = T (X; �) for �(X) and �2(�) =
T (X; �) for �(X).
113
In either case, it holds that
P�(�(X) < � < �(X)) � 1� � 8�:
Note:
(i) Solvability is guaranteed if T is continuous and strictly monotone as a function of �.
(ii) If T is not monotone, we can still use this Theorem to get con�dence sets that may not
be con�dence intervals.
Example 11.1.6:
Let X1; : : : ;Xn � N(�; �2), where � and �2 > 0 are both unknown. We seek a 1 � � level
con�dence interval for �.
Note that
T (X;�) =X � �
S=pn� tn�1
is independent of � and monotone and decreasing in �.
We choose �1(�) and �2(�) such that
P (�1(�) < T (X;�) < �2(�)) = 1� �
and solve for � which yields
P (X � S�2(�)pn
< � < X � S�1(�)pn
) = 1� �:
Thus,
(X � S�2(�)pn
;X � S�1(�)pn
)
is a 1� � level CI for �. We commonly choose �2(�) = ��1(�) = tn�1;�=2.
Example 11.1.7:
Let X1; : : : ; Xn � U(0; �).
We know that �̂ = max(Xi) =Maxn is the MLE for � and su�cient for �.
The pdf of Maxn is given by
fn(y) =nyn�1
�nI(0;�)(y):
114
Then the rv Tn =Maxn�
has the pdf
hn(t) = ntn�1I(0;1)(t);
which is independent of �. Tn is monotone and decreasing in �.
We now have to �nd numbers �1(�) and �2(�) such that
P (�1(�) < Tn < �2(�)) = 1� �
=) n
Z �2
�1
tn�1dt = 1� �
=) �n2 � �n1 = 1� �
If we choose �2 = 1 and �1 = �1=n, then (Maxn; ��1=nMaxn) is a 1� � level CI for �. This
holds since
P (�1=n <Maxn
�< 1) = P (��1=n >
�
Maxn> 1)
= P (��1=nMaxn > � > Maxn)
= 1� �:
115
Lecture 31:
Fr 03/30/0111.2 Shortest{Length Con�dence Intervals
In practice, we usually want not only an interval with coverage probability 1� � for �, but if
possible the shortest (most precise) such interval.
De�nition 11.2.1:
A rv T (X; �) whose distribution is independent of � is called a pivot.
Note:
The methods we will discuss here can provide the shortest interval based on a given pivot.
They will not guarantee that there is no other pivot with a shorter minimal interval.
Example 11.2.2:
Let X1; : : : ; Xn � N(�; �2), where �2 > 0 is known. The obvious pivot for � is
T�(X) =X � �
�=pn� N(0; 1):
Suppose that (a; b) is an interval such that P (a < Z < b) = 1� �, where Z � N(0; 1).
A 1� � level CI based on this pivot is found by
1� � = P
a <
X � �
�=pn< b
!= P
�X � b
�pn< � < X � a
�pn
�:
The length of the interval is L = (b� a) �pn.
To minimize L, we must choose a and b such that b� a is minimal while
�(b)� �(a) =1p2�
Z b
ae�
x2
2 dx = 1� �;
where �(z) = P (Z � z).
To �nd a minimum, we can di�erentiate these expressions with respect to a. However, b is
not a constant but is an implicit function of a. Formally, we could writed b(a)da
. However, this
is usually shortened to dbda.
Here we getd
da(�(b)� �(a)) = �(b)
db
da� �(a) =
d
da(1� �) = 0
anddL
da=
�pn(db
da� 1) =
�pn(�(a)
�(b)� 1):
116
The minimum occurs when �(a) = �(b) which happens when a = b or a = �b. If we selecta = b, then �(b)��(a) = �(a)��(a) = 0 6= 1��. Thus, we must have that b = �a = z�=2.
Thus, the shortest CI based on T� is
(X � z�=2�pn;X + z�=2
�pn):
De�nition 11.2.3:
A pdf f(x) is unimodal i� there exists a x� such that f(x) is nondecreasing for x � x� and
f(x) is nonincreasing for x � x�.
Theorem 11.2.4:
Let f(x) be a unimodal pdf. If the interval [a; b] satis�es
(i)
Z b
af(x)dx = 1� �
(ii) f(a) = f(b) > 0, and
(iii) a � x� � b, where x� is a mode of f(x),
then the interval [a; b] is the shortest of all intervals which satisfy condition (i).
Proof:
Let [a0; b0] be any interval with b0�a0 < b�a. We will show that this implies
Z b
af(x)dx < 1��,
i.e., a contradiction.
We assume that a0 � a. The case a < a0 is similar.
� Suppose that b0 � a. Then a0 � b0 � a � x�.
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
x*a ba’ b’
Theorem 11.2.4a
117
It follows Z b0
a0f(x)dx � f(b0)(b0 � a0) jx � b0 � x� ) f(x) � f(b0)
� f(a)(b0 � a0) jb0 � a � x� ) f(b0) � f(a)
< f(a)(b� a) jb0 � a0 < b� a and f(a) > 0
�Z b
af(x)dx jf(x) � f(a) for a � x � b
= 1� � jby (i)
� Suppose b0 > a. We can immediately exclude that b0 > b since then b0 � a0 > b � a, i.e.,
b0 � a0 wouldn't be of shorter length than b � a. Thus, we have to consider the case that
a0 � a < b0 < b.
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
x*a ba’ b’
Theorem 11.2.4b
It holds that Z b0
a0f(x)dx =
Z b
af(x)dx+
Z a
a0f(x)dx�
Z b
b0f(x)dx
Note that
Z a
a0f(x)dx � f(a)(a� a0) and
Z b
b0f(x)dx � f(b)(b� b0). Therefore, we get
Z a
a0f(x)dx�
Z b
b0f(x)dx � f(a)(a� a0)� f(b)(b� b0)
= f(a)((a� a0)� (b� b0)) since f(a) = f(b)
= f(a)((b0 � a0)� (b� a))
< 0
Thus, Z b0
a0f(x)dx < 1� �:
118
Note:
Example 11.2.2 is a special case of Theorem 11.2.4.
Example 11.2.5:
Let X1; : : : ; Xn � N(�; �2), where � is known. The obvious pivot for �2 is
T�2(X) =
P(Xi � �)2
�2� �2n:
So
P
a <
P(Xi � �)2
�2< b
!= 1� �
() P
P(Xi � �)2
b< �2 <
P(Xi � �)2
a
!= 1� �
We wish to minimize
L = (1
a� 1
b)X
(Xi � �)2
such that
Z b
afn(t)dt = 1� �, where fn(t) is the pdf of a �
2n distribution.
We get
fn(b)db
da� fn(a) = 0
anddL
da=
�1
a2� 1
b2db
da
�X(Xi � �)2 =
�1
a2� 1
b2fn(a)
fn(b)
�X(Xi � �)2:
We obtain a minimum if a2fn(a) = b2fn(b).
Note that in practice equal tails �2n;�=2 and �2n;1��=2 are used, which do not result in shortest{
length CI's. The reason for this selection is simple: When these tests were developed, com-
puters did not exist that could solve these equations numerically. People in general had to
rely on tabulated values. Manually solving the equation above for each case obviously wasn't
a feasible solution.
Example 11.2.6:
Let X1; : : : ;Xn � U(0; �). Let Maxn = maxXi = X(n). Since Tn = Maxn�
has pdf
ntn�1I(0;1)(t) which does not depend on �, Tn can be selected as a our pivot. The den-
sity of Tn is strictly increasing for n � 2, so we cannot �nd constants a and b as in Example
11.2.5.
If P (a < Tn < b) = 1� �, then P (Maxnb
< � < Maxna
) = 1� �.
119
We wish to minimize
L =Maxn(1
a� 1
b)
such that
Z b
antn�1dt = bn � an = 1� �.
We get
nbn�1 � nan�1da
db= 0 =) da
db=bn�1
an�1
and
dL
db=Maxn(�
1
a2da
db+1
b2) =Maxn(�
bn�1
an+1+1
b2) =Maxn(
an+1 � bn+1
b2an+1) < 0 for 0 � a < b � 1:
Thus, L does not have a local minimum. However, since dLdb < 0, L is strictly decreasing
as a function of b. It is minimized when b = 1, i.e., when b is as large as possible. The
corresponding a is selected as a = �1=n.
The shortest 1� � level CI based on Tn is (Maxn; ��1=nMaxn).
120
Lecture 32:
Mo 04/02/0111.3 Con�dence Intervals and Hypothesis Tests
Example 11.3.1:
Let X1; : : : ;Xn � N(�; �2), where �2 > 0 is known. In Example 11.2.2 we have shown that
the interval
(X � z�=2�pn;X + z�=2
�pn)
is a 1� � level CI for �.
Suppose we de�ne a test � of H0 : � = �0 vs. H1 : � 6= �0 that rejects H0 i� �0 does not
fall in this interval. Then,
P�0(Type I error) = P�0(X � z�=2�pn� �0) + P�0(X + z�=2
�pn� �0)
= P�0
j X � �0 j
�pn
� z�=2
!= �;
i.e., � has size �.
Conversely, if �(x; �0) is a family of size � tests ofH0 : � = �0, the set f�0 j �(x; �0) fails to reject H0gis a level 1� � con�dence set for �0.
Theorem 11.3.2:
Denote H0(�0) for H0 : � = �0, and H1(�0) for the alternative. Let A(�0); �0 2 �, denote the
acceptance region of a level{� test of H0(�0). For each possible observation x, de�ne
S(x) = f� : x 2 A(�); � 2 �g:
Then S(x) is a family of 1� � level con�dence sets for �.
If, moreover, A(�0) is UMP for (�;H0(�0);H1(�0)), then S(x) minimizes P�(S(X) 3 �0)
8� 2 H1(�0) among all 1� � level families of con�dence sets, i.e., S(x) is UMA.
Proof:
It holds that S(x) 3 � i� x 2 A(�). Therefore,
P�(S(X) 3 �) = P�(X 2 A(�)) � 1� �:
Let S�(X) be any other family of 1�� level con�dence sets. De�ne A�(�) = fx : S�(x) 3 �g.Then,
P�(X 2 A�(�)) = P�(S�(X) 3 �) � 1� �:
121
Since A(�0) is UMP, it holds that
P�(X 2 A�(�0)) � P�(X 2 A(�0)) 8� 2 H1(�0):
This implies that
P�(S�(X) 3 �0) � P�(X 2 A(�0)) = P�(S(X) 3 �0) 8� 2 H1(�0):
Example 11.3.3:
LetX be a rv that belongs to a one{parameter exponential family with pdf f�(x) = exp(Q(�)T (x)+
S0(x)+D(�)), where Q(�) is non{decreasing. We consider a test H0 : � = �0 vs. H1 : � < �0.
The acceptance region of a UMP size � test of H0 has the form A(�0) = fx : T (x) > c(�0)g.
For � � �0,
P�0(T (X) � c(�0)) = � = P�(T (X) � c(�)) � P�0(T (X) � c(�))
(since for a UMP test, it holds power � size). Therefore, we can choose c(�) as non{decreasing.
A level 1� � CI for � is then
S(x) = f� : x 2 A(�)g = (�1; c�1(T (x)));
where c�1(T (x)) = sup�
f� : c(�) � T (x)g.
Example 11.3.4:
Let X � Exp(�) with f�(x) =1�e�
x
� I(0;1)(x). Then Q(�) = �1�and T (x) = x.
We want to test H0 : � = �0 vs. H1 : � < �0.
The acceptance region of a UMP size � test of H0 has the form A(�0) = fx : x � c(�0)g,where
� =
Z c(�0)
0f�0(x)dx =
Z c(�0)
0
1
�0e� x
�0 dx = 1� e� c(�0)
�0 :
Thus,
e� c(�0)
�0 = 1� � =) c(�0) = �0 log(1
1� �)
Therefore, the UMA family of 1� � level con�dence sets is of the form
S(x) = f� : x 2 A(�)g
= f� : � � x
log( 11�� )
g
=
"0;
x
log( 11��)
#:
122
Note:
Just as we frequently restrict the class of tests (when UMP tests don't exist), we can make
the same sorts of restrictions on CI's.
De�nition 11.3.5:
A family S(x) of con�dence sets for parameter � is said to be unbiased at level 1� � if
P�(S(X) 3 �) � 1� � and P�0(S(X) 3 �) � 1� � 8�; �0 2 �:
If S(x) is unbiased and minimizes P�0(S(X) 3 �) among all unbiased CI's at level 1� �, it is
called uniformly most accurate unbiased (UMAU).
Theorem 11.3.6:
Let A(�0) be the acceptance region of a UMPU size � test of H0 : � = �0 vs. H1 : � 6= �0
(for all �0). Then S(x) = f� : x 2 A(�)g is a UMAU family of con�dence sets at level 1��.
Proof:
Since A(�) is unbiased, it holds that
P�0(S(X) 3 �) = P�0(X 2 A(�)) � 1� �:
Thus, S is unbiased.
Let S�(x) be any other unbiased family of level 1� � con�dence sets, where
A�(�) = fx : S�(x) 3 �g:
It holds that
P�0(X 2 A�(�)) = P�0(S�(X) 3 �) � 1� �:
Therefore, A�(�) is the acceptance region of an unbiased size � test. Thus,
P�0(S�(X) 3 �) = P�0(X 2 A�(�))
A UMPU� P�0(X 2 A(�))
= P�0(S(X) 3 �)
123
Lecture 33:
We 04/04/01Theorem 11.3.7:
Let � be an interval on IR and f� be the pdf of X. Let S(X) be a family of 1� � level CI's,
where S(X) = (�(X); �(X)), � and � increasing functions of X, and �(X) � �(X) is a �nite
rv.
Then it holds that
E�(�(X)� �(X))) =
Z(�(x)� �(x))f�(x)dx =
Z�0 6=�
P�(S(X) 3 �0) d�0 8� 2 �:
Proof:
It holds that � � � =
Z �
�d�0. Thus, for all � 2 �,
E�(�(X)� �(X))) =
ZIRn
(�(x)� �(x))f�(x)dx
=
ZIRn
Z �(x)
�(x)d�0!f�(x)dx
=
ZIR
0BBBBB@Z 2IRnz }| {��1(�0)
��1(�0)| {z }2IRn
f�(x)dx
1CCCCCA d�0
=
ZIRP�
�X 2 [��1(�0); �
�1(�0)]
�d�0
=
ZIRP�(S(X) 3 �0) d�0
=
Z�0 6=�
P�(S(X) 3 �0) d�0
Note:
Theorem 11.3.7 says that the expected length of the CI is the probability that S(X) includes
the false �, averaged over all false values of �.
Corollary 11.3.8:
If S(X) is UMAU, then E�(�(X)� �(X)) is minimized among all unbiased families of CI's.
Proof:
In Theorem 11.3.7 we have shown that
E�(�(X)� �(X)) =
Z�0 6=�
P�(S(X) 3 �0) d�0:
Since a UMAU CI minimizes this probability for all �0, the entire integral is minimized.
124
Example 11.3.9:
Let X1; : : : ; Xn � N(�; �2), where �2 > 0 is known.
By Example 11.2.2, (X � z�=2�pn; X + z�=2
�pn) is the shortest 1� � level CI for �.
By Example 9.4.3, the equivalent test is UMPU. So by Theorem 11.3.6 this interval is UMAU
and by Corollary 11.3.8 it has shortest expected length as well.
Example 11.3.10:
Let X1; : : : ; Xn � N(�; �2), where � and �2 > 0 are both unknown.
Note that
T (X;�2) =(n� 1)S2
�2= T� � �2n�1:
Thus,
P�2
�1 <
(n� 1)S2
�2< �2
!= 1� � () P�2
(n� 1)S2
�2< �2 <
(n� 1)S2
�1
!= 1� �:
We now de�ne P ( ) as
P�2
(n� 1)S2
�2< �02 <
(n� 1)S2
�1
!= P
�T�
�2< <
T�
�1
�= P (�1 < T� < �2 ) = P ( );
where = �02
�2.
If our test is unbiased, then it follows from De�nition 11.3.5 that
P (1) = 1� � and P ( ) < 1� � 8 < 1:
This implies that we can �nd �1; �2 such that P (1) = 1� � and
dP ( )
d
���� =1
= �2fT�(�2)� �1fT�(�1) = 0;
where fT� is the pdf that relates to T�, i.e., a the pdf of a �2n�1 distribution. We can solve for
�1; �2 numerically. Then, (n� 1)S2
�2;(n� 1)S2
�1
!is an unbiased 1� � level CI for �2.
Rohatgi, Theorem 4(b), page 428{429, states that the related test is UMPU. Therefore, by
Theorem 11.3.6 and Corollary 11.3.8, our CI is UMAU with shortest expected length among
all unbiased intervals.
Note that this CI is di�erent from the equal{tail CI based on De�nition 10.2.1, III, and from
the shortest{length CI obtained in Example 11.2.5.
125
11.4 Bayes Con�dence Intervals
De�nition 11.4.1:
Given a posterior distribution h(� j x), a level 1 � � credible set (Bayesian con�dence
set) is any set A such that
P (� 2 A j x) =ZAh(� j x)d� = 1� �:
Note:
If A is an interval, we speak of a Bayesian con�dence interval.
Example 11.4.2:
Let X � Bin(n; p) and �(p) � U(0; 1).
In Example 8.8.11, we have shown that
h(p j x) =px(1� p)n�xZ 1
0px(1� p)n�xdp
I(0;1)(p)
= B(x+ 1; n� x+ 1)�1px(1� p)n�xI(0;1)(p)
=�(n+ 2)
�(x+ 1)�(n� x+ 1)px(1� p)n�xI(0;1)(p)
� Beta(x+ 1; n� x+ 1);
where B(a; b) =�(a)�(b)�(a+b) is the beta function evaluated for a and b and Beta(x+ 1; n� x+1)
represents a Beta distribution with parameters x+ 1 and n� x+ 1.
Using the observed value for x and tables for incomplete beta integrals or a numerical ap-
proach, we can �nd �1 and �2 such that Ppjx(�1 < p < �2) = 1� �. So (�1; �2) is a credible
interval for p.
Note:
(i) The de�nitions and interpretations of credible intervals and con�dence intervals are quite
di�erent. Therefore, very di�erent intervals may result.
(ii) We can often use Theorem 11.2.4 to �nd the shortest credible interval.
126
Lecture 34:
Mo 04/09/01Example 11.4.3:
Let X1; : : : ;Xn be iid N(�; 1) and �(�) � N(0; 1). We want to construct a Bayesian level
1� � CI for �.
By De�nition 8.8.7, the posterior distribution of � given x is
h(� j x) = �(�)f(x j �)g(x)
where
g(x) =
Z 1
�1f(x; �)d�
=
Z 1
�1�(�)f(x j �)d�
=
Z 1
�1
1p2�
exp(�1
2�2)
1
(p2�)n
exp
�1
2
nXi=1
(xi � �)2
!d�
=
Z 1
�1
1
(2�)n+12
exp
�1
2
nXi=1
x2i � 2�nXi=1
xi + n�2
!� 1
2�2
!d�
=
Z 1
�1
1
(2�)n+12
exp
�1
2
nXi=1
x2i + �nx� n
2�2 � 1
2�2
!d�
=1
(2�)n+12
exp
�1
2
nXi=1
x2i
!Z 1
�1exp
��n+ 1
2
��2 � 2�
nx
n+ 1
��d�
=1
(2�)n+12
exp
�1
2
nXi=1
x2i
!Z 1
�1exp
�n+ 1
2
��� nx
n+ 1
�2+n+ 1
2
�nx
n+ 1
�2!d�
=1
(2�)n+12
exp
�1
2
nXi=1
x2i +n2x2
2(n+ 1)
!s2�
1
n+ 1�
Z 1
�1
1q2� 1
n+1
exp
�1
2
1
1=(n+ 1)
��� nx
n+ 1
�2!d�
| {z }=1 since pdf of a N( nx
n+1; 1n+1
) distribution
=(n+ 1)�
12
(2�)n
2
exp
�1
2
nXi=1
x2i +n2x2
2(n+ 1)
!
127
Therefore,
h(� j x) =�(�)f(x j �)
g(x)
=
1p2�
exp(�1
2�2)
1
(p2�)n
exp
�1
2
nXi=1
(xi � �)2
!(n+ 1)�
12
(2�)n
2
exp
�1
2
nXi=1
x2i +n2x2
2(n+ 1)
!
=
pn+ 1p2�
exp
�1
2
nXi=1
x2i + �nx� n
2�2 � 1
2�2 +
1
2
nXi=1
x2i �n2x2
2(n+ 1)
!
=
pn+ 1p2�
exp
�n+ 1
2
�2 � 2
n+ 1�nx+
2
n+ 1
n2x2
2(n+ 1)
!!
=1q
2� 1n+1
exp
�1
2
11
n+1
��� nx
n+ 1
�2!;
i.e.,
h(� j x) � N
�nx
n+ 1;
1
n+ 1
�:
Therefore, a Bayesian level 1� � CI for � is�n
n+ 1X �
z�=2pn+ 1
;n
n+ 1X +
z�=2pn+ 1
�:
The shortest (\classical") level 1� � CI for � (treating � as �xed) is�X �
z�=2pn; X +
z�=2pn
�as seen in Example 11.3.9.
Thus, the Bayesian CI is slightly shorter than the classical CI since we use additional infor-
mation in constructing the Bayesian CI.
128
12 Nonparametric Inference
12.1 Nonparametric Estimation
De�nition 12.1.1:
A statistical method which does not rely on assumptions about the distributional form of a
rv (except, perhaps, that it is absolutely continuous, or purely discrete) is called a nonpara-
metric or distribution{free method.
Note:
Unless otherwise speci�ed, we make the following assumptions for the remainder of this chap-
ter: Let X1; : : : ;Xn be iid � F , where F is unknown. Let P be the class of all possible
distributions of X.
De�nition 12.1.2:
A statistic T (X) is su�cient for a family of distributions P if the conditional distibution of
X given T = t is the same for all F 2 P.
Example 12.1.3:
Let X1; : : : ; Xn be absolutely continuous. Let T = (X(1); : : : ; X(n)) be the order statistics.
It holds that
f(x j T = t) =1
n!;
so T is su�cient for the family of absolutely continuous distributions on IR.
De�nition 12.1.4:
A family of distributions P is complete if the only unbiased estimate of 0 is the 0 itself, i.e.,
EF (h(X)) = 0 8F 2 P =) h(x) = 0 8x:
De�nition 12.1.5:
A statistic T (X) is complete in relation to P if the class of induced distributions of T is
complete.
Theorem 12.1.6:
The order statistic (X(1); : : : ;X(n)) is a complete su�cient statistic, provided that X1; : : : ;Xn
are of either (pure) discrete of (pure) continuous type.
129
De�nition 12.1.7:
A parameter g(F ) is called estimable if it has an unbiased estimate, i.e., if there exists a
T (X) such that
EF (T (X)) = g(F ) 8F 2 P:
Example 12.1.8:
Let P be the class of distributions for which second moments exist. Then X is unbiased for
�(F ) =RxdF (x). Thus, �(F ) is estimable.
De�nition 12.1.9:
The degree m of an estimable parameter g(F ) is the smallest sample size for which an unbi-
ased estimate exists for all F 2 P.
An unbiased estimate based on a sample of size m is called a kernel.
Lemma 12.1.10:
There exists a symmetric kernel for every estimable parameter.
Proof:
Let T (X1; : : : ;Xm) be a kernel of g(F ). De�ne
Ts(X1; : : : ;Xm) =1
m!
Xall permutations off1;:::;mg
T (Xi1 ; : : : ;Xim):
where the summation is over all m! permutations of f1; : : : ;mg.
Clearly Ts is symmetric and E(Ts) = g(F ).
Example 12.1.11:
(i) E(X1) = �(F ), so �(F ) has degree 1 with kernel X1.
(ii) E(I(c;1)(X1)) = PF (X > c), where c is a known constant. So g(F ) = PF (X > c) has
degree 1 with kernel I(c;1)(X1).
(iii) There exists no T (X1) such that E(T (X1)) = �2(F ) =R(x� �(F ))2dF (x).
But E(T (X1; X2)) = E(X21 � X1X2) = �2(F ). So �2(F ) has degree 2 with kernel
X21 �X1X2. Note that X
22 �X2X1 is another kernel.
(iv) A symmetric kernel for �2(F ) is
Ts(X1; X2) =1
2((X2
1 �X1X2) + (X22 �X1X2)) =
1
2(X1 �X2)
2:
130
Lecture 35:
We 04/11/01De�nition 12.1.12:
Let g(F ) be an estimable parameter of degree m. Let X1; : : : ;Xn be a sample of size n; n � m.
Given a kernel T (Xi1 ; : : : ;Xim) of g(F ), we de�ne a U{statistic by
U(X1; : : : ;Xn) =1�nm
� Xc
Ts(Xi1 ; : : : ;Xim);
where Ts is de�ned as in Lemma 12.1.10 and the summation c is over all�nm
�combina-
tions of m integers (i1; : : : ; im) from f1; � � � ; ng: U(X1; : : : ;Xn) is symmetric in the Xi's and
EF (U(X)) = g(F ) for all F:
Example 12.1.13:
For estimating �(F ) with degree m of �(F ) = 1:
Symmetric kernel:
Ts(Xi) = Xi; i = 1; : : : ; n
U{statistic:
U�(X) =1�n1
� Xc
Xi
=1 � (n� 1)!
n!
Xc
Xi
=1
n
nXi=1
Xi
= X
For estimating �2(F ) with degree m of �2(F ) = 2:
Symmetric kernel:
Ts(Xi1 ;Xi2) =1
2(Xi1 �Xi2)
2; i1; i2 = 1; : : : ; n; i1 6= i2
U{statistic:
U�2(X) =1�n2
� Xi1<i2
1
2(Xi1 �Xi2)
2
=1�n2
� 14
Xi1 6=i2
(Xi1 �Xi2)2
=(n� 2)! � 2!
n!
1
4
Xi1 6=i2
(Xi1 �Xi2)2
131
=1
2n(n� 1)
Xi1
Xi2 6=i1
(X2i1� 2Xi1Xi2 +X2
i2)
=1
2n(n� 1)
24(n� 1)nX
i1=1
X2i1� 2(
nXi1=1
Xi1)(nX
i2=1
Xi2) + 2nXi=1
X2i +
(n� 1)nX
i2=1
X2i2
35
=1
2n(n� 1)
24n nXi1=1
X2i1�
nXi1=1
X2i1� 2(
nXi1=1
Xi1)2 + 2
nXi=1
X2i +
n
nXi2=1
X2i2 �
nXi2=1
X2i2
35=
1
n(n� 1)
"n
nXi=1
X2i � (
nXi=1
Xi)2
#
=1
n(n� 1)
264n nXi=1
0@Xi �1
n
nXj=1
Xj
1A2375
=1
(n� 1)
nXi=1
(Xi �X)2
= S2
Theorem 12.1.14:
Let P be the class of all absolutely continuous or all purely discrete distribution functions on
IR. Any estimable function g(F ); F 2 P, has a unique estimate that is unbiased and sym-
metric in the observations and has uniformly minimum variance among all unbiased estimates.
Proof:
Let X1; : : : ; Xniid� F 2 P; with T (X1; : : : ; Xn) an unbiased estimate of g(F ).
We de�ne
Ti = Ti(X1; : : : ; Xn) = T (Xi1 ;Xi2 ; : : : ;Xin); i = 1; 2; : : : ; n!;
over all possible permutations of f1; : : : ; ng.
Let T = 1n!
n!Xi=1
Ti and T =n!Xi=1
Ti.
132
Then
EF (T ) = g(F )
and
V ar(T ) = E(T2)� (E(T ))2
= E
"(1
n!
n!Xi=1
Ti)2
#� [g(F )]2
= E
24( 1n!)2
n!Xi=1
n!Xj=1
TiTj
35� [g(F )]2
� E
24 n!Xi=1
n!Xj=1
TiTj
35� [g(F )]2
= E
24 n!Xi=1
Ti
!0@ n!Xj=1
Tj
1A35� [g(F )]2
= E
24 n!Xi=1
Ti
!235� [g(F )]2
= E(T 2)� [g(F )]2
= V ar(T )
Equality holds i� Ti = Tj 8i; j = 1; : : : ; n!
=) T is symmetric in (X1; : : : ;Xn) and T = T
=) by Rohatgi, Problem 4, page 538, T is a function of order statistics
=) by Rohatgi, Theorem 1, page 535, T is a complete su�cient statistic
=) by Note (i) following Theorem 8.4.12, T is UMVUE
Corollary 12.1.15:
If T (X1; : : : ;Xn) is unbiased for g(F ); F 2 P, the corresponding U{statistic is an essentially
unique UMVUE.
133
De�nition 12.1.16:
Suppose we have independent samples X1; : : : ;Xmiid� F 2 P; Y1; : : : ; Yn
iid� G 2 P (G may or
may not equal F:) Let g(F;G) be an estimable function with unbiased estimator T (X1; : : : ;Xk; Y1; : : : ; Yl):
De�ne
Ts(X1; : : : ;Xk; Y1; : : : ; Yl) =1
k!l!
XPX
XPY
T (Xi1 ; : : : ; Xik ; Yj1 ; : : : ; Yjl)
(where PX and PY are permutations of X and Y ) and
U(X;Y ) =1�m
k
��nl
� XCX
XCY
Ts(Xi1 ; : : : ;Xik ; Yj1 ; : : : ; Yjl)
(where CX and CY are combinations of X and Y ).
U is a called a generalized U{statistic.
Example 12.1.17:
Let X1; : : : ;Xm and Y1; : : : ; Yn be independent random samples from F and G, respectively,
with F;G 2 P: We wish to estimate
g(F;G) = PF;G(X � Y ):
Let us de�ne
Zij =
(1; Xi � Yj
0; Xi > Yj
for each pair Xi; Yj; i = 1; 2; : : : ;m; j = 1; 2; : : : ; n:
ThenmXi=1
Zij is the number of X's � Yj, andnXj=1
Zij is the number of Y 's > Xi:
E(I(Xi � Yj)) = g(F;G) = PF;G(X � Y );
and degrees k and l are = 1, so we use
U(X;Y ) =1�m
1
��n1
� XCX
XCY
Ts(Xi1 ; : : : ; Xik ; Yj1 ; : : : ; Yjl)
=(m� 1)!(n� 1)!
m!n!
XCX
XCY
1
1!1!
XPX
XPY
T (Xi1 ; : : : ;Xik ; Yj1 ; : : : ; Yjl)
=1
mn
mXi=1
nXj=1
I(Xi � Yj):
This Mann{Whitney estimator (or Wilcoxin 2{Sample estimator) is unbiased and
symmetric in the X's and Y 's. It follows by Corollary 12.1.15 that it has minimum variance.
134
Lecture 36:
Fr 04/13/0112.2 Single-Sample Hypothesis Tests
Let X1; : : : ;Xn be a sample from a distribution F . The problem of �t is to test the hypoth-
esis that the sample X1; : : : ;Xn is from some speci�ed distribution against the alternative
that it is from some other distribution, i.e., H0 : F = F0 vs. H1 : F (x) 6= F0(x) for some x.
De�nition 12.2.1:
Let X1; : : : ; Xniid� F , and let the corresponding empirical cdf be
F �n(x) =1
n
nXi=1
I(�1;x](Xi):
The statistic
Dn = supxj F �n(x)� F (x) j
is called the two{sided Kolmogorov{Smirnov statistic (K{S statistic).
The one{sided K{S statistics are
D+n = sup
x[F �n(x)� F (x)] and D�
n = supx[F (x) � F �n(x)]:
Theorem 12.2.2:
For any continuous distribution F , the K{S statistics Dn;D�n ;D
+n are distribution free.
Proof:
Let X(1); : : : ;X(n) be the order statistics of X1; : : : ; Xn, i.e., X(1) � X(2) � : : : � X(n), and
de�ne X(0) = �1 and X(n+1) = +1.
Then,
F �n(x) =i
nfor X(i) � x < X(i+1); i = 0; : : : ; n:
Therefore,
D+n = max
0�i�nf sup
X(i)�x<X(i+1)
[i
n� F (x)]g
= max0�i�n
f in� [ inf
X(i)�x<X(i+1)
F (x)]g
(�)= max
0�i�nf in� F (X(i))g
= max f max1�i�n
�i
n� F (X(i))
�; 0g
(�) holds since F is nondecreasing in [X(i);X(i+1)).
135
Note that D+n is a function of F (X(i)). In order to make some inference about D+
n , the dis-
tribution of F (X(i)) must be known. We know from the Probability Integral Transformation
(see Rohatgi, page 203, Theorem 1) that for a rv X with continuous cdf FX , it holds that
FX(X) � U(0; 1).
Thus, F (X(i)) is the ith order statistic of a sample from U(0; 1), independent from F . There-
fore, the distribution of D+n is independent of F .
Similarly, the distribution of
D�n = max f max
1�i�n
�F (X(i))�
i� 1
n
�; 0g
is independent of F .
Since
Dn = supxj F �n(x)� F (x) j= max fD+
n ; D�n g;
the distribution of Dn is also independent of F .
Theorem 12.2.3:
If F is continuous, then
P (Dn � � +1
2n) =
8>>>>><>>>>>:
0; if � � 0Z �+ 12n
12n��
Z �+ 32n
32n��
: : :
Z �+ 2n�12n
2n�12n
��f(u)du; if 0 < � < 2n�1
2n
1; if � � 2n�12n
where
f(u) = f(u1; : : : ; un) =
(n!; if 0 < u1 < u2 < : : : < un < 1
0; otherwise
is the joint pdf of an order statistic of a sample of size n from U(0; 1).
Note:
As Gibbons & Chakraborti (1992), page 108{109, point out, this result must be interpreted
carefully. Consider the case n = 2.
For 0 < � < 34 , it holds that
P (D2 � � +1
4) =
Z �+ 14
14��
0<u1<u2<1
Z �+ 34
34��
2! du2 du1:
Note that the integration limits overlap if
� +1
4� �� + 3
4
() � � 1
4
136
When 0 < � < 14 , it automatically holds that 0 < u1 < u2 < 1. Thus, for 0 < � < 1
4 , it holds
that
P (D2 � � +1
4) =
Z �+ 14
14��
Z �+ 34
34��
2! du2 du1
= 2!
Z �+ 14
14��
�u2j
�+ 34
34��
�du1
= 2!
Z �+ 14
14��
2� du1
= 2! (2�) u1j�+ 1
414��
= 2! (2�)2
For 14 � � < 3
4 , the region of integration is as follows:
u1
u2
1
10
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
1/4 + nu
3/4 - nu
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
••
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
3/4 - nu
Area 1 Area 2
Note to Theorem 12.2.3
Thus, for 14 � � < 3
4 , it holds that
P (D2 � � +1
4) =
Z �+ 14
14��
0<u1<u2<1
Z �+ 34
34��
2! du2 du1
=
Z �+ 14
34��
Z 1
u1
2! du2 du1 +
Z 34��
0
Z 1
34��
2! du2 du1
= 2
"Z �+ 14
34��
�u2j1u1
�du1 +
Z 34��
0
�u2j13
4��
�du1
#
= 2
"Z �+ 14
34��
(1� u1) du1 +
Z 34��
0(1� 3
4+ �) du1
#
137
= 2
24 u1 � u21
2
!������+ 1
4
34��
+
�u1
4+ �u1
����� 34��0
35= 2
"(� +
1
4)�
(� + 14)2
2� (�� + 3
4) +
(�� + 34)2
2+(�� + 3
4)
4+ �(�� + 3
4)
#
= 2
"� +
1
4� �2
2� �
4� 1
32+ � � 3
4+�2
2� 3
4� +
9
32� �
4+
3
16� �2 +
3
4�
#
= 2
���2 + 3
2� � 1
16
�
= �2�2 + 3� � 1
8
Combining these results gives
P (D2 � � +1
4) =
8>>>>>>><>>>>>>>:
0; if � � 0
2! (2�)2; if 0 < � < 14
�2�2 + 3� � 18 ; if 1
4 � � < 34
1; if � � 34
Lecture 37:
Mo 04/16/01Theorem 12.2.4:
Let F be a continuous cdf. Then it holds 8z � 0:
limn!1
P (Dn �zpn) = L1(z) = 1� 2
1Xi=1
(�1)i�1 exp(�2i2z2):
Theorem 12.2.5:
Let F be a continuous cdf. Then it holds:
P (D+n � z) = P (D�
n � z) =
8>>>>><>>>>>:0; if z � 0Z 1
1�z
Z un
n�1n�z: : :
Z u3
2n�z
Z u2
1n�zf(u)du; if 0 < z < 1
1; if z � 1
where f(u) is de�ned in Theorem 12.2.3.
Note:
It should be obvious that the statistics D+n and D�
n have the same distribution because of
symmetry.
138
Theorem 12.2.6:
Let F be a continuous cdf. Then it holds 8z � 0:
limn!1
P (D+n �
zpn) = lim
n!1P (D�
n � zpn) = L2(z) = 1� exp(�2z2)
Corollary 12.2.7:
Let Vn = 4n(D+n )
2. Then it holds Vnd�! �22, i.e., this transformation of D
+n has an asymptotic
�22 distribution.
Proof:
Let x � 0. Then it follows:
limn!1
P (Vn � x)x=4z2= lim
n!1P (Vn � 4z2)
= limn!1
P (4n(D+n )
2 � 4z2)
= limn!1
P (pnD+
n � z)
Th:12:2:6= 1� exp(�2z2)
4z2=x= 1� exp(�x=2)
Thus, limn!1
P (Vn � x) = 1� exp(�x=2) for x � 0. Note that this is the cdf of a �22 distribu-
tion.
De�nition 12.2.8:
Let Dn;� be the smallest value such that P (Dn > Dn;�) � �. Likewise, let D+n;� be the
smallest value such that P (D+n > D+
n;�) � �.
The Kolmogorov{Smirnov test (K{S test) rejects H0 : F (x) = F0(x) 8x at level � if
Dn > Dn;�.
It rejects H 00 : F (x) � F0(x) 8x at level � if D�
n > D+n;� and it rejects H 00
0 : F (x) � F0(x) 8xat level � if D+
n > D+n;�.
Note:
Rohatgi, Table 7, page 661, gives values of Dn;� and D+n;� for selected values of � and small
n. Theorems 12.2.4 and 12.2.6 allow the approximation of Dn;� and D+n;� for large n.
139
Example 12.2.9:
Let X1; : : : ; Xn � C(1; 0). We want to test whether H0 : X � N(0; 1).
The following data has been observed for x(1); : : : ; x(10):
�1:42;�0:43;�0:19; 0:26; 0:30; 0:45; 0:64; 0:96; 1:97; and 4:68
The results for the K{S test have been obtained through the following S{Plus session, i.e.,
D+10 = 0:02219616, D�
10 = 0:3025681, and D10 = 0:3025681:
> x _ c(-1.42, -0.43, -0.19, 0.26, 0.30, 0.45, 0.64, 0.96, 1.97, 4.68)
> FX _ pnorm(x)
> FX
[1] 0.07780384 0.33359782 0.42465457 0.60256811 0.61791142 0.67364478
[7] 0.73891370 0.83147239 0.97558081 0.99999857
> Dp _ (1:10)/10 - FX
> Dp
[1] 2.219616e-02 -1.335978e-01 -1.246546e-01 -2.025681e-01 -1.179114e-01
[6] -7.364478e-02 -3.891370e-02 -3.147239e-02 -7.558081e-02 1.434375e-06
> Dm _ FX - (0:9)/10
> Dm
[1] 0.07780384 0.23359782 0.22465457 0.30256811 0.21791142 0.17364478
[7] 0.13891370 0.13147239 0.17558081 0.09999857
> max(Dp)
[1] 0.02219616
> max(Dm)
[1] 0.3025681
> max(max(Dp), max(Dm))
[1] 0.3025681
>
> ks.gof(x, alternative = "two.sided", mean = 0, sd = 1)
One-sample Kolmogorov-Smirnov Test
Hypothesized distribution = normal
data: x
ks = 0.3026, p-value = 0.2617
alternative hypothesis:
True cdf is not the normal distn. with the specified parameters
Using Rohatgi, Table 7, page 661, we have to use D10;0:20 = 0:323 for � = 0:20. Since
D10 = 0:3026 < 0:323 = D10;0:20, it is p > 0:20. The K{S test does not reject H0 at level
� = 0:20. As S{Plus shows, the precise p{value is even p = 0:2617.
140
Note:
Comparison between �2 and K{S goodness of �t tests:
� K{S uses all available data; �2 bins the data and loses information
� K{S works for all sample sizes; �2 requires large sample sizes
� it is more di�cult to modify K{S for estimated parameters; �2 can be easily adapted
for estimated parameters
� K{S is \conservative" for discrete data, i.e., it tends to accept H0 for such data
� the order matters for K{S; �2 is better for unordered categorical data
141
12.3 More on Order Statistics
De�nition 12.3.1:
Let F be a continuous cdf. A tolerance interval for F with tolerance coe�cient is
a random interval such that the probability is that this random interval covers at least a
speci�ed percentage 100p% of the distribution.
Theorem 12.3.2:
If order statisticsX(r) < X(s) are used as the endpoints for a tolerance interval for a continuous
cdf F , it holds that
=s�r�1Xi=0
n
i
!pi(1� p)n�i:
Proof:
According to De�nition 12.3.1, it holds that
= PX(r);X(s)
�PX(X(r) < X < X(s)) � p
�:
Since F is continuous, it holds that FX(X) � U(0; 1). Therefore,
PX(X(r) < X < X(s)) = P (X < X(s))� P (X � X(r))
= F (X(s))� F (X(r))
= U(s) � U(r);
where U(s) and U(r) are the order statistics of a U(0; 1) distribution.
Thus,
= PX(r);X(s)
�PX(X(r) < X < X(s)) � p
�= P (U(s) � U(r) � p):
By Therorem 4.4.4, we can determine the joint distribution of order statistics and calculate
as
=
Z 1
p
Z y�p
0
n!
(r � 1)!(s� r � 1)!(n� s)!xr�1(y � x)s�r�1(1� y)n�s dx dy:
Rather than solving this integral directly, we make the transformation
U = U(s) � U(r)
V = U(s):
Then the joint pdf of U and V is
fU;V (u; v) =
8<:n!
(r�1)!(s�r�1)!(n�s)! (v � u)r�1us�r�1(1� v)n�s; if 0 < u < v < 1
0; otherwise
142
and the marginal pdf of U is
fU(u) =
Z 1
0fU;V (u; v) dv
=n!
(r � 1)!(s� r � 1)!(n� s)!us�r�1I(0;1)(u)
Z 1
u(v � u)r�1(1� v)n�s dv
(A)=
n!
(r � 1)!(s� r � 1)!(n� s)!us�r�1(1� u)n�s+rI(0;1)(u)
Z 1
0tr�1(1� t)n�s dt| {z }B(r;n�s+1)
=n!
(r � 1)!(s� r � 1)!(n� s)!us�r�1(1� u)n�s+r
(r � 1)!(n� s)!
(n� s+ r)!I(0;1)(u)
=n!
(n� s+ r)!(s� r � 1)!us�r�1(1� u)n�s+rI(0;1)(u)
= n
n� 1
s� r � 1
!us�r�1(1� u)n�s+rI(0;1)(u):
(A) is based on the transformation t =v � u
1� u, v� u = (1� u)t, 1� v = 1� u� (1� u)t =
(1� u)(1� t) and dv = (1� u)dt.
It follows that
= P (U(s) � U(r) � p)
= P (U � p)
=
Z 1
pn
n� 1
s� r � 1
!us�r�1(1� u)n�s+r du
(B)= P (Y < s� r) j where Y � Bin(n; p)
=s�r�1Xi=0
n
i
!pi(1� p)n�i:
(B) holds due to Rohatgi, Remark 3 after Theorem 5.3.18, page 216, since for X � Bin(n; p),
it holds that
P (X < k) =
Z 1
pn
n� 1
k � 1
!xk�1(1� x)n�k dx:
143
Lecture 38:
We 04/18/01Example 12.3.3:
Let s = n and r = 1. Then,
=n�2Xi=0
n
i
!pi(1� p)n�i = 1� pn � npn�1(1� p):
If p = 0:8 and n = 10, then
10 = 1� (0:8)10 � 10 � (0:8)9 � (0:2) = 0:624;
i.e., (X(1);X(10)) de�nes a 62.4% tolerance interval for 80% probability.
If p = 0:8 and n = 20, then
20 = 1� (0:8)20 � 20 � (0:8)19 � (0:2) = 0:931;
and if p = 0:8 and n = 30, then
30 = 1� (0:8)30 � 30 � (0:8)29 � (0:2) = 0:989:
Theorem 12.3.4:
Let kp be the pth quantile of a continuous cdf F . Let X(1); : : : ; X(n) be the order statistics of
a sample of size n from F . Then it holds that
P (X(r) � kp � X(s)) =s�1Xi=r
n
i
!pi(1� p)n�i:
Proof:
It holds that
P (X(r) � kp) = P (at least r of the Xi's are � kp)
=nXi=r
n
i
!pi(1� p)n�i:
Therefore,
P (X(r) � kp � X(s)) = P (X(r) � kp)� P (X(s) < kp)
=nXi=r
n
i
!pi(1� p)n�i �
nXi=s
n
i
!pi(1� p)n�i
=s�1Xi=r
n
i
!pi(1� p)n�i:
144
Corollary 12.3.5:
(X(r);X(s)) is a levels�1Xi=r
n
i
!pi(1� p)n�i con�dence interval for kp.
Example 12.3.6:
Let n = 10. We want a 95% con�dence interval for the median, i.e., kp where p =12 .
We get the following probabilities pr;s =s�1Xi=r
n
i
!pi(1� p)n�i that (X(r);X(s)) covers k0:5:
pr;s s
2 3 4 5 6 7 8 9 10
1 0:01 0:05 0:17 0:38 0:62 0:83 0:94 0:99 0:998
2 0:04 0:16 0:37 0:61 0:82 0:93 0:98 0:99
3 0:12 0:32 0:57 0:77 0:89 0:93 0:94
4 0:21 0:45 0:66 0:77 0:82 0:83
r 5 0:25 0:45 0:57 0:61 0:62
6 0:21 0:32 0:37 0:38
7 0:12 0:16 0:17
8 0:04 0:05
9 0:01
Only the random intervals (X(1);X(9)), (X(1); X(10)), (X(2);X(9)), and (X(2); X(10)) give the
desired coverage probability. Therefore, we use the one that comes closest to 95%, i.e.,
(X(2); X(9)), as the 95% con�dence interval for the median.
145
13 Some Results from Sampling
13.1 Simple Random Samples
De�nition 13.1.1:
Let be a population of size N with mean � and variance �2. A sampling method (of size
n) is called simple if the set S of possible samples contains all combinations of n elements of
(without repetition) and the probability for each sample s 2 S to become selected depends
only on n, i.e., p(s) = 1
(Nn)8s 2 S. Then we call s 2 S a simple random sample (SRS) of
size n.
Theorem 13.1.2:
Let be a population of size N with mean � and variance �2. Let Y : ! IR be a measurable
function. Let ni be the total number of times the parameter ~yi occurs in the population and
pi =niN
be the relative frequency the parameter ~yi occurs in the population. Let (y1; : : : ; yn)
be a SRS of size n with respect to Y , where P (Y = ~yi) = pi =niN .
Then the components yi; i = 1; : : : ; n, are identically distributed as Y and it holds for i 6= j:
P (yi = ~yk; yj = ~yl) =1
N(N � 1)nkl; where nkl =
8<: nknl; k 6= l
nk(nk � 1); k = l
Note:
(i) In Sampling, many authors use capital letters to denote properties of the population
and small letters to denote properties of the random sample. In particular, xi's and yi's
are considered as random variables related to the sample. They are not seen as speci�c
realizations.
(ii) The following equalities hold in the scenario of Theorem 13.1.2:
N =Xi
ni
� =1
N
Xi
ni~yi
�2 =1
N
Xi
ni(~yi � �)2
=1
N
Xi
ni~y2i � �2
146
Theorem 13.1.3:
Let the same conditions hold as in Theorem 13.1.2. Let y = 1n
nXi=1
yi be the sample mean of a
SRS of size n. Then it holds:
(i) E(y) = �, i.e., the sample mean is unbiased for the population mean �.
(ii) V ar(y) =1
n
N � n
N � 1�2 =
1
n(1� f)
N
N � 1�2, where f = n
N.
Proof:
(i)
E(y) =1
n
nXi=1
E(yi) = �; since E(yi) = � 8i:
(ii)
V ar(y) =1
n2
0@ nXi=1
V ar(yi) + 2Xi<j
Cov(yi; yj)
1ACov(yi; yj) = E(yi � yj)�E(yi)E(yj)
= E(yi � yj)� �2
=Xk;l
~yk~ylP (yi = ~yk; yj = ~yl)� �2
Th:13:1:2=
1
N(N � 1)
0@Xk 6=l
~yk~ylnknl +Xk
~y2knk(nk � 1)
1A� �2
=1
N(N � 1)
0@Xk;l
~yk~ylnknl �Xk
~y2knk
1A� �2
=1
N(N � 1)
Xk
~yknk
! Xl
~ylnl
!�Xk
~y2knk
!� �2
Note (ii)=
1
N(N � 1)
�N2�2 �N(�2 + �2)
�� �2
=1
N � 1
�N�2 � �2 � �2 � �2(N � 1)
�= � 1
N � 1�2; for i 6= j
147
Lecture 39:
Fr 04/20/01
=) V ar(y) =1
n2
0@ nXi=1
V ar(yi) + 2Xi<j
Cov(yi; yj)
1A=
1
n2
�n�2 + n(n� 1)
�� 1
N � 1�2��
=1
n
�1� n� 1
N � 1
��2
=1
n
N � n
N � 1�2
=1
n(1� n
N)
N
N � 1�2
=1
n(1� f)
N
N � 1�2
Theorem 13.1.4:
Let yn be the sample mean of a SRS of size n. Then it holds thatrn
1� f
yn � �qN
N�1�
d�! N(0; 1);
where N !1 and f = nNis a constant.
In particular, when the yi's are 0{1{distributed with E(yi) = P (yi = 1) = p 8i, then it holds
that rn
1� f
yn � pqN
N�1p(1� p)
d�! N(0; 1);
where N !1 and f = nNis a constant.
148
13.2 Strati�ed Random Samples
De�nition 13.2.1:
Let be a population of size N , that is split into m disjoint sets j, called strata, of size
Nj , j = 1; : : : ;m, where N =mXj=1
Nj . If we independently draw a random sample of size nj in
each strata, we speak of a strati�ed random sample.
Note:
(i) The random samples in each strata are not always SRS's.
(ii) Strati�ed random samples are used in practice as a means to reduce the sample variance
in the case that data in each strata is homogeneous and data among di�erent strata is
heterogeneous.
(iii) Frequently used strata in practice are gender, state (or county), income range, ethnic
background, etc.
De�nition 13.2.2:
Let Y : ! IR be a measurable function. In case of a strati�ed random sample, we use the
following notation:
Let Yjk; j = 1; : : : ;m; k = 1; : : : ; Nj be the elements in j. Then, we de�ne
(i) Yj =
NjXk=1
Yjk the total in the jth strata,
(ii) �j =1NjYj the mean in the jth strata,
(iii) � = 1N
mXj=1
Nj�j the expectation (or grand mean),
(iv) N� =mXj=1
Yj =mXj=1
NjXk=1
Yjk the total,
(v) �2j =1Nj
NjXk=1
(Yjk � �j)2 the variance in the jth strata, and
(vi) �2 = 1N
mXj=1
NjXk=1
(Yjk � �)2 the variance.
149
(vii) We denote an (ordered) sample in j of size nj as (yj1; : : : ; yjnj ) and yj =1nj
njXk=1
yjk the
sample mean in the jth strata.
Theorem 13.2.3:
Let the same conditions hold as in De�nitions 13.2.1 and 13.2.2. Let �̂j be an unbiased
estimate of �j anddV ar(�̂j) be an unbiased estimate of V ar(�̂j). Then it holds:
(i) �̂ = 1N
mXj=1
Nj�̂j is unbiased for �.
V ar(�̂) = 1N2
mXj=1
N2j V ar(�̂j).
(ii) dV ar(�̂) = 1N2
mXj=1
N2j
dV ar(�̂j) is unbiased for V ar(�̂).
Proof:
(i)
E(�̂) =1
N
mXj=1
NjE(�̂j) =1
N
mXj=1
Nj�j = �
By independence of the samples within each strata,
V ar(�̂) =1
N2
mXj=1
N2j V ar(�̂j):
(ii)
E( dV ar(�̂)) = 1
N2
mXj=1
N2j E(
dV ar(�̂j)) =1
N2
mXj=1
N2j V ar(�̂j) = V ar(�̂)
Theorem 13.2.4:
Let the same conditions hold as in Theorem 13.2.3. If we draw a SRS in each strata, then it
holds:
(i) �̂ = 1N
mXj=1
Njyj is unbiased for �, where yj =1nj
njXk=1
yjk; j = 1; : : : ;m.
V ar(�̂) = 1N2
mXj=1
N2j
1
nj(1� fj)
Nj
Nj � 1�2j , where fj =
njNj
.
150
(ii) dV ar(�̂) = 1N2
mXj=1
N2j
1
nj(1� fj)s
2j is unbiased for V ar(�̂), where
s2j =1
nj � 1
njXk=1
(yjk � yj)2:
Proof:
For a SRS in the jth strata, it follows by Theorem 13.1.3:
E(yj) = �j
V ar(yj) =1
nj(1� fj)
Nj
Nj � 1�2j
Also, we can show that
E(s2j ) =Nj
Nj � 1�2j :
Now the proof follows directly from Theorem 13.2.3.
De�nition 13.2.5:
Let the same conditions hold as in De�nitions 13.2.1 and 13.2.2. If the sample in each strata
is of size nj = nNj
N ; j = 1; : : : ;m, where n is the total sample size, then we speak of pro-
portional selection.
Note:
(i) In the case of proportional selection, it holds that fj =njNj
= nN = f; j = 1; : : : ;m.
(ii) Proportional strata cannot always be obtained for each combination of m, n, and N .
Theorem 13.2.6:
Let the same conditions hold as in De�nition 13.2.5. If we draw a SRS in each strata, then it
holds in case of proportional selection that
V ar(�̂) =1
N2
1� f
f
mXj=1
Nj~�2j ;
where ~�2j =Nj
Nj�1�2j .
Proof:
The proof follows directly from Theorem 13.2.4 (i).
151
Theorem 13.2.7:
If we draw (1) a strati�ed random sample that consists of SRS's of sizes nj under proportional
selection and (2) a SRS of size n =mXj=1
nj from the same population, then it holds that
V ar(y)� V ar(�̂) =1
n
N � n
N(N � 1)
0@ mXj=1
Nj(�j � �)2 � 1
N
mXj=1
(N �Nj)~�2j
1A :Proof:
Homework.
152
Lecture 41:
We 04/25/01
14 Some Results from Sequential Statistical Inference
14.1 Fundamentals of Sequential Sampling
Example 14.1.1:
A particular machine produces a large number of items every day. Each item can be either
\defective" or \non{defective". The unknown proportion of defective items in the production
of a particular day is p.
Let (X1; : : : ;Xm) be a sample from the daily production where xi = 1 when the item is
defective and xi = 0 when the item is non{defective. Obviously, Sm =mXi=1
Xi � Bin(m; p)
denotes the total number of defective items in the sample (assuming that m is small compared
to the daily production).
We might be interested to test H0 : p � p0 vs. H1 : p > p0 at a given signi�cance level �
and use this decision to trash the entire daily production and have the machine �xed if indeed
p > p0. A suitable test could be
�1(x1; : : : ; xm) =
8<: 1; if sm > c
0; if sm � c
where c is chosen such that �1 is a level{� test.
However, wouldn't it be more bene�cial if we sequentially sample the items (e.g., take item
# 57, 623, 1005, 1286, 2663, etc.) and stop the machine as soon as it becomes obvious that
it produces too many bad items. (Alternatively, we could also �nish the time consuming and
expensive process to determine whether an item is defective or non{defective if it is impossible
to surpass a certain proportion of defectives.) For example, if for some j < m it already holds
that sj > c, then we could stop (and immediately call maintenance) and reject H0 after only
j observations.
More formally, let us de�ne T = minfj j Sj > cg and T 0 = minfT;mg. We can now con-
sider a decision rule that stops with the sampling process at random time T 0 and rejects H0 if
T � m. Thus, if we consider R0 = f(x1; : : : ; xm) j t � mg and R1 = f(x1; : : : ; xm) j sm > cgas critical regions of two tests �0 and �1, then these two tests are equivalent.
153
De�nition 14.1.2:
Let � be the parameter space and A the set of actions the statistician can take. We assume
that the rv's X1;X2; : : : are observed sequentially and iid with common pdf (or pmf) f�(x).
A sequential decision procedure is de�ned as follows:
(i) A stopping rule speci�es whether an element of A should be chosen without taking
any further observation. If at least one observation is taken, this rule speci�es for every
set of observed values (x1; x2; : : : ; xn), n � 1, whether to stop sampling and choose an
action in A or to take another observation xn+1.
(ii) A decision rule speci�es the decision to be taken. If no observation has been taken,
then we take action d0 2 A. If n � 1 observation have been taken, then we take action
dn(x1; : : : ; xn) 2 A, where dn(x1; : : : ; xn) speci�es the action that has to be taken for
the set (x1; : : : ; xn) of observed values. Once an action has been taken, the sampling
process is stopped.
Note:
In the remainder of this chapter, we assume that the statistician takes at least one observation.
De�nition 14.1.3:
Let Rn � IRn; n = 1; 2; : : :, be a sequence of Borel{measurable sets such that the sampling
process is stopped after observing X1 = x1; X2 = x2; : : : ; Xn = xn if (x1; : : : ; xn) 2 Rn. If
(x1; : : : ; xn) =2 Rn, then another observation xn+1 is taken. The sets Rn; n = 1; 2; : : : are called
stopping regions.
De�nition 14.1.4:
With every sequential stopping rule we associate a stopping random variable N which
takes on the values 1; 2; 3; : : :. Thus, N is a rv that indicates the total number of observations
taken before the sampling is stopped.
Note:
We use the (sloppy) notation fN = ng to denote the event that sampling is stopped after
observing exactly n values x1; : : : ; xn (i.e., sampling is not stopped before taking n samples).
Then the following equalities hold:
fN = 1g = R1
154
fN = ng = f(x1; : : : ; xn) 2 IRn j sampling is stopped after n observations but not beforeg
= (R1 [R2 [ : : : [Rn�1)c \Rn
= Rc1 \Rc
2 \ : : : \Rcn�1 \Rn
fN � ng =n[
k=1
fN = kg
Here we will only consider closed sequential sampling procedures, i.e., procedures where
sampling eventually stops with probability 1, i.e.,
P (N <1) = 1;
P (N =1) = 1� P (N <1) = 0:
Theorem 14.1.5: Wald's Equation
LetX1;X2; : : : be iid rv's withE(j X1 j) <1. Let N be a stopping variable. Let SN =NXk=1
Xk.
If E(N) <1, then it holds
E(SN ) = E(X1)E(N):
Proof:
De�ne a sequence of rv's Yi; i = 1; 2; : : :, where
Yi =
8<: 1; if no decision is reached up to the (i� 1)th stage, i.e., N > (i� 1)
0; otherwise
Then each Yi is a function of X1;X2; : : : ;Xi�1 only and Yi is independent of Xi.
Consider the rv 1Xn=1
XnYn:
Obviously, it holds that
SN =1Xn=1
XnYn:
Thus, it follows that
E(SN ) = E
1Xn=1
XnYn
!: (�)
It holds that
1Xn=1
E(j XnYn j) =1Xn=1
E(j Xn j)E(j Yn j)
= E(j X1 j)1Xn=1
P (N � n)
155
= E(j X1 j)1Xn=1
1Xk=n
P (N = k)
(A)= E(j X1 j)
1Xn=1
nP (N = n)
= E(j X1 j)E(N)
< 1
(A) holds due to the following rearrangement of indizes:
n k
1 1; 2; 3; : : :
2 2; 3; : : :
3 3; : : :...
...
Lecture 42:
Fr 04/27/01
We may therefore interchange the expectation and summation signs in (�) and get
E(SN ) = E
1Xn=1
XnYn
!
=1Xn=1
E(XnYn)
=1Xn=1
E(Xn)E(Yn)
= E(X1)1Xn=1
P (N � n)
= E(X1)E(N)
which completes the proof.
156
14.2 Sequential Probability Ratio Tests
De�nition 14.2.1:
Let X1; X2; : : : be a sequence of iid rv's with common pdf (or pmf) f�(x). We want to test a
simple hypothesis H0 : X � f�0 vs. a simple alternative H1 : X � f�1 when the observations
are taken sequentially.
Let f0n and f1n denote the joint pdf's (or pmf's) of X1; : : : ;Xn under H0 and H1 respectively,
i.e.,
f0n(x1; : : : ; xn) =nYi=1
f�0(xi) and f1n(x1; : : : ; xn) =nYi=1
f�1(xi):
Finally, let
�n(x1; : : : ; xn) =f1n(x)
f0n(x);
where x = (x1; : : : ; xn). Then a sequential probability ratio test (SPRT) for testing H0
vs. H1 is the following decision rule:
(i) If at any stage of the sampling process it holds that
�n(x) � A;
then stop and reject H0.
(ii) If at any stage of the sampling process it holds that
�n(x) � B;
then stop and accept H0, i.e., reject H1.
(iii) If
B < �n(x) < A;
then continue sampling by taking another observation xn+1.
Note:
(i) It is usually convenient to de�ne
Zi = logf�1(Xi)
f�0(Xi);
where Z1; Z2; : : : are iid rv's. Then, we work with
log �n(x) =nXi=1
zi =nXi=1
(log f�1(xi)� log f�0(xi))
instead of using �n(x). Obviously, we now have to use constants b = logB and a = logA
instead of the original constants B and A.
157
(ii) A and B (where A > B) are constants such that the SPRT will have strength (�; �),
where
� = P (Type I error) = P (Reject H0 j H0)
and
� = P (Type II error) = P (Accept H0 j H1):
If N is the stopping rv, then
� = P�0(�N (X) � A) and � = P�1(�N (X) � B):
Example 14.2.2:
Let X1;X2; : : : be iid N(�; �2), where � is unknown and �2 > 0 is known. We want to test
H0 : � = �0 vs. H1 : � = �1, where �0 < �1.
If our data is sampled sequentially, we can constract a SPRT as follows:
log �n(x) =nXi=1
�� 1
2�2(xi � �1)
2 � (� 1
2�2(xi � �0)
2)
�
=1
2�2
nXi=1
�(xi � �0)
2 � (xi � �1)2)�
=1
2�2
nXi=1
�x2i � 2xi�0 + �20 � x2i + 2xi�1 � �21
�
=1
2�2
nXi=1
��2xi�0 + �20 + 2xi�1 � �21
�
=1
2�2
nXi=1
2xi(�1 � �0) + n(�20 � �21)
!
=�1 � �0
�2
nXi=1
xi � n�0 + �1
2
!
We decide for H0 if
log �n(x) � b
() �1 � �0
�2
nXi=1
xi � n�0 + �1
2
!� b
()nXi=1
xi � n�0 + �1
2+ b�;
where b� = �2
�1��0 b.
158
We decide for H1 if
log �n(x) � a
() �1 � �0
�2
nXi=1
xi � n�0 + �1
2
!� a
()nXi=1
xi � n�0 + �1
2+ a�;
where a� = �2
�1��0a.
n
sum(x_i)b* a*
1
2
3
4
•••• •••• ••••• ••••• ••••• •••• ••••• ••••• ••••• •••• ••••
accept H0 continue accept H1
Example 14.2.2
Theorem 14.2.3:
For a SPRT with stopping bounds A and B, A > B, and strength (�; �), we have
A � 1� �
�and B � �
1� �;
where 0 < � < 1 and 0 < � < 1.
Theorem 14.2.4:
Assume we select for given �; � 2 (0; 1), where �+ � � 1, the stopping bounds
A0 =1� �
�and B0 =
�
1� �:
Then it holds that the SPRT with stopping bounds A0 and B0 has strength (�0; �0), where
�0 � �
1� �; �0 � �
1� �; and �0 + �0 � �+ �:
159
Note:
(i) The approximation A0 = 1���
and B0 = �1�� in Theorem 14.2.4 is called Wald{
Approximation for the optimal stopping bounds of a SPRT.
(ii) A0 and B0 are functions of � and � only and do not depend on the pdf's (or pmf's) f�0
and f�1 . Therefore, they can be computed once and for all f�i 's, i = 0; 1.
THE END !!!
160
Index
�{similar, 88
0{1 Loss, 107
A Posteriori Distribution, 66
A Priori Distribution, 66
Action, 63
Alternative Hypothesis, 71
Asymptotically (Most) E�cient, 53
Bayes Estimate, 67
Bayes Risk, 66
Bayes Rule, 67
Bayesian Con�dence Interval, 126
Bayesian Con�dence Set, 126
Bias, 38
Borel{Cantelli Lemma, 3
Cauchy Criterion, 5
Centering Constants, 1
Central Limit Theorem, Lindeberg, 15
Central Limit Theorem, Lindeberg{L�evy, 12
Chapman, Robbins, Kiefer Inequality, 51
CI, 113
Closed, 155
Complete, 34, 129
Complete in Relation to P, 129
Composite, 71
Con�dence Bound, Lower, 113
Con�dence Bound, Upper, 113
Con�dence Coe�cient, 112
Con�dence Interval, 113
Con�dence Level, 112
Con�dence Sets, 112
Consistent in the rthMean, 27
Consistent, Mean{Squared{Error, 39
Consistent, Strongly, 27
Consistent, Weakly, 27
Continuity Theorem, 11
Contradiction, Proof by, 41
Convergence{Equivalent, 5
Cram�er{Rao Lower Bound, 47, 48
Credible Set, 126
Critical Region, 72
CRK Inequality, 51
CRLB, 47, 48
Decision Function, 63
Decision Rule, 154
Degree, 130
Distribution, A Posteriori, 66
Distribution, Population, 18
Distribution{Free, 129
Domain of Attraction, 14
E�ciency, 53
E�cient, Asymptotically (Most), 53
E�cient, More, 53
E�cient, Most, 53
Empirical CDF, 18
Empirical Cumulative Distribution Function, 18
Equivalence Lemma, 5
Error, Type I, 72
Error, Type II, 72
Estimable, 130
Estimable Function, 38
Estimate, Bayes, 67
Estimate, Maximum Likelihood, 57
Estimate, Method of Moments, 55
Estimate, Minimax, 64
Estimate, Point, 26
Estimator, 26
Estimator, Mann{Whitney, 134
Estimator, Wilcoxin 2{Sample, 134
Exponential Family, One{Parameter, 35
F{Test, 105
Factorization Criterion, 32
Family of CDF's, 26
Family of Con�dence Sets, 112
Family of PDF's, 26
Family of PMF's, 26
Family of Random Sets, 112
Formal Invariance, 91
Generalized U{Statistic, 134
Glivenko{Cantelli Theorem, 19
Hypothesis, Alternative, 71
Hypothesis, Null, 71
Independence of X and S2, 23
Induced Function, 28
Inequality, Kolmogorov's, 4
Interval, Random, 112
Invariance, Measurement, 91
Invariant, 28, 90
Invariant Test, 91
Invariant, Location, 29
Invariant, Maximal, 92
Invariant, Permutation, 29
Invariant, Scale, 29
K{S Statistic, 135
K{S Test, 139
Kernel, 130
Kernel, Symmetric, 130
Kolmogorov's Inequality, 4
161
Kolmogorov's SLLN, 7
Kolmogorov{Smirnov Statistic, 135
Kolmogorov{Smirnov Test, 139
Kronecker's Lemma, 4
Landau Symbols O and o, 13
Lehmann{Sche��ee, 45
Level of Signi�cance, 73
Level{�{Test, 73
Likelihood Function, 57
Likelihood Ratio Test, 95
Likelihood Ratio Test Statistic, 95
Lindeberg Central Limit Theorem, 15
Lindeberg Condition, 15
Lindeberg{L�evy Central Limit Theorem, 12
LMVUE, 40
Locally Minumum Variance Unbiased Estimate, 40
Location Invariant, 29
Logic, 41
Loss Function, 63
Lower Con�dence Bound, 113
LRT, 95
Mann{Whitney Estimator, 134
Maximal Invariant, 92
Maximum Likelihood Estimate, 57
Mean Square Error, 39
Mean{Squared{Error Consistent, 39
Measurement Invariance, 91
Method of Moments Estimate, 55
Minimax Estimate, 64
Minimax Principle, 64
MLE, 57
MLR, 81
MOM, 55
Monotone Likelihood Ratio, 81
More E�cient, 53
Most E�cient, 53
Most Powerful Test, 73
MP, 73
MSE{Consistent, 39
Neyman{Pearson Lemma, 76
Nonparametric, 129
Nonrandomized Test, 73
Normal Variance Tests, 100
Norming Constants, 1
NP Lemma, 76
Null Hypothesis, 71
One Sample t{Test, 103
One{Tailed t-Test, 103
Paired t-Test, 104
Parameter Space, 26
Parametric Hypothesis, 71
Permutation Invariant, 29
Pivot, 116
Point Estimate, 26
Point Estimation, 26
Population Distribution, 18
Posterior Distribution, 66
Power, 73
Power Function, 73
Prior Distribution, 66
Probability Integral Transformation, 136
Probability Ratio Test, Sequential, 157
Problem of Fit, 135
Proof by Contradiction, 41
Proportional Selection, 151
Random Interval, 112
Random Sample, 18
Random Sets, 112
Random Variable, Stopping, 154
Randomized Test, 73
Rao{Blackwell, 44
Rao{Blackwellization, 45
Realization, 18
Regularity Conditions, 48
Risk Function, 63
Risk, Bayes, 66
Sample, 18
Sample Central Moment of Order k, 19
Sample Mean, 18
Sample Moment of Order k, 19
Sample Statistic, 18
Sample Variance, 18
Scale Invariant, 29
Selection, Proportional, 151
Sequential Decision Procedure, 154
Sequential Probability Ratio Test, 157
Signi�cance Level, 73
Similar, 88
Similar, �, 88
Simple, 71, 146
Simple Random Sample, 146
Size, 73
SPRT, 157
SRS, 146
Stable, 14
Statistic, 18
Statistic, Kolmogorov{Smirnov, 135
Statistic, Likelihood Ratio Test, 95
Stopping Random Variable, 154
Stopping Regions, 154
Stopping Rule, 154
Strata, 149
Strati�ed Random Sample, 149
162
Strong Law of Large Numbers, Kolmogorov's, 7
Strongly Consistent, 27
Su�cient, 30, 129
Symmetric Kernel, 130
t{Test, 103
Tail{Equivalent, 5
Taylor Series, 13
Test Function, 73
Test, Invariant, 91
Test, Kolmogorov{Smirnov, 139
Test, Likelihood Ratio, 95
Test, Most Powerful, 73
Test, Nonrandomized, 73
Test, Randomized, 73
Test, Uniformly Most Powerful, 73
Tolerance Coe�cient, 142
Tolerance Interval, 142
Two{Sample t-Test, 103
Two{Tailed t-Test, 103
Type I Error, 72
Type II Error, 72
U{Statistic, 131
U{Statistic, Generalized, 134
UMA, 113
UMAU, 123
UMP, 73
UMP �{similar, 89
UMP Invariant, 93
UMP Unbiased, 85
UMPU, 85
UMVUE, 40
Unbiased, 38, 85, 123
Uniformly Minumum Variance Unbiased Estimate, 40
Uniformly Most Accurate, 113
Uniformly Most Accurate Unbiased, 123
Uniformly Most Powerful Test, 73
Unimodal, 117
Upper Con�dence Bound, 113
Wald's Equation, 155
Wald{Approximation, 160
Weakly Consistent, 27
Wilcoxin 2{Sample Estimator, 134
163