lecture 7 introduction to statistical decision...

Lecture 7Introduction to Statistical Decision Theory

I-Hsiang Wang

Department of Electrical EngineeringNational Taiwan University

[email protected]

December 20, 2016

1 / 55 I-Hsiang Wang IT Lecture 7

mailto:[email protected]

In the rest of this course we switch gear to the interplay between information theory and statistics.

1 In this lecture, we will introduce the basic elements of statistical decision theory:It is about how to make decision from data samples collected from a statistical model.It is about how to evaluate decision making algorithms (decision rules) under a statistical model.

It also serves the purpose of overviewing the contents to be covered in the follow-up lectures.

2 In the follow-up lectures, we will go into details of several topics, includingHypothesis testing: large-sample asymptotic performance limitsPoint estimation: Bayes vs. Minimax, lower bounding techniques, high dimensional problems, etc.

Along side, we will introduce tools and techniques for investigating the asymptotic performanceof several statistical problems, and show its interplay with information theory.

Tools from probability theory: large deviation, concentration inequalities, etc.Elements from information theory: information measures, lower bounding techniques, etc.


Overview of this Lecture

In this lecture, the goal is to establish basics of statistical decision theory.

1 We will begin with setting up the framework of statistical decision theory, including:Statistical experiment: parameter space, data samples, statistical modelDecision rule: deterministic vs. randomizedPerformance evaluation: loss function, risk, minimax vs. Bayes

2 Next, we will introduce two basic statical decision making problems, includingHypothesis testingPoint estimation


Statistical Model and Decision Making

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency


Statistical Model and Decision Making Basic Framework






StatisticalExperiment

P� (·)� � �

Decision Making � (X) = TX�

X � X




P� (·)� � �


X � X

Statistical Experiment

Statistical Model: A collection of data-generating distributions P ≜ {Pθ | θ ∈ Θ}, where▶ Θ is called the parameter space, could be finite, infinitely countable, or uncountable.▶ Pθ (·) is a probability distribution which accounts for the implicit randomness in experiments,

sampling, or making observations

Data (Sample/Outcome/Observation): X is generated by a random draw from Pθ, that is, X ∼ Pθ.▶ X could be random variables, vectors, matrices, processes, etc.




P� (·)� � �


X � X

Inference Task

Objective: T (θ), a function of the parameter θ.From the data X ∼ Pθ, one would like to infer T (θ) from X .

Decision Rule

Decision rule (deterministic): τ (·) is a function of X . T = τ (X) is the inferred result.Decision rule (randomized): τ (·, ·) is a function of (X,U), where U is external randomness.T = τ (X,U) is the inferred result.




P� (·)� � �


X � X

T (·) Loss FunctionT (�) l (·, ·)� T

EX�P� [·]

l(T (�), �(X))

L� (�)

Performance Evaluation: how good is a decision rule τ?

Loss function: l(T (θ), τ(X)) measures how bad the decision rule τ is (with a specific data pointX ).Note: since X is random, l (T (θ) , τ (X)) is also random.

Risk: Lθ (τ) ≜ EX∼Pθ[l(T (θ), τ(X))] measures on average how bad the decision rule τ is

when the true parameter is θ.




P� (·)� � �


X � X

T (·) Loss FunctionT (�) l (·, ·)� T

EX�P� [·]

l(T (�), �(X))

L� (�)

Performance Evaluation: what if the decision rule τ is randomized?

Loss function becomes l(T (θ), τ(X,U)).Risk becomes Lθ (τ) ≜ EU,X∼Pθ

[l(T (θ), τ(X,U))].


Statistical Model and Decision Making Examples






Sometimes we care about the inferred object itself.



Example: Decoding

Decoding in channel coding over a DMC is one example that we are familiar with.

Parameter is the message: θ ←→ m

Parameter space is the message set: Θ←→{1, 2, . . . , 2NR

}Data is the received sequence: X ←→ Y N

Statistical model is Encoder+Channel: Pθ(x)←→∏N

i=1 PY |X(yi|xi(m))

Task is to decode the message: T (θ)←→ m

Decision rule is the decoding algorithm: τ(X)←→ dec(Y N )

Loss function is the 0-1 loss: l (T (θ), τ(x))←→ 1{m = dec(yN )

}Risk is the decoding error probability: Lθ(τ)←→ λm,dec ≜ P

{m = dec(Y N )

∣∣m is sent}



Example: Hypothesis Testing

Decoding in channel coding belongs to a more general class of problems called hypothesis testing.

Parameter space is a finite set: |Θ| <∞

Task is to infer parameter θ: T (θ) = θ

Loss function is the 0-1 loss: l (T (θ), τ(x)) = 1 {θ = τ(x)}

Risk is the probability of error: Lθ(τ) = PX∼Pθ{θ = τ(X)}



Example: Density Estimation

Estimate the probability density function from the collected samples.

Parameter space is a (huge) set of density functions:Θ←→ F = {f : R→ [0,+∞) which is concave/continuous/Lipschitz continuous/etc.}

Data is the observed i.i.d. sequence: X ←→ Xn, Xii.i.d.∼ f(·)

Task is to infer density function f(·): T (θ)←→ f

Decision rule is the density estimator: τ(X)←→ fXn(·)

Loss function is some sort of divergence: l (T (θ), τ(x))←→ D(f∥∥∥fxn

)Risk is the expected loss: Lθ(τ)←→ EXn∼f⊗n

[D

(f∥∥∥fXn

)]



Sometimes we care about the utility of the inferred object.



Example: Classification/Prediction

A basic problem in learning is to train a classifier that predicts the category of a new object.

Parameter space is a collection of labelings: Θ←→ H = {h : X → [1 : K]}

Data is the training data set: X ←→ (Xn, Y n), label Yi ∈ [1 : K].

Statistical model is the noisy labeling: Pθ(x)←→∏n

i=1 PX (xi)PY |h(X)(yi|h(xi))

Task is to infer the true labeling h ∈ H: T (θ)←→ h(·), τ(X)←→ hXn,Y n(·).

Loss function is the prediction error probability:l (T, τ(x))←→ EX∼PX

[1{h(X) = h(X)

}](Note: This is still random as h depends on the randomly drawn training data (X,Y )n)

Risk is the averaged loss over training:Lθ(τ)←→ E(Xn,Y n)∼(PXPY |h(X))

⊗n

[EX∼PX

[1{h(X) = h(X)

}]]17 / 55 I-Hsiang Wang IT Lecture 7


Example: Regression

Another example that we are very familiar with is regression under mean squared error.

Parameter space is a collection of functions: Θ←→ F = {f : Rp → R}

Data is the training data: X ←→ (Xn, Y n)

Statistical model is the noisy observation: Pθ(x)←→∏n

i=1 PX (xi)PZ (yi − f(xi))(Y = f(X) + Z where Z is the observation noise)

Task is to infer f ∈ F : T (θ)←→ f , τ(X)←→ fXn,Y n(·)

Loss function is the mean-squared-error: l (T, τ(x))←→ E(X,Y )∼Pf

[(Y − f(X))2

](Note: This is still random as f depends on the randomly drawn training data (X, Y )n)

Risk is the averaged loss over training:Lθ(τ)←→ E(Xn,Y n)∼P⊗n

f

[E(X,Y )∼Pf

[(Y − f(X))2

]]18 / 55 I-Hsiang Wang IT Lecture 7


Example: Linear Regression

For linear regression, the underlying model is linear: f (x) = β⊺x+ γ.

Parameter is (β, γ) ∈ Rp × R: θ ←→ (β, γ), Θ←→ Rp × R

Data is the training data: X ←→ (Xn, Y n)

Statistical model is the noisy linear model: Pθ(x)←→∏n

i=1 PX (xi)PZ (yi − γ − β⊺xi)

Task is to infer (β, γ): T (θ)←→ (β, γ), τ(X)←→ (β, γ)Xn,Y n

Loss function is the mean-squared-error: l (T, τ(x))←→ E(X,Y )∼Pβ,γ

[(Y − β

⊺X − γ)2

](Note: This is still random as (β, γ) depends on the randomly drawn training data (X, Y )n)

Risk is the averaged loss over random training data:

Lθ(τ)←→ E(Xn,Y n)∼P⊗nβ,γ

[E(X,Y )∼Pβ,γ

[(Y − β

⊺X − γ)2

]]19 / 55 I-Hsiang Wang IT Lecture 7

Statistical Model and Decision Making Paradigms






How to Determine the Best Estimator?

Recall: Risk of a decision rule τ is Lθ(τ), which is defined according to the underlying statisticalmodel, the task, and the loss function. Note that the risk depends on the hidden parameter θ.

If we can find a decision making algorithm τ∗ such that for all other decision rule τ ,Lθ (τ

∗) ≤ Lθ (τ) ∀ θ ∈ Θ,then τ∗ is obviously the best.

Unfortunately this is not likely to happen. For example, say the loss function is the ℓ2-norm betweenθ and θ = τ(x). Then, τ(x) = θ will minimize the risk Lθ(τ) when the true parameter is θ. Thismeans that there is no single τ that can simultaneously minimize Lθ for all θ ∈ Θ !

There are two main paradigms that resolve this issue:Average-case paradigm (Bayes) and Worst-case paradigm (Minimax)



Bayes Paradigm

In the Bayes paradigm, a prior distribution π (·) over the parameter space Θ is required, and theperformance of a decision rule is evaluated according to the average-case analysis with respect to π.

Definition 1 (Bayes Risk)

The average risk of a decision rule τ with respect to prior π is defined as Rπ (τ) ≜ EΘ∼π [LΘ(τ)].The Bayes risk for a prior π is defined as the minimum average risk, that is,

R∗π ≜ infτ Rπ (τ) , (1)

which is attained by a Bayes decision rule τ∗π (may not be unique).

Note that in information theory most coding theorems are derived in the Bayes paradigm – weassume random sources and messages. We will give more results later when we talk abouthypothesis testing and point estimation.



Minimax Paradigm

In a Minimax paradigm, the performance of a decision rule is evaluated according to the worst-caserisk over the entire parameter space.

Definition 2 (Minimax Risk)

The worst-case risk of a decision rule τ is defined as R (τ) ≜ supθ∈Θ Lθ(τ).The Minimax risk is defined as the minimum worst-case risk, that is,

R∗ ≜ infτ R (τ) = infτ supθ∈Θ Lθ(τ), (2)

which is attained by a Minimax decision rule τ∗ (may not be unique).

A main criticism on the Bayes paradigm is that in many applications there is no prior over theparameter space, because the statistical task is done one-shot. In such cases, the Minimaxparadigm provides a conservative but robust evaluation and theoretical guarantees.



Bayes vs. Minimax

A simple yet fundamental relationship between Minimax and Bayes is stated in the theorem below.

Theorem 1 (Minimax Risk ≥ Worst-Case Bayes Risk)

In general, the Minimax risk is not smaller than any Bayes risk, that is,

R∗ ≥ supπ R∗π. (3)

Furthermore, the above inequality becomes equality in the following two cases:1 The parameter space Θ and the data alphabet X are both finite.2 |Θ| <∞ and the loss function is bounded from below.

Remark: When deriving lower bound on the minimax risk, inequality (3) is useful.



pf: Note that R∗ = infτ supθ∈Θ Lθ(τ) and supπ R∗π = supπ infτ EΘ∼π [LΘ(τ)].

For any decision rule τ and prior distribution π, we have

supθ∈Θ Lθ(τ) ≥ EΘ∼π [LΘ(τ)] .

Hence, for any prior distribution π, we have

R∗ = infτ supθ∈Θ Lθ(τ) ≥ infτ EΘ∼π [LΘ(τ)] = R∗π.

Therefore, R∗ ≥ supπ R∗π.

Proof of the two sufficient conditions for the equality to hold requires some knowledge in convex optimizationtheory. Essentially speaking, these are the two sufficient conditions such that "strong duality" holds.



Deterministic vs. Randomized Decision Rules

Randomization does not always help. In the following we give two such scenarios.

Proposition 1

1 If the loss function τ 7→ l(T, τ) is convex, then randomization does not help.2 In the Bayes paradigm, there always exists a deterministic decision rule which is Bayes optimal.

pf: First, for any randomized decision rule τ(X,U), by Jensen's inequality we haveLθ (τ) ≜ EU,X∼Pθ

[l(T (θ), τ(X,U))] ≥ EX∼Pθ[l(T (θ),EU [τ(X,U)|X])] = Lθ (EU [τ(X,U)|X]) .

Second, for any randomized decision rule τ(X,U) and prior distribution Θ ∼ π,Rπ(τ) = EΘ,U,X [l (T (Θ), τ(X,U)] = EU [Rπ(τ(·, U))] ≥ Rπ(τ(·, u)), for some u.

Hence, the average risk of a randomized decision rule is always lower bound by that of a deterministic one,namely, τ(·, u) wtih u attains a smaller Rπ(τ(·, u)) than the average.



Parametric vs. Non-Parametric Models

Conventionally, parametric model refers to the case where the parameter of interest is finitedimensional, while in the non-parametric model, parameter is infinite dimensional (eg., a function).

The parametric framework is useful when one is familiar with certain properties of the data and has agood statistical model for the data.The parameter space Θ is hence fixed, not scaling with the samples of data.

In contrast, if such knowledge about the underlying data is not sufficient, the non-parametricframework might be more suitable.

However, this cut is vague and not well-defined at times.

Parametric:Hypothesis testing, Mean estimation,Covariance matrix estimation, etc.

Non-Parametric:Density estimation, Regression, etc.


Hypothesis Testing





Hypothesis Testing Basics






Basic Setup

We begin with the simplest setup – binary hypothesis testing:

1 Two hypotheses regarding the observation X , indexed by θ ∈ {0, 1}:

H0 : X ∼ P0 (Null Hypothesis, θ = 0)H1 : X ∼ P1 (Alternative Hypothesis, θ = 1)

2 Goal: design a decision making algorithm ϕ : X → {0, 1} , x 7→ θ, to choose one of the twohypotheses, based on the observed realization of X , so that a certain risk is minimized.

3 The loss function is the 0-1 loss, rendering two kinds probability of errors:Probability of false alarm (false positive; type I error): αϕ ≡ PFA (ϕ) ≜ P {H1 is chosen |H0} .Probability of miss detection (false negative; type II error): βϕ ≡ PMD (ϕ) ≜ P {H0 is chosen |H1} .



Deterministic Testing Algorithm ≡ Decision Regions

Observation Space

XA1 (�)

A0 (�)

AcceptanceRegion of H1.

AcceptanceRegion of H0.

A test ϕ : X → {0, 1} is equivalently characterized by its correspondingacceptance (decision) regions:

Aθ (ϕ) ≡ ϕ−1(θ) ≜ {x ∈ X : ϕ (x) = θ}, θ = 0, 1.

The two types of probability of error can be equivalently represented as

αϕ =∑

x∈A1(ϕ)P0 (x) =

∑x∈X ϕ (x)P0 (x),

βϕ =∑

x∈A0(ϕ)P1 (x) =

∑x∈X (1− ϕ (x))P1 (x).

When the context is clear, we often drop the dependency on the test ϕwhen dealing with acceptance regions Aθ.



Likelihood Ratio Test

Definition 3 (Likelihood Ratio Test)

A (deteministic) likelihood ratio test (LRT) is a test ϕτ , parametrized by constants τ > 0 (calledthreshold), defined as follows:

ϕτ (x) =

{1 if P1 (x) > τP0 (x)

0 if P1 (x) ≤ τP0 (x).

For x ∈ suppP0, the likelihood ratio L (x) ≜ P1(x)

P0(x). Hence, LRT is a thresholding algorithm on the

likelihood ratio L (x).

Remark: For computational convenience, often one deals with log likelihood ratio (LLR)log (L(x)) = log (P1(x))− log (P0(x)).



Trade-Off Between α (PFA) and β (PMD)

Theorem 2 (Neyman-Pearson Lemma)

For a likelihood ratio test ϕτ and another deterministic test ϕ, αϕ ≤ αϕτ =⇒ βϕ ≥ βϕτ .

pf: Observe ∀x ∈ X , 0 ≤ (ϕτ (x)− ϕ (x)) (P1 (x)− τP0 (x)) , because

if P1 (x)− τP0 (x) > 0 =⇒ ϕτ (x) = 1 =⇒ (ϕτ (x)− ϕ (x)) ≥ 0.if P1 (x)− τP0 (x) ≤ 0 =⇒ ϕτ (x) = 0 =⇒ (ϕτ (x)− ϕ (x)) ≤ 0.

Summing over all x ∈ X , we get

0 ≤ (1− βϕτ )− (1− βϕ)− τ (αϕτ − αϕ) = (βϕ − βϕτ ) + τ (αϕ − αϕτ ) .

Since τ > 0, from above we conclude that αϕ ≤ αϕτ =⇒ βϕ ≥ βϕτ .



1

1 � (PFA)

� (PMD)

� (PFA)

� (PMD)

1

1

Question: What is the optimal trade-off curve?What is the optimal test achieving the curve?



Randomized Testing Algorithm

Randomized tests include deterministic tests as special cases.

Definition 4 (Randomized Test)

A randomized test decides θ = 1 with probability ϕ (x) and θ = 0 with probability 1− ϕ (x), whereϕ is a mapping ϕ : X → [0, 1].

Note: A randomized test is characterized by ϕ, as in deterministic tests.

Definition 5 (Randomized LRT)

A randomized likelihood ratio test (LRT) is a test ϕτ,γ , parametrized by cosntants τ > 0 andγ ∈ (0, 1), defined as follows:

ϕτ,γ (x) =

{1/0 if P1 (x) ≷ τP0 (x)

γ if P1 (x) = τP0 (x).



Randomized LRT Achieves the Optimal Trade-Off

Consider the following optimization problem:

Neyman-Pearson Problem

minimizeϕ:X→[0,1]

βϕ

subject to αϕ ≤ α∗

Theorem 3 (Neyman-Pearson)

A randomized LRT ϕτ∗,γ∗ with the parameters (τ∗, γ∗) satisfying α∗ = αϕτ∗,γ∗ attains optimality forthe Neyman-Pearson Problem.



pf: First argue that for any α∗ ∈ (0, 1), one can find (τ∗, γ∗) such that

α∗ = αϕτ∗,γ∗ =∑x∈X

ϕτ∗,γ∗ (x)P0 (x) =∑

x: L(x)>τ∗P0 (x) +

∑x: L(x)=τ∗

γ∗P0 (x)

For any test ϕ, due to a similar argument as in Theorem 2, we have

∀x ∈ X , (ϕτ∗,γ∗ (x)− ϕ (x)) (P1 (x)− τ∗P0 (x)) ≥ 0.

Summing over all x ∈ X , similarly we get(βϕ − βϕτ∗,γ∗

)+ τ∗

(αϕ − αϕτ∗,γ∗

)≥ 0

Hence, for any feasible test ϕ with αϕ ≤ α∗ = αϕτ∗,γ∗ , its probability of type II error

βϕ ≥ βϕτ∗,γ∗



Bayes Risk

Sometimes prior probabilities of the two hypotheses are known:

πθ ≜ P {Hθ is true} , θ = 0, 1, π0 + π1 = 1.

In this sense, one can view the index Θ as a (binary) random variable with (prior) distributionP {Θ = θ} = πθ, for θ = 0, 1.

With prior probabilities, it then makes sense to talk about the average probability of error for a test ϕ,or more generally, the average risk:

Pe (ϕ) ≜ π0αϕ + π1βϕ = EΘ,X [1 {Θ = ϕ(X)}] ; Rπ (ϕ) ≜ EΘ,X [l(Θ,ϕ(X))] .

The Bayes hypothesis testing problem is to test the two hypotheses with knowledge of priorprobabilities so that the average probability of error (or in general, a risk function) is minimized.



Minimizing Bayes Risk

Consider the following problem of minimizing Bayes risk.

Bayes Problem

minimizeϕ:X→[0,1]

Rπ (ϕ) ≜ EΘ,X [l(Θ,ϕ(X))]

with known (π0, π1) and l(θ, θ)

Theorem 4 (LRT is an Optimal Bayes Test)

Assume l(0, 0) < l(0, 1) and l(1, 1) < l(1, 0). A deterministic LRT ϕτ∗ with threshold

τ∗ = (l(0,1)−l(0,0))π0

(l(1,0)−l(1,1))π1

attains optimality for the Bayes Problem.



pf: Rπ (ϕ) =∑

x∈X l(0, 0)π0P0 (x) (1− ϕ (x)) +∑

x∈X l(0, 1)π0P0 (x)ϕ (x)

+∑

x∈X l(1, 0)π1P1 (x) (1− ϕ (x)) +∑

x∈X l(1, 1)π1P1 (x)ϕ (x)

= l(0, 0)π0 +∑

x∈X (l(0, 1)− l(0, 0))π0P0 (x)ϕ (x)

+ l(1, 0)π1 +∑

x∈X (l(1, 1)− l(1, 0))π1P1 (x)ϕ (x)

=∑

x∈X[(l(0, 1)− l(0, 0))π0P0 (x)− (l(1, 1)− l(1, 0))π1P1 (x)

]ϕ (x)︸︷︷︸

(∗)

+ l(0, 0)π0 + l(1, 0)π1.

For each x ∈ X , we shall choose ϕ (x) ∈ [0, 1] such that (∗) is minimized.

It is then obvious that we should choose

ϕ (x) =

{1 if (l(0, 1)− l(0, 0))π0P0 (x)− (l(1, 1)− l(1, 0))π1P1 (x) < 0

0 if (l(0, 1)− l(0, 0))π0P0 (x)− (l(1, 1)− l(1, 0))π1P1 (x) ≥ 0.



Discussions

For binary hypothesis testing problems, the likelihood ratio L (x) ≜ P1(x)P0(x)

is a sufficient statistics.

Moreover, a likelihood ratio test (LRT) is optimal both in the Neyman-Pearson and the Bayes settings.

Extensions includeM -ary hypothesis testingMinimax risk optimization (with unknown prior)Composite hypothesis testing, etc.

Here we do not pursue these directions further at the moment.

We leave some of them in the later lectures including asymptotic behavior of hypothesis testing, anddiscuss about its close connection with information theory, in particular, information divergence.Next, we quickly overview the asymptotic behavior of performance in hypothesis testing.


Estimation





Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound






Estimator, Bias, Mean Squared Error

Definition 6 (Estimator)

Consider X ∼ Pθ randomly generates the data x, where θ ∈ Θ is an unknown parameter.An estimator of θ based on observed x is a mapping ϕ : X → Θ, x 7→ θ.An estimator of function z (θ) is a mapping ζ : X → z (Θ) , x 7→ z.

When Θ ⊆ R or Rn, it is reasonable to consider the following two measures of performance.

Definition 7 (Bias, Mean Squared Error)

For an estimator ϕ (x) of θ,

Biasθ (ϕ) ≜ EX∼Pθ[ϕ (X)]− θ, MSEθ (ϕ) ≜ EX∼Pθ

[|ϕ (X)− θ|2

].



Fact 1 (MSE = Variance +(Bias)2)

For an estimator ϕ (x) of θ, MSEθ (ϕ) = VarPθ[ϕ (X)] + (Biasθ (ϕ))2.

pf: MSEθ (ϕ) ≜ EPθ

[|ϕ (X)− θ|2

]= EPθ

[(ϕ (X)− EPθ

[ϕ (X)] + EPθ[ϕ (X)]− θ)

2]

= VarPθ[ϕ (X)] + (Biasθ (ϕ))2 + 2Biasθ (ϕ)EPθ

[ϕ (X)− EPθ[ϕ (X)]]︸︷︷︸

0

.

Note: MSE is the risk of an estimator, when the loss function is the squared-error loss.

In the following we provide a parameter-dependent lower bound on the MSE of unbiased estimators,namely, Cramér-Rao Inequality.



Lower Bound on MSE of Unbiased Estimators

Below we deal with densities and hence change notation from Pθ to fθ.

Definition 8 (Fisher Information)

The Fisher Information of θ is defined as J (θ) ≜ Efθ

[(∂∂ θ ln fθ(X)

)2].Definition 9 (Unbiased Estimator)

An estimator ϕ is unbiased if Biasθ (ϕ) = 0 for all θ ∈ Θ.

Now we are ready to state the theorem.

Theorem 5 (Cramér-Rao)

For any unbiased estimator ϕ, we have MSEθ (ϕ) ≥ 1J(θ) , ∀ θ ∈ Θ .



pf: The proof is essentially an application of Cauchy-Schwarz inequality.

Let us begin with the observation that J (θ) = Varfθ [sθ (X)], wheresθ (X) ≜ ∂

∂ θ ln fθ(X) = 1fθ(X)

∂∂ θfθ(X), because

Efθ [sθ (X)] =∫∞−∞ fθ (x)

1fθ(x)

∂∂ θfθ(x) dx =

∫∞−∞

∂∂ θfθ(x) dx

= dd θ

∫∞−∞ fθ(x) dx = 0.

Hence, by Cauchy-Schwarz inequality, we have

(Covfθ (sθ (X) , ϕ (X)))2 ≤ Varfθ [sθ (X)]Varfθ [ϕ (X)] .

Since Biasθ (ϕ) = 0, we have MSEθ (ϕ) = Varfθ [ϕ (X)], and hence

MSEθ (ϕ) J (θ) ≥ (Covfθ (sθ (X) , ϕ (X)))2 .



It remains to prove that Covfθ (sθ (X) , ϕ (X)) = 1:

Covfθ (sθ (X) , ϕ (X)) = Efθ [sθ (X)ϕ (X)]−0︷︸︸︷

Efθ [sθ (X)]Efθ [ϕ (X)]

= Efθ [sθ (X)ϕ (X)]

= Efθ

[1

fθ(X)∂∂ θfθ(X)ϕ (X)

]= d

d θ

∫∞−∞ fθ(x)ϕ (x) dx = d

d θEfθ [ϕ (X)]

(a)= d

d θθ = 1,

where the (a) holds because ϕ is unbiased. The proof is complete.

Remark: Cramér-Rao inequality can be extended to vector estimators, biased estimators, estimatorof a function of θ, etc.



Extensions of Cramér-Rao Inequality

Below we list some extensions and leave the proofs as exercises.

Exercise 1 (Cramér-Rao Inequality for Unbiased Functional Estimators)

Prove that for any unbiased estimator ζ of z (θ), MSEθ (ζ) ≥ 1J(θ)

(dd θ

z (θ))2.

Exercise 2 (Cramér-Rao Inequality for Biased Estimators)

Prove that for any estimator ϕ of the parameter θ, MSEθ (ϕ) ≥ 1J(θ)

(1 + d

d θBiasθ (ϕ)

)2+ (Biasθ (ϕ))2.

Exercise 3 (Attainment of Cramér-Rao)

Show that the necessary and sufficient condition for an unbiased estimator ϕ to attain the Cramér-Rao lower bound is that,there exists some function g such that for all x,

g (θ) (ϕ (x)− θ) = ∂∂ θ

ln fθ (x) .



More on Fisher Information

Fisher Information plays a key role in Cramér-Rao lower bound. We make further remarks about it.

1 J (θ) ≜ Efθ

[(sθ (X))2

]= Varfθ [sθ (X)], where the score of θ,

sθ (X) ≜ ∂∂ θ ln fθ(X) = 1

fθ(X)∂∂ θfθ(X) is zero-mean.

2 Suppose Xii.i.d.∼ fθ, then for the estimation problem with observation Xn, its Fisher information

Jn (θ) = nJ (θ), where J (θ) is the Fisher information when the observation is justX ∼ fθ.

3 For an exponential family {fθ | θ ∈ Θ}, it can be shown that

J (θ) = −Efθ

[∂2

∂ θ2ln fθ (X)

],

which makes computation of J (θ) simpler.


Estimation Maximum Likelihood Estimator, Consistency, and Efficiency






Maximum Likelihood Estimator

Maximum Likelihood Estimator (MLE) is a widely used estimator.

Definition 10 (Maximum Likelihood Estimator)

The Maximum Likelihood Estimator (MLE) for estimating θ from a randomly drawn X ∼ Pθ is

ϕMLE (x) ≜ arg maxθ∈Θ

{Pθ (x)} .

Here Pθ (x) is called the likelihood function.

Exercise 4 (MLE of Gaussian with Unknown Mean and Variance)

Consider Xii.i.d.∼ N

(µ, σ2

)for i = 1, 2, . . . , n, where θ ≜

(µ, σ2

)denote the unknown parameter. Let x ≜ 1

n

∑ni=1 xi.

Show thatϕMLE (xn) =

(x , 1

n

∑ni=1 (xi − x)2

).



In the following we consider observation of n i.i.d. drawn samples Xii.i.d.∼ Pθ, i = 1, . . . , n, and give

two ways of evaluating the performance of a sequence of estimators {ϕn (xn) |n ∈ N} as n→∞.

1 Consistency:The estimator output will coincide the true parameter as sample size n→∞.

2 Efficiency:The estimator output will achieve the MSE Cramér-Rao lower bound, as sample size n→∞.

We will see that MLE is not only consistent but also efficient.



Asymptotic Evaluations: Consistency

Definition 11 (Consistency)

A sequence of estimators {ζn (xn) |n ∈ N} is consistent if ∀ ε > 0,

limn→∞

PXi

i.i.d.∼ Pθ{|ζn (Xn)− z (θ)| < ε} = 1, ∀ θ ∈ Θ.

In other words, ζn (Xn)p→ z (θ) for all θ ∈ Θ.

Theorem 6 (MLE is Consistent)

For a family of densities {fθ | θ ∈ Θ}, under some regularity conditions on fθ (x), the plug-inestimator z (ϕMLE (xn)) is a consistent estimator of z (θ), where z is a continuous function of θ.



Asymptotic Evaluations: Efficiency

Definition 12 (Efficiency)

A sequence of estimators {ζn (xn) |n ∈ N} is asymptotically efficient if

√n (ζn (X

n)− z (θ))d→ N

(0, 1

J(θ)

(dd θz (θ)

)2) as n→∞.

Theorem 7 (MLE is Asymptotically Efficient)

For a family of densities {fθ | θ ∈ Θ}, under some regularity conditions on fθ (x), the plug-inestimator z (ϕn (x

n)) is an asymptotically efficient estimator of z (θ), where z is a continuousfunction of θ.


lecture 7 introduction to statistical decision...

Documents