lecture 7 introduction to statistical decision...

55
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University [email protected] December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

Upload: others

Post on 22-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Lecture 7Introduction to Statistical Decision Theory

I-Hsiang Wang

Department of Electrical EngineeringNational Taiwan University

[email protected]

December 20, 2016

1 / 55 I-Hsiang Wang IT Lecture 7

Page 2: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

In the rest of this course we switch gear to the interplay between information theory and statistics.

1 In this lecture, we will introduce the basic elements of statistical decision theory:It is about how to make decision from data samples collected from a statistical model.It is about how to evaluate decision making algorithms (decision rules) under a statistical model.

It also serves the purpose of overviewing the contents to be covered in the follow-up lectures.

2 In the follow-up lectures, we will go into details of several topics, includingHypothesis testing: large-sample asymptotic performance limitsPoint estimation: Bayes vs. Minimax, lower bounding techniques, high dimensional problems, etc.

Along side, we will introduce tools and techniques for investigating the asymptotic performanceof several statistical problems, and show its interplay with information theory.

Tools from probability theory: large deviation, concentration inequalities, etc.Elements from information theory: information measures, lower bounding techniques, etc.

2 / 55 I-Hsiang Wang IT Lecture 7

Page 3: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Overview of this Lecture

In this lecture, the goal is to establish basics of statistical decision theory.

1 We will begin with setting up the framework of statistical decision theory, including:Statistical experiment: parameter space, data samples, statistical modelDecision rule: deterministic vs. randomizedPerformance evaluation: loss function, risk, minimax vs. Bayes

2 Next, we will introduce two basic statical decision making problems, includingHypothesis testingPoint estimation

3 / 55 I-Hsiang Wang IT Lecture 7

Page 4: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency

4 / 55 I-Hsiang Wang IT Lecture 7

Page 5: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Basic Framework

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency

5 / 55 I-Hsiang Wang IT Lecture 7

Page 6: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Basic Framework

StatisticalExperiment

P� (·)� � �

Decision Making � (X) = TX�

X � X

6 / 55 I-Hsiang Wang IT Lecture 7

Page 7: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Basic Framework

StatisticalExperiment

P� (·)� � �

Decision Making � (X) = TX�

X � X

Statistical Experiment

Statistical Model: A collection of data-generating distributions P ≜ {Pθ | θ ∈ Θ}, where▶ Θ is called the parameter space, could be finite, infinitely countable, or uncountable.▶ Pθ (·) is a probability distribution which accounts for the implicit randomness in experiments,

sampling, or making observations

Data (Sample/Outcome/Observation): X is generated by a random draw from Pθ, that is, X ∼ Pθ.▶ X could be random variables, vectors, matrices, processes, etc.

7 / 55 I-Hsiang Wang IT Lecture 7

Page 8: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Basic Framework

StatisticalExperiment

P� (·)� � �

Decision Making � (X) = TX�

X � X

Inference Task

Objective: T (θ), a function of the parameter θ.From the data X ∼ Pθ, one would like to infer T (θ) from X .

Decision Rule

Decision rule (deterministic): τ (·) is a function of X . T = τ (X) is the inferred result.Decision rule (randomized): τ (·, ·) is a function of (X,U), where U is external randomness.T = τ (X,U) is the inferred result.

8 / 55 I-Hsiang Wang IT Lecture 7

Page 9: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Basic Framework

StatisticalExperiment

P� (·)� � �

Decision Making � (X) = TX�

X � X

T (·) Loss FunctionT (�) l (·, ·)� T

EX�P� [·]

l(T (�), �(X))

L� (�)

Performance Evaluation: how good is a decision rule τ?

Loss function: l(T (θ), τ(X)) measures how bad the decision rule τ is (with a specific data pointX ).Note: since X is random, l (T (θ) , τ (X)) is also random.

Risk: Lθ (τ) ≜ EX∼Pθ[l(T (θ), τ(X))] measures on average how bad the decision rule τ is

when the true parameter is θ.

9 / 55 I-Hsiang Wang IT Lecture 7

Page 10: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Basic Framework

StatisticalExperiment

P� (·)� � �

Decision Making � (X) = TX�

X � X

T (·) Loss FunctionT (�) l (·, ·)� T

EX�P� [·]

l(T (�), �(X))

L� (�)

Performance Evaluation: what if the decision rule τ is randomized?

Loss function becomes l(T (θ), τ(X,U)).Risk becomes Lθ (τ) ≜ EU,X∼Pθ

[l(T (θ), τ(X,U))].

10 / 55 I-Hsiang Wang IT Lecture 7

Page 11: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Examples

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency

11 / 55 I-Hsiang Wang IT Lecture 7

Page 12: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Examples

Sometimes we care about the inferred object itself.

12 / 55 I-Hsiang Wang IT Lecture 7

Page 13: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Examples

Example: Decoding

Decoding in channel coding over a DMC is one example that we are familiar with.

Parameter is the message: θ ←→ m

Parameter space is the message set: Θ←→{1, 2, . . . , 2NR

}Data is the received sequence: X ←→ Y N

Statistical model is Encoder+Channel: Pθ(x)←→∏N

i=1 PY |X(yi|xi(m))

Task is to decode the message: T (θ)←→ m

Decision rule is the decoding algorithm: τ(X)←→ dec(Y N )

Loss function is the 0-1 loss: l (T (θ), τ(x))←→ 1{m = dec(yN )

}Risk is the decoding error probability: Lθ(τ)←→ λm,dec ≜ P

{m = dec(Y N )

∣∣m is sent}

13 / 55 I-Hsiang Wang IT Lecture 7

Page 14: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Examples

Example: Hypothesis Testing

Decoding in channel coding belongs to a more general class of problems called hypothesis testing.

Parameter space is a finite set: |Θ| <∞

Task is to infer parameter θ: T (θ) = θ

Loss function is the 0-1 loss: l (T (θ), τ(x)) = 1 {θ = τ(x)}

Risk is the probability of error: Lθ(τ) = PX∼Pθ{θ = τ(X)}

14 / 55 I-Hsiang Wang IT Lecture 7

Page 15: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Examples

Example: Density Estimation

Estimate the probability density function from the collected samples.

Parameter space is a (huge) set of density functions:Θ←→ F = {f : R→ [0,+∞) which is concave/continuous/Lipschitz continuous/etc.}

Data is the observed i.i.d. sequence: X ←→ Xn, Xii.i.d.∼ f(·)

Task is to infer density function f(·): T (θ)←→ f

Decision rule is the density estimator: τ(X)←→ fXn(·)

Loss function is some sort of divergence: l (T (θ), τ(x))←→ D(f∥∥∥fxn

)Risk is the expected loss: Lθ(τ)←→ EXn∼f⊗n

[D

(f∥∥∥fXn

)]

15 / 55 I-Hsiang Wang IT Lecture 7

Page 16: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Examples

Sometimes we care about the utility of the inferred object.

16 / 55 I-Hsiang Wang IT Lecture 7

Page 17: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Examples

Example: Classification/Prediction

A basic problem in learning is to train a classifier that predicts the category of a new object.

Parameter space is a collection of labelings: Θ←→ H = {h : X → [1 : K]}

Data is the training data set: X ←→ (Xn, Y n), label Yi ∈ [1 : K].

Statistical model is the noisy labeling: Pθ(x)←→∏n

i=1 PX (xi)PY |h(X)(yi|h(xi))

Task is to infer the true labeling h ∈ H: T (θ)←→ h(·), τ(X)←→ hXn,Y n(·).

Loss function is the prediction error probability:l (T, τ(x))←→ EX∼PX

[1{h(X) = h(X)

}](Note: This is still random as h depends on the randomly drawn training data (X,Y )n)

Risk is the averaged loss over training:Lθ(τ)←→ E(Xn,Y n)∼(PXPY |h(X))

⊗n

[EX∼PX

[1{h(X) = h(X)

}]]17 / 55 I-Hsiang Wang IT Lecture 7

Page 18: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Examples

Example: Regression

Another example that we are very familiar with is regression under mean squared error.

Parameter space is a collection of functions: Θ←→ F = {f : Rp → R}

Data is the training data: X ←→ (Xn, Y n)

Statistical model is the noisy observation: Pθ(x)←→∏n

i=1 PX (xi)PZ (yi − f(xi))(Y = f(X) + Z where Z is the observation noise)

Task is to infer f ∈ F : T (θ)←→ f , τ(X)←→ fXn,Y n(·)

Loss function is the mean-squared-error: l (T, τ(x))←→ E(X,Y )∼Pf

[(Y − f(X))2

](Note: This is still random as f depends on the randomly drawn training data (X, Y )n)

Risk is the averaged loss over training:Lθ(τ)←→ E(Xn,Y n)∼P⊗n

f

[E(X,Y )∼Pf

[(Y − f(X))2

]]18 / 55 I-Hsiang Wang IT Lecture 7

Page 19: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Examples

Example: Linear Regression

For linear regression, the underlying model is linear: f (x) = β⊺x+ γ.

Parameter is (β, γ) ∈ Rp × R: θ ←→ (β, γ), Θ←→ Rp × R

Data is the training data: X ←→ (Xn, Y n)

Statistical model is the noisy linear model: Pθ(x)←→∏n

i=1 PX (xi)PZ (yi − γ − β⊺xi)

Task is to infer (β, γ): T (θ)←→ (β, γ), τ(X)←→ (β, γ)Xn,Y n

Loss function is the mean-squared-error: l (T, τ(x))←→ E(X,Y )∼Pβ,γ

[(Y − β

⊺X − γ)2

](Note: This is still random as (β, γ) depends on the randomly drawn training data (X, Y )n)

Risk is the averaged loss over random training data:

Lθ(τ)←→ E(Xn,Y n)∼P⊗nβ,γ

[E(X,Y )∼Pβ,γ

[(Y − β

⊺X − γ)2

]]19 / 55 I-Hsiang Wang IT Lecture 7

Page 20: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Paradigms

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency

20 / 55 I-Hsiang Wang IT Lecture 7

Page 21: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Paradigms

How to Determine the Best Estimator?

Recall: Risk of a decision rule τ is Lθ(τ), which is defined according to the underlying statisticalmodel, the task, and the loss function. Note that the risk depends on the hidden parameter θ.

If we can find a decision making algorithm τ∗ such that for all other decision rule τ ,Lθ (τ

∗) ≤ Lθ (τ) ∀ θ ∈ Θ,then τ∗ is obviously the best.

Unfortunately this is not likely to happen. For example, say the loss function is the ℓ2-norm betweenθ and θ = τ(x). Then, τ(x) = θ will minimize the risk Lθ(τ) when the true parameter is θ. Thismeans that there is no single τ that can simultaneously minimize Lθ for all θ ∈ Θ !

There are two main paradigms that resolve this issue:Average-case paradigm (Bayes) and Worst-case paradigm (Minimax)

21 / 55 I-Hsiang Wang IT Lecture 7

Page 22: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Paradigms

Bayes Paradigm

In the Bayes paradigm, a prior distribution π (·) over the parameter space Θ is required, and theperformance of a decision rule is evaluated according to the average-case analysis with respect to π.

Definition 1 (Bayes Risk)

The average risk of a decision rule τ with respect to prior π is defined as Rπ (τ) ≜ EΘ∼π [LΘ(τ)].The Bayes risk for a prior π is defined as the minimum average risk, that is,

R∗π ≜ infτ Rπ (τ) , (1)

which is attained by a Bayes decision rule τ∗π (may not be unique).

Note that in information theory most coding theorems are derived in the Bayes paradigm – weassume random sources and messages. We will give more results later when we talk abouthypothesis testing and point estimation.

22 / 55 I-Hsiang Wang IT Lecture 7

Page 23: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Paradigms

Minimax Paradigm

In a Minimax paradigm, the performance of a decision rule is evaluated according to the worst-caserisk over the entire parameter space.

Definition 2 (Minimax Risk)

The worst-case risk of a decision rule τ is defined as R (τ) ≜ supθ∈Θ Lθ(τ).The Minimax risk is defined as the minimum worst-case risk, that is,

R∗ ≜ infτ R (τ) = infτ supθ∈Θ Lθ(τ), (2)

which is attained by a Minimax decision rule τ∗ (may not be unique).

A main criticism on the Bayes paradigm is that in many applications there is no prior over theparameter space, because the statistical task is done one-shot. In such cases, the Minimaxparadigm provides a conservative but robust evaluation and theoretical guarantees.

23 / 55 I-Hsiang Wang IT Lecture 7

Page 24: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Paradigms

Bayes vs. Minimax

A simple yet fundamental relationship between Minimax and Bayes is stated in the theorem below.

Theorem 1 (Minimax Risk ≥ Worst-Case Bayes Risk)

In general, the Minimax risk is not smaller than any Bayes risk, that is,

R∗ ≥ supπ R∗π. (3)

Furthermore, the above inequality becomes equality in the following two cases:1 The parameter space Θ and the data alphabet X are both finite.2 |Θ| <∞ and the loss function is bounded from below.

Remark: When deriving lower bound on the minimax risk, inequality (3) is useful.

24 / 55 I-Hsiang Wang IT Lecture 7

Page 25: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Paradigms

pf: Note that R∗ = infτ supθ∈Θ Lθ(τ) and supπ R∗π = supπ infτ EΘ∼π [LΘ(τ)].

For any decision rule τ and prior distribution π, we have

supθ∈Θ Lθ(τ) ≥ EΘ∼π [LΘ(τ)] .

Hence, for any prior distribution π, we have

R∗ = infτ supθ∈Θ Lθ(τ) ≥ infτ EΘ∼π [LΘ(τ)] = R∗π.

Therefore, R∗ ≥ supπ R∗π.

Proof of the two sufficient conditions for the equality to hold requires some knowledge in convex optimizationtheory. Essentially speaking, these are the two sufficient conditions such that "strong duality" holds.

25 / 55 I-Hsiang Wang IT Lecture 7

Page 26: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Paradigms

Deterministic vs. Randomized Decision Rules

Randomization does not always help. In the following we give two such scenarios.

Proposition 1

1 If the loss function τ 7→ l(T, τ) is convex, then randomization does not help.2 In the Bayes paradigm, there always exists a deterministic decision rule which is Bayes optimal.

pf: First, for any randomized decision rule τ(X,U), by Jensen's inequality we haveLθ (τ) ≜ EU,X∼Pθ

[l(T (θ), τ(X,U))] ≥ EX∼Pθ[l(T (θ),EU [τ(X,U)|X])] = Lθ (EU [τ(X,U)|X]) .

Second, for any randomized decision rule τ(X,U) and prior distribution Θ ∼ π,Rπ(τ) = EΘ,U,X [l (T (Θ), τ(X,U)] = EU [Rπ(τ(·, U))] ≥ Rπ(τ(·, u)), for some u.

Hence, the average risk of a randomized decision rule is always lower bound by that of a deterministic one,namely, τ(·, u) wtih u attains a smaller Rπ(τ(·, u)) than the average.

26 / 55 I-Hsiang Wang IT Lecture 7

Page 27: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Statistical Model and Decision Making Paradigms

Parametric vs. Non-Parametric Models

Conventionally, parametric model refers to the case where the parameter of interest is finitedimensional, while in the non-parametric model, parameter is infinite dimensional (eg., a function).

The parametric framework is useful when one is familiar with certain properties of the data and has agood statistical model for the data.The parameter space Θ is hence fixed, not scaling with the samples of data.

In contrast, if such knowledge about the underlying data is not sufficient, the non-parametricframework might be more suitable.

However, this cut is vague and not well-defined at times.

Parametric:Hypothesis testing, Mean estimation,Covariance matrix estimation, etc.

Non-Parametric:Density estimation, Regression, etc.

27 / 55 I-Hsiang Wang IT Lecture 7

Page 28: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency

28 / 55 I-Hsiang Wang IT Lecture 7

Page 29: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency

29 / 55 I-Hsiang Wang IT Lecture 7

Page 30: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

Basic Setup

We begin with the simplest setup – binary hypothesis testing:

1 Two hypotheses regarding the observation X , indexed by θ ∈ {0, 1}:

H0 : X ∼ P0 (Null Hypothesis, θ = 0)H1 : X ∼ P1 (Alternative Hypothesis, θ = 1)

2 Goal: design a decision making algorithm ϕ : X → {0, 1} , x 7→ θ, to choose one of the twohypotheses, based on the observed realization of X , so that a certain risk is minimized.

3 The loss function is the 0-1 loss, rendering two kinds probability of errors:Probability of false alarm (false positive; type I error): αϕ ≡ PFA (ϕ) ≜ P {H1 is chosen |H0} .Probability of miss detection (false negative; type II error): βϕ ≡ PMD (ϕ) ≜ P {H0 is chosen |H1} .

30 / 55 I-Hsiang Wang IT Lecture 7

Page 31: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

Deterministic Testing Algorithm ≡ Decision Regions

Observation Space

XA1 (�)

A0 (�)

AcceptanceRegion of H1.

AcceptanceRegion of H0.

A test ϕ : X → {0, 1} is equivalently characterized by its correspondingacceptance (decision) regions:

Aθ (ϕ) ≡ ϕ−1(θ) ≜ {x ∈ X : ϕ (x) = θ}, θ = 0, 1.

The two types of probability of error can be equivalently represented as

αϕ =∑

x∈A1(ϕ)P0 (x) =

∑x∈X ϕ (x)P0 (x),

βϕ =∑

x∈A0(ϕ)P1 (x) =

∑x∈X (1− ϕ (x))P1 (x).

When the context is clear, we often drop the dependency on the test ϕwhen dealing with acceptance regions Aθ.

31 / 55 I-Hsiang Wang IT Lecture 7

Page 32: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

Likelihood Ratio Test

Definition 3 (Likelihood Ratio Test)

A (deteministic) likelihood ratio test (LRT) is a test ϕτ , parametrized by constants τ > 0 (calledthreshold), defined as follows:

ϕτ (x) =

{1 if P1 (x) > τP0 (x)

0 if P1 (x) ≤ τP0 (x).

For x ∈ suppP0, the likelihood ratio L (x) ≜ P1(x)

P0(x). Hence, LRT is a thresholding algorithm on the

likelihood ratio L (x).

Remark: For computational convenience, often one deals with log likelihood ratio (LLR)log (L(x)) = log (P1(x))− log (P0(x)).

32 / 55 I-Hsiang Wang IT Lecture 7

Page 33: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

Trade-Off Between α (PFA) and β (PMD)

Theorem 2 (Neyman-Pearson Lemma)

For a likelihood ratio test ϕτ and another deterministic test ϕ, αϕ ≤ αϕτ =⇒ βϕ ≥ βϕτ .

pf: Observe ∀x ∈ X , 0 ≤ (ϕτ (x)− ϕ (x)) (P1 (x)− τP0 (x)) , because

if P1 (x)− τP0 (x) > 0 =⇒ ϕτ (x) = 1 =⇒ (ϕτ (x)− ϕ (x)) ≥ 0.if P1 (x)− τP0 (x) ≤ 0 =⇒ ϕτ (x) = 0 =⇒ (ϕτ (x)− ϕ (x)) ≤ 0.

Summing over all x ∈ X , we get

0 ≤ (1− βϕτ )− (1− βϕ)− τ (αϕτ − αϕ) = (βϕ − βϕτ ) + τ (αϕ − αϕτ ) .

Since τ > 0, from above we conclude that αϕ ≤ αϕτ =⇒ βϕ ≥ βϕτ .

33 / 55 I-Hsiang Wang IT Lecture 7

Page 34: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

1

1 � (PFA)

� (PMD)

� (PFA)

� (PMD)

1

1

Question: What is the optimal trade-off curve?What is the optimal test achieving the curve?

34 / 55 I-Hsiang Wang IT Lecture 7

Page 35: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

Randomized Testing Algorithm

Randomized tests include deterministic tests as special cases.

Definition 4 (Randomized Test)

A randomized test decides θ = 1 with probability ϕ (x) and θ = 0 with probability 1− ϕ (x), whereϕ is a mapping ϕ : X → [0, 1].

Note: A randomized test is characterized by ϕ, as in deterministic tests.

Definition 5 (Randomized LRT)

A randomized likelihood ratio test (LRT) is a test ϕτ,γ , parametrized by cosntants τ > 0 andγ ∈ (0, 1), defined as follows:

ϕτ,γ (x) =

{1/0 if P1 (x) ≷ τP0 (x)

γ if P1 (x) = τP0 (x).

35 / 55 I-Hsiang Wang IT Lecture 7

Page 36: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

Randomized LRT Achieves the Optimal Trade-Off

Consider the following optimization problem:

Neyman-Pearson Problem

minimizeϕ:X→[0,1]

βϕ

subject to αϕ ≤ α∗

Theorem 3 (Neyman-Pearson)

A randomized LRT ϕτ∗,γ∗ with the parameters (τ∗, γ∗) satisfying α∗ = αϕτ∗,γ∗ attains optimality forthe Neyman-Pearson Problem.

36 / 55 I-Hsiang Wang IT Lecture 7

Page 37: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

pf: First argue that for any α∗ ∈ (0, 1), one can find (τ∗, γ∗) such that

α∗ = αϕτ∗,γ∗ =∑x∈X

ϕτ∗,γ∗ (x)P0 (x) =∑

x: L(x)>τ∗P0 (x) +

∑x: L(x)=τ∗

γ∗P0 (x)

For any test ϕ, due to a similar argument as in Theorem 2, we have

∀x ∈ X , (ϕτ∗,γ∗ (x)− ϕ (x)) (P1 (x)− τ∗P0 (x)) ≥ 0.

Summing over all x ∈ X , similarly we get(βϕ − βϕτ∗,γ∗

)+ τ∗

(αϕ − αϕτ∗,γ∗

)≥ 0

Hence, for any feasible test ϕ with αϕ ≤ α∗ = αϕτ∗,γ∗ , its probability of type II error

βϕ ≥ βϕτ∗,γ∗

37 / 55 I-Hsiang Wang IT Lecture 7

Page 38: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

Bayes Risk

Sometimes prior probabilities of the two hypotheses are known:

πθ ≜ P {Hθ is true} , θ = 0, 1, π0 + π1 = 1.

In this sense, one can view the index Θ as a (binary) random variable with (prior) distributionP {Θ = θ} = πθ, for θ = 0, 1.

With prior probabilities, it then makes sense to talk about the average probability of error for a test ϕ,or more generally, the average risk:

Pe (ϕ) ≜ π0αϕ + π1βϕ = EΘ,X [1 {Θ = ϕ(X)}] ; Rπ (ϕ) ≜ EΘ,X [l(Θ,ϕ(X))] .

The Bayes hypothesis testing problem is to test the two hypotheses with knowledge of priorprobabilities so that the average probability of error (or in general, a risk function) is minimized.

38 / 55 I-Hsiang Wang IT Lecture 7

Page 39: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

Minimizing Bayes Risk

Consider the following problem of minimizing Bayes risk.

Bayes Problem

minimizeϕ:X→[0,1]

Rπ (ϕ) ≜ EΘ,X [l(Θ,ϕ(X))]

with known (π0, π1) and l(θ, θ)

Theorem 4 (LRT is an Optimal Bayes Test)

Assume l(0, 0) < l(0, 1) and l(1, 1) < l(1, 0). A deterministic LRT ϕτ∗ with threshold

τ∗ = (l(0,1)−l(0,0))π0

(l(1,0)−l(1,1))π1

attains optimality for the Bayes Problem.

39 / 55 I-Hsiang Wang IT Lecture 7

Page 40: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

pf: Rπ (ϕ) =∑

x∈X l(0, 0)π0P0 (x) (1− ϕ (x)) +∑

x∈X l(0, 1)π0P0 (x)ϕ (x)

+∑

x∈X l(1, 0)π1P1 (x) (1− ϕ (x)) +∑

x∈X l(1, 1)π1P1 (x)ϕ (x)

= l(0, 0)π0 +∑

x∈X (l(0, 1)− l(0, 0))π0P0 (x)ϕ (x)

+ l(1, 0)π1 +∑

x∈X (l(1, 1)− l(1, 0))π1P1 (x)ϕ (x)

=∑

x∈X[(l(0, 1)− l(0, 0))π0P0 (x)− (l(1, 1)− l(1, 0))π1P1 (x)

]ϕ (x)︸ ︷︷ ︸

(∗)

+ l(0, 0)π0 + l(1, 0)π1.

For each x ∈ X , we shall choose ϕ (x) ∈ [0, 1] such that (∗) is minimized.

It is then obvious that we should choose

ϕ (x) =

{1 if (l(0, 1)− l(0, 0))π0P0 (x)− (l(1, 1)− l(1, 0))π1P1 (x) < 0

0 if (l(0, 1)− l(0, 0))π0P0 (x)− (l(1, 1)− l(1, 0))π1P1 (x) ≥ 0.

40 / 55 I-Hsiang Wang IT Lecture 7

Page 41: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Hypothesis Testing Basics

Discussions

For binary hypothesis testing problems, the likelihood ratio L (x) ≜ P1(x)P0(x)

is a sufficient statistics.

Moreover, a likelihood ratio test (LRT) is optimal both in the Neyman-Pearson and the Bayes settings.

Extensions includeM -ary hypothesis testingMinimax risk optimization (with unknown prior)Composite hypothesis testing, etc.

Here we do not pursue these directions further at the moment.

We leave some of them in the later lectures including asymptotic behavior of hypothesis testing, anddiscuss about its close connection with information theory, in particular, information divergence.Next, we quickly overview the asymptotic behavior of performance in hypothesis testing.

41 / 55 I-Hsiang Wang IT Lecture 7

Page 42: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency

42 / 55 I-Hsiang Wang IT Lecture 7

Page 43: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency

43 / 55 I-Hsiang Wang IT Lecture 7

Page 44: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

Estimator, Bias, Mean Squared Error

Definition 6 (Estimator)

Consider X ∼ Pθ randomly generates the data x, where θ ∈ Θ is an unknown parameter.An estimator of θ based on observed x is a mapping ϕ : X → Θ, x 7→ θ.An estimator of function z (θ) is a mapping ζ : X → z (Θ) , x 7→ z.

When Θ ⊆ R or Rn, it is reasonable to consider the following two measures of performance.

Definition 7 (Bias, Mean Squared Error)

For an estimator ϕ (x) of θ,

Biasθ (ϕ) ≜ EX∼Pθ[ϕ (X)]− θ, MSEθ (ϕ) ≜ EX∼Pθ

[|ϕ (X)− θ|2

].

44 / 55 I-Hsiang Wang IT Lecture 7

Page 45: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

Fact 1 (MSE = Variance +(Bias)2)

For an estimator ϕ (x) of θ, MSEθ (ϕ) = VarPθ[ϕ (X)] + (Biasθ (ϕ))2.

pf: MSEθ (ϕ) ≜ EPθ

[|ϕ (X)− θ|2

]= EPθ

[(ϕ (X)− EPθ

[ϕ (X)] + EPθ[ϕ (X)]− θ)

2]

= VarPθ[ϕ (X)] + (Biasθ (ϕ))2 + 2Biasθ (ϕ)EPθ

[ϕ (X)− EPθ[ϕ (X)]]︸ ︷︷ ︸

0

.

Note: MSE is the risk of an estimator, when the loss function is the squared-error loss.

In the following we provide a parameter-dependent lower bound on the MSE of unbiased estimators,namely, Cramér-Rao Inequality.

45 / 55 I-Hsiang Wang IT Lecture 7

Page 46: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

Lower Bound on MSE of Unbiased Estimators

Below we deal with densities and hence change notation from Pθ to fθ.

Definition 8 (Fisher Information)

The Fisher Information of θ is defined as J (θ) ≜ Efθ

[(∂∂ θ ln fθ(X)

)2].Definition 9 (Unbiased Estimator)

An estimator ϕ is unbiased if Biasθ (ϕ) = 0 for all θ ∈ Θ.

Now we are ready to state the theorem.

Theorem 5 (Cramér-Rao)

For any unbiased estimator ϕ, we have MSEθ (ϕ) ≥ 1J(θ) , ∀ θ ∈ Θ .

46 / 55 I-Hsiang Wang IT Lecture 7

Page 47: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

pf: The proof is essentially an application of Cauchy-Schwarz inequality.

Let us begin with the observation that J (θ) = Varfθ [sθ (X)], wheresθ (X) ≜ ∂

∂ θ ln fθ(X) = 1fθ(X)

∂∂ θfθ(X), because

Efθ [sθ (X)] =∫∞−∞ fθ (x)

1fθ(x)

∂∂ θfθ(x) dx =

∫∞−∞

∂∂ θfθ(x) dx

= dd θ

∫∞−∞ fθ(x) dx = 0.

Hence, by Cauchy-Schwarz inequality, we have

(Covfθ (sθ (X) , ϕ (X)))2 ≤ Varfθ [sθ (X)]Varfθ [ϕ (X)] .

Since Biasθ (ϕ) = 0, we have MSEθ (ϕ) = Varfθ [ϕ (X)], and hence

MSEθ (ϕ) J (θ) ≥ (Covfθ (sθ (X) , ϕ (X)))2 .

47 / 55 I-Hsiang Wang IT Lecture 7

Page 48: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

It remains to prove that Covfθ (sθ (X) , ϕ (X)) = 1:

Covfθ (sθ (X) , ϕ (X)) = Efθ [sθ (X)ϕ (X)]−0︷ ︸︸ ︷

Efθ [sθ (X)]Efθ [ϕ (X)]

= Efθ [sθ (X)ϕ (X)]

= Efθ

[1

fθ(X)∂∂ θfθ(X)ϕ (X)

]= d

d θ

∫∞−∞ fθ(x)ϕ (x) dx = d

d θEfθ [ϕ (X)]

(a)= d

d θθ = 1,

where the (a) holds because ϕ is unbiased. The proof is complete.

Remark: Cramér-Rao inequality can be extended to vector estimators, biased estimators, estimatorof a function of θ, etc.

48 / 55 I-Hsiang Wang IT Lecture 7

Page 49: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

Extensions of Cramér-Rao Inequality

Below we list some extensions and leave the proofs as exercises.

Exercise 1 (Cramér-Rao Inequality for Unbiased Functional Estimators)

Prove that for any unbiased estimator ζ of z (θ), MSEθ (ζ) ≥ 1J(θ)

(dd θ

z (θ))2.

Exercise 2 (Cramér-Rao Inequality for Biased Estimators)

Prove that for any estimator ϕ of the parameter θ, MSEθ (ϕ) ≥ 1J(θ)

(1 + d

d θBiasθ (ϕ)

)2+ (Biasθ (ϕ))2.

Exercise 3 (Attainment of Cramér-Rao)

Show that the necessary and sufficient condition for an unbiased estimator ϕ to attain the Cramér-Rao lower bound is that,there exists some function g such that for all x,

g (θ) (ϕ (x)− θ) = ∂∂ θ

ln fθ (x) .

49 / 55 I-Hsiang Wang IT Lecture 7

Page 50: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Mean-Squared Error (MSE) and Cramér-Rao Lower Bound

More on Fisher Information

Fisher Information plays a key role in Cramér-Rao lower bound. We make further remarks about it.

1 J (θ) ≜ Efθ

[(sθ (X))2

]= Varfθ [sθ (X)], where the score of θ,

sθ (X) ≜ ∂∂ θ ln fθ(X) = 1

fθ(X)∂∂ θfθ(X) is zero-mean.

2 Suppose Xii.i.d.∼ fθ, then for the estimation problem with observation Xn, its Fisher information

Jn (θ) = nJ (θ), where J (θ) is the Fisher information when the observation is justX ∼ fθ.

3 For an exponential family {fθ | θ ∈ Θ}, it can be shown that

J (θ) = −Efθ

[∂2

∂ θ2ln fθ (X)

],

which makes computation of J (θ) simpler.

50 / 55 I-Hsiang Wang IT Lecture 7

Page 51: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

1 Statistical Model and Decision MakingBasic FrameworkExamplesParadigms

2 Hypothesis TestingBasics

3 EstimationMean-Squared Error (MSE) and Cramér-Rao Lower BoundMaximum Likelihood Estimator, Consistency, and Efficiency

51 / 55 I-Hsiang Wang IT Lecture 7

Page 52: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

Maximum Likelihood Estimator

Maximum Likelihood Estimator (MLE) is a widely used estimator.

Definition 10 (Maximum Likelihood Estimator)

The Maximum Likelihood Estimator (MLE) for estimating θ from a randomly drawn X ∼ Pθ is

ϕMLE (x) ≜ arg maxθ∈Θ

{Pθ (x)} .

Here Pθ (x) is called the likelihood function.

Exercise 4 (MLE of Gaussian with Unknown Mean and Variance)

Consider Xii.i.d.∼ N

(µ, σ2

)for i = 1, 2, . . . , n, where θ ≜

(µ, σ2

)denote the unknown parameter. Let x ≜ 1

n

∑ni=1 xi.

Show thatϕMLE (xn) =

(x , 1

n

∑ni=1 (xi − x)2

).

52 / 55 I-Hsiang Wang IT Lecture 7

Page 53: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

In the following we consider observation of n i.i.d. drawn samples Xii.i.d.∼ Pθ, i = 1, . . . , n, and give

two ways of evaluating the performance of a sequence of estimators {ϕn (xn) |n ∈ N} as n→∞.

1 Consistency:The estimator output will coincide the true parameter as sample size n→∞.

2 Efficiency:The estimator output will achieve the MSE Cramér-Rao lower bound, as sample size n→∞.

We will see that MLE is not only consistent but also efficient.

53 / 55 I-Hsiang Wang IT Lecture 7

Page 54: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

Asymptotic Evaluations: Consistency

Definition 11 (Consistency)

A sequence of estimators {ζn (xn) |n ∈ N} is consistent if ∀ ε > 0,

limn→∞

PXi

i.i.d.∼ Pθ{|ζn (Xn)− z (θ)| < ε} = 1, ∀ θ ∈ Θ.

In other words, ζn (Xn)p→ z (θ) for all θ ∈ Θ.

Theorem 6 (MLE is Consistent)

For a family of densities {fθ | θ ∈ Θ}, under some regularity conditions on fθ (x), the plug-inestimator z (ϕMLE (xn)) is a consistent estimator of z (θ), where z is a continuous function of θ.

54 / 55 I-Hsiang Wang IT Lecture 7

Page 55: Lecture 7 Introduction to Statistical Decision Theoryhomepage.ntu.edu.tw/.../Slides/IT_Lecture_07_v1.pdf · Lecture7 IntroductiontoStatisticalDecisionTheory I-HsiangWang DepartmentofElectricalEngineering

Estimation Maximum Likelihood Estimator, Consistency, and Efficiency

Asymptotic Evaluations: Efficiency

Definition 12 (Efficiency)

A sequence of estimators {ζn (xn) |n ∈ N} is asymptotically efficient if

√n (ζn (X

n)− z (θ))d→ N

(0, 1

J(θ)

(dd θz (θ)

)2) as n→∞.

Theorem 7 (MLE is Asymptotically Efficient)

For a family of densities {fθ | θ ∈ Θ}, under some regularity conditions on fθ (x), the plug-inestimator z (ϕn (x

n)) is an asymptotically efficient estimator of z (θ), where z is a continuousfunction of θ.

55 / 55 I-Hsiang Wang IT Lecture 7