chapter 8: least squares (beginning of chapter) - tuthehu/ssp/lecture4.pdf · the least squares...

76
Chapter 8: Least squares (beginning of chapter)

Upload: vuongthu

Post on 22-Apr-2018

227 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Chapter 8: Least squares (beginning of chapter)

Page 2: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Least Squares

• So far, we have been trying to determine an estimatorwhich was unbiased and had minimum variance.

• Next we’ll consider a class of estimators that in many caseshave no optimality properties associated with them, butare still applicable in a number of problems due to theirease of use.

• The Least Squares approach originates from 1795, whenGauss invented the method (at the time he was only 18years old)1

• However, the method was properly formalized into itscurrent form and published in 1806 by Legendre.

1Compare the date with maximum likelihood from 1912 and CRLB and RBLS from mid 1900’s.

Page 3: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Least Squares

• The method became widely known as Gauss was the onlyone able to describe the orbit of Ceres, a minor planet in theasteroid belt between Mars and Jupiter that wasdiscovered in 1801.

• In the least squares approach (LS) we attempt to minimizethe squared difference between the given data x[n] and theassumed signal model2.

• Note, that there is no assumption about the noise PDF.

2In Gauss’ case, the goal was to minimize the squared difference between the measurements of the location ofCeres and a family of functions describing the orbit.

Page 4: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Least Squares

0 5 10 15 20 25

0

5

10

15

y(n)

x(n)

LS minimizes the total length of vertical distances

Page 5: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Introductory Example

• Consider the model

x[n] = A+ Bn+w[n], n = 0, . . . ,N− 1

where A and B are unknown deterministic parameters andw[n] ∼ N(0,σ2).

Page 6: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Introductory Example

• The LS estimate θLS is the value of θ that minimizes the LSerror criterion:

J(θ) =

N−1∑n=0

(x[n] − s[n; θ])2,

where s[n; θ] is the deterministic part of our modelparameterized by θ.

Page 7: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Introductory Example

• In the previous example, θ = [A,B]T and the LS criterion is

J(θ) =

N−1∑n=0

(x[n] − s[n;θ])2 =

N−1∑n=0

(x[n] − (A+ Bn))2,

and the LS estimator is defined as

θLS = arg minθ

N−1∑n=0

(x[n] − (A+ Bn))2

• Due to the quadratic form, this estimation problem iseasily solved, as we will soon see.

Page 8: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Types of LS Problems: Linear and Nonlinear

• The LS problems can be divided into two categories: linearand nonlinear.

• The problems where data depends on the parameters in alinear manner are easily solved, whereas nonlinearproblems may be impossible to solve in closed form.

Page 9: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example of a linear LS problem

• Consider estimating the DC level A of a signal:

x[n] = A+w[n], n = 0, . . . ,N− 1

• According to the LS approach, we can estimate A byminimizing

J(A) =

N−1∑n=0

(x[n] −A)2

Page 10: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example of a linear LS problem

• Due to the quadratic form, differentiating is easy:

∂J(A)

∂A=

N−1∑n=0

2(x[n] −A) · (−1)

• Note that the quadratic form also guarantees that the zeroof the derivative is the unique global minimum.

Page 11: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example of a linear LS problem

• When set equal to zero, we get the minimum of J(A).

N−1∑n=0

2(x[n] − A) · (−1) = 0

N−1∑n=0

x[n] −

N−1∑n=0

A = 0

N−1∑n=0

x[n] = NA

A =1N

N−1∑n=0

x[n]

Page 12: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example of a nonlinear LS problem

• Consider the signal model

x[n] = cos 2πf0n+w[n]

where the frequency f0 is to be estimated. The LSE isdefined by the error function

J(f0) =

N−1∑n=0

(x[n] − cos 2πf0n)2

Page 13: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example of a nonlinear LS problem

• Now, the differentiation gives

∂J(f0)

∂f0=

N−1∑n=0

2(x[n] − cos 2πf0n)(2 sin 2πf0n)

= 4N−1∑n=0

(x[n] sin 2πf0n− cos 2πf0n sin 2πf0n)

• We see that it is not possible to solve the root in closedform.

• The problem could be solved using grid search or iterativeminimization.

Page 14: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example of a nonlinear LS problem

• Non-exhaustive heuristics tend to fail in this kind ofproblems due to numerous local optima.

• An example of using Matlab’s nlinfit is shown below.• In this example the true frequency was f0 = 0.28. The

estimation fails miserably and results in f0 = 0.02.• You can also see that the optimizer gets stuck in a local

minimum after a few iterations.• Note: later we will find an approximate estimator using

maximum likelihood principle. This could be used as astarting point for LS, as well.

Page 15: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Linear Least Squares

0 10 20 30 40 50 60 70 80 90 100−4

−3

−2

−1

0

1

2

3

4

True signalNoisy samplesEstimated signal

Page 16: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Linear Least Squares - Vector Parameter

• We assume thats = Hθ

where H is the observation matrix, a known N× pmatrix(N > p) of full rank p. (No noise pdf assumption !)

• The LSE is found by minimizing

J(θ) = (x − Hθ)T (x − Hθ)

= xTx − xTHθ− θTHTx + θTHTHθ

= xTx − 2xTHθ+ θTHTHθ

Page 17: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Linear Least Squares - Vector Parameter

• By setting the gradient ∂J(θ)∂θ = −2HTx + 2HTHθ to zero

we obtain the normal equations HTHθ = HTx and theestimator

θ = (HTH)−1HTx.

• The minimum LS error

Jmin = (x − Hθ)T (x − Hθ)

= (x − H(HTH)−1HTx)T (x − H(HTH)−1HTx)= xT (I − H(HTH)−1HT )(I − H(HTH)−1HT )x= xT (I − H(HTH)−1HT )x= xT (x − Hθ)

Page 18: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Idempotence of (I − H(HTH)−1HT)

• In the previous slide we used the property that matrix(I − H(HTH)−1HT ) is idempotent3.

• Note that the matrix is easy to show to be symmetric.• The idempotence can be shown as follows:

(I − H(HTH)−1HT )(I − H(HTH)−1HT )

=I − H(HTH)−1H − H(HTH)−1H + H(HTH)−1 HTH(HTH)−1︸ ︷︷ ︸I

HT

=I − H(HTH)−1H − H(HTH)−1H + H(HTH)−1HT

=I − H(HTH)−1H

3A square matrix A is idempotent (or projection matrix) if AA = A.

Page 19: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Weighted Linear Least Squares

• The Least squares approach can be extended byintroducing weights for the errors of the individualsamples.

• This is done by an N×N symmetric weighting matrix W:

J(θ) = (x − Hθ)TW(x − Hθ)

where W is a N×N positive definite matrix.• Similar derivation as in the unweighted LS case shows that

the weighted LS estimator is given by

θ = (HTWH)−1HTWx

Page 20: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Weighted Linear Least Squares

• Note, that setting W = I gives the unweighted case.

• The minimum LS error is given by

Jmin = xT (W − WH(HTWH)−1HTW)x

Page 21: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Weighted LS: example

• Consider the DC level estimation problem

x[n] = A+w[n], n = 0, 1, . . . ,N− 1,

where the noise samples have different variances:w[n] ∼ N(0,σ2

n).

Page 22: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Weighted LS: example

• In this case it would be reasonable to choose the weightssuch that the samples with smaller variance had higherweight, for example,

W =

1σ2

00 0 · · · 0

0 1σ2

10 · · · 0

......

.... . .

...0 0 0 · · · 1

σ2N−1

Page 23: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Weighted LS: example

• The weighted LS estimator becomes now

A = (HTWH)−1HTWx = · · · =

∑N−1n=0

x[n]σ2n∑N−1

n=01σ2n

Page 24: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Chapter 5: General MVU estimation

Page 25: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

General MVU Estimation using Sufficientstatistic

• Previously we saw that it’s possible to find the MVUestimator, if it is efficient (achieves the CRLB).

• As a special case, the linear model makes this practical.• In this chapter we’ll consider the case where the MVU does

not achieve the bound. It is still possible to find the MVUestimator.

• The key to this is so called sufficient statistic, which is afunction of the data that contains all available informationabout it.4

4The Finnish translation is quite descriptive: tyhjentävä statistiikka, which says that the statistic drains (exhausts,condenses) all the information from the data.

Page 26: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Sufficient statistic

• Let us consider the example of DC level in WGN, wherethe MVU estimator is

A =1N

N−1∑n=0

x[n].

• In addition to the MVUE, consider the first sampleestimator: A = x[0]

• One way of viewing their difference is that the latter onedisregards the majority of the samples.

Page 27: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Sufficient statistic

• The question arises: which set (or function) of the datawould be sufficient to contain all the necessaryinformation?

• Possible candidates for a sufficient set of data (sufficient tocompute the MVU estimate):

1: S1 = {x[0], x[1], x[2], . . . , x[N− 1]} (all N samples as such)

2: S2 = {x[0] + x[1], x[2], . . . , x[N− 1]} (N− 2 samples + a sum of two)

3: S3 =

{N−1∑n=0

x[n]

}(sum of all the samples)

Page 28: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Sufficient statistic

• All of the sets S1, S2 and S3 are sufficient, because theMVUE can be derived from any of them.

• The set S3 is special in the sense that it compresses theinformation into a single value.

• In general, among all sufficient sets, the one that containsthe least amount of elements is called the minimal sufficientstatistic. There may be several minimal sufficients statistics:for example in our example it’s easy to come up with otherone-element sets of sufficient statistics.

• Next we’ll define sufficiency in mathematical terms.

Page 29: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Definition of Sufficiency

DefinitionConsider a set of data x = {x[0], x[1], . . . , x[N− 1]} sampled froma distribution depending on the parameter θ. When estimatingθ, a function of the data T(x) is called a sufficient statistic, if theconditional PDF (after observing the value of T(x))

p(x | T(x) = T0; θ)

does not depend on the parameter θ.

• A direct use of the above definition is typically difficult.Instead, it is usually more convinient to use Neyman-Fisherfactorization theorem that we’ll describe soon.

Page 30: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Definition of Sufficiency

• However, before that, let’s see an example of the difficultway.

Page 31: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Proof of Sufficiency using theDefinition

• Consider the familiar DC level in WGN example. Let’sshow that the statistic T(x) =

∑x[n] is sufficient.

• The PDF of the data is

p(x;A) =1

(2πσ2)N2

exp

[−

12σ2

N−1∑n=0

(x[n] −A)2

]

• We need to show that the conditional PDFp(x | T(x) = T0;A) does not depend on A.

Page 32: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Proof of Sufficiency using theDefinition

• The PDF can be evaluated by the following rule fromprobability theory

p(E1 | E2) =p(E1,E2)

p(E2)

• In our case this becomes

p(x | T(x) = T0;A) =p(x, T(x) = T0;A)p(T(x) = T0)

Page 33: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Proof of Sufficiency using theDefinition

• The denominator p(T(x) = T0) is easy, because a sum ofGaussian random variables (N(A,σ2)) is Gaussian withdistribution N(NA,Nσ2), or

p(T(x) = T0) = p

(N−1∑n=0

x[n] = T0

)

=1√

2πNσ2exp

[−

12Nσ2 (T0 −NA)

2]

• The numerator p(x, T(x) = T0;A) is the joint PDF of thedata and the statistic.

Page 34: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Proof of Sufficiency using theDefinition

• However, T(x) depends on the data x such that their jointprobability takes nonzero values only when T(x) = T0holds.5

• This condition can be written more compactly as:

p(x, T(x) = T0;A) = p(x;A)δ(T(x) − T0),

where δ(·) is the Dirac delta function6.

5This is analogous to the case of two random variables x and ywith the relationship y ≡ 2x. In this case theirjoint PDF is p(x,y) = p(x)δ(y− 2x). The delta term ensures that the joint PDF is zero outside the line y = 2x.

6The defining characteristic of δ(·) is ∫∞−∞ p(x)δ(x) = p(0),

i.e., it can be used to pick a single value of a continuous function.

Page 35: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Proof of Sufficiency using theDefinition

• Thus, we have

p(x | T(x) = T0;A) =p(x, T(x) = T0;A)p(T(x) = T0)

=p(x;A)δ(T(x) − T0)

1√2πNσ2 exp

[− 1

2Nσ2 (T0 −NA)2]

=

1

(2πσ2)N2

exp[− 1

2σ2

∑N−1n=0 (x[n] −A)

2]

1√2πNσ2 exp

[− 1

2Nσ2 (T0 −NA)2] δ(T(x) − T0)

Page 36: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Proof of Sufficiency using theDefinition

• Algebraic manipulation shows that A vanishes from theformula:

p(x | T(x) = T0;A) =√N

(2πσ2)N−1

2exp

[−

12σ2

N−1∑n=0

x2[n]

]exp

[T 2

0

2Nσ2

]δ(T(x)−T0),

or, without the delta function:

p(x | T(x) = T0;A) =

√N

(2πσ2)N−1

2exp

[− 1

2σ2

∑N−1n=0 x

2[n]]

exp[T 2

02Nσ2

], if T(x) = T0,

0, otherwise.

• Therefore T(x) is sufficient.

Page 37: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Proof of Sufficiency using theDefinition

• If nothing else, we learned that using the definitiondirectly is difficult. There’s also a more straightforwardway, as we’ll see next.

Page 38: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Theorem for Finding a Sufficient Statistic

TheoremNeyman-Fisher Factorization Theorem:The function T(x) is a sufficient statistic for θ if and only if we canfactorize the pdf p(x, θ) as

p(x, θ) = g(T(x), θ)︸ ︷︷ ︸no x here

h(x)︸ ︷︷ ︸no θ here

where g is a function depending on x only through T(x) and h is afunction depending only on x.

Page 39: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: DC Level in WGN

• Let us find the factorization in the DC level in WGN case.As discussed previously, this should give T(x) =

∑x[n].

• Start with the PDF:

p(x;A) =1

(2πσ2)N2

exp

[−

12σ2

N−1∑n=0

(x[n] −A)2

]

Page 40: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: DC Level in WGN

• The exponent can be broken into parts:

N−1∑n=0

(x[n] −A)2 =

N−1∑n=0

(x2[n] − 2Ax[n] +A2)

=

N−1∑n=0

x2[n] − 2AN−1∑n=0

x[n] +NA2

Page 41: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: DC Level in WGN

• Let’s separate the PDF into two functions: one with A andT(x) only and one with x only:

p(x;A) =1

(2πσ2)N2

exp

[−

12σ2

(NA2 − 2A

N−1∑n=0

x[n]

)]︸ ︷︷ ︸

g(T(x),A)

exp

[−

12σ2

N−1∑n=0

x2[n]

]︸ ︷︷ ︸

h(x)

• Thus,

T(x) =N−1∑n=0

x[n]

is a sufficient statistic for A.

Page 42: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: DC Level in WGN

• Note that the factorization suggests

T(x) = 2N−1∑n=0

x[n],

which is also a sufficient statistic of A.

• In fact, any one-to-one mapping of T(x) is a sufficientstatistic as well, because we can always return to T(x).

Page 43: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Power of WGN

• Consider estimating the variance σ2 of a signal with thefollowing model:

x[n] = 0 +w[n],

where w[n] ∼ N(0,σ2). Find a sufficient statistic.• Solution: Start with the PDF:

p(x;σ2) =1

(2πσ2)N2

exp

[−

12σ2

N−1∑n=0

x2[n]

].

Page 44: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Power of WGN

• This can be factored in a silly way into the required form:

p(x;σ2) =1

(2πσ2)N2

exp

[−

12σ2

N−1∑n=0

x2[n]

]︸ ︷︷ ︸

g(T(x),σ2)

· 1︸︷︷︸h(x)

.

Thus, T(x) =∑N−1n=0 x

2[n] is a sufficient statistic for A.• Other examples of sufficient statistics follow later as we’re

really using the concept to find MVUE.

Page 45: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Finding the MVUE using a sufficient statistic

• Let T(x) be a complete sufficient statistic for a parameter θ.There are two ways of finding an MVU estimator usingT(x):

1 Find any unbiased estimator θ and determineθ = E(θ | T(x)).

2 Find some function g so that θ = g(T(x)) is an unbiasedestimator of θ.

• As one could predict, the latter one is easier to use. Thatwill be the one that we’ll generally employ.

• However, let us find the MVU in our familiar example case(DC level in WGN) using both methods.

Page 46: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Method Ê: The difficult way: θ = E(θ | T(x))

• First we need a sufficient statistic T(x) and an unbiasedestimator A. These can be, for example,

T(x) =N−1∑n=0

x[n] and A = x[0].

• The MVU estimator A can then be expressed in terms ofthe two as A = E(A | T(x)).

• How to calculate?

Page 47: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Method Ê: The difficult way: θ = E(θ | T(x))

• In general, the conditional expectation E(x | y) can becalculated by:

E(x | y) =

∫∞−∞ x · p(x | y)dx =

∫∞−∞ x ·

p(x,y)p(y)

dx

• In particular, if x and y are Gaussian, this can be shown tobe7

E(x | y) = E(x) +cov(x,y)

var(y)(y− E(y))

7See proof in Appendix 10 A of S. Kay’s book or:http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions

Page 48: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Method Ê: The difficult way: θ = E(θ | T(x))

• Therefore, in our case

E(A | T(x)) =

∫∞−∞ Ap(A | T(x))dA

= E(A) +cov(A, T(x))

var(T(x))(T(x) − E(T(x)))

= A+σ2

Nσ2 (

N−1∑n=0

x[n] −NA)

=1N

N−1∑n=0

x[n],

which we know is the MVUE.

Page 49: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Method Ë: The easy way: Find aTransformation g such that g(T(x)) is unbiased

• The easy way:We need to find a function g(·) such that g(T(x)) isunbiased, i.e.,

E

(g

(N−1∑n=0

x[n]

))= A

Page 50: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Method Ë: The easy way: Find aTransformation g such that g(T(x)) is unbiased

• Because E(∑x[n]) = NA, the function has to satisfy

E(g(∑N−1

n=0 x[n]))

E(∑N−1

n=0 x[n]) =

A

NA=

1N

• Thus, a simple choice of g is g(x) = x/N. Thus

A =1N

N−1∑n=0

x[n]

Page 51: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

The RBLS Theorem

• We are now ready to state the theorem named afterstatisticians Rao and Blackwell. It was extended byLehmann and Scheffé (the “Additionally...” part); hencethe lengthy name.

• The theorem uses the concept of completeness of a statisticthat needs to be defined.

• Verification (and even understanding) of completeness canbe difficult, so generally we’re just happy to believe that allstatistics we’ll see are complete.

Page 52: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

The RBLS Theorem

• Definitions of completeness include• A statistic is complete if there is only one function of the

statistic that is unbiased.• A statistic T(y) is complete if the only real valued functiong defined on the range of T(y) that satisfies

Eθ(g(T)) = 0, ∀θ

is the function g(T) = 0.

• You may also check 8 for the definition and examples ofcompleteness.

8http://en.wikipedia.org/wiki/Completeness_(statistics)

Page 53: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

The RBLS Theorem

• For example in our DC level case, the statisticT(x) =

∑x[n] is complete, because the only function for

whichE[g(T(x))] = 0 for all A ∈ R

is the zero function g(·) ≡ 0.

Page 54: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

The RBLS Theorem

TheoremRao-Blackwell-Lehmann-Scheffe-theoremIf θ is an unbiased estimator of θ and T(x) is a sufficient statistic forθ, then θ = E(θ | T) is

1 a valid estimator for θ (not dependent on θ)2 unbiased3 has variance at most that of θ, for all θ.

Additionally, if the sufficient statistic is complete, then θ is the MVUestimator.

Page 55: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

The RBLS Theorem

CorollaryIf the sufficient statistic T(x) is complete, then there’s only onefunction g(·) that makes it unbiased. Applied to the statistic, g givesthe MVUE:

θ = g(T(x)).

Proof. See Appendix 5B and the discussion on p. 109-110 ofKay’s book.

Page 56: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

• Suppose we have observed the data

x[n] = w[n],

where w[n] is i.i.d. noise with PDF U(0,β).9

• What is the MVU estimator of the mean θ = β/2 from thedata x? Is it just the sample mean as usual?

• Earlier we saw that CRLB regularity conditions do nothold, so that theorem didn’t help us.

• Let’s first find the sufficient statistic using the factorizationtheorem.

9The German Tank Problem is an instance of this model:http://en.wikipedia.org/wiki/German_tank_problem

Page 57: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

• The PDF of the samples can be written as

p(x[n]; θ) =1β[u(x[n]) − u(x[n] − β)],

where u(x) is the unit step function

u(x) =

{1, x > 00, x < 0

Page 58: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

• The joint PDF of all the samples is

p(x; θ) =1βN

N−1∏n=0

[u(x[n]) − u(x[n] − β)]

• In other words, each value has an equal probability as longas it is in the interval [0,β]; otherwise zero. This can bewritten as a formula:

p(x; θ) =

{1βN

, if 0 < x[n] < β for all n = 0, 1, . . . ,N− 1

0, otherwise

• This form we cannot factorize as required by the theorem:there areN conditions that cannot be written as a formula.

Page 59: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

• However, we can cleverly express the N conditions usingonly two: the conditions

0 < x[n] < β for all n = 0, 1, . . . ,N− 1

are equivalent to

0 < min(x[n]) and max(x[n]) < β.

Page 60: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

• Thus,

p(x; θ) =

{1βN

, if 0 < min(x[n]) and max(x[n]) < β

0, otherwise

or in a single line using the unit step function u(·):

p(x; θ) =1βN

u(β− max(x[n]))︸ ︷︷ ︸g(T(x),θ)

u(min(x[n]))︸ ︷︷ ︸h(x)

Page 61: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

• Hence we have a factorization required by theNeyman-Fisher factorization theorem, and we know thatthe sufficient statistic is T(x) = max(x[n]).

• Surprise: all the data needed by the MVU is only in thelargest sample.

Page 62: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

• If we had unlimited time, we could now prove that T(x) iscomplete. Well, it is.

• In order to find the function g(·) that makes T(x) unbiasedwe could check what’s E(T(x)). We omit this proof as well,and just note that

E(T(x)) =2NN+ 1

θ

• Therefore, the MVU results from the application of such gthat the result is unbiased. Let’s choose

g(x) =N+ 1

2Nx

Page 63: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

• Result: the MVU of the mean of the uniform distribution is

θ = g(T(x)) =N+ 1

2Nmax(x[n]).

Page 64: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

• The optimality can be seen also in practice. The plot belowshows the distribution of the MVU estimator (top) in a testof 100 realizations of uniform noise with β = 3. The otherunbiased estimators that one might think of are the samplemean (center) and the sample median (bottom), that bothhave significantly worse performance than the MVUE.

Page 65: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Mean of Uniform Noise

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

20

40

60MVU: Sample variance = 0.00022122. Sample mean = 1.4997

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

5

10

15MEAN: Sample variance = 0.0090779. Sample mean = 1.5099

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

2

4

6

8MEDIAN: Sample variance = 0.024855. Sample mean = 1.5179

Page 66: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Finding estimator MVU using sufficient statistic- vector parameter

• Both theorems extend to the vector case in a natural way.

TheoremNeyman-Fisher Factorization - vector parameter:If we can factorize the pdf p(x,θ) as

p(x;θ) = g(T(x),θ)h(x) (1)

where g is a function depending only on x through T(x) ∈ Rr×1, andalso on θ, and h is a function depending only on x, then T(x) is asufficient statistic for θ. Conversely, if T(x) is a sufficient statistic forθ, then the pdf can be factorized as in (1).

Page 67: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Finding estimator MVU using sufficient statistic- vector parameter

TheoremRao-Blackwell-Lehmann-Scheffe - vector parameterIf θ is an unbiased estimator of θ and T(x) is an r× 1 sufficientstatistic for θ, then θ = E(θ | T) is

1 a valid estimator for θ (not dependent on θ)2 unbiased3 of lesser or equal variance than that of θ, for all θ (each element

of θ has lesser or equal variance).Additionally, if the sufficient statistic is complete, then θ is the MVUestimator.

Page 68: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Finding estimator MVU using sufficient statistic- vector parameter

CorollaryIf the sufficient statistic T(x) is complete, then there’s only onefunction g(·) that makes it unbiased. Applied to the statistic, g givesthe MVU:

θ = g(T(x)).

Page 69: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Estimation of Both DC Level andNoise Power

• Consider the DC level in WGN problem,

x[n] = A+w[n],

where w[n] is WGN, and the problem is to estimate A andthe variance σ2 of w[n] simultaneously.

• The parameter vector is θ = [A,σ2]T .

Page 70: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Estimation of Both DC Level andNoise Power

• The PDF is

p(x;θ) =1

(2πσ2)N2

exp

[−

12σ2

N−1∑n=0

(x[n] −A)2

]

=1

(2πσ2)N2

exp

[−

12σ2

(N−1∑n=0

x2[n] − 2AN−1∑n=0

x[n] +NA2

)]

• By selecting T(x) = [T1(x), T2(x)]T with T1(x) =∑x[n] and

T2(x) =∑x2[n], we arrive at the Neyman-Fisher

factorization.

Page 71: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Estimation of Both DC Level andNoise Power

• The factorization is now

p(x;θ) =1

(2πσ2)N2

exp[−

12σ2

(T2(x) − 2AT1(x) +NA2)]

︸ ︷︷ ︸g(T(x);θ)

· 1︸︷︷︸h(x)

• The sufficient statistic is thus

T(x) =

N−1∑n=0

x[n]

N−1∑n=0

x2[n]

Page 72: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Estimation of Both DC Level andNoise Power

• The next step is to find a transformation g(·) that makesthe statistic T(x) unbiased.

• Let’s find the expectations:

E(T(x)) =

N−1∑n=0

E(x[n])

N−1∑n=0

E(x2[n])

=

(NA

N(σ2 +A2)

)

Page 73: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Estimation of Both DC Level andNoise Power

• The question is: how to remove the bias from bothcomponents?

• How about simply dividing by N:

g(T(x)) =1N

N−1∑n=0

x[n]

N−1∑n=0

x2[n]

=⇒ E(g(T(x))) =(

A

σ2 +A2

),

(A

σ2

)

No! The second term won’t get unbiased this easily.

Page 74: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Estimation of Both DC Level andNoise Power

• What about trying to cancel the extra A2 term by usingT1(x):

g(T(x)) =( 1

NT1(x)1NT2(x) − [ 1

NT1(x)]2

)=

x

1N

N−1∑n=0

x2[n] − x2

• This way the expectation becomes quite close to the true

value:

E(g(T(x))) =

E(x)

E

(1N

N−1∑n=0

x2[n] − x2) =

(A

σ2 +A2 − E(x2)

)

Page 75: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Estimation of Both DC Level andNoise Power

• Since x is Gaussian, we know the expectation of its square:E(x2) = A2 + σ2/N, which yields

E(g(T(x))) =(

A

σ2 +A2 −A2 − σ2/N

)=

(A

(N− 1)σ2/N

)If we multiply the second term with N/(N− 1) the resultwill become unbiased. Thus, the MVUE is

θ =

x

NN−1

(1N

N−1∑n=0

x2[n] − x2) = · · · =

x

1N−1

N−1∑n=0

(x[n] − x)2

Page 76: Chapter 8: Least squares (beginning of chapter) - TUThehu/SSP/lecture4.pdf · The Least Squares approach originates from 1795, when ... (A+Bn))2 Due to the quadratic form, this estimation

Example: Estimation of Both DC Level andNoise Power

• The estimator (or its second term) is not efficient. It can beshown that the variance of the second term is larger thanthe CRLB:

Cθ =

(σ2

N 00 2σ4

N−1

)while I−1(θ) =

(σ2

N 00 2σ4

N

)