eleg-636: statistical signal processingbarner/courses/eleg636/notes/eleg636-2up.pdf · eleg-636:...

200
ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer Engineering University of Delaware Spring 2009 K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 1 / 406 Course Objectives & Structure Course Objectives & Structure Objective: Given a discrete time sequence {x (n)}, develop Statistical and spectral signal representations Filtering, prediction, and system identification algorithms Optimization methods that are Statistical Adaptive Course Structure: Weekly lectures [notes: www.ece.udel.edu/ barner] Periodic homework (theory & Matlab implementations) [10%] Midterm & Final examinations [80%] Final project [10%] K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 2 / 406

Upload: trankhanh

Post on 19-Aug-2018

241 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

ELEG-636: Statistical Signal Processing

Kenneth E. Barner

Department of Electrical and Computer EngineeringUniversity of Delaware

Spring 2009

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 1 / 406

Course Objectives & Structure

Course Objectives & Structure

Objective: Given a discrete time sequence {x(n)}, develop

Statistical and spectral signal representations

Filtering, prediction, and system identification algorithmsOptimization methods that are

StatisticalAdaptive

Course Structure:

Weekly lectures [notes: www.ece.udel.edu/ barner]

Periodic homework (theory & Matlab implementations) [10%]

Midterm & Final examinations [80%]

Final project [10%]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 2 / 406

Page 2: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Course Objectives & Structure

Outline

1 Probability

2 Stationary Process and Models

3 Linear Systems, Spectral Representations, and Eigen Analysis

4 Maximum Likelihood and Bayes Estimation

5 Wiener (MSE) Filtering Theory

6 Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms)

7 Application: Blind Deconvolution

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 3 / 406

Probability

Signal Characterization

Assumption: Many methods take {x(n)} to be deterministicReality: Real world signals are usually statistical in nature

Thus,. . . x(−1), x(0), x(1), . . .

can be interpreted as a sequence of random variables.We begin by analyzing each observation x(n) as a R.V.Then, to capture dependencies, we consider random vectors

. . . x(n), x(n + 1), . . . , x(n + N − 1)︸ ︷︷ ︸x(n)

, x(n + N), . . .

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 5 / 406

Page 3: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Random Variables

Random Variables

Definition

For a space S, the subsets, or events of S, have associatedprobabilities.

To every event δ, we assign a number x(δ), which is called a R.V.

The distribution function of x is

Pr{x ≤ x0} = Fx (x0) −∞ < x0 < ∞Properties:

1 F (+∞) = 1, F (−∞) = 02 F (x) is continuous from the right

F (x+) = F (x)

3 Pr{x1 < x ≤ x2} = F (x2) − F (x1)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 6 / 406

Probability Random Variables

Example

Fair toss of two coins: H=heads, T=Tails

Define numerical assignments:

Events(δ) Prob. X(δ) Y(δ)HH 1/4 1 -100HT 1/4 2 -100TH 1/4 3 -100TT 1/4 4 500

This assignments yield different distribution functions

Fx(2) = Pr{HH, HT} = 1/2

Fy(2) = Pr{HH, HT , TH} = 3/4

How do we attain an intuitive interpretation of the distribution function?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 7 / 406

Page 4: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Random Variables

Distribution Plots

0

0.5

1

0 1 2 3 4 5

)(xFx

x 0

0.5

1

-200 -100 0 100 200 300 400 500 600 700

)(yFy

y

43

Note properties hold:1 F (+∞) = 1, F (−∞) = 02 F (x) is continuous from the right

F (x+) = F (x)

3 Pr{x1 < x ≤ x2} = F (x2) − F (x1)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 8 / 406

Probability Random Variables

Definition

The probability density function is defined as,

f (x) =dF (x)

dx

or F (x) =

∫ x

−∞f (x)dx

Thus F (∞) = 1 ⇒∫ ∞

−∞f (x)dx = 1

Types of distributions:

Continuous: Pr{x = x0} = 0 ∀x0

Discrete: F (xi) − F (x−i ) = Pr{x = xi} = Pi

In which case f (x) =∑

i Piδ(x − xi)

Mixed: discontinuous but not discrete

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 9 / 406

Page 5: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Random Variables

Distribution examples

Uniform: x ∼ U(a, b) a < b

f (x) =

{ 1b−a x ∈ [a, b]

0 else

ab −1

a b

f(x)

a b

F(x)1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 10 / 406

Probability Random Variables

Gaussian: x ∼ N(μ, σ)

f (x) =1√2πσ

e− (x−μ)2

2σ2

μ

f(x)

μ

F(x)

1/2

1

Historical Note: First introduced by Abraham de Moivre in 1733;Rigorously justified by Gauss in 1809; Extended by Laplace in 1812;

Why not the de Moivre distribution? Stigler’s law.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 11 / 406

Page 6: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Random Variables

Gaussian Distribution Example

Example

Consider the Normal (Gaussian) distribution PDF and CDF forμ = 0, σ2 = 0.2, 1.0, 5.0 and μ = −2, σ2 = 0.5

F,

2(x

)

0.8

0.6

0.4

0.2

0.0

5 3 1 3 5

1.0

1 0 2 424

=0,=0, = 2,

=0, =0.22

=1.02

=5.02

=0.52

f,

2(x

)

0.8

0.6

0.4

0.2

0.0

5 3 1 3 5

1.0

1 0 2 424

=0,=0, = 2,

=0, =0.22

=1.02

=5.02

=0.52

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 12 / 406

Probability Random Variables

Binomial: x ∼ B(p, q) p + q = 1

Example

Toss a coin n times. What is the probability of getting k heads?

For p + q = 1, where q is probability of a tail, and p is the probabilityof a head:

Pr{x = k} =

(nk

)pkqn−k

[NOTE:

(nk

)=

n!

(n − k)!k !

]

⇒ f (x) =n∑

k=0

(nk

)pkqn−kδ(x − k)

⇒ F (x) =m∑

k=0

(nk

)pkqn−k m ≤ x < m + 1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 13 / 406

Page 7: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Random Variables

Binomial Distribution Example I

Example

Toss a coin n times. What is the probability of getting k heads? Forn = 9, p = q = 1

2 (fair coin)

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

f(x)

0

0.5

1

0 1 2 3 4 5 6 7 8 9 10 11

F(x)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 14 / 406

Probability Random Variables

Binomial Distribution Example II

Example

Toss a coin n times. What is the probability of getting k heads? Forn = 20, p = 0.5, 0.7 and n = 40, p = 0.5.

0 10 20 30 40

0.00

0.05

0.10

0.15

0.20

0.25

p=0.5 and n=20p=0.7 and n=20p=0.5 and n=40

0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

p=0.5 and n=20p=0.7 and n=20p=0.5 and n=40

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 15 / 406

Page 8: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Conditional Distributions

Conditional Distributions

Definition

The conditional distribution of x given event “M” has occurred is

Fx(x0|M) = Pr{x ≤ x0|M}=

Pr{x ≤ x0, M}Pr{M}

Example

Suppose M = {x ≤ a}, then

Fx(x0|M) =Pr{x ≤ x0, M}

Pr{x ≤ a}If x0 ≥ a, what happens?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 16 / 406

Probability Conditional Distributions

Special Cases

Special Case: x0 ≥ a

Pr{x ≤ x0, x ≤ a} = Pr{x ≤ a}

⇒ Fx(x0|M) =Pr{x ≤ x0, M}

Pr{x ≤ a} =Pr{x ≤ a}Pr{x ≤ a} = 1

Special Case: x0 ≤ a

⇒ Fx(x0|M) =Pr{x ≤ x0, M}

Pr{x ≤ a} =Pr{x ≤ x0}Pr{x ≤ a}

=Fx(x0)

Fx(a)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 17 / 406

Page 9: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Conditional Distributions

Conditional Distribution Example

Example

Suppose

a

F(x)1

What does Fx(x |M) look like? Note M = {x ≤ a}.

⇒ Fx(x0|M) =

{Fx(x0)Fx(a) x ≤ a1 a ≤ x

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 18 / 406

Probability Conditional Distributions

)(xFx

)( axxFx ≤

a

F(x)1

Distribution properties hold for conditional cases:Limiting cases: F (∞|M) = 1 and F (−∞|M) = 0Probability range: Pr{x0 ≤ x ≤ x1|M} = F (x1|M) − F (x0|M)Density–distribution relations:

f (x |M) =∂F (x |M)

∂x

F (x0|M) =

∫ x0

−∞f (x |M)dx

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 19 / 406

Page 10: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Conditional Distributions

Example (Fair Coin Toss)

Toss a fair coin 4 times. Let x be the number of heads. DeterminePr{x = k}.

Recall

Pr{x = k} =

(nk

)pkqn−k

In this case

Pr{x = k} =

(4k

)(12

)4

Pr{x = 0} = Pr{x = 4} =116

Pr{x = 1} = Pr{x = 3} =14

Pr{x = 2} =38

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 20 / 406

Probability Conditional Distributions

Density and Distribution Plots for Fair Coin (n = 4) Ex.

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5

f(x)

161

41

83

161

41

0

0.5

1

0 1 2 3 4 5 6 7

F(x)

161

165

1611

1615 1

What type of distribution is this? Discrete. Thus,

F (xi) − F (x−i ) = Pr{x = xi} = Pi

F (x) =

∫ x

−∞f (x)dx =

∫ x

−∞

∑i

Piδ(x − xi)dx

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 21 / 406

Page 11: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Conditional Distributions

Conditional Case

Example (Conditional Fair Coin Toss)

Toss a fair coin 4 times. Let x be the number of heads. SupposeM = [at least one flip produces a head]. Determine Pr{x = k |M}.

Recall,

Pr{x = k |M} =Pr{x = k , M}

Pr{M}Thus first determine Pr{M}

Pr{M} = 1 − Pr{No heads}= 1 − 1

16

=1516

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 22 / 406

Probability Conditional Distributions

Next determine Pr{x = k |M} for the individual cases, k = 0, 1, 2, 3, 4

Pr{x = 0|M} =Pr{x = 0, M}

Pr{M} = 0

Pr{x = 1|M} =Pr{x = 1, M}

Pr{M}=

Pr{x = 1}Pr{M} =

1/415/16

=4

15

Pr{x = 2|M} =Pr{x = 2}

Pr{M} =3/8

15/16=

615

Pr{x = 3|M} =415

Pr{x = 4|M} =115

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 23 / 406

Page 12: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Conditional Distributions

Conditional and Unconditional Density Functions

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5

f(x)

161

41

83

161

41

0

0.1

0.2

0.3

0.4

0.5

0 1 2 3 4 5

f(x\M)

154

156

154

151

Are they proper density functions?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 24 / 406

Probability Total Probability and Bayes’ Theorem

Total Probability and Bayes’ Theorem

Let M1, M2, . . . , Mn forms a partition of S, i.e.⋃i

Mi = S and Mi

⋂i �=j

Mj = φ

Then

F (x) =∑

i

Fx(x |Mi)Pr(Mi)

f (x) =∑

i

fx(x |Mi)Pr(Mi)

Aside

Pr{A|B} =Pr{A, B}

Pr{B} =Pr{B, A}Pr{A}Pr{B}Pr{A} =

Pr{B|A}Pr{A}Pr{B}

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 25 / 406

Page 13: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Total Probability and Bayes’ Theorem

From this we get

Pr{M|x ≤ x0} =Pr{x ≤ x0|M}Pr{M}

Pr{x ≤ x0}=

F (x0|M)Pr{M}F (x0)

and

Pr{M|x = x0} =f (x0|M)Pr{M}

f (x0)

By integration∫ ∞

−∞Pr{M|x = x0}f (x0)dx0 =

∫ ∞

−∞f (x0|M)Pr{M}dx0

= Pr{M}∫ ∞

−∞f (x0|M)dx0 = Pr{M}

⇒ Pr{M} =

∫ ∞

−∞Pr{M|x = x0}f (x0)dx0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 26 / 406

Probability Total Probability and Bayes’ Theorem

Putting it all Together: Bayes’ Theorem

Bayes’ Theorem:

f (x0|M) =Pr{M|x = x0}f (x0)

Pr{M}=

Pr{M|x = x0}f (x0)∫∞−∞ Pr{M|x = x0}f (x0)dx0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 27 / 406

Page 14: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Functions of a R.V.

Functions of a R.V.

Problem Statement

Let x and g(x) be RVs such that

y = g(x)

Question: How do we determine the distribution of y?

NoteFy (y0) = Pr{y ≤ y0}

= Pr{g(x) ≤ y0}= Pr{x ∈ Ry0}

whereRy0 = {x : g(x) ≤ y0}

Question: If y = g(x) = x2, what is Ry0 ?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 28 / 406

Probability Functions of a R.V.

Example

Let y = g(x) = x2. Determine Fy (y0).

0y

0y− 0y x

2xy =

Note thatFy (y0) = Pr(y ≤ y0)

= Pr(−√y0 ≤ x ≤ √

y0)= Fx(

√y0) − Fx(−√

y0)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 29 / 406

Page 15: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Functions of a R.V.

Example

Let x ∼ N(μ, σ) and

y = U(x) =

{1 if x > μ0 if x ≤ μ

Determine fy (y0) and Fy (y0).

21

)( yf y

0 1

)( yFy

21

0 1

1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 30 / 406

Probability Functions of a R.V.

General Function of a Random Variable Case

To determine the density of y = g(x) in terms of fx (x0), look at g(x)

fy (y0)dy0 = Pr(y0 ≤ y ≤ y0 + dy0)

= Pr(x1 ≤ x ≤ x1 + dx1) + Pr(x2 + dx2 ≤ x ≤ x2)

+Pr(x3 ≤ x ≤ x3 + dx3)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 31 / 406

Page 16: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Functions of a R.V.

fy (y0)dy0 = Pr(x1 ≤ x ≤ x1 + dx1) + Pr(x2 + dx2 ≤ x ≤ x2)

+Pr(x3 ≤ x ≤ x3 + dx3)

= fx (x1)dx1 + fx (x2)|dx2| + fx (x3)dx3 (∗)Note that

dx1 =dx1

dy0dy0 =

dy0

dy0/dx1=

dy0

g′(x1)

Similarly

dx2 =dy0

g′(x2)and dx3 =

dy0

g′(x3)

Thus (∗) becomes

fy (y0)dy0 =fx (x1)

g′(x1)dy0 +

fx (x2)

|g′(x2)|dy0 +fx (x3)

g′(x3)dy0

or

fy (y0) =fx (x1)

g′(x1)+

fx (x2)

|g′(x2)|+

fx (x3)

g′(x3)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 32 / 406

Probability Functions of a R.V.

Function of a R.V. Distribution General Result

Set y = g(x) and let x1, x2, . . . be the roots, i.e.,

y = g(x1) = g(x2) = . . .

Then

fy (y) =fx (x1)

|g′(x1)| +fx (x2)

|g′(x2)| + . . .

Example

Suppose x ∼ U(−1, 2) and y = x 2. Determine fy (y).

31

-1 2

fx(x)

1 -1 1

y

0 2

2

1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 33 / 406

Page 17: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Functions of a R.V.

Note thatg(x) = x2 ⇒ g′(x) = 2x

Consider special cases separately:Case 1: 0 ≤ y ≤ 1

y = x2 ⇒ x = ±√y

fy (y) =fx (x1)

|g′(x1)| +fx (x2)

|g′(x2)|=

1/3|2√y | +

1/3| − 2

√y | =

1/3√y

Case 2: 1 ≤ y ≤ 4y = x2 ⇒ x =

√y

fy (y) =fx (x1)

|g′(x1)| =1/32√

y=

1/6√y

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 34 / 406

Probability Functions of a R.V.

Result: For x ∼ U(−1, 2) and y = x 2

31

-1 2

fx(x)

1 -1 1

y

0 2

2

1

fy (y) =

{ 1/3√y 0 ≤ y ≤ 1

1/6√y 1 < y ≤ 4

0

0.5

1

0 1 2 3 4 5

fy(y)

31

y

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 35 / 406

Page 18: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Functions of a R.V.

Example

Let x ∼ N(μ, σ) and y = ex . Determine fy (y).

Note g(x) ≥ 0 and g ′(x) = ex

Also, there is a single root (inverse solution):

x = ln(y)

Therefore,

fy (y) =fx (x)

|g′(x)| =fx (x)

ex

Expressing this in terms of y through substitution yields:

fy (y) =fx(ln(y))

eln(y)=

fx (ln(y))

y

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 36 / 406

Probability Functions of a R.V.

Note that x is Gaussian:

fx (x) =1√2πσ

e− (x−μ)2

2σ2

⇒ fy (y) =1√

2πyσe− (ln(y)−μ)2

2σ2 , for y > 0

0

0.5

1

0 1 2 3 4 5

fy(y)

y

Log normal density

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 37 / 406

Page 19: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Functions of a R.V.

Distribution of Fx (x)

For any RV with continuous distribution Fx(x), the RV y = Fx (x) isuniform on [0, 1].

Proof: Note 0 < y < 1. Since

g(x) = Fx(x)

g′(x) = fx (x)

Thus

fy (y) =fx (x)

g′(x)=

fx (x)

fx (x)= 1

0 1

fy(y)

1

0 1

1

Fy(y)

y

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 38 / 406

Probability Functions of a R.V.

Thus the function

g(x) = Fx(x)

performs the mapping:

The converse also holds:

Combining operationsyields Synthesis:

)(xFxx U – uniform [0,1]

pdf fx(x)

)(1 xFx−U x – pdf fx(x)

Uniform pdf

)(xFxx )(1 xFx− xU

)(1 xFy− y

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 39 / 406

Page 20: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Mean and Variance

Mean and variance

Definitions

Mean E{x} =

∫ ∞

−∞xf (x)dx

Conditional Mean E{x |M} =

∫ ∞

−∞xf (x |M)dx

Example

Suppose M = {x ≥ a}. Then

E{x |M} =

∫ ∞

−∞xf (x |M)dx

=

∫∞a xf (x)dx∫∞a f (x)dx

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 40 / 406

Probability Mean and Variance

For a function of a RV, y = g(x),

E{y} =

∫ ∞

−∞yfy (y)dy =

∫ ∞

−∞g(x)fx (x)dx

Example

Suppose g(x) is a step function:

0 x0

g(x)

1

Determine E{g(x)}.

E{g(x)} =

∫ ∞

−∞g(x)fx (x)dx =

∫ x0

−∞fx (x)dx = Fx(x0)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 41 / 406

Page 21: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Mean and Variance

Definition (Variance)

Variance σ2 =

∫ ∞

−∞(x − η)2f (x)dx

where η = E{x}. Thus,

σ2 = E{(x − η)2} = E{x2} − E2{x}

Example

For x ∼ N(η, σ2), determine the variance.

f (x) =1√2πσ

e− (x−η)2

2σ2

Note: f (x) is symmetric about x = η ⇒ E{x} = η

Also ∫ ∞

−∞f (x)dx = 1 ⇒

∫ ∞

−∞e− (x−η)2

2σ2 dx =√

2πσ

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 42 / 406

Probability Mean and Variance

∫ ∞

−∞e− (x−η)2

2σ2 dx =√

2πσ

Differentiating w.r.t. σ:

⇒∫ ∞

−∞

(x − η)2

σ3 e− (x−η)2

2σ2 dx =√

Rearranging yields∫ ∞

−∞(x − η)2 1√

2πσe− (x−η)2

2σ2 dx = σ2

orE{(x − η)2} = σ2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 43 / 406

Page 22: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Moments

Definition (Moments)

Moments

mn = E{xn} =

∫ ∞

−∞xnf (x)dx

Central Moments

μn = E{(x − η)n} =

∫ ∞

−∞(x − η)nf (x)dx

From the binomial theorem

μn = E{(x − η)n} = E

{n∑

k=0

(nk

)xk (−η)n−k

}

=n∑

k=0

(nk

)mk (−η)n−k

⇒ μ0 = 1, μ1 = 0, μ2 = σ2, μ3 = m3 − 3ηm2 + 2η3

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 44 / 406

Probability Moments

Example

Let x ∼ N(0, σ2). Prove

E{xn} =

{0 n = 2k + 11 · 3 · · · · (n − 1)σn n = 2k

For n odd

E{xn} =

∫ ∞

−∞xnf (x) = 0

since xn is an odd function and f (x) is an even function.

To prove the second part, use the fact that∫ ∞

−∞e−αx2

dx =

√π

α

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 45 / 406

Page 23: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Moments

Differentiate ∫ ∞

−∞e−αx2

dx =

√π

α

with respect to α, k times

⇒∫ ∞

−∞x2ke−αx2

dx =1 · 3 · · · · (2k − 1)

2k

√π

α2k+1

Let α = 12σ2 , then∫ ∞

−∞x2ke− x2

2σ2 dx = 1 · 3 · · · · (2k − 1)σ2k+1√

Setting n = 2k and rearranging∫ ∞

−∞xn 1√

2πσe− x2

2σ2 dx = 1 · 3 · · · · (n − 1)σn [QED]

Note: Variance is a measure of a RV’s concentration around its meanK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 46 / 406

Probability Tchebycheff Inequality

Tchebycheff Inequality

For any ε > 0,

Pr(|x − η| ≥ ε) ≤ σ2

ε2

To prove this, note

Pr(|x − η| ≥ ε) =

∫ η−ε

−∞f (x)dx +

∫ ∞

η+εf (x)dx

=

∫|x−η|≥ε

f (x)dx

Also note that

σ2 =

∫ ∞

−∞(x − η)2f (x)dx

≥∫

|x−η|≥ε

(x − η)2f (x)dx

f(x)

η η+εη-ε

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 47 / 406

Page 24: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Tchebycheff Inequality

σ2 ≥∫

|x−η|≥ε

(x − η)2f (x)dx

Using the fact that |x − η| ≥ ε in the above gives

σ2 ≥ ε2∫

|x−η|≥ε

f (x)dx

= ε2Pr{|x − η| ≥ ε}

Rearranging gives the desired result

⇒ Pr{|x − η| ≥ ε} ≤(σ

ε

)2

QED

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 48 / 406

Probability Characteristic & Moment Generating Functions

Definition (Characteristic Function)

The characteristic function of a random variable x with pdf fx (x) isdefined by

φx (ω) = E(

ejωx)

=

∫ ∞

−∞ejωx fx (x)dx

If fx(x) is symmetric about 0 (fx (x) = fx (−x)), then φx (x) is realThe magnitude of the characteristic function is bound by

|φx (ω)| ≤ φx (0) = 1

Theorem (Characteristic Function for the sum of independent RVs)

Let x1, x2, . . . , xN be independent (but not necessarily identicallydistributed) RVs and set sN =

∑ni=1 aixi where ai are constants. Then

φsN (ω) =N∏

i=1

φxi (aiω)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 49 / 406

Page 25: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Characteristic & Moment Generating Functions

The theorem can be proved by a simple extension of the following: Letx and y be independent. Then

φx+y (ω) = E(

ejω(x+y))

= E(

ejωxejωy)

= E(

ejωx)

E(

ejωy)

= φx (ω)φy (ω)

Example

Determine the characteristic function of the sample mean operating oniid samples.

Note x = 1N

∑Ni=1 xi ⇒ ai = 1

N

⇒ φx (ω) =N∏

i=1

φxi (aiω) =(φxi

N

))N

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 50 / 406

Probability Characteristic & Moment Generating Functions

The Moment Generating function is realized by making the substitutionjω → s in the above

Definition (Moment Generating Function)

The moment generating function of a random variable x with pdf f x (x)is defined by

Φx(s) = E (esx) =

∫ ∞

−∞esx fx (x)dx

Note Φx(jω) = φx (ω)

Theorem (Moment Generation)

Provided that Φx (s) exists in an open interval around s = 0, thefollowing hold

mn = E (xn) = Φ(n)x (0) =

dnΦx

dsn (0)

Simply noting that Φ(n)x (s) = E (xnesx) proves the result

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 51 / 406

Page 26: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Characteristic & Moment Generating Functions

Example

Let x be exponentially distributed,

f (x) = λe−λxU(x)

Determine η = m1, m2, and σ2

Note

Φx(s) = λ

∫ ∞

0esxe−λxdx = λ

∫ ∞

0e−x(λ−s)dx

λ − sThus

Φ(1)x (0) =

and Φ(2)x (0) =

2λ2

and

E{x} =1λ

, E{

x2}

=2λ2 ⇒ σ2 =

1λ2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 52 / 406

Probability Bivariate Statistics

Bivariate Statistics

Given two RVs, x and y , the bivariate (joint) distribution is given by

F (x0, y0) = Pr{x ≤ x0, y ≤ y0}

y

x

y0

x0

Properties:

F (−∞, y) = F (x ,−∞) = 0

F (∞,∞) = 1

Fx(x) = F (x ,∞), Fy (y) = F (∞, y)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 53 / 406

Page 27: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Bivariate Statistics

Special Cases

Case 1: M = {x1 ≤ x ≤ x2, y ≤ y0} x

y

y0

x1 x2

⇒ Pr{M} = F (x2, y0) − F (x1, y0)

Case 2: M = {x ≤ x0, y1 ≤ y ≤ y2}

y

x0

y1

y2

x

⇒ Pr{M} = F (x0, y2) − F (x0, y1)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 54 / 406

Probability Bivariate Statistics

Case 3: M = {x1 ≤ x ≤ x2, y1 ≤ y ≤ y2} Theny

x1

y1

y2

xx2

andPr{M} = F (x2, y2) − F (x1, y2) − F (x2, y1) + F (x1, y1)︸ ︷︷ ︸

Added back because this region was subtracted twice

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 55 / 406

Page 28: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Joint Statistics

Definition (Joint Statistics)

f (x , y) =∂2F (x , y)

∂x∂y

and

F (x , y) =

∫ x

−∞

∫ y

−∞f (α, β)dαdβ

In general, for some region M, the joint statistics are

Pr{(x , y) ∈ M} =

∫ ∫M

f (x , y)dxdy

Marginal Statistics: Fx(x) = F (x ,∞) and Fy (y) = F (∞, y)

⇒ fx (x) =

∫ ∞

−∞f (x , y)dy

⇒ fy (y) =

∫ ∞

−∞f (x , y)dx

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 56 / 406

Probability Independence

Independence

Definition (Independence)

Two RVs x and y are statistically independent if for arbitrary events(regions) x ∈ A and y ∈ B,

Pr{x ∈ A, y ∈ B} = Pr{x ∈ A}Pr{y ∈ B}

Letting A = {x ≤ x0} and B = {y ≤ y0}, we see x and y areindependent iff

Fx ,y(x , y) = Fx(x)Fy (y)

and by differentiation

fx ,y(x , y) = fx (x)fy (y)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 57 / 406

Page 29: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Independence

If x and y are independent RVs, then

z = q(x) and w = h(y)

are also independent.

Function of two RVs

Given two RVs, let z = g(x , y). Define Dz to be the xy plane region

{z ≤ z0} = {g(x , y) ≤ z0} = {(x , y) ∈ Dz}

Then

Fz(z0) = Pr{z ≤ z0}= Pr{(x , y) ∈ Dz}=

∫ ∫Dz

f (x , y)dxdy

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 58 / 406

Probability Independence

Example

Let z = x + y . Then, z ≤ z0 gives the region x + y ≤ z0 which isdelineated by the line x + y = z0

y

z0

x

z0

Thus

Fz(z0) =

∫ ∫Dz

f (x , y)dxdy

=

∫ ∞

−∞

∫ z0−y

−∞f (x , y)dxdy

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 59 / 406

Page 30: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Independence

We can obtain fz(z) by differentiation

∂Fz(z)

∂z=

∫ ∞

−∞

∂z

∫ z−y

−∞f (x , y)dxdy

fz(z) =

∫ ∞

−∞f (z − y , y)dy (∗)

Note that if x and y are independent,

f (x , y) = fx (x)fy (y) (∗∗)

Thus utilizing (∗∗) in (∗)

fz(z) =

∫ ∞

−∞fx (z − y)fy (y)dy︸ ︷︷ ︸

Convolution

= fx (z) ∗ fy (z)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 60 / 406

Probability Independence

Example

Let z = x + y where x and y are independent with

fx(x) = αe−αxU(x)

fy (y) = αe−αyU(y)

Then

fz(z) =

∫ ∞

−∞fx (z − y)fy (y)dy

= α2∫ z

0e−α(z−y)e−αydy

= α2e−αz∫ z

0dy

= α2ze−αzU(z)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 61 / 406

Page 31: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Independence

Example

Let z = max(x , y). Determine Fz(z0) and fz(z0).

NoteFz(z0) = Pr{z ≤ z0}

= Pr{max(x , y) ≤ z0}= Fxy(z0, z0)

y

x

z0

z0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 62 / 406

Probability Independence

If x and y are independent,

Fz(z0) = Fx(z0)Fy (z0)

and

fz(z0) =∂Fz(z0)

∂z0

=∂Fx (z0)

∂z0Fy (z0) +

∂Fy (z0)

∂z0Fx(z0)

= fx (z0)Fy (z0) + fy (z0)Fx(z0)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 63 / 406

Page 32: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Joint Moments

Joint Moments

For RVs x and y and function z = g(x , y)

E{z} =

∫ ∞

−∞zfz(z)dz

E{g(x , y)} =

∫ ∞

−∞

∫ ∞

−∞g(x , y)f (x , y)dxdy

Definition (Covariance)

For RVs x and y ,

Cxy = Cov(x , y)= E [(x − ηx )(y − ηy )]= E [xy ] − ηxE [y ] − ηyE [x ] + ηxηy

= E [xy ] − ηxηy

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 64 / 406

Probability Joint Moments

Definition (Correlation Coefficient)

The correlation coefficient is given by

r =Cxy

σxσy

Note that

0 ≤ E{[a(x − ηx ) + (y − ηy )]2}= E{(x − ηx )2}a2 + 2E{(x − ηx )(y − ηy )}a + E{(y − ηy )2}= σ2

x a2 + 2Cxya + σ2y

This is a positive quadratic function of a⇒ Roots are imaginary and discriminant is non-positive√

4C2xy − 4σ2

xσ2y → imaginary

⇒ 4C2xy − 4σ2

xσ2y ≤ 0

⇒ C2xy ≤ σ2

xσ2y

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 65 / 406

Page 33: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Joint Moments

Thus,

|Cxy | ≤ σxσy and |r | =|Cxy |σxσy

≤ 1

Definition (Uncorrelated)

Two RVs are uncorrelated if their covariance is zero

Cxy = 0

⇒ r =Cxy

σxσy= 0

=E{xy} − E{x}E{y}

σxσy= 0

⇒ E{xy} = E{x}E{y}Thus

Cxy = 0 ⇔ E{xy} = E{x}E{y}

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 66 / 406

Probability Joint Moments

Result

If x and y are independent, then

E{xy} = E{x}E{y}

and x and y are uncorrelated

Note: Converse is not true (in general)

Converse only holds for Gaussian RVs

Independence is a stronger condition than uncorrelated

Definition (Orthogonality)

Two RVs are orthogonal if

E{xy} = 0

Note: If x and y are correlated, they are not orthogonal

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 67 / 406

Page 34: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Joint Moments

Example

Consider the correlation between two RV’s, x and y , with samplesshown in a scatter plot

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 68 / 406

Probability Sequences and Vectors of Random Variables

Sequences and Vectors of Random Variables

Definition (Vector Distribution)

Let {x} be a sequence of RVs. Take N samples to form the randomvector

x = [x1, x2, . . . , xN ]T

Then the vector distribution function is

Fx(x0) = Pr{x1 ≤ x01 , x2 ≤ x0

2 , . . . , xN ≤ x0N}

�= Pr{x ≤ x0}

Special Case: For complex data

x = xr + jxi

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 69 / 406

Page 35: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

⎡⎢⎢⎢⎣

x1

x2...xN

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

xr1

xr2...xrN

⎤⎥⎥⎥⎦+ j

⎡⎢⎢⎢⎣

xi1

xi2...xiN

⎤⎥⎥⎥⎦

The distribution in the complex case is defined as

Fx(x0) = Pr{xr ≤ x0r , xi ≤ x0

i }�= Pr{x ≤ x0}

The density function is given by

fx(x) =∂NFx(x)

∂x1∂x2 . . . ∂xN

Fx(x0) =

∫ x01

−∞

∫ x02

−∞. . .

∫ x0N

−∞fx(x)dx1dx2 . . . dxN

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 70 / 406

Probability Sequences and Vectors of Random Variables

Properties:

Fx([∞,∞, · · · ,∞]T ) = 1∫ ∞

−∞

∫ ∞

−∞· · ·∫ ∞

−∞fx(x)dx = 1

Fx([x1, x2, · · · ,−∞, · · · , xN ]T ) = 0

Also

F ([∞, x2, x3, · · · , xN ]T ) = F ([x2, x3, · · · , xN ]T )∫ ∞

−∞f ([x1, x2, x3, · · · , xN ]T )dx1 = f ([x2, x3, · · · , xN ]T )

Setting xi = ∞ in the cdf eliminates this sample

Integrating over (−∞,∞) along xi in the pdf eliminates this sample

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 71 / 406

Page 36: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Joint Distribution

Definitions (Joint Distribution and Density)

Given two random vectors x and y, the joint distribution and density are

Fxy(x0, y0) = Pr{x ≤ x0, y ≤ y0}

fxy(x, y) =∂N∂MFxy(x, y)

∂x1∂x2 · · · ∂xN∂y1∂y2 · · · ∂yM

Definition (Vector Independence)

The vectors are independent iff

Fxy(x, y) = Fx(x)Fy(y)

or equivalentlyfxy(x, y) = fx(x)fy(y)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 72 / 406

Probability Sequences and Vectors of Random Variables

Expectations & Moments

Objective: Obtain partial description of process generating x

Solution: Use moments

The first moment, or mean, is

mx = E{x} = [m1, m2, . . . , mN ]T

=

∫ ∞

−∞

∫ ∞

−∞· · ·∫ ∞

−∞xfx(x)dx

⇒ mk =

∫ ∞

−∞

∫ ∞

−∞· · ·∫ ∞

−∞xk fx(x)dx1dx2 · · · dxN

=

∫ ∞

−∞xk fxk (xk )︸ ︷︷ ︸dxk

⇑ marginal distribution of xk

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 73 / 406

Page 37: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Definition (Correlation Matrix)

A complete set of second moments is given by the correlation matrix

Rx = E{xxH} = E{xx∗T }

=

⎡⎢⎢⎢⎣

E{|x1|2} E{x1x∗2} · · · E{x1x∗

N}E{x2x∗

1} E{|x2|2} · · · E{x2x∗N}

......

. . ....

E{xNx∗1} E{xNx∗

2} · · · E{|xN |2}

⎤⎥⎥⎥⎦

Result

The correlation matrix is Hermitian symmetric

(Rx)H = (E{xxH})H

= E{(xxH)H}= E{xxH} = Rx

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 74 / 406

Probability Sequences and Vectors of Random Variables

Definition (Covariance Matrix)

The set of second central moments is given by the covariance

Cx = E{(x − mx)(x − mx)H}

= E{xxH} − mxE{xH} − E{x}mxH + mxmx

H

= Rx − mxmxH

ResultThe covariance is Hermitian symmetric

Cx = CxH

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 75 / 406

Page 38: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Result

The correlation and covariance matrices are positive semi-definite

aHRxa ≥ 0 aHCxa ≥ 0 (∀a)

To prove this, note

aHRxa = aHE{xxH}a

= E{aHxxHa}= E{(aHx)(aHx)H}= E{|aHx|2} ≥ 0

For most cases, R and C are positive define

aHRxa > 0 aHCxa > 0

⇒ no linear dependencies in Rx or Cx

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 76 / 406

Probability Sequences and Vectors of Random Variables

Definitions (Cross-Correlation and Cross-Covariance)

For random vectors x and y,

Cross-correlation�= Rxy = E{xyH}

Cross-covariance�= Cxy = E{(x − mx)(y − my)

H}= Rxy − mxmy

H

Definition (Uncorrelated Vectors)

Two vectors x and y are uncorrelated if

Cxy = Rxy − mxmyH = 0

or equivalentlyRxy = E{xyH} = mxmy

H

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 77 / 406

Page 39: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Note that as in the scalar case

independence ⇒ uncorrelated

uncorrelated � independence

Also, x and y are orthogonal if

Rxy = E{xyH} = 0

Example

Let x and y be the same dimension. If

z = x + y

find Rz and Cz

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 78 / 406

Probability Sequences and Vectors of Random Variables

By definition

Rz = E{(x + y)(x + y)H}= E{xxH} + E{xyH} + E{yxH} + E{yyH}= Rx + Rxy + Ryx + Ry

SimilarlyCz = Cx + Cxy + Cyx + Cy

Note: If x and y are uncorrelated,

Rz = Rx + mxmyH + mymx

H + Ry

andCz = Cx + Cy

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 79 / 406

Page 40: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Definition (Multivariate Gaussian Density)

For a N dimensional random vector x with covariance Cx, themultivariate Gaussian pdf is

fx(x) =1

(2π)N2 |Cx| 1

2

e− 12 (x−mx)HCx

−1(x−mx)

Note the similarity to the univariate case

fx (x) =1√2πσ

e− 12

(x−m)2

σ2

Example

Let N = 2 (bivariate case) and x be real. Then

x =

[x1

x2

]and mx = E{x} =

[m1

m2

]K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 80 / 406

Probability Sequences and Vectors of Random Variables

Cx = E{(x − mx)(x − mx)T }

= E{xxT} − mxmxT

= E{[

x21 x1x2

x2x1 x22

]}−[

m21 m1m2

m2m1 m22

]

=

[E{x2

1} − m21 E{x1x2} − m1m2

E{x2x1} − m2m1 E{x22} − m2

2

]

Recall thatσ2

x = E{x2} − E2{x}and

r =E{x1x2} − m1m2

σx1σx2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 81 / 406

Page 41: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Rearranging: Cx =

[σ2

x1rσx1σx2

rσx1σx2 σ2x2

]Also,

Cx−1 =

1σ2

x1σ2

x2− r2σ2

x1σ2

x2

[σ2

x2−rσx1σx2

−rσx1σx2 σ2x1

]

=1

σ2x1

σ2x2

(1 − r2)

[σ2

x2−rσx1σx2

−rσx1σx2 σ2x1

]

Substituting into the Gaussian pdf and simplifying

fx(x) =1

2π|Cx| 12

e− 12 (x−mx)T Cx

−1(x−mx)

=1

2πσx1σx2(1 − r2)12

e− 1

2(1−r 2)

[(x1−m1)2

σ2x1

−2r(x1−m1)(x2−m2)

σx1 σx2+

(x2−m2)2

σ2x2

]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 82 / 406

Probability Sequences and Vectors of Random Variables

Note: If uncorrelated, r = 0

⇒ fx(x) = 12πσx1σx2

e− 1

2 [(x1−m1)2

σ2x1

+(x2−m2)2

σ2x2

]

= fx1(x1)fx2(x2)

Gaussian special case result:

uncorrelated ⇒ independent

Example

Examine the contours defined by

(x − mx)T Cx

−1(x − mx) = constant

Why? For all values on the contour

fx(x) = constant

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 83 / 406

Page 42: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

r = 0 σx1 = σx2

x2

x1m1

m2

r = 0 σx1 > σx2

x2

x1m1

m2

r > 0 σx1 > σx2

x2

x1m1

m2

r > 0 σx1 < σx2

x2

x1m1

m2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 84 / 406

Probability Sequences and Vectors of Random Variables

r < 0 and σx1 > σx2

x2

x1m1

m2

x 2

)( 22xf x

x1

)( 11xf x

Integrating over x2 yields fx1(x1)

Integrating over x1 yields fx2(x2)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 85 / 406

Page 43: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Additional Gaussian (surface) examples:

r = 0 σx1 = σx2 r = 0 σx1 < σx2

r > 0 σx1 < σx2 r < 0 σx1 < σx2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 86 / 406

Probability Sequences and Vectors of Random Variables

Transformations of a vector

Let the N functions g1(·), g2(·), . . . , gN(·) map x to z, where

z1 = g1(x1, x2, . . . , xN)z2 = g2(x)...

zN = gN(x)

Forward mapping

Let g1(·), g2(·), . . . , gN(·) be independent and yield a one-to-onetransformation such that ∃ a set of functions

x1 = h1(z), x2 = h2(z), . . . , xN = hN(z)

where z = [z1, z2, . . . , zN ]T .Reverse mapping

Question: How do we determine the distribution fz(z)?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 87 / 406

Page 44: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Let N = 2 and consider the probability of being in the region defined by

[z1, z1 + dz1] and [z2, z2 + dz2]

z1

z2

z2+d z2

z2

z1 z1+d z1

Az

),Pr{(}), xxAzzx1

x2z2+d z2

z2z1

z1+d z1

Ax

Identify an equivalent area in the x1, x2 domain and equate theprobabilities

Pr{(z1, z2) ∈ Az} = Pr{(x1, x2) ∈ Ax}fz1z2(z1, z2)Area(Az) = fx1x2(x1, x2)Area(Ax )

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 88 / 406

Probability Sequences and Vectors of Random Variables

Area(Ax)

Area(Az)= abs

(J(

x1 x2

z1 z2

))

=1

abs

(J(

z1 z2

x1 x2

))The Jacobian is defined as

J(

x1 x2

z1 z2

)=

∣∣∣∣∣∂x1∂z1

∂x1∂z2

∂x2∂z1

∂x2∂z2

∣∣∣∣∣and

J(

z1 z2

x1 x2

)=

∣∣∣∣∣∂z1∂x1

∂z1∂x2

∂z2∂x1

∂z2∂x2

∣∣∣∣∣Note that

∂x1

∂z1=

∂h1(z)

∂z1and

∂z1

∂x1=

∂g1(x)

∂x1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 89 / 406

Page 45: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Thusfz1z2(z1, z2)Area(Az) = fx1x2(x1, x2)Area(Ax )

⇒ fz1z2(z1, z2) =fx1x2(x1, x2)

Area(Az)/Area(Ax)=

fx1x2(x1, x2)

abs

(J(

z1 z2

x1 x2

))

General Case Result (Functions of Vectors)

fz(z) =fx(x)

abs

(J(

zx

))where

J(

zx

)=

∣∣∣∣∣∣∣∂z1∂x1

· · · ∂z1∂xN

.... . .

...∂zN∂x1

· · · ∂zN∂xN

∣∣∣∣∣∣∣K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 90 / 406

Probability Sequences and Vectors of Random Variables

Example (Linear Transformation)

Let z and x be linearly related

z1 = a11x1 + a12x2

z2 = a21x1 + a22x2

or z = Ax and x = A−1z

Then J(

z1 z2

x1 x2

)=

∣∣∣∣∣∂z1∂x1

∂z1∂x2

∂z2∂x1

∂z2∂x2

∣∣∣∣∣=

∣∣∣∣ a11 a12

a21 a22

∣∣∣∣ = |A|

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 91 / 406

Page 46: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Probability Sequences and Vectors of Random Variables

Let A−1 =

[b11 b12

b21 b22

]. Then

[x1

x2

]=

[b11 b12

b21 b22

] [z1

z2

]

and

fz1z2(z1, z2) =fx1x2(x1, x2)

abs|A|=

fx1x2(b11z1 + b12z2, b21z1 + b22z2)

abs|A|

General Case Result (Linear Transformations)

For case where z = Ax

fz(z) =1

abs|A| fx(A−1z)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 92 / 406

Probability Sequences and Vectors of Random Variables

Vector Statistics for Linear Transformations

For such linear transformations z = Ax

E{z} = E{Ax} = Amx

SimilarlyE{zzH} = E{Ax(Ax)H}

= E{AxxHAH}

⇒ Rz = ARxAH

By similar arguments it is easy to show

Cz = ACxAH

Note: Results in simple linear transformations of statistics

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 93 / 406

Page 47: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Stationary Process

Stationary Process and Models

Statistic process describes the time evolution of statisticalphenomena

A stochastic process is not a single function of time but an infinitenumber of possible realizations

A single realization is called a time series

A full joint distribution function of an arbitrary stochastic process isdifficult to obtain or estimate

Settle for a partial characterization

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 95 / 406

Stationary Process and Models Stationary Process

Consider a discrete-time stochastic process

x(n), x(n − 1), . . . , x(n − M)

which may be complex.

Definitions (Mean, Auto-Correlation, and Auto-Covariance)

The mean process is given by

μ(n) = E{x(n)}

The auto–correlation is defined as

r(n, n − k) = E{x(n)x∗(n − k)}

The auto–covariance is given by

c(n, n − k) = E{[x(n) − μ(n)][x(n − k) − μ(n − k)]∗}= r(n, n − k) − μ(n)μ∗(n − k)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 96 / 406

Page 48: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Stationary Process

Definition (Wide-Sense Stationary)

A discrete-time stochastic process is wide–sense stationary (WSS) if

μ(n) = μ for all n

r(n, n − k) = r(k) and

c(n, n − k) = c(k) k = 0,±1,±2, . . .

Let x(n) = [x(n), x(n − 1), . . . , x(n − M + 1)]T be a M × 1 observationvector. Then for {x(n)} WSS, the correlation matrix is

R = E{x(n)xH(n)} =

⎡⎢⎢⎢⎣

r(0) r(1) · · · r(M − 1)r(−1) r(0) · · · r(M − 2)

......

. . ....

r(−M + 1) r(−M + 2) · · · r(0)

⎤⎥⎥⎥⎦

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 97 / 406

Stationary Process and Models Stationary Process

Properties of the correlation matrices

For a stationary discrete time process: RH = R (Hermetian)⎡⎢⎢⎢⎣

r(0) r(1) · · · r(M − 1)r(−1) r(0) · · · r(M − 2)

......

. . ....

r(−M + 1) r(−M + 2) · · · r(0)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

r(0) r∗(−1) · · · r∗(−M + 1)r∗(1) r(0) · · · r∗(−M + 2)

......

. . ....

r∗(M − 1) r∗(M − 2) · · · r(0)

⎤⎥⎥⎥⎦

Consequence: ⇒ r(−k) = r ∗(k)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 98 / 406

Page 49: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Stationary Process

The correlation matrix is Toeplitz

R =

⎡⎢⎢⎢⎢⎢⎣

r(0) r(1) r(2) · · · r(M − 1)r∗(1) r(0) r(1) · · · r(M − 2)r∗(2) r∗(1) r(0) · · · r(M − 3)

......

......

...r∗(M − 1) r∗(M − 2) r∗(M − 3) · · · r(0)

⎤⎥⎥⎥⎥⎥⎦

For any non-zero vector a

aRaH ≥ 0 (positive semi-definite)

and usuallyaRaH > 0 (positive definite)

Result: R is positive definite if the samples in x are not linearlydependent. In this case R−1 exists.

Historical Note: Diagonal–constant matrices are named after themathematician Otto Toeplitz (1881–1940)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 99 / 406

Stationary Process and Models Stochastic Models

Stochastic Models

A model is used to describe the hidden laws governing thegeneration of physical data observed

We assume that x(n), x(n − 1), · · · have statistical dependenciesthat can be modeled as

Discrete timelinear fitler

v(n) x(n)

processwhere v(n) is a purely random processLinear model types:

1 Auto Regressive – no past model input samples used2 Moving Average – no past model output samples used3 Auto Regressive Moving Average – both past input and output used

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 100 / 406

Page 50: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Stochastic Models

General Stochastic Model:(Modeloutput

)+

(Linear combination

of past outputs

)︸ ︷︷ ︸

AR part

=

(Linear combination ofpresent & past inputs

)︸ ︷︷ ︸

MA part

Three model possibilities:1 AR – auto regressive2 MA – moving average3 ARMA – mixed AR and MA

Model Input: assumed to be an i.i.d. zero mean Gaussian process:

E{v(n)} = 0 for all n

E{v(n)v∗(k)} =

{σ2

v k = n0 otherwise

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 101 / 406

Stationary Process and Models Auto-Regressive Models

Auto-Regressive Models

Definition (Auto-Regressive)

The time series {x(n)} is said to be generated by an AR model if

x(n) + a∗1x(n − 1) + · · · + a∗

Mx(n − M) = v(n)

orx(n) = w∗

1 x(n − 1) + · · · + w∗Mx(n − M) + v(n)

where wk = −ak .

This is an order M model and v(n) is referred to as the noise termNote that we can set a0 = 1 and write

M∑k=0

a∗kx(n − k) = v(n)

which is a convolution sumK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 102 / 406

Page 51: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Auto-Regressive Models

Thus taking Z-transforms

Z{a∗n} = A(z) =

M∑n=0

a∗nz−n

Z{x(n)} = X (z) =∞∑

n=0

x(n)z−n

Z{v(n)} = V (z) =∞∑

n=0

v(n)z−n

andM∑

k=0

a∗kx(n − k) = v(n) ⇒ A(z)X (z) = V (z)

If we regard v(n) as the output,then HA(z)x(n) v(n)

where HA(z) = V (z)X(z) = A(z)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 103 / 406

Stationary Process and Models Auto-Regressive Models

[Notation note: figure uses u(n) as input, i.e., u(n) = v(n)]

This is called the process analyzerAnalyzer is an all zero system

Impulse response is finite (FIR)System is BIBO stable

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 104 / 406

Page 52: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Auto-Regressive Models

If we view v(n) as the input, then wehave the process generator

HG(z)v(n) x(n)

HG(z) =X (z)

V (z)=

1A(z)

The process generator is an allpole system

Impulse response is infinite (IIR)System stability is an issue

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 105 / 406

Stationary Process and Models Auto-Regressive Models

Note

HG(z) =1

A(z)=

1∑Mn=0 a∗

nz−n

Factor the denominator and represent HG(z) in terms of its poles

HG(z) =1

(1 − p1z−1)(1 − p2z−1) · · · (1 − pMz−1)

p1, p2, . . . , pM are the poles of HG(z) defined as the roots of thecharacteristic equation

1 + a∗1z−1 + a∗

2z−2 + · · · + a∗Mz−M = 0

HG(z) is all pole (IIR) and BIBO stable only if all poles are in theunit circle, i.e.,

|pn| < 1 n = 1, 2, · · · , M

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 106 / 406

Page 53: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Moving Average Model

Moving Average Model

Definition (Moving Average)

The time series {x(n)} is said to be generated by a Moving Average(MA) model if

x(n) = v(n) + b∗1v(n − 1) + · · · + b∗

K v(n − K )

where b1, b2, · · · , bk are the parameters of the order K MA model

v(n) is zero mean white Gaussian noiseThe process generation model is all zero (FIR)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 107 / 406

Stationary Process and Models Auto-Regressive Moving Average Model

Auto-Regressive Moving Average Model

Definition (Auto-Regressive MovingAverage)

In this case, {x(n)} is a mixed processwhere the output is a function of pastoutputs and current/past inputs

x(n) + a∗1x(n − 1) + · · · + a∗

M(n − M)

= v(n) + b∗1v(n − 1) + · · · + b∗

K v(n − K )

The order is (M, K ).

v(n) is zero mean white Gaussiannoise

The process model has zeros andpoles (IIR)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 108 / 406

Page 54: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Analysis of Stochastic Models

Wold Decomposition (after Herman Wold (1908–92))

Any WSS discrete time stochastic process y(n) can be expressed as

y(n) = x(n) + s(n)

where:x(n) and s(n) are uncorrelatedx(n) can be expressed by the MA model

x(n) =∞∑

k=0

b∗kv(n − k)

b0 = 1 and∑∞

k=0 |bk | < ∞v(n) is white noise uncorrelated with s(n)

s(n) is perfectly predictableNote: If B(z) is minimum phase, then it can be represented by an allpole (AR) system.

AR models are widely used because they are tractableK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 109 / 406

Stationary Process and Models Analysis of Stochastic Models

Asymptotic statistics of AR processes

Recall that {x(n)} is generated by

x(n) + a∗1x(n − 1) + a∗

2x(n − 2) + · · · + a∗Mx(n − M) = v(n)

or

x(n) = w∗1 x(n − 1) + w∗

2 x(n − 2) + · · · + w∗Mx(n − M) + v(n)

Linear constant coefficient difference equation of order M drivenby v(n).

Z-transform representation:

X (z) =V (z)

1 +∑M

k=1 a∗kz−k

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 110 / 406

Page 55: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Analysis of Stochastic Models

Inverse transforming X (z) = V (z)

1+∑M

k=1 a∗k z−k

yields

x(n) = xc(n)︸ ︷︷ ︸Homogeneous Solution

+ xp(n)︸ ︷︷ ︸Particular Solution

The particular solution is the result of driving HG(z) with v(n)

xp(n) = HG(z)v(n),

where z−1 is taken as the delay operator.

The particular solution has stationary statistics

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 111 / 406

Stationary Process and Models Analysis of Stochastic Models

The homogeneous solution is of the form

xc(n) = B1pn1 + B2pn

2 + · · · + BMpnM

where p1, p2, · · · , pM are the roots of

1 + a∗1z−1 + a∗

2z−2 + · · · + a∗Mz−M = 0

The B values depend on the initial conditions

The homogeneous solution is not stationary

The process is asymptotically stationary if |pn| < 1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 112 / 406

Page 56: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Analysis of Stochastic Models

Correlation of a stationary AR process

Recall that an AR process can be written as

M∑k=0

a∗kx(n − k) = v(n)

where a0 = 1.Multiply both sides by x∗(n − l) and take E{ }.

E

{M∑

k=0

a∗kx(n − k)x∗(n − l)

}= E{v(n)x∗(n − l)}

Note that

E{x(n − k)x∗(n − l)} = r(l − k)

E{v(n)x∗(n − l)} = 0 for l > 0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 113 / 406

Stationary Process and Models Analysis of Stochastic Models

Thus

E

{M∑

k=0

a∗kx(n − k)x∗(n − l)

}= E{v(n)x∗(n − l)}

⇒M∑

k=0

a∗k r(l − k) = 0 for l > 0

Accordingly, the auto-correlation of the AR process satisfies

r(l) = w∗1 r(l − 1) + w∗

2 r(l − 2) + · · · + w∗Mr(l − M)

where wk = −ak . Note that this also has the solution

r(m) =M∑

k=1

ckpmk

where pk is the k th root of

1 − w∗1 z−1 − w∗

2 z−2 − · · · − w∗Mz−M = 0

Why? Diff. equation (no driving function; homogeneous solution only)K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 114 / 406

Page 57: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Analysis of Stochastic Models

Recall that the AR characteristic equation is

1 + a∗1z−1 + a∗

2z−2 + · · · + a∗Mz−M = 0

This is identical to the auto-correlation characteristic equation

1 − w∗1 z−1 − w∗

2 z−2 − · · · − w∗Mz−M = 0

⇒ the roots are equalResult: A stable AR process ⇒ |pk | < 1 and

limm→∞ r(m) = lim

m→∞

M∑k=1

ckpmk = 0

(asymptotically uncorrelated)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 115 / 406

Stationary Process and Models Yule-Walker Equations

Yule-Walker Equations

An AR model of order M is completely specified by

AR coefficients: a1, a2, . . . , aM

Variance of v(n): σ2v

Proposition: These parameters can be determined by theauto-correlation values: r(0), r(1), . . . , r(M).

Recall

r(l) = w∗1 r(l − 1) + w∗

2 r(l − 2) + · · · + w∗Mr(l − M)

Case 1: Let l = 1

r(1) = w∗1 r(0) + w∗

2 r(−1) + · · · + w∗Mr(1 − M)

Using the fact r(−k) = r∗(k)

r(1) = w∗1 r(0) + w∗

2 r∗(1) + · · · + w∗Mr∗(M − 1)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 116 / 406

Page 58: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Yule-Walker Equations

Taking the complex conjugate

r∗(1) = w1r(0) + w2r(1) + · · · + wMr(M − 1)

= wT [r(0), r(1), · · · , r(M − 1)]T

where wT = [w1, w2, · · · , wM ]

Case 2: Now let l = 2

r(2) = w∗1 r(1) + w∗

2 r(0) + w∗3 r(−1) + · · · + w∗

Mr(2 − M)

⇒ r∗(2) = w1r∗(1) + w2r(0) + w3r(1) · · · + wMr(M − 2)

= wT [r∗(1), r(0), r(1), · · · , r(M − 2)]T

Case 3: Similarly, for l = 3

r(3) = w∗1 r(2) + w∗

2 r(1) + w∗3 r(0) + w∗

4 r(−1) · · · + w∗Mr(3 − M)

⇒ r∗(3) = w1r∗(2) + w2r∗(1) + w3r(0) + w4r(1) · · · + wMr(M − 3)

= wT [r∗(2), r∗(1), r(0), r(1), · · · , r(M − 3)]T

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 117 / 406

Stationary Process and Models Yule-Walker Equations

Repeating the process & combining results in matrix form⎡⎢⎢⎢⎢⎢⎣

r(0) r(1) · · · r(M − 1)r∗(1) r(0) · · · r(M − 2)r∗(2) r∗(1) · · · r(M − 3)

......

. . ....

r∗(M − 1) r∗(M − 2) · · · r(0)

⎤⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎣

w1

w2

w3...

wM

⎤⎥⎥⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎢⎣

r∗(1)r∗(2)r∗(3)...r∗(M)

⎤⎥⎥⎥⎥⎥⎦

or more compactly

Rw = r where r = [r(1), r(2), r(3), . . . , r(M)]H

Result: Given the auto-correlation values, the AR coefficients we canbe uniquely determine

w = R−1r

whereak = −wk k = 1, 2, · · · , M

Assumption: R is nonsingularK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 118 / 406

Page 59: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Yule-Walker Equations

Still to be determined: variance of the driving sequence v(n)

Recall: E{∑M

k=0 a∗kx(n − k)x∗(n − l)

}= E{v(n)x∗(n − l)}

⇒M∑

k=0

a∗k r(l − k) = E{v(n)x∗(n − l)} (∗)

Note that

x∗(n) = [w1x∗(n−1)+w2x∗(n−2)+ · · ·+wMx∗(n−M)+v∗(n)] (∗∗)Let l = 0 in (∗) and use (∗∗) on the RHS

M∑k=0

a∗k r(−k) = E{v(n)x∗(n)}

= E{v(n)[w1x∗(n − 1) + w2x∗(n − 2) + · · ·+wMx∗(n − M) + v∗(n)]}

Note that

E{v(n)wk x∗(n − k)} = 0, k = 1, 2, · · · , M

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 119 / 406

Stationary Process and Models Yule-Walker Equations

Thus E{v(n)wk x∗(n − k)} = 0, k = 1, 2, · · · , M gives

M∑k=0

a∗k r(−k) = E{v(n)x∗(n)}

= E{v(n)[w1x∗(n − 1) + w2x∗(n − 2) + · · ·+wMx∗(n − M) + v∗(n)]}

= E{v(n)v∗(n)}or conjugating

σ2v =

M∑k=0

akr(k)

Final Yule–Walker Result

Rw = r and σ2v =

M∑k=0

akr(k)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 120 / 406

Page 60: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models AR Order 2 Example

Example (AR Order-2 Process)

Consider the process defined by

x(n) + a1x(n − 1) + a2x(n − 2) = v(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 121 / 406

Stationary Process and Models AR Order 2 Example

The process

x(n) + a1x(n − 1) + a2x(n − 2) = v(n)

has characteristic equation

1 + a1z−1 + a2z−2 = 0

⇒ p1, p2 =−a1 ±

√a2

1 − 4a2

2Stability enforces the constraints

|pk | < 1 ⇒⎧⎨⎩

−1 ≤ a2 + a1

−1 ≤ a2 − a1

−1 ≤ a2 ≤ 1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 122 / 406

Page 61: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models AR Order 2 Example

Recall that the auto-correlation can be expressed as

r(m) =M∑

k=1

ckpmk = c1pm

1 + c2pm2

p1, p2 real, positive ⇒ r(m) positivedecaying exponential

p1, p2 real, negative ⇒ r(m) alternatesign decaying exponential

p1, p2 complex conjugate ⇒ r(m)exponentially decaying sinusoid

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 123 / 406

Stationary Process and Models AR Order 2 Example

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 124 / 406

Page 62: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models AR Order 2 Example

The characteristics of the AR process vary in a related fashion to thepole placements.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 125 / 406

Stationary Process and Models AR Order 2 Example

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 126 / 406

Page 63: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Stationary Process and Models Model Order Selection

Model Order Selection

Model Order Selection

A model is typically estimated from a finite set of observation data.

Result: Use Yule-Walker equations to estimate model parameters

Open Question: How do we estimate the model order?Solution: Use information theoretic criteria.

Akaike’s information criterion, developed by Hirotsugu Akaike underthe name of “an information criterion” (AIC) in 1971

Take xi = x(i), i = 1, 2, · · · , N to be N observations of a stationarydiscrete time process.

Let θ be the estimated model (AR/MA/ARMA) order m parameters

θm = [θ1m, θ2m, · · · , θMm]T

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 127 / 406

Stationary Process and Models Model Order Selection

Let fx (xi |θm) be the conditional pdf of xi given the estimated modeldefined by θm.Set L(θm) = max

θm

∑Ni=1 ln fx (xi |θm).

The likelihood function (log of conditional pdf evaluated at themaximum likelihood estimates of the model parameters, θm).

Then the AIC model order is given by m that minimizes

AIC(m) = −2L(θm)︸ ︷︷ ︸Always decreasing

+ 2m︸︷︷︸Parameter cost function

The AIC methodology attempts to find the model that bestexplains the data with a minimum of free parameters.

m optimal

AIC(m)

m

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 128 / 406

Page 64: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

Transformation by a Linear System

Consider a LTI system characterized by impulse response h(n)

h(n)x(n) y(n)

⇒ y(n) = x(n) ∗ h(n) =∞∑

k=−∞h(n − k)x(k)

Example (Deterministic Case)

Suppose x(n) and h(n) are

0 1

x(n)

2

1

n0 1

h(n)

2

1

n

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 130 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

0 1

x(k)

2

1

k

-1 -2 -1

h(-k)

0

1

k

1

y(n) =∞∑

k=−∞h(n − k)x(k)

⇒ y(n) = 0 n < 0y(0) = 1y(1) = 1/2y(2) = 1y(3) = 1/2y(n) = 0 n ≥ 4

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 131 / 406

Page 65: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

n

-1

y(n)

2

1

3

1/21

1/2

10 4

Stochastic Case: If x(n) is a stationary RV

y(n) =∞∑

k=−∞h(k)x(n − k)

⇒ E{y(n)} =∞∑

k=−∞h(k)E{x(n − k)}

⇒ my = mx

∞∑k=−∞

h(k)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 132 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

Next, evaluate the cross-correlation

y(n) =∑

k

h(n − k)x(k)

⇒ E{y(n)x∗(n − l)} =∑

k

h(n − k)E{x(k)x∗(n − l)}

⇒ ryx(l) =∑

k

h(n − k)rx (k − n + l) [let m = n − k ]

=∑

m

h(m)rx (l − m)

⇒ ryx(l) = h(l) ∗ rx (l)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 133 / 406

Page 66: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

Known results:

ryx(l) = h(l) ∗ rx (l) [new result]

rx (l) = r∗x (−l) [prior result]

ryx(l) = r∗xy (−l) [prior result]

Using these results:

r∗xy (−l) = ryx(l)

= h(l) ∗ rx (l)

⇒ r∗xy(l) = h(−l) ∗ rx (−l)

= h(−l) ∗ r∗x (l)

⇒ rxy(l) = h∗(−l) ∗ rx (l)

Note: ryx (l) = h(l) ∗ rx (l) while rxy (l) = h∗(−l) ∗ rx (l)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 134 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

Lastly, examine ry (l)

y(n) =∑

k

h(n − k)x(k)

⇒ E{y(n)y∗(n − l)} =∑

k

h(n − k)E{x(k)y∗(n − l)}

⇒ ry (l) =∑

k

h(n − k)rxy (k − n + l) [let m = n − k ]

=∑

m

h(m)rxy (l − m)

= h(l) ∗ rxy(l)

or, substituting back using rxy(l) = h∗(−l) ∗ rx (l)

⇒ ry (l) = h(l) ∗ h∗(−l) ∗ rx (l)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 135 / 406

Page 67: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

Summary of Results (Correlations for a LTI System)

ryx(l) = h(l) ∗ rx (l)

rxy (l) = h∗(−l) ∗ rx (l)

ry (l) = h(l) ∗ rxy(l)

= h(l) ∗ h∗(−l) ∗ rx (l)

Note: Similar results hold for the covariance.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 136 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

Example

Suppose the impulse response of a LTI is

-1

h(n)

2

1

310 4

Question: If the system is driven by white zero mean noise, what aremy , ryx(l), rxy(l), and ry (l)?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 137 / 406

Page 68: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

Consider my first:

my = E{y(n)} = mx

∑k

h(k) = 0

Also note rx (l) = σ2xδ(l). Thus,

ryx(l) = h(l) ∗ rx (l)

= h(l) ∗ σ2xδ(l)

= σ2x h(l)

By the symmetry property

rxy(l) = r∗yx(−l) = σ2xh∗(−l)

Plotting the results:

0 1

ryx(l)

2

l

-1

2xσ

3 4 -3 -2

rxy(l)

-1

l

-4

2xσ

0 1K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 138 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System

Lastly

ry (l) = h(l) ∗ h∗(−l) ∗ rx (l)

= σ2x (h(l) ∗ h∗(−l))

Generating the result graphically:

0 1

h(n)

2-1 3 4

1

-3 -2

h*(-n)

-1-4 0 1

1

0

3

2-4 3 4

12

3

1

4

21

-2 -1-3

h(n)*h*(-n)

0

3σ2

2-4 3 41

4σ2

2σ2

1σ2

-2 -1-3

ry(l)

⇒ ⇒

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 139 / 406

Page 69: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Power Spectral Density

Power Spectral Density

Definition (Power Spectral Density)

The autocorrelation function of a wide-sense stationary process andthe power spectral density form a Fourier transform pair

S(ω) =∞∑

l=−∞r(l)e−jωl − π < ω ≤ π

r(l) =1

∫ π

−πS(ω)ejωldω l = 0,±1,±2, · · ·

PSD Properties:

The PSD is periodic, with period 2π

S(ω + 2kπ) = S(ω) k = 0,±1,±2, . . .

⇒ All information contained in the Nyquist interval [−π, π]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 140 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Power Spectral Density

The PSD of a stationary process is real

To prove this, note

S(ω) =∞∑

k=−∞r(k)e−jωk

= r(0) +∞∑

k=1

r(k)e−jωk +−1∑

k=−∞r(k)e−jωk

= r(0) +∞∑

k=1

r(k)e−jωk +∞∑

k=1

r(−k)ejωk

= r(0) +∞∑

k=1

r(k)e−jωk + r∗(k)ejωk

= r(0) + 2∞∑

k=1

Re[r(k)e−jωk ]

QEDK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 141 / 406

Page 70: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Power Spectral Density

For real-valued sequences, S(ω) ≥ 0 and has even symmetry(prove yourself)

The PSD is (within a constant) a density describing theconcentration of power

Proof: Relate area under the PSD to signal power. Thus, let l = 0in the inverse transform,

r(l) =1

∫ π

−πS(ω)ejωldω

⇒ r(0) =1

∫ π

−πS(ω)dω

⇒ 12π

∫ π

−πS(ω)dω = E{x2(n)} [signal power]

⇒ Area under the PSD is (scaled by the 1/2π constant) equal tothe signal power. Moreover, it characterizes the signal power byfrequency

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 142 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Power Spectral Density

Example

If x(n) is zero mean i.i.d. with variance σ2x , what is the PSD?

In the case,

r(l) = σ2xδ(l)

⇒ S(ω) =∑

l

r(l)e−jωl

=∑

l

σ2xδ(l)e−jωl = σ2

x − π < ω ≤ π

2xσ

S(ω)

-π π

Note: Flat spectrums are said to be whitebecause they have a uniform distributionof power

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 143 / 406

Page 71: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System

Transmission Through a Linear System

Let x(n) be a stationary discrete time process that is passed through aLTI system characterized by h(n).

h(n)x(n) y(n)

Then by known properties and time/frequency relations

ryx(l) = h(l) ∗ rx (l)

⇒ Syx(ω) = H(ω)Sx(ω)

whereH(ω) =

∑l

h(l)e−jωl

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 144 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System

Also note that

FT{h∗(−l)} =∑

l

h∗(−l)e−jωl

=∑

l

h∗(l)ejωl

=

[∑l

h(l)e−jωl

]∗= H∗(ω)

Thus

ry (l) = h(l) ∗ h∗(−l) ∗ rx (l)

⇒ Sy(ω) = H(ω)H∗(ω)Sx (ω)

= |H(ω)|2Sx (ω)

Note: Only the PSD magnitude is affected by the system

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 145 / 406

Page 72: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System

Example

Let x(n) have correlation rx(l) = (12)|l | and be the into to a LTI system

with impulse response

h(n) = δ(n) + δ(n − 1).

Find ry (l) and Sy (ω)

Note ry (l) = h(l) ∗ h(−l) ∗ rx (l)

0 1

1

0 1

1

* =

0 1

1

-1

2

h(l)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 146 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System

Thus

h(l) ∗ h(−l) = δ(l + 1) + 2δ(l) + δ(l − 1)

⇒ ry (l) = h(l) ∗ h(−l) ∗ rx (l)

= [δ(l + 1) + 2δ(l) + δ(l − 1)] ∗(

12

)|l |

= 2(

12

)|l |+

(12

)|l+1|+

(12

)|l−1|

The result is a consequence of convolving with shifted delta functions

Next, compute Sx(ω) and |H(ω)|2Aside: Recall for |p| < 1,

∞∑k=0

pk =1

1 − pand

∞∑k=1

pk =p

1 − p

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 147 / 406

Page 73: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System

Given rx (k) = (12)|k |

Sx(ω) =∑

k

(12

)|k |e−jωk

=∞∑

k=0

(12

)k

e−jωk +∞∑

k=1

(12

)k

ejωk

=1

1 − 12e−jω

+12ejω

1 − 12ejω

=34

54 − cos(ω)

Also

H(ω) =∑

k

h(k)e−jωk

=∑

k

[δ(k) + δ(k − 1)]e−jωk

= 1 + e−jω

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 148 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System

Taking the magnitude,

|H(ω)|2 = (1 + e−jω)(1 + e−jω)∗

= 1 + e−jω + ejω + 1

= 2 + 2 cos(ω)

= 2(1 + cos(ω))

Thus the output PSD is

Sy (ω) = |H(ω)|2Sx(ω)

= 2(1 + cos(ω))

(34

54 − cos(ω)

)

=32(1 + cos(ω))

54 − cos(ω)

Note: The system only affects the magnitude and Sy (ω) has evensymmetry. Question: Where is the power concentrated?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 149 / 406

Page 74: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization

Spectral Factorization

Suppose a system is described by the difference equation

M∑k=0

akx(n − k) =K∑

l=0

blv(n − l)

Then taking the z-transform

M∑k=0

akz−k X (z) =K∑

l=0

blz−l V (z)

orX (z)

V (z)

�= H(z) =

∑Kl=0 blz−l∑M

k=0 akz−k=

B(z)

A(z)

Note: H(z) is defined as the transfer function

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 150 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization

All pass systems

Suppose a stable first-order system has a transfer function

H(z) =1 − a∗zz − a

Note: System has a zero at z = 1/a∗ and a pole at z = a.

Let a = rejθ. Then 1a∗ = 1

re−jθ = 1r ejθ. For a stable (r < 1) system:

ra

Note: Pole and zero are at the same angle

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 151 / 406

Page 75: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization

Evaluating H(ω) = H(z)|z=ejω

H(ω) =1 − a∗ejω

ejω − a=

ejω(e−jω − a∗)ejω − a

⇒ |H(ω)| = |ejω| |e−jω − a∗||ejω − a| = 1 [All pass filter]

General Result – All Pass Systems

Systems of the form

Hap(z) =N∏

i=1

(1 − a∗i z)

(z − ai)

are all pass. For stability, |ai | < 1.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 152 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization

Minimum Phase Systems

A system is minimum phase if it is causal and has all its poles andzeros inside the unit circle

Minimum phase systems are stable

If H(z) is minimum phase, then H−1(z) is minimum phase andstable

Maximum Phase Systems

A system is maximum phase if it has all its zeros outside the unit circle

The inverse of a maximum phase systems is not stable

Mixed Phase Systems

A system is mixed phase if it has zeros inside and outside the unitcircle

The inverse of a mixed phase systems is not stable

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 153 / 406

Page 76: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization

Factorization Result

Any rational H(z) can be factored as

H(z) = Hmin(z)︸ ︷︷ ︸Min. phase

Hap(z)︸ ︷︷ ︸All pass

Consider a non–minimum phase H(z) with a (single) zero outside theunit circle, at z = 1/a∗ where |a| < 1

H(z) = H1(z)(1 − a∗z) [factor out non–min. phase comp.]

= H1(z)(1 − a∗z)

(z − az − a

)

= H1(z)(z − a)︸ ︷︷ ︸Minimum phase

(1 − a∗zz − a

)︸ ︷︷ ︸

All pass

Note: Magnitude response is determined by the minimum phasecomponent, while both terms contribute to phase response

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 154 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization

Example

Consider a stable system with a single zero outside the unit circle(maximum phase system):

H(z) Hmin(z) Hap(z)System with 1 System with reflected System with original zero

zero outside UC zero; min phase & reflected/cancelingpole; all pass

Note: The zero in Hmin(z) and the pole in Hap(z) cancel each other out

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 155 / 406

Page 77: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis

Eigen Analysis

Objective: Utilize tools from linear algebra to characterize and analyzematrices, especially the correlation matrix

The correlation matrix plays a large role in statisticalcharacterization and processing.

Previously result: R is Hermitian.

Further insight into the correlation matrix is achieved througheigen analysis

Eigenvalues and vectorsMatrix diagonalizationApplication: Optimum filtering problems

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 156 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis

Objective: For a Hermitian matrix R, find a vector q satisfying

Rq = λq

Interpretation: Linear transformation by R changes the scale, butnot the direction of q

Fact: A M × M matrix R has M eigenvectors and eigenvalues

Rqi = λiqi i = 1, 2, 3, · · · , M

To see this, note(R − λI)q = 0

For this to be true, the row/columns of (R − λI) must be linearlydependent,

⇒ det(R − λI) = 0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 157 / 406

Page 78: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis

Note: det(R − λI) is a Mth order polynomial in λ

The roots of the polynomial are the eigenvalues λ1, λ2, · · · , λM

Rqi = λiqi

Each eigenvector qi is associated with one eigenvalue λi

The eigenvectors are not unique

Rqi = λiqi

⇒ R(aqi) = λi(aqi)

Consequence: eigenvectors are generally normalized, e.g.,|qi | = 1 for i = 1, 2, . . . , M

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 158 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis

Example (General two dimensional case)

Let M = 2 and

R =

[R1,1 R1,2

R2,1 R2,2

]Determine the eigenvalues and eigenvectors.

Thus

det(R − λI) = 0

⇒∣∣∣∣ R1,1 − λ R1,2

R2,1 R2,2 − λ

∣∣∣∣ = 0

⇒ λ2 − λ(R1,1 + R2,2) + (R1,1R2,2 − R1,2R2,1) = 0

⇒ λ1,2 =12

[(R1,1 + R2,2) ±

√4R1,2R2,1 + (R1,1 − R2,2)

]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 159 / 406

Page 79: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis

Back substitution yields the eigenvectors:[R1,1 − λ R1,2

R2,1 R2,2 − λ

] [q1

q2

]=

[00

]

In general, this yields a set of linear equations. In the M = 2 case:

(R1,1 − λ)q1 + R1,2q2 =0

R2,1q1 + (R2,2 − λ)q2 =0

Solving the set of linear equations for a specific eigenvalue λ i

yields the corresponding eigenvector, qi

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 160 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis

Example (Two–dimensional white noise)

Let R be the correlation matrix of a two–sample vector of zero meanwhite noise

R =

[σ2 00 σ2

]Determine the eigenvalues and eigenvectors.

Carrying out the analysis yields eigenvalues

λ1,2 =12

[(R1,1 + R2,2) ±

√4R1,2R2,1 + (R1,1 − R2,2)

]

=12

[(σ2 + σ2) ±

√0 + (σ2 − σ2)

]= σ2

and eigenvectors

q1 =

[10

]and q2 =

[01

]Note: The eigenvectors are unit length (and orthogonal)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 161 / 406

Page 80: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Eigen Properties

Property (eigenvalues of Rk )

If λ1, λ2, · · · , λM are the eigenvalues of R, then λk1, λk

2, · · · , λkM are the

eigenvalues of Rk .

Proof: Note Rqi = λiqi . Multiplying both sides by R k − 1 times,

Rkqi = λiRk−1qi = λki qi

Property (linear independence of eigenvectors)

The eigenvectors q1, q2, · · · , qM , of R are linearly independent, i.e.,

M∑i=1

aiqi �= 0

for all nonzero scalars a1, a2, · · · , aM .

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 162 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Property (Correlation matrix eigenvalues are real & nonnegative)

The eigenvalues of R are real and nonnegative.

Proof:

Rqi = λiqi

⇒ qHi Rqi = λiqH

i qi [pre–multiply by qHi ]

⇒ λi =qH

i Rqi

qHi qi

≥ 0

Follows from the facts: R is positive semi-definite and qHi qi = |qi|2 > 0

Note: In most cases, R is positive definite and

λi > 0, i = 1, 2, · · · , M

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 163 / 406

Page 81: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Property (Unique eigenvalues ⇒ orthogonal eigenvectors)

If λ1, λ2, · · · , λM are unique eigenvalues of R, then the correspondingeigenvectors, q1, q2, · · · , qM , are orthogonal.

Proof:

Rqi = λiqi

⇒ qHj Rqi = λiqH

j qi (∗)

Also, since λj is real and R is Hermitian

Rqj = λjqj

⇒ qHj R = λjqH

j

⇒ qHj Rqi = λjqH

j qi

Substituting the LHS from (∗)⇒ λiqH

j qi = λjqHj qi

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 164 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Thus

λiqHj qi = λjqH

j qi

⇒ (λi − λj)qHj qi = 0

Since λ1, λ2, · · · , λM are unique

qHj qi = 0 i �= j

⇒ q1, q2, · · · , qM are orthogonal.

QED

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 165 / 406

Page 82: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Diagonalization of R

Objective: Find a transformation that transforms the correlation matrixinto a diagonal matrix.

Let λ1, λ2, · · · , λM be unique eigenvectors of R and take q1, q2, · · · , qM

to be the M orthonormal eigenvectors

qHi qj =

{1 i = j0 i �= j

Define Q = [q1, q2, · · · , qM ] and Ω = diag(λ1, λ2, · · · , λM). Thenconsider

QHRQ =

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦R[q1, q2, · · · , qM ]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 166 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

QHRQ =

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦R[q1, q2, · · · , qM ]

=

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦ [λ1q1, λ2q2, · · · , λNqM ]

=

⎡⎢⎢⎢⎣

λ1 0 · · · 00 λ2 · · · 0...

.... . .

...0 0 · · · λM

⎤⎥⎥⎥⎦

⇒ QHRQ = Ω (eigenvector diagonalization of R)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 167 / 406

Page 83: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Property (Q is unitary)

Q is unitary, i.e., Q−1 = QH

Proof: Since the qi eigenvectors are orthonormal

QHQ =

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦ [q1, q2, · · · , qM ] = I

⇒ Q−1 = QH

Property (Eigen decomposition of R)

The correlation matrix can be expressed as

R =M∑

i=1

λiqiqHi

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 168 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Proof: The correlation diagonalization result states

QHRQ = Ω

Isolating R and expanding,

R = QΩQH = [q1, q2, · · · , qM ]Ω

⎡⎢⎢⎢⎣

qH1

qH2...

qHM

⎤⎥⎥⎥⎦

= [q1, q2, · · · , qM ]

⎡⎢⎢⎢⎣

λ1qH1

λ2qH2

...λMqH

M

⎤⎥⎥⎥⎦ =

M∑i=1

λiqiqHi

Note: This also gives

R−1 = (QH)−1Ω−1Q−1 = QΩ−1QH

where Ω−1 = diag(1/λ1, 1/λ2, · · · , 1/λM)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 169 / 406

Page 84: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Aside (trace & determinant for matrix products)

Note trace(A)�=∑

i Ai ,i . Also,

trace(AB) = trace(BA) similarly det(AB) = det(A)det(B)

Property (Determinant–Eigenvalue Relation)

The determinant of the correlation matrix is related to the eigenvaluesas follows:

det(R) =M∏

i=1

λi

Proof: Using R = QΩQH and the above,

det(R) = det(QΩQH)

= det(Q)det(QH)det(Ω) = det(Ω) =M∏

i=1

λi

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 170 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Property (Trace–Eigenvalue Relation)

The trace of the correlation matrix is related to the eigenvalues asfollows:

trace(R) =M∑

i=1

λi

Proof: Note

trace(R) = trace(QΩQH)

= trace(QHQΩ)

= trace(Ω)

=M∑

i=1

λi

QED

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 171 / 406

Page 85: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Definition (Normal Matrix)

A complex square matrix A is a normal matrix if

AHA = AAH

That is, a matrix is normal if it commutes with its conjugate transpose.

Note

All Hermitian symmetric matrices are normal

Every matrix that can be diagonalized by the unitary transform isnormal

Definition (Condition Number)

The condition number reflects how numerically well–conditioned aproblem is, i.e, a low condition number ⇒ well–conditioned; a highcondition number ⇒ ill–conditioned.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 172 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Definition (Condition Number for Linear Systems)

For a linear systemAx = b

defined by a normal matrix A, the condition number is

χ(A) =λmax

λmin

where λmax and λmin are the maximum/minimum eigenvalues of A

Observations:

Large eigenvalue spread ⇒ ill–conditioned

Small eigenvalue spread ⇒ well–conditioned

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 173 / 406

Page 86: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties

Sensitivity Analysis: Suppose a system (filter) is related as:

Rw = d

where w defines the filter parameters; R and d are signal statisticmatrices

⇒ Introduce small signal statistic perturbations

Let R and d are perturbed such that ||δR||/||R|| and ||δd||/||d|| are onthe order ε � 1

Result: A bound on the resulting parameter perturbations is given by

||δw||||w|| ≤ εχ(R) = ε

λmax

λmin

Consequence: If R is ill conditioned, small changes in R or d can leadto big changes in w

⇒ System sensitivity is related to eigenvalue spread

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 174 / 406

Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform

The discrete Karhmen-Loeve Transform (KLT)

Definition (The discrete Karhmen-Loeve Transform (KLT))

A M sample vector x(n) from the process {x(n)} can be expressed as

x(n) =M∑

i=1

ci(n)qi

where q1, q2, · · · , qM are the orthonormal eigenvectors of the processcorrelation matrix, R, and c1(n), c2(n), · · · , cM(n) are a set of KLTcoefficients.

Signal is represented as a weighted sum of eigenvectors

Need to determine the coefficients

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 175 / 406

Page 87: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform

Determining Coefficients: Write the expression in matrix form

x(n) =M∑

i=1

ci(n)qi

= Qc(n) (∗)

whereQ = [q1, q2, · · · , qM ]

andc(n) = [c1(n), c2(n), · · · , cM(n)]T .

Solving (∗) for c(n):

c(n) = Q−1x(n) = QHx(n)

orci(n) = qH

i x(n)

Note: ci(n) is the projection of x(n) onto qi

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 176 / 406

Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform

Question: How related are the coefficients to reach other?

Answer: Consider the correlation between ci(n) terms

Rc(n) = E{c(n)cH(n)}= E{(QHx(n))(QHx(n))H}= E{(QHx(n)xH(n)Q}= QHRxQ

= Ω

Result:

E{c∗i (n)cj (n)} =

{λi i = j0 otherwise

⇒ KLT transform coefficients are uncorrelatedA desirable property – Why?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 177 / 406

Page 88: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform

Question: Can we represent x(n) with fewer terms? If so, how do weminimize the representation error?

Approach: Use fewer terms in the KLT transform

x(n) =M∑

i=1

ci(n)qi

⇒ x(n) =N∑

i=1

ci(n)qi N < M

Thus

x(n) = x(n) + ε(n)

=N∑

i=1

ci(n)qi +M∑

i=N+1

ci(n)qi

Question: How do we minimize the representation error?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 178 / 406

Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform

Approach: Analyzed and minimize the error power

The error power is given by

ε = E{εH(n)ε(n)}

= E

⎧⎨⎩

M∑i=N+1

c∗i (n)qH

i

M∑j=N+1

cj(n)qj

⎫⎬⎭

=M∑

i=N+1

E{c∗i (n)ci (n)} [result of orthogonality]

=M∑

i=N+1

λi [from prior result]

Result: To minimize the error select the qi eigenvectors associatedwith M largest eigenvalues.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 179 / 406

Page 89: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

The Matched Filter

Objective: Find the optimal filter coefficients for two cases:(1) deterministic signals and (2) stochastic signals

Signal model:x(n) = u(n)︸︷︷︸

signal

+ v(n)︸︷︷︸noise

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 180 / 406

Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

Definitions:

Filter parameters: w = [w1, w2, · · · , wM ]T

Observation vector: x(n) = [x(n), x(n − 1), · · · , x(n − M + 1)]T

Then using x(n) = u(n) + v(n) and linearity,

y(n) = wT x(n)

= wT u(n) + wT v(n)

= ys(n) + yn(n)

where

ys(n) = wT u(n)

yn(n) = wT v(n)

Consider deterministic and stochastic cases separately

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 181 / 406

Page 90: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

Case 1: Matched Filters for Deterministic Signals

Approach:

Analyzed output SNR

Set filter coefficients to maximize SNR

The SNR at time n can be defined as

SNR =|ys(n)|2

E{|yn(n)|2}where

ys(n) = wT u(n) [Deterministic]

yn(n) = wT v(n) [Stochastic]

Note: Case assumes u(n) is deterministic

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 182 / 406

Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

Using ys(n) = wT u(n) and yn(n) = wT v(n),

SNR =|ys(n)|2

E{|yn(n)|2}

=(wT u(n))∗(uT (n)w)

E{(wT v(n))∗(vT (n)w)}

=wHu∗(n)uT (n)w

E{wHv∗(n)vT (n)w}

=wHu∗(n)uT (n)w

wHRvw

To analyze further, restrict the noise statistics

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 183 / 406

Page 91: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

Suppose v(n) is zero mean and uncorrelated, Rv = σ2v I. Then

SNR =wHu∗(n)uT (n)w

wHRvw

=|wHu∗(n)|2

σ2vwHw

If we restrict |w|2 = 1, then the final result is

SNR =|wHu∗(n)|2

σ2v

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 184 / 406

Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

Definition (Cauchy–Schwarz Inequality (1821 disc.; 1859 cont.))

Cauchy-Schwarz’s inequality states (matrix case)

|AHB|2 ≤ (AHA)(BHB)

with equality only if A = kB, where k is a constant

Thus

SNR =|wHu∗(n)|2

σ2v

≤ (wHw)(uT (n)u∗(n))

σ2v

=uH(n)u(n)

σ2v

=|u(n)|2

σ2v

Result: The SNR is maximized when w = ku∗(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 185 / 406

Page 92: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

Example

A communications system sends out one of two signals

Symbol “1”

Symbol “0”

0 1 2

n

3 4

1

2

0 1 2 n3 4

-2

-1

We receivey(n) = ui(n) + v(n)

where v(n) is i.i.d. Gaussian.

Objective: Find the optimal w (assume n = 3)K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 186 / 406

Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

Set w = ku1(n) where k is such that |w|2 = 1. Thus,

w = u1(n)/|u1(n)| =1√10

[2, 2, 1, 1]T

The resulting impulse response of the filter is

0 1 2n

3

102

101

Observation: Optimal filter impulse response is sent (desired) signaltime–reversed & normalized

Determine ys(n) for signal “1” sent

ys(n) = wT u1(n) =|u1(n)|2|u1(n)| =

√10

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 187 / 406

Page 93: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter

For signal “0”

ys(n) = wT u0(n) =−|u0(n)|2|u1(n)| = −

√10

Received signal distributions:

10− 100

fy(y)

The SNR for the optimized system is

SNR =|u1(n)|2

σ2v

=10σ2

v

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 188 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Matched Filter for Stochastic Signals

Case 2: Matched Filter for Stochastic Signals

As before

y(n) = wT x(n)

= wT u(n) + wT v(n)

= ys(n) + yn(n)

Now define the SNR as

SNR =E{|ys(n)|2}E{|yn(n)|2} =

wHRuwwHRvw

As before, if Rv = σ2v I and |w|2 = 1

SNR =wHRuw

σ2v

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 189 / 406

Page 94: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Linear Systems, Spectral Representations, and Eigen Analysis Matched Filter for Stochastic Signals

Since

SNR =wHRuw

σ2v

,

maximizing the SNR is equivalent to maximizing |wHRuw|Recall from Cauchy–Schwarz,

|AHB|2 ≤ (AHA)(BHB)

with equality only if A = kB

Let AH = wHRu and B = w and apply Cauchy–SchwarzResult: |wHRuw| is maximized for A = kB

⇒ Ruw = kw

⇒ k is an eigenvalue and w is an eigenvector!w = qi and k = λi , where Ruqi = λiqi

Question: How do we choose which eigenvector?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 190 / 406

Linear Systems, Spectral Representations, and Eigen Analysis Matched Filter for Stochastic Signals

Using w = qi , or wH = qHi , and Ruw = Ruqi = λiqi gives

SNR =wHRuw

σ2v

=qH

i λiqi

σ2v

[orthonormal eigenvectors]

=λi

σ2v

Result: The SNR is maximized when w = qmax, the eigenvectorassociated with the largest eigenvalue of Ru.

When using the optimal filter weights (w = qmax), the resulting(maximized) SNR is

SNR =λmax

σ2v

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 191 / 406

Page 95: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation ML Estimation

Maximum Likelihood and Bayes Estimation

Estimation

Estimation is the inference of unknown quantities. Two cases areconsidered:

1 Quantity is fixed, but unknown – parameter estimation2 Quantity is random and unknown – random variable estimator

Parameter Estimation

Consider a set of observations forming a vector

x = [x1, x2, · · · , xN ]T

Assumption: The xi RVs come from a known density governed byunknown (but fixed) parameter θ

Objective: Estimate θ. What optimality criteria should be used?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 193 / 406

Maximum Likelihood and Bayes Estimation ML Estimation

Definition (Maximum Likelihood Estimation)

The maximum likelihood estimate of θ is the value θML(x) which makesthe x observations most likely

θML(x) = argmaxθ

fx|θ(x|θ)

Example

Let xi ∼ N(μ, σ2). Given N observations, find the ML estimate of μ.

μ μx

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 194 / 406

Page 96: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation ML Estimation

For i.i.d. samples

fx|μ(x|μ) =N∏

i=1

fxi |μ(xi |μ)

=N∏

i=1

1√2πσ2

e− (xi−μ)2

2σ2 [Gaussian case]

�= likelihood function

Thus the estimate of the mean it is set as

μ = argmaxμ

fx|μ(x|μ)

Interpretation: Set the distribution mean to the value that makesobtaining the observed samples most likely.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 195 / 406

Maximum Likelihood and Bayes Estimation ML Estimation

Note: Maximizing fx|μ(x|μ) is equivalent to maximizing any monotonicfunction of fx|μ(x|μ). Choosing ln(·)

ln(fx|μ(x|μ)) = ln

(N∏

i=1

1√2πσ2

e− (xi−μ)2

2σ2

)

= −N ln(√

2πσ2) −N∑

i=1

(xi − μ)2

2σ2

= −N ln(√

2πσ2) −N∑

i=1

x2i

2σ2 + μN∑

i=1

xi

σ2 −N∑

i=1

μ2

2σ2

Taking the derivative and equating to 0,

∂ ln(fx|μ(x|μ))

∂μ=

N∑i=1

xi

σ2 − Nμ

σ2 = 0

⇒ μ =1N

N∑i=1

xi�= sample mean

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 196 / 406

Page 97: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation ML Estimation

General Maximum Likelihood Result

General Statement: The ML estimate of θ is

θML(x) = argmaxθ

fx|θ(x|θ)

Solution: The ML estimate of θ is obtained as the solution to

∂θfx|θ(x|θ)

∣∣∣θ=θML

= 0

or∂

∂θln[fx|θ(x|θ)]

∣∣∣θ=θML

= 0

fx|θ(x|θ) is the likelihood function of θ.

θML is a RV since it is a function of the RVs x1, x2, · · · , xN

Historical Note: ML estimation was pioneered by geneticist andstatistician Sir R. A. Fisher between 1912 and 1922

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 197 / 406

Maximum Likelihood and Bayes Estimation ML Estimation

Example

The time between customer arrivals at a bar is a RV with distribution

fT (T ) = αe−αT U(T )

Objective: Estimate the arrival rate α based on N measured arrivalintervals T1, T2, · · · , TN .

Assuming that the arrivals are independent,

f (T1, T2, · · · , TN) =N∏

i=1

fT (Ti)

=N∏

i=1

αe−αTi = αNe−α

N∑i=1

Ti

⇒ ln[f (T1, T2, · · · , TN)] = [N ln(α) − α

N∑i=1

Ti ]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 198 / 406

Page 98: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation ML Estimation

Taking the derivative and equating to 0,

∂αln[f (T1, T2, · · · , TN)] =

∂α[N ln(α) − α

N∑i=1

Ti ]

=Nα

−N∑

i=1

Ti = 0

Solving for α gives the ML estimate

⇒ αML =1

1N

∑Ni=1 Ti

=1

T

Result: The ML estimate of arival rate for exponentially distributedsamples is the reciprocal of the sample mean arrival

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 199 / 406

Maximum Likelihood and Bayes Estimation Properties of Estimates

Properties of Estimates

Since θN is a function of RVs x1, x2, · · · , xN , estimates areRVs and wecan state the following properties:

An estimate θN is unbiased if

E{θN} = θ bias�= E{θN} − θ

θN is consistent (converges in probability) if

limN→∞

Pr{|θN − θ| < ε} = 1 for arbitrary ε

θN is efficient in comparison to other estimators if

var(θN) < var(θother)

Note: If θN is unbiased and efficient with respect to θN−1 for all N (i.e.,var(θN) converges to 0), then θN is a consistent estimate

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 200 / 406

Page 99: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Properties of Estimates

To prove the consistent estimate result, note that by the Tchebycheffinequality

Pr{|θN − θ| > ε} ≤ var(θN)

ε2

If var(θN) < var(θN−1), the above gives

limN→∞

Pr{|θN − θ| > ε} = 0

orlim

N→∞Pr{|θN − θ| < ε} = 1

That is, it converges in probability, or is consistent

QED

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 201 / 406

Maximum Likelihood and Bayes Estimation Properties of Estimates

Example

Let {xi} be WSS with uncorrelated samples. Is the sample mean aconsistent estimator for this sequence?

Step 1: Consider the bias

E{μN} = E

{1N

N∑i=1

xi

}

=1N

(Nμ) = μ

Result: μN is unbiased

Step 2: Consider the variance

var(μN) = E{

(μ − μ)2}

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 202 / 406

Page 100: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Properties of Estimates

var(μN) = E{(μ − μ)2

}

= E

⎧⎨⎩((

1N

N∑i=1

xi

)− μ

)2⎫⎬⎭

=1

N2 E

⎧⎨⎩(

N∑i=1

(xi − μ)

)2⎫⎬⎭ [assume uncorrelated]

=1

N2

N∑i=1

E{(xi − μ)2} +1

N2 E(cross terms)︸ ︷︷ ︸=0

=1

N2

N∑i=1

E{(xi − μ)2} =1

N2 (Nσ2) =σ2

N

Result: μN is unbiased and var(μN) < var(θN−1) ⇒ μN is consistent

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 203 / 406

Maximum Likelihood and Bayes Estimation Cramer-Rao Bound

Theorem (Cramer-Rao Bound (1945, 1946))

If θ is an unbiased estimate of θ, then

var(θ) ≥(

E

{(∂

∂θln[fx|θ(x|θ)]

)2})−1

or equivalently

var(θ) ≥(−E

{∂2

∂θ2 ln[fx|θ(x|θ)]

})−1

where it is assumed

∂θfx|θ(x|θ) and

∂2

∂θ2 fx|θ(x|θ) exist

Note: If any estimate satisfies the bound with equality, it is an efficient(minimum variance) estimate

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 204 / 406

Page 101: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Cramer-Rao Bound

Proof:

Since θ is unbiased

E{θ − θ} =

∫ ∞

−∞(θ − θ)fx|θ(x|θ)dx = 0

Taking the derivative

∂θ

∫ ∞

−∞(θ − θ)fx|θ(x|θ)dx = 0

⇒ −∫ ∞

−∞fx|θ(x|θ)dx︸ ︷︷ ︸−1

+

∫ ∞

−∞

∂fx|θ(x|θ)

∂θ(θ − θ)dx = 0

⇒∫ ∞

−∞

∂fx|θ(x|θ)

∂θ(θ − θ)dx = 1 (∗)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 205 / 406

Maximum Likelihood and Bayes Estimation Cramer-Rao Bound

Note the following equality

∂ ln[fx|θ(x|θ)]

∂θfx|θ(x|θ) =

∂fx|θ(x|θ)

∂θ

Using this in (∗) ∫ ∞

−∞

∂fx|θ(x|θ)

∂θ(θ − θ)dx = 1

⇒∫ ∞

−∞

∂ ln[fx|θ(x|θ)]

∂θfx|θ(x|θ)(θ − θ)dx = 1

This can be equivalently expressed as

(∫ ∞

−∞

(∂ ln[fx|θ(x|θ)]

∂θ

√fx|θ(x|θ)

)(√fx|θ(x|θ)(θ − θ)

)dx)2

= 1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 206 / 406

Page 102: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Cramer-Rao Bound

Definition (Cauchy–Schwarz Inequality (1821 disc.; 1859 cont.))

Cauchy-Schwarz’s inequality states (for square–integrablecomplex–valued functions),∣∣∣∣

∫f (x)g(x) dx

∣∣∣∣2 ≤∫

|f (x)|2 dx ·∫

|g(x)|2 dx

with equality only if f (x) = k · g(x), where k is a constant

Thus(∫ ∞

−∞

(∂ ln[fx|θ(x|θ)]

∂θ

√fx|θ(x|θ)

)(√fx|θ(x|θ)(θ − θ)

)dx)2

= 1

⇒(∫ ∞

−∞

(∂ ln[fx|θ(x|θ)]

∂θ

)2

fx|θ(x|θ)dx

)(∫ ∞

−∞(θ − θ)2fx|θ(x|θ)dx

)≥ 1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 207 / 406

Maximum Likelihood and Bayes Estimation Cramer-Rao Bound

Note ∫ ∞

−∞(θ − θ)2fx|θ(x|θ)dx = var(θ) (∗)

and∫ ∞

−∞

(∂ ln[fx|θ(x|θ)]

∂θ

)2

fx|θ(x|θ)dx = E

{(∂ ln(fx|θ(x|θ))

∂θ

)2}

(∗∗)

Thus using (∗) and (∗∗) in(∫ ∞

−∞

(∂ ln[fx|θ(x|θ)]

∂θ

)2

fx|θ(x|θ)dx

)(∫ ∞

−∞(θ − θ)2fx|θ(x|θ)dx

)≥ 1

⇒ var(θ) ≥[

E

{(∂ ln(fx|θ(x|θ))

∂θ

)2}]−1

with equality iff∂

∂θln(fx|θ(x|θ)) = k(θ − θ)

QEDK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 208 / 406

Page 103: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Cramer-Rao Bound

Thus the bound in met iff

∂θln(fx|θ(x|θ)) = k(θ − θ)

Let θ = θML in the above

∂θln(fx|θ(x|θ))

∣∣∣θ=θML︸ ︷︷ ︸

= 0 by ML criteria

= k(θ − θ)∣∣∣θ=θML

Therefore, the RHS must equal zero, or

θ = θML

Result: If an efficient estimate (one that satisfies the bound withequality) exists, then it is the ML estimate

Note: If an efficient estimator doesn’t exist, then we don’t know howgood θML is

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 209 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

Definition (Bayes Estimation)

Objective: Estimate a random parameter (RV) from observationssamples x1, x2, · · · , xn that are statistically related to y by fy |x(·)Bayes Procedure: Define a nonnegative cost function C(y , y ) and sety to minimize the expected cost, or risk

R︸︷︷︸risk

= E{C(y , y )}

Since y and y are RVs

R =

∫ ∞

−∞

∫ ∞

−∞C(y , y )fy ,x(y , x)dydx

=

∫ ∞

−∞

[∫ ∞

−∞C(y , y )fy |x(y |x)dy

]︸ ︷︷ ︸

I(y)

fx(x)dx

Note: Minimizing I(y) it is equivalent to minimize R since fx(x) ≥ 0K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 210 / 406

Page 104: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

Consider several cost functions

Case 1: Mean Squared cost function

C(y , y ) = |y − y |2

|ˆ| yy −In this case,

I(y) =

∫ ∞

−∞(y − y)2fy |x(y |x)dy

⇒ ∂I(y)

∂y= −2

∫ ∞

−∞(y − y)fy |x(y |x)dy = 0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 211 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

or rearranging∫ ∞

−∞y fy |x(y |x)dy =

∫ ∞

−∞yfy |x(y |x)dy [y is a constant]

⇒ yMS =

∫ ∞

−∞yfy |x(y |x)dy = E{y |x}

Example

Let xi = a + μi for i = 1, 2, · · · , N, where μi ∼ N(0, σ2) anda ∼ N(0, σ2

a) are i.i.d. Determine aMS(x).

Note

fx|a(x|a) =N∏

i=1

1√2πσ

e− (xi−a)2

2σ2 =

(1

2πσ2

)N2

e− 1

2

(N∑

i=1

(xi−a)2

σ2

)(∗)

fa(a) =1√

2πσae− a2

2σ2a (∗∗)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 212 / 406

Page 105: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

To find aMS(x) we need

aMS(x) = E{a|x}By Bayes’s theorem we can write

fa|x(a|x) =fx|a(x|a)fa(a)

fx(x)

Substituting in (∗) and (∗∗), and rearranging

fa|x(a|x) =

( 12πσ2

)N2(

1√2πσa

)e− 1

2

(N∑

i=1

(xi−a)2

σ2 + a2

σ2a

)

fx(x)

This can be compactly written as

fa|x(a|x) = C(x) exp

⎧⎨⎩− 1

2σ2p

[a − σ2

a

σ2a + σ2/N

(1N

N∑i=1

xi

)]2⎫⎬⎭

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 213 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

Observations on

fa|x(a|x) = C(x) exp

⎧⎨⎩− 1

2σ2p

[a − σ2

a

σ2a + σ2/N

(1N

N∑i=1

xi

)]2⎫⎬⎭

= C(x) exp

{−(a − η)2

2σ2p

}

C(x) is a (normalizing) function of x onlyThe variance term is given by

σ2p =

(1σ2

a+

Nσ2

)−1

=σ2

aσ2

Nσ2a + σ2

Critical Observation: fa|x(a|x) is a Gaussian distribution!Result:

aMS = E{a|x} = η =σ2

a

σ2a + σ2/N

(1N

N∑i=1

xi

)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 214 / 406

Page 106: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

Case 2: Uniform cost function

C(y , y ) =

{0 |y − y | < ε1 else

Question: For what types of problems isthis cost function effective?

|ˆ| yy −

)ˆ,( yyC

0

1

ε

In this case,

I(y) =

∫ ∞

−∞C(y , y )fy |x(y |x)dy

=

∫|y−y |≥ε

fy |x(y |x)dy

= 1 −∫

|y−y |<ε

fy |x(y |x)dy

How do we minimize I(y)?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 215 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

Result: I(y) is minimized bymaximizing∫

|y−y |<ε

fy |x(y |x)dyy

∫<− ε|ˆ|

| )|(yy

y dyyf xx

)|(| xx yf y

Note: ε is arbitrarily small

⇒ I(y) is minimized when fy |x(y |x) takes its largest value

yMAP(x) = argmaxy

fy |x(y |x)

yMAP is referred to as the maximum a posteriori (MAP) estimatebecause it maximized the posterior density fy |x(y |x).

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 216 / 406

Page 107: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

Example

Let

fx ,y(x , y) =

{10y 0 ≤ y ≤ x2, 0 ≤ x ≤ 10 otherwie

Find the MS and MAP estimates of y , i.e., yMS(x) and yMAP(x).

First step: determine the posterior density fy |x(y |x).

Since fy |x(y |x) =fx,y (x ,y)

fx (x) , we need

fx (x) =

∫ ∞

−∞fx ,y(x , y)dy

=

∫ x2

010ydy

= 5y2∣∣∣x2

0= 5x4 0 ≤ x ≤ 1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 217 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

Thus,

fy |x(y |x) =fx ,y(x , y)

fx (x)

=10y5x4 =

2yx4 0 ≤ y ≤ x2

y

)|(| xyf xy

0

2

2

x

2x

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 218 / 406

Page 108: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

MAP estimate:

yMAP(x) = argmaxy

fy |x(y |x)

= argmaxy

2yx4 0 ≤ y ≤ x2

= x2

MS estimate:

yMS(x) = E{y |x}

=

∫ x2

0yfy |x(y |x)dy

=

∫ x2

0

2y2

x4 dy

=23

y3

x4

∣∣∣x2

0=

23

x2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 219 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

Note that the minimum MSE is

E{(y − yMS)2} =

∫ 1

0

∫ x2

0(y − yMS)

2fx ,y(x , y)dydx

=

∫ 1

0

∫ x2

0(y − 2

3x2)210ydydx =

5162

= 0.0309

The MSE of the MAP estimate is

E{(y − yMAP)2} =

∫ 1

0

∫ x2

0(y − yMAP)

2fx ,y(x , y)dydx

=

∫ 1

0

∫ x2

0(y − x2)210ydydx =

554

= 0.0926

Observation: This result is expected. Why?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 220 / 406

Page 109: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

Observation: MAP estimation can be used as an extension of MLestimation if some variability is assumed

Instead of an unknown constant θ, we have an unknown randomparameter with distribution fθ(θ)

To see this, note

fθ|x(θ|x) =fx|θ(x|θ)fθ(θ)

fx(x)

The MAP estimate maximizes the numerator since fx(x) is not afunction of θ,

θMAP = argmaxθ

fx|θ(x|θ)fθ(θ)

Question: For what distribution fθ(θ) does θMAP = θML? That is

θMAP = argmaxθ

fx|θ(x|θ)fθ(θ)?= argmax

θfx|θ(x|θ) = θML

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 221 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

Example

Let x(n) = A + μ(n) for n = 1, 2, · · · , N, where μ(n) ∼ N(0, σ2μ) and

A ∼ N(A0, σ2A) are i.i.d.

Determine the MAP estimate of A.

Need to maximize fx|A(x|A)fA(A), or

AMAP = argmaxA

[ln(fx|A(x|A)) + ln(fA(A))

]Note

ln(fx|A(x|A)) =N2

ln

(1

2πσ2μ

)−

N∑n=0

(x(n) − A)2

2σ2μ

and

ln(fA(A)) =12

ln

(1

2πσ2A

)− (A − A0)

2

2σ2A

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 222 / 406

Page 110: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

Thus

AMAP = argminA

(1

2σ2μ

N∑n=1

(x(n) − A)2 +(A − A0)

2

2σ2A

)

Differentiating we get

− 1σ2

μ

N∑n=1

(x(n) − A) +(A − A0)

σ2A

∣∣∣∣∣A=AMAP

= 0

⇒N∑

n=1

x(n)

σ2μ

− NAMAP

σ2μ

=AMAP

σ2A

− A0

σ2A

⇒ AMAP

(1σ2

A

+Nσ2

μ

)=

1σ2

μ

N∑n=1

x(n) +A0

σ2A

⇒ AMAP =1

1σ2

A+ N

σ2μ

(1σ2

μ

N∑n=1

x(n) +A0

σ2A

)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 223 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

AMAP =1

1σ2

A+ N

σ2μ

(1σ2

μ

N∑n=1

x(n) +A0

σ2A

)

Note that if σ2A → ∞ then there is no a priori information and

limσ2

A→∞AMAP =

1N

N∑n=1

x(n) = AML

21Aσ

22Aσ

2212 AA σσ > Observation: As fθ(θ) flattens out

θMAP → θML

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 224 / 406

Page 111: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

Case 3: The absolute cost function

C(y , y ) = |y − y |Question: For what types of problems is this costfunction effective?

|ˆ| yy −

)ˆ,( yyC

0In this case

I(y) =

∫ ∞

−∞C(y , y )fy |x(y |x)dy

=

∫y<y

(y − y)fy |x(y |x)dy +

∫y≥y

(y − y)fy |x(y |x)dy

=

∫ y

−∞(y − y)fy |x(y |x)dy +

∫ ∞

y(y − y)fy |x(y |x)dy

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 225 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

I(y) =

∫ y

−∞(y − y)fy |x(y |x)dy +

∫ ∞

y(y − y)fy |x(y |x)dy

Note that∫ y

−∞(y − y)fy |x(y |x)dy = yFy |x(y |x) −

∫ y

−∞yfy |x(y |x)dy

and similarly∫ ∞

y(y − y)fy |x(y |x)dy =

∫ ∞

yyfy |x(y |x)dy − y(1 − Fy |x(y |x))

Thus,

I(y) = yFy |x(y |x) −∫ y

−∞yfy |x(y |x)dy

−y(1 − Fy |x(y |x)) +

∫ ∞

yyfy |x(y |x)dy

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 226 / 406

Page 112: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

I(y) = yFy |x(y |x) −∫ y

−∞yfy |x(y |x)dy

−y(1 − Fy |x(y |x)) +

∫ ∞

yyfy |x(y |x)dy

Taking the derivative

∂I(y)

∂y= Fy |x(y |x) + y fy |x(y |x) − y fy |x(y |x)

−(1 − Fy |x(y |x)) − y fy |x(y |x) + y fy |x(y |x)

= Fy |x(y |x) − (1 − Fy |x(y |x))

Result: Setting equal to 0, we see the yMAE is given by

Fy |x(yMAE|x) = 1 − Fy |x(yMAE|x)

or ∫ yMAE

−∞fy |x(y |x)dy =

∫ ∞

yMAE

fy |x(y |x)dy

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 227 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

∫ yMAE

−∞fy |x(y |x)dy =

∫ ∞

yMAE

fy |x(y |x)dy

Interpreting this graphically

)|(| xx yf y

Area A Area BMAEy

Area A = Area B

Observation:yMAE = median of fy |x(y |x)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 228 / 406

Page 113: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

Estimator Relations

If fy |x(y |x) is symmetric, then

yMAE = yMS

Why? For a symmetric distribution the conditional mean is equalto the (median) symmetry point

If fy |x(y |x) is symmetric and unimodal, then

yMAE = yMS = yMAP

Why? The unimodal constraint implies that the single mode mustbe at the distribution symmetry point ⇒ the MAP estimate islocated at the central point

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 229 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

Example

Determine yMAE for the previously considered case

fx ,y(x , y) =

{10y 0 ≤ y ≤ x2, 0 ≤ x ≤ 10 otherwise

We showed previously that

fy |x(y |x) =2yx4 0 ≤ y ≤ x2 ⇒ Fy |x(y |x) =

y2

x4 0 ≤ y ≤ x2

Thus determining yMAE

Fy |x(yMAE|x) = 1 − Fy |x(yMAE|x)

⇒ y2MAE

x4 = 1 − y2MAE

x4

⇒ yMAE =x2√

2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 230 / 406

Page 114: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Maximum Likelihood and Bayes Estimation Bayes Estimation

MAP estimate: (previous result)

yMAP(x) = argmaxy

fy |x(y |x) = x2

MS estimate: (previous result)

yMS(x) = E{y |x} =23

x2

MAE estimate:

yMAE(x) = median of fy |x(y |x) =x2√

2

y

)|(| xyf xy

0

2

2

x

2x

2

2x

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 231 / 406

Maximum Likelihood and Bayes Estimation Bayes Estimation

Final ML and MAP Comments

ML estimation was pioneered by geneticist and statistician Sir R.A. Fisher between 1912 and 1922Under fairly weak regularity conditions the ML estimate isasymptotically optimal

The ML estimate is asymptotically unbiased, i.e., its bias tends tozero as the number of samples increases to infinityThe ML estimate is asymptotically efficient, i.e., it achieves theCramér-Rao lower bound when the number of samples tends toinfinityConsequence: No unbiased estimator has lower mean squarederror than the ML estimatorThe ML estimate is asymptotically normal, i.e., as the number ofsamples increases, the distribution of the ML estimate tends to theGaussian distribution

MAP estimation is a generalization of ML estimation thatincorporates the prior distribution of the quantity being estimated

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 232 / 406

Page 115: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Wiener Filtering

Problem Statement

Produce an estimate of a desired process statistically related to a setof observations

filterx(n) d(n)input y(n)

output

e(n) estimateerror

desiredresponse

+_

Historical Notes: The linear filtering problem was solved byAndrey Kolmogorov for discrete time – his 1938 paper“established the basic theorems for smoothing and predictingstationary stochastic processes”Norbert Wiener in 1941 for continuous time – not published untilthe 1949 paper Extrapolation, Interpolation, and Smoothing ofStationary Time Series

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 234 / 406

Wiener (MSE) Filtering Theory Wiener Filtering

filterx(n) d(n)input y(n)

output

e(n) estimateerror

desiredresponse

+_

System restrictions and considerations:

Filter is linear

Filter is discrete time

Filter is finite impulse response (FIR)

The process is WSS

Statistical optimization is employed

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 235 / 406

Page 116: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Wiener Filtering

For the discrete time case

z-1x(n) …

e(n)

z-1 z-1

d(n)

∗0w ∗

1w ∗−1Mw

…… +-)(ˆ nd

The filter impulse response is finite and given by

hk =

{w∗

k for k = 0, 1, · · · , M − 10 otherwise

The output d(n) is an estimate of the desired signal d(n)x(n) and d(n) are statistically related ⇒ d(n) and d(n) arestatistically related

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 236 / 406

Wiener (MSE) Filtering Theory Wiener Filtering

In convolution and vector form

d(n) =M−1∑k=0

w∗k x(n − k) = wHx(n)

where

w = [w0, w1, · · · , wM−1]T [filter coefficient vector]

x = [x(n), x(n − 1), · · · , x(n − M + 1)]T [observation vector]

The error can now be written as

e(n) = d(n) − d(n) = d(n) − wHx(n)

Question: Under what criteria should the error be minimized?

Selected Criteria: Mean squared-error (MSE)

J(w) = E{e(n)e∗(n)} (∗)Result: The w that minimizes J(w) is the optimal (Wiener) filter

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 237 / 406

Page 117: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Wiener Filtering

Utilizing e(n) = d(n) − wHx(n) in (∗) and expanding,

J(w) = E{e(n)e∗(n)}= E{(d(n) − wHx(n))(d∗(n) − xH(n)w)}= E{|d(n)|2 − d(n)xH(n)w − wHx(n)d∗(n)

+wHx(n)xH(n)w}= E{|d(n)|2} − E{d(n)xH(n)}w − wHE{x(n)d ∗(n)}

+wHE{x(n)xH(n)}w (∗∗)Let

R = E{x(n)xH(n)} [autocorrelation of x(n)]

p = E{x(n)d ∗(n)} [cross correlation between x(n) and d(n)]

Then (∗∗) can be compactly expressed as

J(w) = σ2d − pHw − wHp + wHRw

where we have assumed x(n) & d(n) are zero mean, WSSK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 238 / 406

Wiener (MSE) Filtering Theory Wiener Filtering

The MSE criteria as a function of the filter weight vector w

J(w) = σ2d − pHw − wHp + wHRw

Observation: The error is a quadratic function of w

Consequences: The error is an M–dimensional bowl–shaped functionof w with a unique minimum

Result: The optimal weight vector, w0, is determined by differentiatingJ(w) and setting the result to zero

∇wJ(w)|w=w0 = 0

A closed form solution exists

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 239 / 406

Page 118: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Wiener Filtering

Example

Consider a two dimensional case, i.e., a M = 2 tap filter. Plot the errorsurface and error contours.

Error Surface Error Contours

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 240 / 406

Wiener (MSE) Filtering Theory Wiener Filtering

Aside (Matrix Differentiation): For complex data,

wk = ak + jbk , k = 0, 1, · · · , M − 1

the gradient, with respect to wk , is

∇k (J) =∂J∂ak

+ j∂J∂bk

, k = 0, 1, · · · , M − 1

The complete gradient is thus given by

∇w(J) =

⎡⎢⎢⎢⎣

∇0(J)∇1(J)

...∇M−1(J)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎣

∂J∂a0

+ j ∂J∂b0

∂J∂a1

+ j ∂J∂b1

...∂J

∂aM−1+ j ∂J

∂bM−1

⎤⎥⎥⎥⎥⎦

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 241 / 406

Page 119: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Wiener Filtering

Example

Let c and w be M × 1 complex vectors. For g = cHw, find ∇w(g)

Note

g = cHw =M−1∑k=0

c∗k wk =

M−1∑k=0

c∗k (ak + jbk )

Thus

∇k (g) =∂g∂ak

+ j∂g∂bk

= c∗k + j(jc∗

k ) = 0, k = 0, 1, · · · , M − 1

Result: For g = cHw

∇w(g) =

⎡⎢⎢⎢⎣

∇0(g)∇1(g)

...∇M-1(g)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

00...0

⎤⎥⎥⎥⎦ = 0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 242 / 406

Wiener (MSE) Filtering Theory Wiener Filtering

Example

Now suppose g = wHc. Find ∇w(g)

In this case,

g = wHc =M−1∑k=0

w∗k ck =

M−1∑k=0

ck (ak − jbk )

and

∇k (g) =∂g∂ak

+ j∂g∂bk

= ck + j(−jck ) = 2ck , k = 0, 1, · · · , M − 1

Result: For g = wHc

∇w(g) =

⎡⎢⎢⎢⎣

∇0(g)∇1(g)

...∇M-1(g)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

2c0

2c1...

2cM−1

⎤⎥⎥⎥⎦ = 2c

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 243 / 406

Page 120: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Wiener Filtering

Example

Lastly, suppose g = wHQw. Find ∇w(g)

In this case,

g =M−1∑i=0

M−1∑j=0

w∗i wjqi ,j

=M−1∑i=0

M−1∑j=0

(ai − jbi)(aj + jbj)qi ,j

⇒ ∇k (g) =∂g∂ak

+ j∂g∂bk

= 2M−1∑j=0

(aj + jbj)qk ,j + 0

= 2M−1∑j=0

wjqk ,j

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 244 / 406

Wiener (MSE) Filtering Theory Wiener Filtering

Result: For g = wHQw

∇w(g) =

⎡⎢⎢⎢⎣

∇0(g)∇1(g)

...∇M-1(g)

⎤⎥⎥⎥⎦ = 2

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

M−1∑i=0

q0,iwi

M−1∑i=0

q1,iwi

...M-1∑i=0

qM−1,iwi

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

= 2Qw

Observation: Differentiation result depends on matrix ordering

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 245 / 406

Page 121: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Wiener Filtering

Returning to the MSE performance criteria

J(w) = σ2d − pHw − wHp + wHRw

Approach: Minimize error by differentiating with respect to w and setresult to 0

∇w(J) = 0 − 0 − 2p + 2Rw

= 0

⇒ Rw0 = p [normal equation]

Result: The Wiener filter coefficients are defined by

w0 = R−1p

Question: Does R−1 always exist? Recall R is positive semi-definite,and usually positive definite

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 246 / 406

Wiener (MSE) Filtering Theory Orthogonality Principle

Orthogonality Principle

Consider again the normal equation that defines the optimal solution

Rw0 = p

⇒ E{x(n)xH(n)}w0 = E{x(n)d ∗(n)}

Rearranging

E{x(n)d ∗(n)} − E{x(n)xH(n)}w0 = 0

E{x(n)[d ∗(n) − xH(n)w0]} = 0

E{x(n)e∗0(n)} = 0

Note: e∗0(n) is the error when the optimal weights are used, i.e.,

e∗0(n) = d∗(n) − xH(n)w0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 247 / 406

Page 122: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Orthogonality Principle

Thus

E{x(n)e∗0(n)} = E

⎡⎢⎢⎢⎣

x(n)e∗0(n)

x(n − 1)e∗0(n)

...x(n − M + 1)e∗

0(n)

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎣

00...0

⎤⎥⎥⎥⎦

Orthogonality Principle

A necessary and sufficient condition for a filter to be optimal is that theestimate error, e∗(n), be orthogonal to each input sample in x(n)

Interpretation: The observations samples and error are orthogonal andcontain no mutual “information”

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 248 / 406

Wiener (MSE) Filtering Theory Minimum MSE

Objective: Determine the minimum MSE

Approach: Use the optimal weights w0 = R−1p in the MSE expression

J(w) = σ2d − pHw − wHp + wHRw

⇒ Jmin = σ2d − pHw0 − wH

0 p + wH0 R(R−1p)

= σ2d − pHw0 − wH

0 p + wH0 p

= σ2d − pHw0

Result:Jmin = σ2

d − pHR−1p

where the substitution w0 = R−1p has been employed

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 249 / 406

Page 123: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Excess MSE

Objective: Consider the excess MSE introduced by using a weightedvector that is not optimal.

J(w)−Jmin = (σ2d −pHw−wHp+wHRw)−(σ2

d −pHw0−wH0 p+wH

0 Rw0)

Using the fact that

p = Rw0 and pH = wH0 R

yields

J(w) − Jmin = −pHw − wHp + wHRw + pHw0 + wH0 p − wH

0 Rw0

= −wH0 Rw − wHRw0 + wHRw + wH

0 Rw0

+ wH0 Rw0 − wH

0 Rw0

= −wH0 Rw − wHRw0 + wHRw + wH

0 Rw0

= (w − w0)HR(w − w0)

⇒ J(w) = Jmin + (w − w0)HR(w − w0)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 250 / 406

Wiener (MSE) Filtering Theory Excess MSE

Finally, using the eigenvalue and vector representation R = QΩQH

J(w) = Jmin + (w − w0)HQΩQH(w − w0)

or defining the eigenvector transformed difference

v = QH(w − w0) (∗)⇒ J(w) = Jmin + vHΩv

= Jmin +M∑

k=1

λkvkv∗k

Result:

J(w) = Jmin +M∑

k=1

λk |vk |2

Note: (∗) shows that vk is the difference (w − w0) projected ontoeigenvector qk

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 251 / 406

Page 124: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Communications Example

Example

Consider the following system

ARsystemv1(n) u(n)

d(n) x(n)

Noisy transformedobservation

(noise)

Communicationchanel

v2(n)

Desiredsignal

FIRfilteru(n) )(ˆ nd

Objective: Determine the optimal filter for a given system and channel

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 252 / 406

Wiener (MSE) Filtering Theory Communications Example

Specific Objective

Determine the optimal order two filter weights, w0, for

H1(z) =1

1 + 0.8458z−1 [AR process]

H2(z) =1

1 − 0.9458z−1 [communication channel]

where v1(n) and v2(n) zero mean white noise processes withσ2

1 = 0.27 and σ22 = 0.1

Note: To determine w0, we need:

Ru — auto–correlation of the received signal

p — the cross correlation between received signal u(n) and thedesired signal d(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 253 / 406

Page 125: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Communications Example

ARsystemv1(n) u(n)

d(n) x(n)

Noisy transformedobservation

(noise)

Communicationchanel

v2(n)

Desiredsignal

FIRfilteru(n) )(ˆ nd

Procedure: Consider Ru first

Since u(n) = x(n) + v2(n), where v2(n) is white with σ22 = 0.1 and is

uncorrelated with x(n)

Ru = Rx + Rv2 = Rx +

[0.1 00 0.1

]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 254 / 406

Wiener (MSE) Filtering Theory Communications Example

ARsystemv1(n) u(n)

d(n) x(n)

Noisy transformedobservation

(noise)

Communicationchanel

v2(n)

Desiredsignal

FIRfilteru(n) )(ˆ nd

Next, consider the relation between x(n) and v1(n)

X (z) = H1(z)H2(z)V1(z)

where for the give systems

H1(z)H2(z) =1

(1 + 0.8458z−1)(1 − 0.9458z−1)

Converting to the time domain, we see x(n) is an order 2 AR process

x(n) − 0.1x(n − 1) − 0.8x(n − 2) = v1(n)

x(n) + a1x(n − 1) + a2x(n − 2) = v1(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 255 / 406

Page 126: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Communications Example

Since x(n) is a real valued order two AR process, the Yule-Walkerequations are given by[

r(0) r(1)r∗(1) r(0)

] [ −a1

−a2

]=

[r∗(1)r∗(2)

][

r(0) r(1)r(1) r(0)

] [ −a1

−a2

]=

[r(1)r(2)

]Solving for the coefficients:

−a1 =r(1)[r(0) − r(2)]

r2(0) − r2(1)

− a2 =r(0)r(2) − r2(1)

r2(0) − r2(1)

Question: What are the known and unknown terms in this system?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 256 / 406

Wiener (MSE) Filtering Theory Communications Example

Note: We must solve the system to obtain the unknown r(·) values

Noting r(0) = σ2x and rearranging to solve for r(1) and r(2)

r(1) =−a1

1 + a2σ2

x (∗)

r(2) =

(−a2 +

a21

1 + a2

)σ2

x (∗∗)

The Yule-Walker equations also stipulate

σ2v1

= r(0) + a1r(1) + a2r(2) (∗ ∗ ∗)

Next, utilize the determined r(1), r(2) and given a1, a2, σ2v1

values todetermine r(0) = σ2

x

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 257 / 406

Page 127: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Communications Example

Substituting (∗) and (∗∗) into (∗ ∗ ∗), utilize r(0) = σ2x , and rearranging

σ2v1

= r(0) + a1r(1) + a2r(2)

= σ2x + a1

( −a1

1 + a2

)σ2

x + a2

(−a2 +

a21

1 + a2

)σ2

x

=

(1 +

−a21

1 + a2− a2

2 +a2

1a2

1 + a2

)σ2

x

⇒ σ2x =

(1 + a2

1 − a2

)σ2

v1

(1 + a2)2 − a21

Note: All correlation terms have been determined since r(0) = σ2x

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 258 / 406

Wiener (MSE) Filtering Theory Communications Example

Using a1 = −0.1, a2 = −0.8, and σ2v1

= 0.27,

σ2x =

(1 + a2

1 − a2

)σ2

v1

(1 + a2)2 − a21

= 1 =

Thus r(0) = 1. Similarly

r(1) =−a1

1 + a2σ2

x = 0.5

and finally, Rx is

Rx =

[1 0.5

0.5 1

]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 259 / 406

Page 128: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Communications Example

Recall the overall system

ARsystemv1(n) u(n)

d(n) x(n)

Noisy transformedobservation

(noise)

Communicationchanel

v2(n)

Desiredsignal

FIRfilteru(n) )(ˆ nd

Putting the pieces of Ru together

Ru = Rx + Rv2

=

[1 0.5

0.5 1

]+

[0.1 00 0.1

]

=

[1.1 0.50.5 1.1

]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 260 / 406

Wiener (MSE) Filtering Theory Communications Example

Recall: The Wiener solution is given by w0 = R−1p

Note: R is known, but p is still to be determined

p = E{[

d(n)x(n)d(n)u(n − 1)

]}

Recall

X (z) = H2(z)D(z) =D(z)

1 − 0.9458z−1

or in the time domain

x(n) − 0.9458x(n − 1) = d(n)

Lastly, the observation is corrupted by additive noise

u(n) = x(n) + v2(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 261 / 406

Page 129: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Wiener (MSE) Filtering Theory Communications Example

Thus

E{u(n)d(n)} = E{[x(n) + v2(n)][x(n) − 0.9458x(n − 1)]}= E{x2(n)} + E{x(n)v2(n)} − 0.9458E{x(n)x(n − 1)}

−0.9458E{v2(n)x(n − 1)}= σ2

x + 0 − 0.9458r(1) − 0

= 1 − 0.9458(

12

)= 0.5272

Similarly,

E{u(n − 1)d(n)} = E{[x(n − 1) + v2(n − 1)][x(n) − 0.9458x(n − 1)]}= r(1) − 0.9458r(0)

= −0.4458

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 262 / 406

Wiener (MSE) Filtering Theory Communications Example

Thus

p =

[0.5272−0.4458

]Final Solution:

w0 = R−1p

=

[1.1 0.50.5 1.1

]−1 [0.5272−0.4458

]

=

[0.8360−0.7853

]

The optimal filter weights are a function of the source signalstatistics and the communications channel

Question: How do we optimized a filter of when the statistics arenot known in closed form or a priori?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 263 / 406

Page 130: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Motivation

Adaptive Optimization and Filtering Methods

Motivation

Adaptive optimization and filtering methods are appropriate,advantageous, or necessary when:

Signal statistics are not known a priori and must be “learned” fromobserved or representative samples

Signal statistics evolve over time

Time or computational restrictions dictate that simple, if repetitive,operations be employed rather than solving more complex, closedform expressionsTo be considered are the following algorithms:

Steepest Descent (SD) – deterministicLeast Means Squared (LMS) – stochasticRecursive Least Squares (RLS) – deterministic

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 265 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

Definition (Steepest Descent (SD))

Steepest descent, also known as gradient descent, it is an iterativetechnique for finding the local minimum of a function.

Approach: Given an arbitrary starting point, the current location (value)is moved in steps proportional to the negatives of the gradient at thecurrent point.

SD is an old, deterministic method, that is the basis for stochasticgradient based methodsSD is a feedback approach to finding local minimum of an errorperformance surfaceThe error surface must be known a prioriIn the MSE case, SD converges converges to the optimal solution,w0 = R−1p, without inverting a matrix

Question: Why in the MSE case does this converge to the globalminimum rather than a local minimum?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 266 / 406

Page 131: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

Example

Consider a well structured cost function with a single minimum. Theoptimization proceeds as follows:

Contour plot showing that evolution of the optimization

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 267 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

Example

Consider a gradient ascent example in which there are multipleminima/maxima

Surface plot showing the multipleminima and maxima Contour plot illustrating that the final

result depends on starting value

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 268 / 406

Page 132: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

To derive the approach, consider the FIR case:

{x(n)} – the WSS input samples{d(n)} – the WSS desired output{d(n)} – the estimate of the desired signal given by

d(n) = wH(n)x(n)

where

x(n) = [x(n), x(n − 1), · · · , x(n − M + 1)]T [obs. vector]

w(n) = [w0(n), w1(n), · · · , wM-1(n)]T [time indexed filter coefs.]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 269 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

Then similarly to previously considered cases

e(n) = d(n) − d(n) = d(n) − wH(n)x(n)

and the MSE at time n is

J(n) = E{|e(n)|2}= σ2

d − wH(n)p − pHw(n) + wH(n)Rw(n)

where

σ2d – variance of desired signal

p – cross-correlation between x(n) and d(n)

R – correlation matrix of x(n)

Note: The weight vector and cost function are time indexed (functionsof time)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 270 / 406

Page 133: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

When w(n) is set to the (optimal) Wiener solution,

w(n) = w0 = R−1p

andJ(n) = Jmin = σ2

d − pHw0

Use the method of steepest descent to iteratively find w0.

The optimal result is achieved since the cost function is a secondorder polynomial with a single unique minimum

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 271 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

Example

Let M = 2. The MSE is a bowl–shaped surface, which is a function ofthe 2-D space weight vector w(n)

w0

w

J(w)

J(w)

w1

w2

Surface Plot

w0

w2

w1

)(J∇−

2w

J

∂∂−

1w

J

∂∂−

Contour Plot

Imagine dropping a marble at any point on the bowl-shaped surface.The ball will reach the minimum point by going through the path ofsteepest descent.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 272 / 406

Page 134: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent

Observation: Set the direction of filter update as: −∇J(n)

Resulting Update:

w(n + 1) = w(n) +12

μ[−∇J(n)]

or, since ∇J(n) = −2p + 2Rw(n)

w(n + 1) = w(n) + μ[p − Rw(n)] n = 0, 1, 2, · · ·

where w(0) = 0 (or other appropriate value) and μ is the step size

Observation: SD uses feedback, which makes it possible for thesystem to be unstable

Bounds on the step size guaranteeing stability can be determinedwith respect to the eigenvalues of R (Widrow, 1970)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 273 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Convergence Analysis

Define the error vector for the tap weights as

c(n) = w(n) − w0

Then using p = Rw0 in the update,

w(n + 1) = w(n) + μ[p − Rw(n)]

= w(n) + μ[Rw0 − Rw(n)]

= w(n) − μRc(n)

and subtracting w0 from both sides

w(n + 1) − w0 = w(n) − w0 − μRc(n)

⇒ c(n + 1) = c(n) − μRc(n)

= [I − μR]c(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 274 / 406

Page 135: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Using the Unitary Similarity Transform

R = QΩQH

we have

c(n + 1) = [I − μR]c(n)

= [I − μQΩQH ]c(n)

⇒ QHc(n + 1) = [QH − μQHQΩQH ]c(n)

= [I − μΩ]QHc(n) (∗)Define the transformed coefficients as

v(n) = QHc(n)

= QH(w(n) − w0)

Then (∗) becomesv(n + 1) = [I − μΩ]v(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 275 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Consider the initial condition of v(n)

v(0) = QH(w(0) − w0)

= −QHw0 [if w(0) = 0]

Consider the k th term (mode) in

v(n + 1) = [I − μΩ]v(n)

Note [I − μΩ] is diagonalThus all modes are independently updatedThe update for the k th term can be written as

vk (n + 1) = (1 − μλk )vk (n) k = 1, 2, · · · , M

or using recursion

vk (n) = (1 − μλk )nvk (0)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 276 / 406

Page 136: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Observation: Conversion to the optimal solution requires

limn→∞w(n) = w0

⇒ limn→∞ c(n) = lim

n→∞w(n) − w0 = 0

⇒ limn→∞ v(n) = lim

n→∞QHc(n) = 0

⇒ limn→∞ vk (n) = 0 k = 1, 2, · · · , M (∗)

Result: According to the recursion

vk (n) = (1 − μλk )nvk (0)

the limit in (∗) holds if and only if

|1 − μλk | < 1 for all k

Thus since the eigenvalues are nonnegative, 0 < μλmax < 2, or

0 < μ <2

λmax

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 277 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Observation: The k th mode has geometric decay

vk (n) = (1 − μλk )nvk (0)

The rate of decay it is characterized by the time it takes to decayto e−1 of the initial value

Let τk denote this time for the k th mode

vk (τk ) = (1 − μλk )τk vk (0) = e−1vk (0)

⇒ e−1 = (1 − μλk )τk

⇒ τk =−1

ln(1 − μλk )≈ 1

μλkfor μ � 1

Result: The overall rate of decay is

−1ln(1 − μλmax)

≤ τ ≤ −1ln(1 − μλmin)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 278 / 406

Page 137: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Example

Consider the typical behavior of a single mode

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 279 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Error Analysis

Recall that

J(n) = Jmin + (w(n) − w0)HR(w(n) − w0)

= Jmin + (w(n) − w0)HQΩQH(w(n) − w0)

= Jmin + v(n)HΩv(n)

= Jmin +M∑

k=1

λk |vk (n)|2 [sub in vk (n) = (1 − μλk )nvk (0)]

= Jmin +M∑

k=1

λk (1 − μλk )2n|vk (0)|2

Result: If 0 < μ < 2λmax

, then

limn→∞ J(n) = Jmin

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 280 / 406

Page 138: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

Example

Consider a two–tap predictor for real–valued input

Analyzed the effects of the following cases:

Varying the eigenvalue spread χ(R) = λmaxλmin

while keeping μ fixed

Varying μ and keeping the eigenvalue spread χ(R) fixed

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 281 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]for step-size μ = 0.3

Eigenvalue spread: χ(R) = 1.22

Small eigenvalue spread ⇒modes converge at a similar rate

Eigenvalue spread: χ(R) = 3

Moderate eigenvalue spread ⇒modes converge at moderatelysimilar rates

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 282 / 406

Page 139: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]for step-size μ = 0.3

Eigenvalue spread: χ(R) = 10

Large eigenvalue spread ⇒modes converge at different rates

Eigenvalue spread: χ(R) = 100

Very large eigenvalue spread ⇒modes converge at very differentrates

Principle direction convergence isfastest

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 283 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

SD loci plots (with shown J(n) contours) as a function of [w1(n), w2(n)]for step-size μ = 0.3

Eigenvalue spread: χ(R) = 1.22

Small eigenvalue spread ⇒modes converge at a similar rate

Eigenvalue spread: χ(R) = 3

Moderate eigenvalue spread ⇒modes converge at moderatelysimilar rates

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 284 / 406

Page 140: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

SD loci plots (with shown J(n) contours) as a function of [w1(n), w2(n)]for step-size μ = 0.3

Eigenvalue spread: χ(R) = 10

Large eigenvalue spread ⇒modes converge at different rates

Eigenvalue spread: χ(R) = 100

Very large eigenvalue spread ⇒modes converge at very differentrates

Principle direction convergence isfastest

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 285 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

Learning curves of steepest-descent algorithm with step-sizeparameter μ = 0.3 and varying eigenvalue spread.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 286 / 406

Page 141: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]with χ(R) = 10 and varying step–sizes

Step–sizes: μ = 0.3

This is over–damped ⇒ slowconvergence

Step–sizes: μ = 1

This is under–damped ⇒ fast(erratic) convergence

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 287 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

SD loci plots (with shown J(n) contours) as a function of [w1(n), w2(n)]with χ(R) = 10 and varying step–sizes

Step–sizes: μ = 0.3

This is over–damped ⇒ slowconvergence

Step–sizes: μ = 1

This is under–damped ⇒ fast(erratic) convergence

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 288 / 406

Page 142: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

Example

Consider a system identification problem

w(n) system

{x(n)}

)(ˆ nd d(n)+_

e(n)

Suppose M = 2 and

Rx =

[1 0.8

0.8 1

]P =

[0.80.5

]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 289 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

From eigen analysis we have

λ1 = 1.8, λ2 = 0.2 ⇒ μ <2

1.8

also

q1 =1√2

[11

]q2 =

1√2

[1−1

]and

Q =1√2

[1 11 −1

]Also,

w0 = R−1p =

[1.11

−0.389

]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 290 / 406

Page 143: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

Thusv(n) = QH [w(n) − w0]

Noting that

v(0) = −QHw0 = − 1√2

[1 11 −1

] [1.11

−0.389

]=

[0.511.06

]

andv1(n) = (1 − μ(1.8))n0.51

v2(n) = (1 − μ(0.2))n1.06

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 291 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor

SD convergence properties for two μ values

Step–sizes: μ = 0.5

This is over–damped ⇒ slowconvergence

Step–sizes: μ = 1

This is under–damped ⇒ fast(erratic) convergence

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 292 / 406

Page 144: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Least Mean Squares (LMS)

Least Mean Squares (LMS)

Definition (Least Mean Squares (LMS) Algorithm)

Motivation: The error performance surface used by the SD method isnot always known a priori

Solution: Use estimated values. We will use the followinginstantaneous estimates

R(n) = x(n)xH(n)

p(n) = x(n)d ∗(n)

Result: The estimates are RVs and thus this leads to a stochasticoptimization

Historical Note: Invented in 1960 by Stanford University professorBernard Widrow and his first Ph.D. student, Ted Hoff

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 293 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Least Mean Squares (LMS)

Recall the SD update

w(n + 1) = w(n) +12μ[−∇(J(n))]

where the gradient of the error surface at w(n) was shown to be

∇(J(n)) = −2p + 2Rw(n)

Using the instantaneous estimates,

∇(J(n)) = −2x(n)d ∗(n) + 2x(n)xH(n)w(n)

= −2x(n)[d ∗(n) − xH(n)w(n)]

= −2x(n)[d ∗(n) − d∗(n)]

= −2x(n)e∗(n)

where e∗(n) is the complex conjugate of the estimate error.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 294 / 406

Page 145: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Least Mean Squares (LMS)

Utilizing ∇(J(n)) = −2x(n)e∗(n) in the update

w(n + 1) = w(n) +12μ[−∇(J(n))]

= w(n) + μx(n)e∗(n) [LMS Update]

The LMS algorithm belongs to the family of stochastic gradientalgorithms

The update is extremely simple

Although the instantaneous estimates may have large variance,the LMS algorithm is recursive and effectively averages theseestimates

The simplicity and good performance of the LMS algorithm makeit the benchmark against which other optimization algorithms arejudged

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 295 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Convergence Analysis

Independence Theorem

The following conditions hold:1 The vectors x(1), x(2), · · · , x(n) are statistically independent2 x(n) is independent of d(1), d(2), · · · , d(n − 1)

3 d(n) is statistically dependent on x(n), but is independent ofd(1), d(2), · · · , d(n − 1)

4 x(n) and d(n) are mutually Gaussian

The independence theorem is invoked in the LMS algorithmanalysisThe independence theorem is justified in some cases, e.g.,beamforming where we receive independent vector observationsIn other cases it is not well justified, but allows the analysis toproceeds (i.e., when all else fails, invoke simplifying assumptions)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 296 / 406

Page 146: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

We will invoke the independence theorem to show that w(n) convergesto the optimal solution in the mean

limn→∞ E{w(n)} = w0

To prove this, evaluate the update

w(n + 1) = w(n) + μx(n)e∗(n)

⇒ w(n + 1) − w0 = w(n) − w0 + μx(n)e∗(n)

⇒ c(n + 1) = c(n) + μx(n)(d ∗(n) − xH(n)w(n))

= c(n) + μx(n)d ∗(n) − μx(n)xH(n)[w(n) − w0 + w0]

= c(n) + μx(n)d ∗(n) − μx(n)xH(n)c(n)

−μx(n)xH(n)w0

= [I − μx(n)xH(n)]c(n) + μx(n)[d∗(n) − xH(n)w0]

= [I − μx(n)xH(n)]c(n) + μx(n)e∗0(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 297 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Take the expectation of the update noting thatw(n) is based on past inputs and desired values

⇒ w(n), and consequently c(n)), are independent of x(n)(Independence Theorem)

Thus

c(n + 1) = [I − μx(n)xH(n)]c(n) + μx(n)e∗0(n)

⇒ E{c(n + 1)} = (I − μR)E{c(n)} + μ E{x(n)e∗0(n)}︸ ︷︷ ︸

=0 why?

= (I − μR)E{c(n)}Using arguments similar to the SD case we have

limn→∞E{c(n)} = 0 if 0 < μ <

2λmax

or equivalently

limn→∞E{w(n)} = w0 if 0 < μ <

2λmax

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 298 / 406

Page 147: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Noting that∑M

i=1 λi = trace[R]

⇒ λmax ≤ trace[R] = Mr(0) = Mσ2x

Thus a more conservative bound (and one easier to determine) is

0 < μ <2

Mσ2x

Convergence in the mean

limn→∞ E{w(n)} = w0

is a weak condition that says nothing about the variance, whichmay even grow

A stronger condition is convergence in the mean square, whichsays

limn→∞E{|c(n)|2} = constant

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 299 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Proving convergence in the mean square is equivalent to showing that

limn→∞ J(n) = lim

n→∞ E{|e(n)|2} = constant

To evaluate the limit, write e(n) as

e(n) = d(n) − d(n) = d(n) − wH(n)x(n)

= d(n) − wH0 x(n) − [wH(n) − wH

0 ]x(n)

= e0(n) − cH(n)x(n)

Thus

J(n) = E{|e(n)|2}= E

{(e0(n) − cH(n)x(n)

) (e∗

0(n) − xH(n)c(n))}

= Jmin + E{cH(n)x(n)xH(n)c(n)}︸ ︷︷ ︸Jex(n)

[Cross terms → 0, why?]

= Jmin + Jex(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 300 / 406

Page 148: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Since Jex(n) is a scalar

Jex(n) = E{cH(n)x(n)xH(n)c(n)}= E{trace[cH(n)x(n)xH(n)c(n)]}= E{trace[x(n)xH(n)c(n)cH(n)]}= trace[E{x(n)xH(n)c(n)cH(n)}]

Invoking the independence theorem

Jex(n) = trace[E{x(n)xH(n)}E{c(n)cH(n)}]= trace[RK(n)]

whereK(n) = E{c(n)cH(n)}

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 301 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Thus

J(n) = Jmin + Jex(n)

= Jmin + trace[RK(n)]

RecallQHRQ = Ω or R = QΩQH

SetS(n) ≡ QHK(n)Q

where S(n) need not be diagonal. Then

K(n) = QQHK(n)QQH [since Q−1 = QH ]

= QS(n)QH

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 302 / 406

Page 149: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Utilizing R = QΩQH and K(n) = QS(n)QH in the excess errorexpression

Jex(n) = trace[RK(n)]

= trace[QΩQHQS(n)QH ]

= trace[QΩS(n)QH ]

= trace[QHQΩS(n)]

= trace[ΩS(n)]

Since Ω is diagonal

Jex(n) = trace[ΩS(n)] =M∑

i=1

λi si(n)

where s1(n), s2(n), · · · , sM (n) are the diagonal elements of S(n).

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 303 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

The previously derived recursion

E{c(n + 1)} = (I − μR)E{c(n)}can be modified to yield a recursion on S(n),

S(n + 1) = (I − μΩ)S(n)(I − μΩ) + μ2JminΩ

which for the diagonal elements is

si(n + 1) = (1 − μλi)2si (n) + μ2Jminλi i = 1, 2, · · · , M

Suppose Jex(n) converges, then si(n + 1) = si(n)

⇒ si (n) = (1 − μλi)2si(n) + μ2Jminλi

⇒ si (n) =μ2Jminλi

1 − (1 − μλi)2 =μ2Jminλi

2μλi − μ2λ2i

=μJmin

2 − μλii = 1, 2, · · · , M

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 304 / 406

Page 150: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis

Consider again

Jex(n) = trace[ΩS(n)] =M∑

i=1

λi si(n)

Taking the limit and utilizing si(n) = μJmin2−μλi

,

limn→∞ Jex(n) = Jmin

M∑i=1

μλi

2 − μλi

The LMS misadjustment is defined as

MA =limn→∞ Jex(n)

Jmin=

M∑i=1

μλi

2 − μλi

Note: A misadjustment at 10% or less is generally consideredacceptable.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 305 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Example

This is a one tap predictor

x(n) = w(n)x(n − 1)

Take the underlying process to be areal order one AR process

x(n) = −ax(n − 1) + v(n)

The weight update is

w(n + 1) = w(n) + μx(n − 1)e(n) [LMS update for obs. x(n − 1)]

= w(n) + μx(n − 1)[x(n) − w(n)x(n − 1)]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 306 / 406

Page 151: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Sincex(n) = −ax(n − 1) + v(n) [AR model]

and

x(n) = w(n)x(n − 1) [one tap predictor]

⇒ w0 = −a

Note thatE{x(n − 1)eo(n)} = E{x(n − 1)v(n)} = 0

proves the optimality

Set μ = 0.05 and consider two cases

a σ2x

-0.99 0.936270.99 0.995

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 307 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Figure: Transient behavior of adaptive first-order predictor weight w(n) forμ = 0.05.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 308 / 406

Page 152: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Figure: Transient behavior of adaptive first-order predictor squared error forμ = 0.05.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 309 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Figure: Mean-squared error learning curves for an adaptive first-orderpredictor with varying step-size parameter μ.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 310 / 406

Page 153: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Consider the expected trajectory of w(n). Recall

w(n + 1) = w(n) + μx(n − 1)e(n)

= w(n) + μx(n − 1)[x(n) − w(n)x(n − 1)]

= [1 − μx(n − 1)x(n − 1)]w(n) + μx(n − 1)x(n)

In this example, x(n) = −ax(n − 1) + v(n). Substituting in:

w(n + 1) = [1 − μx(n − 1)x(n − 1)]w(n) + μx(n − 1)[−ax(n − 1)

+v(n)]

= [1 − μx(n − 1)x(n − 1)]w(n) − μax(n − 1)x(n − 1)

+μx(n − 1)v(n)

Taking the expectation and invoking the dependence theorem

E{w(n + 1)} = (1 − μσ2x )E{w(n)} − μσ2

xa

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 311 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Figure: Comparison of experimental results with theory, based on w(n).

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 312 / 406

Page 154: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Next, derive a theoretical expression for J(n).

Note that the initial value of J(n) is

J(0) = E{(x(0) − w(0)x(−1))2} = E{(x(0))2} = σ2x

and the final value is

J(∞) = Jmin + Jex

= E{(x(n) − w(n)x(n − 1))2} + Jex

= E{(v(n))2} + Jex

= σ2v + Jmin

μλ1

2 − μλ1

Note λ1 = σ2x . Thus,

J(∞) = σ2v + σ2

v

(μσ2

x

2 − μσ2x

)

= σ2v

(1 +

μσ2x

2 − μσ2x

)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 313 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

And if μ is small

J(∞) = σ2v

(1 +

μσ2x

2 − μσ2x

)

≈ σ2v

(1 +

μσ2x

2

)

Putting all the components together:

J(n) = [σ2x − σ2

v (1 +μ

2σ2

x )]︸ ︷︷ ︸J(0)−J(∞)

(1 − μσ2x )2n︸ ︷︷ ︸

1→0

+ σ2v (1 +

μ

2σ2

x )︸ ︷︷ ︸J(∞)

Also, the time constant is

τ = − 12 ln(1 − μλ1)

= − 12 ln(1 − μσ2

x )≈ 1

2μσ2x

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 314 / 406

Page 155: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor

Figure: Comparison of experimental results with theory for the adaptivepredictor, based on the mean-square error for μ = 0.001.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 315 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization

Example (Adaptive Equalization)

Objective: Pass a known signal through an unknown channel to invertthe effects the channel and noise have on the signal

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 316 / 406

Page 156: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization

The signal is a Bernoulli sequence

xn =

{+1 with probability 1/2−1 with probability 1/2

The additive noise is ∼ N(0, 0.001)

The channel has a raised cosine response

hn =

{ 12

[1 + cos

(2πw (n − 2)

)]n = 1, 2, 3

0 otherwise

⇒ w controls the eigenvalue spread χ(R)⇒ hn is symmetric about n = 2 and thus introduces a delay of 2

We will use an M = 11 tap filter, which is symmetric about n = 5⇒ Introduce a delay of 5

Thus an overall delay of δ = 5 + 2 = 7 is added to the system

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 317 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization

Channel response and Filter response

Figure: (a) Impulse response of channel; (b) impulse response of optimumtransversal equalizer.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 318 / 406

Page 157: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization

Consider three w values

Note the step size is bound by the w = 3.5 case

μ ≤ 2Mr(0)

=2

11(1.3022)= 0.14

Choose μ = 0.075 in all cases.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 319 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization

Figure: Learning curves of the LMS algorithm for an adaptive equalizer withnumber of taps M = 11, step-size parameter μ = 0.075, and varyingeigenvalue spread χ(R).

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 320 / 406

Page 158: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization

Ensemble-average impulse response of the adaptive equalizer (after1000 iterations) for each of four different eigenvalue spreads.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 321 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization

Figure: Learning curves of the LMS algorithm for an adaptive equalizer withthe number of taps M = 11, fixed eigenvalue spread, and varying step-sizeparameter μ.

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 322 / 406

Page 159: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality

Example

Directionality of the LMS algorithm

The speed of convergence of the LMS algorithm is faster incertain directions in the weight space

If the convergence is in the appropriate direction, the convergencecan be accelerated by increased eigenvalue spread

To investigate this phenomenon, consider the deterministic signal

x(n) = A1 cos(ω1n) + A2 cos(ω2n)

Even though it is deterministic, a correlation matrix can be determined:

R =12

[A2

1 + A22 A2

1 cos(ω1) + A22 cos(ω2)

A21 cos(ω1) + A2

2 cos(ω2) A21 + A2

2

]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 323 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality

Determining the eigenvalues and eigenvectors yields

λ1 =12

A21(1 + cos(ω1)) +

12

A22(1 + cos(ω2))

λ2 =12

A21(1 − cos(ω1)) +

12

A22(1 − cos(ω2))

and

q1 =

[11

]q2 =

[ −11

]

Case 1: A1 = 1, A2 = 0.5, ω1 = 1.2, ω2 = 0.1

xa(n) = cos(1.2n) + 0.5 cos(0.1n) and χ(R) = 2.9

Case 2: A1 = 1, A2 = 0.5, ω1 = 0.6, ω2 = 0.23

xb(n) = cos(0.6n) + 0.5 cos(0.23n) and χ(R) = 12.9

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 324 / 406

Page 160: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality

Since p undefined, set p = λiqi

Then since p = Rw0, we see (two cases)

p = λ1q1 ⇒ Rw0 = λ1q1 ⇒ w0 = q1 =

[11

]

p = λ2q2 ⇒ Rw0 = λ2q2 ⇒ w0 = q2 =

[ −11

]

Utilize 200 iterations of the algorithm.

Consider the minimum eigenfilter first, w0 = q2 =

[ −11

]

Consider the maximum eigenfilter second, w0 = q1 =

[11

]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 325 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality

Convergence of the LMS algorithm, for a deterministic sinusoidalprocess, along “slow” eigenvector (i.e., minimum eigenfilter).

For input xa(n) (χ(R) = 2.9) For input xb(n) (χ(R) = 12.9)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 326 / 406

Page 161: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality

Convergence of the LMS algorithm, for a deterministic sinusoidalprocess, along “fast” eigenvector (i.e., maximum eigenfilter).

For input xa(n) (χ(R) = 2.9) For input xb(n) (χ(R) = 12.9)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 327 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm

Observation: The LMS correction is proportional to μx(n)e∗(n)

w(n + 1) = w(n) + μx(n)e∗(n)

If x(n) is large, the LMS update suffers from gradient noiseamplificationThe normalized LMS algorithm seeks to avoid gradient noiseamplification

The step size is made time varying, μ(n), and optimized tominimize the next step error

w(n + 1) = w(n) +12

μ(n)[−∇J(n)]

= w(n) + μ(n)[p − Rw(n)]

Choose μ(n), such that w(n + 1) produces the minimum MSE,

J(n + 1) = E{|e(n + 1)|2}

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 328 / 406

Page 162: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm

Let ∇(n) ≡ ∇J(n) and note

e(n + 1) = d(n + 1) − wH(n + 1)x(n + 1)

Objective: Choose μ(n) such that it minimizes J(n + 1)

The optimal step size, μ0(n), will be a function of R and ∇(n).

⇒ Use instantaneous estimates of these values

To determine μ0(n), expand J(n + 1)

J(n + 1) = E{e(n + 1)e∗(n + 1)}= E{(d(n + 1) − wH(n + 1)x(n + 1))

(d∗(n + 1) − xH(n + 1)w(n + 1))}= σ2

d − wH(n + 1)p − pHw(n + 1)

+wH(n + 1)Rw(n + 1)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 329 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm

Now use the fact that w(n + 1) = w(n) − 12μ(n)∇(n)

J(n + 1) = σ2d − wH(n + 1)p − pHw(n + 1)

+wH(n + 1)Rw(n + 1)

= σ2d −

[w(n) − 1

2μ(n)∇(n)

]H

p

−pH[w(n) − 1

2μ(n)∇(n)

]

+

[w(n) − 1

2μ(n)∇(n)

]H

R[w(n) − 1

2μ(n)∇(n)

]︸ ︷︷ ︸= wH(n)Rw(n) − 1

2μ(n)wH(n)R∇(n)

−12

μ(n)∇H(n)Rw(n) +14

μ2(n)∇H(n)R∇(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 330 / 406

Page 163: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm

J(n + 1) = σ2d −

[w(n) − 1

2μ(n)∇(n)

]H

p

−pH[w(n) − 1

2μ(n)∇(n)

]+wH(n)Rw(n) − 1

2μ(n)wH(n)R∇(n)

−12μ(n)∇H(n)Rw(n) +

14μ2(n)∇H(n)R∇(n)

Differentiating with respect to μ(n),

∂J(n + 1)

∂μ(n)=

12∇H(n)p +

12

pH∇(n) − 12

wHR∇(n)

−12∇H(n)Rw(n) +

12

μ(n)∇H(n)R∇(n) (∗)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 331 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm

Setting (∗) equal to 0

μ0(n)∇H(n)R∇(n) = wH(n)R∇(n) − pH∇(n)

+∇H(n)Rw(n) −∇H(n)p

⇒ μ0(n) =wH(n)R∇(n) − pH∇(n) + ∇H(n)Rw(n) −∇H(n)p

∇H(n)R∇(n)

=[wH(n)R − pH ]∇(n) + ∇H(n)[Rw(n) − p]

∇H(n)R∇(n)

=[Rw(n) − p]H∇(n) + ∇H(n)[Rw(n) − p]

∇H(n)R∇(n)

=12∇H(n)∇(n) + 1

2∇H(n)∇(n)

∇H(n)R∇(n)

=∇H(n)∇(n)

∇H(n)R∇(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 332 / 406

Page 164: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm

Using instantaneous estimates

R = x(n)xH(n) and p = x(n)d ∗(n)

⇒ ∇(n) = 2[Rw(n) − p]

= 2[x(n)xH(n)w(n) − x(n)d ∗(n)]

= 2[x(n)(d∗(n) − d∗(n))]

= −2x(n)e∗(n)

Thus

μ0(n) =∇H(n)∇(n)

∇H(n)R∇(n)=

4xH(n)e(n)x(n)e∗(n)

2xH(n)e(n)x(n)xH(n)2x(n)e∗(n)

=|e(n)|2xH(n)x(n)

|e(n)|2(xH(n)x(n))2

=1

xH(n)x(n)=

1||x(n)||2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 333 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm

Result: The NLMS update is

w(n + 1) = w(n) +μ

||x(n)||2︸ ︷︷ ︸μ(n)

x(n)e∗(n)

μ is introduced to scale the update

To avoid problems when ||x(n)||2 ≈ 0 we add an offset

w(n + 1) = w(n) +μ

a + ||x(n)||2 x(n)e∗(n)

where a > 0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 334 / 406

Page 165: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce

Objective: Analyze the NLMS convergence

w(n + 1) = w(n) +μ

||x(n)||2 x(n)e∗(n)

Substituting e(n) = d(n) − wH(n)x(n)

w(n + 1) = w(n) +μ

||x(n)||2 x(n)[d∗(n) − xH(n)w(n)]

=

[I − μ

x(n)xH(n)

||x(n)||2]

w(n) + μx(n)d∗(n)

||x(n)||2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 335 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce

Objective: Compare the NLMS and LMS algorithms:

NLMS:

w(n + 1) =

[I − μ

x(n)xH(n)

||x(n)||2]

w(n) + μx(n)d∗(n)

||x(n)||2

LMS:

w(n + 1) = [I − μx(n)xH(n)]w(n) + μx(n)d ∗(n)

By observation, we see the following corresponding terms

LMS NLMS

μ μ

x(n)xH(n)x(n)xH(n)

||x(n)||2x(n)d∗(n)

x(n)d∗(n)

||x(n)||2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 336 / 406

Page 166: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce

LMS NLMS

μ μ

x(n)xH(n)x(n)xH(n)

||x(n)||2x(n)d∗(n)

x(n)d∗(n)

||x(n)||2

LMS case:

0 < μ <2

trace[E{x(n)xH(n)}] =2

trace[R]

guarantees stabilityBy analogy,

0 < μ <2

trace[E{

x(n)xH (n)||x(n)||2

}]guarantees stability of the NLMS

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 337 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce

To analyze the bound, make the following approximation

E{

x(n)xH(n)

||x(n)||2}

≈ E{x(n)xH(n)}E{||x(n)||2}

Then

trace[E{

x(n)xH(n)

||x(n)||2}]

=trace[E{x(n)xH(n)}]

E{||x(n)||2}

=E{trace[x(n)xH(n)]}

E{||x(n)||2}=

E{trace[xH(n)x(n)]}E{||x(n)||2}

=E{trace[||x(n)||2]}

E{||x(n)||2}= 1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 338 / 406

Page 167: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce

Thus

0 < μ <2

trace[E{

x(n)xH (n)||x(n)||2

}] = 2

Final Result: The NLMS update

w(n + 1) = w(n) + μx(n)

||x(n)||2 e∗(n)

will converge if 0 < μ < 2

Note:

The NLMS has a simpler convergence criterion than the LMS

The NLMS generally converges faster than the LMS algorithm

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 339 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Method of Least Squares (LS)

Definition (Method of Least Squares (LS))

Motivation: Develop a general method for optimallyadjusting parameters to model observed data

Solution: Set the sum of squared residuals (errors) asthe performance criteria and restrict the model to belinear

The LS filtering method is a deterministic method

Can be applied to linear and nonlinear systems

LS corresponds to the ML criterion if the errors havea normal distribution

The method is related to linear regression

Optimization procedure results in a LS best fit for afilter over the observed (training) samples

HistoricalNote: Gaussdeveloped LSin 1795 at theage of 18

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 340 / 406

Page 168: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Consider the linear transversal filter

and a fixed number of observed samples: i = 1, 2, · · · , N.

M – the number of taps in the filter

{x(i)} – input sequence

{d(i)} – desired output sequence

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 341 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Objective: Set the tap weights to minimize the sum of squared errors

ε(w) =N∑

i=M

|e(i)|2

Let

w = [w0, w1, · · · , wM−1]T [weight vector]

x(i) = [x(i), x(i − 1), · · · , x(i − M + 1)]T , M ≤ i ≤ N [obs. vect.]

The error at time i ise(i) = d(i) − wHx(i)

The full set of error values can be compiled into a vector

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 342 / 406

Page 169: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Define the (N − M + 1) × 1 vectors:

εH = [e(M), e(M + 1), · · · , e(N)] [error vector]

dH = [d(M), d(M + 1), · · · , d(N)] [desired vector]

Denoting the filter output as d(i) and using vector form:

dH = [d(M), d(M + 1), · · · , d(N)]

= [wHx(M), wHx(M + 1), · · · , wHx(N)]

= wH [x(M), x(M + 1), · · · , x(N)]

= wHAH

whereAH = [x(M), x(M + 1), · · · , x(N)]

is the observation data matrix

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 343 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Expanding the data matrix

AH = [x(M), x(M + 1), · · · , x(N)]

=

⎡⎢⎢⎢⎣

x(M) x(M + 1) · · · x(N)x(M − 1) x(M) · · · x(N − 1)

......

. . ....

x(1) x(2) · · · x(N − M + 1)

⎤⎥⎥⎥⎦

⇒ AH is a M × (N − M + 1) rectangular toplitz matrix.

Combining all the above:

Filter output vector: dH = wHAH

Desired output vector: dH

Error vector: εH = dH − dH = dH − wHAH

Note: All incorporate samples for M ≤ i ≤ N

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 344 / 406

Page 170: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

The sum of the squared estimate errors can now be written as

ε(w) =N∑

i=M

|e(i)|2

= εHε

= (dH − wHAH)(d − Aw)

= dHd − dHAw − wHAHd + wHAHAw

Minimizing with respect to w,

∂ε(w)

∂w= −2AHd + 2AHAw (∗)

Setting (∗) equal to zero gives the optimal LS weight w

⇒ AHAw = AHd [Deterministic normal equation]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 345 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Note: A is not generally square, and thus not invertible, but AHA issquare and generally invertible

AHAw = AHd

⇒ w = (AHA)−1AHd

The deterministic normal equation can be rearranged as

AHAw − AHd = 0

AH(Aw − d) = 0 [or using εmin = d − Aw]

AHεmin = 0

Observation: The LS orthogonality principle states that the estimateerror εmin is orthogonal to the row vectors of the data matrix AH

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 346 / 406

Page 171: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Objective: Determine the minimum sum of squared errors (emin)

emin = εHminεmin

= (dH − wHAH)(d − Aw)

= dHd − wHAHd − dHAw + wHAHAw

Utilizing the normal equations wHAHd = wHAHAw

emin = dHd − wHAHd︸ ︷︷ ︸wHAHAw

− dHAw + wHAHAw

= dHd − dHAw

or using w = (AHA)−1AHd

emin = dHd − dHA(AHA)−1AHd (∗)Note that

dHd =N∑

i=M

|d(i)|2 [energy of desired response]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 347 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Consider again the deterministic normal equation

AHAw = AHd

Note that

AHA = [x(M), x(M + 1), · · · , x(N)]

⎡⎢⎢⎢⎣

xH(M)

xH(M + 1)...

xH(N)

⎤⎥⎥⎥⎦

=N∑

i=M

x(i)xH(i)

= Φ [time averaged correlation matrix, size M × M ]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 348 / 406

Page 172: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

From Φ =∑N

i=M x(i)xH(i) it can be shown that:1 Φ is Hermitian2 Φ is nonnegative definite

To prove this, note that for any a

aHΦa =N∑

i=M

aHx(i)xH(i)a

=N∑

i=M

[aHx(i)][aHx(i)]H

=N∑

i=M

|aHx(i)|2 ≥ 0

3 From (1) and (2) we can prove that the eigenvalues of Φ are realand nonnegative

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 349 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

The deterministic normal equation,

AHAw = AHd

also employs

AHd = [x(M), x(M + 1), · · · , x(N)]

⎡⎢⎢⎢⎣

d∗(M)d∗(M + 1)

...d∗(N)

⎤⎥⎥⎥⎦

=N∑

i=M

x(i)d∗(i)

= θ [Time averaged cross-correlation vector, size M × 1]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 350 / 406

Page 173: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Thus the deterministic normal equation, AHAw = AHd, reduces to

Φw = θ

Φ is usually positive definite (always positive semi-definite) ⇒ thesolution is well defined

w = Φ−1θ [LS optimal weight vector]

Also, recall from (∗) that emin can be expressed as

emin = dHd − dHA︸︷︷︸θH

(AHA)−1︸ ︷︷ ︸Φ−1

AHd︸︷︷︸θ

= ed − θHΦ−1θ

where ed is the energy of desired signal

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 351 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method

Consider again the orthogonality principle

AHεmin = 0

Recall that d = Aw. Thus

AHεmin = 0

⇒ wHAHεmin = wH0

⇒ dHεmin = 0

Result: The minimum estimation error vector, εmin, is orthogonal to thedata matrix AH and the LS estimate d

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 352 / 406

Page 174: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution

Objective: Analyze the Least Squares solution in terms of

Bias – it is the LS solution unbiased?

BLUE – is the LS solution the Best Linear Unbiased Estimate?

Assumption: Take the true underlying system to be a linear

d(i) =M−1∑k=0

w∗0kx(i − k) + e0(i)

= wH0 x(i) + e0(i)

e0(i) is the unobservable measurement error⇒ e0(i) is white (uncorrelated) with zero mean and variance σ2

Express the desired signal in vector form

d = Aw0 + ε0

where εH0 = [e0(M), e0(M + 1), · · · , e0(N)]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 353 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution

Objective: Evaluate the bias of w

Recall thatw = (AHA)−1AHd

Using d = Aw0 + ε0 in the above

w = (AHA)−1AHd

= (AHA)−1AH(Aw0 + ε0)

= (AHA)−1AHAw0 + (AHA)−1AHε0

= w0 + (AHA)−1AHε0 (∗)

Note A is fixed. Thus taking the expectation of (∗) yields

E{w} = w0 + (AHA)−1AHE{ε0}= w0

Result: The LS estimate, w, is unbiased

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 354 / 406

Page 175: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution

Objective: Evaluate the covariance of w

Note that from (∗)

w = w0 + (AHA)−1AHε0

⇒ w − w0 = (AHA)−1AHε0

Thus

cov[w] = E{(w − w0)(w − w0)H}

= E{(AHA)−1AHε0εH0 A(AHA)−1}

= Φ−1AH E{ε0εH0 }︸ ︷︷ ︸

σ2I

AΦ−1

= σ2Φ−1ΦΦ−1 = σ2Φ−1 (�1)

Result: The covariance of w is proportional to: (1) the variance of themeasurement noise and (2) the inverse of the time average correlationmatrix

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 355 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution

Objective: Show that the LS estimate w is the Best Linear UnbiasedEstimate (BLUE)

Consider any linear unbiased estimate wNote that w is a linear function of the observed date and can thusbe written as

w = Bd

where B is a M × (N − M + 1) matrix

Substituting d = Aw0 + ε0 into the above,

w = BAw0 + Bε0 (∗)⇒ E{w} = BAw0

⇒ BA = I [since w unbiased]

Thus BA = I and (∗) ⇒w = w0 + Bε0

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 356 / 406

Page 176: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution

Rearranging w = w0 + Bε0,

w − w0 = Bε0

⇒ cov[w] = E{(w − w0)(w − w0)H}

= E{Bε0εH0 BH}

= σ2BBH (�2)

Now define

Ψ = B − (AHA)−1AH

⇒ ΨΨH = [B − Φ−1AH ][BH − AΦ−1]

= BBH − BA︸︷︷︸I

Φ−1 − Φ−1 AHBH︸ ︷︷ ︸I

+ Φ−1AHAΦ−1︸ ︷︷ ︸Φ−1ΦΦ−1

= BBH − Φ−1 − Φ−1 + Φ−1

= BBH − Φ−1

= BBH − (AHA)−1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 357 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution

Observation: The diagonal elements at ΨΨH must be ≥ 0

Thus ΨΨH = BBH − (AHA)−1 ⇒

diag[BBH ] ≥ diag[(AHA)−1]

⇒ diag[σ2BBH ] ≥ diag[σ2(AHA)−1] (∗)

But recall from (�1) and (�2) that

cov[w] = σ2(AHA)−1 and cov[w] = σ2BBH

Utilizing these results in (∗) ⇒

variance[wi ] ≥ variance[wi ] i = 1, 2, · · · , M

Thus the weights in w have lower variance than any other linearestimates

Result: The LS estimate w is unbiased and has the smallest weightvariance ⇒ it is the Best Linear Unbiased Estimate (BLUE)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 358 / 406

Page 177: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

Definition (Recursive Least Squares (RLS))

Motivation: LS requires solving

w = (AHA)−1AHd

= Φ−1θ

where

Φ =N∑

i=M

x(i)xH(i) and θ =N∑

i=M

x(i)d∗(i)

(AHA) is M × M and inversion requires O(M3) multiplications andadditions

Approach: Suppose the LS optimal weights are known at time n, w(n).As time evolves, find the new estimate, w(n + 1), in terms of w(n).

Employ the matrix inversion lemma to reduce the number ofcomputations

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 359 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

Let the observation sequence be x(1), x(2), · · · , x(n)

⇒ Assume x(l) = 0 for l ≤ 0

Define the error as

ε(n) =n∑

i=1

β(n, i)|e(i)|2

where

e(i) = d(i) − wH(n)x(i)

x(i) = [x(i), x(i − 1), · · · , x(i − M + 1)]T

w(n) = [w0(n), w1(n), · · · , wM−1(n)]T

⇒ β(n, i) ∈ (0, 1] is a forgetting factor used in non–stationarystatistics cases

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 360 / 406

Page 178: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

A commonly used forgetting factor is the exponential forgetting factor

β(n, i) = λn−i i = 1, 2, · · · , n, λ ∈ (0, 1]

Thus,

ε(n) =n∑

i=1

λn−i |e(i)|2

The LS solution is given by the deterministic normal equation

Φ(n)w(n) = θ(n)

where now

Φ(n) =n∑

i=1

λn−ix(i)xH(i)

θ(n) =n∑

i=1

λn−ix(i)d∗(i)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 361 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

The normal equation terms can be updated recursively,

Φ(n) =n∑

i=1

λn−ix(i)xH(i)

= λ

[n−1∑i=1

λ(n−1)−ix(i)xH(i)

]︸ ︷︷ ︸

Φ(n−1)

+ x(n)xH(n)

= λΦ(n − 1) + x(n)xH(n)

Similarly

θ(n) =n∑

i=1

λn−ix(i)d∗(i)

= λ

[n−1∑i=1

λ(n−1)−ix(i)d∗(i)

]+ x(n)d∗(n)

= λθ(n − 1) + x(n)d ∗(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 362 / 406

Page 179: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

Aside: Matrix inversion lemma: If

A︸︷︷︸M×M

= B−1︸︷︷︸M×M

+ C︸︷︷︸M×L

D−1︸︷︷︸L×L

CH︸︷︷︸L×M

where A, B, D are positive definite (non-singular), then

A−1 = B − BC[D + CHBC]−1CHB

Apply the lemma to

Φ(n) = λΦ(n − 1) + x(n)xH(n)

Accordingly, set

A = Φ(n) [M × M] B−1 = λΦ(n − 1) [M × M]C = x(n) [M × 1] D = 1 [1 × 1]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 363 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

UtilizingA = Φ(n) B−1 = λΦ(n − 1)C = x(n) D = 1

andA−1 = B − BC[D + CHBC]−1CHB (∗)

we get

[D + CHBC]−1 = [1 + λ−1xH(n)Φ−1(n − 1)x(n)]−1

which is a scalar. Thus evaluating (∗) yields

Φ−1(n) = λ−1Φ−1(n − 1) − λ−2Φ−1(n − 1)x(n)xH(n)Φ−1(n − 1)

1 + λ−1xH(n)Φ−1(n − 1)x(n)

To simplify the result, let P(n) = Φ−1(n) and

k(n)︸︷︷︸Gain vector

=λ−1P(n − 1)x(n)

1 + λ−1xH(n)P(n − 1)x(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 364 / 406

Page 180: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

Utilizing P(n) = Φ−1(n) and k(n) = λ−1P(n−1)x(n)1+λ−1xH(n)P(n−1)x(n)

Φ−1(n) = λ−1Φ−1(n − 1) − λ−2Φ−1(n − 1)x(n)xH(n)Φ−1(n − 1)

1 + λ−1xH(n)Φ−1(n − 1)x(n)

⇒ P(n) = λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1) (∗)

Also, the gain vector can be simplified as

k(n) =λ−1P(n − 1)x(n)

1 + λ−1xH(n)P(n − 1)x(n)[multiply by denom.]

⇒ k(n) = λ−1P(n − 1)x(n) − λ−1k(n)xH(n)P(n − 1)x(n)

= [λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)]︸ ︷︷ ︸=P(n) from (∗)

x(n)

= P(n)x(n) = Φ−1(n)x(n) (∗∗)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 365 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

We must now derive an update for the tap weight vector. Recall,

w(n) = Φ−1(n)θ(n) = P(n)θ(n)

Using the recursion θ(n) = λθ(n − 1) + x(n)d ∗(n) in the above

w(n) = λP(n)θ(n − 1) + P(n)x(n)d ∗(n) (∗ ∗ ∗)Using the update (∗)

P(n) = λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)

in the first P(n) term of (∗ ∗ ∗)w(n) = λP(n)θ(n − 1) + P(n)x(n)d ∗(n)

= λ[λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)]θ(n − 1)

+ P(n)x(n)d ∗(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 366 / 406

Page 181: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

*

w(n) = λ[λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)]θ(n − 1)

+P(n)x(n)d ∗(n)

= P(n − 1)θ(n − 1)︸ ︷︷ ︸w(n−1)

− k(n)xH(n) P(n − 1)θ(n − 1)︸ ︷︷ ︸w(n−1)

+P(n)x(n)d ∗(n)

= w(n − 1) − k(n)xH(n)w(n − 1) + P(n)x(n)︸ ︷︷ ︸=k(n) from (∗∗)

d∗(n)

= w(n − 1) − k(n)[xH(n)w(n − 1) − d ∗(n)]

= w(n − 1) + k(n)α∗(n)

where α(n) = d(n) − wH(n − 1)x(n)

Observation: Difference between e(n) and α(n):

e(n) = d(n) − wH(n)x(n) ⇒ a posteriori error

α(n) = d(n) − wH(n − 1)x(n) ⇒ a priori error

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 367 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

RLS Algorithm Summary

1 Given a new sample x(n), update the gain vector

k(n) =λ−1P(n − 1)x(n)

1 + λ−1xH(n)P(n − 1)x(n)

2 Update the innovation: α(n) = d(n) − wH(n − 1)x(n)

3 Update the tap weight vector: w(n) = w(n − 1) + k(n)α∗(n)

4 Update inverse correlation matrix

P(n) = λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)

Initial Conditions: w(0) = 0 and Φ(0) = δI, where δ is a small positiveconstant, δ ≈ 0.01σ2

x .

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 368 / 406

Page 182: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)

Algorithm Comparison: RLS and LMS algorithm terms:

Entity RLS LMSError

error) priori a(

)()1(ˆ)()( nnndn H xw −−=αerror) posteriari a(

)()(ˆ)()( nnndne H xw−=

WeightUpdate

)()()1(ˆ)(ˆ nnnn ∗+−= αkww )()()()1( nennn ∗+=+ xww μ

Gain oferror update

)()()1()(1

)1(1

1

nnnn

nH

xxPx

P⎟⎟⎠

⎞⎜⎜⎝

⎛−+

−−

λλ )()( nxμ

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 369 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Complexity Analysis

Objective: Compare the complexities (number of additions andmultiplies) for the LMS, LS, and RLS algorithms.

Assume the data is real and the filter is of size M

Case 1 – The LMS algorithm: Algorithm stages:1 d(n) = wT (n)x(n)

2 e(n) = d(n) − d(n)

3 w(n + 1) = w(n) + μx(n)e(n)

ComplexityStage O× O+

(1) M M − 1(2) 0 1(3) M + 1 M

Total complexityO×(2M + 1) O+(2M)

per iteration

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 370 / 406

Page 183: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Complexity Analysis

Case 2 – The LS algorithm: Algorithm solves

w(n) = Φ−1(n)θ(n)

and has stages:1 Φ(n + 1) = Φ(n) + x(n + 1)xH(n + 1)

2 θ(n + 1) = θ(n) + x(n + 1)d(n + 1)

3 w(n + 1) = Φ−1(n + 1)θ(n + 1)

ComplexityStage O× O+

(1) M2 M2

(2) M M(3) M3 + M2 M3 + M(M − 1)

Total complexityO×(M3 + 2M2 + M) O+(M3 + 2M2)

per iteration

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 371 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Complexity Analysis

Case 3 – The RLS algorithm: Algorithm has stages (assuming λ = 1):1 k(n) = λ−1P(n−1)x(n)

1+xT (n)P(n−1)x(n)

2 α(n) = d(n) − wT (n − 1)x(n)3 w(n) = w(n − 1) + k(n)α(n)4 P(n) = P(n − 1) − k(n)xT (n)P(n − 1)

Note: The operation xT (n)P(n − 1) is repeated (but only performedonce). Corresponding steps are underlined in the chart.

ComplexityStage O× O+

(1) numerator M2 M(M − 1)(1) denominator M2 + M M(M − 1) + M

(1) division M(2) M M(3) M M(4) M2 + M2 M(M − 1) + M2

Total complexityO×(3M2 + 4M) O+(3M2 + M)

per iteration

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 372 / 406

Page 184: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Objective: Analyze the RLS algorithm in terms of

Bias

Convergence in the mean; Convergence in the mean square

Learning curve decay rate

Assumptions:1 The desired signal is formed by the regression model

d(n) = wH0 x(n) + e0(n)

where e0(n) is white with variance σ2.2 λ = 1 and n ≥ M.

Thenw(n) = Φ−1(n)θ(n)

where

Φ(n) =n∑

i=1

x(i)xH(i) and θ(n) =n∑

i=1

x(i)d∗(i)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 373 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Substituting d∗(n) = xH(n)w0 + e∗0(n) into θ(n)

θ(n) =n∑

i=1

x(i)[xH(i)w0 + e∗0(i)]

=n∑

i=1

x(i)xH(i)w0 +n∑

i=1

x(i)e∗0(i)

= Φ(n)w0 +n∑

i=1

x(i)e∗0(i)

Thus

w(n) = Φ−1(n)θ(n)

= Φ−1(n)[Φ(n)w0 +n∑

i=1

x(i)e∗0(i)]

= w0 + Φ−1(n)n∑

i=1

x(i)e∗0(i) (∗)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 374 / 406

Page 185: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Note that E{A} = E{E{A|B}}. Thus

w(n) = w0 + Φ−1(n)n∑

i=1

x(i)e∗0(i)

⇒ E{w(n)} = w0 + E{E{Φ−1(n)n∑

i=1

x(i)e∗0(i)|x(i), i = 1, 2, · · · , n}}

= w0 + E{Φ−1(n)n∑

i=1

x(i)E{e∗0(i)}} = w0

The above follows from the fact that Φ(n) and e∗0(i) are independent.

Why?e0(i) is independent of all observations and the x(i) terms aregiven, uniquely defining Φ(n). ⇒ independence of Φ(n) and e∗

0(i).

Result: The RLS algorithm is unbiased and convergent in the mean forn ≥ M.

Question: How does this compare to the LMS algorithm?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 375 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Next, consider the convergence in the mean square. Recall (∗)

w(n) = w0 + Φ−1(n)n∑

i=1

x(i)e∗0(i)

which gives

ε(n) = w(n) − w0 = Φ−1(n)n∑

i=1

x(i)e∗0(i)

Thus the weight error correlation matrix is

K(n) = E{ε(n)εH(n)}

= E{Φ−1(n)

⎛⎝ n∑

i=1

n∑j=1

x(i)e∗0(i)e0(j)xH(j)

⎞⎠Φ−1(n)}

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 376 / 406

Page 186: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Again using E{A} = E{E{A|B}} yields

K(n) = E

⎧⎪⎨⎪⎩Φ−1(n)

⎛⎜⎝ n∑

i=1

n∑j=1

x(i) E{e∗0(i)e0(j)}︸ ︷︷ ︸

σ2δ(i−j)

xH(j)

⎞⎟⎠Φ−1(n)

⎫⎪⎬⎪⎭

= σ2E

{Φ−1(n)

(n∑

i=1

x(i)xH(i)

)Φ−1(n)

}

= σ2E{Φ−1(n)Φ(n)Φ−1(n)}= σ2E{Φ−1(n)}

Note: Φ−1(n) has a Wishart distribution, the expectation of which is

E{Φ−1(n)} =1

n − M − 1R−1 n > M + 1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 377 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Using K(n) = σ2

n−M−1R−1 and the trace

E{||ε(n)||2} = E{εH(n)ε(n)}= E{trace[εH(n)ε(n)]}= E{trace[ε(n)εH(n)]}= traceE{ε(n)εH(n)}= trace[K(n)]

=σ2

n − M − 1trace[R−1]

=σ2

n − M − 1

M∑i=1

1λi

n > M + 1

Results:

The weight vector MSE is initially proportional to∑M

i=11λi

The weight vector converges linearly in the mean squared sense

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 378 / 406

Page 187: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Objective: Evaluate the RLS (error) learning curve

Recall the a priori estimation error

α(n) = d(n) − wH(n − 1)x(n)

= d(n) − d0(n) + d0(n) − wH(n − 1)x(n)

= e0(n) + wH0 x(n) − wH(n − 1)x(n)

= e0(n) − εH(n − 1)x(n)

Now consider the MSE of α(n)

Jα(n) = E{|α(n)|2}= E{[e∗

0(n) − xH(n)ε(n − 1)][e0(n) − εH(n − 1)x(n)]}= E{|e0(n)|2} − E{xH(n)ε(n − 1)e0(n)}

−E{εH(n − 1)x(n)e∗0(n)} + E{xH(n)ε(n − 1)εH(n − 1)x(n)}

To analyze Jα(n), consider each term individually

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 379 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Jα(n) = E{|e0(n)|2} − E{xH(n)ε(n − 1)e0(n)}−E{εH(n − 1)x(n)e∗

0(n)} + E{xH(n)ε(n − 1)εH(n − 1)x(n)}

Term: E{|e0(n)|2}. Clearly,

E{|e0(n)|2} = σ2

Term: E{εH(n − 1)x(n)e∗0(n)}. By the independence theorem, ε(n − 1)

is independent of x(n) and e0(n). Thus,

E{εH(n − 1)x(n)e∗0(n)} = E{εH(n − 1)}E{x(n)e∗

0(n)}= 0

where the final result is due to the orthogonality principle.

Term: E{xH(n)ε(n − 1)e0(n)} → 0 by similar arguments

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 380 / 406

Page 188: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Jα(n) = E{|e0(n)|2} − E{xH(n)ε(n − 1)e0(n)}−E{εH(n − 1)x(n)e∗

0(n)} + E{xH(n)ε(n − 1)εH(n − 1)x(n)}

Term: E{xH(n)ε(n − 1)εH(n − 1)x(n)}

E{xH(n)ε(n − 1)εH(n − 1)x(n)} = E{trace[xH(n)ε(n − 1)εH(n − 1)x(n)]}= E{trace[x(n)xH(n)ε(n − 1)εH(n − 1)]}

Invoking the independence theorem

E{xH(n)ε(n − 1)εH(n − 1)x(n)}= trace[E{x(n)xH(n)}E{ε(n − 1)εH(n − 1)}]= trace[RK(n − 1)]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 381 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis

Utilizing K(n − 1) = σ2

n−M−2R−1 and substituting back each of thecomponents

Jα(n) = σ2 + trace[RK(n − 1)]

= σ2 +Mσ2

n − M − 2n > M + 1

Results:

The ensemble average learning curve of the RLS converges inabout 2M iterations, which is typically an order of magnitude fasterthan the LMS

limn→∞ Jα(n) = σ2 thus there is no excess MSE

Convergence of the RLS algorithm is independent of theeigenvalues of Φ(n)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 382 / 406

Page 189: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Channel Equalization

Example

Consider again the channel equalization problem

where

hn =

{ 12 [1 + cos(2π

W (n − 1))] n = 1, 2, 30 otherwise

As before an 11-tap filter is used

The SNR is 30dB and W is varied to control the eigenvalue spread

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 383 / 406

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Channel Equalization

Observations:

The RLS algorithm converges in about 20 iterations (twice thenumber of filter taps)

The convergence (rate) is insensitive to the eigenvalue spread

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 384 / 406

Page 190: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Channel Equalization

Observations:

The RLS algorithm converges faster than the LMS algorithm

The RLS algorithm has lower steady state error than the LMSalgorithm

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 385 / 406

Application: Blind Deconvolution Motivation

Blind Deconvolution

Motivation:

Adaptive equalizers typically require a training period during whichthey operate on known signals/statistics.This known signal training is not always appropriate, e.g., inmobile communications

Cost is too high (time/bandwidth)Multipathing or other time varying interference

In such cases, we must use blind equalization

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 387 / 406

Page 191: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution System Model

System components and assumptions:

Channel introduces distortion (dominant)

System has additive noise (not dominant)Assume a baseband model of communications

Multilevel pulse amplitude modulation (M-ary PAM)Received signal:

u(n) =∑

k

hk x(n − k)

Dominating interference is due to intersymbol interference (ISI)from channel distortion⇒ The noise is ignored

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 388 / 406

Application: Blind Deconvolution System Model

Also assume that:

h �= 0 for n < 0 (noncausal)∑k

h2k = 1 (to keep the variance of the output constant)

Example (4-ary PAM modulation)

A 4–ary PAM modulation scheme uses 4 signals

Or

S1

S2

S3

S4

S4 S2 S1 S3x

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 389 / 406

Page 192: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution System Model

To solve the equalization problem, we need a statistical model of thedata. Assume,

1 The data is white

E{x(n)} = 0

E{x(n)x(k)} =

{1 k = n0 otherwise

2 The pdf of x(n) is symmetric and uniform)(xf x

0

32

1

33−

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 390 / 406

Application: Blind Deconvolution System Model

Deconvolution Objective: If {wi} are the coefficients of the idealinverse filter, then ∑

i

wihl−i = δl =

{1 l = 00 else

If this is the case, the output of the equalizer is

y(n) =∑

i

wiu(n − i)

=∑

i

∑k

wihkx(n − i − k) [let k = l − i ]

=∑

l

x(n − l)∑

i

wihl−i

=∑

l

x(n − l)δl

= x(n)

Problem: hn is not known ⇒ the exact inverse can not be usedK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 391 / 406

Page 193: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution System Model

Solution: Use an iterative procedure to find the filter.

Let the output at iteration n be given by

y(n) =L∑

i=−L

wi(n)u(n − i)

where a 2L + 1 tap filter is used

Setting wi(n) = 0 for |i | > L, we can write

y(n) =∑

i

wi(n)u(n − i)

=∑

i

wiu(n − i) +∑

i

[wi(n) − wi ]u(n − i)

= x(n) + v(n)

[since x(n) =

∑i

wiu(n − i)

]

Note v(n) =∑

i [wi(n) − wi ]u(n − i) is the residual ISIK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 392 / 406

Application: Blind Deconvolution System Model

Interpretation: v(n) =∑

i [wi(n) − wi ]u(n − i) is the convolution noise(residual ISI) resulting from the fact that ideal filter was not used

Estimation Approach: Apply the output y(n) = x(n) + v(n) to a zeromemory nonlinear estimator

x(n) = g[y(n)]

Complete System:

The nonlinear estimate g[y(n)] can be used to update theequalizer to produce a better estimate at time n + 1

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 393 / 406

Page 194: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution Optimization

Define the equalizer error to be

e(n) = x(n) − y(n)

Use e(n) in the LMS algorithm to update the equalizer weights

wi(n + 1) = wi(n) + μu(n − i)e(n) i = 0,±1,±2, · · · ,±L

Note that in this case, the cost function is

J(n) = E{e2(n)}= E{(x(n) − y(n))2}= E{(g[y(n)] − y(n))2}

Observation: J(n) is a nonconvex function of the filter weights (g[·] isnonlinear)

The cost function can have numerous local minima

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 394 / 406

Application: Blind Deconvolution Optimization

To solve the optimization, first evaluate the convolution noise,

u(n) =∑

k

hkx(n − k)

⇒ u(n − i) =∑

k

hkx(n − i − k)

Next, use this in

y(n) =∑

i

wiu(n − i)

︸ ︷︷ ︸x(n)

+∑

i

[wi(n) − wi ]u(n − i)

︸ ︷︷ ︸v(n)

where recall

wi ≡ perfect equalizer weights

wi(n) ≡ finite approximate equalizer with wi(n) = 0 for |i | > L

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 395 / 406

Page 195: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution Optimization

Then using u(n − i) =∑k

hkx(n − i − k)

v(n) =∑

i

[wi(n) − wi ]u(n − i)

=∑

i

∑k

hk [wi(n) − wi ]x(n − i − k)

=∑

l

x(l)∇(n − l)

where the last line follows from letting n − i − k = l and defining

∇(n) =∑

k

hk [wn−k (n) − wn−k ]

∇(n) is the residual impulse response of the channel due toimperfect equalization

∇(n) is small in value, but long and oscillatory

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 396 / 406

Application: Blind Deconvolution Optimization

Since the convolution noise is given by

v(n) =∑

l

x(l)∇(n − l)

⇒ E{v(n)} =∑

l

E{x(l)}∇(n − l) = 0

Also

E{v(n)v(n − j)} = E

{∑l

x(l)∇(n − l)∑

m

x(m)∇(n − m − j)

}

=∑

l

∑m

∇(n − l)∇(n − m − j)E{x(l)x(m)}

=∑

l

∇(n − l)∇(n − l − j)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 397 / 406

Page 196: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution Optimization

Since ∇(n) is long and oscillatory

E{v(n)v(n − j)} =∑

l

∇(n − l)∇(n − l − j)

tends to average to 0 for j �= 0. Thus

E{v(n)v(n − j)} =

{σ2 j = 00 j �= 0

whereσ2 = σ2(n) =

∑l

∇2(n − l)

Observation: v(n) =∑

lx(l)∇(n − l) is a weighted sum of i.i.d. RVs ⇒

v(n) is Gaussian (central limit theorem)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 398 / 406

Application: Blind Deconvolution Optimization

Lastly, consider the cross correlation of x(n) and v(n),

E{x(n)v(n − j)} = E{x(n)∑

l

x(l)∇(n − l − j)}

=∑

l

∇(n − l − j)E{x(n)x(l)}

= ∇(−j)

�∑

l

∇2(l) = σ2

Observation: Thus we can say x(n) and v(n) are essentiallyindependent

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 399 / 406

Page 197: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution Optimization

Finally, consider the nonlinear estimation of x(n)

Component statistics:

x(n) is uniformly distributed with zero mean and unit variance

v(n) is white Gaussian noise with zero mean and variance σ2(n)

x(n) and v(n) are independent

Estimation Approach: Utilize Bayes estimation, which exploitsknowledge of the distributions

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 400 / 406

Application: Blind Deconvolution Optimization

The estimation risk is defined by

R = E{c(x , x(n))}=

∫ ∞

−∞

∫ ∞

−∞c(x , x(n))fxy (x , y)dydx

Recall that if c(x , x(n)) = (x − x(n))2, then the risk is minimized by

x =

∫ ∞

−∞xfx (x |y)dx

= E{x |y} [conditional expectation]

where we can use the fact that

fx (x |y) =fy (y |x)fx (x)

fy (y)

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 401 / 406

Page 198: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution Optimization

Employ the current model,

y = c0x + v

where c0 < 1 is a scaling factor included to ensure E{y 2} = 1.

Thus since x and v are independent, fy (y |x) = fv (y − c0x) and

x =

∫ ∞

−∞xfx (x |y)dx

=1

fy (y)

∫ ∞

−∞xfy (y |x)fx (x)dx

=1

fy (y)

∫ ∞

−∞xfv (y − c0x)fx (x)dx

where

fx(x)

32

1

33− 0

fv(v)

Variance = σ2

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 402 / 406

Application: Blind Deconvolution Optimization

Evaluating this yields (Berllini, 1988)

x =1

c0y− σ

c0

Z (y1) − Z (y2)

Q(y1) − Q(y2)

where

y1 =1σ

(y +√

3c0)

y2 =1σ

(y −√

3c0)

and

Z (y) =1√2π

e− y2

2 [normalized Gaussian pdf]

Q(y) =1√2π

∫ ∞

ye− u2

2 du [normalized Gaussian cdf]

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 403 / 406

Page 199: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution Optimization

Results for an 8 level PAM system:For an 8 level PAM system:

Plotted for three different noise to (signal + noise) ratios

Note that this tends to a step function as the noise goes to zero

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 404 / 406

Application: Blind Deconvolution Optimization

Implementation Point: When the blind equalizer has converged, thealgorithm is switched to decision-directed mode.

In the case of binary symbols,

x(n) =

{+1 for symbol 1−1 for symbol 2

which results in the sample estimate

⇒ x(n) = sgn(y(n)) =

{+1 if y(n) ≥ 0−1 else

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 405 / 406

Page 200: ELEG-636: Statistical Signal Processingbarner/courses/eleg636/notes/ELEG636-2up.pdf · ELEG-636: Statistical Signal Processing Kenneth E. Barner Department of Electrical and Computer

Application: Blind Deconvolution Optimization

Final System:

Final Observation: This system works well as long as what?

K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 406 / 406