eleg-636: statistical signal processingbarner/courses/eleg636/notes/eleg636-2up.pdf · eleg-636:...
TRANSCRIPT
ELEG-636: Statistical Signal Processing
Kenneth E. Barner
Department of Electrical and Computer EngineeringUniversity of Delaware
Spring 2009
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 1 / 406
Course Objectives & Structure
Course Objectives & Structure
Objective: Given a discrete time sequence {x(n)}, develop
Statistical and spectral signal representations
Filtering, prediction, and system identification algorithmsOptimization methods that are
StatisticalAdaptive
Course Structure:
Weekly lectures [notes: www.ece.udel.edu/ barner]
Periodic homework (theory & Matlab implementations) [10%]
Midterm & Final examinations [80%]
Final project [10%]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 2 / 406
Course Objectives & Structure
Outline
1 Probability
2 Stationary Process and Models
3 Linear Systems, Spectral Representations, and Eigen Analysis
4 Maximum Likelihood and Bayes Estimation
5 Wiener (MSE) Filtering Theory
6 Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms)
7 Application: Blind Deconvolution
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 3 / 406
Probability
Signal Characterization
Assumption: Many methods take {x(n)} to be deterministicReality: Real world signals are usually statistical in nature
Thus,. . . x(−1), x(0), x(1), . . .
can be interpreted as a sequence of random variables.We begin by analyzing each observation x(n) as a R.V.Then, to capture dependencies, we consider random vectors
. . . x(n), x(n + 1), . . . , x(n + N − 1)︸ ︷︷ ︸x(n)
, x(n + N), . . .
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 5 / 406
Probability Random Variables
Random Variables
Definition
For a space S, the subsets, or events of S, have associatedprobabilities.
To every event δ, we assign a number x(δ), which is called a R.V.
The distribution function of x is
Pr{x ≤ x0} = Fx (x0) −∞ < x0 < ∞Properties:
1 F (+∞) = 1, F (−∞) = 02 F (x) is continuous from the right
F (x+) = F (x)
3 Pr{x1 < x ≤ x2} = F (x2) − F (x1)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 6 / 406
Probability Random Variables
Example
Fair toss of two coins: H=heads, T=Tails
Define numerical assignments:
Events(δ) Prob. X(δ) Y(δ)HH 1/4 1 -100HT 1/4 2 -100TH 1/4 3 -100TT 1/4 4 500
This assignments yield different distribution functions
Fx(2) = Pr{HH, HT} = 1/2
Fy(2) = Pr{HH, HT , TH} = 3/4
How do we attain an intuitive interpretation of the distribution function?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 7 / 406
Probability Random Variables
Distribution Plots
0
0.5
1
0 1 2 3 4 5
)(xFx
x 0
0.5
1
-200 -100 0 100 200 300 400 500 600 700
)(yFy
y
43
Note properties hold:1 F (+∞) = 1, F (−∞) = 02 F (x) is continuous from the right
F (x+) = F (x)
3 Pr{x1 < x ≤ x2} = F (x2) − F (x1)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 8 / 406
Probability Random Variables
Definition
The probability density function is defined as,
f (x) =dF (x)
dx
or F (x) =
∫ x
−∞f (x)dx
Thus F (∞) = 1 ⇒∫ ∞
−∞f (x)dx = 1
Types of distributions:
Continuous: Pr{x = x0} = 0 ∀x0
Discrete: F (xi) − F (x−i ) = Pr{x = xi} = Pi
In which case f (x) =∑
i Piδ(x − xi)
Mixed: discontinuous but not discrete
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 9 / 406
Probability Random Variables
Distribution examples
Uniform: x ∼ U(a, b) a < b
f (x) =
{ 1b−a x ∈ [a, b]
0 else
ab −1
a b
f(x)
a b
F(x)1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 10 / 406
Probability Random Variables
Gaussian: x ∼ N(μ, σ)
f (x) =1√2πσ
e− (x−μ)2
2σ2
μ
f(x)
μ
F(x)
1/2
1
Historical Note: First introduced by Abraham de Moivre in 1733;Rigorously justified by Gauss in 1809; Extended by Laplace in 1812;
Why not the de Moivre distribution? Stigler’s law.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 11 / 406
Probability Random Variables
Gaussian Distribution Example
Example
Consider the Normal (Gaussian) distribution PDF and CDF forμ = 0, σ2 = 0.2, 1.0, 5.0 and μ = −2, σ2 = 0.5
F,
2(x
)
0.8
0.6
0.4
0.2
0.0
5 3 1 3 5
1.0
1 0 2 424
=0,=0, = 2,
=0, =0.22
=1.02
=5.02
=0.52
f,
2(x
)
0.8
0.6
0.4
0.2
0.0
5 3 1 3 5
1.0
1 0 2 424
=0,=0, = 2,
=0, =0.22
=1.02
=5.02
=0.52
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 12 / 406
Probability Random Variables
Binomial: x ∼ B(p, q) p + q = 1
Example
Toss a coin n times. What is the probability of getting k heads?
For p + q = 1, where q is probability of a tail, and p is the probabilityof a head:
Pr{x = k} =
(nk
)pkqn−k
[NOTE:
(nk
)=
n!
(n − k)!k !
]
⇒ f (x) =n∑
k=0
(nk
)pkqn−kδ(x − k)
⇒ F (x) =m∑
k=0
(nk
)pkqn−k m ≤ x < m + 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 13 / 406
Probability Random Variables
Binomial Distribution Example I
Example
Toss a coin n times. What is the probability of getting k heads? Forn = 9, p = q = 1
2 (fair coin)
0
0.1
0.2
0.3
0 1 2 3 4 5 6 7 8 9 10
f(x)
0
0.5
1
0 1 2 3 4 5 6 7 8 9 10 11
F(x)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 14 / 406
Probability Random Variables
Binomial Distribution Example II
Example
Toss a coin n times. What is the probability of getting k heads? Forn = 20, p = 0.5, 0.7 and n = 40, p = 0.5.
0 10 20 30 40
0.00
0.05
0.10
0.15
0.20
0.25
p=0.5 and n=20p=0.7 and n=20p=0.5 and n=40
0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
p=0.5 and n=20p=0.7 and n=20p=0.5 and n=40
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 15 / 406
Probability Conditional Distributions
Conditional Distributions
Definition
The conditional distribution of x given event “M” has occurred is
Fx(x0|M) = Pr{x ≤ x0|M}=
Pr{x ≤ x0, M}Pr{M}
Example
Suppose M = {x ≤ a}, then
Fx(x0|M) =Pr{x ≤ x0, M}
Pr{x ≤ a}If x0 ≥ a, what happens?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 16 / 406
Probability Conditional Distributions
Special Cases
Special Case: x0 ≥ a
Pr{x ≤ x0, x ≤ a} = Pr{x ≤ a}
⇒ Fx(x0|M) =Pr{x ≤ x0, M}
Pr{x ≤ a} =Pr{x ≤ a}Pr{x ≤ a} = 1
Special Case: x0 ≤ a
⇒ Fx(x0|M) =Pr{x ≤ x0, M}
Pr{x ≤ a} =Pr{x ≤ x0}Pr{x ≤ a}
=Fx(x0)
Fx(a)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 17 / 406
Probability Conditional Distributions
Conditional Distribution Example
Example
Suppose
a
F(x)1
What does Fx(x |M) look like? Note M = {x ≤ a}.
⇒ Fx(x0|M) =
{Fx(x0)Fx(a) x ≤ a1 a ≤ x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 18 / 406
Probability Conditional Distributions
)(xFx
)( axxFx ≤
a
F(x)1
Distribution properties hold for conditional cases:Limiting cases: F (∞|M) = 1 and F (−∞|M) = 0Probability range: Pr{x0 ≤ x ≤ x1|M} = F (x1|M) − F (x0|M)Density–distribution relations:
f (x |M) =∂F (x |M)
∂x
F (x0|M) =
∫ x0
−∞f (x |M)dx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 19 / 406
Probability Conditional Distributions
Example (Fair Coin Toss)
Toss a fair coin 4 times. Let x be the number of heads. DeterminePr{x = k}.
Recall
Pr{x = k} =
(nk
)pkqn−k
In this case
Pr{x = k} =
(4k
)(12
)4
Pr{x = 0} = Pr{x = 4} =116
Pr{x = 1} = Pr{x = 3} =14
Pr{x = 2} =38
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 20 / 406
Probability Conditional Distributions
Density and Distribution Plots for Fair Coin (n = 4) Ex.
0
0.1
0.2
0.3
0.4
0.5
0 1 2 3 4 5
f(x)
161
41
83
161
41
0
0.5
1
0 1 2 3 4 5 6 7
F(x)
161
165
1611
1615 1
What type of distribution is this? Discrete. Thus,
F (xi) − F (x−i ) = Pr{x = xi} = Pi
F (x) =
∫ x
−∞f (x)dx =
∫ x
−∞
∑i
Piδ(x − xi)dx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 21 / 406
Probability Conditional Distributions
Conditional Case
Example (Conditional Fair Coin Toss)
Toss a fair coin 4 times. Let x be the number of heads. SupposeM = [at least one flip produces a head]. Determine Pr{x = k |M}.
Recall,
Pr{x = k |M} =Pr{x = k , M}
Pr{M}Thus first determine Pr{M}
Pr{M} = 1 − Pr{No heads}= 1 − 1
16
=1516
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 22 / 406
Probability Conditional Distributions
Next determine Pr{x = k |M} for the individual cases, k = 0, 1, 2, 3, 4
Pr{x = 0|M} =Pr{x = 0, M}
Pr{M} = 0
Pr{x = 1|M} =Pr{x = 1, M}
Pr{M}=
Pr{x = 1}Pr{M} =
1/415/16
=4
15
Pr{x = 2|M} =Pr{x = 2}
Pr{M} =3/8
15/16=
615
Pr{x = 3|M} =415
Pr{x = 4|M} =115
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 23 / 406
Probability Conditional Distributions
Conditional and Unconditional Density Functions
0
0.1
0.2
0.3
0.4
0.5
0 1 2 3 4 5
f(x)
161
41
83
161
41
0
0.1
0.2
0.3
0.4
0.5
0 1 2 3 4 5
f(x\M)
154
156
154
151
Are they proper density functions?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 24 / 406
Probability Total Probability and Bayes’ Theorem
Total Probability and Bayes’ Theorem
Let M1, M2, . . . , Mn forms a partition of S, i.e.⋃i
Mi = S and Mi
⋂i �=j
Mj = φ
Then
F (x) =∑
i
Fx(x |Mi)Pr(Mi)
f (x) =∑
i
fx(x |Mi)Pr(Mi)
Aside
Pr{A|B} =Pr{A, B}
Pr{B} =Pr{B, A}Pr{A}Pr{B}Pr{A} =
Pr{B|A}Pr{A}Pr{B}
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 25 / 406
Probability Total Probability and Bayes’ Theorem
From this we get
Pr{M|x ≤ x0} =Pr{x ≤ x0|M}Pr{M}
Pr{x ≤ x0}=
F (x0|M)Pr{M}F (x0)
and
Pr{M|x = x0} =f (x0|M)Pr{M}
f (x0)
By integration∫ ∞
−∞Pr{M|x = x0}f (x0)dx0 =
∫ ∞
−∞f (x0|M)Pr{M}dx0
= Pr{M}∫ ∞
−∞f (x0|M)dx0 = Pr{M}
⇒ Pr{M} =
∫ ∞
−∞Pr{M|x = x0}f (x0)dx0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 26 / 406
Probability Total Probability and Bayes’ Theorem
Putting it all Together: Bayes’ Theorem
Bayes’ Theorem:
f (x0|M) =Pr{M|x = x0}f (x0)
Pr{M}=
Pr{M|x = x0}f (x0)∫∞−∞ Pr{M|x = x0}f (x0)dx0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 27 / 406
Probability Functions of a R.V.
Functions of a R.V.
Problem Statement
Let x and g(x) be RVs such that
y = g(x)
Question: How do we determine the distribution of y?
NoteFy (y0) = Pr{y ≤ y0}
= Pr{g(x) ≤ y0}= Pr{x ∈ Ry0}
whereRy0 = {x : g(x) ≤ y0}
Question: If y = g(x) = x2, what is Ry0 ?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 28 / 406
Probability Functions of a R.V.
Example
Let y = g(x) = x2. Determine Fy (y0).
0y
0y− 0y x
2xy =
Note thatFy (y0) = Pr(y ≤ y0)
= Pr(−√y0 ≤ x ≤ √
y0)= Fx(
√y0) − Fx(−√
y0)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 29 / 406
Probability Functions of a R.V.
Example
Let x ∼ N(μ, σ) and
y = U(x) =
{1 if x > μ0 if x ≤ μ
Determine fy (y0) and Fy (y0).
21
)( yf y
0 1
)( yFy
21
0 1
1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 30 / 406
Probability Functions of a R.V.
General Function of a Random Variable Case
To determine the density of y = g(x) in terms of fx (x0), look at g(x)
fy (y0)dy0 = Pr(y0 ≤ y ≤ y0 + dy0)
= Pr(x1 ≤ x ≤ x1 + dx1) + Pr(x2 + dx2 ≤ x ≤ x2)
+Pr(x3 ≤ x ≤ x3 + dx3)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 31 / 406
Probability Functions of a R.V.
fy (y0)dy0 = Pr(x1 ≤ x ≤ x1 + dx1) + Pr(x2 + dx2 ≤ x ≤ x2)
+Pr(x3 ≤ x ≤ x3 + dx3)
= fx (x1)dx1 + fx (x2)|dx2| + fx (x3)dx3 (∗)Note that
dx1 =dx1
dy0dy0 =
dy0
dy0/dx1=
dy0
g′(x1)
Similarly
dx2 =dy0
g′(x2)and dx3 =
dy0
g′(x3)
Thus (∗) becomes
fy (y0)dy0 =fx (x1)
g′(x1)dy0 +
fx (x2)
|g′(x2)|dy0 +fx (x3)
g′(x3)dy0
or
fy (y0) =fx (x1)
g′(x1)+
fx (x2)
|g′(x2)|+
fx (x3)
g′(x3)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 32 / 406
Probability Functions of a R.V.
Function of a R.V. Distribution General Result
Set y = g(x) and let x1, x2, . . . be the roots, i.e.,
y = g(x1) = g(x2) = . . .
Then
fy (y) =fx (x1)
|g′(x1)| +fx (x2)
|g′(x2)| + . . .
Example
Suppose x ∼ U(−1, 2) and y = x 2. Determine fy (y).
31
-1 2
fx(x)
1 -1 1
y
0 2
2
1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 33 / 406
Probability Functions of a R.V.
Note thatg(x) = x2 ⇒ g′(x) = 2x
Consider special cases separately:Case 1: 0 ≤ y ≤ 1
y = x2 ⇒ x = ±√y
fy (y) =fx (x1)
|g′(x1)| +fx (x2)
|g′(x2)|=
1/3|2√y | +
1/3| − 2
√y | =
1/3√y
Case 2: 1 ≤ y ≤ 4y = x2 ⇒ x =
√y
fy (y) =fx (x1)
|g′(x1)| =1/32√
y=
1/6√y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 34 / 406
Probability Functions of a R.V.
Result: For x ∼ U(−1, 2) and y = x 2
31
-1 2
fx(x)
1 -1 1
y
0 2
2
1
fy (y) =
{ 1/3√y 0 ≤ y ≤ 1
1/6√y 1 < y ≤ 4
0
0.5
1
0 1 2 3 4 5
fy(y)
31
y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 35 / 406
Probability Functions of a R.V.
Example
Let x ∼ N(μ, σ) and y = ex . Determine fy (y).
Note g(x) ≥ 0 and g ′(x) = ex
Also, there is a single root (inverse solution):
x = ln(y)
Therefore,
fy (y) =fx (x)
|g′(x)| =fx (x)
ex
Expressing this in terms of y through substitution yields:
fy (y) =fx(ln(y))
eln(y)=
fx (ln(y))
y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 36 / 406
Probability Functions of a R.V.
Note that x is Gaussian:
fx (x) =1√2πσ
e− (x−μ)2
2σ2
⇒ fy (y) =1√
2πyσe− (ln(y)−μ)2
2σ2 , for y > 0
0
0.5
1
0 1 2 3 4 5
fy(y)
y
Log normal density
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 37 / 406
Probability Functions of a R.V.
Distribution of Fx (x)
For any RV with continuous distribution Fx(x), the RV y = Fx (x) isuniform on [0, 1].
Proof: Note 0 < y < 1. Since
g(x) = Fx(x)
g′(x) = fx (x)
Thus
fy (y) =fx (x)
g′(x)=
fx (x)
fx (x)= 1
0 1
fy(y)
1
0 1
1
Fy(y)
y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 38 / 406
Probability Functions of a R.V.
Thus the function
g(x) = Fx(x)
performs the mapping:
The converse also holds:
Combining operationsyields Synthesis:
)(xFxx U – uniform [0,1]
pdf fx(x)
)(1 xFx−U x – pdf fx(x)
Uniform pdf
)(xFxx )(1 xFx− xU
)(1 xFy− y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 39 / 406
Probability Mean and Variance
Mean and variance
Definitions
Mean E{x} =
∫ ∞
−∞xf (x)dx
Conditional Mean E{x |M} =
∫ ∞
−∞xf (x |M)dx
Example
Suppose M = {x ≥ a}. Then
E{x |M} =
∫ ∞
−∞xf (x |M)dx
=
∫∞a xf (x)dx∫∞a f (x)dx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 40 / 406
Probability Mean and Variance
For a function of a RV, y = g(x),
E{y} =
∫ ∞
−∞yfy (y)dy =
∫ ∞
−∞g(x)fx (x)dx
Example
Suppose g(x) is a step function:
0 x0
g(x)
1
Determine E{g(x)}.
E{g(x)} =
∫ ∞
−∞g(x)fx (x)dx =
∫ x0
−∞fx (x)dx = Fx(x0)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 41 / 406
Probability Mean and Variance
Definition (Variance)
Variance σ2 =
∫ ∞
−∞(x − η)2f (x)dx
where η = E{x}. Thus,
σ2 = E{(x − η)2} = E{x2} − E2{x}
Example
For x ∼ N(η, σ2), determine the variance.
f (x) =1√2πσ
e− (x−η)2
2σ2
Note: f (x) is symmetric about x = η ⇒ E{x} = η
Also ∫ ∞
−∞f (x)dx = 1 ⇒
∫ ∞
−∞e− (x−η)2
2σ2 dx =√
2πσ
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 42 / 406
Probability Mean and Variance
∫ ∞
−∞e− (x−η)2
2σ2 dx =√
2πσ
Differentiating w.r.t. σ:
⇒∫ ∞
−∞
(x − η)2
σ3 e− (x−η)2
2σ2 dx =√
2π
Rearranging yields∫ ∞
−∞(x − η)2 1√
2πσe− (x−η)2
2σ2 dx = σ2
orE{(x − η)2} = σ2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 43 / 406
Probability Moments
Definition (Moments)
Moments
mn = E{xn} =
∫ ∞
−∞xnf (x)dx
Central Moments
μn = E{(x − η)n} =
∫ ∞
−∞(x − η)nf (x)dx
From the binomial theorem
μn = E{(x − η)n} = E
{n∑
k=0
(nk
)xk (−η)n−k
}
=n∑
k=0
(nk
)mk (−η)n−k
⇒ μ0 = 1, μ1 = 0, μ2 = σ2, μ3 = m3 − 3ηm2 + 2η3
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 44 / 406
Probability Moments
Example
Let x ∼ N(0, σ2). Prove
E{xn} =
{0 n = 2k + 11 · 3 · · · · (n − 1)σn n = 2k
For n odd
E{xn} =
∫ ∞
−∞xnf (x) = 0
since xn is an odd function and f (x) is an even function.
To prove the second part, use the fact that∫ ∞
−∞e−αx2
dx =
√π
α
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 45 / 406
Probability Moments
Differentiate ∫ ∞
−∞e−αx2
dx =
√π
α
with respect to α, k times
⇒∫ ∞
−∞x2ke−αx2
dx =1 · 3 · · · · (2k − 1)
2k
√π
α2k+1
Let α = 12σ2 , then∫ ∞
−∞x2ke− x2
2σ2 dx = 1 · 3 · · · · (2k − 1)σ2k+1√
2π
Setting n = 2k and rearranging∫ ∞
−∞xn 1√
2πσe− x2
2σ2 dx = 1 · 3 · · · · (n − 1)σn [QED]
Note: Variance is a measure of a RV’s concentration around its meanK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 46 / 406
Probability Tchebycheff Inequality
Tchebycheff Inequality
For any ε > 0,
Pr(|x − η| ≥ ε) ≤ σ2
ε2
To prove this, note
Pr(|x − η| ≥ ε) =
∫ η−ε
−∞f (x)dx +
∫ ∞
η+εf (x)dx
=
∫|x−η|≥ε
f (x)dx
Also note that
σ2 =
∫ ∞
−∞(x − η)2f (x)dx
≥∫
|x−η|≥ε
(x − η)2f (x)dx
f(x)
η η+εη-ε
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 47 / 406
Probability Tchebycheff Inequality
σ2 ≥∫
|x−η|≥ε
(x − η)2f (x)dx
Using the fact that |x − η| ≥ ε in the above gives
σ2 ≥ ε2∫
|x−η|≥ε
f (x)dx
= ε2Pr{|x − η| ≥ ε}
Rearranging gives the desired result
⇒ Pr{|x − η| ≥ ε} ≤(σ
ε
)2
QED
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 48 / 406
Probability Characteristic & Moment Generating Functions
Definition (Characteristic Function)
The characteristic function of a random variable x with pdf fx (x) isdefined by
φx (ω) = E(
ejωx)
=
∫ ∞
−∞ejωx fx (x)dx
If fx(x) is symmetric about 0 (fx (x) = fx (−x)), then φx (x) is realThe magnitude of the characteristic function is bound by
|φx (ω)| ≤ φx (0) = 1
Theorem (Characteristic Function for the sum of independent RVs)
Let x1, x2, . . . , xN be independent (but not necessarily identicallydistributed) RVs and set sN =
∑ni=1 aixi where ai are constants. Then
φsN (ω) =N∏
i=1
φxi (aiω)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 49 / 406
Probability Characteristic & Moment Generating Functions
The theorem can be proved by a simple extension of the following: Letx and y be independent. Then
φx+y (ω) = E(
ejω(x+y))
= E(
ejωxejωy)
= E(
ejωx)
E(
ejωy)
= φx (ω)φy (ω)
Example
Determine the characteristic function of the sample mean operating oniid samples.
Note x = 1N
∑Ni=1 xi ⇒ ai = 1
N
⇒ φx (ω) =N∏
i=1
φxi (aiω) =(φxi
(ω
N
))N
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 50 / 406
Probability Characteristic & Moment Generating Functions
The Moment Generating function is realized by making the substitutionjω → s in the above
Definition (Moment Generating Function)
The moment generating function of a random variable x with pdf f x (x)is defined by
Φx(s) = E (esx) =
∫ ∞
−∞esx fx (x)dx
Note Φx(jω) = φx (ω)
Theorem (Moment Generation)
Provided that Φx (s) exists in an open interval around s = 0, thefollowing hold
mn = E (xn) = Φ(n)x (0) =
dnΦx
dsn (0)
Simply noting that Φ(n)x (s) = E (xnesx) proves the result
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 51 / 406
Probability Characteristic & Moment Generating Functions
Example
Let x be exponentially distributed,
f (x) = λe−λxU(x)
Determine η = m1, m2, and σ2
Note
Φx(s) = λ
∫ ∞
0esxe−λxdx = λ
∫ ∞
0e−x(λ−s)dx
=λ
λ − sThus
Φ(1)x (0) =
1λ
and Φ(2)x (0) =
2λ2
and
E{x} =1λ
, E{
x2}
=2λ2 ⇒ σ2 =
1λ2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 52 / 406
Probability Bivariate Statistics
Bivariate Statistics
Given two RVs, x and y , the bivariate (joint) distribution is given by
F (x0, y0) = Pr{x ≤ x0, y ≤ y0}
y
x
y0
x0
Properties:
F (−∞, y) = F (x ,−∞) = 0
F (∞,∞) = 1
Fx(x) = F (x ,∞), Fy (y) = F (∞, y)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 53 / 406
Probability Bivariate Statistics
Special Cases
Case 1: M = {x1 ≤ x ≤ x2, y ≤ y0} x
y
y0
x1 x2
⇒ Pr{M} = F (x2, y0) − F (x1, y0)
Case 2: M = {x ≤ x0, y1 ≤ y ≤ y2}
y
x0
y1
y2
x
⇒ Pr{M} = F (x0, y2) − F (x0, y1)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 54 / 406
Probability Bivariate Statistics
Case 3: M = {x1 ≤ x ≤ x2, y1 ≤ y ≤ y2} Theny
x1
y1
y2
xx2
andPr{M} = F (x2, y2) − F (x1, y2) − F (x2, y1) + F (x1, y1)︸ ︷︷ ︸
↓
Added back because this region was subtracted twice
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 55 / 406
Probability Joint Statistics
Definition (Joint Statistics)
f (x , y) =∂2F (x , y)
∂x∂y
and
F (x , y) =
∫ x
−∞
∫ y
−∞f (α, β)dαdβ
In general, for some region M, the joint statistics are
Pr{(x , y) ∈ M} =
∫ ∫M
f (x , y)dxdy
Marginal Statistics: Fx(x) = F (x ,∞) and Fy (y) = F (∞, y)
⇒ fx (x) =
∫ ∞
−∞f (x , y)dy
⇒ fy (y) =
∫ ∞
−∞f (x , y)dx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 56 / 406
Probability Independence
Independence
Definition (Independence)
Two RVs x and y are statistically independent if for arbitrary events(regions) x ∈ A and y ∈ B,
Pr{x ∈ A, y ∈ B} = Pr{x ∈ A}Pr{y ∈ B}
Letting A = {x ≤ x0} and B = {y ≤ y0}, we see x and y areindependent iff
Fx ,y(x , y) = Fx(x)Fy (y)
and by differentiation
fx ,y(x , y) = fx (x)fy (y)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 57 / 406
Probability Independence
If x and y are independent RVs, then
z = q(x) and w = h(y)
are also independent.
Function of two RVs
Given two RVs, let z = g(x , y). Define Dz to be the xy plane region
{z ≤ z0} = {g(x , y) ≤ z0} = {(x , y) ∈ Dz}
Then
Fz(z0) = Pr{z ≤ z0}= Pr{(x , y) ∈ Dz}=
∫ ∫Dz
f (x , y)dxdy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 58 / 406
Probability Independence
Example
Let z = x + y . Then, z ≤ z0 gives the region x + y ≤ z0 which isdelineated by the line x + y = z0
y
z0
x
z0
Thus
Fz(z0) =
∫ ∫Dz
f (x , y)dxdy
=
∫ ∞
−∞
∫ z0−y
−∞f (x , y)dxdy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 59 / 406
Probability Independence
We can obtain fz(z) by differentiation
∂Fz(z)
∂z=
∫ ∞
−∞
∂
∂z
∫ z−y
−∞f (x , y)dxdy
fz(z) =
∫ ∞
−∞f (z − y , y)dy (∗)
Note that if x and y are independent,
f (x , y) = fx (x)fy (y) (∗∗)
Thus utilizing (∗∗) in (∗)
fz(z) =
∫ ∞
−∞fx (z − y)fy (y)dy︸ ︷︷ ︸
Convolution
= fx (z) ∗ fy (z)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 60 / 406
Probability Independence
Example
Let z = x + y where x and y are independent with
fx(x) = αe−αxU(x)
fy (y) = αe−αyU(y)
Then
fz(z) =
∫ ∞
−∞fx (z − y)fy (y)dy
= α2∫ z
0e−α(z−y)e−αydy
= α2e−αz∫ z
0dy
= α2ze−αzU(z)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 61 / 406
Probability Independence
Example
Let z = max(x , y). Determine Fz(z0) and fz(z0).
NoteFz(z0) = Pr{z ≤ z0}
= Pr{max(x , y) ≤ z0}= Fxy(z0, z0)
y
x
z0
z0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 62 / 406
Probability Independence
If x and y are independent,
Fz(z0) = Fx(z0)Fy (z0)
and
fz(z0) =∂Fz(z0)
∂z0
=∂Fx (z0)
∂z0Fy (z0) +
∂Fy (z0)
∂z0Fx(z0)
= fx (z0)Fy (z0) + fy (z0)Fx(z0)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 63 / 406
Probability Joint Moments
Joint Moments
For RVs x and y and function z = g(x , y)
E{z} =
∫ ∞
−∞zfz(z)dz
E{g(x , y)} =
∫ ∞
−∞
∫ ∞
−∞g(x , y)f (x , y)dxdy
Definition (Covariance)
For RVs x and y ,
Cxy = Cov(x , y)= E [(x − ηx )(y − ηy )]= E [xy ] − ηxE [y ] − ηyE [x ] + ηxηy
= E [xy ] − ηxηy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 64 / 406
Probability Joint Moments
Definition (Correlation Coefficient)
The correlation coefficient is given by
r =Cxy
σxσy
Note that
0 ≤ E{[a(x − ηx ) + (y − ηy )]2}= E{(x − ηx )2}a2 + 2E{(x − ηx )(y − ηy )}a + E{(y − ηy )2}= σ2
x a2 + 2Cxya + σ2y
This is a positive quadratic function of a⇒ Roots are imaginary and discriminant is non-positive√
4C2xy − 4σ2
xσ2y → imaginary
⇒ 4C2xy − 4σ2
xσ2y ≤ 0
⇒ C2xy ≤ σ2
xσ2y
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 65 / 406
Probability Joint Moments
Thus,
|Cxy | ≤ σxσy and |r | =|Cxy |σxσy
≤ 1
Definition (Uncorrelated)
Two RVs are uncorrelated if their covariance is zero
Cxy = 0
⇒ r =Cxy
σxσy= 0
=E{xy} − E{x}E{y}
σxσy= 0
⇒ E{xy} = E{x}E{y}Thus
Cxy = 0 ⇔ E{xy} = E{x}E{y}
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 66 / 406
Probability Joint Moments
Result
If x and y are independent, then
E{xy} = E{x}E{y}
and x and y are uncorrelated
Note: Converse is not true (in general)
Converse only holds for Gaussian RVs
Independence is a stronger condition than uncorrelated
Definition (Orthogonality)
Two RVs are orthogonal if
E{xy} = 0
Note: If x and y are correlated, they are not orthogonal
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 67 / 406
Probability Joint Moments
Example
Consider the correlation between two RV’s, x and y , with samplesshown in a scatter plot
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 68 / 406
Probability Sequences and Vectors of Random Variables
Sequences and Vectors of Random Variables
Definition (Vector Distribution)
Let {x} be a sequence of RVs. Take N samples to form the randomvector
x = [x1, x2, . . . , xN ]T
Then the vector distribution function is
Fx(x0) = Pr{x1 ≤ x01 , x2 ≤ x0
2 , . . . , xN ≤ x0N}
�= Pr{x ≤ x0}
Special Case: For complex data
x = xr + jxi
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 69 / 406
Probability Sequences and Vectors of Random Variables
⎡⎢⎢⎢⎣
x1
x2...xN
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣
xr1
xr2...xrN
⎤⎥⎥⎥⎦+ j
⎡⎢⎢⎢⎣
xi1
xi2...xiN
⎤⎥⎥⎥⎦
The distribution in the complex case is defined as
Fx(x0) = Pr{xr ≤ x0r , xi ≤ x0
i }�= Pr{x ≤ x0}
The density function is given by
fx(x) =∂NFx(x)
∂x1∂x2 . . . ∂xN
Fx(x0) =
∫ x01
−∞
∫ x02
−∞. . .
∫ x0N
−∞fx(x)dx1dx2 . . . dxN
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 70 / 406
Probability Sequences and Vectors of Random Variables
Properties:
Fx([∞,∞, · · · ,∞]T ) = 1∫ ∞
−∞
∫ ∞
−∞· · ·∫ ∞
−∞fx(x)dx = 1
Fx([x1, x2, · · · ,−∞, · · · , xN ]T ) = 0
Also
F ([∞, x2, x3, · · · , xN ]T ) = F ([x2, x3, · · · , xN ]T )∫ ∞
−∞f ([x1, x2, x3, · · · , xN ]T )dx1 = f ([x2, x3, · · · , xN ]T )
Setting xi = ∞ in the cdf eliminates this sample
Integrating over (−∞,∞) along xi in the pdf eliminates this sample
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 71 / 406
Probability Sequences and Vectors of Random Variables
Joint Distribution
Definitions (Joint Distribution and Density)
Given two random vectors x and y, the joint distribution and density are
Fxy(x0, y0) = Pr{x ≤ x0, y ≤ y0}
fxy(x, y) =∂N∂MFxy(x, y)
∂x1∂x2 · · · ∂xN∂y1∂y2 · · · ∂yM
Definition (Vector Independence)
The vectors are independent iff
Fxy(x, y) = Fx(x)Fy(y)
or equivalentlyfxy(x, y) = fx(x)fy(y)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 72 / 406
Probability Sequences and Vectors of Random Variables
Expectations & Moments
Objective: Obtain partial description of process generating x
Solution: Use moments
The first moment, or mean, is
mx = E{x} = [m1, m2, . . . , mN ]T
=
∫ ∞
−∞
∫ ∞
−∞· · ·∫ ∞
−∞xfx(x)dx
⇒ mk =
∫ ∞
−∞
∫ ∞
−∞· · ·∫ ∞
−∞xk fx(x)dx1dx2 · · · dxN
=
∫ ∞
−∞xk fxk (xk )︸ ︷︷ ︸dxk
⇑ marginal distribution of xk
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 73 / 406
Probability Sequences and Vectors of Random Variables
Definition (Correlation Matrix)
A complete set of second moments is given by the correlation matrix
Rx = E{xxH} = E{xx∗T }
=
⎡⎢⎢⎢⎣
E{|x1|2} E{x1x∗2} · · · E{x1x∗
N}E{x2x∗
1} E{|x2|2} · · · E{x2x∗N}
......
. . ....
E{xNx∗1} E{xNx∗
2} · · · E{|xN |2}
⎤⎥⎥⎥⎦
Result
The correlation matrix is Hermitian symmetric
(Rx)H = (E{xxH})H
= E{(xxH)H}= E{xxH} = Rx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 74 / 406
Probability Sequences and Vectors of Random Variables
Definition (Covariance Matrix)
The set of second central moments is given by the covariance
Cx = E{(x − mx)(x − mx)H}
= E{xxH} − mxE{xH} − E{x}mxH + mxmx
H
= Rx − mxmxH
ResultThe covariance is Hermitian symmetric
Cx = CxH
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 75 / 406
Probability Sequences and Vectors of Random Variables
Result
The correlation and covariance matrices are positive semi-definite
aHRxa ≥ 0 aHCxa ≥ 0 (∀a)
To prove this, note
aHRxa = aHE{xxH}a
= E{aHxxHa}= E{(aHx)(aHx)H}= E{|aHx|2} ≥ 0
For most cases, R and C are positive define
aHRxa > 0 aHCxa > 0
⇒ no linear dependencies in Rx or Cx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 76 / 406
Probability Sequences and Vectors of Random Variables
Definitions (Cross-Correlation and Cross-Covariance)
For random vectors x and y,
Cross-correlation�= Rxy = E{xyH}
Cross-covariance�= Cxy = E{(x − mx)(y − my)
H}= Rxy − mxmy
H
Definition (Uncorrelated Vectors)
Two vectors x and y are uncorrelated if
Cxy = Rxy − mxmyH = 0
or equivalentlyRxy = E{xyH} = mxmy
H
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 77 / 406
Probability Sequences and Vectors of Random Variables
Note that as in the scalar case
independence ⇒ uncorrelated
uncorrelated � independence
Also, x and y are orthogonal if
Rxy = E{xyH} = 0
Example
Let x and y be the same dimension. If
z = x + y
find Rz and Cz
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 78 / 406
Probability Sequences and Vectors of Random Variables
By definition
Rz = E{(x + y)(x + y)H}= E{xxH} + E{xyH} + E{yxH} + E{yyH}= Rx + Rxy + Ryx + Ry
SimilarlyCz = Cx + Cxy + Cyx + Cy
Note: If x and y are uncorrelated,
Rz = Rx + mxmyH + mymx
H + Ry
andCz = Cx + Cy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 79 / 406
Probability Sequences and Vectors of Random Variables
Definition (Multivariate Gaussian Density)
For a N dimensional random vector x with covariance Cx, themultivariate Gaussian pdf is
fx(x) =1
(2π)N2 |Cx| 1
2
e− 12 (x−mx)HCx
−1(x−mx)
Note the similarity to the univariate case
fx (x) =1√2πσ
e− 12
(x−m)2
σ2
Example
Let N = 2 (bivariate case) and x be real. Then
x =
[x1
x2
]and mx = E{x} =
[m1
m2
]K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 80 / 406
Probability Sequences and Vectors of Random Variables
Cx = E{(x − mx)(x − mx)T }
= E{xxT} − mxmxT
= E{[
x21 x1x2
x2x1 x22
]}−[
m21 m1m2
m2m1 m22
]
=
[E{x2
1} − m21 E{x1x2} − m1m2
E{x2x1} − m2m1 E{x22} − m2
2
]
Recall thatσ2
x = E{x2} − E2{x}and
r =E{x1x2} − m1m2
σx1σx2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 81 / 406
Probability Sequences and Vectors of Random Variables
Rearranging: Cx =
[σ2
x1rσx1σx2
rσx1σx2 σ2x2
]Also,
Cx−1 =
1σ2
x1σ2
x2− r2σ2
x1σ2
x2
[σ2
x2−rσx1σx2
−rσx1σx2 σ2x1
]
=1
σ2x1
σ2x2
(1 − r2)
[σ2
x2−rσx1σx2
−rσx1σx2 σ2x1
]
Substituting into the Gaussian pdf and simplifying
fx(x) =1
2π|Cx| 12
e− 12 (x−mx)T Cx
−1(x−mx)
=1
2πσx1σx2(1 − r2)12
e− 1
2(1−r 2)
[(x1−m1)2
σ2x1
−2r(x1−m1)(x2−m2)
σx1 σx2+
(x2−m2)2
σ2x2
]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 82 / 406
Probability Sequences and Vectors of Random Variables
Note: If uncorrelated, r = 0
⇒ fx(x) = 12πσx1σx2
e− 1
2 [(x1−m1)2
σ2x1
+(x2−m2)2
σ2x2
]
= fx1(x1)fx2(x2)
Gaussian special case result:
uncorrelated ⇒ independent
Example
Examine the contours defined by
(x − mx)T Cx
−1(x − mx) = constant
Why? For all values on the contour
fx(x) = constant
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 83 / 406
Probability Sequences and Vectors of Random Variables
r = 0 σx1 = σx2
x2
x1m1
m2
r = 0 σx1 > σx2
x2
x1m1
m2
r > 0 σx1 > σx2
x2
x1m1
m2
r > 0 σx1 < σx2
x2
x1m1
m2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 84 / 406
Probability Sequences and Vectors of Random Variables
r < 0 and σx1 > σx2
x2
x1m1
m2
x 2
)( 22xf x
x1
)( 11xf x
Integrating over x2 yields fx1(x1)
Integrating over x1 yields fx2(x2)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 85 / 406
Probability Sequences and Vectors of Random Variables
Additional Gaussian (surface) examples:
r = 0 σx1 = σx2 r = 0 σx1 < σx2
r > 0 σx1 < σx2 r < 0 σx1 < σx2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 86 / 406
Probability Sequences and Vectors of Random Variables
Transformations of a vector
Let the N functions g1(·), g2(·), . . . , gN(·) map x to z, where
z1 = g1(x1, x2, . . . , xN)z2 = g2(x)...
zN = gN(x)
Forward mapping
Let g1(·), g2(·), . . . , gN(·) be independent and yield a one-to-onetransformation such that ∃ a set of functions
x1 = h1(z), x2 = h2(z), . . . , xN = hN(z)
where z = [z1, z2, . . . , zN ]T .Reverse mapping
Question: How do we determine the distribution fz(z)?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 87 / 406
Probability Sequences and Vectors of Random Variables
Let N = 2 and consider the probability of being in the region defined by
[z1, z1 + dz1] and [z2, z2 + dz2]
z1
z2
z2+d z2
z2
z1 z1+d z1
Az
),Pr{(}), xxAzzx1
x2z2+d z2
z2z1
z1+d z1
Ax
Identify an equivalent area in the x1, x2 domain and equate theprobabilities
Pr{(z1, z2) ∈ Az} = Pr{(x1, x2) ∈ Ax}fz1z2(z1, z2)Area(Az) = fx1x2(x1, x2)Area(Ax )
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 88 / 406
Probability Sequences and Vectors of Random Variables
Area(Ax)
Area(Az)= abs
(J(
x1 x2
z1 z2
))
=1
abs
(J(
z1 z2
x1 x2
))The Jacobian is defined as
J(
x1 x2
z1 z2
)=
∣∣∣∣∣∂x1∂z1
∂x1∂z2
∂x2∂z1
∂x2∂z2
∣∣∣∣∣and
J(
z1 z2
x1 x2
)=
∣∣∣∣∣∂z1∂x1
∂z1∂x2
∂z2∂x1
∂z2∂x2
∣∣∣∣∣Note that
∂x1
∂z1=
∂h1(z)
∂z1and
∂z1
∂x1=
∂g1(x)
∂x1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 89 / 406
Probability Sequences and Vectors of Random Variables
Thusfz1z2(z1, z2)Area(Az) = fx1x2(x1, x2)Area(Ax )
⇒ fz1z2(z1, z2) =fx1x2(x1, x2)
Area(Az)/Area(Ax)=
fx1x2(x1, x2)
abs
(J(
z1 z2
x1 x2
))
General Case Result (Functions of Vectors)
fz(z) =fx(x)
abs
(J(
zx
))where
J(
zx
)=
∣∣∣∣∣∣∣∂z1∂x1
· · · ∂z1∂xN
.... . .
...∂zN∂x1
· · · ∂zN∂xN
∣∣∣∣∣∣∣K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 90 / 406
Probability Sequences and Vectors of Random Variables
Example (Linear Transformation)
Let z and x be linearly related
z1 = a11x1 + a12x2
z2 = a21x1 + a22x2
or z = Ax and x = A−1z
Then J(
z1 z2
x1 x2
)=
∣∣∣∣∣∂z1∂x1
∂z1∂x2
∂z2∂x1
∂z2∂x2
∣∣∣∣∣=
∣∣∣∣ a11 a12
a21 a22
∣∣∣∣ = |A|
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 91 / 406
Probability Sequences and Vectors of Random Variables
Let A−1 =
[b11 b12
b21 b22
]. Then
[x1
x2
]=
[b11 b12
b21 b22
] [z1
z2
]
and
fz1z2(z1, z2) =fx1x2(x1, x2)
abs|A|=
fx1x2(b11z1 + b12z2, b21z1 + b22z2)
abs|A|
General Case Result (Linear Transformations)
For case where z = Ax
fz(z) =1
abs|A| fx(A−1z)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 92 / 406
Probability Sequences and Vectors of Random Variables
Vector Statistics for Linear Transformations
For such linear transformations z = Ax
E{z} = E{Ax} = Amx
SimilarlyE{zzH} = E{Ax(Ax)H}
= E{AxxHAH}
⇒ Rz = ARxAH
By similar arguments it is easy to show
Cz = ACxAH
Note: Results in simple linear transformations of statistics
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 93 / 406
Stationary Process and Models Stationary Process
Stationary Process and Models
Statistic process describes the time evolution of statisticalphenomena
A stochastic process is not a single function of time but an infinitenumber of possible realizations
A single realization is called a time series
A full joint distribution function of an arbitrary stochastic process isdifficult to obtain or estimate
Settle for a partial characterization
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 95 / 406
Stationary Process and Models Stationary Process
Consider a discrete-time stochastic process
x(n), x(n − 1), . . . , x(n − M)
which may be complex.
Definitions (Mean, Auto-Correlation, and Auto-Covariance)
The mean process is given by
μ(n) = E{x(n)}
The auto–correlation is defined as
r(n, n − k) = E{x(n)x∗(n − k)}
The auto–covariance is given by
c(n, n − k) = E{[x(n) − μ(n)][x(n − k) − μ(n − k)]∗}= r(n, n − k) − μ(n)μ∗(n − k)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 96 / 406
Stationary Process and Models Stationary Process
Definition (Wide-Sense Stationary)
A discrete-time stochastic process is wide–sense stationary (WSS) if
μ(n) = μ for all n
r(n, n − k) = r(k) and
c(n, n − k) = c(k) k = 0,±1,±2, . . .
Let x(n) = [x(n), x(n − 1), . . . , x(n − M + 1)]T be a M × 1 observationvector. Then for {x(n)} WSS, the correlation matrix is
R = E{x(n)xH(n)} =
⎡⎢⎢⎢⎣
r(0) r(1) · · · r(M − 1)r(−1) r(0) · · · r(M − 2)
......
. . ....
r(−M + 1) r(−M + 2) · · · r(0)
⎤⎥⎥⎥⎦
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 97 / 406
Stationary Process and Models Stationary Process
Properties of the correlation matrices
For a stationary discrete time process: RH = R (Hermetian)⎡⎢⎢⎢⎣
r(0) r(1) · · · r(M − 1)r(−1) r(0) · · · r(M − 2)
......
. . ....
r(−M + 1) r(−M + 2) · · · r(0)
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣
r(0) r∗(−1) · · · r∗(−M + 1)r∗(1) r(0) · · · r∗(−M + 2)
......
. . ....
r∗(M − 1) r∗(M − 2) · · · r(0)
⎤⎥⎥⎥⎦
Consequence: ⇒ r(−k) = r ∗(k)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 98 / 406
Stationary Process and Models Stationary Process
The correlation matrix is Toeplitz
R =
⎡⎢⎢⎢⎢⎢⎣
r(0) r(1) r(2) · · · r(M − 1)r∗(1) r(0) r(1) · · · r(M − 2)r∗(2) r∗(1) r(0) · · · r(M − 3)
......
......
...r∗(M − 1) r∗(M − 2) r∗(M − 3) · · · r(0)
⎤⎥⎥⎥⎥⎥⎦
For any non-zero vector a
aRaH ≥ 0 (positive semi-definite)
and usuallyaRaH > 0 (positive definite)
Result: R is positive definite if the samples in x are not linearlydependent. In this case R−1 exists.
Historical Note: Diagonal–constant matrices are named after themathematician Otto Toeplitz (1881–1940)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 99 / 406
Stationary Process and Models Stochastic Models
Stochastic Models
A model is used to describe the hidden laws governing thegeneration of physical data observed
We assume that x(n), x(n − 1), · · · have statistical dependenciesthat can be modeled as
Discrete timelinear fitler
v(n) x(n)
processwhere v(n) is a purely random processLinear model types:
1 Auto Regressive – no past model input samples used2 Moving Average – no past model output samples used3 Auto Regressive Moving Average – both past input and output used
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 100 / 406
Stationary Process and Models Stochastic Models
General Stochastic Model:(Modeloutput
)+
(Linear combination
of past outputs
)︸ ︷︷ ︸
AR part
=
(Linear combination ofpresent & past inputs
)︸ ︷︷ ︸
MA part
Three model possibilities:1 AR – auto regressive2 MA – moving average3 ARMA – mixed AR and MA
Model Input: assumed to be an i.i.d. zero mean Gaussian process:
E{v(n)} = 0 for all n
E{v(n)v∗(k)} =
{σ2
v k = n0 otherwise
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 101 / 406
Stationary Process and Models Auto-Regressive Models
Auto-Regressive Models
Definition (Auto-Regressive)
The time series {x(n)} is said to be generated by an AR model if
x(n) + a∗1x(n − 1) + · · · + a∗
Mx(n − M) = v(n)
orx(n) = w∗
1 x(n − 1) + · · · + w∗Mx(n − M) + v(n)
where wk = −ak .
This is an order M model and v(n) is referred to as the noise termNote that we can set a0 = 1 and write
M∑k=0
a∗kx(n − k) = v(n)
which is a convolution sumK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 102 / 406
Stationary Process and Models Auto-Regressive Models
Thus taking Z-transforms
Z{a∗n} = A(z) =
M∑n=0
a∗nz−n
Z{x(n)} = X (z) =∞∑
n=0
x(n)z−n
Z{v(n)} = V (z) =∞∑
n=0
v(n)z−n
andM∑
k=0
a∗kx(n − k) = v(n) ⇒ A(z)X (z) = V (z)
If we regard v(n) as the output,then HA(z)x(n) v(n)
where HA(z) = V (z)X(z) = A(z)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 103 / 406
Stationary Process and Models Auto-Regressive Models
[Notation note: figure uses u(n) as input, i.e., u(n) = v(n)]
This is called the process analyzerAnalyzer is an all zero system
Impulse response is finite (FIR)System is BIBO stable
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 104 / 406
Stationary Process and Models Auto-Regressive Models
If we view v(n) as the input, then wehave the process generator
HG(z)v(n) x(n)
HG(z) =X (z)
V (z)=
1A(z)
The process generator is an allpole system
Impulse response is infinite (IIR)System stability is an issue
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 105 / 406
Stationary Process and Models Auto-Regressive Models
Note
HG(z) =1
A(z)=
1∑Mn=0 a∗
nz−n
Factor the denominator and represent HG(z) in terms of its poles
HG(z) =1
(1 − p1z−1)(1 − p2z−1) · · · (1 − pMz−1)
p1, p2, . . . , pM are the poles of HG(z) defined as the roots of thecharacteristic equation
1 + a∗1z−1 + a∗
2z−2 + · · · + a∗Mz−M = 0
HG(z) is all pole (IIR) and BIBO stable only if all poles are in theunit circle, i.e.,
|pn| < 1 n = 1, 2, · · · , M
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 106 / 406
Stationary Process and Models Moving Average Model
Moving Average Model
Definition (Moving Average)
The time series {x(n)} is said to be generated by a Moving Average(MA) model if
x(n) = v(n) + b∗1v(n − 1) + · · · + b∗
K v(n − K )
where b1, b2, · · · , bk are the parameters of the order K MA model
v(n) is zero mean white Gaussian noiseThe process generation model is all zero (FIR)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 107 / 406
Stationary Process and Models Auto-Regressive Moving Average Model
Auto-Regressive Moving Average Model
Definition (Auto-Regressive MovingAverage)
In this case, {x(n)} is a mixed processwhere the output is a function of pastoutputs and current/past inputs
x(n) + a∗1x(n − 1) + · · · + a∗
M(n − M)
= v(n) + b∗1v(n − 1) + · · · + b∗
K v(n − K )
The order is (M, K ).
v(n) is zero mean white Gaussiannoise
The process model has zeros andpoles (IIR)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 108 / 406
Stationary Process and Models Analysis of Stochastic Models
Wold Decomposition (after Herman Wold (1908–92))
Any WSS discrete time stochastic process y(n) can be expressed as
y(n) = x(n) + s(n)
where:x(n) and s(n) are uncorrelatedx(n) can be expressed by the MA model
x(n) =∞∑
k=0
b∗kv(n − k)
b0 = 1 and∑∞
k=0 |bk | < ∞v(n) is white noise uncorrelated with s(n)
s(n) is perfectly predictableNote: If B(z) is minimum phase, then it can be represented by an allpole (AR) system.
AR models are widely used because they are tractableK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 109 / 406
Stationary Process and Models Analysis of Stochastic Models
Asymptotic statistics of AR processes
Recall that {x(n)} is generated by
x(n) + a∗1x(n − 1) + a∗
2x(n − 2) + · · · + a∗Mx(n − M) = v(n)
or
x(n) = w∗1 x(n − 1) + w∗
2 x(n − 2) + · · · + w∗Mx(n − M) + v(n)
Linear constant coefficient difference equation of order M drivenby v(n).
Z-transform representation:
X (z) =V (z)
1 +∑M
k=1 a∗kz−k
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 110 / 406
Stationary Process and Models Analysis of Stochastic Models
Inverse transforming X (z) = V (z)
1+∑M
k=1 a∗k z−k
yields
x(n) = xc(n)︸ ︷︷ ︸Homogeneous Solution
+ xp(n)︸ ︷︷ ︸Particular Solution
The particular solution is the result of driving HG(z) with v(n)
xp(n) = HG(z)v(n),
where z−1 is taken as the delay operator.
The particular solution has stationary statistics
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 111 / 406
Stationary Process and Models Analysis of Stochastic Models
The homogeneous solution is of the form
xc(n) = B1pn1 + B2pn
2 + · · · + BMpnM
where p1, p2, · · · , pM are the roots of
1 + a∗1z−1 + a∗
2z−2 + · · · + a∗Mz−M = 0
The B values depend on the initial conditions
The homogeneous solution is not stationary
The process is asymptotically stationary if |pn| < 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 112 / 406
Stationary Process and Models Analysis of Stochastic Models
Correlation of a stationary AR process
Recall that an AR process can be written as
M∑k=0
a∗kx(n − k) = v(n)
where a0 = 1.Multiply both sides by x∗(n − l) and take E{ }.
E
{M∑
k=0
a∗kx(n − k)x∗(n − l)
}= E{v(n)x∗(n − l)}
Note that
E{x(n − k)x∗(n − l)} = r(l − k)
E{v(n)x∗(n − l)} = 0 for l > 0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 113 / 406
Stationary Process and Models Analysis of Stochastic Models
Thus
E
{M∑
k=0
a∗kx(n − k)x∗(n − l)
}= E{v(n)x∗(n − l)}
⇒M∑
k=0
a∗k r(l − k) = 0 for l > 0
Accordingly, the auto-correlation of the AR process satisfies
r(l) = w∗1 r(l − 1) + w∗
2 r(l − 2) + · · · + w∗Mr(l − M)
where wk = −ak . Note that this also has the solution
r(m) =M∑
k=1
ckpmk
where pk is the k th root of
1 − w∗1 z−1 − w∗
2 z−2 − · · · − w∗Mz−M = 0
Why? Diff. equation (no driving function; homogeneous solution only)K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 114 / 406
Stationary Process and Models Analysis of Stochastic Models
Recall that the AR characteristic equation is
1 + a∗1z−1 + a∗
2z−2 + · · · + a∗Mz−M = 0
This is identical to the auto-correlation characteristic equation
1 − w∗1 z−1 − w∗
2 z−2 − · · · − w∗Mz−M = 0
⇒ the roots are equalResult: A stable AR process ⇒ |pk | < 1 and
limm→∞ r(m) = lim
m→∞
M∑k=1
ckpmk = 0
(asymptotically uncorrelated)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 115 / 406
Stationary Process and Models Yule-Walker Equations
Yule-Walker Equations
An AR model of order M is completely specified by
AR coefficients: a1, a2, . . . , aM
Variance of v(n): σ2v
Proposition: These parameters can be determined by theauto-correlation values: r(0), r(1), . . . , r(M).
Recall
r(l) = w∗1 r(l − 1) + w∗
2 r(l − 2) + · · · + w∗Mr(l − M)
Case 1: Let l = 1
r(1) = w∗1 r(0) + w∗
2 r(−1) + · · · + w∗Mr(1 − M)
Using the fact r(−k) = r∗(k)
r(1) = w∗1 r(0) + w∗
2 r∗(1) + · · · + w∗Mr∗(M − 1)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 116 / 406
Stationary Process and Models Yule-Walker Equations
Taking the complex conjugate
r∗(1) = w1r(0) + w2r(1) + · · · + wMr(M − 1)
= wT [r(0), r(1), · · · , r(M − 1)]T
where wT = [w1, w2, · · · , wM ]
Case 2: Now let l = 2
r(2) = w∗1 r(1) + w∗
2 r(0) + w∗3 r(−1) + · · · + w∗
Mr(2 − M)
⇒ r∗(2) = w1r∗(1) + w2r(0) + w3r(1) · · · + wMr(M − 2)
= wT [r∗(1), r(0), r(1), · · · , r(M − 2)]T
Case 3: Similarly, for l = 3
r(3) = w∗1 r(2) + w∗
2 r(1) + w∗3 r(0) + w∗
4 r(−1) · · · + w∗Mr(3 − M)
⇒ r∗(3) = w1r∗(2) + w2r∗(1) + w3r(0) + w4r(1) · · · + wMr(M − 3)
= wT [r∗(2), r∗(1), r(0), r(1), · · · , r(M − 3)]T
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 117 / 406
Stationary Process and Models Yule-Walker Equations
Repeating the process & combining results in matrix form⎡⎢⎢⎢⎢⎢⎣
r(0) r(1) · · · r(M − 1)r∗(1) r(0) · · · r(M − 2)r∗(2) r∗(1) · · · r(M − 3)
......
. . ....
r∗(M − 1) r∗(M − 2) · · · r(0)
⎤⎥⎥⎥⎥⎥⎦
⎡⎢⎢⎢⎢⎢⎣
w1
w2
w3...
wM
⎤⎥⎥⎥⎥⎥⎦ =
⎡⎢⎢⎢⎢⎢⎣
r∗(1)r∗(2)r∗(3)...r∗(M)
⎤⎥⎥⎥⎥⎥⎦
or more compactly
Rw = r where r = [r(1), r(2), r(3), . . . , r(M)]H
Result: Given the auto-correlation values, the AR coefficients we canbe uniquely determine
w = R−1r
whereak = −wk k = 1, 2, · · · , M
Assumption: R is nonsingularK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 118 / 406
Stationary Process and Models Yule-Walker Equations
Still to be determined: variance of the driving sequence v(n)
Recall: E{∑M
k=0 a∗kx(n − k)x∗(n − l)
}= E{v(n)x∗(n − l)}
⇒M∑
k=0
a∗k r(l − k) = E{v(n)x∗(n − l)} (∗)
Note that
x∗(n) = [w1x∗(n−1)+w2x∗(n−2)+ · · ·+wMx∗(n−M)+v∗(n)] (∗∗)Let l = 0 in (∗) and use (∗∗) on the RHS
M∑k=0
a∗k r(−k) = E{v(n)x∗(n)}
= E{v(n)[w1x∗(n − 1) + w2x∗(n − 2) + · · ·+wMx∗(n − M) + v∗(n)]}
Note that
E{v(n)wk x∗(n − k)} = 0, k = 1, 2, · · · , M
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 119 / 406
Stationary Process and Models Yule-Walker Equations
Thus E{v(n)wk x∗(n − k)} = 0, k = 1, 2, · · · , M gives
M∑k=0
a∗k r(−k) = E{v(n)x∗(n)}
= E{v(n)[w1x∗(n − 1) + w2x∗(n − 2) + · · ·+wMx∗(n − M) + v∗(n)]}
= E{v(n)v∗(n)}or conjugating
σ2v =
M∑k=0
akr(k)
Final Yule–Walker Result
Rw = r and σ2v =
M∑k=0
akr(k)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 120 / 406
Stationary Process and Models AR Order 2 Example
Example (AR Order-2 Process)
Consider the process defined by
x(n) + a1x(n − 1) + a2x(n − 2) = v(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 121 / 406
Stationary Process and Models AR Order 2 Example
The process
x(n) + a1x(n − 1) + a2x(n − 2) = v(n)
has characteristic equation
1 + a1z−1 + a2z−2 = 0
⇒ p1, p2 =−a1 ±
√a2
1 − 4a2
2Stability enforces the constraints
|pk | < 1 ⇒⎧⎨⎩
−1 ≤ a2 + a1
−1 ≤ a2 − a1
−1 ≤ a2 ≤ 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 122 / 406
Stationary Process and Models AR Order 2 Example
Recall that the auto-correlation can be expressed as
r(m) =M∑
k=1
ckpmk = c1pm
1 + c2pm2
p1, p2 real, positive ⇒ r(m) positivedecaying exponential
p1, p2 real, negative ⇒ r(m) alternatesign decaying exponential
p1, p2 complex conjugate ⇒ r(m)exponentially decaying sinusoid
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 123 / 406
Stationary Process and Models AR Order 2 Example
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 124 / 406
Stationary Process and Models AR Order 2 Example
The characteristics of the AR process vary in a related fashion to thepole placements.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 125 / 406
Stationary Process and Models AR Order 2 Example
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 126 / 406
Stationary Process and Models Model Order Selection
Model Order Selection
Model Order Selection
A model is typically estimated from a finite set of observation data.
Result: Use Yule-Walker equations to estimate model parameters
Open Question: How do we estimate the model order?Solution: Use information theoretic criteria.
Akaike’s information criterion, developed by Hirotsugu Akaike underthe name of “an information criterion” (AIC) in 1971
Take xi = x(i), i = 1, 2, · · · , N to be N observations of a stationarydiscrete time process.
Let θ be the estimated model (AR/MA/ARMA) order m parameters
θm = [θ1m, θ2m, · · · , θMm]T
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 127 / 406
Stationary Process and Models Model Order Selection
Let fx (xi |θm) be the conditional pdf of xi given the estimated modeldefined by θm.Set L(θm) = max
θm
∑Ni=1 ln fx (xi |θm).
The likelihood function (log of conditional pdf evaluated at themaximum likelihood estimates of the model parameters, θm).
Then the AIC model order is given by m that minimizes
AIC(m) = −2L(θm)︸ ︷︷ ︸Always decreasing
+ 2m︸︷︷︸Parameter cost function
The AIC methodology attempts to find the model that bestexplains the data with a minimum of free parameters.
m optimal
AIC(m)
m
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 128 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
Transformation by a Linear System
Consider a LTI system characterized by impulse response h(n)
h(n)x(n) y(n)
⇒ y(n) = x(n) ∗ h(n) =∞∑
k=−∞h(n − k)x(k)
Example (Deterministic Case)
Suppose x(n) and h(n) are
0 1
x(n)
2
1
n0 1
h(n)
2
1
n
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 130 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
0 1
x(k)
2
1
k
-1 -2 -1
h(-k)
0
1
k
1
y(n) =∞∑
k=−∞h(n − k)x(k)
⇒ y(n) = 0 n < 0y(0) = 1y(1) = 1/2y(2) = 1y(3) = 1/2y(n) = 0 n ≥ 4
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 131 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
n
-1
y(n)
2
1
3
1/21
1/2
10 4
Stochastic Case: If x(n) is a stationary RV
y(n) =∞∑
k=−∞h(k)x(n − k)
⇒ E{y(n)} =∞∑
k=−∞h(k)E{x(n − k)}
⇒ my = mx
∞∑k=−∞
h(k)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 132 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
Next, evaluate the cross-correlation
y(n) =∑
k
h(n − k)x(k)
⇒ E{y(n)x∗(n − l)} =∑
k
h(n − k)E{x(k)x∗(n − l)}
⇒ ryx(l) =∑
k
h(n − k)rx (k − n + l) [let m = n − k ]
=∑
m
h(m)rx (l − m)
⇒ ryx(l) = h(l) ∗ rx (l)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 133 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
Known results:
ryx(l) = h(l) ∗ rx (l) [new result]
rx (l) = r∗x (−l) [prior result]
ryx(l) = r∗xy (−l) [prior result]
Using these results:
r∗xy (−l) = ryx(l)
= h(l) ∗ rx (l)
⇒ r∗xy(l) = h(−l) ∗ rx (−l)
= h(−l) ∗ r∗x (l)
⇒ rxy(l) = h∗(−l) ∗ rx (l)
Note: ryx (l) = h(l) ∗ rx (l) while rxy (l) = h∗(−l) ∗ rx (l)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 134 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
Lastly, examine ry (l)
y(n) =∑
k
h(n − k)x(k)
⇒ E{y(n)y∗(n − l)} =∑
k
h(n − k)E{x(k)y∗(n − l)}
⇒ ry (l) =∑
k
h(n − k)rxy (k − n + l) [let m = n − k ]
=∑
m
h(m)rxy (l − m)
= h(l) ∗ rxy(l)
or, substituting back using rxy(l) = h∗(−l) ∗ rx (l)
⇒ ry (l) = h(l) ∗ h∗(−l) ∗ rx (l)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 135 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
Summary of Results (Correlations for a LTI System)
ryx(l) = h(l) ∗ rx (l)
rxy (l) = h∗(−l) ∗ rx (l)
ry (l) = h(l) ∗ rxy(l)
= h(l) ∗ h∗(−l) ∗ rx (l)
Note: Similar results hold for the covariance.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 136 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
Example
Suppose the impulse response of a LTI is
-1
h(n)
2
1
310 4
Question: If the system is driven by white zero mean noise, what aremy , ryx(l), rxy(l), and ry (l)?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 137 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
Consider my first:
my = E{y(n)} = mx
∑k
h(k) = 0
Also note rx (l) = σ2xδ(l). Thus,
ryx(l) = h(l) ∗ rx (l)
= h(l) ∗ σ2xδ(l)
= σ2x h(l)
By the symmetry property
rxy(l) = r∗yx(−l) = σ2xh∗(−l)
Plotting the results:
0 1
ryx(l)
2
l
-1
2xσ
3 4 -3 -2
rxy(l)
-1
l
-4
2xσ
0 1K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 138 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transformation by a Linear System
Lastly
ry (l) = h(l) ∗ h∗(−l) ∗ rx (l)
= σ2x (h(l) ∗ h∗(−l))
Generating the result graphically:
0 1
h(n)
2-1 3 4
1
-3 -2
h*(-n)
-1-4 0 1
1
0
3
2-4 3 4
12
3
1
4
21
-2 -1-3
h(n)*h*(-n)
0
3σ2
2-4 3 41
4σ2
2σ2
1σ2
-2 -1-3
ry(l)
⇒ ⇒
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 139 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Power Spectral Density
Power Spectral Density
Definition (Power Spectral Density)
The autocorrelation function of a wide-sense stationary process andthe power spectral density form a Fourier transform pair
S(ω) =∞∑
l=−∞r(l)e−jωl − π < ω ≤ π
r(l) =1
2π
∫ π
−πS(ω)ejωldω l = 0,±1,±2, · · ·
PSD Properties:
The PSD is periodic, with period 2π
S(ω + 2kπ) = S(ω) k = 0,±1,±2, . . .
⇒ All information contained in the Nyquist interval [−π, π]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 140 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Power Spectral Density
The PSD of a stationary process is real
To prove this, note
S(ω) =∞∑
k=−∞r(k)e−jωk
= r(0) +∞∑
k=1
r(k)e−jωk +−1∑
k=−∞r(k)e−jωk
= r(0) +∞∑
k=1
r(k)e−jωk +∞∑
k=1
r(−k)ejωk
= r(0) +∞∑
k=1
r(k)e−jωk + r∗(k)ejωk
= r(0) + 2∞∑
k=1
Re[r(k)e−jωk ]
QEDK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 141 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Power Spectral Density
For real-valued sequences, S(ω) ≥ 0 and has even symmetry(prove yourself)
The PSD is (within a constant) a density describing theconcentration of power
Proof: Relate area under the PSD to signal power. Thus, let l = 0in the inverse transform,
r(l) =1
2π
∫ π
−πS(ω)ejωldω
⇒ r(0) =1
2π
∫ π
−πS(ω)dω
⇒ 12π
∫ π
−πS(ω)dω = E{x2(n)} [signal power]
⇒ Area under the PSD is (scaled by the 1/2π constant) equal tothe signal power. Moreover, it characterizes the signal power byfrequency
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 142 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Power Spectral Density
Example
If x(n) is zero mean i.i.d. with variance σ2x , what is the PSD?
In the case,
r(l) = σ2xδ(l)
⇒ S(ω) =∑
l
r(l)e−jωl
=∑
l
σ2xδ(l)e−jωl = σ2
x − π < ω ≤ π
2xσ
S(ω)
-π π
Note: Flat spectrums are said to be whitebecause they have a uniform distributionof power
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 143 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System
Transmission Through a Linear System
Let x(n) be a stationary discrete time process that is passed through aLTI system characterized by h(n).
h(n)x(n) y(n)
Then by known properties and time/frequency relations
ryx(l) = h(l) ∗ rx (l)
⇒ Syx(ω) = H(ω)Sx(ω)
whereH(ω) =
∑l
h(l)e−jωl
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 144 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System
Also note that
FT{h∗(−l)} =∑
l
h∗(−l)e−jωl
=∑
l
h∗(l)ejωl
=
[∑l
h(l)e−jωl
]∗= H∗(ω)
Thus
ry (l) = h(l) ∗ h∗(−l) ∗ rx (l)
⇒ Sy(ω) = H(ω)H∗(ω)Sx (ω)
= |H(ω)|2Sx (ω)
Note: Only the PSD magnitude is affected by the system
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 145 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System
Example
Let x(n) have correlation rx(l) = (12)|l | and be the into to a LTI system
with impulse response
h(n) = δ(n) + δ(n − 1).
Find ry (l) and Sy (ω)
Note ry (l) = h(l) ∗ h(−l) ∗ rx (l)
0 1
1
0 1
1
* =
0 1
1
-1
2
h(l)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 146 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System
Thus
h(l) ∗ h(−l) = δ(l + 1) + 2δ(l) + δ(l − 1)
⇒ ry (l) = h(l) ∗ h(−l) ∗ rx (l)
= [δ(l + 1) + 2δ(l) + δ(l − 1)] ∗(
12
)|l |
= 2(
12
)|l |+
(12
)|l+1|+
(12
)|l−1|
The result is a consequence of convolving with shifted delta functions
Next, compute Sx(ω) and |H(ω)|2Aside: Recall for |p| < 1,
∞∑k=0
pk =1
1 − pand
∞∑k=1
pk =p
1 − p
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 147 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System
Given rx (k) = (12)|k |
Sx(ω) =∑
k
(12
)|k |e−jωk
=∞∑
k=0
(12
)k
e−jωk +∞∑
k=1
(12
)k
ejωk
=1
1 − 12e−jω
+12ejω
1 − 12ejω
=34
54 − cos(ω)
Also
H(ω) =∑
k
h(k)e−jωk
=∑
k
[δ(k) + δ(k − 1)]e−jωk
= 1 + e−jω
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 148 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Transmission Through a Linear System
Taking the magnitude,
|H(ω)|2 = (1 + e−jω)(1 + e−jω)∗
= 1 + e−jω + ejω + 1
= 2 + 2 cos(ω)
= 2(1 + cos(ω))
Thus the output PSD is
Sy (ω) = |H(ω)|2Sx(ω)
= 2(1 + cos(ω))
(34
54 − cos(ω)
)
=32(1 + cos(ω))
54 − cos(ω)
Note: The system only affects the magnitude and Sy (ω) has evensymmetry. Question: Where is the power concentrated?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 149 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization
Spectral Factorization
Suppose a system is described by the difference equation
M∑k=0
akx(n − k) =K∑
l=0
blv(n − l)
Then taking the z-transform
M∑k=0
akz−k X (z) =K∑
l=0
blz−l V (z)
orX (z)
V (z)
�= H(z) =
∑Kl=0 blz−l∑M
k=0 akz−k=
B(z)
A(z)
Note: H(z) is defined as the transfer function
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 150 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization
All pass systems
Suppose a stable first-order system has a transfer function
H(z) =1 − a∗zz − a
Note: System has a zero at z = 1/a∗ and a pole at z = a.
Let a = rejθ. Then 1a∗ = 1
re−jθ = 1r ejθ. For a stable (r < 1) system:
ra
Note: Pole and zero are at the same angle
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 151 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization
Evaluating H(ω) = H(z)|z=ejω
H(ω) =1 − a∗ejω
ejω − a=
ejω(e−jω − a∗)ejω − a
⇒ |H(ω)| = |ejω| |e−jω − a∗||ejω − a| = 1 [All pass filter]
General Result – All Pass Systems
Systems of the form
Hap(z) =N∏
i=1
(1 − a∗i z)
(z − ai)
are all pass. For stability, |ai | < 1.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 152 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization
Minimum Phase Systems
A system is minimum phase if it is causal and has all its poles andzeros inside the unit circle
Minimum phase systems are stable
If H(z) is minimum phase, then H−1(z) is minimum phase andstable
Maximum Phase Systems
A system is maximum phase if it has all its zeros outside the unit circle
The inverse of a maximum phase systems is not stable
Mixed Phase Systems
A system is mixed phase if it has zeros inside and outside the unitcircle
The inverse of a mixed phase systems is not stable
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 153 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization
Factorization Result
Any rational H(z) can be factored as
H(z) = Hmin(z)︸ ︷︷ ︸Min. phase
Hap(z)︸ ︷︷ ︸All pass
Consider a non–minimum phase H(z) with a (single) zero outside theunit circle, at z = 1/a∗ where |a| < 1
H(z) = H1(z)(1 − a∗z) [factor out non–min. phase comp.]
= H1(z)(1 − a∗z)
(z − az − a
)
= H1(z)(z − a)︸ ︷︷ ︸Minimum phase
(1 − a∗zz − a
)︸ ︷︷ ︸
All pass
Note: Magnitude response is determined by the minimum phasecomponent, while both terms contribute to phase response
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 154 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Spectral Factorization
Example
Consider a stable system with a single zero outside the unit circle(maximum phase system):
H(z) Hmin(z) Hap(z)System with 1 System with reflected System with original zero
zero outside UC zero; min phase & reflected/cancelingpole; all pass
Note: The zero in Hmin(z) and the pole in Hap(z) cancel each other out
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 155 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis
Eigen Analysis
Objective: Utilize tools from linear algebra to characterize and analyzematrices, especially the correlation matrix
The correlation matrix plays a large role in statisticalcharacterization and processing.
Previously result: R is Hermitian.
Further insight into the correlation matrix is achieved througheigen analysis
Eigenvalues and vectorsMatrix diagonalizationApplication: Optimum filtering problems
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 156 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis
Objective: For a Hermitian matrix R, find a vector q satisfying
Rq = λq
Interpretation: Linear transformation by R changes the scale, butnot the direction of q
Fact: A M × M matrix R has M eigenvectors and eigenvalues
Rqi = λiqi i = 1, 2, 3, · · · , M
To see this, note(R − λI)q = 0
For this to be true, the row/columns of (R − λI) must be linearlydependent,
⇒ det(R − λI) = 0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 157 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis
Note: det(R − λI) is a Mth order polynomial in λ
The roots of the polynomial are the eigenvalues λ1, λ2, · · · , λM
Rqi = λiqi
Each eigenvector qi is associated with one eigenvalue λi
The eigenvectors are not unique
Rqi = λiqi
⇒ R(aqi) = λi(aqi)
Consequence: eigenvectors are generally normalized, e.g.,|qi | = 1 for i = 1, 2, . . . , M
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 158 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis
Example (General two dimensional case)
Let M = 2 and
R =
[R1,1 R1,2
R2,1 R2,2
]Determine the eigenvalues and eigenvectors.
Thus
det(R − λI) = 0
⇒∣∣∣∣ R1,1 − λ R1,2
R2,1 R2,2 − λ
∣∣∣∣ = 0
⇒ λ2 − λ(R1,1 + R2,2) + (R1,1R2,2 − R1,2R2,1) = 0
⇒ λ1,2 =12
[(R1,1 + R2,2) ±
√4R1,2R2,1 + (R1,1 − R2,2)
]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 159 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis
Back substitution yields the eigenvectors:[R1,1 − λ R1,2
R2,1 R2,2 − λ
] [q1
q2
]=
[00
]
In general, this yields a set of linear equations. In the M = 2 case:
(R1,1 − λ)q1 + R1,2q2 =0
R2,1q1 + (R2,2 − λ)q2 =0
Solving the set of linear equations for a specific eigenvalue λ i
yields the corresponding eigenvector, qi
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 160 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Analysis
Example (Two–dimensional white noise)
Let R be the correlation matrix of a two–sample vector of zero meanwhite noise
R =
[σ2 00 σ2
]Determine the eigenvalues and eigenvectors.
Carrying out the analysis yields eigenvalues
λ1,2 =12
[(R1,1 + R2,2) ±
√4R1,2R2,1 + (R1,1 − R2,2)
]
=12
[(σ2 + σ2) ±
√0 + (σ2 − σ2)
]= σ2
and eigenvectors
q1 =
[10
]and q2 =
[01
]Note: The eigenvectors are unit length (and orthogonal)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 161 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Eigen Properties
Property (eigenvalues of Rk )
If λ1, λ2, · · · , λM are the eigenvalues of R, then λk1, λk
2, · · · , λkM are the
eigenvalues of Rk .
Proof: Note Rqi = λiqi . Multiplying both sides by R k − 1 times,
Rkqi = λiRk−1qi = λki qi
Property (linear independence of eigenvectors)
The eigenvectors q1, q2, · · · , qM , of R are linearly independent, i.e.,
M∑i=1
aiqi �= 0
for all nonzero scalars a1, a2, · · · , aM .
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 162 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Property (Correlation matrix eigenvalues are real & nonnegative)
The eigenvalues of R are real and nonnegative.
Proof:
Rqi = λiqi
⇒ qHi Rqi = λiqH
i qi [pre–multiply by qHi ]
⇒ λi =qH
i Rqi
qHi qi
≥ 0
Follows from the facts: R is positive semi-definite and qHi qi = |qi|2 > 0
Note: In most cases, R is positive definite and
λi > 0, i = 1, 2, · · · , M
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 163 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Property (Unique eigenvalues ⇒ orthogonal eigenvectors)
If λ1, λ2, · · · , λM are unique eigenvalues of R, then the correspondingeigenvectors, q1, q2, · · · , qM , are orthogonal.
Proof:
Rqi = λiqi
⇒ qHj Rqi = λiqH
j qi (∗)
Also, since λj is real and R is Hermitian
Rqj = λjqj
⇒ qHj R = λjqH
j
⇒ qHj Rqi = λjqH
j qi
Substituting the LHS from (∗)⇒ λiqH
j qi = λjqHj qi
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 164 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Thus
λiqHj qi = λjqH
j qi
⇒ (λi − λj)qHj qi = 0
Since λ1, λ2, · · · , λM are unique
qHj qi = 0 i �= j
⇒ q1, q2, · · · , qM are orthogonal.
QED
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 165 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Diagonalization of R
Objective: Find a transformation that transforms the correlation matrixinto a diagonal matrix.
Let λ1, λ2, · · · , λM be unique eigenvectors of R and take q1, q2, · · · , qM
to be the M orthonormal eigenvectors
qHi qj =
{1 i = j0 i �= j
Define Q = [q1, q2, · · · , qM ] and Ω = diag(λ1, λ2, · · · , λM). Thenconsider
QHRQ =
⎡⎢⎢⎢⎣
qH1
qH2...
qHM
⎤⎥⎥⎥⎦R[q1, q2, · · · , qM ]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 166 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
QHRQ =
⎡⎢⎢⎢⎣
qH1
qH2...
qHM
⎤⎥⎥⎥⎦R[q1, q2, · · · , qM ]
=
⎡⎢⎢⎢⎣
qH1
qH2...
qHM
⎤⎥⎥⎥⎦ [λ1q1, λ2q2, · · · , λNqM ]
=
⎡⎢⎢⎢⎣
λ1 0 · · · 00 λ2 · · · 0...
.... . .
...0 0 · · · λM
⎤⎥⎥⎥⎦
⇒ QHRQ = Ω (eigenvector diagonalization of R)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 167 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Property (Q is unitary)
Q is unitary, i.e., Q−1 = QH
Proof: Since the qi eigenvectors are orthonormal
QHQ =
⎡⎢⎢⎢⎣
qH1
qH2...
qHM
⎤⎥⎥⎥⎦ [q1, q2, · · · , qM ] = I
⇒ Q−1 = QH
Property (Eigen decomposition of R)
The correlation matrix can be expressed as
R =M∑
i=1
λiqiqHi
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 168 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Proof: The correlation diagonalization result states
QHRQ = Ω
Isolating R and expanding,
R = QΩQH = [q1, q2, · · · , qM ]Ω
⎡⎢⎢⎢⎣
qH1
qH2...
qHM
⎤⎥⎥⎥⎦
= [q1, q2, · · · , qM ]
⎡⎢⎢⎢⎣
λ1qH1
λ2qH2
...λMqH
M
⎤⎥⎥⎥⎦ =
M∑i=1
λiqiqHi
Note: This also gives
R−1 = (QH)−1Ω−1Q−1 = QΩ−1QH
where Ω−1 = diag(1/λ1, 1/λ2, · · · , 1/λM)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 169 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Aside (trace & determinant for matrix products)
Note trace(A)�=∑
i Ai ,i . Also,
trace(AB) = trace(BA) similarly det(AB) = det(A)det(B)
Property (Determinant–Eigenvalue Relation)
The determinant of the correlation matrix is related to the eigenvaluesas follows:
det(R) =M∏
i=1
λi
Proof: Using R = QΩQH and the above,
det(R) = det(QΩQH)
= det(Q)det(QH)det(Ω) = det(Ω) =M∏
i=1
λi
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 170 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Property (Trace–Eigenvalue Relation)
The trace of the correlation matrix is related to the eigenvalues asfollows:
trace(R) =M∑
i=1
λi
Proof: Note
trace(R) = trace(QΩQH)
= trace(QHQΩ)
= trace(Ω)
=M∑
i=1
λi
QED
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 171 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Definition (Normal Matrix)
A complex square matrix A is a normal matrix if
AHA = AAH
That is, a matrix is normal if it commutes with its conjugate transpose.
Note
All Hermitian symmetric matrices are normal
Every matrix that can be diagonalized by the unitary transform isnormal
Definition (Condition Number)
The condition number reflects how numerically well–conditioned aproblem is, i.e, a low condition number ⇒ well–conditioned; a highcondition number ⇒ ill–conditioned.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 172 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Definition (Condition Number for Linear Systems)
For a linear systemAx = b
defined by a normal matrix A, the condition number is
χ(A) =λmax
λmin
where λmax and λmin are the maximum/minimum eigenvalues of A
Observations:
Large eigenvalue spread ⇒ ill–conditioned
Small eigenvalue spread ⇒ well–conditioned
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 173 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Eigen Properties
Sensitivity Analysis: Suppose a system (filter) is related as:
Rw = d
where w defines the filter parameters; R and d are signal statisticmatrices
⇒ Introduce small signal statistic perturbations
Let R and d are perturbed such that ||δR||/||R|| and ||δd||/||d|| are onthe order ε � 1
Result: A bound on the resulting parameter perturbations is given by
||δw||||w|| ≤ εχ(R) = ε
λmax
λmin
Consequence: If R is ill conditioned, small changes in R or d can leadto big changes in w
⇒ System sensitivity is related to eigenvalue spread
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 174 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
The discrete Karhmen-Loeve Transform (KLT)
Definition (The discrete Karhmen-Loeve Transform (KLT))
A M sample vector x(n) from the process {x(n)} can be expressed as
x(n) =M∑
i=1
ci(n)qi
where q1, q2, · · · , qM are the orthonormal eigenvectors of the processcorrelation matrix, R, and c1(n), c2(n), · · · , cM(n) are a set of KLTcoefficients.
Signal is represented as a weighted sum of eigenvectors
Need to determine the coefficients
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 175 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
Determining Coefficients: Write the expression in matrix form
x(n) =M∑
i=1
ci(n)qi
= Qc(n) (∗)
whereQ = [q1, q2, · · · , qM ]
andc(n) = [c1(n), c2(n), · · · , cM(n)]T .
Solving (∗) for c(n):
c(n) = Q−1x(n) = QHx(n)
orci(n) = qH
i x(n)
Note: ci(n) is the projection of x(n) onto qi
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 176 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
Question: How related are the coefficients to reach other?
Answer: Consider the correlation between ci(n) terms
Rc(n) = E{c(n)cH(n)}= E{(QHx(n))(QHx(n))H}= E{(QHx(n)xH(n)Q}= QHRxQ
= Ω
Result:
E{c∗i (n)cj (n)} =
{λi i = j0 otherwise
⇒ KLT transform coefficients are uncorrelatedA desirable property – Why?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 177 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
Question: Can we represent x(n) with fewer terms? If so, how do weminimize the representation error?
Approach: Use fewer terms in the KLT transform
x(n) =M∑
i=1
ci(n)qi
⇒ x(n) =N∑
i=1
ci(n)qi N < M
Thus
x(n) = x(n) + ε(n)
=N∑
i=1
ci(n)qi +M∑
i=N+1
ci(n)qi
Question: How do we minimize the representation error?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 178 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The discrete Karhmen-Loeve Transform
Approach: Analyzed and minimize the error power
The error power is given by
ε = E{εH(n)ε(n)}
= E
⎧⎨⎩
M∑i=N+1
c∗i (n)qH
i
M∑j=N+1
cj(n)qj
⎫⎬⎭
=M∑
i=N+1
E{c∗i (n)ci (n)} [result of orthogonality]
=M∑
i=N+1
λi [from prior result]
Result: To minimize the error select the qi eigenvectors associatedwith M largest eigenvalues.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 179 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
The Matched Filter
Objective: Find the optimal filter coefficients for two cases:(1) deterministic signals and (2) stochastic signals
Signal model:x(n) = u(n)︸︷︷︸
signal
+ v(n)︸︷︷︸noise
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 180 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
Definitions:
Filter parameters: w = [w1, w2, · · · , wM ]T
Observation vector: x(n) = [x(n), x(n − 1), · · · , x(n − M + 1)]T
Then using x(n) = u(n) + v(n) and linearity,
y(n) = wT x(n)
= wT u(n) + wT v(n)
= ys(n) + yn(n)
where
ys(n) = wT u(n)
yn(n) = wT v(n)
Consider deterministic and stochastic cases separately
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 181 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
Case 1: Matched Filters for Deterministic Signals
Approach:
Analyzed output SNR
Set filter coefficients to maximize SNR
The SNR at time n can be defined as
SNR =|ys(n)|2
E{|yn(n)|2}where
ys(n) = wT u(n) [Deterministic]
yn(n) = wT v(n) [Stochastic]
Note: Case assumes u(n) is deterministic
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 182 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
Using ys(n) = wT u(n) and yn(n) = wT v(n),
SNR =|ys(n)|2
E{|yn(n)|2}
=(wT u(n))∗(uT (n)w)
E{(wT v(n))∗(vT (n)w)}
=wHu∗(n)uT (n)w
E{wHv∗(n)vT (n)w}
=wHu∗(n)uT (n)w
wHRvw
To analyze further, restrict the noise statistics
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 183 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
Suppose v(n) is zero mean and uncorrelated, Rv = σ2v I. Then
SNR =wHu∗(n)uT (n)w
wHRvw
=|wHu∗(n)|2
σ2vwHw
If we restrict |w|2 = 1, then the final result is
SNR =|wHu∗(n)|2
σ2v
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 184 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
Definition (Cauchy–Schwarz Inequality (1821 disc.; 1859 cont.))
Cauchy-Schwarz’s inequality states (matrix case)
|AHB|2 ≤ (AHA)(BHB)
with equality only if A = kB, where k is a constant
Thus
SNR =|wHu∗(n)|2
σ2v
≤ (wHw)(uT (n)u∗(n))
σ2v
=uH(n)u(n)
σ2v
=|u(n)|2
σ2v
Result: The SNR is maximized when w = ku∗(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 185 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
Example
A communications system sends out one of two signals
Symbol “1”
Symbol “0”
0 1 2
n
3 4
1
2
0 1 2 n3 4
-2
-1
We receivey(n) = ui(n) + v(n)
where v(n) is i.i.d. Gaussian.
Objective: Find the optimal w (assume n = 3)K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 186 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
Set w = ku1(n) where k is such that |w|2 = 1. Thus,
w = u1(n)/|u1(n)| =1√10
[2, 2, 1, 1]T
The resulting impulse response of the filter is
0 1 2n
3
102
101
Observation: Optimal filter impulse response is sent (desired) signaltime–reversed & normalized
Determine ys(n) for signal “1” sent
ys(n) = wT u1(n) =|u1(n)|2|u1(n)| =
√10
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 187 / 406
Linear Systems, Spectral Representations, and Eigen Analysis The Matched Filter
For signal “0”
ys(n) = wT u0(n) =−|u0(n)|2|u1(n)| = −
√10
Received signal distributions:
10− 100
fy(y)
The SNR for the optimized system is
SNR =|u1(n)|2
σ2v
=10σ2
v
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 188 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Matched Filter for Stochastic Signals
Case 2: Matched Filter for Stochastic Signals
As before
y(n) = wT x(n)
= wT u(n) + wT v(n)
= ys(n) + yn(n)
Now define the SNR as
SNR =E{|ys(n)|2}E{|yn(n)|2} =
wHRuwwHRvw
As before, if Rv = σ2v I and |w|2 = 1
SNR =wHRuw
σ2v
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 189 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Matched Filter for Stochastic Signals
Since
SNR =wHRuw
σ2v
,
maximizing the SNR is equivalent to maximizing |wHRuw|Recall from Cauchy–Schwarz,
|AHB|2 ≤ (AHA)(BHB)
with equality only if A = kB
Let AH = wHRu and B = w and apply Cauchy–SchwarzResult: |wHRuw| is maximized for A = kB
⇒ Ruw = kw
⇒ k is an eigenvalue and w is an eigenvector!w = qi and k = λi , where Ruqi = λiqi
Question: How do we choose which eigenvector?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 190 / 406
Linear Systems, Spectral Representations, and Eigen Analysis Matched Filter for Stochastic Signals
Using w = qi , or wH = qHi , and Ruw = Ruqi = λiqi gives
SNR =wHRuw
σ2v
=qH
i λiqi
σ2v
[orthonormal eigenvectors]
=λi
σ2v
Result: The SNR is maximized when w = qmax, the eigenvectorassociated with the largest eigenvalue of Ru.
When using the optimal filter weights (w = qmax), the resulting(maximized) SNR is
SNR =λmax
σ2v
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 191 / 406
Maximum Likelihood and Bayes Estimation ML Estimation
Maximum Likelihood and Bayes Estimation
Estimation
Estimation is the inference of unknown quantities. Two cases areconsidered:
1 Quantity is fixed, but unknown – parameter estimation2 Quantity is random and unknown – random variable estimator
Parameter Estimation
Consider a set of observations forming a vector
x = [x1, x2, · · · , xN ]T
Assumption: The xi RVs come from a known density governed byunknown (but fixed) parameter θ
Objective: Estimate θ. What optimality criteria should be used?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 193 / 406
Maximum Likelihood and Bayes Estimation ML Estimation
Definition (Maximum Likelihood Estimation)
The maximum likelihood estimate of θ is the value θML(x) which makesthe x observations most likely
θML(x) = argmaxθ
fx|θ(x|θ)
Example
Let xi ∼ N(μ, σ2). Given N observations, find the ML estimate of μ.
μ μx
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 194 / 406
Maximum Likelihood and Bayes Estimation ML Estimation
For i.i.d. samples
fx|μ(x|μ) =N∏
i=1
fxi |μ(xi |μ)
=N∏
i=1
1√2πσ2
e− (xi−μ)2
2σ2 [Gaussian case]
�= likelihood function
Thus the estimate of the mean it is set as
μ = argmaxμ
fx|μ(x|μ)
Interpretation: Set the distribution mean to the value that makesobtaining the observed samples most likely.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 195 / 406
Maximum Likelihood and Bayes Estimation ML Estimation
Note: Maximizing fx|μ(x|μ) is equivalent to maximizing any monotonicfunction of fx|μ(x|μ). Choosing ln(·)
ln(fx|μ(x|μ)) = ln
(N∏
i=1
1√2πσ2
e− (xi−μ)2
2σ2
)
= −N ln(√
2πσ2) −N∑
i=1
(xi − μ)2
2σ2
= −N ln(√
2πσ2) −N∑
i=1
x2i
2σ2 + μN∑
i=1
xi
σ2 −N∑
i=1
μ2
2σ2
Taking the derivative and equating to 0,
∂ ln(fx|μ(x|μ))
∂μ=
N∑i=1
xi
σ2 − Nμ
σ2 = 0
⇒ μ =1N
N∑i=1
xi�= sample mean
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 196 / 406
Maximum Likelihood and Bayes Estimation ML Estimation
General Maximum Likelihood Result
General Statement: The ML estimate of θ is
θML(x) = argmaxθ
fx|θ(x|θ)
Solution: The ML estimate of θ is obtained as the solution to
∂
∂θfx|θ(x|θ)
∣∣∣θ=θML
= 0
or∂
∂θln[fx|θ(x|θ)]
∣∣∣θ=θML
= 0
fx|θ(x|θ) is the likelihood function of θ.
θML is a RV since it is a function of the RVs x1, x2, · · · , xN
Historical Note: ML estimation was pioneered by geneticist andstatistician Sir R. A. Fisher between 1912 and 1922
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 197 / 406
Maximum Likelihood and Bayes Estimation ML Estimation
Example
The time between customer arrivals at a bar is a RV with distribution
fT (T ) = αe−αT U(T )
Objective: Estimate the arrival rate α based on N measured arrivalintervals T1, T2, · · · , TN .
Assuming that the arrivals are independent,
f (T1, T2, · · · , TN) =N∏
i=1
fT (Ti)
=N∏
i=1
αe−αTi = αNe−α
N∑i=1
Ti
⇒ ln[f (T1, T2, · · · , TN)] = [N ln(α) − α
N∑i=1
Ti ]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 198 / 406
Maximum Likelihood and Bayes Estimation ML Estimation
Taking the derivative and equating to 0,
∂
∂αln[f (T1, T2, · · · , TN)] =
∂
∂α[N ln(α) − α
N∑i=1
Ti ]
=Nα
−N∑
i=1
Ti = 0
Solving for α gives the ML estimate
⇒ αML =1
1N
∑Ni=1 Ti
=1
T
Result: The ML estimate of arival rate for exponentially distributedsamples is the reciprocal of the sample mean arrival
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 199 / 406
Maximum Likelihood and Bayes Estimation Properties of Estimates
Properties of Estimates
Since θN is a function of RVs x1, x2, · · · , xN , estimates areRVs and wecan state the following properties:
An estimate θN is unbiased if
E{θN} = θ bias�= E{θN} − θ
θN is consistent (converges in probability) if
limN→∞
Pr{|θN − θ| < ε} = 1 for arbitrary ε
θN is efficient in comparison to other estimators if
var(θN) < var(θother)
Note: If θN is unbiased and efficient with respect to θN−1 for all N (i.e.,var(θN) converges to 0), then θN is a consistent estimate
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 200 / 406
Maximum Likelihood and Bayes Estimation Properties of Estimates
To prove the consistent estimate result, note that by the Tchebycheffinequality
Pr{|θN − θ| > ε} ≤ var(θN)
ε2
If var(θN) < var(θN−1), the above gives
limN→∞
Pr{|θN − θ| > ε} = 0
orlim
N→∞Pr{|θN − θ| < ε} = 1
That is, it converges in probability, or is consistent
QED
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 201 / 406
Maximum Likelihood and Bayes Estimation Properties of Estimates
Example
Let {xi} be WSS with uncorrelated samples. Is the sample mean aconsistent estimator for this sequence?
Step 1: Consider the bias
E{μN} = E
{1N
N∑i=1
xi
}
=1N
(Nμ) = μ
Result: μN is unbiased
Step 2: Consider the variance
var(μN) = E{
(μ − μ)2}
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 202 / 406
Maximum Likelihood and Bayes Estimation Properties of Estimates
var(μN) = E{(μ − μ)2
}
= E
⎧⎨⎩((
1N
N∑i=1
xi
)− μ
)2⎫⎬⎭
=1
N2 E
⎧⎨⎩(
N∑i=1
(xi − μ)
)2⎫⎬⎭ [assume uncorrelated]
=1
N2
N∑i=1
E{(xi − μ)2} +1
N2 E(cross terms)︸ ︷︷ ︸=0
=1
N2
N∑i=1
E{(xi − μ)2} =1
N2 (Nσ2) =σ2
N
Result: μN is unbiased and var(μN) < var(θN−1) ⇒ μN is consistent
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 203 / 406
Maximum Likelihood and Bayes Estimation Cramer-Rao Bound
Theorem (Cramer-Rao Bound (1945, 1946))
If θ is an unbiased estimate of θ, then
var(θ) ≥(
E
{(∂
∂θln[fx|θ(x|θ)]
)2})−1
or equivalently
var(θ) ≥(−E
{∂2
∂θ2 ln[fx|θ(x|θ)]
})−1
where it is assumed
∂
∂θfx|θ(x|θ) and
∂2
∂θ2 fx|θ(x|θ) exist
Note: If any estimate satisfies the bound with equality, it is an efficient(minimum variance) estimate
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 204 / 406
Maximum Likelihood and Bayes Estimation Cramer-Rao Bound
Proof:
Since θ is unbiased
E{θ − θ} =
∫ ∞
−∞(θ − θ)fx|θ(x|θ)dx = 0
Taking the derivative
∂
∂θ
∫ ∞
−∞(θ − θ)fx|θ(x|θ)dx = 0
⇒ −∫ ∞
−∞fx|θ(x|θ)dx︸ ︷︷ ︸−1
+
∫ ∞
−∞
∂fx|θ(x|θ)
∂θ(θ − θ)dx = 0
⇒∫ ∞
−∞
∂fx|θ(x|θ)
∂θ(θ − θ)dx = 1 (∗)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 205 / 406
Maximum Likelihood and Bayes Estimation Cramer-Rao Bound
Note the following equality
∂ ln[fx|θ(x|θ)]
∂θfx|θ(x|θ) =
∂fx|θ(x|θ)
∂θ
Using this in (∗) ∫ ∞
−∞
∂fx|θ(x|θ)
∂θ(θ − θ)dx = 1
⇒∫ ∞
−∞
∂ ln[fx|θ(x|θ)]
∂θfx|θ(x|θ)(θ − θ)dx = 1
This can be equivalently expressed as
(∫ ∞
−∞
(∂ ln[fx|θ(x|θ)]
∂θ
√fx|θ(x|θ)
)(√fx|θ(x|θ)(θ − θ)
)dx)2
= 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 206 / 406
Maximum Likelihood and Bayes Estimation Cramer-Rao Bound
Definition (Cauchy–Schwarz Inequality (1821 disc.; 1859 cont.))
Cauchy-Schwarz’s inequality states (for square–integrablecomplex–valued functions),∣∣∣∣
∫f (x)g(x) dx
∣∣∣∣2 ≤∫
|f (x)|2 dx ·∫
|g(x)|2 dx
with equality only if f (x) = k · g(x), where k is a constant
Thus(∫ ∞
−∞
(∂ ln[fx|θ(x|θ)]
∂θ
√fx|θ(x|θ)
)(√fx|θ(x|θ)(θ − θ)
)dx)2
= 1
⇒(∫ ∞
−∞
(∂ ln[fx|θ(x|θ)]
∂θ
)2
fx|θ(x|θ)dx
)(∫ ∞
−∞(θ − θ)2fx|θ(x|θ)dx
)≥ 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 207 / 406
Maximum Likelihood and Bayes Estimation Cramer-Rao Bound
Note ∫ ∞
−∞(θ − θ)2fx|θ(x|θ)dx = var(θ) (∗)
and∫ ∞
−∞
(∂ ln[fx|θ(x|θ)]
∂θ
)2
fx|θ(x|θ)dx = E
{(∂ ln(fx|θ(x|θ))
∂θ
)2}
(∗∗)
Thus using (∗) and (∗∗) in(∫ ∞
−∞
(∂ ln[fx|θ(x|θ)]
∂θ
)2
fx|θ(x|θ)dx
)(∫ ∞
−∞(θ − θ)2fx|θ(x|θ)dx
)≥ 1
⇒ var(θ) ≥[
E
{(∂ ln(fx|θ(x|θ))
∂θ
)2}]−1
with equality iff∂
∂θln(fx|θ(x|θ)) = k(θ − θ)
QEDK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 208 / 406
Maximum Likelihood and Bayes Estimation Cramer-Rao Bound
Thus the bound in met iff
∂
∂θln(fx|θ(x|θ)) = k(θ − θ)
Let θ = θML in the above
∂
∂θln(fx|θ(x|θ))
∣∣∣θ=θML︸ ︷︷ ︸
= 0 by ML criteria
= k(θ − θ)∣∣∣θ=θML
Therefore, the RHS must equal zero, or
θ = θML
Result: If an efficient estimate (one that satisfies the bound withequality) exists, then it is the ML estimate
Note: If an efficient estimator doesn’t exist, then we don’t know howgood θML is
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 209 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Definition (Bayes Estimation)
Objective: Estimate a random parameter (RV) from observationssamples x1, x2, · · · , xn that are statistically related to y by fy |x(·)Bayes Procedure: Define a nonnegative cost function C(y , y ) and sety to minimize the expected cost, or risk
R︸︷︷︸risk
= E{C(y , y )}
Since y and y are RVs
R =
∫ ∞
−∞
∫ ∞
−∞C(y , y )fy ,x(y , x)dydx
=
∫ ∞
−∞
[∫ ∞
−∞C(y , y )fy |x(y |x)dy
]︸ ︷︷ ︸
I(y)
fx(x)dx
Note: Minimizing I(y) it is equivalent to minimize R since fx(x) ≥ 0K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 210 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Consider several cost functions
Case 1: Mean Squared cost function
C(y , y ) = |y − y |2
|ˆ| yy −In this case,
I(y) =
∫ ∞
−∞(y − y)2fy |x(y |x)dy
⇒ ∂I(y)
∂y= −2
∫ ∞
−∞(y − y)fy |x(y |x)dy = 0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 211 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
or rearranging∫ ∞
−∞y fy |x(y |x)dy =
∫ ∞
−∞yfy |x(y |x)dy [y is a constant]
⇒ yMS =
∫ ∞
−∞yfy |x(y |x)dy = E{y |x}
Example
Let xi = a + μi for i = 1, 2, · · · , N, where μi ∼ N(0, σ2) anda ∼ N(0, σ2
a) are i.i.d. Determine aMS(x).
Note
fx|a(x|a) =N∏
i=1
1√2πσ
e− (xi−a)2
2σ2 =
(1
2πσ2
)N2
e− 1
2
(N∑
i=1
(xi−a)2
σ2
)(∗)
fa(a) =1√
2πσae− a2
2σ2a (∗∗)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 212 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
To find aMS(x) we need
aMS(x) = E{a|x}By Bayes’s theorem we can write
fa|x(a|x) =fx|a(x|a)fa(a)
fx(x)
Substituting in (∗) and (∗∗), and rearranging
fa|x(a|x) =
( 12πσ2
)N2(
1√2πσa
)e− 1
2
(N∑
i=1
(xi−a)2
σ2 + a2
σ2a
)
fx(x)
This can be compactly written as
fa|x(a|x) = C(x) exp
⎧⎨⎩− 1
2σ2p
[a − σ2
a
σ2a + σ2/N
(1N
N∑i=1
xi
)]2⎫⎬⎭
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 213 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Observations on
fa|x(a|x) = C(x) exp
⎧⎨⎩− 1
2σ2p
[a − σ2
a
σ2a + σ2/N
(1N
N∑i=1
xi
)]2⎫⎬⎭
= C(x) exp
{−(a − η)2
2σ2p
}
C(x) is a (normalizing) function of x onlyThe variance term is given by
σ2p =
(1σ2
a+
Nσ2
)−1
=σ2
aσ2
Nσ2a + σ2
Critical Observation: fa|x(a|x) is a Gaussian distribution!Result:
aMS = E{a|x} = η =σ2
a
σ2a + σ2/N
(1N
N∑i=1
xi
)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 214 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Case 2: Uniform cost function
C(y , y ) =
{0 |y − y | < ε1 else
Question: For what types of problems isthis cost function effective?
|ˆ| yy −
)ˆ,( yyC
0
1
ε
In this case,
I(y) =
∫ ∞
−∞C(y , y )fy |x(y |x)dy
=
∫|y−y |≥ε
fy |x(y |x)dy
= 1 −∫
|y−y |<ε
fy |x(y |x)dy
How do we minimize I(y)?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 215 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Result: I(y) is minimized bymaximizing∫
|y−y |<ε
fy |x(y |x)dyy
∫<− ε|ˆ|
| )|(yy
y dyyf xx
)|(| xx yf y
Note: ε is arbitrarily small
⇒ I(y) is minimized when fy |x(y |x) takes its largest value
yMAP(x) = argmaxy
fy |x(y |x)
yMAP is referred to as the maximum a posteriori (MAP) estimatebecause it maximized the posterior density fy |x(y |x).
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 216 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Example
Let
fx ,y(x , y) =
{10y 0 ≤ y ≤ x2, 0 ≤ x ≤ 10 otherwie
Find the MS and MAP estimates of y , i.e., yMS(x) and yMAP(x).
First step: determine the posterior density fy |x(y |x).
Since fy |x(y |x) =fx,y (x ,y)
fx (x) , we need
fx (x) =
∫ ∞
−∞fx ,y(x , y)dy
=
∫ x2
010ydy
= 5y2∣∣∣x2
0= 5x4 0 ≤ x ≤ 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 217 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Thus,
fy |x(y |x) =fx ,y(x , y)
fx (x)
=10y5x4 =
2yx4 0 ≤ y ≤ x2
y
)|(| xyf xy
0
2
2
x
2x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 218 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
MAP estimate:
yMAP(x) = argmaxy
fy |x(y |x)
= argmaxy
2yx4 0 ≤ y ≤ x2
= x2
MS estimate:
yMS(x) = E{y |x}
=
∫ x2
0yfy |x(y |x)dy
=
∫ x2
0
2y2
x4 dy
=23
y3
x4
∣∣∣x2
0=
23
x2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 219 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Note that the minimum MSE is
E{(y − yMS)2} =
∫ 1
0
∫ x2
0(y − yMS)
2fx ,y(x , y)dydx
=
∫ 1
0
∫ x2
0(y − 2
3x2)210ydydx =
5162
= 0.0309
The MSE of the MAP estimate is
E{(y − yMAP)2} =
∫ 1
0
∫ x2
0(y − yMAP)
2fx ,y(x , y)dydx
=
∫ 1
0
∫ x2
0(y − x2)210ydydx =
554
= 0.0926
Observation: This result is expected. Why?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 220 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Observation: MAP estimation can be used as an extension of MLestimation if some variability is assumed
Instead of an unknown constant θ, we have an unknown randomparameter with distribution fθ(θ)
To see this, note
fθ|x(θ|x) =fx|θ(x|θ)fθ(θ)
fx(x)
The MAP estimate maximizes the numerator since fx(x) is not afunction of θ,
θMAP = argmaxθ
fx|θ(x|θ)fθ(θ)
Question: For what distribution fθ(θ) does θMAP = θML? That is
θMAP = argmaxθ
fx|θ(x|θ)fθ(θ)?= argmax
θfx|θ(x|θ) = θML
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 221 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Example
Let x(n) = A + μ(n) for n = 1, 2, · · · , N, where μ(n) ∼ N(0, σ2μ) and
A ∼ N(A0, σ2A) are i.i.d.
Determine the MAP estimate of A.
Need to maximize fx|A(x|A)fA(A), or
AMAP = argmaxA
[ln(fx|A(x|A)) + ln(fA(A))
]Note
ln(fx|A(x|A)) =N2
ln
(1
2πσ2μ
)−
N∑n=0
(x(n) − A)2
2σ2μ
and
ln(fA(A)) =12
ln
(1
2πσ2A
)− (A − A0)
2
2σ2A
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 222 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Thus
AMAP = argminA
(1
2σ2μ
N∑n=1
(x(n) − A)2 +(A − A0)
2
2σ2A
)
Differentiating we get
− 1σ2
μ
N∑n=1
(x(n) − A) +(A − A0)
σ2A
∣∣∣∣∣A=AMAP
= 0
⇒N∑
n=1
x(n)
σ2μ
− NAMAP
σ2μ
=AMAP
σ2A
− A0
σ2A
⇒ AMAP
(1σ2
A
+Nσ2
μ
)=
1σ2
μ
N∑n=1
x(n) +A0
σ2A
⇒ AMAP =1
1σ2
A+ N
σ2μ
(1σ2
μ
N∑n=1
x(n) +A0
σ2A
)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 223 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
AMAP =1
1σ2
A+ N
σ2μ
(1σ2
μ
N∑n=1
x(n) +A0
σ2A
)
Note that if σ2A → ∞ then there is no a priori information and
limσ2
A→∞AMAP =
1N
N∑n=1
x(n) = AML
21Aσ
22Aσ
2212 AA σσ > Observation: As fθ(θ) flattens out
θMAP → θML
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 224 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Case 3: The absolute cost function
C(y , y ) = |y − y |Question: For what types of problems is this costfunction effective?
|ˆ| yy −
)ˆ,( yyC
0In this case
I(y) =
∫ ∞
−∞C(y , y )fy |x(y |x)dy
=
∫y<y
(y − y)fy |x(y |x)dy +
∫y≥y
(y − y)fy |x(y |x)dy
=
∫ y
−∞(y − y)fy |x(y |x)dy +
∫ ∞
y(y − y)fy |x(y |x)dy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 225 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
I(y) =
∫ y
−∞(y − y)fy |x(y |x)dy +
∫ ∞
y(y − y)fy |x(y |x)dy
Note that∫ y
−∞(y − y)fy |x(y |x)dy = yFy |x(y |x) −
∫ y
−∞yfy |x(y |x)dy
and similarly∫ ∞
y(y − y)fy |x(y |x)dy =
∫ ∞
yyfy |x(y |x)dy − y(1 − Fy |x(y |x))
Thus,
I(y) = yFy |x(y |x) −∫ y
−∞yfy |x(y |x)dy
−y(1 − Fy |x(y |x)) +
∫ ∞
yyfy |x(y |x)dy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 226 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
I(y) = yFy |x(y |x) −∫ y
−∞yfy |x(y |x)dy
−y(1 − Fy |x(y |x)) +
∫ ∞
yyfy |x(y |x)dy
Taking the derivative
∂I(y)
∂y= Fy |x(y |x) + y fy |x(y |x) − y fy |x(y |x)
−(1 − Fy |x(y |x)) − y fy |x(y |x) + y fy |x(y |x)
= Fy |x(y |x) − (1 − Fy |x(y |x))
Result: Setting equal to 0, we see the yMAE is given by
Fy |x(yMAE|x) = 1 − Fy |x(yMAE|x)
or ∫ yMAE
−∞fy |x(y |x)dy =
∫ ∞
yMAE
fy |x(y |x)dy
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 227 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
∫ yMAE
−∞fy |x(y |x)dy =
∫ ∞
yMAE
fy |x(y |x)dy
Interpreting this graphically
)|(| xx yf y
Area A Area BMAEy
Area A = Area B
Observation:yMAE = median of fy |x(y |x)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 228 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Estimator Relations
If fy |x(y |x) is symmetric, then
yMAE = yMS
Why? For a symmetric distribution the conditional mean is equalto the (median) symmetry point
If fy |x(y |x) is symmetric and unimodal, then
yMAE = yMS = yMAP
Why? The unimodal constraint implies that the single mode mustbe at the distribution symmetry point ⇒ the MAP estimate islocated at the central point
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 229 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Example
Determine yMAE for the previously considered case
fx ,y(x , y) =
{10y 0 ≤ y ≤ x2, 0 ≤ x ≤ 10 otherwise
We showed previously that
fy |x(y |x) =2yx4 0 ≤ y ≤ x2 ⇒ Fy |x(y |x) =
y2
x4 0 ≤ y ≤ x2
Thus determining yMAE
Fy |x(yMAE|x) = 1 − Fy |x(yMAE|x)
⇒ y2MAE
x4 = 1 − y2MAE
x4
⇒ yMAE =x2√
2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 230 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
MAP estimate: (previous result)
yMAP(x) = argmaxy
fy |x(y |x) = x2
MS estimate: (previous result)
yMS(x) = E{y |x} =23
x2
MAE estimate:
yMAE(x) = median of fy |x(y |x) =x2√
2
y
)|(| xyf xy
0
2
2
x
2x
2
2x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 231 / 406
Maximum Likelihood and Bayes Estimation Bayes Estimation
Final ML and MAP Comments
ML estimation was pioneered by geneticist and statistician Sir R.A. Fisher between 1912 and 1922Under fairly weak regularity conditions the ML estimate isasymptotically optimal
The ML estimate is asymptotically unbiased, i.e., its bias tends tozero as the number of samples increases to infinityThe ML estimate is asymptotically efficient, i.e., it achieves theCramér-Rao lower bound when the number of samples tends toinfinityConsequence: No unbiased estimator has lower mean squarederror than the ML estimatorThe ML estimate is asymptotically normal, i.e., as the number ofsamples increases, the distribution of the ML estimate tends to theGaussian distribution
MAP estimation is a generalization of ML estimation thatincorporates the prior distribution of the quantity being estimated
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 232 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Problem Statement
Produce an estimate of a desired process statistically related to a setof observations
filterx(n) d(n)input y(n)
output
e(n) estimateerror
desiredresponse
+_
Historical Notes: The linear filtering problem was solved byAndrey Kolmogorov for discrete time – his 1938 paper“established the basic theorems for smoothing and predictingstationary stochastic processes”Norbert Wiener in 1941 for continuous time – not published untilthe 1949 paper Extrapolation, Interpolation, and Smoothing ofStationary Time Series
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 234 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
filterx(n) d(n)input y(n)
output
e(n) estimateerror
desiredresponse
+_
System restrictions and considerations:
Filter is linear
Filter is discrete time
Filter is finite impulse response (FIR)
The process is WSS
Statistical optimization is employed
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 235 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
For the discrete time case
z-1x(n) …
e(n)
z-1 z-1
d(n)
∗0w ∗
1w ∗−1Mw
…… +-)(ˆ nd
The filter impulse response is finite and given by
hk =
{w∗
k for k = 0, 1, · · · , M − 10 otherwise
The output d(n) is an estimate of the desired signal d(n)x(n) and d(n) are statistically related ⇒ d(n) and d(n) arestatistically related
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 236 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
In convolution and vector form
d(n) =M−1∑k=0
w∗k x(n − k) = wHx(n)
where
w = [w0, w1, · · · , wM−1]T [filter coefficient vector]
x = [x(n), x(n − 1), · · · , x(n − M + 1)]T [observation vector]
The error can now be written as
e(n) = d(n) − d(n) = d(n) − wHx(n)
Question: Under what criteria should the error be minimized?
Selected Criteria: Mean squared-error (MSE)
J(w) = E{e(n)e∗(n)} (∗)Result: The w that minimizes J(w) is the optimal (Wiener) filter
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 237 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Utilizing e(n) = d(n) − wHx(n) in (∗) and expanding,
J(w) = E{e(n)e∗(n)}= E{(d(n) − wHx(n))(d∗(n) − xH(n)w)}= E{|d(n)|2 − d(n)xH(n)w − wHx(n)d∗(n)
+wHx(n)xH(n)w}= E{|d(n)|2} − E{d(n)xH(n)}w − wHE{x(n)d ∗(n)}
+wHE{x(n)xH(n)}w (∗∗)Let
R = E{x(n)xH(n)} [autocorrelation of x(n)]
p = E{x(n)d ∗(n)} [cross correlation between x(n) and d(n)]
Then (∗∗) can be compactly expressed as
J(w) = σ2d − pHw − wHp + wHRw
where we have assumed x(n) & d(n) are zero mean, WSSK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 238 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
The MSE criteria as a function of the filter weight vector w
J(w) = σ2d − pHw − wHp + wHRw
Observation: The error is a quadratic function of w
Consequences: The error is an M–dimensional bowl–shaped functionof w with a unique minimum
Result: The optimal weight vector, w0, is determined by differentiatingJ(w) and setting the result to zero
∇wJ(w)|w=w0 = 0
A closed form solution exists
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 239 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Example
Consider a two dimensional case, i.e., a M = 2 tap filter. Plot the errorsurface and error contours.
Error Surface Error Contours
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 240 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Aside (Matrix Differentiation): For complex data,
wk = ak + jbk , k = 0, 1, · · · , M − 1
the gradient, with respect to wk , is
∇k (J) =∂J∂ak
+ j∂J∂bk
, k = 0, 1, · · · , M − 1
The complete gradient is thus given by
∇w(J) =
⎡⎢⎢⎢⎣
∇0(J)∇1(J)
...∇M−1(J)
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎢⎣
∂J∂a0
+ j ∂J∂b0
∂J∂a1
+ j ∂J∂b1
...∂J
∂aM−1+ j ∂J
∂bM−1
⎤⎥⎥⎥⎥⎦
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 241 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Example
Let c and w be M × 1 complex vectors. For g = cHw, find ∇w(g)
Note
g = cHw =M−1∑k=0
c∗k wk =
M−1∑k=0
c∗k (ak + jbk )
Thus
∇k (g) =∂g∂ak
+ j∂g∂bk
= c∗k + j(jc∗
k ) = 0, k = 0, 1, · · · , M − 1
Result: For g = cHw
∇w(g) =
⎡⎢⎢⎢⎣
∇0(g)∇1(g)
...∇M-1(g)
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣
00...0
⎤⎥⎥⎥⎦ = 0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 242 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Example
Now suppose g = wHc. Find ∇w(g)
In this case,
g = wHc =M−1∑k=0
w∗k ck =
M−1∑k=0
ck (ak − jbk )
and
∇k (g) =∂g∂ak
+ j∂g∂bk
= ck + j(−jck ) = 2ck , k = 0, 1, · · · , M − 1
Result: For g = wHc
∇w(g) =
⎡⎢⎢⎢⎣
∇0(g)∇1(g)
...∇M-1(g)
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣
2c0
2c1...
2cM−1
⎤⎥⎥⎥⎦ = 2c
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 243 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Example
Lastly, suppose g = wHQw. Find ∇w(g)
In this case,
g =M−1∑i=0
M−1∑j=0
w∗i wjqi ,j
=M−1∑i=0
M−1∑j=0
(ai − jbi)(aj + jbj)qi ,j
⇒ ∇k (g) =∂g∂ak
+ j∂g∂bk
= 2M−1∑j=0
(aj + jbj)qk ,j + 0
= 2M−1∑j=0
wjqk ,j
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 244 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Result: For g = wHQw
∇w(g) =
⎡⎢⎢⎢⎣
∇0(g)∇1(g)
...∇M-1(g)
⎤⎥⎥⎥⎦ = 2
⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣
M−1∑i=0
q0,iwi
M−1∑i=0
q1,iwi
...M-1∑i=0
qM−1,iwi
⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦
= 2Qw
Observation: Differentiation result depends on matrix ordering
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 245 / 406
Wiener (MSE) Filtering Theory Wiener Filtering
Returning to the MSE performance criteria
J(w) = σ2d − pHw − wHp + wHRw
Approach: Minimize error by differentiating with respect to w and setresult to 0
∇w(J) = 0 − 0 − 2p + 2Rw
= 0
⇒ Rw0 = p [normal equation]
Result: The Wiener filter coefficients are defined by
w0 = R−1p
Question: Does R−1 always exist? Recall R is positive semi-definite,and usually positive definite
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 246 / 406
Wiener (MSE) Filtering Theory Orthogonality Principle
Orthogonality Principle
Consider again the normal equation that defines the optimal solution
Rw0 = p
⇒ E{x(n)xH(n)}w0 = E{x(n)d ∗(n)}
Rearranging
E{x(n)d ∗(n)} − E{x(n)xH(n)}w0 = 0
E{x(n)[d ∗(n) − xH(n)w0]} = 0
E{x(n)e∗0(n)} = 0
Note: e∗0(n) is the error when the optimal weights are used, i.e.,
e∗0(n) = d∗(n) − xH(n)w0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 247 / 406
Wiener (MSE) Filtering Theory Orthogonality Principle
Thus
E{x(n)e∗0(n)} = E
⎡⎢⎢⎢⎣
x(n)e∗0(n)
x(n − 1)e∗0(n)
...x(n − M + 1)e∗
0(n)
⎤⎥⎥⎥⎦ =
⎡⎢⎢⎢⎣
00...0
⎤⎥⎥⎥⎦
Orthogonality Principle
A necessary and sufficient condition for a filter to be optimal is that theestimate error, e∗(n), be orthogonal to each input sample in x(n)
Interpretation: The observations samples and error are orthogonal andcontain no mutual “information”
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 248 / 406
Wiener (MSE) Filtering Theory Minimum MSE
Objective: Determine the minimum MSE
Approach: Use the optimal weights w0 = R−1p in the MSE expression
J(w) = σ2d − pHw − wHp + wHRw
⇒ Jmin = σ2d − pHw0 − wH
0 p + wH0 R(R−1p)
= σ2d − pHw0 − wH
0 p + wH0 p
= σ2d − pHw0
Result:Jmin = σ2
d − pHR−1p
where the substitution w0 = R−1p has been employed
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 249 / 406
Wiener (MSE) Filtering Theory Excess MSE
Objective: Consider the excess MSE introduced by using a weightedvector that is not optimal.
J(w)−Jmin = (σ2d −pHw−wHp+wHRw)−(σ2
d −pHw0−wH0 p+wH
0 Rw0)
Using the fact that
p = Rw0 and pH = wH0 R
yields
J(w) − Jmin = −pHw − wHp + wHRw + pHw0 + wH0 p − wH
0 Rw0
= −wH0 Rw − wHRw0 + wHRw + wH
0 Rw0
+ wH0 Rw0 − wH
0 Rw0
= −wH0 Rw − wHRw0 + wHRw + wH
0 Rw0
= (w − w0)HR(w − w0)
⇒ J(w) = Jmin + (w − w0)HR(w − w0)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 250 / 406
Wiener (MSE) Filtering Theory Excess MSE
Finally, using the eigenvalue and vector representation R = QΩQH
J(w) = Jmin + (w − w0)HQΩQH(w − w0)
or defining the eigenvector transformed difference
v = QH(w − w0) (∗)⇒ J(w) = Jmin + vHΩv
= Jmin +M∑
k=1
λkvkv∗k
Result:
J(w) = Jmin +M∑
k=1
λk |vk |2
Note: (∗) shows that vk is the difference (w − w0) projected ontoeigenvector qk
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 251 / 406
Wiener (MSE) Filtering Theory Communications Example
Example
Consider the following system
ARsystemv1(n) u(n)
d(n) x(n)
Noisy transformedobservation
(noise)
Communicationchanel
v2(n)
Desiredsignal
FIRfilteru(n) )(ˆ nd
Objective: Determine the optimal filter for a given system and channel
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 252 / 406
Wiener (MSE) Filtering Theory Communications Example
Specific Objective
Determine the optimal order two filter weights, w0, for
H1(z) =1
1 + 0.8458z−1 [AR process]
H2(z) =1
1 − 0.9458z−1 [communication channel]
where v1(n) and v2(n) zero mean white noise processes withσ2
1 = 0.27 and σ22 = 0.1
Note: To determine w0, we need:
Ru — auto–correlation of the received signal
p — the cross correlation between received signal u(n) and thedesired signal d(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 253 / 406
Wiener (MSE) Filtering Theory Communications Example
ARsystemv1(n) u(n)
d(n) x(n)
Noisy transformedobservation
(noise)
Communicationchanel
v2(n)
Desiredsignal
FIRfilteru(n) )(ˆ nd
Procedure: Consider Ru first
Since u(n) = x(n) + v2(n), where v2(n) is white with σ22 = 0.1 and is
uncorrelated with x(n)
Ru = Rx + Rv2 = Rx +
[0.1 00 0.1
]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 254 / 406
Wiener (MSE) Filtering Theory Communications Example
ARsystemv1(n) u(n)
d(n) x(n)
Noisy transformedobservation
(noise)
Communicationchanel
v2(n)
Desiredsignal
FIRfilteru(n) )(ˆ nd
Next, consider the relation between x(n) and v1(n)
X (z) = H1(z)H2(z)V1(z)
where for the give systems
H1(z)H2(z) =1
(1 + 0.8458z−1)(1 − 0.9458z−1)
Converting to the time domain, we see x(n) is an order 2 AR process
x(n) − 0.1x(n − 1) − 0.8x(n − 2) = v1(n)
x(n) + a1x(n − 1) + a2x(n − 2) = v1(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 255 / 406
Wiener (MSE) Filtering Theory Communications Example
Since x(n) is a real valued order two AR process, the Yule-Walkerequations are given by[
r(0) r(1)r∗(1) r(0)
] [ −a1
−a2
]=
[r∗(1)r∗(2)
][
r(0) r(1)r(1) r(0)
] [ −a1
−a2
]=
[r(1)r(2)
]Solving for the coefficients:
−a1 =r(1)[r(0) − r(2)]
r2(0) − r2(1)
− a2 =r(0)r(2) − r2(1)
r2(0) − r2(1)
Question: What are the known and unknown terms in this system?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 256 / 406
Wiener (MSE) Filtering Theory Communications Example
Note: We must solve the system to obtain the unknown r(·) values
Noting r(0) = σ2x and rearranging to solve for r(1) and r(2)
r(1) =−a1
1 + a2σ2
x (∗)
r(2) =
(−a2 +
a21
1 + a2
)σ2
x (∗∗)
The Yule-Walker equations also stipulate
σ2v1
= r(0) + a1r(1) + a2r(2) (∗ ∗ ∗)
Next, utilize the determined r(1), r(2) and given a1, a2, σ2v1
values todetermine r(0) = σ2
x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 257 / 406
Wiener (MSE) Filtering Theory Communications Example
Substituting (∗) and (∗∗) into (∗ ∗ ∗), utilize r(0) = σ2x , and rearranging
σ2v1
= r(0) + a1r(1) + a2r(2)
= σ2x + a1
( −a1
1 + a2
)σ2
x + a2
(−a2 +
a21
1 + a2
)σ2
x
=
(1 +
−a21
1 + a2− a2
2 +a2
1a2
1 + a2
)σ2
x
⇒ σ2x =
(1 + a2
1 − a2
)σ2
v1
(1 + a2)2 − a21
Note: All correlation terms have been determined since r(0) = σ2x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 258 / 406
Wiener (MSE) Filtering Theory Communications Example
Using a1 = −0.1, a2 = −0.8, and σ2v1
= 0.27,
σ2x =
(1 + a2
1 − a2
)σ2
v1
(1 + a2)2 − a21
= 1 =
Thus r(0) = 1. Similarly
r(1) =−a1
1 + a2σ2
x = 0.5
and finally, Rx is
Rx =
[1 0.5
0.5 1
]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 259 / 406
Wiener (MSE) Filtering Theory Communications Example
Recall the overall system
ARsystemv1(n) u(n)
d(n) x(n)
Noisy transformedobservation
(noise)
Communicationchanel
v2(n)
Desiredsignal
FIRfilteru(n) )(ˆ nd
Putting the pieces of Ru together
Ru = Rx + Rv2
=
[1 0.5
0.5 1
]+
[0.1 00 0.1
]
=
[1.1 0.50.5 1.1
]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 260 / 406
Wiener (MSE) Filtering Theory Communications Example
Recall: The Wiener solution is given by w0 = R−1p
Note: R is known, but p is still to be determined
p = E{[
d(n)x(n)d(n)u(n − 1)
]}
Recall
X (z) = H2(z)D(z) =D(z)
1 − 0.9458z−1
or in the time domain
x(n) − 0.9458x(n − 1) = d(n)
Lastly, the observation is corrupted by additive noise
u(n) = x(n) + v2(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 261 / 406
Wiener (MSE) Filtering Theory Communications Example
Thus
E{u(n)d(n)} = E{[x(n) + v2(n)][x(n) − 0.9458x(n − 1)]}= E{x2(n)} + E{x(n)v2(n)} − 0.9458E{x(n)x(n − 1)}
−0.9458E{v2(n)x(n − 1)}= σ2
x + 0 − 0.9458r(1) − 0
= 1 − 0.9458(
12
)= 0.5272
Similarly,
E{u(n − 1)d(n)} = E{[x(n − 1) + v2(n − 1)][x(n) − 0.9458x(n − 1)]}= r(1) − 0.9458r(0)
= −0.4458
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 262 / 406
Wiener (MSE) Filtering Theory Communications Example
Thus
p =
[0.5272−0.4458
]Final Solution:
w0 = R−1p
=
[1.1 0.50.5 1.1
]−1 [0.5272−0.4458
]
=
[0.8360−0.7853
]
The optimal filter weights are a function of the source signalstatistics and the communications channel
Question: How do we optimized a filter of when the statistics arenot known in closed form or a priori?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 263 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Motivation
Adaptive Optimization and Filtering Methods
Motivation
Adaptive optimization and filtering methods are appropriate,advantageous, or necessary when:
Signal statistics are not known a priori and must be “learned” fromobserved or representative samples
Signal statistics evolve over time
Time or computational restrictions dictate that simple, if repetitive,operations be employed rather than solving more complex, closedform expressionsTo be considered are the following algorithms:
Steepest Descent (SD) – deterministicLeast Means Squared (LMS) – stochasticRecursive Least Squares (RLS) – deterministic
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 265 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent
Definition (Steepest Descent (SD))
Steepest descent, also known as gradient descent, it is an iterativetechnique for finding the local minimum of a function.
Approach: Given an arbitrary starting point, the current location (value)is moved in steps proportional to the negatives of the gradient at thecurrent point.
SD is an old, deterministic method, that is the basis for stochasticgradient based methodsSD is a feedback approach to finding local minimum of an errorperformance surfaceThe error surface must be known a prioriIn the MSE case, SD converges converges to the optimal solution,w0 = R−1p, without inverting a matrix
Question: Why in the MSE case does this converge to the globalminimum rather than a local minimum?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 266 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent
Example
Consider a well structured cost function with a single minimum. Theoptimization proceeds as follows:
Contour plot showing that evolution of the optimization
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 267 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent
Example
Consider a gradient ascent example in which there are multipleminima/maxima
Surface plot showing the multipleminima and maxima Contour plot illustrating that the final
result depends on starting value
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 268 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent
To derive the approach, consider the FIR case:
{x(n)} – the WSS input samples{d(n)} – the WSS desired output{d(n)} – the estimate of the desired signal given by
d(n) = wH(n)x(n)
where
x(n) = [x(n), x(n − 1), · · · , x(n − M + 1)]T [obs. vector]
w(n) = [w0(n), w1(n), · · · , wM-1(n)]T [time indexed filter coefs.]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 269 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent
Then similarly to previously considered cases
e(n) = d(n) − d(n) = d(n) − wH(n)x(n)
and the MSE at time n is
J(n) = E{|e(n)|2}= σ2
d − wH(n)p − pHw(n) + wH(n)Rw(n)
where
σ2d – variance of desired signal
p – cross-correlation between x(n) and d(n)
R – correlation matrix of x(n)
Note: The weight vector and cost function are time indexed (functionsof time)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 270 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent
When w(n) is set to the (optimal) Wiener solution,
w(n) = w0 = R−1p
andJ(n) = Jmin = σ2
d − pHw0
Use the method of steepest descent to iteratively find w0.
The optimal result is achieved since the cost function is a secondorder polynomial with a single unique minimum
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 271 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent
Example
Let M = 2. The MSE is a bowl–shaped surface, which is a function ofthe 2-D space weight vector w(n)
w0
w
J(w)
J(w)
w1
w2
Surface Plot
w0
w2
w1
)(J∇−
2w
J
∂∂−
1w
J
∂∂−
Contour Plot
Imagine dropping a marble at any point on the bowl-shaped surface.The ball will reach the minimum point by going through the path ofsteepest descent.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 272 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Steepest Descent
Observation: Set the direction of filter update as: −∇J(n)
Resulting Update:
w(n + 1) = w(n) +12
μ[−∇J(n)]
or, since ∇J(n) = −2p + 2Rw(n)
w(n + 1) = w(n) + μ[p − Rw(n)] n = 0, 1, 2, · · ·
where w(0) = 0 (or other appropriate value) and μ is the step size
Observation: SD uses feedback, which makes it possible for thesystem to be unstable
Bounds on the step size guaranteeing stability can be determinedwith respect to the eigenvalues of R (Widrow, 1970)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 273 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Convergence Analysis
Define the error vector for the tap weights as
c(n) = w(n) − w0
Then using p = Rw0 in the update,
w(n + 1) = w(n) + μ[p − Rw(n)]
= w(n) + μ[Rw0 − Rw(n)]
= w(n) − μRc(n)
and subtracting w0 from both sides
w(n + 1) − w0 = w(n) − w0 − μRc(n)
⇒ c(n + 1) = c(n) − μRc(n)
= [I − μR]c(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 274 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Using the Unitary Similarity Transform
R = QΩQH
we have
c(n + 1) = [I − μR]c(n)
= [I − μQΩQH ]c(n)
⇒ QHc(n + 1) = [QH − μQHQΩQH ]c(n)
= [I − μΩ]QHc(n) (∗)Define the transformed coefficients as
v(n) = QHc(n)
= QH(w(n) − w0)
Then (∗) becomesv(n + 1) = [I − μΩ]v(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 275 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Consider the initial condition of v(n)
v(0) = QH(w(0) − w0)
= −QHw0 [if w(0) = 0]
Consider the k th term (mode) in
v(n + 1) = [I − μΩ]v(n)
Note [I − μΩ] is diagonalThus all modes are independently updatedThe update for the k th term can be written as
vk (n + 1) = (1 − μλk )vk (n) k = 1, 2, · · · , M
or using recursion
vk (n) = (1 − μλk )nvk (0)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 276 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Observation: Conversion to the optimal solution requires
limn→∞w(n) = w0
⇒ limn→∞ c(n) = lim
n→∞w(n) − w0 = 0
⇒ limn→∞ v(n) = lim
n→∞QHc(n) = 0
⇒ limn→∞ vk (n) = 0 k = 1, 2, · · · , M (∗)
Result: According to the recursion
vk (n) = (1 − μλk )nvk (0)
the limit in (∗) holds if and only if
|1 − μλk | < 1 for all k
Thus since the eigenvalues are nonnegative, 0 < μλmax < 2, or
0 < μ <2
λmax
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 277 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Observation: The k th mode has geometric decay
vk (n) = (1 − μλk )nvk (0)
The rate of decay it is characterized by the time it takes to decayto e−1 of the initial value
Let τk denote this time for the k th mode
vk (τk ) = (1 − μλk )τk vk (0) = e−1vk (0)
⇒ e−1 = (1 − μλk )τk
⇒ τk =−1
ln(1 − μλk )≈ 1
μλkfor μ � 1
Result: The overall rate of decay is
−1ln(1 − μλmax)
≤ τ ≤ −1ln(1 − μλmin)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 278 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Example
Consider the typical behavior of a single mode
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 279 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Error Analysis
Recall that
J(n) = Jmin + (w(n) − w0)HR(w(n) − w0)
= Jmin + (w(n) − w0)HQΩQH(w(n) − w0)
= Jmin + v(n)HΩv(n)
= Jmin +M∑
k=1
λk |vk (n)|2 [sub in vk (n) = (1 − μλk )nvk (0)]
= Jmin +M∑
k=1
λk (1 − μλk )2n|vk (0)|2
Result: If 0 < μ < 2λmax
, then
limn→∞ J(n) = Jmin
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 280 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
Example
Consider a two–tap predictor for real–valued input
Analyzed the effects of the following cases:
Varying the eigenvalue spread χ(R) = λmaxλmin
while keeping μ fixed
Varying μ and keeping the eigenvalue spread χ(R) fixed
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 281 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]for step-size μ = 0.3
Eigenvalue spread: χ(R) = 1.22
Small eigenvalue spread ⇒modes converge at a similar rate
Eigenvalue spread: χ(R) = 3
Moderate eigenvalue spread ⇒modes converge at moderatelysimilar rates
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 282 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]for step-size μ = 0.3
Eigenvalue spread: χ(R) = 10
Large eigenvalue spread ⇒modes converge at different rates
Eigenvalue spread: χ(R) = 100
Very large eigenvalue spread ⇒modes converge at very differentrates
Principle direction convergence isfastest
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 283 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
SD loci plots (with shown J(n) contours) as a function of [w1(n), w2(n)]for step-size μ = 0.3
Eigenvalue spread: χ(R) = 1.22
Small eigenvalue spread ⇒modes converge at a similar rate
Eigenvalue spread: χ(R) = 3
Moderate eigenvalue spread ⇒modes converge at moderatelysimilar rates
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 284 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
SD loci plots (with shown J(n) contours) as a function of [w1(n), w2(n)]for step-size μ = 0.3
Eigenvalue spread: χ(R) = 10
Large eigenvalue spread ⇒modes converge at different rates
Eigenvalue spread: χ(R) = 100
Very large eigenvalue spread ⇒modes converge at very differentrates
Principle direction convergence isfastest
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 285 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
Learning curves of steepest-descent algorithm with step-sizeparameter μ = 0.3 and varying eigenvalue spread.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 286 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
SD loci plots (with shown J(n) contours) as a function of [v1(n), v2(n)]with χ(R) = 10 and varying step–sizes
Step–sizes: μ = 0.3
This is over–damped ⇒ slowconvergence
Step–sizes: μ = 1
This is under–damped ⇒ fast(erratic) convergence
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 287 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
SD loci plots (with shown J(n) contours) as a function of [w1(n), w2(n)]with χ(R) = 10 and varying step–sizes
Step–sizes: μ = 0.3
This is over–damped ⇒ slowconvergence
Step–sizes: μ = 1
This is under–damped ⇒ fast(erratic) convergence
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 288 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
Example
Consider a system identification problem
w(n) system
{x(n)}
)(ˆ nd d(n)+_
e(n)
Suppose M = 2 and
Rx =
[1 0.8
0.8 1
]P =
[0.80.5
]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 289 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
From eigen analysis we have
λ1 = 1.8, λ2 = 0.2 ⇒ μ <2
1.8
also
q1 =1√2
[11
]q2 =
1√2
[1−1
]and
Q =1√2
[1 11 −1
]Also,
w0 = R−1p =
[1.11
−0.389
]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 290 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
Thusv(n) = QH [w(n) − w0]
Noting that
v(0) = −QHw0 = − 1√2
[1 11 −1
] [1.11
−0.389
]=
[0.511.06
]
andv1(n) = (1 − μ(1.8))n0.51
v2(n) = (1 − μ(0.2))n1.06
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 291 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Predictor
SD convergence properties for two μ values
Step–sizes: μ = 0.5
This is over–damped ⇒ slowconvergence
Step–sizes: μ = 1
This is under–damped ⇒ fast(erratic) convergence
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 292 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Least Mean Squares (LMS)
Least Mean Squares (LMS)
Definition (Least Mean Squares (LMS) Algorithm)
Motivation: The error performance surface used by the SD method isnot always known a priori
Solution: Use estimated values. We will use the followinginstantaneous estimates
R(n) = x(n)xH(n)
p(n) = x(n)d ∗(n)
Result: The estimates are RVs and thus this leads to a stochasticoptimization
Historical Note: Invented in 1960 by Stanford University professorBernard Widrow and his first Ph.D. student, Ted Hoff
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 293 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Least Mean Squares (LMS)
Recall the SD update
w(n + 1) = w(n) +12μ[−∇(J(n))]
where the gradient of the error surface at w(n) was shown to be
∇(J(n)) = −2p + 2Rw(n)
Using the instantaneous estimates,
∇(J(n)) = −2x(n)d ∗(n) + 2x(n)xH(n)w(n)
= −2x(n)[d ∗(n) − xH(n)w(n)]
= −2x(n)[d ∗(n) − d∗(n)]
= −2x(n)e∗(n)
where e∗(n) is the complex conjugate of the estimate error.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 294 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Least Mean Squares (LMS)
Utilizing ∇(J(n)) = −2x(n)e∗(n) in the update
w(n + 1) = w(n) +12μ[−∇(J(n))]
= w(n) + μx(n)e∗(n) [LMS Update]
The LMS algorithm belongs to the family of stochastic gradientalgorithms
The update is extremely simple
Although the instantaneous estimates may have large variance,the LMS algorithm is recursive and effectively averages theseestimates
The simplicity and good performance of the LMS algorithm makeit the benchmark against which other optimization algorithms arejudged
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 295 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Convergence Analysis
Independence Theorem
The following conditions hold:1 The vectors x(1), x(2), · · · , x(n) are statistically independent2 x(n) is independent of d(1), d(2), · · · , d(n − 1)
3 d(n) is statistically dependent on x(n), but is independent ofd(1), d(2), · · · , d(n − 1)
4 x(n) and d(n) are mutually Gaussian
The independence theorem is invoked in the LMS algorithmanalysisThe independence theorem is justified in some cases, e.g.,beamforming where we receive independent vector observationsIn other cases it is not well justified, but allows the analysis toproceeds (i.e., when all else fails, invoke simplifying assumptions)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 296 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
We will invoke the independence theorem to show that w(n) convergesto the optimal solution in the mean
limn→∞ E{w(n)} = w0
To prove this, evaluate the update
w(n + 1) = w(n) + μx(n)e∗(n)
⇒ w(n + 1) − w0 = w(n) − w0 + μx(n)e∗(n)
⇒ c(n + 1) = c(n) + μx(n)(d ∗(n) − xH(n)w(n))
= c(n) + μx(n)d ∗(n) − μx(n)xH(n)[w(n) − w0 + w0]
= c(n) + μx(n)d ∗(n) − μx(n)xH(n)c(n)
−μx(n)xH(n)w0
= [I − μx(n)xH(n)]c(n) + μx(n)[d∗(n) − xH(n)w0]
= [I − μx(n)xH(n)]c(n) + μx(n)e∗0(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 297 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Take the expectation of the update noting thatw(n) is based on past inputs and desired values
⇒ w(n), and consequently c(n)), are independent of x(n)(Independence Theorem)
Thus
c(n + 1) = [I − μx(n)xH(n)]c(n) + μx(n)e∗0(n)
⇒ E{c(n + 1)} = (I − μR)E{c(n)} + μ E{x(n)e∗0(n)}︸ ︷︷ ︸
=0 why?
= (I − μR)E{c(n)}Using arguments similar to the SD case we have
limn→∞E{c(n)} = 0 if 0 < μ <
2λmax
or equivalently
limn→∞E{w(n)} = w0 if 0 < μ <
2λmax
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 298 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Noting that∑M
i=1 λi = trace[R]
⇒ λmax ≤ trace[R] = Mr(0) = Mσ2x
Thus a more conservative bound (and one easier to determine) is
0 < μ <2
Mσ2x
Convergence in the mean
limn→∞ E{w(n)} = w0
is a weak condition that says nothing about the variance, whichmay even grow
A stronger condition is convergence in the mean square, whichsays
limn→∞E{|c(n)|2} = constant
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 299 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Proving convergence in the mean square is equivalent to showing that
limn→∞ J(n) = lim
n→∞ E{|e(n)|2} = constant
To evaluate the limit, write e(n) as
e(n) = d(n) − d(n) = d(n) − wH(n)x(n)
= d(n) − wH0 x(n) − [wH(n) − wH
0 ]x(n)
= e0(n) − cH(n)x(n)
Thus
J(n) = E{|e(n)|2}= E
{(e0(n) − cH(n)x(n)
) (e∗
0(n) − xH(n)c(n))}
= Jmin + E{cH(n)x(n)xH(n)c(n)}︸ ︷︷ ︸Jex(n)
[Cross terms → 0, why?]
= Jmin + Jex(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 300 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Since Jex(n) is a scalar
Jex(n) = E{cH(n)x(n)xH(n)c(n)}= E{trace[cH(n)x(n)xH(n)c(n)]}= E{trace[x(n)xH(n)c(n)cH(n)]}= trace[E{x(n)xH(n)c(n)cH(n)}]
Invoking the independence theorem
Jex(n) = trace[E{x(n)xH(n)}E{c(n)cH(n)}]= trace[RK(n)]
whereK(n) = E{c(n)cH(n)}
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 301 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Thus
J(n) = Jmin + Jex(n)
= Jmin + trace[RK(n)]
RecallQHRQ = Ω or R = QΩQH
SetS(n) ≡ QHK(n)Q
where S(n) need not be diagonal. Then
K(n) = QQHK(n)QQH [since Q−1 = QH ]
= QS(n)QH
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 302 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Utilizing R = QΩQH and K(n) = QS(n)QH in the excess errorexpression
Jex(n) = trace[RK(n)]
= trace[QΩQHQS(n)QH ]
= trace[QΩS(n)QH ]
= trace[QHQΩS(n)]
= trace[ΩS(n)]
Since Ω is diagonal
Jex(n) = trace[ΩS(n)] =M∑
i=1
λi si(n)
where s1(n), s2(n), · · · , sM (n) are the diagonal elements of S(n).
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 303 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
The previously derived recursion
E{c(n + 1)} = (I − μR)E{c(n)}can be modified to yield a recursion on S(n),
S(n + 1) = (I − μΩ)S(n)(I − μΩ) + μ2JminΩ
which for the diagonal elements is
si(n + 1) = (1 − μλi)2si (n) + μ2Jminλi i = 1, 2, · · · , M
Suppose Jex(n) converges, then si(n + 1) = si(n)
⇒ si (n) = (1 − μλi)2si(n) + μ2Jminλi
⇒ si (n) =μ2Jminλi
1 − (1 − μλi)2 =μ2Jminλi
2μλi − μ2λ2i
=μJmin
2 − μλii = 1, 2, · · · , M
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 304 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Convergence Analysis
Consider again
Jex(n) = trace[ΩS(n)] =M∑
i=1
λi si(n)
Taking the limit and utilizing si(n) = μJmin2−μλi
,
limn→∞ Jex(n) = Jmin
M∑i=1
μλi
2 − μλi
The LMS misadjustment is defined as
MA =limn→∞ Jex(n)
Jmin=
M∑i=1
μλi
2 − μλi
Note: A misadjustment at 10% or less is generally consideredacceptable.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 305 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
Example
This is a one tap predictor
x(n) = w(n)x(n − 1)
Take the underlying process to be areal order one AR process
x(n) = −ax(n − 1) + v(n)
The weight update is
w(n + 1) = w(n) + μx(n − 1)e(n) [LMS update for obs. x(n − 1)]
= w(n) + μx(n − 1)[x(n) − w(n)x(n − 1)]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 306 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
Sincex(n) = −ax(n − 1) + v(n) [AR model]
and
x(n) = w(n)x(n − 1) [one tap predictor]
⇒ w0 = −a
Note thatE{x(n − 1)eo(n)} = E{x(n − 1)v(n)} = 0
proves the optimality
Set μ = 0.05 and consider two cases
a σ2x
-0.99 0.936270.99 0.995
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 307 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
Figure: Transient behavior of adaptive first-order predictor weight w(n) forμ = 0.05.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 308 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
Figure: Transient behavior of adaptive first-order predictor squared error forμ = 0.05.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 309 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
Figure: Mean-squared error learning curves for an adaptive first-orderpredictor with varying step-size parameter μ.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 310 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
Consider the expected trajectory of w(n). Recall
w(n + 1) = w(n) + μx(n − 1)e(n)
= w(n) + μx(n − 1)[x(n) − w(n)x(n − 1)]
= [1 − μx(n − 1)x(n − 1)]w(n) + μx(n − 1)x(n)
In this example, x(n) = −ax(n − 1) + v(n). Substituting in:
w(n + 1) = [1 − μx(n − 1)x(n − 1)]w(n) + μx(n − 1)[−ax(n − 1)
+v(n)]
= [1 − μx(n − 1)x(n − 1)]w(n) − μax(n − 1)x(n − 1)
+μx(n − 1)v(n)
Taking the expectation and invoking the dependence theorem
E{w(n + 1)} = (1 − μσ2x )E{w(n)} − μσ2
xa
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 311 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
Figure: Comparison of experimental results with theory, based on w(n).
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 312 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
Next, derive a theoretical expression for J(n).
Note that the initial value of J(n) is
J(0) = E{(x(0) − w(0)x(−1))2} = E{(x(0))2} = σ2x
and the final value is
J(∞) = Jmin + Jex
= E{(x(n) − w(n)x(n − 1))2} + Jex
= E{(v(n))2} + Jex
= σ2v + Jmin
μλ1
2 − μλ1
Note λ1 = σ2x . Thus,
J(∞) = σ2v + σ2
v
(μσ2
x
2 − μσ2x
)
= σ2v
(1 +
μσ2x
2 − μσ2x
)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 313 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
And if μ is small
J(∞) = σ2v
(1 +
μσ2x
2 − μσ2x
)
≈ σ2v
(1 +
μσ2x
2
)
Putting all the components together:
J(n) = [σ2x − σ2
v (1 +μ
2σ2
x )]︸ ︷︷ ︸J(0)−J(∞)
(1 − μσ2x )2n︸ ︷︷ ︸
1→0
+ σ2v (1 +
μ
2σ2
x )︸ ︷︷ ︸J(∞)
Also, the time constant is
τ = − 12 ln(1 − μλ1)
= − 12 ln(1 − μσ2
x )≈ 1
2μσ2x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 314 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: First-Order Predictor
Figure: Comparison of experimental results with theory for the adaptivepredictor, based on the mean-square error for μ = 0.001.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 315 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization
Example (Adaptive Equalization)
Objective: Pass a known signal through an unknown channel to invertthe effects the channel and noise have on the signal
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 316 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization
The signal is a Bernoulli sequence
xn =
{+1 with probability 1/2−1 with probability 1/2
The additive noise is ∼ N(0, 0.001)
The channel has a raised cosine response
hn =
{ 12
[1 + cos
(2πw (n − 2)
)]n = 1, 2, 3
0 otherwise
⇒ w controls the eigenvalue spread χ(R)⇒ hn is symmetric about n = 2 and thus introduces a delay of 2
We will use an M = 11 tap filter, which is symmetric about n = 5⇒ Introduce a delay of 5
Thus an overall delay of δ = 5 + 2 = 7 is added to the system
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 317 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization
Channel response and Filter response
Figure: (a) Impulse response of channel; (b) impulse response of optimumtransversal equalizer.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 318 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization
Consider three w values
Note the step size is bound by the w = 3.5 case
μ ≤ 2Mr(0)
=2
11(1.3022)= 0.14
Choose μ = 0.075 in all cases.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 319 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization
Figure: Learning curves of the LMS algorithm for an adaptive equalizer withnumber of taps M = 11, step-size parameter μ = 0.075, and varyingeigenvalue spread χ(R).
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 320 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization
Ensemble-average impulse response of the adaptive equalizer (after1000 iterations) for each of four different eigenvalue spreads.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 321 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Adaptive Equalization
Figure: Learning curves of the LMS algorithm for an adaptive equalizer withthe number of taps M = 11, fixed eigenvalue spread, and varying step-sizeparameter μ.
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 322 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality
Example
Directionality of the LMS algorithm
The speed of convergence of the LMS algorithm is faster incertain directions in the weight space
If the convergence is in the appropriate direction, the convergencecan be accelerated by increased eigenvalue spread
To investigate this phenomenon, consider the deterministic signal
x(n) = A1 cos(ω1n) + A2 cos(ω2n)
Even though it is deterministic, a correlation matrix can be determined:
R =12
[A2
1 + A22 A2
1 cos(ω1) + A22 cos(ω2)
A21 cos(ω1) + A2
2 cos(ω2) A21 + A2
2
]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 323 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality
Determining the eigenvalues and eigenvectors yields
λ1 =12
A21(1 + cos(ω1)) +
12
A22(1 + cos(ω2))
λ2 =12
A21(1 − cos(ω1)) +
12
A22(1 − cos(ω2))
and
q1 =
[11
]q2 =
[ −11
]
Case 1: A1 = 1, A2 = 0.5, ω1 = 1.2, ω2 = 0.1
xa(n) = cos(1.2n) + 0.5 cos(0.1n) and χ(R) = 2.9
Case 2: A1 = 1, A2 = 0.5, ω1 = 0.6, ω2 = 0.23
xb(n) = cos(0.6n) + 0.5 cos(0.23n) and χ(R) = 12.9
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 324 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality
Since p undefined, set p = λiqi
Then since p = Rw0, we see (two cases)
p = λ1q1 ⇒ Rw0 = λ1q1 ⇒ w0 = q1 =
[11
]
p = λ2q2 ⇒ Rw0 = λ2q2 ⇒ w0 = q2 =
[ −11
]
Utilize 200 iterations of the algorithm.
Consider the minimum eigenfilter first, w0 = q2 =
[ −11
]
Consider the maximum eigenfilter second, w0 = q1 =
[11
]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 325 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality
Convergence of the LMS algorithm, for a deterministic sinusoidalprocess, along “slow” eigenvector (i.e., minimum eigenfilter).
For input xa(n) (χ(R) = 2.9) For input xb(n) (χ(R) = 12.9)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 326 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: LMS Directionality
Convergence of the LMS algorithm, for a deterministic sinusoidalprocess, along “fast” eigenvector (i.e., maximum eigenfilter).
For input xa(n) (χ(R) = 2.9) For input xb(n) (χ(R) = 12.9)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 327 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm
Observation: The LMS correction is proportional to μx(n)e∗(n)
w(n + 1) = w(n) + μx(n)e∗(n)
If x(n) is large, the LMS update suffers from gradient noiseamplificationThe normalized LMS algorithm seeks to avoid gradient noiseamplification
The step size is made time varying, μ(n), and optimized tominimize the next step error
w(n + 1) = w(n) +12
μ(n)[−∇J(n)]
= w(n) + μ(n)[p − Rw(n)]
Choose μ(n), such that w(n + 1) produces the minimum MSE,
J(n + 1) = E{|e(n + 1)|2}
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 328 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm
Let ∇(n) ≡ ∇J(n) and note
e(n + 1) = d(n + 1) − wH(n + 1)x(n + 1)
Objective: Choose μ(n) such that it minimizes J(n + 1)
The optimal step size, μ0(n), will be a function of R and ∇(n).
⇒ Use instantaneous estimates of these values
To determine μ0(n), expand J(n + 1)
J(n + 1) = E{e(n + 1)e∗(n + 1)}= E{(d(n + 1) − wH(n + 1)x(n + 1))
(d∗(n + 1) − xH(n + 1)w(n + 1))}= σ2
d − wH(n + 1)p − pHw(n + 1)
+wH(n + 1)Rw(n + 1)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 329 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm
Now use the fact that w(n + 1) = w(n) − 12μ(n)∇(n)
J(n + 1) = σ2d − wH(n + 1)p − pHw(n + 1)
+wH(n + 1)Rw(n + 1)
= σ2d −
[w(n) − 1
2μ(n)∇(n)
]H
p
−pH[w(n) − 1
2μ(n)∇(n)
]
+
[w(n) − 1
2μ(n)∇(n)
]H
R[w(n) − 1
2μ(n)∇(n)
]︸ ︷︷ ︸= wH(n)Rw(n) − 1
2μ(n)wH(n)R∇(n)
−12
μ(n)∇H(n)Rw(n) +14
μ2(n)∇H(n)R∇(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 330 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm
J(n + 1) = σ2d −
[w(n) − 1
2μ(n)∇(n)
]H
p
−pH[w(n) − 1
2μ(n)∇(n)
]+wH(n)Rw(n) − 1
2μ(n)wH(n)R∇(n)
−12μ(n)∇H(n)Rw(n) +
14μ2(n)∇H(n)R∇(n)
Differentiating with respect to μ(n),
∂J(n + 1)
∂μ(n)=
12∇H(n)p +
12
pH∇(n) − 12
wHR∇(n)
−12∇H(n)Rw(n) +
12
μ(n)∇H(n)R∇(n) (∗)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 331 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm
Setting (∗) equal to 0
μ0(n)∇H(n)R∇(n) = wH(n)R∇(n) − pH∇(n)
+∇H(n)Rw(n) −∇H(n)p
⇒ μ0(n) =wH(n)R∇(n) − pH∇(n) + ∇H(n)Rw(n) −∇H(n)p
∇H(n)R∇(n)
=[wH(n)R − pH ]∇(n) + ∇H(n)[Rw(n) − p]
∇H(n)R∇(n)
=[Rw(n) − p]H∇(n) + ∇H(n)[Rw(n) − p]
∇H(n)R∇(n)
=12∇H(n)∇(n) + 1
2∇H(n)∇(n)
∇H(n)R∇(n)
=∇H(n)∇(n)
∇H(n)R∇(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 332 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm
Using instantaneous estimates
R = x(n)xH(n) and p = x(n)d ∗(n)
⇒ ∇(n) = 2[Rw(n) − p]
= 2[x(n)xH(n)w(n) − x(n)d ∗(n)]
= 2[x(n)(d∗(n) − d∗(n))]
= −2x(n)e∗(n)
Thus
μ0(n) =∇H(n)∇(n)
∇H(n)R∇(n)=
4xH(n)e(n)x(n)e∗(n)
2xH(n)e(n)x(n)xH(n)2x(n)e∗(n)
=|e(n)|2xH(n)x(n)
|e(n)|2(xH(n)x(n))2
=1
xH(n)x(n)=
1||x(n)||2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 333 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Normalized LMS Algorithm
Result: The NLMS update is
w(n + 1) = w(n) +μ
||x(n)||2︸ ︷︷ ︸μ(n)
x(n)e∗(n)
μ is introduced to scale the update
To avoid problems when ||x(n)||2 ≈ 0 we add an offset
w(n + 1) = w(n) +μ
a + ||x(n)||2 x(n)e∗(n)
where a > 0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 334 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce
Objective: Analyze the NLMS convergence
w(n + 1) = w(n) +μ
||x(n)||2 x(n)e∗(n)
Substituting e(n) = d(n) − wH(n)x(n)
w(n + 1) = w(n) +μ
||x(n)||2 x(n)[d∗(n) − xH(n)w(n)]
=
[I − μ
x(n)xH(n)
||x(n)||2]
w(n) + μx(n)d∗(n)
||x(n)||2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 335 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce
Objective: Compare the NLMS and LMS algorithms:
NLMS:
w(n + 1) =
[I − μ
x(n)xH(n)
||x(n)||2]
w(n) + μx(n)d∗(n)
||x(n)||2
LMS:
w(n + 1) = [I − μx(n)xH(n)]w(n) + μx(n)d ∗(n)
By observation, we see the following corresponding terms
LMS NLMS
μ μ
x(n)xH(n)x(n)xH(n)
||x(n)||2x(n)d∗(n)
x(n)d∗(n)
||x(n)||2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 336 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce
LMS NLMS
μ μ
x(n)xH(n)x(n)xH(n)
||x(n)||2x(n)d∗(n)
x(n)d∗(n)
||x(n)||2
LMS case:
0 < μ <2
trace[E{x(n)xH(n)}] =2
trace[R]
guarantees stabilityBy analogy,
0 < μ <2
trace[E{
x(n)xH (n)||x(n)||2
}]guarantees stability of the NLMS
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 337 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce
To analyze the bound, make the following approximation
E{
x(n)xH(n)
||x(n)||2}
≈ E{x(n)xH(n)}E{||x(n)||2}
Then
trace[E{
x(n)xH(n)
||x(n)||2}]
=trace[E{x(n)xH(n)}]
E{||x(n)||2}
=E{trace[x(n)xH(n)]}
E{||x(n)||2}=
E{trace[xH(n)x(n)]}E{||x(n)||2}
=E{trace[||x(n)||2]}
E{||x(n)||2}= 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 338 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) NLMS Convergnce
Thus
0 < μ <2
trace[E{
x(n)xH (n)||x(n)||2
}] = 2
Final Result: The NLMS update
w(n + 1) = w(n) + μx(n)
||x(n)||2 e∗(n)
will converge if 0 < μ < 2
Note:
The NLMS has a simpler convergence criterion than the LMS
The NLMS generally converges faster than the LMS algorithm
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 339 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Method of Least Squares (LS)
Definition (Method of Least Squares (LS))
Motivation: Develop a general method for optimallyadjusting parameters to model observed data
Solution: Set the sum of squared residuals (errors) asthe performance criteria and restrict the model to belinear
The LS filtering method is a deterministic method
Can be applied to linear and nonlinear systems
LS corresponds to the ML criterion if the errors havea normal distribution
The method is related to linear regression
Optimization procedure results in a LS best fit for afilter over the observed (training) samples
HistoricalNote: Gaussdeveloped LSin 1795 at theage of 18
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 340 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Consider the linear transversal filter
and a fixed number of observed samples: i = 1, 2, · · · , N.
M – the number of taps in the filter
{x(i)} – input sequence
{d(i)} – desired output sequence
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 341 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Objective: Set the tap weights to minimize the sum of squared errors
ε(w) =N∑
i=M
|e(i)|2
Let
w = [w0, w1, · · · , wM−1]T [weight vector]
x(i) = [x(i), x(i − 1), · · · , x(i − M + 1)]T , M ≤ i ≤ N [obs. vect.]
The error at time i ise(i) = d(i) − wHx(i)
The full set of error values can be compiled into a vector
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 342 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Define the (N − M + 1) × 1 vectors:
εH = [e(M), e(M + 1), · · · , e(N)] [error vector]
dH = [d(M), d(M + 1), · · · , d(N)] [desired vector]
Denoting the filter output as d(i) and using vector form:
dH = [d(M), d(M + 1), · · · , d(N)]
= [wHx(M), wHx(M + 1), · · · , wHx(N)]
= wH [x(M), x(M + 1), · · · , x(N)]
= wHAH
whereAH = [x(M), x(M + 1), · · · , x(N)]
is the observation data matrix
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 343 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Expanding the data matrix
AH = [x(M), x(M + 1), · · · , x(N)]
=
⎡⎢⎢⎢⎣
x(M) x(M + 1) · · · x(N)x(M − 1) x(M) · · · x(N − 1)
......
. . ....
x(1) x(2) · · · x(N − M + 1)
⎤⎥⎥⎥⎦
⇒ AH is a M × (N − M + 1) rectangular toplitz matrix.
Combining all the above:
Filter output vector: dH = wHAH
Desired output vector: dH
Error vector: εH = dH − dH = dH − wHAH
Note: All incorporate samples for M ≤ i ≤ N
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 344 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
The sum of the squared estimate errors can now be written as
ε(w) =N∑
i=M
|e(i)|2
= εHε
= (dH − wHAH)(d − Aw)
= dHd − dHAw − wHAHd + wHAHAw
Minimizing with respect to w,
∂ε(w)
∂w= −2AHd + 2AHAw (∗)
Setting (∗) equal to zero gives the optimal LS weight w
⇒ AHAw = AHd [Deterministic normal equation]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 345 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Note: A is not generally square, and thus not invertible, but AHA issquare and generally invertible
AHAw = AHd
⇒ w = (AHA)−1AHd
The deterministic normal equation can be rearranged as
AHAw − AHd = 0
AH(Aw − d) = 0 [or using εmin = d − Aw]
AHεmin = 0
Observation: The LS orthogonality principle states that the estimateerror εmin is orthogonal to the row vectors of the data matrix AH
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 346 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Objective: Determine the minimum sum of squared errors (emin)
emin = εHminεmin
= (dH − wHAH)(d − Aw)
= dHd − wHAHd − dHAw + wHAHAw
Utilizing the normal equations wHAHd = wHAHAw
emin = dHd − wHAHd︸ ︷︷ ︸wHAHAw
− dHAw + wHAHAw
= dHd − dHAw
or using w = (AHA)−1AHd
emin = dHd − dHA(AHA)−1AHd (∗)Note that
dHd =N∑
i=M
|d(i)|2 [energy of desired response]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 347 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Consider again the deterministic normal equation
AHAw = AHd
Note that
AHA = [x(M), x(M + 1), · · · , x(N)]
⎡⎢⎢⎢⎣
xH(M)
xH(M + 1)...
xH(N)
⎤⎥⎥⎥⎦
=N∑
i=M
x(i)xH(i)
= Φ [time averaged correlation matrix, size M × M ]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 348 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
From Φ =∑N
i=M x(i)xH(i) it can be shown that:1 Φ is Hermitian2 Φ is nonnegative definite
To prove this, note that for any a
aHΦa =N∑
i=M
aHx(i)xH(i)a
=N∑
i=M
[aHx(i)][aHx(i)]H
=N∑
i=M
|aHx(i)|2 ≥ 0
3 From (1) and (2) we can prove that the eigenvalues of Φ are realand nonnegative
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 349 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
The deterministic normal equation,
AHAw = AHd
also employs
AHd = [x(M), x(M + 1), · · · , x(N)]
⎡⎢⎢⎢⎣
d∗(M)d∗(M + 1)
...d∗(N)
⎤⎥⎥⎥⎦
=N∑
i=M
x(i)d∗(i)
= θ [Time averaged cross-correlation vector, size M × 1]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 350 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Thus the deterministic normal equation, AHAw = AHd, reduces to
Φw = θ
Φ is usually positive definite (always positive semi-definite) ⇒ thesolution is well defined
w = Φ−1θ [LS optimal weight vector]
Also, recall from (∗) that emin can be expressed as
emin = dHd − dHA︸︷︷︸θH
(AHA)−1︸ ︷︷ ︸Φ−1
AHd︸︷︷︸θ
= ed − θHΦ−1θ
where ed is the energy of desired signal
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 351 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) The Least Square Method
Consider again the orthogonality principle
AHεmin = 0
Recall that d = Aw. Thus
AHεmin = 0
⇒ wHAHεmin = wH0
⇒ dHεmin = 0
Result: The minimum estimation error vector, εmin, is orthogonal to thedata matrix AH and the LS estimate d
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 352 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution
Objective: Analyze the Least Squares solution in terms of
Bias – it is the LS solution unbiased?
BLUE – is the LS solution the Best Linear Unbiased Estimate?
Assumption: Take the true underlying system to be a linear
d(i) =M−1∑k=0
w∗0kx(i − k) + e0(i)
= wH0 x(i) + e0(i)
e0(i) is the unobservable measurement error⇒ e0(i) is white (uncorrelated) with zero mean and variance σ2
Express the desired signal in vector form
d = Aw0 + ε0
where εH0 = [e0(M), e0(M + 1), · · · , e0(N)]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 353 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution
Objective: Evaluate the bias of w
Recall thatw = (AHA)−1AHd
Using d = Aw0 + ε0 in the above
w = (AHA)−1AHd
= (AHA)−1AH(Aw0 + ε0)
= (AHA)−1AHAw0 + (AHA)−1AHε0
= w0 + (AHA)−1AHε0 (∗)
Note A is fixed. Thus taking the expectation of (∗) yields
E{w} = w0 + (AHA)−1AHE{ε0}= w0
Result: The LS estimate, w, is unbiased
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 354 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution
Objective: Evaluate the covariance of w
Note that from (∗)
w = w0 + (AHA)−1AHε0
⇒ w − w0 = (AHA)−1AHε0
Thus
cov[w] = E{(w − w0)(w − w0)H}
= E{(AHA)−1AHε0εH0 A(AHA)−1}
= Φ−1AH E{ε0εH0 }︸ ︷︷ ︸
σ2I
AΦ−1
= σ2Φ−1ΦΦ−1 = σ2Φ−1 (�1)
Result: The covariance of w is proportional to: (1) the variance of themeasurement noise and (2) the inverse of the time average correlationmatrix
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 355 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution
Objective: Show that the LS estimate w is the Best Linear UnbiasedEstimate (BLUE)
Consider any linear unbiased estimate wNote that w is a linear function of the observed date and can thusbe written as
w = Bd
where B is a M × (N − M + 1) matrix
Substituting d = Aw0 + ε0 into the above,
w = BAw0 + Bε0 (∗)⇒ E{w} = BAw0
⇒ BA = I [since w unbiased]
Thus BA = I and (∗) ⇒w = w0 + Bε0
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 356 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution
Rearranging w = w0 + Bε0,
w − w0 = Bε0
⇒ cov[w] = E{(w − w0)(w − w0)H}
= E{Bε0εH0 BH}
= σ2BBH (�2)
Now define
Ψ = B − (AHA)−1AH
⇒ ΨΨH = [B − Φ−1AH ][BH − AΦ−1]
= BBH − BA︸︷︷︸I
Φ−1 − Φ−1 AHBH︸ ︷︷ ︸I
+ Φ−1AHAΦ−1︸ ︷︷ ︸Φ−1ΦΦ−1
= BBH − Φ−1 − Φ−1 + Φ−1
= BBH − Φ−1
= BBH − (AHA)−1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 357 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Analysis of the LS Solution
Observation: The diagonal elements at ΨΨH must be ≥ 0
Thus ΨΨH = BBH − (AHA)−1 ⇒
diag[BBH ] ≥ diag[(AHA)−1]
⇒ diag[σ2BBH ] ≥ diag[σ2(AHA)−1] (∗)
But recall from (�1) and (�2) that
cov[w] = σ2(AHA)−1 and cov[w] = σ2BBH
Utilizing these results in (∗) ⇒
variance[wi ] ≥ variance[wi ] i = 1, 2, · · · , M
Thus the weights in w have lower variance than any other linearestimates
Result: The LS estimate w is unbiased and has the smallest weightvariance ⇒ it is the Best Linear Unbiased Estimate (BLUE)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 358 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
Definition (Recursive Least Squares (RLS))
Motivation: LS requires solving
w = (AHA)−1AHd
= Φ−1θ
where
Φ =N∑
i=M
x(i)xH(i) and θ =N∑
i=M
x(i)d∗(i)
(AHA) is M × M and inversion requires O(M3) multiplications andadditions
Approach: Suppose the LS optimal weights are known at time n, w(n).As time evolves, find the new estimate, w(n + 1), in terms of w(n).
Employ the matrix inversion lemma to reduce the number ofcomputations
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 359 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
Let the observation sequence be x(1), x(2), · · · , x(n)
⇒ Assume x(l) = 0 for l ≤ 0
Define the error as
ε(n) =n∑
i=1
β(n, i)|e(i)|2
where
e(i) = d(i) − wH(n)x(i)
x(i) = [x(i), x(i − 1), · · · , x(i − M + 1)]T
w(n) = [w0(n), w1(n), · · · , wM−1(n)]T
⇒ β(n, i) ∈ (0, 1] is a forgetting factor used in non–stationarystatistics cases
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 360 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
A commonly used forgetting factor is the exponential forgetting factor
β(n, i) = λn−i i = 1, 2, · · · , n, λ ∈ (0, 1]
Thus,
ε(n) =n∑
i=1
λn−i |e(i)|2
The LS solution is given by the deterministic normal equation
Φ(n)w(n) = θ(n)
where now
Φ(n) =n∑
i=1
λn−ix(i)xH(i)
θ(n) =n∑
i=1
λn−ix(i)d∗(i)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 361 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
The normal equation terms can be updated recursively,
Φ(n) =n∑
i=1
λn−ix(i)xH(i)
= λ
[n−1∑i=1
λ(n−1)−ix(i)xH(i)
]︸ ︷︷ ︸
Φ(n−1)
+ x(n)xH(n)
= λΦ(n − 1) + x(n)xH(n)
Similarly
θ(n) =n∑
i=1
λn−ix(i)d∗(i)
= λ
[n−1∑i=1
λ(n−1)−ix(i)d∗(i)
]+ x(n)d∗(n)
= λθ(n − 1) + x(n)d ∗(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 362 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
Aside: Matrix inversion lemma: If
A︸︷︷︸M×M
= B−1︸︷︷︸M×M
+ C︸︷︷︸M×L
D−1︸︷︷︸L×L
CH︸︷︷︸L×M
where A, B, D are positive definite (non-singular), then
A−1 = B − BC[D + CHBC]−1CHB
Apply the lemma to
Φ(n) = λΦ(n − 1) + x(n)xH(n)
Accordingly, set
A = Φ(n) [M × M] B−1 = λΦ(n − 1) [M × M]C = x(n) [M × 1] D = 1 [1 × 1]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 363 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
UtilizingA = Φ(n) B−1 = λΦ(n − 1)C = x(n) D = 1
andA−1 = B − BC[D + CHBC]−1CHB (∗)
we get
[D + CHBC]−1 = [1 + λ−1xH(n)Φ−1(n − 1)x(n)]−1
which is a scalar. Thus evaluating (∗) yields
Φ−1(n) = λ−1Φ−1(n − 1) − λ−2Φ−1(n − 1)x(n)xH(n)Φ−1(n − 1)
1 + λ−1xH(n)Φ−1(n − 1)x(n)
To simplify the result, let P(n) = Φ−1(n) and
k(n)︸︷︷︸Gain vector
=λ−1P(n − 1)x(n)
1 + λ−1xH(n)P(n − 1)x(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 364 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
Utilizing P(n) = Φ−1(n) and k(n) = λ−1P(n−1)x(n)1+λ−1xH(n)P(n−1)x(n)
Φ−1(n) = λ−1Φ−1(n − 1) − λ−2Φ−1(n − 1)x(n)xH(n)Φ−1(n − 1)
1 + λ−1xH(n)Φ−1(n − 1)x(n)
⇒ P(n) = λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1) (∗)
Also, the gain vector can be simplified as
k(n) =λ−1P(n − 1)x(n)
1 + λ−1xH(n)P(n − 1)x(n)[multiply by denom.]
⇒ k(n) = λ−1P(n − 1)x(n) − λ−1k(n)xH(n)P(n − 1)x(n)
= [λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)]︸ ︷︷ ︸=P(n) from (∗)
x(n)
= P(n)x(n) = Φ−1(n)x(n) (∗∗)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 365 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
We must now derive an update for the tap weight vector. Recall,
w(n) = Φ−1(n)θ(n) = P(n)θ(n)
Using the recursion θ(n) = λθ(n − 1) + x(n)d ∗(n) in the above
w(n) = λP(n)θ(n − 1) + P(n)x(n)d ∗(n) (∗ ∗ ∗)Using the update (∗)
P(n) = λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)
in the first P(n) term of (∗ ∗ ∗)w(n) = λP(n)θ(n − 1) + P(n)x(n)d ∗(n)
= λ[λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)]θ(n − 1)
+ P(n)x(n)d ∗(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 366 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
*
w(n) = λ[λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)]θ(n − 1)
+P(n)x(n)d ∗(n)
= P(n − 1)θ(n − 1)︸ ︷︷ ︸w(n−1)
− k(n)xH(n) P(n − 1)θ(n − 1)︸ ︷︷ ︸w(n−1)
+P(n)x(n)d ∗(n)
= w(n − 1) − k(n)xH(n)w(n − 1) + P(n)x(n)︸ ︷︷ ︸=k(n) from (∗∗)
d∗(n)
= w(n − 1) − k(n)[xH(n)w(n − 1) − d ∗(n)]
= w(n − 1) + k(n)α∗(n)
where α(n) = d(n) − wH(n − 1)x(n)
Observation: Difference between e(n) and α(n):
e(n) = d(n) − wH(n)x(n) ⇒ a posteriori error
α(n) = d(n) − wH(n − 1)x(n) ⇒ a priori error
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 367 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
RLS Algorithm Summary
1 Given a new sample x(n), update the gain vector
k(n) =λ−1P(n − 1)x(n)
1 + λ−1xH(n)P(n − 1)x(n)
2 Update the innovation: α(n) = d(n) − wH(n − 1)x(n)
3 Update the tap weight vector: w(n) = w(n − 1) + k(n)α∗(n)
4 Update inverse correlation matrix
P(n) = λ−1P(n − 1) − λ−1k(n)xH(n)P(n − 1)
Initial Conditions: w(0) = 0 and Φ(0) = δI, where δ is a small positiveconstant, δ ≈ 0.01σ2
x .
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 368 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Recursive Least Squares (RLS)
Algorithm Comparison: RLS and LMS algorithm terms:
Entity RLS LMSError
error) priori a(
)()1(ˆ)()( nnndn H xw −−=αerror) posteriari a(
)()(ˆ)()( nnndne H xw−=
WeightUpdate
)()()1(ˆ)(ˆ nnnn ∗+−= αkww )()()()1( nennn ∗+=+ xww μ
Gain oferror update
)()()1()(1
)1(1
1
nnnn
nH
xxPx
P⎟⎟⎠
⎞⎜⎜⎝
⎛−+
−−
−
λλ )()( nxμ
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 369 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Complexity Analysis
Objective: Compare the complexities (number of additions andmultiplies) for the LMS, LS, and RLS algorithms.
Assume the data is real and the filter is of size M
Case 1 – The LMS algorithm: Algorithm stages:1 d(n) = wT (n)x(n)
2 e(n) = d(n) − d(n)
3 w(n + 1) = w(n) + μx(n)e(n)
ComplexityStage O× O+
(1) M M − 1(2) 0 1(3) M + 1 M
Total complexityO×(2M + 1) O+(2M)
per iteration
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 370 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Complexity Analysis
Case 2 – The LS algorithm: Algorithm solves
w(n) = Φ−1(n)θ(n)
and has stages:1 Φ(n + 1) = Φ(n) + x(n + 1)xH(n + 1)
2 θ(n + 1) = θ(n) + x(n + 1)d(n + 1)
3 w(n + 1) = Φ−1(n + 1)θ(n + 1)
ComplexityStage O× O+
(1) M2 M2
(2) M M(3) M3 + M2 M3 + M(M − 1)
Total complexityO×(M3 + 2M2 + M) O+(M3 + 2M2)
per iteration
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 371 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Complexity Analysis
Case 3 – The RLS algorithm: Algorithm has stages (assuming λ = 1):1 k(n) = λ−1P(n−1)x(n)
1+xT (n)P(n−1)x(n)
2 α(n) = d(n) − wT (n − 1)x(n)3 w(n) = w(n − 1) + k(n)α(n)4 P(n) = P(n − 1) − k(n)xT (n)P(n − 1)
Note: The operation xT (n)P(n − 1) is repeated (but only performedonce). Corresponding steps are underlined in the chart.
ComplexityStage O× O+
(1) numerator M2 M(M − 1)(1) denominator M2 + M M(M − 1) + M
(1) division M(2) M M(3) M M(4) M2 + M2 M(M − 1) + M2
Total complexityO×(3M2 + 4M) O+(3M2 + M)
per iteration
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 372 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Objective: Analyze the RLS algorithm in terms of
Bias
Convergence in the mean; Convergence in the mean square
Learning curve decay rate
Assumptions:1 The desired signal is formed by the regression model
d(n) = wH0 x(n) + e0(n)
where e0(n) is white with variance σ2.2 λ = 1 and n ≥ M.
Thenw(n) = Φ−1(n)θ(n)
where
Φ(n) =n∑
i=1
x(i)xH(i) and θ(n) =n∑
i=1
x(i)d∗(i)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 373 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Substituting d∗(n) = xH(n)w0 + e∗0(n) into θ(n)
θ(n) =n∑
i=1
x(i)[xH(i)w0 + e∗0(i)]
=n∑
i=1
x(i)xH(i)w0 +n∑
i=1
x(i)e∗0(i)
= Φ(n)w0 +n∑
i=1
x(i)e∗0(i)
Thus
w(n) = Φ−1(n)θ(n)
= Φ−1(n)[Φ(n)w0 +n∑
i=1
x(i)e∗0(i)]
= w0 + Φ−1(n)n∑
i=1
x(i)e∗0(i) (∗)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 374 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Note that E{A} = E{E{A|B}}. Thus
w(n) = w0 + Φ−1(n)n∑
i=1
x(i)e∗0(i)
⇒ E{w(n)} = w0 + E{E{Φ−1(n)n∑
i=1
x(i)e∗0(i)|x(i), i = 1, 2, · · · , n}}
= w0 + E{Φ−1(n)n∑
i=1
x(i)E{e∗0(i)}} = w0
The above follows from the fact that Φ(n) and e∗0(i) are independent.
Why?e0(i) is independent of all observations and the x(i) terms aregiven, uniquely defining Φ(n). ⇒ independence of Φ(n) and e∗
0(i).
Result: The RLS algorithm is unbiased and convergent in the mean forn ≥ M.
Question: How does this compare to the LMS algorithm?K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 375 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Next, consider the convergence in the mean square. Recall (∗)
w(n) = w0 + Φ−1(n)n∑
i=1
x(i)e∗0(i)
which gives
ε(n) = w(n) − w0 = Φ−1(n)n∑
i=1
x(i)e∗0(i)
Thus the weight error correlation matrix is
K(n) = E{ε(n)εH(n)}
= E{Φ−1(n)
⎛⎝ n∑
i=1
n∑j=1
x(i)e∗0(i)e0(j)xH(j)
⎞⎠Φ−1(n)}
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 376 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Again using E{A} = E{E{A|B}} yields
K(n) = E
⎧⎪⎨⎪⎩Φ−1(n)
⎛⎜⎝ n∑
i=1
n∑j=1
x(i) E{e∗0(i)e0(j)}︸ ︷︷ ︸
σ2δ(i−j)
xH(j)
⎞⎟⎠Φ−1(n)
⎫⎪⎬⎪⎭
= σ2E
{Φ−1(n)
(n∑
i=1
x(i)xH(i)
)Φ−1(n)
}
= σ2E{Φ−1(n)Φ(n)Φ−1(n)}= σ2E{Φ−1(n)}
Note: Φ−1(n) has a Wishart distribution, the expectation of which is
E{Φ−1(n)} =1
n − M − 1R−1 n > M + 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 377 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Using K(n) = σ2
n−M−1R−1 and the trace
E{||ε(n)||2} = E{εH(n)ε(n)}= E{trace[εH(n)ε(n)]}= E{trace[ε(n)εH(n)]}= traceE{ε(n)εH(n)}= trace[K(n)]
=σ2
n − M − 1trace[R−1]
=σ2
n − M − 1
M∑i=1
1λi
n > M + 1
Results:
The weight vector MSE is initially proportional to∑M
i=11λi
The weight vector converges linearly in the mean squared sense
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 378 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Objective: Evaluate the RLS (error) learning curve
Recall the a priori estimation error
α(n) = d(n) − wH(n − 1)x(n)
= d(n) − d0(n) + d0(n) − wH(n − 1)x(n)
= e0(n) + wH0 x(n) − wH(n − 1)x(n)
= e0(n) − εH(n − 1)x(n)
Now consider the MSE of α(n)
Jα(n) = E{|α(n)|2}= E{[e∗
0(n) − xH(n)ε(n − 1)][e0(n) − εH(n − 1)x(n)]}= E{|e0(n)|2} − E{xH(n)ε(n − 1)e0(n)}
−E{εH(n − 1)x(n)e∗0(n)} + E{xH(n)ε(n − 1)εH(n − 1)x(n)}
To analyze Jα(n), consider each term individually
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 379 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Jα(n) = E{|e0(n)|2} − E{xH(n)ε(n − 1)e0(n)}−E{εH(n − 1)x(n)e∗
0(n)} + E{xH(n)ε(n − 1)εH(n − 1)x(n)}
Term: E{|e0(n)|2}. Clearly,
E{|e0(n)|2} = σ2
Term: E{εH(n − 1)x(n)e∗0(n)}. By the independence theorem, ε(n − 1)
is independent of x(n) and e0(n). Thus,
E{εH(n − 1)x(n)e∗0(n)} = E{εH(n − 1)}E{x(n)e∗
0(n)}= 0
where the final result is due to the orthogonality principle.
Term: E{xH(n)ε(n − 1)e0(n)} → 0 by similar arguments
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 380 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Jα(n) = E{|e0(n)|2} − E{xH(n)ε(n − 1)e0(n)}−E{εH(n − 1)x(n)e∗
0(n)} + E{xH(n)ε(n − 1)εH(n − 1)x(n)}
Term: E{xH(n)ε(n − 1)εH(n − 1)x(n)}
E{xH(n)ε(n − 1)εH(n − 1)x(n)} = E{trace[xH(n)ε(n − 1)εH(n − 1)x(n)]}= E{trace[x(n)xH(n)ε(n − 1)εH(n − 1)]}
Invoking the independence theorem
E{xH(n)ε(n − 1)εH(n − 1)x(n)}= trace[E{x(n)xH(n)}E{ε(n − 1)εH(n − 1)}]= trace[RK(n − 1)]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 381 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Performance Analysis
Utilizing K(n − 1) = σ2
n−M−2R−1 and substituting back each of thecomponents
Jα(n) = σ2 + trace[RK(n − 1)]
= σ2 +Mσ2
n − M − 2n > M + 1
Results:
The ensemble average learning curve of the RLS converges inabout 2M iterations, which is typically an order of magnitude fasterthan the LMS
limn→∞ Jα(n) = σ2 thus there is no excess MSE
Convergence of the RLS algorithm is independent of theeigenvalues of Φ(n)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 382 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Channel Equalization
Example
Consider again the channel equalization problem
where
hn =
{ 12 [1 + cos(2π
W (n − 1))] n = 1, 2, 30 otherwise
As before an 11-tap filter is used
The SNR is 30dB and W is varied to control the eigenvalue spread
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 383 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Channel Equalization
Observations:
The RLS algorithm converges in about 20 iterations (twice thenumber of filter taps)
The convergence (rate) is insensitive to the eigenvalue spread
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 384 / 406
Adaptive Optimization (Steepest Descent, LMS, RLS Algorithms) Example: Channel Equalization
Observations:
The RLS algorithm converges faster than the LMS algorithm
The RLS algorithm has lower steady state error than the LMSalgorithm
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 385 / 406
Application: Blind Deconvolution Motivation
Blind Deconvolution
Motivation:
Adaptive equalizers typically require a training period during whichthey operate on known signals/statistics.This known signal training is not always appropriate, e.g., inmobile communications
Cost is too high (time/bandwidth)Multipathing or other time varying interference
In such cases, we must use blind equalization
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 387 / 406
Application: Blind Deconvolution System Model
System components and assumptions:
Channel introduces distortion (dominant)
System has additive noise (not dominant)Assume a baseband model of communications
Multilevel pulse amplitude modulation (M-ary PAM)Received signal:
u(n) =∑
k
hk x(n − k)
Dominating interference is due to intersymbol interference (ISI)from channel distortion⇒ The noise is ignored
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 388 / 406
Application: Blind Deconvolution System Model
Also assume that:
h �= 0 for n < 0 (noncausal)∑k
h2k = 1 (to keep the variance of the output constant)
Example (4-ary PAM modulation)
A 4–ary PAM modulation scheme uses 4 signals
Or
S1
S2
S3
S4
S4 S2 S1 S3x
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 389 / 406
Application: Blind Deconvolution System Model
To solve the equalization problem, we need a statistical model of thedata. Assume,
1 The data is white
E{x(n)} = 0
E{x(n)x(k)} =
{1 k = n0 otherwise
2 The pdf of x(n) is symmetric and uniform)(xf x
0
32
1
33−
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 390 / 406
Application: Blind Deconvolution System Model
Deconvolution Objective: If {wi} are the coefficients of the idealinverse filter, then ∑
i
wihl−i = δl =
{1 l = 00 else
If this is the case, the output of the equalizer is
y(n) =∑
i
wiu(n − i)
=∑
i
∑k
wihkx(n − i − k) [let k = l − i ]
=∑
l
x(n − l)∑
i
wihl−i
=∑
l
x(n − l)δl
= x(n)
Problem: hn is not known ⇒ the exact inverse can not be usedK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 391 / 406
Application: Blind Deconvolution System Model
Solution: Use an iterative procedure to find the filter.
Let the output at iteration n be given by
y(n) =L∑
i=−L
wi(n)u(n − i)
where a 2L + 1 tap filter is used
Setting wi(n) = 0 for |i | > L, we can write
y(n) =∑
i
wi(n)u(n − i)
=∑
i
wiu(n − i) +∑
i
[wi(n) − wi ]u(n − i)
= x(n) + v(n)
[since x(n) =
∑i
wiu(n − i)
]
Note v(n) =∑
i [wi(n) − wi ]u(n − i) is the residual ISIK. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 392 / 406
Application: Blind Deconvolution System Model
Interpretation: v(n) =∑
i [wi(n) − wi ]u(n − i) is the convolution noise(residual ISI) resulting from the fact that ideal filter was not used
Estimation Approach: Apply the output y(n) = x(n) + v(n) to a zeromemory nonlinear estimator
x(n) = g[y(n)]
Complete System:
The nonlinear estimate g[y(n)] can be used to update theequalizer to produce a better estimate at time n + 1
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 393 / 406
Application: Blind Deconvolution Optimization
Define the equalizer error to be
e(n) = x(n) − y(n)
Use e(n) in the LMS algorithm to update the equalizer weights
wi(n + 1) = wi(n) + μu(n − i)e(n) i = 0,±1,±2, · · · ,±L
Note that in this case, the cost function is
J(n) = E{e2(n)}= E{(x(n) − y(n))2}= E{(g[y(n)] − y(n))2}
Observation: J(n) is a nonconvex function of the filter weights (g[·] isnonlinear)
The cost function can have numerous local minima
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 394 / 406
Application: Blind Deconvolution Optimization
To solve the optimization, first evaluate the convolution noise,
u(n) =∑
k
hkx(n − k)
⇒ u(n − i) =∑
k
hkx(n − i − k)
Next, use this in
y(n) =∑
i
wiu(n − i)
︸ ︷︷ ︸x(n)
+∑
i
[wi(n) − wi ]u(n − i)
︸ ︷︷ ︸v(n)
where recall
wi ≡ perfect equalizer weights
wi(n) ≡ finite approximate equalizer with wi(n) = 0 for |i | > L
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 395 / 406
Application: Blind Deconvolution Optimization
Then using u(n − i) =∑k
hkx(n − i − k)
v(n) =∑
i
[wi(n) − wi ]u(n − i)
=∑
i
∑k
hk [wi(n) − wi ]x(n − i − k)
=∑
l
x(l)∇(n − l)
where the last line follows from letting n − i − k = l and defining
∇(n) =∑
k
hk [wn−k (n) − wn−k ]
∇(n) is the residual impulse response of the channel due toimperfect equalization
∇(n) is small in value, but long and oscillatory
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 396 / 406
Application: Blind Deconvolution Optimization
Since the convolution noise is given by
v(n) =∑
l
x(l)∇(n − l)
⇒ E{v(n)} =∑
l
E{x(l)}∇(n − l) = 0
Also
E{v(n)v(n − j)} = E
{∑l
x(l)∇(n − l)∑
m
x(m)∇(n − m − j)
}
=∑
l
∑m
∇(n − l)∇(n − m − j)E{x(l)x(m)}
=∑
l
∇(n − l)∇(n − l − j)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 397 / 406
Application: Blind Deconvolution Optimization
Since ∇(n) is long and oscillatory
E{v(n)v(n − j)} =∑
l
∇(n − l)∇(n − l − j)
tends to average to 0 for j �= 0. Thus
E{v(n)v(n − j)} =
{σ2 j = 00 j �= 0
whereσ2 = σ2(n) =
∑l
∇2(n − l)
Observation: v(n) =∑
lx(l)∇(n − l) is a weighted sum of i.i.d. RVs ⇒
v(n) is Gaussian (central limit theorem)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 398 / 406
Application: Blind Deconvolution Optimization
Lastly, consider the cross correlation of x(n) and v(n),
E{x(n)v(n − j)} = E{x(n)∑
l
x(l)∇(n − l − j)}
=∑
l
∇(n − l − j)E{x(n)x(l)}
= ∇(−j)
�∑
l
∇2(l) = σ2
Observation: Thus we can say x(n) and v(n) are essentiallyindependent
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 399 / 406
Application: Blind Deconvolution Optimization
Finally, consider the nonlinear estimation of x(n)
Component statistics:
x(n) is uniformly distributed with zero mean and unit variance
v(n) is white Gaussian noise with zero mean and variance σ2(n)
x(n) and v(n) are independent
Estimation Approach: Utilize Bayes estimation, which exploitsknowledge of the distributions
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 400 / 406
Application: Blind Deconvolution Optimization
The estimation risk is defined by
R = E{c(x , x(n))}=
∫ ∞
−∞
∫ ∞
−∞c(x , x(n))fxy (x , y)dydx
Recall that if c(x , x(n)) = (x − x(n))2, then the risk is minimized by
x =
∫ ∞
−∞xfx (x |y)dx
= E{x |y} [conditional expectation]
where we can use the fact that
fx (x |y) =fy (y |x)fx (x)
fy (y)
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 401 / 406
Application: Blind Deconvolution Optimization
Employ the current model,
y = c0x + v
where c0 < 1 is a scaling factor included to ensure E{y 2} = 1.
Thus since x and v are independent, fy (y |x) = fv (y − c0x) and
x =
∫ ∞
−∞xfx (x |y)dx
=1
fy (y)
∫ ∞
−∞xfy (y |x)fx (x)dx
=1
fy (y)
∫ ∞
−∞xfv (y − c0x)fx (x)dx
where
fx(x)
32
1
33− 0
fv(v)
Variance = σ2
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 402 / 406
Application: Blind Deconvolution Optimization
Evaluating this yields (Berllini, 1988)
x =1
c0y− σ
c0
Z (y1) − Z (y2)
Q(y1) − Q(y2)
where
y1 =1σ
(y +√
3c0)
y2 =1σ
(y −√
3c0)
and
Z (y) =1√2π
e− y2
2 [normalized Gaussian pdf]
Q(y) =1√2π
∫ ∞
ye− u2
2 du [normalized Gaussian cdf]
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 403 / 406
Application: Blind Deconvolution Optimization
Results for an 8 level PAM system:For an 8 level PAM system:
Plotted for three different noise to (signal + noise) ratios
Note that this tends to a step function as the noise goes to zero
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 404 / 406
Application: Blind Deconvolution Optimization
Implementation Point: When the blind equalizer has converged, thealgorithm is switched to decision-directed mode.
In the case of binary symbols,
x(n) =
{+1 for symbol 1−1 for symbol 2
which results in the sample estimate
⇒ x(n) = sgn(y(n)) =
{+1 if y(n) ≥ 0−1 else
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 405 / 406
Application: Blind Deconvolution Optimization
Final System:
Final Observation: This system works well as long as what?
K. E. Barner (ECE, Univ. of Delaware) ELEG-636: Statistical Signal Processing Spring 2009 406 / 406