fundamentals of adaptive filtering - …engold.ui.ac.ir/~sabahi/adaptive...

2FUNDAMENTALS OF ADAPTIVE

FILTERING

2.1 INTRODUCTION

This chapter includes a brief review of deterministic and random signal representations. Due to theextent of those subjects, our review is limited to the concepts that are directly relevant to adaptivefiltering. The properties of the correlation matrix of the input signal vector are investigated in somedetail, since they play a key role in the statistical analysis of the adaptive-filtering algorithms.

The Wiener solution that represents the minimum mean-square error (MSE) solution of discrete-time filters realized through a linear combiner is also introduced. This solution depends on theinput signal correlation matrix as well as on the the cross-correlation between the elements of theinput signal vector and the reference signal. The values of these correlations form the parametersof the MSE surface, which is a quadratic function of the adaptive-filter coefficients. The linearlyconstrained Wiener filter is also presented, a technique commonly used in antenna array processingapplications. The transformation of the constrained minimization problem into an unconstrained oneis also discussed. Motivated by the importance of the properties of the MSE surface, we analyzethem using some results related to the input signal correlation matrix.

In practice the parameters that determine the MSE surface shape are not available. What is left isto directly or indirectly estimate these parameters using the available data and to develop adaptivealgorithms that use these estimates to search the MSE surface, such that the adaptive-filter coefficientsconverge to the Wiener solution in some sense. The starting point to obtain an estimation procedureis to investigate the convenience of using the classical searching methods of optimization theory[1]-[3] to adaptive filtering. The Newton and steepest-descent algorithms are investigated as possiblesearching methods for adaptive filtering. Although both methods are not directly applicable topractical adaptive filtering, smart reflections inspired on them led to practical algorithms such as theleast-mean-square (LMS) [4]-[5] and Newton-based algorithms. The Newton and steepest-descentalgorithms are introduced in this chapter, whereas the LMS algorithm is treated in the next chapter.

Also, in the present chapter, the main applications of adaptive filters are revisited and discussed ingreater detail.

P.S.R. Diniz, Adaptive Filtering, DOI: 10.1007/978-0-387-68606-6_2, © Springer Science+Business Media, LLC 2008

14 Chapter 2 Fundamentals of Adaptive Filtering

2.2 SIGNAL REPRESENTATION

In this section, we briefly review some concepts related to deterministic and random discrete-timesignals. Only specific results essential to the understanding of adaptive filtering are reviewed. Forfurther details on signals and digital signal processing we refer to [6]-[13].

2.2.1 Deterministic Signals

A deterministic discrete-time signal is characterized by a defined mathematical function of the timeindex k1, with k = 0,±1,±2,±3, . . .. An example of a deterministic signal (or sequence) is

x(k) = e−α k cos(ωk) + u(k) (2.1)

where u(k) is the unit step sequence.

The response of a linear time-invariant filter to an input x(k) is given by the convolution summation,as follows [7]:

y(k) = x(k) ∗ h(k) =∞∑

n=−∞x(n)h(k − n)

=∞∑

n=−∞h(n)x(k − n) = h(k) ∗ x(k) (2.2)

where h(k) is the impulse response of the filter2.

The Z-transform of a given sequence x(k) is defined as

Z{x(k)} = X(z) =∞∑

k=−∞x(k)z−k (2.3)

for regions in the Z-plane such that this summation converges. If the Z-transform is defined fora given region of the Z-plane, in other words the above summation converges in that region, theconvolution operation can be replaced by a product of the Z-transforms as follows [7]:

Y (z) = H(z) X(z) (2.4)

where Y (z), X(z), and H(z) are theZ-transforms of y(k), x(k), and h(k), respectively. Consideringonly waveforms that start at an instant k ≥ 0 and have finite power, their Z-transforms will alwaysbe defined outside the unit circle.

1The index k can also denote space in some applications.2An alternative and more accurate notation for the convolution summation would be (x ∗ h)(k) instead of x(k) ∗ h(k),

since in the latter the index k appears twice whereas the resulting convolution is simply a function of k. We will keep theformer notation since it is more widely used.

152.2 Signal Representation

For finite-energy waveforms, it is convenient to use the discrete-time Fourier transform defined as

F{x(k)} = X(ejω) =∞∑

k=−∞x(k)e−jωk (2.5)

Although the discrete-time Fourier transform does not exist for a signal with infinite energy, if thesignal has finite-power, a generalized discrete-time Fourier transform exists and is largely used fordeterministic signals [16].

2.2.2 Random Signals

A random variable X is a function that assigns a number to every outcome, denoted by �, of a givenexperiment. A stochastic process is a rule to describe the time evolution of the random variabledepending on �, therefore it is a function of two variables X(k, �). The set of all experimentaloutcomes, i.e., the ensemble, is the domain of �. We denote x(k) as a sample of the given processwith � fixed, where in this case if k is also fixed, x(k) is a number. When any statistical operator isapplied to x(k) it is implied that x(k) is a random variable, k is fixed, and � is variable. In this book,x(k) represents a random signal.

Random signals do not have a precise description of their waveforms. What is possible is to charac-terize them via measured statistics or through a probabilistic model. For random signals, the first- andsecond-order statistics are most of the time sufficient for characterization of the stochastic process.The first- and second-order statistics are also convenient for measurements. In addition, the effecton these statistics caused by linear filtering can be easily accounted for as shown below.

Let’s consider for the time being that the random signals are real. We start to introduce some toolsto deal with random signals by defining the distribution function of a random variable as

Px(k)(y)�= probability of x(k) being smaller or equal to y

or

Px(k)(y) =∫ y

−∞px(k)(z)dz (2.6)

The derivative of the distribution function is the probability density function (pdf)

px(k)(y) =dPx(k)(y)

dy(2.7)

The expected value, or mean value, of the process is defined by

mx(k) = E[x(k)] (2.8)

The definition of the expected value is expressed as

E[x(k)] =∫ ∞

−∞y px(k)(y)dy (2.9)


where px(k)(y) is the pdf of x(k) at the point y.

The autocorrelation function of the process x(k) is defined by

rx(k, l) = E[x(k)x(l)] =∫ ∞

−∞

∫ ∞

−∞yzpx(k),x(l)(y, z)dydz (2.10)

where px(k),x(l)(y, z) is the joint probability density of the random variables x(k) and x(l) definedas

px(k),x(l)(y, z) =∂2Px(k),x(l)(y, z)

∂y∂z(2.11)

wherePx(k),x(l)(y, z)

�= probability of {x(k) ≤ y and x(l) ≤ z}

The autocovariance function is defined as

σ2x(k, l) = E{[x(k)−mx(k)][x(l)−mx(l)]} = rx(k, l)−mx(k)mx(l) (2.12)

where the second equality follows from the definitions of mean value and autocorrelation. For k = l,σ2

x(k, l) = σ2x(k) which is the variance of x(k).

The most important specific example of probability density function is the Gaussian density function,also known as normal density function [14]-[15]. The Gaussian pdf is defined by

px(k)(y) =1√

2πσ2x(k)

e− (y−mx(k))2

2σ2x(k) (2.13)

where mx(k) and σ2x(k) are the mean and variance of x(k), respectively.

One justification for the importance of the Gaussian distribution is the central limit theorem. Givena random variable x composed by the sum of n independent random variables xi as follows:

x =n∑

i=1

xi (2.14)

the central limit theorem states that under certain general conditions, the probability density functionof x approaches a Gaussian density function for large n. The mean and variance of x are given,respectively, by

mx =n∑

i=1

mxi (2.15)

σ2x =

n∑i=1

σ2xi

(2.16)

Considering that the values of the mean and variance of x can grow, define

x′=

x−mx

σx(2.17)


In this case, for n →∞ it follows that

px′ (y) =1√2π

e−y2

2 (2.18)

In a number of situations we require the calculation of conditional distributions, where the probabilityof a certain event to occur is calculated assuming that another event B has occurred. In this case, wedefine

Px(k)(y|B) =P ({x(k) ≤ y} ∩B)

P (B)(2.19)

�= probability of x(k) ≤ y assuming B has occurred

This joint event consists of all outcomes � ∈ B such that x(k) = x(k, �) ≤ y3. The definition of theconditional mean is given by

mx|B(k) = E[x(k)|B] =∫ ∞

−∞ypx(k)(y|B)dy (2.20)

where px(k)(y|B) is the pdf of x(k) conditioned on B.

The conditional variance is defined as

σ2x|B(k) = E{[x(k)−mx|B(k)]2|B} =

∫ ∞

−∞[y −mx|B(k)]2px(k)(y|B)dy (2.21)

There are processes for which the mean and autocorrelation functions are shift (or time) invariant,i.e.,

mx(k − i) = mx(k) = E[x(k)] = mx (2.22)

rx(k, i) = E[x(k − j)x(i− j)] = rx(k − i) = rx(l) (2.23)

and as a consequenceσ2

x(l) = rx(l)−m2x (2.24)

These processes are said to be wide-sense stationary (WSS). If the nth-order statistics of a processis shift invariant, the process is said to be nth-order stationary. Also if the process is nth-orderstationary for any value of n the process is stationary in strict sense.

Two processes are considered jointly WSS if and only if any linear combination of them is also WSS.This is equivalent to state that

y(k) = k1 x1(k) + k2 x2(k) (2.25)

must be WSS, for any constants k1 and k2, if x1(k) and x2(k) are jointly WSS. This propertyimplies that both x1(k) and x2(k) have shift-invariant means and autocorrelations, and that theircross-correlation is also shift invariant.

3Or equivalently, such that X(�) ≤ y.


For complex signals where x(k) = xr(k) + jxi(k), y = yr + jyi, and z = zr + jzi, we have thefollowing definition of the expected value

E[x(k)] =∫ ∞

−∞

∫ ∞

−∞ypxr(k),xi(k)(yr, yi)dyrdyi (2.26)

where pxr(k),xi(k)(yr, yi) is the joint probability density function (pdf) of xr(k) and xi(k).

The autocorrelation function of the complex random signal x(k) is defined by

rx(k, l) = E[x(k)x∗(l)]

=∫ ∞

−∞

∫ ∞

−∞

∫ ∞

−∞

∫ ∞

−∞yz∗pxr(k),xi(k),xr(l),xi(l)(yr, yi, zr, zi)dyrdyidzrdzi

(2.27)

where ∗ denotes complex conjugate, since we assume for now that we are dealing with complexsignals, and pxr(k),xi(k),xr(l),xi(l)(yr, yi, zr, zi) is the joint probability density function of the randomvariables x(k) and x(l).

For complex signals the autocovariance function is defined as

σ2x(k, l) = E{[x(k)−mx(k)][x(l)−mx(l)]∗} = rx(k, l)−mx(k)m∗

x(l) (2.28)

2.2.2.1 Autoregressive Moving Average Process

The process resulting from the output of a system described by a general linear difference equationgiven by

y(k) =M∑

j=0

bjx(k − j) +N∑

i=1

aiy(k − i) (2.29)

where x(k) is a white noise, is called autoregressive moving average (ARMA) process. The coeffi-cients ai and bj are the parameters of the ARMA process. The output signal y(k) is also said to be acolored noise since the autocorrelation function of y(k) is nonzero for a lag different from zero, i.e.,r(l) = 0 for some l = 0.

For the special case where bj = 0 for j = 1, 2, . . . , M , the resulting process is called autoregressive(AR) process. The terminology means that the process depends on the present value of the inputsignal and on a linear combination of past samples of the process. This indicates the presence of afeedback of the output signal.

For the special case where ai = 0 for i = 1, 2, . . . , N , the process is identified as a moving average(MA) process. This terminology indicates that the process depends on a linear combination of thepresent and past samples of the input signal. In summary, an ARMA process can be generated byapplying a white noise to the input of a digital filter with poles and zeros, whereas for the AR andMA cases the digital filters are all-pole and all-zero filters, respectively.


2.2.2.2 Markov Process

A stochastic process is called a Markov process if its past has no influence in the future when thepresent is specified [14], [16]. In other words, the present behavior of the process depends only onthe most recent past, i.e., all behavior previous to the most recent past is not required. A first-orderAR process is a first-order Markov process, whereas an N th-order AR process is considered anN th-order Markov process. Take as an example the sequence

y(k) = ay(k − 1) + n(k) (2.30)

where n(k) is a white noise process. The process represented by y(k) is determined by y(k − 1)and n(k), and no information before the instant k− 1 is required. We conclude that y(k) representsa Markov process. In the previous example, if a = 1 and y(−1) = 0 the signal y(k), for k ≥ 0, is asum of white noise samples, usually called random walk sequence.

Formally, an mth-order Markov process satisfies the following condition: for all k ≥ 0, and for afixed m, it follows that

Px(k) (y|x(k − 1), x(k − 2), . . . , x(0)) = Px(k) (y|x(k − 1), x(k − 2), . . . , x(k −m)) (2.31)

2.2.2.3 Wold Decomposition

Another important result related to any wide-sense stationary process x(k) is theWold decomposition,which states that x(k) can be decomposed as

x(k) = xr(k) + xp(k) (2.32)

where xr(k) is a regular process that is equivalent to the response of a stable, linear, time-invariant,and causal filter to a white noise [16], and xp(k) is a perfectly predictable (deterministic or singular)process. Also, xp(k) and xr(k) are orthogonal processes, i.e., E[xr(k)xp(k)] = 0. The key factorhere is that the regular process can be modeled through a stable autoregressive model [24] with astable and causal inverse. The importance of Wold decomposition lies on the observation that aWSS process can in part be represented by an AR process of adequate order, with the remaining partconsisting of a perfectly predictable process. Obviously, the perfectly predictable process part ofx(k) also admits an AR model with zero excitation.

2.2.2.4 Power Spectral Density

Stochastic signals that are wide-sense stationary are persistent and therefore are not finite-energysignals. On the other hand, they have finite-power such that the generalized discrete-time Fouriertransform can be applied to them. When the generalized discrete-time Fourier transform is appliedto a WSS process it leads to a random function of the frequency [16]. On the other hand, theautocorrelation functions of most practical stationary processes have discrete-time Fourier transform.Therefore, the discrete-time Fourier transform of the autocorrelation function of a stationary randomprocess can be very useful in many situations. This transform, called power spectral density, isdefined as

Rx(ejω) =∞∑

l=−∞rx(l)e−jωl = F [rx(l)] (2.33)


where rx(l) is the autocorrelation of the process represented by x(k). The inverse discrete-timeFourier transform allows us to recover rx(l) from Rx(ejω), through the relation

rx(l) =12π

∫ π

−π

Rx(ejω)ejωldω = F−1[Rx(ejω)] (2.34)

It should be mentioned that Rx(ejω) is a deterministic function of ω, and can be interpreted as thepower density of the random process at a given frequency in the ensemble4, i.e., considering theaverage outcome of all possible realizations of the process. In particular, the mean squared value ofthe process represented by x(k) is given by

rx(0) =12π

∫ π

−π

Rx(ejω)dω (2.35)

If the random signal representing any single realization of a stationary process is applied as input toa linear and time-invariant filter, with impulse response h(k), the following equalities are valid andcan be easily verified:

y(k) =∞∑

n=−∞x(n)h(k − n) = x(k) ∗ h(k) (2.36)

ry(l) = rx(l) ∗ rh(l) (2.37)

Ry(ejω) = Rx(ejω)|H(ejω)|2 (2.38)

ryx(l) = rx(l) ∗ h(l) = E[x∗(l)y(l)] (2.39)

Ryx(ejω) = Rx(ejω)H(ejω) (2.40)

where rh(l) = h(l) ∗h(−l), Ry(ejω) is the power spectral density of the output signal, ryx(k) is thecross-correlation of x(k) and y(k), and Ryx(ejω) is the corresponding cross-power spectral density.

The main feature of the spectral density function is to allow a simple analysis of the correlationbehavior of WSS random signals processed with linear time-invariant systems. As an illustration,suppose a white noise is applied as input to a lowpass filter with impulse response h(k) and sharpcutoff at a given frequency ωl. The autocorrelation function of the output signal y(k) will not bea single impulse, it will be h(k) ∗ h(−k). Therefore, the signal y(k) will look like a band-limitedrandom signal, in this case, a slow-varying noise. Some properties of the function Rx(ejω) of adiscrete-time and stationary stochastic process are worth mentioning. The power spectrum densityis a periodic function of ω, with period 2π, as can be verified from its definition. Also, since fora stationary and complex random process we have rx(−l) = r∗x(l), Rx(ejω) is real. Despite theusefulness of the power spectrum density function in dealing with WSS processes, it will not bewidely used in this book since usually the filters considered here are time varying. However, itshould be noted its important role in areas such as spectrum estimation [25]-[26].

4The average signal power at a given sufficiently small frequency range, Δω, around a center frequency ω0 is approximatelygiven by Δω

2πRx(ejω0 ).


If the Z-transforms of the autocorrelation and cross-correlation functions exist, we can generalizethe definition of power spectral density. In particular, the definition of equation (2.33) correspondsto the following relation

Z[rx(k)] = Rx(z) =∞∑

k=−∞rx(k)z−k (2.41)

If the random signal representing any single realization of a stationary process is applied as input toa linear and time-invariant filter with impulse response h(k), the following equalities are valid:

Ry(z) = Rx(z)H(z)H(z−1) (2.42)

andRyx(z) = Rx(z)H(z) (2.43)

where H(z) = Z[h(l)]. If we wish to calculate the cross-correlation of y(k) and x(k), namelyryx(0), we can use the inverse Z-transform formula as follows:

E[y(k)x∗(k)] =1

2πj

∮Ryx(z)

dz

z

=1

2πj

∮H(z)Rx(z)

dz

z(2.44)

where the integration path is a counterclockwise closed contour in the region of convergence ofRyx(z). The contour integral above equation is usually solved through the Cauchy’s residue theo-rem [8].

2.2.3 Ergodicity

In the probabilistic approach, the statistical parameters of the real data are obtained through ensembleaverages (or expected values). The estimation of any parameter of the stochastic process can beobtained by averaging a large number of realizations of the given process, at each instant of time.However, in many applications only a few or even a single sample of the process is available. In thesesituations, we need to find out in which cases the statistical parameters of the process can be estimatedby using time average of a single sample (or ensemble member) of the process. This is obviouslynot possible if the desired parameter is time varying. The equivalence between the ensemble averageand time average is called ergodicity [14], [16].

The time average of a given stationary process represented by x(k) is calculated by

mxN=

12N + 1

N∑k=−N

x(k) (2.45)

Ifσ2

mxN= lim

N→∞E{|mxN

−mx|2} = 0


the process is said to be mean-ergodic in the mean-square sense. Therefore, the mean-ergodic processhas time average that approximates the ensemble average as N →∞. Obviously, mxN

is an unbiasedestimate of mx since

E[mxN] =

12N + 1

N∑k=−N

E[x(k)] = mx (2.46)

Therefore, the process will be considered ergodic if the variance of mxNtends to zero (σ2

mxN→ 0)

when N →∞. The variance σ2mxN

can be expressed after some manipulations as

σ2mxN

=1

2N + 1

2N∑l=−2N

σ2x(k + l, k)

(1− |l|

2N + 1

)(2.47)

where σ2x(k + l, k) is the autocovariance of the stochastic process x(k). The variance of mxN

tendsto zero if and only if

limN→∞

1N

N∑l=0

σ2x(k + l, k) → 0

The above condition is necessary and sufficient to guarantee that the process is mean-ergodic.

The ergodicity concept can be extended to higher order statistics. In particular, for second-orderstatistics we can define the process

xl(k) = x(k + l)x∗(k) (2.48)

where the mean of this process corresponds to the autocorrelation of x(k), i.e., rx(l). Mean-ergodicityof xl(k) implies mean-square ergodicity of the autocorrelation of x(k).

The time average of xl(k) is given by

mxl,N=

12N + 1

N∑k=−N

xl(k) (2.49)

that is an unbiased estimate of rx(l). If the variance of mxl,Ntends to zero as N tends to infinity,

the process x(k) is said to be mean-square ergodic of the autocorrelation, i.e.,

limN→∞

E{|mxl,N− rx(l)|2} = 0 (2.50)

The above condition is satisfied if and only if

limN→∞

1N

N∑i=0

E{x(k + l)x∗(k)x(k + l + i)x∗(k + i)} − r2x(l) = 0 (2.51)

where it is assumed that x(n) has stationary fourth-order moments. The concept of ergodicity canbe extended to nonstationary processes [16], however, that is beyond the scope of this book.

232.3 The Correlation Matrix

2.3 THE CORRELATION MATRIX

Usually, adaptive filters utilize the available input signals at instant k in their updating equations.These inputs are the elements of the input signal vector denoted by

x(k) = [x0(k) x1(k) . . . xN (k)]T

The correlation matrix is defined as R = E[x(k)xH(k)], where xH(k) is the Hermitian transpositionof x(k), that means transposition followed by complex conjugation or vice versa. As will be noted,the characteristics of the correlation matrix play a key role in the understanding of properties ofmost adaptive-filtering algorithms. As a consequence, it is important to examine the main propertiesof the matrix R. Some properties of the correlation matrix come from the statistical nature of theadaptive-filtering problem, whereas other properties derive from the linear algebra theory.

For a given input vector, the correlation matrix is given by

R =

⎡⎢⎢⎢⎣

E[|x0(k)|2] E[x0(k)x∗1(k)] · · · E[x0(k)x∗N (k)]E[x1(k)x∗0(k)] E[|x1(k)|2] · · · E[x1(k)x∗N (k)]

......

. . ....

E[xN (k)x∗0(k)] E[xN (k)x∗1(k)] · · · E[|xN (k)|2]

⎤⎥⎥⎥⎦

= E[x(k)xH(k)] (2.52)

The main properties of the R matrix are listed below:

1. The matrix R is positive semidefinite.

Proof:

Given an arbitrary complex weight vector w, we can form a signal given by

y(k) = wHx(k)

The magnitude squared of y(k) is

y(k)y∗(k) = |y(k)|2 = wHx(k)xH(k)w ≥ 0

The mean-square (MS) value of y(k) is then given by

MS[y(k)] = E[|y(k)|2] = wHE[x(k)xH(k)]w = wHRw ≥ 0

Therefore, the matrix R is positive semidefinite.

�

Usually, the matrix R is positive definite, unless the signals that compose the input vector arelinearly dependent. Linear dependent signals are rarely found in practice.


2. The matrix R is Hermitian, i.e.,R = RH (2.53)

Proof:

RH = E{[x(k)xH(k)]H} = E[x(k)xH(k)] = R

�

3. A matrix is Toeplitz if the elements of the main diagonal and of any secondary diagonal areequal. When the input signal vector is composed of delayed versions of the same signal (thatis, xi(k) = x0(k − i), for i = 1, 2, . . . , N ) taken from a WSS process, matrix R is Toeplitz.

Proof:

For the delayed signal input vector, with x(k) WSS, matrix R has the following form

R =

⎡⎢⎢⎢⎣

rx(0) rx(1) · · · rx(N)rx(−1) rx(0) · · · rx(N − 1)

......

. . ....

rx(−N) rx(−N + 1) · · · rx(0)

⎤⎥⎥⎥⎦ (2.54)

By examining the right-hand side of the above equation, we can easily conclude that R isToeplitz.

�

Note that r∗x(i) = rx(−i), what also follows from the fact that the matrix R is Hermitian.

If matrix R given by equation (2.54) is nonsingular for a given N , the input signal is said to bepersistently exciting of order N + 1. This means that the power spectral density Rx(ejω) is differentfrom zero at least at N + 1 points in the interval 0 < ω ≤ 2π. It also means that a nontrivialN th-order FIR filter (with at least one nonzero coefficient) cannot filter x(k) to zero. Note that anontrivial filter, with x(k) as input, would require at least N + 1 zeros in order to generate an outputwith all samples equal to zero. The absence of persistence of excitation implies the misbehavior ofsome adaptive algorithms [17], [18]. The definition of persistence of excitation is not unique, and itis algorithm dependent (see the book by Johnson [17] for further details).

From now on in this section, we discuss some properties of the correlation matrix related to itseigenvalues and eigenvectors. A number λ is an eigenvalue of the matrix R, with a correspondingeigenvector q, if and only if

Rq = λq (2.55)

or equivalentlydet(R− λI) = 0 (2.56)

where I is the (N + 1) by (N + 1) identity matrix. Equation (2.56) is called characteristic equationof R, and has (N + 1) solutions for λ. We denote the (N + 1) eigenvalues of R by λ0, λ1, . . . , λN .


Note also that for every value of λ, the vector q = 0 satisfies equation (2.55), however we consideronly those particular values of λ that are linked to a nonzero eigenvector q.

Some important properties related to the eigenvalues and eigenvectors of R, that will be useful in thefollowing chapters, are listed below.

1. The eigenvalues of Rm are λmi , for i = 0, 1, 2, . . . , N .

Proof:

By premultiplying equation (2.55) by Rm−1, we obtain

Rm−1Rqi = Rm−1λiqi = λiRm−2Rqi

= λiRm−2λiqi = λ2

i Rm−3Rqi

= · · · = λmi qi (2.57)

�

2. Suppose R has N + 1 linearly independent eigenvectors qi; then if we form a matrix Q withcolumns consisting of the qi’s, it follows that

Q−1RQ =

⎡⎢⎢⎢⎢⎢⎢⎢⎣

λ0 0 · · · 0

0 λ1...

... 0 · · · ...

...... 0

0 0 · · · λN

⎤⎥⎥⎥⎥⎥⎥⎥⎦

= Λ (2.58)

Proof:

RQ = R[q0 q1 · · ·qN ] = [λ0q0 λ1q1 · · ·λN qN ]

= Q

⎡⎢⎢⎢⎢⎢⎢⎢⎣

λ0 0 · · · 0

0 λ1...

... 0 · · · ...

...... 0

0 0 · · · λN

⎤⎥⎥⎥⎥⎥⎥⎥⎦

= QΛ

Therefore, since Q is invertible because the qi’s are linearly independent, we can show that

Q−1RQ = Λ

�


3. The nonzero eigenvectors q0, q1, . . . qN that correspond to different eigenvalues are linearlyindependent.

Proof:

If we form a linear combination of the eigenvectors such that

a0q0 + a1q1 + · · ·+ aN qN = 0 (2.59)

By multiplying the above equation by R we have

a0Rq0 + a1Rq1 + · · ·+ aN RqN = a0λ0q0 + a1λ1q1 + · · ·+ aNλN qN = 0 (2.60)

Now by multiplying equation (2.59) by λN and subtracting the result from equation (2.60), weobtain

a0(λ0 − λN )q0 + a1(λ1 − λN )q1 + · · ·+ aN−1(λN−1 − λN )qN−1 = 0

By repeating the above steps, i.e., multiplying the above equation by R in one instance and byλN−1 on the other instance, and subtracting the results, it yields

a0(λ0 − λN )(λ0 − λN−1)q0 + a1(λ1 − λN )(λ1 − λN−1)q1

+ · · ·+ aN−2(λN−2 − λN−1)qN−2 = 0

By repeating the same above steps several times, we end up with

a0(λ0 − λN )(λ0 − λN−1) · · · (λ0 − λ1)q0 = 0

Since we assumed λ0 = λ1, λ0 = λ2, . . . λ0 = λN , and q0 was assumed nonzero, then a0 = 0.

The same line of thought can be used to show that a0 = a1 = a2 = · · · = aN = 0 is the onlysolution for equation (2.59). Therefore, the eigenvectors corresponding to different eigenvaluesare linearly independent.

�

Not all matrices are diagonalizable. A matrix of order (N + 1) is diagonalizable if it possesses(N + 1) linearly independent eigenvectors. A matrix with repeated eigenvalues can be diago-nalized or not, depending on the linear dependency of the eigenvectors. A nondiagonalizablematrix is called defective [19].

4. Since the correlation matrix R is Hermitian, i.e., RH = R, its eigenvalues are real. Theseeigenvalues are equal to or greater than zero given that R is positive semidefinite.

Proof:

First note that given an arbitrary complex vector w,

(wHRw)H = wHRH(wH)H = wHRw

Therefore, wHRw is a real number. Assume now that λi is an eigenvalue of R correspondingto the eigenvector qi, i.e., Rqi = λiqi. By premultiplying this equation by qH

i , it follows that

qHi Rqi = λiqH

i qi = λi‖qi‖2

where the operation ‖a‖2 =|a0|2+ |a1|2+ · · ·+ |aN |2 is the Euclidean norm squared of the vectora, that is always real. Since the term on the left hand is also real, ‖qi‖2 = 0, and R is positivesemidefinite, we can conclude that λi is real and nonnegative.

�


Note that Q is not unique since each qi can be multiplied by an arbitrary nonzero constant, andthe resulting vector continues to be an eigenvector5. For practical reasons, we consider onlynormalized eigenvectors having length one, that is

qHi qi = 1 for i = 0, 1, . . . , N (2.61)

5. If R is a Hermitian matrix with different eigenvalues, the eigenvectors are orthogonal to eachother. As a consequence, there is a diagonalizing matrix Q that is unitary, i.e., QHQ = I.

Proof:

Given two eigenvalues λi and λj , it follows that

Rqi = λiqi

andRqj = λjqj (2.62)

Using the fact that R is Hermitian and that λi and λj are real, then

qHi R = λiqH

i

and by multiplying this equation on the right by qj , we get

qHi Rqj = λiqH

i qj

Now by premultiplying equation (2.62) by qHi , it follows that

qHi Rqj = λjqH

i qj

Therefore,λiqH

i qj = λjqHi qj

Since λi = λj , it can be concluded that

qHi qj = 0 for i = j

If we form matrix Q with normalized eigenvectors, matrix Q is a unitary matrix.

�

5We can also change the order in which the qi’s compose matrix Q, but this fact is not relevant for the present discussion.


An important result is that any Hermitian matrix R can be diagonalized by a suitable unitarymatrix Q, even if the eigenvalues of R are not distinct. The proof is omitted here and can befound in [19]. Therefore, for Hermitian matrices with repeated eigenvalues it is always possibleto find a complete set of orthonormal eigenvectors.

A useful form to decompose a Hermitian matrix that results from the last property is

R = QΛQH =N∑

i=0

λiqiqHi (2.63)

that is known as spectral decomposition. From this decomposition, one can easily derive thefollowing relation

wHRw =N∑

i=0

λiwHqiqHi w =

N∑i=0

λi|wHqi|2 (2.64)

In addition, since qi = λiR−1qi, the eigenvectors of a matrix and of its inverse coincide,

whereas the eigenvalues are reciprocals of each other. As a consequence,

R−1 =N∑

i=0

1λi

qiqHi (2.65)

Another consequence of the unitary property of Q for Hermitian matrices is that any Hermitianmatrix can be written in the form

R =[√

λ0q0

√λ1q1...

√λN qN

]⎡⎢⎢⎢⎣√

λ0qH0√

λ1qH1

...√λN qH

N

⎤⎥⎥⎥⎦

= LLH (2.66)

6. The sum of the eigenvalues of R is equal to the trace of R, and the product of the eigenvaluesof R is equal to the determinant of R6.

Proof:

tr[Q−1RQ] = tr[Λ]

where, tr[A] =∑N

i=0 aii. Since tr[A′A] = tr[AA′], we have

tr[Q−1RQ] = tr[RQQ−1] = tr[RI] = tr[R] =N∑

i=0

λi

Also det[Q−1 R Q] = det[R] det[Q] det[Q−1] = det[R] = det[Λ] =∏N

i=0 λi

�

6This property is valid for any square matrix, but for more general matrices the proof differs from the one presented here.


7. The Rayleigh’s quotient defined as

R =wHRwwHw

(2.67)

of a Hermitian matrix is bounded by the minimum and maximum eigenvalues, i.e.,

λmin ≤ R ≤ λmax (2.68)

where the minimum and maximum values are reached when the vector w is chosen to be theeigenvector corresponding to the minimum and maximum eigenvalues, respectively.

Proof:

Suppose w = Qw′, where Q is the matrix that diagonalizes R, then

R =w′HQHRQw′

w′HQHQw′

=w′HΛw′

w′Hw′

=∑N

i=0 λiw′i2∑N

i=0 w′i2 (2.69)

It is then possible to show, see problem 14, that the minimum value for the above equationoccurs when wi = 0 for i = j and λj is the smallest eigenvalue. Identically, the maximumvalue forR occurs when wi = 0 for i = l, where λl is the largest eigenvalue.

�

There are several ways to define the norm of a matrix. In this book the norm of a matrix R,denoted by ‖R‖, is defined by

‖R‖2 = maxw �=0

‖Rw‖2‖w‖2

= maxw �=0

wHRHRwwHw

(2.70)

Note that the norm of R is a measure of how a vector w grows in magnitude, when it is multipliedby R.

When the matrix R is Hermitian, the norm of R is easily obtained by using the results of equations(2.57) and (2.68). The result is

‖R‖ = λmax (2.71)

where λmax is the maximum eigenvalue of R.

A common problem that we encounter in adaptive filtering is the solution of a system of linearequations such as

Rw = p (2.72)


In case there is an error in the vector p, originated by quantization or estimation, how does itaffect the solution of the system of linear equations? For a positive definite Hermitian matrix R,it can be shown [19] that the relative error in the solution of the above linear system of equationsis bounded by

‖Δw‖‖w‖ ≤ λmax

λmin

‖Δp‖‖p‖ (2.73)

where λmax and λmin are the maximum and minimum values of the eigenvalues of R, respec-tively. The ratio λmax/λmin is called condition number of a matrix, that is

C =λmax

λmin= ‖R‖‖R−1‖ (2.74)

The value of C influences the convergence behavior of a number of adaptive-filtering algorithms,as will be seen in the following chapters. Large value of C indicates that the matrix R is ill-conditioned, and that errors introduced by the manipulation of R may be largely amplified.When C = 1, the matrix is perfectly conditioned. In case R represents the correlation matrix ofthe input signal of an adaptive filter, with the input vector composed by uncorrelated elementsof a delay line (see Fig. 2.1.b, and the discussions around it), then C = 1.

Example 2.1

Suppose the input signal vector is composed by a delay line with a single input signal, i.e.,

x(k) = [x(k) x(k − 1) . . . x(k −N)]T

Given the following input signals:

(a)x(k) = n(k)

(b)x(k) = a cos ω0k + n(k)

(c)

x(k) =M∑i=0

bin(k − i)

(d)x(k) = −a1x(k − 1) + n(k)

(e)x(k) = aej(ω0k+n(k))

where n(k) is a white noise with zero mean and variance σ2n; in case (e) n(k) is uniformly distributed

in the range −π to π.


Calculate the autocorrelation matrix R for N = 3.

Solution:

(a) In this case, we have that E[x(k)x(k − l)] = σ2nδ(l), where δ(l) denotes an impulse sequence.

Therefore,

R = E[x(k)xT (k)] = σ2n

⎡⎢⎢⎢⎣

1 0 · · · 00 1 · · · 0...

.... . .

...0 0 · · · 1

⎤⎥⎥⎥⎦ (2.75)

(b) In this example, n(k) is zero mean and uncorrelated with the deterministic cosine. The autocor-relation function can then be expressed as

r(k, k − l) = E[a2 cos(ω0k) cos(ω0k − ω0l) + n(k)n(k − l)]= a2E[cos(ω0k) cos(ω0k − ω0l)] + σ2

nδ(l)

=a2

2[cos(ω0l) + cos(2ω0k − ω0l)] + σ2

nδ(l) (2.76)

where δ(l) again denotes an impulse sequence. Since part of the input signal is deterministic andnonstationary, the autocorrelation is time dependent.

For the 3× 3 case the input signal correlation matrix R(k) becomes

a2

2

⎡⎣ 1 + cos 2ω0k + 2

a2 σ2n cos ω0 + cos ω0(2k − 1) cos 2ω0 + cos 2ω0(k − 1)

cos ω0 + cos ω0(2k − 1) 1 + cos 2ω0(k − 1) + 2a2 σ2

n cos ω0 + cos ω0(2(k − 1)− 1)cos 2ω0 + cos 2ω0(k − 1) cos ω0 + cos ω0(2(k − 1)− 1) 1 + cos 2ω0(k − 2) + 2

a2 σ2n

⎤⎦

(c) By exploring the fact that n(k) is a white noise, we can perform the following simplifications:

r(l) = E[x(k)x(k − l)] = E

⎡⎣M−l∑

j=0

M∑i=0

bibjn(k − i)n(k − l − j)

⎤⎦

=M−l∑j=0

bjbl+jE[n2(k − l − j)] = σ2n

M∑j=0

bjbl+j

0 ≤ l + j ≤M (2.77)

where from the third to the fourth relation we used the fact that E[n(k − i)n(k − l − j)] = 0 fori = l + j. For M = 3, the correlation matrix has the following form

R = σ2n

⎡⎢⎢⎣

∑3i=0 b2

i

∑2i=0 bibi+1

∑1i=0 bibi+2 b0b3∑2

i=0 bibi+1∑3

i=0 b2i

∑2i=0 bibi+1

∑1i=0 bibi+2∑1

i=0 bibi+2∑2

i=0 bibi+1∑3

i=0 b2i

∑2i=0 bibi+1

b0b3∑1

i=0 bibi+2∑2

i=0 bibi+1∑3

i=0 b2i

⎤⎥⎥⎦ (2.78)


(d) By solving the difference equation, we can obtain the correlation between x(k) and x(k− l), thatis

x(k) = (−a1)lx(k − l) +l−1∑j=0

(−a1)jn(k − j) (2.79)

Multiplying x(k− l) on both sides of the above equation and taking the expected value of the result,we obtain

E[x(k)x(k − l)] = (−a1)lE[x2(k − l)] (2.80)

since x(k − l) is independent of n(k − j) for j ≤ l − 1.

For l = 0, just calculate x2(k) and apply the expectation operation to the result. The partial result is

E[x2(k)] = a21E[x2(k − 1)] + E[n2(k)] (2.81)

therefore,

E[x2(k)] =σ2

n

1− a21

(2.82)

assuming x(k) is WSS.

The elements of R are then given by

r(l) =(−a1)|l|

1− a21

σ2n (2.83)

and the 3× 3 autocorrelation matrix becomes

R =σ2

n

1− a21

⎡⎣ 1 −a1 a2

1−a1 1 −a1a21 −a1 1

⎤⎦

(e) In this case, we are interested in calculating the autocorrelation of a complex sequence, that is

r(l) = E[x(k)x∗(k − l)]= a2E[e−j(−ω0l−n(k)+n(k−l))] (2.84)


By recalling the definition of expected value in equation (2.9), for l = 0,

r(l) = a2ejω0l

∫ ∞

−∞

∫ ∞

−∞e−j(−n0+n1)pn(k),n(k−l)(n0, n1)dn0dn1

= a2ejω0l

∫ π

−π

∫ π

−π

e−j(−n0+n1)pn(k)(n0)pn(k−l)(n1)dn0dn1

= a2ejω0l

∫ π

−π

∫ π

−π

e−j(−n0+n1) 12π

12π

dn0dn1

= a2ejω0l 14π2

∫ π

−π

∫ π

−π

e−j(−n0+n1)dn0dn1

= a2ejω0l 14π2

[∫ π

−π

ejn0dn0

] [∫ π

−π

e−jn1dn1

]

= a2ejω0l 14π2

[ejπ − e−jπ

j

] [−e−jπ + ejπ

j

]

= −a2ejω0l 1π2 (sin π)(sinπ) = 0 (2.85)

where in the fifth equality it is used the fact that n(k) and n(k − l), for l = 0, are independent.

For l = 0

r(0) = E[x(k)x∗(k)] = a2ej(ω00) = a2

Therefore,

r(l) = E[x(k)x∗(k − l)] = a2ej(ω0l)δ(l)

where in the 3× 3 case

R =

⎡⎣ a2 0 0

0 a2 00 0 a2

⎤⎦

At the end it was verified the fact that when we have two exponential functions (l = 0) with uniformlydistributed white noise in the range of−kπ to kπ as exponents, these exponentials are nonorthogonalonly if l = 0, where k is a positive integer.

�

In the remaining part of this chapter and in the following chapters, we will treat the algorithms forreal and complex signals separately. The derivations of the adaptive-filtering algorithms for complexsignals are usually straightforward extensions of the real signal cases, and some of them are left asexercises.


2.4 WIENER FILTER

One of the most widely used objective function in adaptive filtering is the mean-square error (MSE)defined as

F [e(k)] = ξ(k) = E[e2(k)] = E[d2(k)− 2d(k)y(k) + y2(k)] (2.86)

where d(k) is the reference signal as illustrated in Fig. 1.1.

Suppose the adaptive filter consists of a linear combiner, i.e., the output signal is composed by alinear combination of signals coming from an array as depicted in Fig. 2.1.a. In this case,

y(k) =N∑

i=0

wi(k)xi(k) = wT (k)x(k) (2.87)

where x(k) = [x0(k)x1(k) . . . xN (k)]T and w(k) = [w0(k)w1(k) . . . wN (k)]T are the input signaland the adaptive-filter coefficient vectors, respectively.

In many applications, each element of the input signal vector consists of a delayed version of thesame signal, that is: x0(k) = x(k), x1(k) = x(k − 1), . . . , xN (k) = x(k − N). Note that in thiscase the signal y(k) is the result of applying an FIR filter to the input signal x(k).

Since most of the analyses and algorithms presented in this book apply equally to the linear combinerand the FIR filter cases, we will mostly consider the latter case throughout the rest of the book. Themain reason for this decision is that the fast algorithms for the recursive least-squares solution, tobe discussed in the forthcoming chapters, explore the fact that the input signal vector consists of theoutput of a delay line with a single input signal, and, as a consequence, are not applicable to thelinear combiner case.

The most straightforward realization for the adaptive filter is through the direct-form FIR structureas illustrated in Fig. 2.1.b, with the output given by

y(k) =N∑

i=0

wi(k)x(k − i) = wT (k)x(k) (2.88)

where x(k) = [x(k) x(k − 1) . . . x(k − N)]T is the input vector representing a tapped-delay line,and w(k) = [w0(k) w1(k) . . . wN (k)]T is the tap-weight vector.

In both the linear combiner and FIR filter cases, the objective function can be rewritten as

E[e2(k)] = ξ(k)= E

[d2(k)− 2d(k)wT (k)x(k) + wT (k)x(k)xT (k)w(k)

]= E[d2(k)]− 2E[d(k)wT (k)x(k)] + E[wT (k)x(k)xT (k)w(k)] (2.89)

For a filter with fixed coefficients, the MSE function in a stationary environment is given by

ξ = E[d2(k)]− 2wT E[d(k)x(k)] + wT E[x(k)xT (k)]w= E[d2(k)]− 2wT p + wT Rw (2.90)

352.4 Wiener Filter

y(k)

d(k)

x0(k)

x1(k)

xN(k)

w1 (k)

wN (k)

w0 (k)

e(k)+

-++

(a)

y(k)

d(k)w1 (k)

wN (k)

w0 (k)

e(k)+

-++

x(k)

z -1

z -1

z -1

(b)

Figure 2.1 (a) Linear combiner; (b) Adaptive FIR filter.


where p = E[d(k)x(k)] is the cross-correlation vector between the desired and input signals, andR = E[x(k)xT (k)] is the input signal correlation matrix. As can be noted, the objective function ξis a quadratic function of the tap-weight coefficients which would allow a straightforward solutionfor w that minimizes ξ, if vector p and matrix R are known. Note that matrix R corresponds to theHessian matrix of the objective function defined in the previous chapter.

If the adaptive filter is implemented through an IIR filter, the objective function is a nonquadraticfunction of the filter parameters, turning the minimization problem into a much more difficult one.Local minima are likely to exist, rendering some solutions obtained by gradient-based algorithmsunacceptable. Despite its disadvantages, adaptive IIR filters are needed in a number of applicationswhere the order of a suitable FIR filter is too high. Typical applications include data equalization incommunication channels and cancellation of acoustic echo, see Chapter 10.

The gradient vector of the MSE function related to the filter tap-weight coefficients is given by7

gw =∂ξ

∂w=

[∂ξ

∂w0

∂ξ

∂w1. . .

∂ξ

∂wN

]T

= −2p + 2Rw (2.91)

By equating the gradient vector to zero and assuming R is nonsingular, the optimal values for thetap-weight coefficients that minimize the objective function can be evaluated as follows:

wo = R−1p (2.92)

This solution is called the Wiener solution. Unfortunately, in practice, precise estimations of R and pare not available. When the input and the desired signals are ergodic, one is able to use time averagesto estimate R and p, what is implicitly performed by most adaptive algorithms.

If we replace the optimal solution for w in the MSE expression, we can calculate the minimum MSEprovided by the Wiener solution:

ξmin = E[d2(k)]− 2wTo p + wT

o RR−1p

= E[d2(k)]− wTo p (2.93)

The above equation indicates that the optimal set of parameters removes part of the power of thedesired signal through the cross-correlation between x(k) and d(k), assuming both signals stationary.If the reference signal and the input signal are orthogonal, the optimal coefficients are equal to zeroand the minimum MSE is E[d2(k)]. This result is expected since nothing can be done with theparameters in order to minimize the MSE if the input signal carries no information about the desiredsignal. In this case, if any of the taps is nonzero, it would only increase the MSE.

An important property of the Wiener filter can be deduced if we analyze the gradient of the errorsurface at the optimal solution. The gradient vector can be expressed as follows:

gw =∂E[e2(k)]

∂w= E[2e(k)

∂e(k)∂w

] = −E[2e(k)x(k)] (2.94)

7Some books define gw as[

∂ξ∂w

]T, here we follow the notation more widely used in the subject matter.

372.4 Wiener Filter

With the coefficients set at their optimal values, i.e., at the Wiener solution, the gradient vector isequal to zero, implying that

E[e(k)x(k)] = 0 (2.95)

orE[e(k)x(k − i)] = 0 (2.96)

for i = 0, 1, . . . , N . This means that the error signal is orthogonal to the elements of the input signalvector. In case either the error or the input signal has zero mean, the orthogonality property impliesthat e(k) and x(k) are uncorrelated.

The orthogonality principle also applies to the correlation between the output signal y(k) and theerror e(k), when the tap weights are given by w = wo. By premultiplying the equation (2.95) bywT

o , the desired result follows:

E[e(k)wTo x(k)] = E[e(k)y(k)] = 0 (2.97)

The gradient with respect to a complex parameter has not been defined. For our purposes the complexgradient vector can be defined as [25]

gw(k){F (e(k))} =12

{∂F [e(k)]∂re[w(k)]

− j∂F [e(k)]∂im[w(k)]

}

where re[·] and im[·] indicate real and imaginary parts of [·], respectively. Note that the partialderivatives are calculated for each element of w(k).

For the complex case the error signal and the MSE are, respectively, described by, see Appendix Afor details,

e(k) = d(k)− wH(k)x(k) (2.98)

and

ξ = E[|e(k)|2]= E[|d(k)|2]− 2re{wHE[d∗(k)x(k)]}+ wHE[x(k)xH(k)]w= E[|d(k)|2]− 2re[wHp] + wHRw (2.99)

where p = E[d∗(k)x(k)] is the cross-correlation vector between the desired and input signals, andR = E[x(k)xH(k)] is the input signal correlation matrix. The Wiener solution in this case is alsogiven by equation (2.92).

Example 2.2

The input signal of a first-order adaptive filter is described by

x(k) = α1x1(k) + α2x2(k)


where x1(k) and x2(k) are first-order AR processes and mutually uncorrelated having both unitvariance. These signals are generated by applying distinct white noises to first-order filters whosepoles are placed at −s1 and −s2, respectively.

(a) Calculate the autocorrelation matrix of the input signal.

(b) If the desired signal consists of x2(k), calculate the Wiener solution.

Solution:

(a)

The models for the signals involved are described by

xi(k) = −sixi(k − 1) + κini(k)

for i = 1, 2. According to equation (2.83) the autocorrelation of either xi(k) is given by

E[xi(k)xi(k − l)] = κ2i

(−si)|l|

1− s2i

σ2n,i (2.100)

where σ2n,i is the variance of ni(k). Since each signal xi(k) has unit variance, then by applying l = 0

to the above equation

κ2i =

1− s2i

σ2n,i

(2.101)

Now by utilizing the fact that x1(k) and x2(k) are uncorrelated, the autocorrelation of the inputsignal is

R =[

α21 + α2

2 −α21s1 − α2

2s2−α2

1s1 − α22s2 α2

1 + α22

]

p =[

α2−α2s2

]

(b)

The Wiener solution can then be expressed as

wo = R−1p

=1

(α21 + α2

2)2 − (α21s1 + α2

2s2)2

[α2

1 + α22 α2

1s1 + α22s2

α21s1 + α2

2s2 α21 + α2

2

] [α2

−α2s2

]

=1

(1 + α22

α21)2 − (s1 + α2

2α2

1s2)2

⎡⎣ 1 + α2

2α2

1s1 + α2

2α2

1s2

s1 + α22

α21s2 1 + α2

2α2

1

⎤⎦

[α2α2

1−α2α2

1s2

]

= α2

[1

α21+α2

2−s1α21−s2α2

20

0 1α2

1+α22+s1α2

1+s2α22

] [ 1−s22

− 1+s22

]

392.5 Linearly Constrained Wiener Filter

Let’s assume that in this example our task was to detect the presence of x2(k) in the input signal. Fora fixed input-signal power, from this solution it is possible to observe that lower signal to interference

at the input, that is lower α22

α21

, leads to a Wiener solution vector with lower norm. This result reflectsthe fact that the Wiener solution tries to detect the desired signal at the same time it avoids enhancingthe undesired signal, i.e., the interference x1(k).

�

2.5 LINEARLY CONSTRAINED WIENER FILTER

In a number of applications, it is required to impose some linear constraints on the filter coefficientssuch that the optimal solution is the one that achieves the minimum MSE, provided the constraintsare met. Typical constraints are: unity norm of the parameter vector; linear phase of the adaptivefilter; prescribed gains at given frequencies.

In the particular case of an array of antennas the measured signals can be linearly combined toform a directional beam, where the signal impinging on the array in the desired direction will havehigher gain. This application is called beamforming, where we specify gains at certain directions ofarrival. It is clear that the array is introducing another dimension to the received data, namely spatialinformation. The weights in the antennas can be made adaptive leading to the so-called adaptiveantenna arrays. This is the principle behind the concept of smart antennas, where a set of adaptivearray processors filter the signals coming from the array, and direct the beam to several differentdirections where a potential communication is required. For example, in a wireless communicationsystem we are able to form a beam for each subscriber according to its position, ultimately leadingto minimization of noise from the environment and interference from other subscribers.

In order to develop the theory of linearly constrained optimal filters, let us consider the particularapplication of a narrowband beamformer required to pass without distortion all signals arriving at90◦ with respect to the array of antennas. All other sources of signals shall be treated as interferersand must be attenuated as much as possible. Fig. 2.2 illustrates the application. Note that in case thesignal of interest does not impinge the array at 90◦ with respect to the array, a steering operation inthe constraint vector c (to be defined) has to be performed [22].

The optimal filter that satisfies the linear constraints is called the linearly-constrained minimum-variance (LCMV) filter.

If the desired signal source is sufficiently far from the array of antennas, then we may assume thatthe wavefronts are planar at the array. Therefore, the wavefront from the desired source will reachall antennas at the same instant, whereas the wavefront from the interferer will reach each antenna atdifferent time instants. Taking the antenna with input signal x0 as a time reference t0, the wavefrontwill reach the ith antenna at [22]

ti = t0 + id cos θ

c


Figure 2.2 Narrowband beamformer.

where θ is the angle between the antenna array and the interferer direction of arrival, d is the distancebetween neighboring antennas, and c is the speed of propagation of the wave (3× 108 m/s).

For this particular case, the LCMV filter is the one that minimizes the array output signal energy

ξ = E[y2(k)] = E[wT x(k)xT (k)w]

subject to :N∑

j=0

cjwj = f (2.102)

where

w = [w0 w1 . . . wN ]T

x(k) = [x0(k) x1(k) . . . xN (k)]T


and

c = [1 1 . . . 1]T

is the constraint vector, since θ = 90◦. The desired gain is usually f = 1.

In the case the desired signal impinges the array at an angle θ with respect to the array, the incomingsignal reaches the ith antenna delayed by id cos θ

c with respect to the 0th antenna [23]. Let’s considerthe case of a narrowband array such that all antennas detect the impinging signal with the sameamplitude when measured taking into consideration their relative delays, which are multiples ofd cos θ

c . In such a case the optimal receiver coefficients would be

wi =ejωτi

N + 1(2.103)

for i = 0, 1, . . . , N , in order to add coherently the delays of the desired incoming signal at a givendirection θ. The impinging signal appear at the ith antenna multiplied by e−jωτi , considering theparticular case of array configuration of Fig. 2.2. In this uniform linear array, the antenna locationsare

pi = id

for i = 0, 1, . . . , N . Using the 0th antenna as reference, the signal will reach the array according tothe following pattern

c = ejωt[1 e−jω d cos θ

c e−jω 2d cos θc . . . e−jω Nd cos θ

c

]T

= ejωt[1 e−j 2π

λ d cos θ e−j 2πλ 2d cos θ . . . e−j 2π

λ Nd cos θ]T

(2.104)

where the equality ωc = 2π

λ was employed, with λ being the wavelength corresponding to thefrequency ω.

By defining the variable ψ(ω, θ) = 2πλ d cos θ, we can describe the output signal of the beamformer

as

y = ejωtN∑

i=0

wie−jψ(ω,θ)i

= ejωtH(ω, θ) (2.105)

where H(ω, θ) modifies the amplitude and phase of transmitted signal at a given frequency ω. Notethat the shaping function H(ω, θ) depends on the impinging angle.

For the sake of illustration, if the antenna separation is d = λ2 , θ = 60◦, and N is odd, then the

constraint vector would be

c =[1 e−j π

2 e−jπ . . . e−j Nπ2

]T

=[1 − j − 1 . . . e−j Nπ

2

]T

(2.106)


Using the method of Lagrange multipliers, we can rewrite the constrained minimization problemdescribed in equation (2.102) as

ξc = E[wT x(k)xT (k)w] + λ(cT w− f) (2.107)

The gradient of ξc with respect to w is equal to

gw = 2Rw + λc (2.108)

where R = E[x(k)xT (k)]. For a positive definite matrix R, the value of w that satisfies gw = 0 isunique and minimizes ξc. Denoting wo as the optimal solution, we have

2Rwo + λc = 0

2cT wo + λcT R−1c = 0

2f + λcT R−1c = 0

where in order to obtain the second equality, we premultiply the first equation by cT R−1. Therefore,

λ = −2(cT R−1c)−1f

and the LCMV filter is

wo = R−1c(cT R−1c)−1f (2.109)

If more constraints need to be satisfied by the filter, these can be easily incorporated in a constraintmatrix and in a gain vector, such that

CT w = f (2.110)

In this case, the LCMV filter is given by

wo = R−1C(CT R−1C)−1f (2.111)

If there is a desired signal, the natural objective is the minimization of the MSE, not the output energyas in the narrowband beamformer. In this case, it is straightforward to modify equation (2.107) andobtain the optimal solution

wo = R−1p + R−1C(CT R−1C)−1(f− CT R−1p) (2.112)

where p = E[d(k) x(k)] see problem 20.

In the case of complex input signals and constraints the optimal solution is given by

wo = R−1p + R−1C(CHR−1C)−1(f− CHR−1p) (2.113)

where CHw = f.


Figure 2.3 The generalized sidelobe canceller.

2.5.1 The Generalized Sidelobe Canceller

An alternative implementation to the direct-form constrained adaptive filter showed above is calledthe generalized sidelobe canceller (GSC) (see Fig. 2.3) [20].

For this structure the input signal vector is transformed by a matrix

T = [C B] (2.114)

where C is the constraint matrix and B is a blocking matrix that spans the null space of C, i.e., matrixB satisfies

BT C = 0 (2.115)

The output signal y(k) shown in Fig. 2.3 is formed as

y(k) = wTu CT x(k) + wT

l BT x(k)= (Cwu + Bwl)T x(k)= (Tw)T x(k)= wT x(k) (2.116)

where w = [wTu wT

l ]T and w = Tw.

The linear constraints are satisfied if CT w = f. But as CT B = 0, then the condition to be satisfiedbecomes

CT w = CT Cwu = f (2.117)

Therefore, for the GSC structure shown in Fig. 2.3 there is a necessary condition that the upper partof the coefficient vector, wu, should be initialized as

wu = (CT C)−1f (2.118)

Minimization of the output energy is achieved with a proper choice of wl. In fact, we transformeda constrained optimization problem into an unconstrained one, which in turn can be solved with the

C

B

x(k)

xu(k) yu(k)

y(k)+

wu

wlxl (k) yl (k)


classical linear Wiener filter, i.e.,

minwl

E[y2(k)] = minwl

E{[yu(k) + wTl xl(k)]2}

= wl,o

= −R−1l pl (2.119)

where

Rl = E[xl(k)xTl (k)]

= E[BT x(k)xT (k)B]= BT [x(k)xT (k)]B= BT RB (2.120)

and

pl = E[yu(k) xl(k)] = E[xl(k) yu(k)]= E[BT x(k) wT

u CT x(k)]= E[BT x(k) xT (k)Cwu]= BT E[x(k) xT (k)]Cwu

= BT RCwu

= BT RC(CT C)−1f (2.121)

where in the above derivations we utilized the results and definitions from equations (2.116) and(2.118).

Using the equations (2.120), (2.121), and (2.118), it is possible to show that

wl,o = −(BT RB)−1BT RC(CT C)−1f (2.122)

Given that wl,o is the solution to an unconstrained minimization problem of transformed quantities,any unconstrained adaptive filter can be used to estimate recursively this optimal solution. Thedrawback in the implementation of the GSC structure comes from the transformation of the inputsignal vector via a constraint matrix and a blocking matrix. Although in theory any matrix with linearlyindependent columns that spans the null space of C can be employed, in many cases the computationalcomplexity resulting from the multiplication of B by x(k) can be prohibitive. Furthermore, if thetransformation matrix T is not orthogonal, finite-precision effects may yield an overall unstablesystem. A simple solution that guarantees orthogonality in the transformation and low computationalcomplexity can be obtained with a Householder transformation [21].

2.6 MEAN-SQUARE ERROR SURFACE

The mean-square error is a quadratic function of the parameters w. Assuming a given fixed w, theMSE is not a function of time and can be expressed as

ξ = σ2d − 2wT p + wT Rw (2.123)

452.6 Mean-Square Error Surface

where σ2d is the variance of d(k) assuming it has zero-mean. The MSE is a quadratic function of the tap

weights forming a hyperparaboloid surface. The MSE surface is convex and has only positive values.For two weights, the surface is a paraboloid. Fig. 2.4 illustrates the MSE surface for a numericalexample where w has two coefficients. If the MSE surface is intersected by a plane parallel to thew plane, placed at a level superior to ξmin, the intersection consists of an ellipse representing equalMSE contours as depicted in Fig. 2.5. Note that in this figure we showed three distinct ellipses,corresponding to different levels of MSE. The ellipses of constant MSE are all concentric.

05

1015

2025

0

5

10

15

20

250

50

100

150

200

250

MS

E

w0 w1

Figure 2.4 Mean-square error surface.

In order to understand the properties of the MSE surface, it is convenient to define a translatedcoefficient vector as follows:

Δw = w− wo (2.124)

The MSE can be expressed as a function of Δw as follows:

ξ = σ2d − wT

o p + wTo p− 2wT p + wT Rw

= ξmin −ΔwT p− wT Rwo + wT Rw

= ξmin −ΔwT p + wT RΔw

= ξmin − wTo RΔw + wT RΔw

= ξmin + ΔwT RΔw (2.125)

where we used the results of equations (2.92) and (2.93). The corresponding error surface contoursare depicted in Fig. 2.6.


w0

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

4

25

50

w1

Figure 2.5 Contours of the MSE surface.

Δw0

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

4

25

50

Δw1

Figure 2.6 Translated contours of the MSE surface.

472.7 Bias and Consistency

By employing the diagonalized form of R, the last equation can be rewritten as follows:

ξ = ξmin + ΔwT QΛQT Δw

= ξmin + vT Λv

= ξmin +N∑

i=0

λiv2i (2.126)

where v = QT Δw are the rotated parameters.

The above form for representing the MSE surface is an uncoupled form, in the sense that eachcomponent of the gradient vector of the MSE with respect to the rotated parameters is a function ofa single parameter, that is

gv[ξ] = [2λ0v0 2λ1v1 . . . 2λNvN ]T

This property means that if all vi’s are zero except one, the gradient direction coincides with thenonzero parameter axis. In other words, the rotated parameters represent the principal axes of thehyperellipse of constant MSE, as illustrated in Fig. 2.7. Note that since the rotated parameters arethe result of the projection of the original parameter vector Δw on the eigenvectors qi direction, itis straightforward to conclude that the eigenvectors represent the principal axes of the constant MSEhyperellipses.

The matrix of second derivatives of ξ as related to the rotated parameters is Λ. We can note thatthe gradient will be steeper in the principal axes corresponding to larger eigenvalues. This is thedirection, in the two axes case, where the ellipse is narrow.

2.7 BIAS AND CONSISTENCY

The correct interpretation of the results obtained by the adaptive-filtering algorithm requires thedefinitions of bias and consistency. An estimate is considered unbiased if the following condition issatisfied

E[w(k)] = wo (2.127)

The difference E[w(k)]− wo is called the bias in the parameter estimate.

An estimate is considered consistent if

w(k) → wo as k →∞ (2.128)

Note that since w(k) is a random variable, it is necessary to define in which sense the limit istaken. Usually, the limit with probability one is employed. In the case of identification, a system isconsidered identifiable if the given parameter estimates are consistent. For a more formal treatmenton this subject refer to [18].


v0

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

4

25 50

v1

Figure 2.7 Rotated contours of the MSE surface.

2.8 NEWTON ALGORITHM

In the context of the MSE minimization discussed in the previous section, see equation (2.123), thecoefficient-vector updating using the Newton method is performed as follows:

w(k + 1) = w(k)− μR−1gw(k) (2.129)

where its derivation originates from equation (1.4). Assuming the true gradient and the matrix R areavailable, the coefficient-vector updating can be expressed as

w(k + 1) = w(k)− μR−1[−2p + 2Rw(k)] = (I− 2μI)w(k) + 2μwo (2.130)

where if μ = 1/2, the Wiener solution is reached in one step.

The Wiener solution can be approached using a Newton-like search algorithm, by updating theadaptive-filter coefficients as follows:

w(k + 1) = w(k)− μR−1

(k)gw(k) (2.131)

where R−1

(k) is an estimate of R−1 and gw(k) is an estimate of gw, both at instant k. The parameterμ is the convergence factor that regulates the convergence rate. Newton-based algorithms present,in general, fast convergence. However, the estimate of R−1 is computationally intensive and canbecome numerically unstable if special care is not taken. These factors made the steepest-descent-based algorithms more popular in adaptive-filtering applications.

492.9 Steepest-Descent Algorithm

2.9 STEEPEST-DESCENT ALGORITHM

In order to get a practical feeling of a problem that is being solved using the steepest-descent algorithm,we assume that the optimal coefficient vector, i.e., the Wiener solution, is wo, and that the referencesignal is not corrupted by measurement noise8.

The main objective of the present section is to study the rate of convergence, the stability, and thesteady-state behavior of an adaptive filter whose coefficients are updated through the steepest-descentalgorithm. It is worth mentioning that the steepest-descent method can be considered an efficientgradient-type algorithm, in the sense that it works with the true gradient vector, and not with anestimate of it. Therefore, the performance of other gradient-type algorithms can at most be closeto the performance of the steepest-descent algorithm. When the objective function is the MSE, thedifficult task of obtaining the matrix R and the vector p impairs the steepest-descent algorithm frombeing useful in adaptive-filtering applications. Its performance, however, serves as a benchmark forgradient-based algorithms.

The steepest-descent algorithm updates the coefficients in the following general form

w(k + 1) = w(k)− μgw(k) (2.132)

where the above expression is equivalent to equation (1.6). It is worth noting that several alternativegradient-based algorithms available replace gw(k) by an estimate gw(k), and they differ in the waythe gradient vector is estimated. The true gradient expression is given in equation (2.91) and, as canbe noted, it depends on the vector p and the matrix R, that are usually not available.

Substituting equation (2.91) in equation (2.132), we get

w(k + 1) = w(k)− 2μRw(k) + 2μp (2.133)

Now, some of the main properties related to the convergence behavior of the steepest-descent algo-rithm in stationary environment are described. First, an analysis is required to determine the influenceof the convergence factor μ in the convergence behavior of the steepest-descent algorithm.

The error in the adaptive-filter coefficients when compared to the Wiener solution is defined as

Δw(k) = w(k)− wo (2.134)

The steepest-descent algorithm can then be described in an alternative way, that is:

Δw(k + 1) = Δw(k)− 2μ[Rw(k)− Rwo]= Δw(k)− 2μRΔw(k)= (I− 2μR) Δw(k) (2.135)

where the relation p = Rwo (see equation (2.92)) was employed. It can be shown from the aboveequation that

Δw(k + 1) = (I− 2μR)k+1Δw(0) (2.136)

8Noise added to the reference signal originated from environment and/or thermal noise.


orw(k + 1) = wo + (I− 2μR)k+1[w(0)− wo] (2.137)

The equation (2.135) premultiplied by QT , where Q is the unitary matrix that diagonalizes R througha similarity transformation, yields

QT Δw(k + 1) = (I− 2μQT RQ)QT Δw(k)= v(k + 1)= (I− 2μΛ)v(k)

=

⎡⎢⎢⎢⎢⎣

1− 2μλ0 0 · · · 0

0 1− 2μλ1...

......

. . ....

0 0 1− 2μλN

⎤⎥⎥⎥⎥⎦ v(k) (2.138)

In the above equation, v(k + 1) = QT Δw(k + 1) is the rotated coefficient-vector error. Usinginduction, equation (2.138) can be rewritten as

v(k + 1) = (I− 2μΛ)k+1v(0)

=

⎡⎢⎢⎢⎢⎣

(1− 2μλ0)k+1 0 · · · 0

0 (1− 2μλ1)k+1...

......

. . ....

0 0 (1− 2μλN )k+1

⎤⎥⎥⎥⎥⎦ v(0) (2.139)

This equation shows that in order to guarantee the convergence of the coefficients, each element1 − 2μλi must have an absolute value less than one. As a consequence, the convergence factor ofthe steepest-descent algorithm must be chosen in the range

0 < μ <1

λmax(2.140)

where λmax is the largest eigenvalue of R. In this case, all the elements of the diagonal matrix inequation (2.139) tend to zero as k →∞, resulting in v(k + 1)→ 0 for large k.

The μ value in the above range guarantees that the coefficient vector approaches the optimum coeffi-cient vector wo. It should be mentioned that if matrix R has large eigenvalue spread, the convergencespeed of the coefficients will be primarily dependent on the value of the smallest eigenvalue. Notethat the slowest decaying element in equation (2.139) is given by (1− 2μλmin)k+1.

The MSE presents a transient behavior during the adaptation process, that can be analyzed in astraightforward way if we employ the diagonalized version of R. Recalling from equation (2.125)that

ξ(k) = ξmin + ΔwT (k)RΔw(k) (2.141)


the MSE can then be simplified as follows:

ξ(k) = ξmin + ΔwT (k)QΛ QT Δw(k)= ξmin + vT (k)Λ v(k)

= ξmin +N∑

i=0

λiv2i (k) (2.142)

If we apply the result of equation (2.139) in equation (2.142), it can be shown that the followingrelation results

ξ(k) = ξmin + vT (k − 1)(I− 2μΛ)Λ (I− 2μΛ)v(k − 1)

= ξmin +N∑

i=0

λi(1− 2μλi)2kv2i (0) (2.143)

The analyses presented in this section show that before the steepest-descent algorithm reaches thesteady-state behavior, there is a transient period where the error is usually high and the coefficientsare far from the Wiener solution. As can be seen from equation (2.139), in the case of the adaptive-filter coefficients, the convergence will follow (N + 1) geometric decaying curves with ratios rwi =(1−2μλi). Each of these curves can be approximated by an exponential envelope with time constantτwi as follows [5]:

rwi = e−1τwi = 1− 1

τwi+

12!τ2

wi

+ · · · (2.144)

In general, rwi is slightly smaller than one, specially in the cases of slowly decreasing modes thatcorrespond to small values λi and μ. Therefore,

rwi = (1− 2μλi) ≈ 1− 1τwi

(2.145)

then

τwi ≈ 12μλi

for i = 0, 1, . . . , N .

For the convergence of the MSE, the range of values of μ is the same to guarantee the convergenceof the coefficients. In this case, due to the exponent 2k in equation (2.143), the geometric decayingcurves have ratios given by rei = (1 − 4μλi), that can be approximated by exponential envelopeswith time constants given by

τei ≈ 14μλi

(2.146)

for i = 0, 1, . . . , N , where it was considered that 4μ2λ2i 1. In the convergence of both the error

and the coefficients, the time required for the convergence depends on the ratio of the eigenvalues ofthe input signal. Further discussions on convergence properties that apply to gradient-type algorithmscan be found in Chapter 3.


Example 2.3

The matrix R and the vector p are known for a given experimental environment:

R =[

1 0.40450.4045 1

]

p = [0 0.2939]T

E[d2(k)] = 0.5

(a) Deduce the equation for the MSE.

(b) Choose a small value for μ, and starting the parameters at [−1 − 2]T

plot the convergence path of the steepest-descent algorithm in the MSE surface.

(c) Repeat the previous item for the Newton algorithm starting at [0 − 2]T .

Solution:

(a) The MSE function is given by

ξ = E[d2(k)]− 2wT p + wT Rw

= σ2d − 2[w1 w2]

[0

0.2939

]+ [w1 w2]

[1 0.4045

0.4045 1

] [w1w2

]

After performing the algebraic calculations, we obtain the following result

ξ = 0.5 + w21 + w2

2 + 0.8090w1w2 − 0.5878w2

(b) The steepest-descent algorithm was applied to minimize the MSE using a convergence factorμ = 0.1/λmax, where λmax = 1.4045. The convergence path of the algorithm in the MSE surface isdepicted in Fig. 2.8. As can be noted, the path followed by the algorithm first approaches the mainaxis (eigenvector) corresponding to the smaller eigenvalue, and then follows toward the minimum ina direction increasingly aligned with this main axis.

(c) The Newton algorithm was also applied to minimize the MSE using a convergence factor μ =0.1/λmax. The convergence path of the Newton algorithm in the MSE surface is depicted in Fig.2.9. The Newton algorithm follows a straight path to the minimum.

�

53

w1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

w0

Figure 2.8 Convergence path of the steepest-descent algorithm.

w1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

w0

Figure 2.9 Convergence path of the Newton algorithm.



2.10 APPLICATIONS REVISITED

In this section, we give a brief introduction to the typical applications where the adaptive-filteringalgorithms are required, including a discussion of where in the real world these applications arefound. The main objective of this section is to illustrate how the adaptive-filtering algorithms, ingeneral, and the ones presented in the book, in particular, are applied to solve practical problems.It should be noted that the detailed analysis of any particular application is beyond the scope ofthis book. Nevertheless, a number of specific references are given for the interested reader. Thedistinctive feature of each application is the way the adaptive filter input signal and the desiredsignal are chosen. Once these signals are determined, any known properties of them can be usedto understand the expected behavior of the adaptive filter when attempting to minimize the chosenobjective function (for example, the MSE, ξ).

2.10.1 System Identification

The typical set up of the system identification application is depicted in Fig. 2.10. A common inputsignal is applied to the unknown system and to the adaptive filter. Usually, the input signal is awideband signal, in order to allow the adaptive filter to converge to a good model of the unknownsystem.

Adaptivefilter

Unknownsystem

x(k) e(k)

y(k)

d(k)

+-

Figure 2.10 System identification.

Assume the unknown system has impulse response given by h(k), for k = 0, 1, 2, 3, . . . ,∞, andzero for k < 0. The error signal is then given by

e(k) = d(k)− y(k)

=∞∑

l=0

h(l)x(k − l)−N∑

i=0

wi(k)x(k − i) (2.147)

where wi(k) are the coefficients of the adaptive filter.

552.10 Applications Revisited

Assuming that x(k) is a white noise, the MSE for a fixed w is given by

ξ = E{[hT x∞(k)− wT xN+1(k)]2}= E

[hT x∞(k)xT

∞(k)h− 2hT x∞(k)xTN+1(k)w + wT xN+1(k)xT

N+1(k)w]

= σ2x

∞∑i=0

h2(i)− 2σ2xhT

[IN+1

0∞×(N+1)

]w + wT RN+1w (2.148)

where x∞(k) and xN+1(k) are the input signal vector with infinite and finite lengths, respectively.

By calculating the derivative of ξ with respect to the coefficients of the adaptive filter, it follows that

wo = hN+1 (2.149)

where

hTN+1 = hT

[IN+1

0∞×(N+1)

](2.150)

If the input signal is a white noise, the best model for the unknown system is a system whose impulseresponse coincides with the N + 1 first samples of the unknown system impulse response. In thecases where the impulse response of the unknown system is of finite length and the adaptive filteris of sufficient order (i.e., it has enough number of parameters), the MSE becomes zero if thereis no measurement noise (or channel noise). In practical applications the measurement noise isunavoidable, and if it is uncorrelated with the input signal, the expected value of the adaptive-filtercoefficients will coincide with the unknown-system impulse response samples. The output errorwill of course be the measurement noise. We can observe that the measurement noise introduces avariance in the estimates of the unknown system parameters.

Some real world applications of the system identification scheme include modeling of multipathcommunication channels [36], control systems [28], seismic exploration [37], and cancellation ofecho caused by hybrids in some communication systems [38]-[42], just to mention a few.

2.10.2 Signal Enhancement

In the signal enhancement application, the reference signal consists of a desired signal x(k) that iscorrupted by an additive noise n1(k). The input signal of the adaptive filter is a noise signal n2(k)that is correlated with the interference signal n1(k), but uncorrelated with x(k). Fig. 2.11 illustratesthe configuration of the signal enhancement application. In practice, this configuration is foundin acoustic echo cancellation for auditoriums [45], hearing aids, noise cancellation in hydrophones[44], cancelling of power line interference in electrocardiography [28], and in other applications.The cancelling of echo caused by the hybrid in some communication systems can also be considereda signal enhancement problem [28].

In this application, the error signal is given by

e(k) = x(k) + n1(k)−N∑

l=0

wln2(k − l) = x(k) + n1(k)− y(k) (2.151)


Adaptivefilter

e(k)n2(k)

x(k) + n1(k)

+-

Figure 2.11 Signal enhancement (n1(k) and n2(k) are noise signals correlated to each other).

The resulting MSE is then given by

E[e2(k)] = E[x2(k)] + E{[n1(k)− y(k)]2} (2.152)

where it was assumed that x(k) is uncorrelated with n1(k) and n2(k). The above equation showsthat if the adaptive filter, having n2(k) as the input signal, is able to perfectly predict the signal n1(k),the minimum MSE is given by

ξmin = E[x2(k)] (2.153)

where the error signal, in this situation, is the desired signal x(k).

The effectiveness of the signal enhancement scheme depends on the high correlation between n1(k)and n2(k). In some applications, it is useful to include a delay of L samples in the reference signalor in the input signal, such that their relative delay yields a maximum cross-correlation between y(k)and n1(k), reducing the MSE. This delay provides a kind of synchronization between the signalsinvolved. An example exploring this issue will be presented in the following chapters.

2.10.3 Signal Prediction

In the signal prediction application, the adaptive-filter input consists of a delayed version of thedesired signal as illustrated in Fig. 2.12. The MSE is given by

ξ = E{[x(k)− wT x(k − L)]2} (2.154)

Adaptivefilterx(k) e(k)

y(k)+

-z-L

Figure 2.12 Signal prediction.


The minimization of the MSE leads to an FIR filter, whose coefficients are the elements of w. Thisfilter is able to predict the present sample of the input signal using as information old samples suchas x(k − L), x(k − L− 1), . . . , x(k − L−N). The resulting FIR filter can then be considered amodel for the signal x(k) when the MSE is small. The minimum MSE is given by

ξmin = r(0)− wTo

⎡⎢⎢⎢⎢⎢⎢⎣

r(L)r(L + 1)

.

.

.r(L + N)

⎤⎥⎥⎥⎥⎥⎥⎦

(2.155)

where wo is the optimum predictor coefficient vector and r(l) = E[x(k)x(k − l)] for a stationaryprocess.

A typical predictor’s application is in linear prediction coding of speech signals [43], where the predic-tor’s task is to estimate the speech parameters. These parameters w are part of the coding informationthat is transmitted or stored along with other information inherent to the speech characteristics, suchas pitch period, among others.

The adaptive signal predictor is also used for adaptive line enhancement (ALE), where the input signalis a narrowband signal (predictable) added to a wideband signal. After convergence, the predictoroutput will be an enhanced version of the narrowband signal.

Yet another application of the signal predictor is the suppression of narrowband interference in awideband signal. The input signal, in this case, has the same general characteristics of the ALE.However, we are now interested in removing the narrowband interferer. For such an application, theoutput signal of interest is the error signal [45].

2.10.4 Channel Equalization

As can be seen from Fig. 2.13, channel equalization or inverse filtering consists of estimating atransfer function to compensate for the linear distortion caused by the channel. From another pointof view, the objective is to force a prescribed dynamic behavior for the cascade of the channel(unknown system) and the adaptive filter, determined by the input signal. The first interpretationis more appropriate in communications, where the information is transmitted through dispersivechannels [35], [41]. The second interpretation is appropriate for control applications, where theinverse filtering scheme generates control signals to be used in the unknown system [28].

In the ideal situation, where n(k) = 0 and the equalizer has sufficient order, the error signal is zeroif

W (z)H(z) = z−L (2.156)


e(k)y(k)

n(k)

x(k)

d(k)

+-+ AdaptivefilterChannel

z-L

Figure 2.13 Channel equalization.

where W (z) and H(z) are the equalizer and unknown system transfer functions, respectively. There-fore, the ideal equalizer has the following transfer function

W (z) =z−L

H(z)(2.157)

From the above equation, we can conclude that if H(z) is an IIR transfer function with nontrivialnumerator and denominator polynomials, W (z) will also be IIR. If H(z) is an all-pole model, W (z)is FIR. If H(z) is an all-zero model, W (z) is an all-pole transfer function.

By applying the inverseZ-transform to equation (2.156), we can conclude that the optimal equalizerimpulse response convolved with the channel impulse response produces as a result an impulse. Thismeans that for zero additional error in the channel, the output signal y(k) restores x(k − L) and,therefore, one can conclude that a deconvolution process took place.

The delay in the reference signal plays an important role in the equalization process. Without thedelay, the desired signal is x(k), whereas the signal y(k) will be mainly influenced by old samplesof the input signal, since the unknown system is usually causal. As a consequence, the equalizershould also perform the task of predicting x(k) simultaneously with the main task of equalizing thechannel. The introduction of a delay alleviates the prediction task, leaving the equalizer free to invertthe channel response. A rule of thumb for choosing the delay was proposed and analyzed in [28],where it was conjectured that the best delay should be close to half the time span of the equalizer. Inpractice, the reader should try different delays.

In the case the unknown system is not of minimum phase, i.e., its transfer function has zeros outsidethe unit circle of the Z plane, the optimum equalizer is either stable and noncausal, or unstable andcausal. Both solutions are unacceptable. The noncausal stable solution could be better approximatedby a causal FIR filter when the delay is included in the desired signal. The delay forces a time shiftin the ideal impulse response of the equalizer, allowing the time span, where most of the energy isconcentrated, to be in the causal region.

If channel noise signal is present and is uncorrelated with the channel’s input signal, the error signaland y(k) will be accordingly noisier. However, it should be noticed that the adaptive equalizer, in theprocess of reducing the MSE, disturbs the optimal solution by trying to reduce the effects of n(k).Therefore, in a noisy environment the equalizer transfer function is not exactly the inverse of H(z).


In practice, the noblest use of the adaptive equalizer is to compensate for the distortion caused by thetransmission channel in a communication system. The main distortions caused by the channels arehigh attenuation and intersymbol interference (ISI). The ISI is generated when different frequencycomponents of the transmitted signals arrive at different times at the receiver, a phenomenon causedby the nonlinear group delay of the channel [35]. For example, in a digital communication system,the time-dispersive channel extends a transmitted symbol beyond the time interval allotted to it,interfering in the past and future symbols. Under severe ISI, when short symbol space is used, thenumber of symbols causing ISI is large.

The channel impulse response is a time spread sequence described by h(k) with the received signalbeing given by

re(k + J) = x(k)h(J) +k+J∑

l=−∞, l �=k

x(l)h(k + J − l) + n(k + J) (2.158)

where J denotes the channel time delay (including the sampler phase). The first term of the aboveequation corresponds to the desired information, the second term is the interference of the symbolssent before and after x(k). The third term accounts for channel noise. Obviously only the neighboringsymbols have significant influence in the second term of the above equation. The elements of thesecond term involving x(l), for l > k, are called pre-cursor ISI since they are caused by componentsof the data signal that reach the receiver before their cursor. On the other hand, the elements involvingx(l), for l < k, are called post-cursor ISI.

In many situations, the ISI is reduced by employing an equalizer consisting of an adaptive FIR filterof appropriate length. The adaptive equalizer attempts to cancel the ISI in the presence of noise. Indigital communication, a decision device is placed after the equalizer in order to identify the symbolat a given instant. The equalizer coefficients are updated in two distinct circumstances by employingdifferent reference signals. During the equalizer training period, a previously chosen training signalis transmitted through the channel and a properly delayed version of this signal, that is prestored inthe receiver end, is used as reference signal. The training signal is usually a pseudo-noise sequencelong enough to allow the equalizer to compensate for the channel distortions. After convergence,the error between the adaptive-filter output and the decision device output is utilized to update thecoefficients. The resulting scheme is the decision-directed adaptive equalizer. It should be mentionedthat in some applications no training period is available. Usually, in this case, the decision-directederror is used all the time.

A more general equalizer scheme is the decision-feedback equalizer (DFE) illustrated in Fig. 2.14.The DFE is widely used in situations where the channel distortion is severe [35], [46]. The basicidea is to feed back, via a second FIR filter, the decisions made by the decision device that is appliedto the equalized signal. The second FIR filter is preceded by a delay, otherwise there is a delay-freeloop around the decision device. Assuming the decisions were correct, we are actually feeding backthe symbols x(l), for l < k, of equation (2.158). The DFE is able to cancel the post-cursor ISI for anumber of past symbols (depending on the order of the FIR feedback filter), leaving more freedomfor the feedforward section to take care of the remaining terms of the ISI. Some known characteristicsof the DFE are [35]:


The signals that are fed back are symbols, being noise free and allowing computational savings.

The noise enhancement is reduced, if compared with the feedforward-only equalizer.

Short time recovery when incorrect decisions are made.

Reduced sensitivity to sampling phase.

Figure 2.14 Decision-feedback equalizer.

The DFE operation starts with a training period where a known sequence is transmitted through thechannel, and the same sequence is used at the receiver as the desired signal. The delay introducedin the training signal is meant to compensate for the delay the transmitted signal faces when passingthrough the channel. During the training period the error signal, which consists of the differencebetween the delayed training signal and signal y(k), is minimized by adapting the coefficients of theforward and feedback filters. After this period, there is no training signal and the desired signal willconsist of the decision device output signal. Assuming the decisions are correct, this blind way ofperforming the adaptation is the best solution to keep track of small changes in the channel behavior.

Example 2.4

In this example we will verify the effectiveness of the Wiener solution in environments related to theapplications of noise cancellation, prediction, equalization, and identification.

(a) In a noise cancellation environment a sinusoid is corrupted by noise as follows

d(k) = cos ω0k + n1(k)


with

n1(k) = −an1(k − 1) + n(k)

|a| < 1 and n(k) is a zero-mean white noise with variance σ2n = 1. The input signal of the Wiener

filter is described by

n2(k) = −bn2(k − 1) + n(k)

where |b| < 1.

(b) In a prediction case the input signal is modeled as

x(k) = −ax(k − 1) + n(k)

with n(k) being a white noise with unit variance and |a| < 1.

(c) In an equalization problem a zero-mean white noise signal s(k) with variance c is transmittedthrough a channel with an AR model described by

x(k) = −ax(k − 1) + s(k)

with |a| < 1 and the received signal given by

x(k) = x(k) + n(k)

whereas n(k) is a zero-mean white noise with variance d and uncorrelated with s(k).

(d) In a system identification problem a zero-mean white noise signal x(k) with variance c is employedas the input signal to identify an AR system whose model is described by

v(k) = −av(k − 1) + x(k)

where |a| < 1 and the desired signal is given by

d(k) = v(k) + n(k)

Repeat the problem if the system to be identified is an MA whose model is described by

v(k) = −ax(k − 1) + x(k)

For all these cases describe the Wiener solution with two coefficients and comment on the results.

Solution:

Some results used in the examples are briefly reviewed. A 2× 2 matrix inversion is performed as

R−1 =1

r11r22 − r12r21

[r22 −r12−r21 r11

]


where rij is the element of row i and column j of the matrix R. For two first-order AR modeledsignals x(k) and v(k), whose poles are respectively placed at −a and −b with the same white noiseinput with unit variance, their cross-correlations are given by9

E[x(k)v(k − l)] =(−a)l

1− ab

for l > 0, and

E[x(k)v(k − l)] =(−b)−l

1− ab

for l < 0, are frequently required in the following solutions.

(a)

The input signal in this case is given by n2(k), whereas the desired signal is given by d(k). Theelements of the correlation matrix are computed as

E[n2(k)n2(k − l)] =(−b)|l|

1− b2

The expression for the cross-correlation vector is given by

p =[

E[(cos ω0k + n1(k))n2(k)]E[(cos ω0k + n1(k))n2(k − 1)]

]

=[

E[n1(k)n2(k)]E[n1(k)n2(k − 1)]

]

=[ 1

1−abσ2n

− a1−abσ

2n

]=

[ 11−ab

− a1−ab

]

where in the last expression we substituted σ2n = 1.

The coefficients corresponding to the Wiener solution are given by

wo = R−1p =[

1 bb 1

] [ 11−ab

− a1−ab

]=

[1

b−a1−ab

]

The special case where a = 0 provides a quite illustrative solution. In this case

wo =[

1b

]such that the error signal is given by

e(k) = d(k)− y(k) = cos ω0k + n(k)− wTo

[n2(k)

n2(k − 1)

]= cos ω0k + n(k)− n2(k)− bn2(k − 1)= cos ω0k + n(k) + bn2(k − 1)− n(k)− bn2(k − 1) = cos ω0k

9Assuming x(k) and v(k) are jointly WSS.


As can be observed the cosine signal is fully recovered since the Wiener filter was able to restoren(k) and remove it from the desired signal.

(b)

In the prediction case the input signal is x(k) and the desired signal is x(k + L). Since

E[x(k)x(k − L)] =(−a)|L|

1− a2

the input signal correlation matrix is

R =[

E[x2(k)] E[x(k)x(k − 1)]E[x(k)x(k − 1)] E[x2(k − 1)]

]

=[ 1

1−a2 − a1−a2

− a1−a2

11−a2

]

Vector p is described by

p =[

E[x(k + L)x(k)]E[x(k + L)x(k − 1)]

]=

[(−a)|L|

1−a2

(−a)|L+1|

1−a2

]

The expression for the optimal coefficient vector is easily derived.

wo = R−1p

= (1− a2)[ 1

1−a2a

1−a2a

1−a21

1−a2

][(−a)L

1−a2

(−a)L+1

1−a2

]

=[

(−a)L

0

]

where in the above equation the value of L is considered positive. The predictor result tells us thatan estimate x(k + L) of x(k + L) can be obtained as

x(k + L) = (−a)Lx(k)

According to our model for the signal x(k), the actual value of x(k + L) is

x(k + L) = (−a)Lx(k) +L−1∑i=0

(−a)in(k − i)

The results show that if x(k) is an observed data at a given instant of time, the best estimate ofx(k + L) in terms of x(k) is to average out the noise as follows

x(k + L) = (−a)Lx(k) + E[L−1∑i=0

(−a)in(k − i)] = (−a)Lx(k)


since E[n(k − i)] = 0.

(c)

In this equalization problem, matrix R is given by

R =[

E[x2(k)] E[x(k)x(k − 1)]E[x(k)x(k − 1)] E[x2(k − 1)]

]=

[ 11−a2 c + d − a

1−a2 c

− a1−a2 c 1

1−a2 c + d

]

By utilizing as desired signal s(k−L) and recalling that it is a white noise and uncorrelated with theother signals involved in the experiment, the cross-correlation vector between the input and desiredsignals has the following expression

p =[

E[x(k)s(k − L)]E[x(k − 1)s(k − L)]

]=

[(−1)LaLc

(−1)L−1aL−1c

]

The coefficients of the underlying Wiener solution are given by

wo = R−1p =1

c2

1−a2 + 2 dc1−a2 + d2

[ 11−a2 c + d a

1−a2 ca

1−a2 c 11−a2 c + d

] [(−1)LaLc

(−1)L−1aL−1c

]

=(−1)LaLc

c2

1−a2 + 2 cd1−a2 + d2

[c

1−a2 + d− c1−a2

ac1−a2 − a−1d− a−1c

1−a2

]

=(−1)LaLc

c2

1−a2 + 2 cd1−a2 + d2

[d

−a−1d− a−1c

]

If there is no additional noise, i.e. d = 0, the above result becomes

wo =[

0(−1)L−1aL−1(1− a2)

]

That is, the Wiener solution is just correcting the gain of the previously received component of theinput signal, namely x(k−1), while not using its most recent component x(k). This happens becausethe desired signal s(k−L) at instant k has a defined correlation with any previously received symbol.On the other hand, if the signal s(k) is a colored noise the Wiener filter would have a nonzero firstcoefficient in a noiseless environment. In case there is environmental noise, the solution tries to finda perfect balance between the desired signal modeling and the noise amplification.

(d)

In the system identification example the input signal correlation matrix is given by

R =[

c 00 c

]

With the desired signal d(k), the cross-correlation vector is described as

p =[

E[x(k)d(k)]E[x(k − 1)d(k)]

]=

[c−ca

]


The coefficients of the underlying Wiener solution are given by

wo = R−1p =[ 1

c 00 1

c

] [c−ca

]=

[1−a

]

Note that this solution represents the best way a first-order FIR model can approximate an IIR model,since

Wo(z) = 1− az−1

and

11 + az−1 = 1− az−1 + a2z−2 + . . .

On the other hand, if the unknown model is the described FIR model such as v(k) = −ax(k− 1) +x(k), the Wiener solution remains the same and corresponds exactly to the unknown system model.

In all these examples, the environmental signals are considered wide-sense stationary and theirstatistics assumed known. In a practical situation, not only the statistics might be unknown but theenvironments are usually nonstationary as well. In these situations, the adaptive filters come intoplay since their coefficients vary with time according to measured signals from the environment.

�

2.10.5 Digital Communication System

For illustration, a general digital communication scheme over a channel consisting of a subscriberline (telephone line, for example) is shown in Fig. 2.15. In either end, the input signal is first codedand conditioned by a transmit filter. The filter shapes the pulse and limits in band the signal thatis actually transmitted. The signal then crosses the hybrid to travel through a dual duplex channel.The hybrid is an impedance bridge used to transfer the transmit signal into the channel with minimalleakage to the near-end receiver. The imperfections of the hybrid cause echo that should be properlycanceled.

In the channel, the signal is corrupted by white noise and crosstalk (leakage of signals being trans-mitted by other subscribers). After crossing the channel and the far-end hybrid, the signal is filteredby the receive filter that attenuates high-frequency noise and also acts as an antialiasing filter. Subse-quently, we have a joint DFE and echo canceller, where the forward filter and echo canceller outputsare subtracted. The result after subtracting the decision feedback output is applied to the decisiondevice. After passing through the decision device, the symbol is decoded.

Other schemes for data transmission in subscriber line exist [41]. The one shown here is for illustra-tion purposes, having as special feature the joint equalizer and echo canceller strategy. The digitalsubscriber line (DSL) structure shown here has been used in integrated services digital network(ISDN) basic access, that allows a data rate of 144 Kbits/s [41]. Also, a similar scheme is employed


Data

input

Feedback

filter

Lin

edecoder

Decis

ion

devic

eF

orw

ard

filter

Receiv

efilter

Tra

nsm

itfilter

Echo

cancelle

rF

eedback

filter

Forw

ard

filter

Receiv

efilter

White

nois

eD

ecis

ion

devic

eLin

edecoder

Lin

ecoder

Data

input

Data

outp

ut

Hybrid

Hybrid

Channel

Cro

ssta

lk

Echo

cancelle

r

Tra

nsm

itfilter

Lin

ecoder

Data

outp

ut

++

+

++

--

--

Figure 2.15 General digital communication transceiver.

672.11 Concluding Remarks

in the high bit rate digital subscriber line (HDSL) [40], [47] that operates over short and conditionedloops [48], [49]. The latter system belongs to a broad class of digital subscriber line collectivelyknown as XDSL.

2.11 CONCLUDING REMARKS

In this chapter, we described some of the concepts underlying the adaptive filtering theory. The mate-rial presented here forms the basis to understand the behavior of most adaptive-filtering algorithms ina practical implementation. The basic concept of the MSE surface searching algorithms was brieflyreviewed, serving as a starting point for the development of a number of practical adaptive-filteringalgorithms to be presented in the following chapters. We illustrated through several examples theexpected Wiener solutions in a number of distinct situations. In addition, we presented the basicconcepts of linearly-constrained Wiener filter required in array signal processing. The theory andpractice of adaptive signal processing is also the main subject of some excellent books such as[27]-[34].


2.12 REFERENCES

1. D. G. Luenberger, Introduction to Linear and Nonlinear Programming, Addison Wesley, Read-ing, MA, 2nd edition, 1984.

2. R. Fletcher, Practical Methods of Optimization, John Wiley & Sons, NewYork, NY, 2nd edition,1990.

3. A. Antoniou and W.-S. Lu, Practical Optimization: Algorithms and Engineering Applications,Springer, New York, NY, 2007.

4. B. Widrow and M. E. Hoff, “Adaptive switching circuits,” WESCOM Conv. Rec., pt. 4, pp.96-140, 1960.

5. B. Widrow, J. M. McCool, M. G. Larimore, and C. R. Johnson, Jr., “Stationary and nonstationarylearning characteristics of the LMS adaptive filters,” Proceedings of the IEEE, vol. 64, pp.1151-1162, Aug. 1976.

6. A. Papoulis, Signal Analysis, McGraw Hill, New York, NY, 1977.

7. A. V. Oppenheim, A. S. Willsky, and S. H. Nawab, Signals and Systems, Prentice Hall, Engle-wood Cliffs, NJ, 2nd edition, 1997.

8. P. S. R. Diniz, E. A. B. da Silva, and S. L. Netto, Digital Signal Processing: System Analysisand Design, Cambridge University Press, Cambridge, UK, 2002.

9. A. Antoniou, Digital Signal Processing: Signals, Systems, and Filters, McGraw Hill, NewYork,NY, 2005.

10. L. B. Jackson, Digital Filters and Signal Processing, Kluwer Academic Publishers, Norwell,MA, 3rd edition, 1996.

11. R.A. Roberts and C. T. Mullis, Digital Signal Processing, Addison-Wesley, Reading, MA, 1987.

12. J. G. Proakis and D. G. Manolakis, Digital Signal Processing, Prentice Hall, Englewood Cliffs,NJ, 4th edition, 2007.

13. T. Bose, Digital Signal and Image Processing, John Wiley & Sons, New York, NY, 2004.

14. A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw Hill, NewYork,NY, 3rd edition, 1991.

15. P. Z. Peebles, Jr., Probability, Random Variables, and Random Signal Principles, McGraw Hill,New York, NY, 3rd edition, 1993.

16. W.A. Gardner, Introduction to Random Processes, McGraw Hill, NewYork, NY, Second edition,1990.

17. C. R. Johnson, Jr., Lectures on Adaptive Parameter Estimation, Prentice Hall, Englewood Cliffs,NJ, 1988.

18. T. Soderstrom and P. Stoica, System Identification, Prentice Hall International, Hemel Hemp-stead, Hertfordshire, 1989.

692.12 References

19. G. Strang, Linear Algebra and Its Applications, Academic Press, New York, NY, 2nd Edition,1980.

20. L. J. Griffiths and C. W. Jim, “An alternative approach to linearly constrained adaptive beam-forming,” IEEE Trans. on Antennas and Propagation, vol. AP-30, pp. 27-34, Jan. 1982.

21. M. L. R. de Campos, S. Werner, and J. A. Apolinario, Jr., “Constrained adaptation algorithmsemploying Householder transformation,” IEEE Trans. on Signal Processing, vol.50, pp. 2187-2195, Sept. 2002.

22. D. H. Johnson and D. E. Dudgeon, Array Signal Processing, Prentice Hall, Englewood Cliffs,NJ, 1993.

23. H. L. Van trees, Optimum Array Processing: Part IV of Detection, Estimation and ModulationTheory, John Wiley & Sons, New York, NY, 2002.

24. A. Papoulis, “Predictable processes and Wold’s decomposition: A review,” IEEE Trans. onAcoust., Speech, and Signal Processing, vol. ASSP-33, pp. 933-938, Aug. 1985.

25. S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall,Englewood Cliffs, NJ, 1993.

26. S. L. Marple, Jr., Digital Spectral Analysis, Prentice Hall, Englewood Cliffs, NJ, 1987.

27. M. L. Honig and D. G. Messerschmitt, Adaptive Filters: Structures, Algorithms, and Applica-tions, Kluwer Academic Publishers, Boston, MA, 1984.

28. B. Widrow and S. D. Stearns, Adaptive Signal Processing, Prentice Hall, Englewood Cliffs, NJ,1985.

29. S. T. Alexander, Adaptive Signal Processing, Springer Verlag, New York, NY, 1986.

30. J. R. Treichler, C. R. Johnson, Jr., and M. G. Larimore, Theory and Design of Adaptive Filters,John Wiley & Sons, New York, NY, 1987.

31. M. Bellanger, Adaptive Digital Filters and Signal Analysis, Marcel Dekker, Inc., New York,NY, 2nd Edition, 2001.

32. P. Strobach, Linear Prediction Theory, Springer Verlag, New York, NY, 1990.

33. S. Haykin, Adaptive Filter Theory, Prentice Hall, Englewood Cliffs, NJ, 4th edition, 2002.

34. A. H. Sayed, Fundamentals of Adaptive Filtering, John Wiley & Sons, Hoboken, NJ, 2003.

35. S. U. Qureshi, “Adaptive Equalization,” Proceedings of the IEEE, vol. 73, pp. 1349-1387, Sept.1985.

36. J. G. Proakis, Digital Communication, McGraw Hill, New York, NY, 4th edition, 2001.

37. L. C. Wood and S. Treitel, “Seismic signal processing,” Proceedings of the IEEE, vol. 63, pp.649-661, Dec. 1975.


38. D. G. Messerschmitt, “Echo cancellation in speech and data transmission,” IEEE Journal onSelected Areas in Communications, vol. SAC-2, pp. 283-296, March 1984.

39. M. L. Honig, “Echo cancellation of voiceband data signals using recursive least squares andstochastic gradient algorithms,” IEEE Trans. on Communications, vol. COM-33, pp. 65-73,Jan. 1985.

40. S. Subramanian, D. J. Shpak, P. S. R. Diniz, and A. Antoniou, “The performance of adaptivefiltering algorithms in a simulated HDSL environment,” Proc. IEEE Canadian Conf. Electricaland Computer Engineering, Toronto, Canada, pp. TA 2.19.1-TA 2.19.5, Sept. 1992.

41. D. W. Lin, “Minimum mean-squared error echo cancellation and equalization for digital sub-scriber line transmission: Part I - theory and computation,” IEEE Trans. on Communications,vol. 38, pp. 31-38, Jan. 1990.

42. D. W. Lin, “Minimum mean-squared error echo cancellation and equalization for digital sub-scriber line transmission: Part II - a simulation study,” IEEE Trans. on Communications, vol.38, pp. 39-45, Jan. 1990.

43. L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice Hall, EnglewoodCliffs, NJ, 1978.

44. B. D. Van Veen and K. M. Buckley, “Beamforming: a versatile approach to spatial filtering,”IEEE Acoust., Speech, Signal Processing Magazine, vol. 37, pp. 4-24, April 1988.

45. B. Widrow, J. R. Grover, Jr., J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearns, J. R. Zeidler,E. Dong, Jr., and R. C. Goodlin, “Adaptive noise cancelling: Principles and applications,”Proceedings of the IEEE, vol. 63, pp. 1692-1716, Dec. 1975.

46. M. Abdulrahman and D. D. Falconer, “Cyclostationary crosstalk suppression by decision feed-back equalization on digital subscriber line,” IEEE Journal on Selected Areas in Communica-tions, vol. 10, pp. 640-649, April 1992.

47. H. Samueli, B. Daneshrad, R. B. Joshi, B. C. Wong, and H. T. Nicholas, III, “A 64-tap CMOSecho canceller/decision feedback equalizer for 2B1Q HDSL transceiver,” IEEE Journal onSelected Areas in Communications, vol. 9, pp. 839-847, Aug. 1991.

48. J.-J. Werner, “The HDSL environment,” IEEE Journal on Selected Areas in Communications,vol. 9, pp. 785-800, Aug. 1991.

49. J. W. Leichleider, “High bit rate digital subscriber lines: A review of HDSL progress,” IEEEJournal on Selected Areas in Communications, vol. 9, pp. 769-784, Aug. 1991.

2.13 PROBLEMS

1. Suppose the input signal vector is composed by a delay line with a single input signal, computethe correlation matrix for the following input signals:

712.13 Problems

(a)x(k) = sin(

π

6k) + cos(

π

4k) + n(k)

(b)x(k) = an1(k) cos(ω0k) + n2(k)

(c)x(k) = an1(k) sin(ω0k + n2(k))

(d)x(k) = −a1x(k − 1)− a2x(k − 2) + n(k)

(e)

x(k) =4∑

i=0

0.25n(k − i)

(f)x(k) = an(k)ejω0k

In all cases, n(k), n1(k), and n2(k) are white noise processes, with zero mean and with variancesσ2

n, σ2n1

, and σ2n2

, respectively. These random signals are considered independent.

2. Consider two complex random processes represented by x(k) and y(k).(a) Derive σ2

xy(k, l) = E[(x(k)−mx(k))(y(l)−my(l))] as a function of rxy(k, l), mx(k) andmy(l).(b) Repeat (a) if x(k) and y(k) are jointly WSS.(c) Being x(k) and y(k) orthogonal, in which conditions are they not correlated?

3. For the correlation matrices given below, calculate their eigenvalues, eigenvectors, and condi-tioning numbers.

(a)

R =14

⎡⎢⎢⎣

4 3 2 13 4 3 22 3 4 31 2 3 4

⎤⎥⎥⎦

(b)

R =

⎡⎢⎢⎣

1 0.95 0.9025 0.8573750.95 1 0.95 0.9025

0.9025 0.95 1 0.950.857375 0.9025 0.95 1

⎤⎥⎥⎦

(c)

R = 50σ2n

⎡⎢⎢⎣

1 0.9899 0.98 0.9700.9899 1 0.9899 0.980.98 0.9899 1 0.98990.970 0.98 0.9899 1

⎤⎥⎥⎦


(d)

R =

⎡⎢⎢⎣

1 0.5 0.25 0.1250.5 1 0.5 0.250.25 0.5 1 0.50.125 0.25 0.5 1

⎤⎥⎥⎦

4. For the correlation matrix given below, calculate its eigenvalues and eigenvectors, and form thematrix Q.

R =14

[a1 a2a2 a1

]

5. The input signal of a second-order adaptive filter is described by

x(k) = α1x1(k) + α2x2(k)

where x1(k) and x2(k) are first-orderAR processes and uncorrelated between themselves havingboth unit variance. These signals are generated by applying distinct white noises to first-orderfilters whose poles are placed at a and −b, respectively.


(b) If the desired signal consists of γx2(k), calculate the Wiener solution.

6. The input signal of a first-order adaptive filter is described by

x(k) =√

2x1(k) + x2(k) + 2x3(k)

where x1(k) and x2(k) are first-orderAR processes and uncorrelated between themselves havingboth unit variance. These signals are generated by applying distinct white noises to first-orderfilters whose poles are placed at −0.5 and

√2

2 , respectively. The signal x3(k) is a white noisewith unit variance and uncorrelated with x1(k) and x2(k).


(b) If the desired signal consists of 12x3(k), calculate the Wiener solution.

7. Repeat the previous problem if the signal x3(k) is exactly the white noise that generated x2(k).

8. In a prediction case a sinusoid is corrupted by noise as follows

x(k) = cos ω0k + n1(k)

with

n1(k) = −an1(k − 1) + n(k)

where |a| < 1. For this case describe the Wiener solution with two coefficients and commenton the results.

732.13 Problems

9. Generate theARMA processes x(k) described below. Calculate the variance of the output signaland the autocorrelation for lags 1 and 2. In all cases, n(k) is zero-mean Gaussian white noisewith variance 0.1.

(a)

x(k) = 1.9368x(k − 1)− 0.9519x(k − 2) + n(k)− 1.8894n(k − 1) + n(k − 2)

(b)

x(k) = −1.9368x(k − 1)− 0.9519x(k − 2) + n(k) + 1.8894n(k − 1) + n(k − 2)

Hint: For white noise generation consult for example [14], [15].

10. Generate the AR processes x(k) described below. Calculate the variance of the output signaland the autocorrelation for lags 1 and 2. In all cases, n(k) is zero-mean Gaussian white noisewith variance 0.05.

(a)x(k) = −0.8987x(k − 1)− 0.9018x(k − 2) + n(k)

(b)x(k) = 0.057x(k − 1) + 0.889x(k − 2) + n(k)

11. Generate the MA processes x(k) described below. Calculate the variance of the output signaland the autocovariance matrix. In all cases, n(k) is zero-mean Gaussian white noise withvariance 1.

(a)

x(k) = 0.0935n(k) + 0.3027n(k − 1) + 0.4n(k − 2) + 0.3027n(k − 4) + 0.0935n(k − 5)

(b)x(k) = n(k)− n(k − 1) + n(k − 2)− n(k − 4) + n(k − 5)

(c)x(k) = n(k) + 2n(k − 1) + 3n(k − 2) + 2n(k − 4) + n(k − 5)

12. Show that a process generated by adding two AR processes is in general an ARMA process.

13. Determine if the following processes are mean ergodic:

(a)x(k) = an1(k) cos(ω0k) + n2(k)

(b)x(k) = an1(k) sin(ω0k + n2(k))

(c)x(k) = an(k)e2jω0k

In all cases, n(k), n1(k), and n2(k) are white noise processes, with zero mean and with variancesσ2

n, σ2n1

, and σ2n2

, respectively. These random signals are considered independent.


14. Show that the minimum (maximum) value of equation (2.69) occurs when wi = 0 for i = jand λj is the smallest (largest) eigenvalue, respectively.

15. Suppose the matrix R and the vector p are known for a given experimental environment. Com-pute the Wiener solution for the following cases:

(a)

R =14

⎡⎢⎢⎣

4 3 2 13 4 3 22 3 4 31 2 3 4

⎤⎥⎥⎦

p =[12

38

28

18

]T

(b)

R =

⎡⎢⎢⎣

1 0.8 0.64 0.5120.8 1 0.8 0.640.64 0.8 1 0.80.512 0.64 0.8 1

⎤⎥⎥⎦

p =14

[0.4096 0.512 0.64 0.8]T

(c)

R =13

⎡⎣ 3 −2 1−2 3 −21 −2 3

⎤⎦

p =[−2 1 − 1

2

]T

16. For the environments described in the previous problem, derive the updating formula for thesteepest-descent method. Considering that the adaptive-filter coefficients are initially zero,calculate their values for the first ten iterations.

17. Repeat the previous problem using the Newton method.

18. Calculate the spectral decomposition for the matrices R of problem 15.

19. Calculate the minimum MSE for the examples of problem 15 considering that the variance ofthe reference signal is given by σ2

d.

752.13 Problems

20. Derive equation (2.112).

21. Derive the constraint matrix C and the gain vector f that impose the condition of linear phaseonto the linearly constrained Wiener filter.

22. Show that the optimal solutions of the LCMV filter and the GSC filter with minimum normare equivalent and related according to wLCMV = TwGSC, where T = [C B] is a full-ranktransformation matrix with CT B = 0 and

wLCMV = R−1C(CT R−1C)−1f

and

wGSC =[

(CT C)−1f−(BT RB)−1BT RC(CT C)−1f

]

23. Calculate the time constants of the MSE and of the coefficients for the examples of problem 15considering that the steepest-descent algorithm was employed.

24. For the examples of problem 15, describe the equations for the MSE surface.

25. Using the spectral decomposition of a Hermitian matrix show that

R1N = QΛ

1N QH =

N∑i=0

λ1Ni qiq

Hi

26. Derive the complex steepest-descent algorithm.

27. Derive the Newton algorithm for complex signals.

28. In a signal enhancement application, assume that n1(k) = n2(k) ∗h(k), where h(k) representsthe impulse response of an unknown system. Also, assume that some small leakage of the signalx(k), given by h′(k) ∗ x(k), is added to the adaptive-filter input. Analyze the consequences ofthis phenomenon.

29. In the equalizer application, calculate the optimal equalizer transfer function when the channelnoise is present.

fundamentals of adaptive filtering - …engold.ui.ac.ir/~sabahi/adaptive...

Documents