signal modeling, statistical inference and data mining in...

A6523 Signal Modeling, Statistical Inference and Data

Mining in Astrophysics Spring 2013

Lecture 15 –  Last comments on MESE and AR modeling –  Bayesian approach –  High resolution approach

Reading: Chaper 13: Bayesian Revolution in Spectral Analysis (already assigned)

A6523 Signal Modeling, Statistical Inference and Data

Mining in Astrophysics Spring 2013

The remaining lectures will include these topics: •  Markov processes and stochastic resonance •  Model fitting (frequentist and Bayesian)

–  Model definition –  Linear and non-linear least squares, maximum likelihood –  Parameter space exploration (grid, Hastings-Metropolis, SA, GA) –  Parameter errors (credible intervals, Fisher matrix)

•  Cholesky decomposition •  Principal component analysis (PCA) •  Localization/Matched filtering

–  1D, 2D problems (time, frequency/wavelength, image) •  Phase retrieval and Hilbert transforms •  Radon transform •  Extreme value statistics

Entropy in terms of the spectrum:

To get a maximum entropy estimate of a spectrum, we need an expression for the entropy in

terms of the spectrum. There is no general relation between the spectrum and the entropy. For

Gaussian processes, however, there is a relation. This is appropriate since a Gaussian process

is one with maximum entropy out of all processes with the same variance. The spectrum is the

variance per unit frequency, so this conceptual step is important. But a relation exists1

between

the determinant of the covariance matrix and the spectrum, which is assumed to be bandlimited

in (−fN, fN).

limN→∞

(detCx)1

N+1 = 2fN exp

1

2fN

fN

−fN

df ln Sx(f )

.

The theorem depends on Cx being a Toeplitz matrix [matrix element Cij depends only on

(i− j)], i.e. that the process be WSS

1An arcane proof exists in “Prediction-Error Filtering and Maximum-Entropy Spectral Estimation” in Non-Linear Methods of Spectral Analysis, Haykin ed.

Springer-Verlag 1979, see Appendix A, pp. 62-67. It is also given by Smylie et al. 1973, Meth. Comp. Phys. 13, 391.

5

Thus

h = limN→∞

1

2ln (detCx)

1N+1

=1

2ln

lim

N→∞(detCx)

1N+1

=1

2ln 2fN +

1

4fN

fN

−fN

df ln Sx(f )

Ignoring the first, constant term, we have

h =1

4fN

fN

−fN

df ln Sx(f )

6

Heuristic “derivation” of the entropy rate expression:

Another way of viewing this is as follows. In calculating a power spectrum we are concernedwith a second-order moment, by definition. Consequently, we can assume that the randomprocess under consideration is Gaussian because:

1. we are maximizing the entropy (subject to constraints) and

2. given the second moment, the process with largest entropy is a Gaussian random process

Note that while this assumption is satisfactory for estimating the power spectrum (a secondmoment), it is not necessarily accurate when we consider the estimation errors of the spectralestimate, which depend on fourth-order statistics. If the central limit theorem can be invokedthen, of course, the Gaussian assumption becomes a good one once again.

Imagine that the process under study is constructed by passing white noise through a linearfilter whose system function is

S(f )

n(t) −→

Sx(f ) −→ x(t)

Consequently, the Fourier transforms are related as

N(f )Sx(f ) = X(f )

Now N(f ) itself is a Gaussian random variable because it is the sum of GRV’s. Therefore,

7

X(f ) is a GRV and viewing it as a 1-D random variable we have that the entropy is

H(f ) =1

2ln [2πeσ2(f )]

but

σ2(f ) ≡ [N(f )

S(f )]2 = S(f )SN

Letting the white noise spectrum be SN = 1 we have

σ2(f ) = S(f )

andH(f ) =

1

2ln [2πeS(f )]

Recall that white noise is uncorrelated between different frequencies

N(f ) N ∗(f ) = SN δ(f − f).

Consequently, the information in different frequencies adds because of statistical independenceand, therefore, to get the total entropy we simple integrate (add variances):

H =

df H(f ) =

1

2

df ln[2πeS(f )]

8

Again we ignore additive constants and we also consider the case where the signal is bandlim-ited in |f | ≤ fN and the signal is sampled over a time [0, T ]. Therefore, the number of degreesof freedom is 2fNT .

The entropy per degree of freedom is

h =H

2fN=

1

4fN

fN

−fN

df ln [S(f )]

The derivation of the ME spectrum follows the logical flow:

entropy in terms of power spectrum S(f )⇓

maximize H subject to constraints:

⇓(C)known

F.T.

⇔ S(f )⇓

ME spectral estimator

van den Bos shows that extrapolating the covariance function to larger lags while maximizingentropy yields a matrix equation that is identical to that obtained by fitting an autoregressivemodel to the data. This implies that the two procedures are identical mathematically.

9

Maximum Entropy Spectral Estimator

By maximizing the entropy rate that is expressed in terms of the spectrum, the spectral estimatecan be expressed as (e.g. Edward & Fitelson 1973),

S(f ) =1

fN

α20

|ε t C−1δ|2

where C is the Toeplitz covariance matrix, which applies to WSS processes.

ε =

1e2πif∆τ

...e2πifM∆τ

C =

C00C01 . . . C0M

...CM0 . . . . . . CMM

δ =

10...0

Toeplitz ⇒ C ≡

C0C1 . . . CM... C0

CM . . . . . . C0

Let

γ ≡ C−1 =

γ00 . . . γ0M...γM0 . . . γMM

10

Then

ε t C−1δ =M

j=0

γj0 e2πif∆τj

andS(f ) =

1

4N

α20

M

j=0

γj0 e2πif∆τj

2

By rewriting the sum and redefining the constants this can be written

S(f ) =α20

fN

γ00 +

M

j=1

γj0 e2πif∆τj

2

=α20

fNγ200

1 +

M

j=1

γj0γ00

e2πif∆τj

2

11

Thus, the ME spectral estimate can be put into the form

S(f ) =PM

1 +M

j=1

αj e2πifj∆τ

2

where PM = a constant that properly normalizes the spectrum.

This is the same spectrum as for an Mth order AR process that can be fitted to the data, where

the coefficients are determined by least squares.

Spectrum of an AR Process:

Consider the following M-th order AR process

xt = atwhite noise

+M

j=1

αj xt−j

autoregressive part

A zero-th order process would be xt = at (i.e. white noise). Scargle would term the above

definition a causal AR process. An acausal or two-sided process would allow negative values

of j in the sum on the RHS.

The coefficients αj, j = 1, . . . ,M are the AR coefficients. In the fitting of an AR model to the

data, one must determine the order M as well as the M coefficients.

12

Define the DFTs

X(f ) ≡N−1

t=0

xt e−2πift/N

A(f ) ≡N−1

t=0

at e−2πift/N.

Substituting the definition for the AR process, we have

X(f ) =M

j=1

αj X(f ) e−2πifj/N + A(f ),

and, solving for X(f ),

X(f ) =A(f )

1−

M

j=1

αj e−2πifj/N

.

13

The spectrum is then

S(f ) =|A(f )|2

1−M

j=1

αj e−2πifj/N

2

∝ 11−

M

j=1

αj e−2πifj/N

2.

As is obvious, the AR spectrum has the same form as the maximum-entropy spectrum.

14

Matrix form of the Fourier-Transform Based SpectralEstimate:

It is instructive to compare the matrix form for the maximum entropy spectrum with the powerspectral estimate defined as the Fourier transform of the autocorrelation function. This is iden-tical to the spectrum found by taking the squared magnitude of the Fourier transform of thetime series and is sometimes called the Bartlett estimate, because the Bartlett lag window is atriangle function.

Let Cn−n be the n, n element of the covariance matrix; then the Bartlett estimate is

SB(f ) = N−1N+1

=−N+1

C

1− ||

N

e−2πf∆τ ,

which can also be written as

SB(f ) = N−2

C− e−2πif∆τ(−)

= N−2

e2πif∆τC−e−2πif∆τ,

and can be written as a matrix product

SB(f ) = N−2ε tCε ∗

15

This can be compared with the ME and high-resolution (“ML”)estimates

SME =constant

|ε tC−1δ|2

SML(f ) =1

ε tC−1ε

16

On the Applicability of Maximum Entropy Spectral Estimate:

The spectral estimator:

Makes use of the 2M+1 values of the covariance function that are known or estimated.There is no choice in M . If ρM → 0 then the spectral estimate will reflect that (Jaynessays that the corresponding Lagrange multiplier will be zero).

The AR Approach:

M appears to be a parameter that must be chosen according to some criterion.

Reconciliation:

Jaynes is correct so long as the expression used for the entropy is correct. It may notbe in some cases. If it is ok, then simply use the 2M + 1 known values. If the entropyexpression is not applicable, then one must view the situation as one where an ARmodel is being fitted to the data, with M an unknown parameter.The problem reduces to finding (i) the order of the AR process and (ii) the coefficients.

17

Example: Suppose one knows the AR process is zero-th order, then αj = 0 for j ≥ 1

The MEM spectral estimate is

S(f ) = PA = σ2a/N = constant

Contrast this with a Fourier transform estimate which would look like

and would become smooth only if one smoothed over all spectral values.

The question (again) is how does one know or determine the proper AR order to use?

18

Estimates for AR coefficients For all the nitty-gritty details of calculation of AR coefficients,

see Ulrych and Bishop, “Maximum Entropy Spectral Analysis and Autoregressive Decompo-

sition” 1975, Rev. Geo, and Sp. Phys., 13, 1983. There are Fortran listings for the Yule-Walker

and Burg algorithms for estimating coefficients. See also Numerical Recipes.

Two problems remain:

1. How does one calculate the order of the AR model?

2. What are the estimation errors of the spectral estimate?

The order of the AR Model can be estimated by looking at the prediction error as a function of

order M .

With N = # data points and M = order of AR process (or of a closely related prediction error

filter), evaluate the quantity (the “final prediction error”)

(FPE)M =N + (M + 1)

N − (M + 1) increases as M increases

t

(xt − xt)2

decreases

The order M is chosen as the one that minimizes the FPE (The Akaike criterion).

19

Final Prediction Error

FPE Examples for Red Processes with Power-Law Spectra

0 5 10 15 20Order of AR Process

0.512

0.514

0.516

0.518

0.520

0.522

0.524

0.526

0.528

Fina

lPre

dict

ion

Err

or

0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102

Frequency Bin

10−2

10−1

100

101

Spe

ctru

m

Black: Generated spectrum Blue: Periodogram Red: AR Spectrum

S(f) ∝ f−0.0

0 5 10 15 20Order of AR Process

0.001045

0.001050

0.001055

0.001060

0.001065

0.001070

Fina

lPre

dict

ion

Err

or0 1 2 3 4 5

Time (steps)

x(t)

Time Series

100 101 102

Frequency Bin

10−6

10−5

10−4

10−3

10−2

10−1

100

101

102

Spe

ctru

m

Black: Generated spectrum Blue: Periodogram Red: AR Spectrum

S(f) ∝ f−2.0

Application of MEM •  Sinusoids + noise •  Noise only

•  Δt = 0.01 yr •  Nyquist frequency fN = 50 cy yr-1

•  Points: •  MEM can give much better performance than the FFT-

based power spectrum •  Using the wrong AR order, however, can get spurious

results

FFT Power Spectrum

MEM Spectrum AR order determined empirically

MEM Spectrum AR order forced to value indicated

MEM Spectrum of Timing Residuals of a Millisecond Pulsar

A Bayesian Approach to Spectral Analysis

Chirped Signals

Chirped signals are oscillating signals with time variable frequencies, usually with a linearvariation of frequency with time. E.g.

f (t) = A cos(ωt + αt2 + θ).

Examples:

• plasma wave diagnostic signals

• signals propagated through dispersive media (seismic cases, plasmas)

• gravitational waves from inspiraling binary stars

• doppler-shifted signals over fractions of an orbit (e.g. acceleration of pulsar in its orbit)

Jaynes’ Approach to Spectral Analysis:

cf. Jaynes “Bayesian Spectrum and Chirp Analysis” in Maximum Entropy and BayesianSpectral Analysis and Estimation Problems

Result: Optimal processing is a nonlinear operation on the data without recourse to smoothing.However, the DFT-based spectrum (the “periodogram”) plays a key role in the estimation.

1

Start with Bayes’ theorem

p(H/DI) posterior prob.

= p(H/I) prior prob.

new data p(D/HI)

p(D/I)

In this context, probabilities represent a simple mapping of degrees of belief onto real numbers.

Recall

p(D/HI) vs.D for fixed H = “sampling distribution

p(D/HI) vs.H for fixed D = “likelihood function

Read H as a statement that a parameter vector lies in a region of parameter space.

Measured Quantity:

y(t) = f (t) + e(t)

f (t) = A cos(ωt + αt2 + θ)

e(t) = white gaussian noise, < e >= 0, < e2>= σ

2

Data Set:D = y(t), |t| ≤ T, N = 2T + 1 data points.

2

Data Model 4 parameters: A, ω, α, θ

Data Probability: The probability of obtaining a data set of N samples is

P (D|HI) =

t

P [y(t)] = ΠT

t=−T(2πσ2)−1/2

e

−

1

2σ2[y(t)− f (t)]2

, (1)

which we can rewrite as a likelihood function once we acquire a data set and evaluate theprobability for a specific H . Writing out the parameters explicitly, the likelihood function is

L(A,ω,α, θ) ∝ e

−

1

2σ2

T

t=−T

[y(t)− A cos(ωt + αt2 + θ)]2

For simplicity, assume that ωT 1 so that many cycles of oscillation are summed over.

Then

t

cos2(ωt + αt2 + θ) =

t

1

2[1 + cos 2(ωt + αt

2 + θ)]

≈ 2T + 1

2

≡ N

2

3

Expanding out the argument of the exponential in the likelihood function, we have

[y(t)− A cos(ωt + αt2 + θ)]2 = y2(t) + A2 cos2( )− 2Ay(t) cos( )

We care only about terms that are functions of the parameters, so we drop the y2(t) term to get

− 1

2σ2

T

t=−T[y(t)− A cos( )]2 −→ − 1

2σ2

t[A2 cos2( )− 2Ay(t) cos( )]

−→ A

σ2

ty(t) cos( )− NA2

4σ2

The likelihood function becomes

L(A,ω,α, θ) ∝ e

A

σ2

ty(t) cos(ωt + αt2 + θ)− NA2

4σ2

Integrating out the phase:

In calculating a power spectrum [in this case, a chirped power spectrum (“chirpogram”)], we donot care about the phase of any sinusoid in the data. In Bayesian estimation, such a parameteris called a nuisance parameter.

Since we do not know anything about θ, we integrate over its prior distribution, a pdf that is

4

uniform over [0, 2π]:

fθ(θ) =

12π 0 ≤ θ ≤ 2π

0 otherwise.The marginalized likelihood function becomes

⇒ L(A,ω,α) ∝ 1

2π

2π

0dθ L(A,ω,α, θ)

=1

2π

2π

0dθ exp

A

σ2

ty(t) cos(ωt+ αt2 + θ)− NA2

4σ2

= exp−NA2

4σ2

1

2π

2π

0dθ exp

A

σ1

ty(t) cos(ωt+ αt2 + θ)

Using the identity

cos(ωt + αt2 + θ) = cos(ωt + αt2) cos θ − sin(ωt + αt2) sin θ

we have

ty(t) cos(ωt + αt2 + θ) = cos θ

ty(t) cos(ωt + αt2)

P

− sin θ

ty(t) sin(ωt + αt2)

Q

5

Uniform pdf has maximum uncertainty

≡ P cos θ −Q sin θ

=

P 2 +Q2 cos[θ + tan−1(Q/P )].

This result may be used to evaluate the integral over θ in the margninalized likelihood function:

=

1

2π

2π

0dθ e

A

σ2

P 2 +Q2 cos[θ + tan−1(Q/P )]

irrelevant phase shift

To evaluate the integral we use the identity,

I0(x) ≡1

2π

2π

0dθ ex cos θ = modified Bessel function

This yields

= I0A

σ2

P 2 +Q2

We now simplify P 2 +Q2:

P 2 +Q2 =

ty(t) cos(ωt + αt2)

2+

tyt sin(ωt + αt2)

2

=

t

ty(t)y(t) [cos(ωt + αt2) · cos(ωt + αt2)

+ sin(ωt + αt2) sin(ωt + αt2)

cos[ω(t− t) + α(t− t)2]

6

P 2 +Q2 =

t

ty(t)y(t) cos[ω(t− t) + α(t− t)2].

Define

C(ω,α) ≡ N−1(P 2+Q2) = N−1

t

ty(t)y(t) cos[ω(t− t)+α(t− t)2],

Then the integral over θ gives

≡ I0

ANC(ω,α)

σ2

and the marginalized likelihood is

L(A,ω,α) = e−NA2

4σ2 I0A

NC(ω,α)

σ2

.

7

Notes:

(1) The data appear only in C(ω,α).

(2) C is a sufficient statistic, meaning that it contains all information from the data that

are relevant to inference using the likelihood function.

(3) How do we read L(A,ω,α)? As the probability distribution of the parameters A,ω,αin terms of the data dependent quantity C(ω,α). (Note that L is not normalized as a PDF).

As such, L is a quite different quantity from the Fourier-based power spectrum.

(4) What is C(ω,α) ≡ N−1 t

t y(t)y(t

) cos[ω(t− t) + α(t− t)2]?

For a given data set, ω,α are variables. If we plot C(ω,α), we expect to get a large value

when ω = ωsignal, α = αsignal.

(5) For a non-chirped but oscillatory signal (ω = 0, α = 0), the quantity C(ω,α) is nothing

other than the periodogram (the squared magnitude of the Fourier transform of the data).

We then see that, for this case, the likelihood function is a nonlinear function of the Fourier

estimate for the power spectrum.

8

A Limiting Form:

For argument x 1, the Bessel function

I0(x) ∼ex√2πx

.

In this case the marginalized likelihood is

L(A,ω,α) ∝ e−NA2

4σ2 I0A

N C(ω,α)

σ2

∝ e−NA2

4σ2 × e

AN C(ω,α)

σ2

2πA

N C(ω,α)σ2

1/2 .

Since C(ω,α) is large when ω and α match those of any true signal, we see that it is exponen-tiated as compared to appearing linearly in the periodogram.

9

Interpretation of the Bayesian and Fourier Approaches

We found the marginalized likelihood for the frequency and chirp rate to be

L(A,ω,α) = e−NA2

4σ2 I0

A

NC(ω,α)

σ2

.

and the limiting form for the Bessel function’s argument x 1 is

I0(x) ∼ex√2πx

.

In this case the marginalized likelihood is

L(A,ω,α) ∝ e−NA2

4σ2 I0A

N C(ω,α)

σ2

∝ e−NA2

4σ2 × e

AN C(ω,α)

σ2

2πA

N C(ω,α)σ2

1/2 .

Since C(ω,α) is large when ω and α match those of any true signal, we see that it is exponen-tiated as compared to appearing linearly in the periodogram.

10

Now let’s consider the case with no chirp rate, α = 0. Examples in the literature show that thewidth of the Bayesian PDF is much narrower than the periodogram, C(ω, 0). Does this meanthat the uncertainty principle has been avoided?

The answer is no!

Uncertainty Principle in the Periodogram:

For a data set of length T , the frequency resolution implied by the spectral window function is

δω ∼ 2πδf ∼ 2π

T.

Width of the Bayesian PDF:

When the argument of the Bessel function is “large” the exponentiation causes the PDF to bemuch narrower than the spectral window for the periodogram.

11

Interpretation:

The periodogram is the distribution of power (or variance) with frequency for the particularrealization of data used to form the periodogram. The spectral window also depicts the distri-bution of variance for a pure sinusoid in the data (with infinite signal to noise ratio).

The Bayesian posterior is the PDF for the frequency of a sinusoid and therefore represents avery different quantity than the periodogram and are thus not directly comparable.

12

1. The Bayesian method addresses the question, “what is the PDF for the frequency of thesinusoid that is in the data.?

2. The periodogram is the distribution of variance in frequency.

3. If we use the periodogram to estimate the sinusoid’s frequency, we get a result that is morecomparable:

(a) First note that the width of the posterior PDF involves the signal to noise ratio (in thesquare root of the periodogram)

√NA/σ while the width of the periodogram’s spectral

window is independent of the SNR.(b) General result: if a spectral line has width ∆ω, its centroid can be determined to an

accuracy

δω ∼ ∆ω

SNR.

This result follows from matched filtering, which we will discuss later on.(c) Quantitatively, the periodogram yields the same information about the location of the

spectral line as does the posterior PDF.

4. Problem: derive an estimate for the width of the posterior PDF that can be compared withthe estimate for the periodogram.

13

Figure 1: Left: Time series of sinusoid + white noise with A/σ = 1 sampled N = 500 times over an interval of length T = 500. Right:Plot of the periodogram (red) and Bayesian PDF of the time series.

14

Figure 2: Left: Time series of sinusoid + white noise with A/σ = 1/4 sampled N = 500 times over an interval of length T = 500.Right: Plot of the periodogram (red) and Bayesian PDF of the time series.

15

Maximum Likelihood Spectra Estimation (MLSE)

MLSE is a misnomer; a better name is High Resolution Method because the method is derived

by explicitly maximizing the sensitivity to a given frequency while minimizing the effects (i.e.

leakage, a.k.a. bias) from other frequencies.

The MLSE was developed in the 1960s by Capon to analyze data from arrays of sensors to

maximize the response to one particular direction and minimize the response to others.

e.g. LASA = Large Aperture Seismic Array (test earthquakes vs. underground nuclear tests).

There is a close relationship to beam forming in acoustic arrays and beam forming in radio

interferometric arrays.

In the original development of the method discussed by Capon1

the spectral estimator is very

closely related to a filter that gives the ML estimate of a signal when it is corrupted by Gaussian

noise:

S +N −→ A −→ S

1see “Nonlinear Methods of Spectral Analysis”, Haykin, ed. pp. 154-179

1

This system involves:

a) a filter that gives the ML estimate of the signal when corrupted by Gaussian noise is also

. . .

b) the filter that generally gives the minimum variance and unbiased estimate of the signal

for arbitrary noise and . . .

c) has coefficients that yield an unbiased, high resolution spectral estimate for any signal.

The way the filter coefficients are derived (i.e. the constraints applied to the maximization

problem) imply that the spectral estimate minimizes leakage.

The HRM is sometimes described as a positive constrained reconstruction method which min-

imizes leakage.

Thus, the intent of the MLSE technique is much different from the MESE technique:

MLSE minimizes variance and bias (recall how spectral bias was related to resolution)

MESE in effect (via its relationship to prediction filters) tries to maximize resolution

2

We will derive the ML spectral estimate following the derivation of Lacoss.

Method: Construct a linear filter that

1. yields an unbiased estimate of a sinusoidal signal and

2. minimizes the variance of the output with respect to corrupting white noise.

Pass a signal yn through a linear filter:

yn −→ ak −→ xn

xn =n

k=1

ak yn−k+1 (causal)

where the input is of the form of a deterministic sinusoid added to zero mean noise havingan arbitrary spectrum:

yn = AeiΩn + nn.

We will determine the coefficients ak by invoking the above two criteria.

3

Goal: We want the filter to pass AeiΩn undistorted but to reject the noise as much as possible.Thus, we require

1. no bias (in the mean): xn ≡N

k=1

ak yn−k+1

=N

k=1

akAeiΩ(n−k+1) + nn−k+1

=N

k=1

akA eiΩ(n−k+1)

≡ AeiΩn (if no bias)

⇒N

k=1

ak eiΩ(1−k) = 1 constraint equation

4

This can be written in matrix form using ‘†’ to denote transpose conjugate:

ε †a = 1 ε ≡

1eiΩ

ei2Ω...

ei(N−1)Ω

a =

a1a2...aN

5

2. Minimum variance of the filter output:

σ2 ≡[xn − xn]2

=

k

ak yn−k+1 − AeiΩn2

=

k

akAeiΩ(n−k+1)

cancels last term

+nn−k+1

− AeiΩn

from 1.

2

=

k

ak nn−k+1

2 ≡

k

k

ak nn−k+1 n∗n−k+1 a∗k

= a† Ca,

where C is the covariance matrix of the noise, n.

6

3. Minimize σ2 w.r.t. a and subject to the constraint ε†a = 1.

By minimizing σ2 subject to the constraint, we get the smallest error and no bias.

Therefore we minimize L with respect to a:

L = σ2 + λ(ε †a− 1) = a† Ca + λ(ε †a− 1)

We can take ∂L/∂Re(aj) and ∂L/∂Im(aj) separately to derive equations fora, then recombinethese equations to get

a† C + λε † = 0.

This is the same as we get by taking

∇aL ≡ ∂L

∂a=

∂

∂aa † Ca + λ

∂

∂a(ε †a)

= a† C + λε †

= 0 for a = a0.

7

The solution for a0 is

a†0C = −λε †

⇒ C†a0 = −λ∗ε

a0 = −λ∗(C†)−1ε

Now substitute back into the constraint equation ε †a0 = 1 (the no bias relation) to get

ε †a0 = −λ∗ε †(C †)−1ε = 1

or − λ∗ =1

ε †(C †)−1ε

⇒ a0 =(C †)−1ε

ε †(C †)−1ε

Note denominator is real (quadratic form)

⇒ε †(C †)−1ε

†= ε † C−1ε

8

4. Minimum variance: Substitute a0 back into the expression for σ2to find the minimum

variance:

σ2min ≡ a†0 Ca0

=

(C †)−1ε

ε †(C †)−1ε

†

C †

(C †)−1ε

ε †(C †)−1ε

=ε † C−1ε

(ε † C−1ε)(ε † C †−1ε)

=1

ε †C−1ε

σ2min =

1

ε † C−1ε

This is the power in the noise components with the same frequency as the signal Ω.

(Note we have used the Hermitian relation C † ≡ C.)

9

Interpretation:

1. σ2min = portion of noise that leaks through the filter, which is attempting to estimate a

sinusoid corrupted by the noise.

2. Note that the filter coefficients and σ2min are functions of Ω and of the noise covariance

matrix. But they do not depend on the amplitude of the sinusoid.

3. The trick: now take away the signal but keep the noise. We allow Ω to vary across arange of frequencies we are interested in. Then, σ2

min(Ω) is a spectral estimate for the noisespectrum (which was left arbitrary)

4. ⇒ maximum likelihood spectral estimator

SML(f ) =1

ε † C−1εwith Ω = 2πf∆τ

Further comments:

1. As used, the covariance matrix C is an ensemble average quantity. Applications to actualdata require use of some estimate for the covariance matrix.

2. The derivation is for equally spaced data.

3. The spectral estimate should work well on processes with steep power-law spectra becausethe estimator is derived explicitly to minimize bias.

10

Data-adaptive aspect of the MLSE spectral estimator:

Recall that the Fourier-transform based estimator has a fixed window. The MLSE has a dataadaptive spectral window, as we will show.

The filter coefficients are a function of the frequency of the sinusoid, Ω:

a0(Ω) =(C−1)†(Ω)

(Ω)†(C−1) † (Ω)

As Ω is varied, the coefficients a0 vary but subject to the normalization constraint a †0 = 1.

For a given Ω, which labels the frequency component we are attempting to estimate, what isthe response to other frequencies, ω?

Define the window function

W (ω,Ω) = a0 (Ω)† (ω)

as the response to frequency ω of a filter designed to pass through the frequency Ω.

The window function satisfies (normalization)

W (Ω,Ω) ≡ 1.

The equivalent quantity for a Fourier transform estimator might be

W (ω,Ω) =sin(ω − Ω)T/2

(ω − Ω)T/2.

11

Simulating the HRM

Generate a process with specified noise + signal spectrum or just noise with an arbitrary spec-

trum by passing white noise through a linear filter.

white noise −→ h(t) −→ x(t)

From one or more realizations of x(t) estimate the autocovariance and put it in the form of a

covariance matrix, C.

For each frequency of interest (Ω), calculate the MLM/HRM filter coefficients

a0 =C−1ε

ε † C−1ε.

Calculate the power-spectrum estimate as

S(Ω) =1

ε † C−1ε.

The window function can be calculated as

W (ω,Ω) = a †0 (Ω)ε (ω).

12

Comparison of Spectral Estimators

Bartlett MLM MEMN−2ε † Cε (ε † C−1ε)−1 ∝ |ε † C−1δ|−2

100% error

∆f = 1N∆τ = 1

T ≈ same or better better resolutionresolution (up to ×2 of Bartlett)

large sidelobes lower sidelobes

Note all estimators are real because the quadratic form εt Cε is real for C Toeplitz or, in theMEM case, the estimator is manifestly real.

13

Appendices

Derivation of the Maximum Entropy Spectrum

This section follows Edward and Fitleson IEEE Trans on IT, 19, 232-234, 1973 and Mc-

Donough in Nonlinear Methods of Spectral Analysis, ed. Haykin, pp. 227-228.

Using the expression for the entropy rate in terms of the power spectrum (for a Gaussian pro-

cess)

h =1

4fN

fN

−fN

df ln S(f )

we will derive a spectral estimate given that we know a finite number of values of the ACV.

That is, suppose we know

C(n) ≡ (Xk − x)(Xk+n − x)∗, n = −M,−M + 1, . . . , 0, . . .M

∆τ = sample interval.

For now, assume that we actually know C(n) rather than some estimates for C(n), C(n). Let-

ting S(f ) carry the integration limits, we, therefore, have the constraint equations.

C(n) =

df e2πifn∆τS(f )

which we incorporate into the maximization problem by using Lagrange multipliers λn. There-

fore, we maximize

L = h−M

n=−M

λnC(n) (minus sign for convenience)

21

which can be written as

L =

df

1

4fNln S(f )−

M

n=−M

λn e2πifn∆τS(f )

Now we vary S(f ) to find δL:

δL =

dt δS(f )

1

4fN

1

S(f )−

n

λn e2πifn∆τ

= 0

This holds for any δS(f ) when S(f ) equals the function that extremizes L.

Thus,

S(f ) =1

4fN

n

λn e2πifn∆τ

−1

Now substitute back into the constraint equations to get equations for the λn:

C(n) =

df S(f ) e2πifn∆τ

=1

4fN

df

e2πifn∆τ

n

λn e2πifn∆τ

, n = −M, . . . ,M

This is a system of nonlinear equations for λn.

22

Following Edward and Fitelson, note that the spectral estimate can be put into the form

S(f ) =1

4fN

1

|A(f )|2 where A(f ) ≡M

l=0

α e2πif l∆τ ,

which follows from the positive semi-definiteness of S(f ) and is easy to see by analogy withthe Wiener-Khinchin theorem: ACF ⇔ S(f ) ∝ |FT |2.

The coefficients α are related to the Lagrange mulipliers (λ is like a correlation function, αl atime series):

|A(f )|2 =

M

=0

α e2πif∆τ

2

=

α α∗ e

2πif∆τ(−)

=M

q=−M

M−|q|

=0

α α∗−q e

2πif∆τq

≡M

n=−M

λn e2πifn∆τ .

Thus, both sides are equal if

λq =

M−|q|

=0

α α∗−q.

23

Now we can find a solution to the constraint equations. Start with:

S(f ) =1

4fN

1

|A(f )|2 .

Multiply S(f ) by A∗(f ) e2πifn∆τ and integrate:

fN

−fN

df S(f )A∗(f ) e2πifn∆τ

=1

4fN

fN

−fN

dfA∗

(f )

|A(f )|2 e2πifn∆τ

The left-hand side becomes

LHS =

fN

−fN

df S(f )

M

=0

α∗ e

−2πif∆τ

e2πifn∆τ

=

n

=0

α∗

fN

−fN

df S(f ) e2πif∆τ(n−)

≡ C(n− )

=

M

=0

α∗ C(n− )

The right-hand size is

RHS =1

4fN

fN

−fN

dfe2πifn∆τ

A(f ).

24

So we have

M

=0

α∗ C(n− ) =

1

4fN

fN

−fN

dfe2πifn∆τ

A(f ).

To further reduce the RHS we perform a contour integral in the complex plane for f . Let

f = fr + i fi and constrain [A(f )]−1to be analytic

2in the upper-half plane. Choose the contour

ζ:

By Cauchy’s Integral Theorem3

the integral around the closed contour vanishes:

ζdf

e2πifn∆τ

A(f )

= 0 =

(1)df [ ] +

(2)df [ ] +

(3)df [ ] +

(4)df [ ].

2Analytic functions (Sokolnikoff and Redheffer, p. 540): A function f(z) that has a derivative f (z) at a given point z = z0 and at every point in the neighborhood

of z0 is analytic at the point z = z0. The points where f(z) is not analytic are singular points. In order that f(z) = u(x, y) + iv(x, y) be analytic at z0 = x0 + iy0is that u and v and their partial derivatives be continuous and that the Cauchy-Riemann equations

∂u

∂x=

∂v

∂y,∂v

∂x= −∂u

∂y

be satisfied throughout the neighborhood of x0, y0.

3Cauchy’s Integral Theorem If f(z) is continuous in a closed, simply connected region, R+C and analytic within the simple closed curve C, then

c dz f(z) = 0.

25

Consider first the (2) and (4) integrals: fr = ±fN, fi ∈ [0, fim]

K24 ≡

(2)df [ ] +

(4)df [ ]

= e2πifNn∆τ

fim

0dfi

e−2πfin∆τ

A(fN + ifi)+ e−2πifNn∆τ

0

fim

dfie−2πifin∆τ

A(−fN + ifi)

To proceed we specify that

fN = Nyquist frequency =1

2∆τ

or 2∆τfN = 1. Then e±2πifNn∆τ = e±πin = cos(±πn) = cosnπ. Also A(f ) is periodic so that

A(+fN) = A(−fN)

A(+fN + ifi) = A(−fN + ifi)

This can be seen from

A(f ) =M

=0

α e2πif∆τ −→

2∆τfN = 1

M

=0

α eπi(f/fN ),

26

which implies that

A(±fN + ifi) =M

=0

α e−π(fi/fN ) e±πi

=M

=0

α e−π(fi/fN )(−1).

Therefore

K24 = cosnπ

fim

0dfi

e−2πfin∆τ

A(fN + ifi) +

0

fim

dfie−2πfin∆τ

A(fN + ifi)

equal and opposite terms

⇒ K24 = 0

27

These results combined with the Cauchy Integral Theorem imply

(1)df [ ] = −

(3)df [ ]

or

fN

−fN

dfe2πifn∆τ

A(f ) desired integral

= − −fN

fN

dfre2πin∆τ(fr+ifim)

A(fr + ifim)

= +

fN

−fN

dfre2πifrn∆τ e−2πn∆τfim

M

=0

α e2πi∆τfr e−2π∆τfim

Taking the limit as fim → ∞ we get contributions only for n = 0 and = 0:

limfim→∞

fN

−fN

dfre2πifrn∆τ e−2πn∆τfim

α0=

δn,0α0

fN

−fN

dfr.

Thus,

fN

−fN

dfe2πifn∆τ

A(f )=

2fN δn,0α0

28

The constraint equations are satisifed if:M

l=0

α∗l C(n− l) =

1

4fN

fN

−fN

dfe2πifn∆τ

A(f )=

1

2

δn,0α0

Now put in matrix form by defining

α =

α0

α1...

αM

δ =

100...0

ε =

1e2πif∆τ

e2πif2∆τ

...e2πifM∆τ

The constraint equations become

C0 C1 . . . CM

C−1 C0 . . . CM−1...

C−M C1−M . . . C0

α0

α1...

αM

∗

=1

2α0

10...0

or (since we’ve assumed the process is real in which case α is also real)

C α∗ =1

2α0

δ

29

which has the solution

α =1

2α0C−1δ

(real data)

30

Now we can solve for the spectral estimate,

S(f ) =1

4fN

1

|A(f )|2

=1

4fN

1

|M

=0

α e2πif∆τ |2

=1

4fN

1

|ε tC−1δ|2

=1

4fN

1

|ε t α|2

=1

4fN

4α2o

|ε tC−1δ|2(real data)

S(f ) =1

fN

α20

|ε tC−1δ|2

31

Note that the solution for α = 12α0

C−1δ implies or

α0 =1

2α0(C−1)00

⇒ α20 =

1

2(C−1)00

32

Data Vectors •  Vector of random variables •  Mean

•  Dot product •  complex:

•  Covariance matrix •  zero mean case •  complex Hermitian

X =

X1

X2...

XN

X =

X1X2...

XN

X ·X = XtX =N

j=1

X2j

X ·X = X†X =N

j=1

|Xj |2

C = XX† =

|X1|2 X1X∗2 · · · X1X∗

N X2X∗

1 |X2|2 · · · X2X∗N

.... . .

...XNX∗

1 XNX∗2 · · · |XN |2

Consider vectors A, B and matrix C with lengths N × 1, N × 1, and N × N ,

respectively. We have

(a) ∇AA ·B = B

(b) ∇AA2= 2A

(c) ∇AAtCA = (Ct+ C)A for real A.

(d) ∇AA†CA = CtA∗+ CA for complex A.

(e) (CA)t = AtCt

(f) If A is a zero mean stochastic process (e.g. a vector of N measurements

of a noiselike signal), its covariance matrix can be written as C = AA†.Here the notation is: ∗ = conjugate;

t= transpose;

†= transpose conju-

gate.

signal modeling, statistical inference and data mining in...

Documents