penalized whittle likelihood estimation of spectral density functions

PENALIZED WHITTLE LIKELIHOOD ESTIMATION

OF SPECTRAL DENSITY FUNCTIONS

by

Yudi PawitaBFinbarr O'Sullivan

TECHNICAL REPORT No. 238

August 1992

Department of Statistics, GN-22

University of Washington

Seattle, Washington 98195 USA

Penalized Whittle Likelihood Estimation of

Spectral Density Functions. 1

Yudi Pawitan

Department of Statistics

University College, Dublin 4.

and

Finbarr O'Sullivan

Department of Statistics

Unlvefl,ltvof Washington

Seattle, WA 98195.

Abstract

The penalized likelihood approach is not yet developed for time series problems, even though

it has applied in a number of nonparametric function estimation problems.

We define a new of nonparametric estimates of the spectral density of a stationary

timeseries as the maximizer of the Whittle likelihood with roughness penalty. Implementa

tion using an iterative least squares procedure, with the log periodogram as starting value,

works very well in practice. We derive an unbiased estimate of the integrated squared error

and use it to choose a data dependent smoothing parameter. The procedure is illustrated

by simulated and real data examples. In larger simulations the new estimate is shown to

be more efficient than the smoothed log periodogram estimate. We indicate heuristically

why this is the case. Assuming some smoothn~ss condition on the true spectrum, we also

show that the estimate achieves the same asymptotic error rate as that in nonparametric

regression and density est;nIlatIOl1.

Keywords: stationary time series, iterative least squares, regularization, bandwidth selec

tion.

1 Introduction

prc.b1€~mseven ttH:mg;npeIlalized likl::lilltood approach is not developed

su<;ce;sfully in a number of no:np,a,rabml::tric function est,im.atitonit has

for Good & Gaskin, 1977; 1982 and 1985; O'Sullivan,

1988). In time analysis, the Whittle likelihood function (Whittle, 1962) has been

investigated extensively for parametric inference (e.g., Hannan, 1973; Davies, 1973) and

has been shown to possess some optimality property (Kulperger, 1985). The purpose of

this paper is to show the application of the penalized Whittle likelihood to the problem of

and a SImilar ap]')roacn

and proved some but here we show asymptotic rate results of the

integrated squared error and describe the computational issues in detail, especially with

regards to the data dependent choice of the smoothing parameter. We also compare the

new method with the standard periodogram and log periodogram smoothing.

The likelihood approach puts a number of spectral estimation methods into a coherent

framework. For example, the raw periodogram is the (unpenalized) maximum likelihood

estimate of the spectrum, while the usual window estimate is a local maximum likelihood

estimate in the sense of Tibshirani (1986) ;and the sIIloothed log periodogram

(SLP) (Cogburn & David, 1974; Wahba, 1980) will be shown to be the first iterate of the

appoximated penalized maximum likelihood estimate (PMLE) of the log >:fil"rfrllTn

We describe the in Section 2, including fiuding a data dependent

smoothing based on an risk with data

and the 3 show a re<ltSO][l-

more etliciellt

than

::ie<:tio,n 4 we

same asJ,mptcltic error rate as

int,egr·atE!d sCluared error is

2 Methods

X t a statio:nar'y time

The second order nTl"n""Tti,~"

mean J.-t observed at time t = 0,

X t is eql111vale:ntliy described by the covariance function

±2, ...

and by the spectral density fnnction

I(x) = I: cxx(k)exp(-21rixk).Ie

The problem is to estimate I(x) from a finite sample X o, ... , X T-l' An estimate of 1 is use-

fulfor a of that the data or, more importantly, it

may be needed as an input for other estimation problem, such as smoothing or filtering(see,

for example, Brillinger, 1981 or Shumway, 1988).

The Whittle likelihood for stationary processes can be justified by the following heuristic

argument. One can also start with a Gaussian likelihood in the time domain (Whittle, 1962;

Chow & Grenander, 1985), but it is not necessary to assume Gaussianity of the process

itself (Hannan, 1973), in which case we are only exploiting the second order structure of

the process. Define the discrete transform (DFT) at frequency x as

d(x) = I: X t exp( -21rixt),t

(1)

and the periodogram I(x) = Id(x)12IT. Under Brillinger's mixing condition (Brillinger

1981, p. 26) we have

CN(O,/(x»,

ISon1

cOlnplex normal distrHmtion and T-l/2d(x) are asymptoti-

to be at Fourier

as T -+ 00, where C.lv (1erlotl~s

cally for distin,ct

lrequenlcy x = kiT.

COIlVeJtlieltlCe we

mCltrv'atlon, we note

10 is

I·

is miJninlizEd at I = 10, so is an apl;>ropriate

of I is mlIllmJ[Zer of

Ln(f) = LT(f) +>..J(f) (2)

where LT(f) serves as a data fit criterion, J(f) is a penalty function and>.. is a parameter

that controls the degree of smoothing: >.. =0 corresponds to the raw periodogram estimates

and>.. = 00 corresponds to the model corresponding to the choice J(f).

COllsidJer the penalty fnncti()n

where fJ( x) == log I(x). Note here that the log transform overcomes the positivity constraint

I(x) > O. The smoothest model using this function is log polynomial of order (m - 1).

Denote Xk = kiT, fJk = fJ(Xk) and (J to be the T x 1 vector of fJk'S. Throughout we will use

the bold notation to represent T x 1 vectors.

Iterative Least Squares for fixed >..

Minimization of the objective function

(3)

IS ac(:onlpJjlshl~d

& Nei,der. p.40).

square metheld as follows McCullagh

+ ) ~

[T/2]

L

-1,

:::;; 1,

peIlalt;y ftlLRct;10n is

J(f) :::: L k2mei ~k

where ek is

mately,

k'th J<h,111'i.,1' C(:>etJllC1,ent of fI(a:), also known as the cepstrum. So, approxi-

(T/2] (T/2]

LT), :::: L (Zk - ek)2 +A L k2mei (7)k=(-T/2]+l k=(-T/2]+1

where Zk is the k'th discrete Fourier transform of z. Minimization of (7) is now trivial.

Given an initial estimate (/J, LT), is milllilIliZ€~d

flJ:::: 1 L(l +Xk2mr 1Zk exp(21rijkIT), j:::: 0, ±1, ....k

(8)

The iteration continues by alternately recomputing (5) and (8) until convergence. Denote

iJ to be the final estimate of the log spectrum.

The equation (8) can be written also as 81 :::: S),mz, where S),m represents a smoothing

matrix associated with A and m. Now it is worth noting that with an initial estimate

fI/. :::: log lk +1.3064313, so that Zk :::: log lk +0.577216, the first iteration gives exactly the

smoothed log periodogram estimate of log spectrum as given by Wahba (1980). This

means that the first iterate is already a good estimate and the procedure typically converges

in less than four iterations, so for a value of A the computational cost is O(TlogT).

Choice of the smoothing parameter ),

The cross-validation l1e;Cl,vI'!-Olle-'OUt baJIld1wklth selection in spectral estimation

and more geller;al

cases smoot;hiIlg para,me~ter as an un-

sqularE!d error

= -2 +

on can be reaAiily cOlnplJ.ted, term

not inv'o!\re A, so it SutfiCE~S socl:md term. mean seclond term can

be unb,iasE~dly estilma,ted as toll()ws. the iter'ath{e

satisfilElS iJ = S),:fnZ, l near the opt:imaJ as T gets large

Z = iJ + lexp(-iJ)-1

:::::$ 8 +1 exp(-8) - 1, (10)

where 8 is the true log spectrum. approximation (10) is in line with the linearization

(11)

lemmas proved in the aJ)l~eIldi:IC.. So, we have

= L EOk + trace{Cov(iJ, z)}k

:::::$ L (Jk E8k+ trace SAm.k

The trace can be shown to be equal to

[T/2]

L (1 + lk2m )-1.k=[-T/2]+l

Then, up to a constant term, the ISE can be unbiasedly estimated by

[T/2]

URE(l, m) =11.iJ z 112 +:{ L (1+l k2m )-1}. (12)k=[~T/2]+1

Since 8 is not available the following fully automatic two-stage procedure is used: in the

first stage we choose the A that minimizes URE using z = iJ + 1 exp(-iJ) - 1, and call the

pilot estimcbte.

millimiza1;ion is needed

optimal estlm.ate 8p

z =8p +1 a

the second repeat the procedure using

find8p •

3 Examples and Comparisons

sec:tion we illustr:a,te

processes as

...... , inn.owl.ticm proc:ess et

deviatl~s are J(eJler,ated

are set to 0

100

4

recurs,ive equation CLU'-'V<:;.

to nOI'ma.llty

sul)se,qm~nt values are COJlll>'llted

of X t are then we use m = 2 and

consider A between and e-1 • A simple grid search is used to minimize URE over

the range of A. Only 5 valnes of A are used for the stage minimization of URE and 50

values are used in the second stage. The procedures are implemented in double precision in

Fortran and validated in S-plus. T = takes abont2 or less to compnte a

fully automatic a DEC 3100 WOJrkSl~atlon.

Figure I(a) shows that the URE of follows the true ISE quite closely up to a

constant shift shown by the dashed line. We call the best estimate to be one that minimizes

the ISE for the data at hand, and the automatic estimate to be one that minimizes the

URE. In this case the automatic is practically the same with the best The URE

and ISE of the SLP are shown for comparison; see Wahba (1980) for detail. It is worthnoting

that the ISE of PMLE is smaller than the ISE of SLP for near optimal smoothing, so, in

particular, the best PMLE has lower ISEthan the best SLP. l(b) shows the log

spectrum of the AR(3) process with the automatic PMLE and SLP estimates. Figure I(c)

for MA(4) process shows patterns to those of lea). For plots we have expressed

the smoothing parameter terms of bandwidth given by

{

T-J at A =

T=

perJodogram at

two crit,eria COIJ:lclcle

occurs as we cannot

zero treqmmcy. ::;,ett.ing it zero thlrol1lJ(h centering

overcome

same

cOlnp;a,ril;on we plot compnted as the

same bandwidth (L = 3)

1 L=-~1 L h:+j}.+ . Lj=-

We see that the is closer to this case.

To compare the performance of PMLE with SLP and LSP we replicate the simulation

500 times, each time recording the best PMLE, SLP and LSP. The LSP estimate

of the log spectrum is computed as the log of simple average periodogram above. The best

LSP for a given L that the ISE in the spectral

domain. This performed for =128,256,512 and 1024.

Figure 3(a) shows that PMLE is than SLP and LSP for AR(3) process and

similar results are shown in Figure 3(c) for MA(4) process. Moreover the estimates are

seen to be consistent as their ISE tends to zero as we increase the number of observations.

The advantage of the PMLE over SLP is retained when we compare the corresponding

automatic procedures, as shown in Figure 3(b) and 3(d) for the AR(3) and MA(4) processes

respectively.

For each simulated realization we also computed the relative efficiency of PMLE against

SLP as the ratio

aSj!mpt()tu:alJly we

1 shows a median efficiency of around

opt;im;al 8pMLE = S),mz and the variance of Zle

l+c) lhis

both for the best and au1;OBlatlc el,tllllat;es.

1.4 to 1.5 in

to be

lJ,,: + lie

1l"2/6 ~

a relative etticie][lcy Int;ui1;ivl~v. we "UV'UiU eXJ;fect as we av€!rafJ~e

squlan!d error spl,;ctrat estimate over trequlmcy. 1 are

Best PMLE vs Best SLP

Auto VB Auto SLP

Auto vs Best PMLE

Best VB

Auto PMLE vs Auto SLP

Auto PMLE VB Best

1.54 1.40 1.45 1.57

1.53 1.48 1.46 1.49

0.82 0.83 0.84 0.87

1.34 1.39 1.41 1.47

1.30 1.44 1.45 1.48

0.89 0.94 0.95 0.96

Table 1: PMLE is more efficient than SLP for best and automatic procedures, and the

unbiased risk estimation shows high efficiency. For each realization the relative efficiency

of A versus B is the ratio of the ISE of B to that of A. Each entry of the table is the median

efficiency 'computed .,over 500. simulations.

cases, with a positive as the sample increases (see 1).

4 Asymptotic results

we apply her'eatter abl:.reviat~~d by CO,

eBtim.ato,r. It is COILVeJ1ierlt to de,relcll) T "13rT"" for

to

is a zero mean SeC1JnG stationar'y process

is

is /J a co~rarian<:e sllmlnal>llit:y C(>ndltlcm

of l</J- a=/J-l>

2:(1+Ie

++<

:$ M 2:(1 + IkI2)i+a Icxx(k )12

k

= Mil fo

{2)1+Ie

where M isa constant andll fo Sobolev norm of fo, which is in turn bounded

l + 1/2. So, the smoothness condition

by II 00

0(k-i - 1

Co:nve:rselv if

k -+ 00 and II fo II~ is bOllmd{~d /J

the covariance =

in C.3 is equivalent to a covariance summability of order I! > 1.

Theorem 1 Under C.l-C.a, if there is an a satisfying

112m < a < (s/m - 1/2m)/2 (13)

and a sequence of AT -+ 0 satisfying

X;(2a+l/m) -+ 0 (14)

then, with probability tending to one, eTA is well defined and for all b E [0, aJ

(15)

where Af a constant ma'ept:na,ent of 00 , b, AT and

any

density estim<!Lte.

theorem may he to convergence ch':tracteris1tics of the spectr.itl

letting In = then we can show that 0(T-2m/(2s+l»)

II Ir>. 10 < Do + II Or>. - DO II~} II On - Do 115

= Op(T-2s/(28 +l»).

(17)

So, for example, if s 2 the asymptotic upper hound is Op(T-4/ 5 ), which is the standard

rate associated with nonparametric regression and density estimation.

APPENDIX: Proof of the theorem

The proof will follow from theorem 3.1 and 3.2 of CO after verifying assumptions A.I-A.6

of that paper. Refer to CO for a detailed development of the normed spaces. Let FT be

the discrete uniform measure on the points XA; = kiT, for k = -[T12] + 1, ... , [TI2]. Also

let F to be the uniform measure on [-0.5,0.5]. Now we can write

Ln j {O(x) + I(x)e-O(x)}dFT(x) +1'x(O, WO}

L>. j{O(x) + 10(x)e-O(X)}dF(x) + 1'x(O, lVO} ,

where I(xA;) = IA;. The score vectors corresponding to Lr>. and LA are given by Zn and

ZA' where

(18)

The corresponding eqllatlon for

(18). :Similarity, the seCOllld

is obtained by sul:!stitut.inJ!: 10 for I and F Fr in

L(O) are by

=

CO.

are vel'ifiE~d similar arg;U111ents as in

Lemma 1

9 such that

CO,~ttfltUOiU8 TfjrnCi~to'lS g on [-0.5,0.5J, there is a constant independent of

(i) If 9 d(FT - 112m < a < 11m,

(ii) E{f(I - fo)g dFT}2 ::;

(iii) E{f(I - fo)g dFTP ::; MT-1tU 9 IIi +T-2ma II 9 II~}

Proof: (i) is standard. For (ii) and (iii), with Xk =kiT

E{/(l - fo)g dFT}2 = 1 ,E,EE{lk - fO(Xk)}{f1;;, - fO(Xk')}9(Xk)9(Xk')k k'

Since X t is stationary and, by C.3, it satisfies a first order covariance summability condition,

then by Theorem 5.2.2 and 5.2.4 of Brillinger (1981, page 123 and 125) we have for some

0< C < 00,

Elk = fO(Xk) +O(T-1)

COV(Ik,f1;;,) = c6wfo(Xk? +O(T-1)

uniformly in k and k'. Then it follows

The latter inequality

then immediate and for part

> 1/2. (ii) is

< 1/ + /<

<

Lelnmla 2

w

w II~.

Proof: For part (i) g(x) = e-8(z)u(x)v(x), then

([D2LT(9) - D2L(9)]uv}2 < {J gd(F - FT)}2 +{~ L(Ik - fO(Xk))9(Xk)}2k

S MT-1 II 9 II~ +{~ L(Ik - fO(Xk))9(Xk)}2,k

where we have used part (i) of Lemma 1. To analyzed the second term, expand 9 in terms

of its Fourier series with coefficients gv = (g, cPv), where </>17 are trigonometric polynomials,

and apply the Cauchy-Schwatrz inequality to obtain

fO(Xk))9{Xk)}2 s M{L(1 + 1~)9;} xv

(L:(1 + 1~)-1[~ I::(Ik fO(Xk))cPv(Xk)F}v k

< ~ II 9 II~ {I::(1 +1~)-1[ ~ I::(Ik - fO(Xk))cPv(Xk)]2},v Vi k

where we have used sUPxcP(x) 1. So, using part (ii) of lemma 1, the random variable in

the parenthesis, can it AT, is positive and

E(AT) s MI::(1 +1~)-1 < 00

v

since 1atV v2am and a > 112m. Hence

< +

=<

= Ope1) such that for b S

=

=

a constant M

b S..\ > 0,

(i) K3()., b)~ S M). -(b+l12m)

(ii) K 2T().,b)2 S ATMT-l).-(a+b+112m)

(iii)K3T().,b)2 S ATM).-(b+l12m)(1 + T-1).-a.

Then we have the following

Lemma

2 a-

Proof: The results follow from Lemma 2.1 of CO and Lemma (3) above.

Lemma 4 If (h. E

there is M such that, with 112m < a*

1180

11m ),

os b a finite K, then

< {T-1 Ilg II 9 Il~* ).(slm-a*)

+II 9 II 9 }

where is ind'eTJf',:nd,ent of 9

term is 9 secl:md term substitute

to ontam

< {fg +lI{f X

g - FT)F dr

:::; II g Ili/m +M II g - fJo II~*

< MT-1{T-1 II g IIt/m +,x(s/m-a*) II g II~* .

Note that T-1 2:: T-2a*m since a* > 112m. The last term

MT-l{~Eg(Xk)2e-20,$Xk )}

k

< MT-l{~Eg(Xk)2}k

< M {II g 115 +T-2ma* II g II~*}

Now define the following linear approximations for fJ in the neighborhood Noo

ii>.. - fJo

8T>.. - fJ>..

-G>..(fJO)-l z>..(fJo),

-G>..(fJ,$-l ZT>..(fJ>..),

where fJ>.. is in the neighborhood Noo ' The following result can be proved using the same

arguments leading to Theorem 4.1 and 4.2 of CO. The assumptions A.5 and A.6 of CO are

then verified in view of Lemma 4.

Lemma 5 For 0 :::; b :::; and some constant lvf we have

II fJo II; . (21)

If II fJ>.. fJo is a constant, for A = AT as in theorem

References

(1987). Determining the bandwidth of a kernel spec

Ser. Anal., 8: 21-38.

[1] Hrillinger,

Day.

[2] Beltrao, I. I. and H!oomnetd,

trum estimate. J.

[3] Chow, Y. and Grenander, (1985). A sieve method for spectral density. Ann. Statist.,

13: 998-1010.

Cogburn, and David, H.

Statist.,2: 1108-1126.

and spectr.rU ef,tinlatJtOn. Ann.

[5] Cox, D. D. and O'Sullivan, F. (1990). Asymptotic analysis of penalized likelihood-type

estimators. Ann. Statist., 18: 1676-1695.

[6] Cox, D. D. and O'Sullivan, F. (1989). Generalized nonparametric regression via penal

ized likelihhod. Tech. Report No. 170, Dep. Statist., U. Washington, Seattle.

As'vmptcltie inference in sta,tionary Gaussian time series. Adv.[7] Davies, R. B.

Appl. Prob., 5: 469-497.

[8] Good, I.J. and Gaskin, It.A. (1971). Nonparametric roughness penalty for probability

densities. Biometrika,

J. Appl.

Prob.,

[10] Tlbf,hlrllblll J.

a l':n@'~trllm est.llll<Lte:

cros:s-validatjon methodls. J.

Gene1'ali.zed L,'iTliear MGllIel:s. Llondon: Chapman

[14]

estimat:ors. SIAM J.

COlnp1I1ta,tio,n of fully automated log-density and IOf!-ha,Zalrd

GlJ'I1nput., 9: 363-319.

[15] Shumway, R. H. (1988). Applied Statistical Time Series Analysis. Englewood Cliffs:

Prentice Hall.

[16] Silverman, RW. (1982). On the .estimation of a probability density function by the

ma:ldI1l.nm pelli:lJ1;iSeU likE!1ihl)od me1thod. Ann. Statist., 10: 195-810.

[11] of spilne smoothing approach to nonparametric

regression curve fitting. J. Roy. Statist. Soc., Ser.B, 41: 1-50.

[18] Wahba, G. (1980). Automatic smoothing of the log periodogram. J. Amer. Statist.

Assoc., 15: 122-132.

[19] Whittle, P. (1962). Ganssian estimation in stationary time series. Bull. Inst. Internat.

Statist., 39: 105-129.

[20] Wichman, R A.and Hill, LD. (1982). efficient and protable pseudo-random number

generator. Appl. Statist., 31: 188-190.

LIST OF FIGURES

siml11atl;-:d AR(3)

are estlmllLtes

MA(4)

true intel?:r'atE~d sqnared

error a reaJlizalcion is

constant.is eXl}ected to

trne SpE!ctrnm

process

are similar to process.

T

The sca,ttered points

Log of RlmnlP aVElra~~e pleri(J,dolitraJUl

is shown by the wiggly dotted

of log Rn.,.rtTllm

same bandwidth as

are Ik +

Figure 3: The PMLE is better than the SLP or log of smoothed periodogram (LSP).

Note: P=PMLE, SL=SLP and LS=LSP. (a) Each boxplot summarizes 500 Monte

Carlo replications for the AR(3) process. For each replication we record the true ISE

of the best PMLE, SLP and LSP given the dataj (b) The same as (a), except here

we summarize the true ISE of the automatic PMLE and SLP. We do not develop an

automatic procedure for LSPj (c) same as (a) for the MA(4) process; (d) The

same as (b) for the MA(4) process.

Figure 1(a) (b)

AR(3)

~..-;;;.........

. True'

~MLE

C\II

'VI

'V

E C\I

~c-oo 0

j'""'" _ - - - --

Frequency

0.02 0.04 0.08 0.16 0.32

Effective Bandwidth

0.0 0.1 0.2 0.3 0.4 0.5

(c) (d)

True I'. If':"__ ""~.,,, .

~LP .

..,.

C\I

E2 0

i00g>

...J C\II

MA(4)

, /?: 1',"

.' /',# "_ -17- -,'-. .

// t/'

,/ ,t'

,selPAAlEj·:.~.:·:::::::::.....·.·.-,..::·::<'i$'~;SLP)0.02 0.04 0.08 0.16 0.32 0.0 0.1 0.2 0.3 0.4 0.5

Effective Bandwidth Frequency

Figure 2(a) (b)

v·10'LS'P'

'~y , ••~._:-

PMLE:~?LP· ..·~·b ...

0T""

E cx:>

2~c% <D

.s''It

C\I

- ~

0.04 0.08 0.16 0.32 0.0 0.1 0.2 0.3 0.4 0.5

Effective Bandwidth Frequency

Figure 3(a) (b)

P SL P SL P SL P SL

T=512 028

T

I

T=256"

.ii

t

,~

:T=12~

" III

tII

l()

0

"l:tci

en0

w~

Nci

"...

ci

00

LS P SL LS

!

Ja

iI

28 i T::::256 : T::::512 T=1028T

III

oo

No

(c) (d)

P SL LS P Sl LS P SL LS

I :

$$ $

T=1028T=512T=256

l $t : ~I •

$-Ja 4-

TII

1

---

P SL P SL P SL P SL

T=128:" .

,,!T

I

$1

<X:!0

<'?0

w ~Cf) 0

Nci

0ci

T='1028T=51228

P SL

No

penalized whittle likelihood estimation of spectral density functions

Documents