discrimination of locally stationary time series based on the

14
Discrimination of Locally Stationary Time Series Based on the Excess Mass Functional Gabriel CHANDLER and Wolfgang POLONIK Discrimination of time series is an important practical problem with applications in various scientific fields. We propose and study a novel approach to this problem. Our approach is applicable to cases where time series in different categories have a different “shape.” Although based on the idea of feature extraction, our method is not distance-based, and as such does not require aligning the time series. Instead, features are measured for each time series, and discrimination is based on these individual measures. An AR process with a time-varying variance is used as an underlying model. Our method then uses shape measures or, better, measures of concentration of the variance function, as a criterion for discrimination. It is this concentration aspect or shape aspect that makes the approach intuitively appealing. We provide some mathematical justification for our proposed methodology, as well as a simulation study and an application to the problem of discriminating earthquakes and explosions. KEY WORDS: Modulated AR process; Nonparametric estimation; Pool-adjacent-violators algorithm; Shape restrictions. 1. INTRODUCTION There exists a large amount of literature dealing with the problem of discriminating nonstationary time series. One class of proposed discrimination methods finds certain features in time series, such as structural breaks of certain kinds, and then bases the discrimination on the presence or absence of those features. Usually, however, time series under considera- tion must be aligned in a certain way (e.g., using time warping) for the procedures to work. Moreover, the resulting methodolo- gies, although having an intuitive appeal, also often have an ad hoc flavor because the mathematical underpinning is missing. This kind of work can be found in the area of data mining (e.g., time series data mining archive at University of California, Riverside). The statistical literature includes several approaches for discriminating nonstationary time series that are based on nonparametric spectral density estimation. These methods are distance-based, and hence also require alignment of the time series. The novel methodology proposed and studied in this article is also based on feature extraction. However, in contrast to many other methods, it is not distance-based, and hence no alignment of time series is necessary. Moreover, the method is computa- tionally feasible, and we provide a more rigorous mathematical underpinning of our procedure. Our approach is semiparametric in nature by using a time-varying AR model. More specifically, given a training set of time series, such as recordings of a spe- cial event by a seismograph at a given specific geographic loca- tion, say, we construct a discrimination rule that automatically assigns a new time series (the event) into one of k groups, π i , i = 1,..., k. In a time-varying AR model, X t has mean ν t , and the centered time series, X c t = X t - ν t , satisfies the equations X c t = p k=1 ψ k (t)X c t-k + t τ(t), where the errors t are assumed to be iid with mean 0. Here, however, we consider a rescaled version of this model. As introduced by Dahlhaus (1997), we Gabriel Chandler is Assistant Professor, Department of Mathematics, Con- necticut College, New London, CT 06320 (E-mail: gabriel.chandler@conncoll. edu). Wolfgang Polonik is Associate Professor, Department of Statistics, Uni- versity of California, Davis, CA 95616 (E-mail: [email protected]). Most of this work was completed while Chandler was a graduate student at University of California, Davis. This work has been supported by National Sci- ence Foundation grants 0103606 and 0406431. The authors would like to thank three referees and the associate editor for constructive criticisms that led to a significant improvement of the original manuscript. The authors also thank Robert Shumway for several valuable discussions, and for his willingness to share computer code and datasets. rescale time to the unit interval; that is, instead of time running from 1 to T , it now runs from 1/T to 1. In this rescaled time, we denote by the mean function by μ(u), the AR coefficient functions by φ k (u), k = 1,..., p, and the variance function by σ 2 (u), u ∈[0, 1]. With X c t,T = X t - μ(t/T ), our model is X c t,T = p k=1 φ k (t/T )X c t-k,T + t σ(t/T ). (1) Our procedures are based on X 1,T ,..., X T ,T . The rescaling will enable us to derive asymptotic results for our procedures. In this article we consider the class of problems in which all of the discrimination information is contained in σ 2 (·). Thus μ(·) and φ k (·), k = 1,..., p, are treated as nuisance parameters. The order p assumed to be known. Our method for discriminat- ing between classes of time series is then based on character- istics of the variance function σ 2 (·) only. The motivation for this approach comes from the problem of discriminating earth- quakes and mining explosions. We would like to point out, how- ever, that the idea underlying our approach is more general. Other parameter functions can, of course, be included in the discrimination procedure. What is needed is some information on the shape of these functions to construct an appropriate cri- terion. Our theoretical results allow estimation of the nuisance parameters as a function, and open the door to derive joint dis- tributions of our criterion with additional criteria based on other parameter functions. In our motivating example of discriminating earthquakes and mining explosions, we actually assume that the AR parame- ters are constant over time, that is, φ k (u) = φ k , k = 1,..., p, where the φ k are real constants. We also assume that the mean is constant at 0, a standard assumption in this context. Intuitively, the purpose of the (now) stationary AR part in our model is to remove the underlying “noise,” leaving us with the “signal,” which is represented by the variance function. Seismic events of the type under consideration actually consist of two differ- ent signals, the p-wave and the s-wave (see Fig. 1). It is ap- parent from Figure 1 that the underlying variance function of both the p-wave and the s-wave has a unimodal shape; this ap- plies to earthquakes as well as to explosions. The concentration © 2006 American Statistical Association Journal of the American Statistical Association March 2006, Vol. 101, No. 473, Theory and Methods DOI 10.1198/016214505000000899 240

Upload: buinhi

Post on 02-Jan-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discrimination of Locally Stationary Time Series Based on the

Discrimination of Locally Stationary Time SeriesBased on the Excess Mass Functional

Gabriel CHANDLER and Wolfgang POLONIK

Discrimination of time series is an important practical problem with applications in various scientific fields. We propose and study a novelapproach to this problem. Our approach is applicable to cases where time series in different categories have a different “shape.” Althoughbased on the idea of feature extraction, our method is not distance-based, and as such does not require aligning the time series. Instead,features are measured for each time series, and discrimination is based on these individual measures. An AR process with a time-varyingvariance is used as an underlying model. Our method then uses shape measures or, better, measures of concentration of the variancefunction, as a criterion for discrimination. It is this concentration aspect or shape aspect that makes the approach intuitively appealing.We provide some mathematical justification for our proposed methodology, as well as a simulation study and an application to the problemof discriminating earthquakes and explosions.

KEY WORDS: Modulated AR process; Nonparametric estimation; Pool-adjacent-violators algorithm; Shape restrictions.

1. INTRODUCTION

There exists a large amount of literature dealing with theproblem of discriminating nonstationary time series. One classof proposed discrimination methods finds certain features intime series, such as structural breaks of certain kinds, andthen bases the discrimination on the presence or absence ofthose features. Usually, however, time series under considera-tion must be aligned in a certain way (e.g., using time warping)for the procedures to work. Moreover, the resulting methodolo-gies, although having an intuitive appeal, also often have an adhoc flavor because the mathematical underpinning is missing.This kind of work can be found in the area of data mining (e.g.,time series data mining archive at University of California,Riverside). The statistical literature includes several approachesfor discriminating nonstationary time series that are based onnonparametric spectral density estimation. These methods aredistance-based, and hence also require alignment of the timeseries.

The novel methodology proposed and studied in this article isalso based on feature extraction. However, in contrast to manyother methods, it is not distance-based, and hence no alignmentof time series is necessary. Moreover, the method is computa-tionally feasible, and we provide a more rigorous mathematicalunderpinning of our procedure. Our approach is semiparametricin nature by using a time-varying AR model. More specifically,given a training set of time series, such as recordings of a spe-cial event by a seismograph at a given specific geographic loca-tion, say, we construct a discrimination rule that automaticallyassigns a new time series (the event) into one of k groups, !i,i = 1, . . . , k. In a time-varying AR model, Xt has mean "t, andthe centered time series, Xc

t = Xt ! "t, satisfies the equationsXc

t =!pk=1 #k(t)Xc

t!k + $t% (t), where the errors $t are assumedto be iid with mean 0. Here, however, we consider a rescaledversion of this model. As introduced by Dahlhaus (1997), we

Gabriel Chandler is Assistant Professor, Department of Mathematics, Con-necticut College, New London, CT 06320 (E-mail: [email protected]). Wolfgang Polonik is Associate Professor, Department of Statistics, Uni-versity of California, Davis, CA 95616 (E-mail: [email protected]).Most of this work was completed while Chandler was a graduate student atUniversity of California, Davis. This work has been supported by National Sci-ence Foundation grants 0103606 and 0406431. The authors would like to thankthree referees and the associate editor for constructive criticisms that led toa significant improvement of the original manuscript. The authors also thankRobert Shumway for several valuable discussions, and for his willingness toshare computer code and datasets.

rescale time to the unit interval; that is, instead of time runningfrom 1 to T , it now runs from 1/T to 1. In this rescaled time,we denote by the mean function by µ(u), the AR coefficientfunctions by &k(u), k = 1, . . . ,p, and the variance function by' 2(u),u " [0,1]. With Xc

t,T = Xt !µ(t/T), our model is

Xct,T =

p"

k=1

&k(t/T)Xct!k,T + $t' (t/T). (1)

Our procedures are based on X1,T , . . . ,XT,T . The rescaling willenable us to derive asymptotic results for our procedures. Inthis article we consider the class of problems in which allof the discrimination information is contained in ' 2(·). Thusµ(·) and &k(·), k = 1, . . . ,p, are treated as nuisance parameters.The order p assumed to be known. Our method for discriminat-ing between classes of time series is then based on character-istics of the variance function ' 2(·) only. The motivation forthis approach comes from the problem of discriminating earth-quakes and mining explosions. We would like to point out, how-ever, that the idea underlying our approach is more general.Other parameter functions can, of course, be included in thediscrimination procedure. What is needed is some informationon the shape of these functions to construct an appropriate cri-terion. Our theoretical results allow estimation of the nuisanceparameters as a function, and open the door to derive joint dis-tributions of our criterion with additional criteria based on otherparameter functions.

In our motivating example of discriminating earthquakes andmining explosions, we actually assume that the AR parame-ters are constant over time, that is, &k(u) = &k, k = 1, . . . ,p,where the &k are real constants. We also assume that the mean isconstant at 0, a standard assumption in this context. Intuitively,the purpose of the (now) stationary AR part in our model isto remove the underlying “noise,” leaving us with the “signal,”which is represented by the variance function. Seismic eventsof the type under consideration actually consist of two differ-ent signals, the p-wave and the s-wave (see Fig. 1). It is ap-parent from Figure 1 that the underlying variance function ofboth the p-wave and the s-wave has a unimodal shape; this ap-plies to earthquakes as well as to explosions. The concentration

© 2006 American Statistical AssociationJournal of the American Statistical Association

March 2006, Vol. 101, No. 473, Theory and MethodsDOI 10.1198/016214505000000899

240

Page 2: Discrimination of Locally Stationary Time Series Based on the

Chandler and Polonik: Discrimination of Time Series 241

Figure 1. A Typical Earthquake and Explosion. The two parts in eachseries (indicated by a small gap) are called the p-wave (left) and thes-wave (right).

of this (unimodal) variance function then serves as a character-istic used for discrimination of the earthquake/explosion data.Using the concentration of the variance function for discrimi-nation is motivated by the known fact that decay of volatilityis different for earthquakes and explosions. In other words, theconcentration of the variance function is the essential featurethat allows us to discriminate between different classes. Differ-ent measures of concentration on which to base the discrimina-tion will be proposed, all relating to the excess mass functionalintroduced independently by Müller and Sawitzki (1987) andHartigan (1987).

Other potential applications of our method include the dis-crimination of types of animals based on geophone recordingsof their footfalls. In one particular study (J. Wood, personalcommunication, 2004), the focus was on discriminating ele-phant footfalls from those of other animals. Yet another possibleapplication is the analysis of electroencephalography record-ings of seizures. In all of these applications, the signals of inter-est show different shapes compared with either the backgroundor other signals not of main interest. This is the situation wherethe basic idea of our method applies.

The discrimination of earthquakes and explosions used as aguiding example is an interesting problem. It can be used tomonitor nuclear proliferation, where we would want to dis-criminate between earthquakes, mining explosions, and smallnuclear testing. This problem has been approached in sev-eral different ways. Early procedures dealt with consideringrelative amplitudes of the two main phases present in eachseismographic reading or considering ratios of power compo-nents in particular frequency bands of the two phases (e.g.,Bennett and Murphy 1986). A more recent approach was toconsider the discrepancy between the spectral structure of mul-tivariate processes (Kakizawa, Shumway, and Taniguchi 1998)while treating such processes as stationary and later consider-ing dynamic or time-varying spectral densities (Sakiyama and

Taniguchi 2001). Kakizawa et al. considered different discrep-ancy measures that formed quasi-distances, allowing for classi-fication. Sakiyama and Taniguchi essentially considered a ratioof Gaussian log-likelihoods to perform discriminant analysis.Yet another approach based on the so-called SLEX model wasconsidered by Huang, Ombao, and Stoffer (2004). There thetime-varying spectral density is estimated using a SLEX basis,and again a certain generalized likelihood ratio statistic is usedfor classification. Although these approaches all deal with thespectrum itself or an amplitude that is proportional to the inte-grated spectrum, we propose a technique that bases discrimina-tion on feature extraction in the time domain.

Our model (1) is a special case of a nonstationary time serieswhose distribution varies (smoothly) in time. Under appropriateassumptions, the model is a special case of a locally stationarytime series as defined by Dahlhaus (1997). This definition wasextended by Dahlhaus and Polonik (2004), who required onlybounded variation of all the parameter functions. In particular,this includes the possibility of jumps in the variance function,our target. Note also that although this model has a paramet-ric form, the variance is a function of rescaled time, and weuse a nonparametric method to estimate this variance function.In fact, as motivated earlier, we estimate the variance functionunder the assumption of unimodality.

We would like to point out, however, that the purpose ofthis article is not only to study the specific method consideredhere, but also to promote the way of thinking underlying ourapproach.

The article is organized as follows. Section 2, assuming thatwe have access to an estimate of the variance function, intro-duces two discrimination measures based on the excess massfunctional of the variance function. Section 3 deals with esti-mation of both the finite- and infinite-dimensional parameters.Section 4 presents theorems dealing with the asymptotic behav-ior of the concentration measures. They serve as justification ofour discrimination methods introduced earlier. Section 5 con-tains a numeric study consisting of earthquakes and explosionsrecorded in Scandinavia, as well as a comparison of our meth-ods with some of the spectral-based methods mentioned earlier.We defer all of the proofs to Section 6.

2. THE DISCRIMINATION METHOD

2.1 Measures of Concentration

Our discrimination method is based on our model (1) andon measures of concentration of the variance function ' 2(·).As indicated earlier, this is motivated by the well-known factthat variation in earthquakes and explosions decay differently.Two different types of measures of concentration are consid-ered. As stated earlier, both are based on the excess mass idea.The excess mass functional of a distribution F with pdf f isdefined as

EF(() = supC

#$

CdF(x)! (|C|

%

=$ #

!#( f (x)! ()+ dx, (2)

where a+ = max(a,0), the sup is extended over all (measur-able) sets C, and |C| denotes the Lebesgue measure of C. In

Page 3: Discrimination of Locally Stationary Time Series Based on the

242 Journal of the American Statistical Association, March 2006

our application, the role of the density f is taken over by the(normalized) variance function.

The excess mass measures the concentration of the underly-ing distribution. In fact, the excess mass functional is a convexfunction (as a supremum of linear functions) that is linear for auniform distribution, and a higher degree of convexity indicatesa higher concentration. Hence different rates of decay can beexpected to lead to different behavior of the excess mass, or offunctionals thereof.

The excess mass approach has been used to, for instance, in-vestigate the modality of a distribution and to estimate levelsets of densities and regression functions (Müller and Sawitzki1987; Hartigan 1987; Nolan 1991; Polonik 1995; Cavalier1997; Cheng and Hall 1998a,b; Polonik and Wang 2005).

For our purposes, we propose to consider the excess massfunctional of the normalized variance function

' 2()) = ' 2())& 1

0 ' 2(u)du. (3)

' 2(·) does not depend on the magnitude of the series. This isimportant, because in the case of discrimination between earth-quakes and explosions, the magnitude of the signals can varygreatly. The excess mass functional of the normalized variancefunction is then

E(() = E' 2(() =$ 1

0

'' 2(u)! (

(+ du, ( "R. (4)

This excess mass functional is used as a basis to define twodifferent types of measures of concentration that eventually willbe used for discrimination.

Integrated Excess Mass. Our first discrimination measureis based on an integral functional of the excess mass. Less con-centrated functions will have an excess mass with a higher de-gree of convexity compared to concentrated functions. Hencewe might suspect that the tail behavior of the excess mass func-tional contains a lot of the discriminatory power. Because thetail would have little weight if we only considered the integralof the excess mass, we include a weight to allow for a greatercontribution from the tail to our measure. Our discriminationmeasure then is

IE(*) =$ #

0(*E(()d(, * > 0. (5)

Larger values of * will result in a larger emphasis on the tailof the excess mass functional, which corresponds to the peak ofthe variance function. For applications we propose to choose *

from the data. For the theory presented here, however, we con-sider a fixed * .

We would like to make additional comments on IE(*). Us-ing IE(*) is equivalent to using nonlinear functionals of ' 2(u).In fact, using Fubini’s theorem, it is straightforward to see that&#

0 (*E(()d( = 1/[(* + 1)(* + 2)]& 1

0 (' 2(u))*+2 du. Hence,using IE(0) for discrimination is equivalent to base discrimina-tion on the L2-norm of ' 2(·).

“Quantile” of the Excess Mass Functional. The secondtype of discrimination measure is a “quantile” of the excessmass functional. In other words, we define

((q) = E!1(q) = sup{( : E(()$ q}, 0 < q < 1. (6)

More concentrated variance functions have larger excess massquantiles than less concentrated variance functions. Similar tothe tuning parameter * , the parameter q will be considered fixedfor the theory that we provide later in this article. However, forthe applications herein we propose a method to automaticallyselect a value for q.

2.2 Empirical Versions of the Discrimination Measures

Empirical versions, or estimates, of the two discriminationmeasure are constructed via a plug-in method by using an esti-mator ¯' 2

s (·) of ' 2(·) which we define in Section 3. Using thisestimate, we define

)E(() = E ¯' 2s(() =

$ 1

0

' ¯' 2s (u)! (

(+ du, ( "R, (7)

and define the empirical measures of concentration )IE(*)

and)((q) as

)IE(*) =$ #

0(*)E(()d(, * > 0, (8)

and

)((q) =)E!1(q) = sup{( :)E(()$ q}, 0 < q < 1. (9)

3. ESTIMATION OF THE VARIANCE FUNCTION

In this section we describe the construction of our estima-tor)' 2(·), and we also introduce a smoothed version)' 2

s (·). Thereason for considering a smoothed version is that smoothinghelps in some special cases. This will become clear later.

Our estimator )' 2(u) can be considered a minimum distanceestimator that minimizes a criterion function WT over the pa-rameter space. Motivated by the application of our method toseismic data, we define the parameter space of our model as theclass of all unimodal functions on [0,1]. To introduce some no-tation, let U(m) denote the class of all positive unimodal func-tion on [0,1] with mode at m, that is, positive functions thatare increasing to the left of m and decreasing to the right. ThenU =*

m"[0,1] U(m) denotes the class of all unimodal functionson [0,1].

As a criterion function, we consider the negative conditionalGaussian likelihood with estimated nuisance parameters. Let

WT(' 2) =T"

t=1

+

log' 2#

tT

%

+'X)ct,T !)&1(

tT )X)ct!1,T ! · · ·!)&p(

tT )X)ct!p,T

(2

' 2( tT )

,

,

where )! = ()&1, . . . ,)&p) denotes an estimator of ! = (&1, . . . ,

&p), X)ct;T = Xt,T !)µ(t/T), and )µ(·) is an estimator of µ(·). Wedefine

)' 2 = argmin' 2"U

WT(' 2). (10)

Page 4: Discrimination of Locally Stationary Time Series Based on the

Chandler and Polonik: Discrimination of Time Series 243

To speed up the calculation, the mode might be estimated ina preliminary step. Let )m denote an estimate of the unknownmode m of the true variance function. Then we define

)' 2)m = argmin

' 2"U()m)

WT(' 2). (11)

As our final estimate of ' 2(·), we propose using “smoothed”versions of the estimates just defined.

Note that a good approximation of the negative Gaussian log-likelihood is given by the Whittle likelihood (Whittle 1962).Thus the estimation approach can be considered a maximumWhittle likelihood approach. However, whereas the Whittlelikelihood is considered a function in the frequency domain,our approach considers the time domain.

Finding the estimator )' 2 or )' 2)m means solving a constrained

optimization problem.

Basic Algorithm for Finding the Minimizers in (10) and (11).First, we discuss the case where the mode is estimated. Notethat the solution )' 2

)m(·) is monotonically increasing to the leftof the mode )m and decreasing to the right, and estimation ofboth parts can be done separately, using essentially the sametechniques. The key observation is that the foregoing likeli-hood WT(·) is of exactly the form needed to apply the theoryof generalized isotonic regression (e.g., Robertson, Wright, andDykstra 1988). It follows that both the decreasing and the in-creasing parts of ' 2(·) can be found by a generalized isotonicregression on the squared residuals based on given values forthe AR parameters. For both parts, the pool-adjacent-violatorsalgorithm can be applied. Let e2

t denote the squared residual,that is,

e2t =

#X)ct,T !)&1

#tT

%X)ct!1,T ! · · ·!)&p

#tT

%X)ct!p,T

%2

, (12)

To the left of the estimated mode )m it is the (right-continuous)slope of the greatest convex minorant to the cumulative sum di-agram given by the points (k/T, (1/T)

!kt=1 e2

t ), k = 1, . . . ,T ,and to the left of )m it is the (left-continuous) slope of the leastconcave majorant to the same cumulative sum diagram (cf.Robertson et al. 1988, thm. 1.2.1).

In the event the mode is not estimated, then the algorithmjust described must be performed T times, because any timepoint t/T , t = 1, . . . ,T , is a potential mode. (The estimatoris known to have jumps only at t/T for some t = 1, . . . ,T .)Among the resulting T different estimators, the one with theoverall largest value of WT is chosen.

There is, however, a small problem with the solution of thealgorithm just described. It is well known in density estima-tion that the maximum likelihood estimator of a unimodal den-sity does not behave well at the mode (i.e., is not consistentat the mode, and behaves irregularly close to the mode). Thisis sometimes called the spiking problem. The spiking problemalso applies when estimating the variance function under theunimodality restriction. In some cases, however, estimation ofthe maximum value of ' 2(·) is of importance for our method,because it may be needed to estimate the (asymptotic) varianceof one of our discrimination measures (cf. Thm. 1). The spikingproblem can be avoided by introducing some smoothing.

Smoothing to Avoid Irregular Behavior at the Mode. Intu-itively, the spiking problem can be understood by observing thatthe isotonic regression is given by the slope of the least concavemajorant to the cumulative sum diagram based on the squaredresiduals and hence the estimate of the mode depends on only afew observations, regardless of sample size. One large squaredresidual in the neighborhood of the mode will therefore causeproblems.

Sun and Woodroofe (1993) proposed a penalized leastsquares approach to make the estimate of a density consistent atthe mode. Although this does not readily generalize beyond thedensity case, another approach to stabilize the estimate is to firstsmooth the squared residuals, via a kernel smoother, and thenperform isotonic regression on the smoothed data. This is theapproach that we use. That is, we form the smoothed residuals

e%2t =

!#s=!#K

' s!t[b T]

(e2

s!#s=!#K

' s!t[b T]

( , (13)

and find

)' 2s (·) = argmin

' 2"U

T"

t=1

-log' 2

#tT

%+ e%2

t

' 2( tT )

.(14)

or

)' 2s,)m(·) = argmin

' 2"U()m)

T"

t=1

-log' 2

#tT

%+ e%2

t

' 2( tT )

.. (15)

Using these estimates, we define

)E(() =$ 1

0

')' 2

s (u)! ((+ du (16)

and similarly )E)m((), where )' 2s is replaced by )' 2

s,)m. Thesetwo empirical excess mass functions have corresponding em-pirical discrimination measures )IE(*) and )((q), as defined in(8) and (9), as well as the corresponding quantities )IE)m(*)

and)()m(q), which are defined using)E)m(() instead of)E((). Westudy the large-sample behavior of all these quantities in thenext section.

We discuss later that using e%2t instead of the regular squared

residuals e2t makes no difference asymptotically. We also

mention that isotonizing the variance function followed bya smoothing step would also alleviate the spiking problem.Mammen (1991) showed that the resulting estimator is asymp-totically first-order equivalent to the estimator we use.

4. ASYMPTOTIC NORMALITY OF THE EMPIRICALCONCENTRATION MEASURES

Results on asymptotic normality of the empirical concentra-tion measures formulated in this section motivate the use of ourdiscrimination procedures, which are known to be optimal inthe normal case. We need the following assumptions.

Assumption 1. (a) $1, $2, . . . are iid with E$2t = 1 and

E($41 log |$1|) <#.

(b) sup1&t&T EX4t,T <#.

(c) ' 2(·) " U(m), sup)"[0,1] '2()) < M < #, and

0 <& 1

0 ' 2())d).

Page 5: Discrimination of Locally Stationary Time Series Based on the

244 Journal of the American Statistical Association, March 2006

Further, we assume that ' 2(·) has no “flat parts,” that is,

sup(>0

//0u " [0,1] : |' 2(u)! (| < $1//' 0 as $' 0. (17)

Recall that for a (measurable) set C, we denote its Lebesguemeasure by |C|. Assumption 1(b) follows from Assump-tion 1(a) if, for instance, we assume Xt,T to be a “locally station-ary” process in the sense discussed by Dahlhaus and Polonik(2004). Note that this definition of local stationarity does notrequire that the parameter functions and the variance functionsbe continuous; only a finite bounded variation is required. Inparticular, our variance function is allowed to have jumps. Wewould also like to point out that assumption (17) in particularimplies that the excess mass functional is continuously differ-entiable for unimodal ' 2(·). This fact is needed in the proofs.It is important to note, however, that unimodality of ' 2(·) isnot a necessary assumption for our discrimination procedure.It could be dropped for the cost of additional, more complexassumptions.

Our next assumption uses bracketing numbers of a classof functions G. In this context, a set [g%,g%] := {h " G : g% &h& g%} is called a “bracket.” Given a metric + on G and , > 0,the bracketing number N(,,G,+) is the smallest number ofbrackets [g%,g%] with +(g%,g%)& , needed to cover G. If thereis no such class of brackets, then N(,,G,+) =#.

Assumption 2. (a) The kernel function K is symmetricaround 0, is bounded, and has support [!1,1]. The smoothingparameter b satisfies bT '# and b

(T ' 0 as T '#.

(b) The estimators )&j of &j, j = 1, . . . ,p, satisfy the follow-ing conditions:

1. )&j " G, where G is a class of functions with )&)# < a forsome 0 < a < #, and

& 10

2log N(,,G,) · )#)d, < #,

where ) · )# denotes sup-norm.2. ))µ!µ)# = oP(T!1/4), ))&j ! &j)# = oP(T!1/4) for all

j = 1, . . . ,p.

Examples of function classes satisfying the finite integralassumption in (b.1) include Hölder smoothness classes (withappropriate smoothness parameters), the class of monotonicfunctions, and of course constant functions. (For more detailsand more examples, as well as for a definition of the brack-eting covering numbers, see van der Vaart and Wellner 1996.)In our application presented later we use a model with constantAR parameters; thus the rate of convergence of these estimatorsis(

T . Note, however, that we do not require a(

T-consistentestimator in (b.2). [For estimators satisfying (b.2), see Dahlhausand Neumann 2001.]

When using a global estimate of the mode)m, we also requirethe following assumption.

Assumption 3.$ 1

0

'' 2(u)! ' 2()m)

(+ du = oP(1/(

T ).

Assumption 3 is satisfied if, for instance T1/6()m ! m) =oP(1) and ' 2(·) behaves like a quadratic around the mode m.

Our concentration measures are smooth functionals of theempirical excess mass )E, which itself is asymptotic normal.This explains their asymptotic normality. Asymptotic normal-ity of the excess mass is also of independent interest.

Lemma 1 (Asymptotic normality of the excess mass). Undermodel (1) and Assumptions 1 and 2 the excess mass process(

T()E ! E)(() converges in distribution to a mean-0 Gaussianprocess in C[0, ' 2(m)] with covariance function c((1,(2)

given by

(µ4 ! 1)

#-(.(0))

$

.((1)*.((2)' 4(u)du

+ -(.((1))-(.((2))

$ 1

0' 4(u)du

!-(.((1))-(.(0))

$

.((2)' 4(u)du

!-(.((2))-(.(0))

$

.((1)' 4(u)du

%,

where µ4 = E$41 , .(() = {u " [0,1] : ' 2(u) $ (} and -(C) =&

C ' 2(u)du. If in addition Assumption 3 is assumed, then thesame results holds for

(T()E)m ! E)(().

Theorem 1 (Asymptotic normality of the )IE(*)). Under thecorresponding assumptions of Lemma 1, both

(T()IE(*) !

IE(*)) and(

T()IE)m(*) ! IE(*)) are asymptotically normalwith mean 0 and variance

$ ' 2(m)

0(

*1

$ ' 2(m)

0(

*2 c((1,(2)d(1 d(2,

where c((1,(2) is the covariance function of the excess massprocess given in Lemma 1.

Theorem 2 (Asymptotic normality of the)((q)). Let q " (0,1]be fixed. Under the corresponding assumptions of Lemma 1,both

(T()((q) ! ((q)) and

(T()()m(q) ! ((q)) are asymptoti-

cally normal with mean 0 and variance#

1|.(((q))|

%2

c'((q),((q)

(, (18)

where c(·, ·) is the asymptotic variance of the empirical excessmass given in Lemma 1.

An estimate of the variance can be found by plugging in theempirical estimates for the corresponding theoretical values.

5. THE DISCRIMINATION PROCEDURE

In this section we describe our discrimination procedure insome detail. The common setup is to assume the availability ofa training set with known types or classes. This training dataare used to estimate unknown quantities and to derive the ac-tual classification rule. A new, unknown event is then assignedto a category based on the value of the classification rule. Ourdiscrimination rule is based on the (empirical) discriminationmeasures described earlier.

The actual format of the rule is motivated by asymptoticnormality of the empirical concentration measures. Under theassumption that the observations within each class are iidGaussian random variables, optimal discrimination techniquesare well studied. Motivated by asymptotic normality of ourdiscrimination measures, we use such optimal normality-baseddiscrimination rules.

Page 6: Discrimination of Locally Stationary Time Series Based on the

Chandler and Polonik: Discrimination of Time Series 245

A Two-Dimensional Variant. In the application study pre-sented herein, where we apply our procedure to the discrimina-tion of seismic times series, we follow a well-known practice(see, e.g., Kakizawa et al. 1998) by treating each seismic timeseries as bivariate by considering the p-wave and the s-waveindividually, and also assume the two waves are independent.Figure 1 shows two examples of seismic time series (one earth-quake and one explosion) split into p-waves and s-waves.

Applying our discrimination measures to the p-waves ands-waves separately, we end up with a two-dimensional measureof discrimination. We then apply a normality-based quadraticclassification rule to assign future observations into one of thetwo classes, !1 = “earthquake” or !2 = “explosion.” That is,if we let T(X) denote one of the two discrimination measures)IE(*) or)((q), and we let X = (Xp,Xs) denote the decomposi-tion of the seismic time series into p-wave Xp and s-wave Xs,then T(X) = (T(Xp),T(Xs)) denotes the two-dimensional dis-crimination measure. Then we allocate a newly observed timeseries, X, to !1 based on T(X) if

!12

T(X)+(S!11 ! S!1

2 )T(X) + (T+1S!11 !T+2S!1

2 )T(X)! k

& ln3#

c(1|2)

c(2|1)

%#p2

p1

%4, (19)

where

k = 12

ln# |S1|

|S2|

%+ 1

2(T+1S!1

1 T1 !T+2S!12 T2), (20)

Ti denotes the sample average classification measure forgroup i, Si is the estimated covariance matrix of the empiricalconcentration measures for group i, |Si| is the determinant of Si,c(i| j) is the cost associated with allocating an observation fromgroup j into group i, and pi is the proportion of observationsfrom group i in the population (see, e.g., Johnson and Wichern1998). Because of the assumed independence of the p-wave ands-wave Si, i = 1,2, will be diagonal matrices. Next, we studythis discrimination procedure numerically.

5.1 Classifying Seismic Time Series

The dataset that we use here was first published by Kakizawaet al. (1998), who also provided a description of the events. (Fora detailed discussion of the problem, see also Shumway andStoffer 2000.) The dataset consists of eight known earthquakesand 8 mining explosions, all measured by stations in Scandi-navia at regional distances. It also contains an event of unknownorigin. One could envision a scenario where this procedurewould be used to monitor nuclear treaties, where classifyinga nuclear test as an earthquake would carry with it a differentcost than the opposite type of error. Nevertheless, for what isdone in the following, we set the cost associated with allocatinga series to the wrong class as equal for both cases. Likewise,we assume that the proportions of each type are equal, that is,p1 = p2 in (19). For this example, we use a second-order au-toregressive process, because it seems to be a good fit for thedata available.

In this application we use the special case of our generalmethodology described earlier, where we assume mean 0, con-stant AR parameters, and p = 2. Further evidence that this as-sumption is justified for classification purposes is provided by

Figure 2. Discrimination Scatterplot for the Quantile Method(o, earthquake; %, explosion; ,, unknown event).

the fact that our classification results presented herein did notchange when we applied the methodology with the mean func-tion estimated (nonparametrically via a kernel estimator).

We use the excess mass quantile as our discriminationmeasure, and first discuss the selection of the parameter qdetermining the value ((q) = E!1(q) on which to base ourdiscrimination. Clearly, selection of this quantity is crucial,because q = 1 would provide no discriminatory power, since((1) = 0 for all series. Because we want to discriminate be-tween series belonging to !1 and !2, we search for a quan-tile q that makes observations within !i as homogenous aspossible, while simultaneously making series in different cat-egories as different as possible. Therefore, we search for thevalue q that maximizes the ratio of between sums to squaresto within sums of squares. Assuming each wave to be inde-pendent, we choose a different q independently for each typeof wave (see Fig. 3). As may be apparent from Figure 2,much of the discriminatory power lies in the p-waves. Thedata select a value of q = .01 for the p-wave. The scatter-plot of the discrimination measures is shown in Figure 2. Itis of interest to note that the unknown event is clearly clas-sified as an earthquake. Whereas the discrimination measureresults in complete separation of the two classes, the discrim-ination rule results in two misclassifications as a result of theidentical distribution assumption having been apparently vi-olated. Inherent in using the quadratic discrimination rule isthe assumption that the discrimination measures for each ex-plosion or earthquake are identically distributed with respectto others in the same class. It is not clear from Figure 2whether this assumption is justified; nevertheless, we presentthe example.

In the foregoing analysis we estimate AR parameters andthe variance function simultaneously by minimizing the Whittlelikelihood. More precisely, we define our parameter space as

/- U =0! = (&1, . . . ,&p) "0,' 2(·) unimodal on [0,1]

1,

where 0 denotes the set

0 =+

! = (&1,&2) "R2 :2"

i=1

&izi .= 0 / z "C,0 < |z|& 1

,

.

Page 7: Discrimination of Locally Stationary Time Series Based on the

246 Journal of the American Statistical Association, March 2006

Figure 3. The Ratio of Between Sums of Squares versus WithinSums of Squares as a Function of q for Simulated Data Using the Quan-tile Method.

Note that if in addition ' 2(·) is bounded (which we assume; seeAssumption 1), then this choice of 0 corresponds to our modelbeing locally stationary in the sense of Dahlhaus and Polonik(2004). This in turn implies that the estimator)!L, defined in thefollowing, is

(T-consistent, and hence our assumptions apply

to this case.Let

WT(!,' 2) =T"

t=1

-log' 2

#tT

%

+ (xt,T ! &1xt!1,T ! · · ·! &pxt!p,T)2

' 2( tT )

.,

and define

()&L,)m,)' 2L,)m) = argmin

(!,' 2)"/-U()m)

WT(!,' 2). (21)

Note that this can also be viewed as a profile likelihood method.This is, we have)' 2

L,)m = argmin' 2"U()m) WT(' 2,)!(' 2)), whereWT(' 2,)!(' 2)) = argmin!"/ WT(!,' 2). Thus the foregoingfits into the framework described in Section 3. The estimator)!has the desired convergence rate. In fact, as shown by Dahlhausand Polonik (2005), it is of 1/

(T-rate.

The final estimate of ' 2(·) that we use in the numerical workis the “smoothed” version of)' 2

L,)m as described earlier. Here theestimate )m is the global mode of a kernel-smoothed squaredobservations based on a Gaussian kernel with bandwidth of50 time points. This bandwidth was chosen based on a visualinspection of the data. This (rough) estimate was chosen bycomputational convenience, and other estimates that we triedworked similarly well.

The Algorithm. To find the minimizers in (21), we use aniterative procedure described by Dahlhaus and Polonik (2005).Note that finding the AR parameters for a given ' 2(t) is justa weighted least squares problem. And given the values for the

AR parameters we use the algorithm for estimating the variancefunction described earlier. These two steps are iterated to findthe global minimizers.

5.2 Simulations

In this section we present simulation results comparing ourmethod with the spectral, distance-based method of Sakiyamaand Taniguchi (2001). This spectral-based method essentiallyassigns an observation to the category with spectra closest toan estimate of the spectral density of the observed processeswith respect to an approximation of the Gaussian likelihood.The simulated time series detailed herein were generated so thatalignment of the observations in time is not necessary, thus aid-ing the distance-based procedure.

We emulate the original dataset by simulating observationsfrom each of two categories [meaning two different AR(2)

models]. We then try to reclassify each observation using theremainder as a training set. The goal of this simulation studyis to show how the two proposed measures compare with thespectral-based method under the proposed model. We mimicthe data of the numerical study in that we generate eight obser-vations from each of two categories, and show that when the as-sumptions regarding identically distributed random variables issatisfied, the quadratic discrimination rule performs quite well,in many cases outperforming the spectral-based classificationrule.

For an observation from group i, we simulate from a second-order AR process of the form:

xt,T = 1.58xt!1,T ! .64xt!2,T + $t'i(t/T). (22)

The values for the AR parameter are similar to the estimated pa-rameters for the data in the numeric example and can be shownto satisfy the requirement for causality.

The two different variance functions that we use for the twocategories in the following tables are of the form

'1(u) = 300'ua1(u < .5) + (1! u)b1(u > .5)

(,

(23)'2(u) = 300

'uc1(u < .5) + (1! u)d1(u > .5)

(,

where a and b are adjusted to illustrate how the discrimina-tion measures behave at different levels of similarity betweenthe two categories. Let Xi denote the number of misclassifica-tions in one run of the simulation; thus Xi " (0,1, . . . ,16). InTables 1 and 2, X denotes the misclassification rate and sX es-timates the standard deviation of the number of misclassifica-tions. “Spec” denotes the spectral-based methods of Sakiyama

Table 1. Misclassification Rate, X , and Estimated StandardDeviation, sX , of the Number of Misclassifications Based

on 100 Simulation Runs Under Model (23)

)IE )( Spec

a X sX X sX X sX

2.6 .07 .26 .01 .11 .59 .772.5 .2 .42 .03 .17 .7 .672.4 .49 .70 .06 .24 .84 .842.25 2.52 1.61 .4 .62 1.45 1.142.2 3.2 1.22 .86 .98 1.3 1.25

Page 8: Discrimination of Locally Stationary Time Series Based on the

Chandler and Polonik: Discrimination of Time Series 247

Table 2. Misclassification Rate, X , and Estimated StandardDeviation, sX , of the Number of Misclassifications Based

on 100 Simulation Runs Under Model (24)

)IE )( Spec

a b X sX X sX X sX

3.6 3.0 .07 .25 .08 .31 1.44 1.123.5 2.9 .29 .55 .14 .34 1.72 1.163.4 2.75 .62 .84 .38 .60 1.84 1.243.3 2.6 1.50 1.23 .91 1.00 1.94 1.27

and Taniguchi (2001). The tables are based on 100 simula-tion runs. In all cases examined, the quantile method outper-forms the spectral-based methods, and in several cases theintegrated excess mass approach is better as well. Table 1 ex-amines the case in which the varaince function is continuous,setting c = d = 2 and a = b. Table 2 considers a variance func-tion that contains jumps, thus not satisfying local stationarity inthe sense of Dahlhaus (1997), but still covered by the definitiongiven by Dahlhaus and Polonik (2004). This is accomplishedby choosing c = 2 and d = 3, so that the jump occurs at u = .5and is as large as half of the maximum of the variance function:

'1(u) = 300'u31(u < .5) + (1! u)21(u > .5)

(,

(24)'2(u) = 300

'ua1(u < .5) + (1! u)b1(u > .5)

(.

It is interesting to note that even when both methods make mis-classifications for a particular dataset, the misclassified obser-vations often are not the same across methods. It seems thatdiscrimination is based on different information in the data.Thus it seems possible to define a discrimination rule, that isa combination of spectral-based and time-based methods thatwould outperform each method individually.

6. PROOFS

To ease notation, we write Xt instead of Xt,T through-out this section. For the proofs, we assume that X![bT]+1 =X![bT]+2 = · · · , X0 = 0 = XT+1 = XT+2 = · · · = XT+[bT]. Let)-()) and )-%()) denote the partial sum processes for the ordi-nary and the smoothed residuals, that is,

)-()) = 1T

[)T]"

t=1

e2t and )-%()) = 1

T

[)T]"

t=1

e%2t . (25)

Recall that e%2t = !t+[bT]

s=t![bT](K( t!sbT )/A)e2

s , where A =!s+[bT]t=s![bT] K( s!t

[bT] ). Hence, by rearranging sums, we can write

)-%()) = 1T

[)T]+[bT]"

t=1

1t())e2t , (26)

where the nonrandom weights 1t()) are normalized sums ofkernel weights. The specific form of the weights plays no role inour proofs, and hence no precise formula is provided. However,what is important is that most of the weights equal 1—namely,if ) > 2b, then 1t()) = 1 for all [bT]& t& [)T]! [bT]; other-wise, 0 < 1t()) < 1.

First, we prove a lemma that allows us to disregard estimationof the nuisance parameters.

Lemma 2. Suppose that Assumptions 1 and 2 hold. Then

(T sup

0&)&1

/////)-())! 1

T

[)T]"

t=1

' 2(t/T)$2t

/////= oP(1)

and

(T sup

0&)&1

/////)-%())! 1

T

[)T]+[bT]"

t=1

1t())' 2(t/T)$2t

/////= oP(1).

Proof. For ease of notation, we present the proof for thecase where p = 2. The case of a general order p $ 1 followssimilarly. To simplify the proof, we also assume that the para-meter functions &1,&2, and ' are defined on the whole realline with &1(u) = &2(u) = ' (u) = 0 for u /" [0,1]. We firstprove the assertion for the ordinary residual partial sum process)-()) = 1/T

![)T]t=1 e2

t .

First, we consider the effect of estimating µ(·). With et =Xc

t !)&1(t/T)Xct!1!)&2(t/T)Xc

t!2 and 2T(t) = µ(t/T)!)µ(t/T),

we can write the difference 1T![)T]

t=1 e2t ! 1

T![)T]

t=1 e2t as

2T

[)T]"

t=1

et

#2T(t)!)&1

#tT

%2T(t! 1)

!)&2

#tT

%2T(t! 2)

%(27)

+ 1T

[)T]"

t=1

#2T(t)!)&1

#tT

%2T(t! 1)

!)&2

#tT

%2T(t! 2)

%2

. (28)

Using the Cauchy–Schwarz inequality, (27) can be boundedfrom above by

2)

51

)T

[)T]"

t=1

e2t

651

)T

[)T]"

t=1

#2T(t)!)&1

#tT

%2T(t! 1)

!)&2

#tT

%2T(t! 2)

%26

& 23

1 + supt

////)&1

#tT

%////+ supt

////)&2

#tT

%////

42

- supt

|2T(t)|2 1T

[)T]"

t=1

e2t

= oP(1/(

T )1T

[)T]"

t=1

e2t . (29)

The last equality follows from the fact that by assumption,the )&i are uniformly consistent estimates of &i, the &i arebounded, and supt |2T(t)|2 = oP(1/

(T ). Later we show that

sup0&)&11T

![)T]t=1 e2

t & 1T

!Tt=1 e2

t = OP(1). This, togetherwith the foregoing, then implies that sup0<)&1 | 1

T

![)T]t=1 e2

t !1T![)T]

t=1 e2t | = oP(1/

(T ). The fact that (28) is also oP(1/

(T )

follows similarly, but is somewhat easier. Details are omitted.

Page 9: Discrimination of Locally Stationary Time Series Based on the

248 Journal of the American Statistical Association, March 2006

We now show that sup0<)&1 | 1T

![)T]t=1 e2

t ! 1T

![)T]t=1 ' 2( t

T )-$2

t | = oP(1/(

T ). This then completes the proof, because1T

![)T]t=1 ' 2( t

T )$2t & 1

T

!Tt=1 ' 2( t

T )$2t = OP(1). We can write

1T

[)T]"

t=1

e2t !

1T

[)T]"

t=1

' 2#

tT

%$2

t

= 1T

[)T]"

t=1

3!2#

Xct ! &1

#tT

%Xc

t!1 ! &2

#tT

%Xc

t!2

%Xc

t!1

4

-#)&1

#tT

%! &1

#tT

%%

+ 1T

[)T]"

t=1

3!2#

Xct ! &1

#tT

%Xc

t!1 ! &2

#tT

%Xc

t!2

%Xc

t!2

4

-#)&2

#tT

%! &2

#tT

%%

+ 1T

[)T]"

t=1

#)!#

tT

%! !

#tT

%%+Ct

#)&#

tT

%! &

#tT

%%

=! 2T

2"

j=1

[)T]"

t=1

'

#tT

%$tXc

t!j

#)&j

#tT

%! &j

#tT

%%

+ 1T

[)T]"

t=1

#)!#

tT

%! !

#tT

%%+Ct

#)!#

tT

%! !

#tT

%%,

(30)

where Ct = ((Xct!iX

ct!j)i,j), i, j = 1,2 and !(t/T) = (&1(t/T),

&2(t/T))+. We now show that both sums in (30) are oP(1/(

T )uniformly in ) " [0,1]. As for the second of these two sums,note that

P

7

sup)"[0,1]

/////1T

[)T]"

t=1

Xct!iX

ct!j

/////> C

8

& P

71T

T"

t=1

|Xct!iX

ct!j| > C

8

& supt var(Xt)

C= O

#1C

%.

Hence sup0<)&11T

![)T]t=1 |Xc

t!iXct!j| = OP(1), i, j = 1,2, and

because, by assumption, ))&j ! &j)# = oP(T!1/4) for j = 1,2,the second sum in (30) is of order oP(1/

(T ) (uniformly in )).

Now we concentrate on the first sum in (30). For a function hon [0,1], let

MT,j(h) = 1(T

T"

t=1

'

#tT

%$tXc

t!jh#

tT

%, j = 1,2. (31)

Note that the first sum in (30) can be written as ! 2(T-

!2j=1 MT,j(()&j!&j)1[0,)]). Hence defining HT,j = {h : h = (g!

&j)1[0,)];g " G,) " [0,1],)g! &j)# & T!1/4}, and recallingthat, by assumption, )&j " G, j = 1,2, and )T1/4()&j ! &j))# =oP(1), we see that T1/4()&j!&j) "HT,j with probability tendingto 1 as T '#. Hence the first sum in (30) being oP(1/

(T )

uniformly in ) follows from

suph"HT,j

|MT,j(h)| = oP(1) for j = 1,2. (32)

For ease of notation, let H = HT,j. Suppose that for each , > 0,there exists a finite partition of H into sets Hk, k = 1, . . . ,N(,)

(and we construct an appropriate partition later). Then we have,for 3, , > 0,

P9

suph"H

|MT,j(h)| > 3:

& P9

maxk=1,...,N(,)

|MT,j(hk,%)| > 3/2:

+ P9

maxk=1,...,N(,)

supg,h"Hk

|MT,j(g! h)| > 3/2:. (33)

Note that MT,j(h) for each fixed h is a sum of martingale dif-ferences with respect to the filtration {Ft}, where Ft is the' -algebra generated by $t, $t!1, . . . . Our assumptions implythat var(MT,j(h)) = O()h)#) = o(1) for any h "H. This im-plies that the first term on the right side of (33) tends to 0 asT '#. The fact that the second term on the right side of (33)also becomes small follows from lemma 3.3 of Nishiyama(2000) as applied to a martingale process in discrete time asconsidered here (cf. Nishiyama 2000, sec. 4). To apply thislemma, we need to verify two conditions, called [PE+] and [L1+]in Nishiyama’s article. We do this next.

First, we construct the needed (nested) partition of H. Let, > 0. Recall that, by assumption, G has a finite bracketingintegral. Without loss of generality, the corresponding parti-tions can be assumed to be nested (see, e.g., van der Vaartand Wellner 1996; Nishiyama 1996). Let Gk,1 & k & NB(,),be the corresponding brackets, that is, Gk = {g " G : gk,% &g & g%k} with functions g%k and g%,k satisfying +(g%k ,g%,k) < ,.Divide [0,1] into small intervals [bk!1,bk] with 0 = b0 <

b1 < · · · < bM(,2)!1 < bM(,2) = 1 and |bk ! bk!1| < ,2 for1& k&M(,2) with M(,2) = O(1/,2). Then let the partitionof H consist of the sets Hk,4 = {h = (g ! &j)1[0,)] " H :g " Gk,) " [b4 ! b4!1]}, k = 1, . . . ,NB(,),4 = 1, . . . ,M(,2).By construction

& 10

2log N(3)d3 &

& 10

2log NB(3)d3 +& 1

0

2log M(32)d3 <#, and the partition can without loss of

generality be chosen to be decreasing in ,.We need some more notation. Let Et denote conditional

expectation given Ft. Write MT(h) = !Tt=1 5t(h), where

5t(h) = 1/(

T' (t/T)$tXct!jh(t/T). With 5t = suph"H |5t(h)|, let

VT(3) =!Tt=1 Et!1(5tI(5t > 3)). In this notation, Nishiyama’s

condition [L1+] reads as VT(3) = oP(1) for every 3 > 0. Notethat there exists a constant C > 0 such that 5t & C 1(

T|$tXc

t!j|and hence VT(3)& C2

T

!Tt=1(X

ct!j)

2P(|$tXct!j| > 3

(T/C|Xc

t!j).

Now, note that on the set AT = {maxt=1,...,T |Xct | <

(T/ log T},

we have P(|$tXct!j| > 3

(T/C|Xc

t!j)& P(|$t| > 3 log T/C)' 0for each 3 > 0. Hence on the set AT , we have VT(3) = oP(1)

for every 3 > 0. Because, by assumption, Xt, t = 1, . . . ,T , haveuniformly bounded fourth moments, we also have P(AT)' 1as T '#.

Next, we turn our attention to Nishiyama’s condition[PE+]. For this, we need an estimate of Et!1(Hk,4) :=Et!1| sup&,#"Hk,4(5t(&)! 5t(#))2|. Note that for &,# "Hk,4,we have & = (h ! &j)I[0,)] and # = (g ! &j)I[0,*], with)g ! h)# & 3 and |) ! *| & 32. Hence, writing & ! # =

Page 10: Discrimination of Locally Stationary Time Series Based on the

Chandler and Polonik: Discrimination of Time Series 249

(h! g)I[0,)] + (g! &j)(I[0,)] ! I[0,*]) and using the fact that,by assumption, ' 2(·) < M <#, we get

Et!1(Hk,l)

&2M

'32 var($t)(Xc

t!1)2 + 1(

Tvar($t)(Xc

t!1)2I[),*]( t

T )(

T,

and hence

sup3"(0,1)

max1&k&NB(3),1&4&M(32)

;!Tt=1 Et!1(Hk,l)

3

&M2

var($t)

<==> 1T

T"

t=1

(Xct!1)

2

+ M13

?1(T

var($t)!T

t=1(Xct )

2I[),*]( tT )

T.

Condition [PE+] requires that the expression on the leftside of the last inequality be stochastically bounded. We have|)!*|& 32, 1/T

!Tt=1(X

ct!1)

2 = OP(1) and max1&t&T(Xct )

2 =OP(

(T ), where for the latter we use Assumption 1(b). Hence

we see that the right side is stochastically bounded, and [PE+]follows.

This completes the proof for )-()) the partial sum processbased on the regular residuals. The proof for )-%()) is similarby using the representation (26) and observing that |1t())|& 1.

We now show the fact that )-()) and )-%()) are very closeuniformly in ). This implies that the empirical excess massfunctionals based on the smoothed and ordinary squared resid-uals have the same limiting behavior.

Lemma 3. Under Assumptions 1 and 2, we have(T sup

)"[0,1]|)-())! )-%())| = oP(1).

Proof. Using Lemma 2, we can write

)-()) = 1T

[)T]"

t=1

' 2(t/T)$2t + oP(1/

(T ), (34)

and similarly, for the smoothed version we have

)-%()) = 1T

[)T]+[bT]"

t=1

' 2(t/T)$2t 1t()) + oP(1/

(T ), (35)

where the oP(1/(

T )-terms are uniform in ) " [0,1]. Because1t()) .= 1 only for values t close to 1 and [)T], we can write(

T')-())! )-%())

(

=-

QT()) + BT()) + CT(0) + oP(1) for ) > 2b

CT()) + oP(1) for ) & 2b,(36)

where

QT()) = 1(T

[)T]+[bT]"

t=[)T]![bT]+1

' 2(t/T)'1{1& t& [)T]}!1t())

(

- ($2t ! 1),

BT()) = 1(T

[)T]+[bT]"

t=[)T]![bT]+1

' 2(t/T)'1{1& t& [)T]}!1t())

(,

and

CT()) = 1(T

[)T]+[bT]"

t=1

' 2(t/T)'1{1& t& [)T]}!1t())

($2

t .

The oP(1) terms in (36) (which are uniform in ) " [0,1]) arethe sum of the two oP terms from (34) and (35). First, we con-sider CT()). With Wt = ' 2( t

T )(1{1 & t & [)T]} ! 1t()))$2t ,

we have

sup[)T]&2[bT]

|CT())| & 1(T

3[bt]"

t=1

|Wt|

=(

Tb · 1bT

3[bt]"

t=1

|Wt| = o(1)OP(1).

The last equality follows because by assumption,(

Tb = o(1)

and 1bT

!3[bt]t=1 |Wt| = OP(1), because the random variables |Wt|

have (uniformly) finite expected values.Next, we show sup)"[0,1] |QT())| = oP(1), where, to sim-

plify notation, we have extended the definition of QT()) to all) " [0,1] to

QT()) = 1(T

[)T]+[bT]"

t=max([)T]![bT]+1,1)

vt())Zt,

where vt()) = ' 2(t/T)(1{1 & t & [)T]} ! 1t())) andZt = $2

t ! 1. Note that QT()) is a sum of at most 2[bT] in-dependent random variables that are not necessarily bounded.Therefore, we use a truncation argument. By assumption,E(Z2

t log |Zt|) < #, we can truncate the variables |Zt| at(KT/log T for some appropriate K > 0, to be determined later.

For any K > 0, the difference is negligible, as the followingargument shows:

Dn())

:=/////

1(T

[)T]+[bT]"

t=max([)T]![bT]+1,1)

vt())Zt

! 1(T

[)T]+[bT]"

t=max([)T]![bT]+1,1)

vt())Zt1#

|Zt|&?

KTlog T

%/////

& 1(T

[)T]+[bT]"

t=max([)T]![bT]+1,1)

M|Zt|1#

|Zt| >?

KTlog T

%

& 1(T

T+[bT]"

t=1

M|Zt|1#

|Zt| >?

KTlog T

%,

and hence for any 3 > 0,

P9

sup)"[0,1]

Dn()) > 3:

& P#0 t " {1, . . . ,T + [bT]} : |Zt| >

?KT

log T

%

& (T + [bT]) · P#

|Zt| >?

KTlog T

%' 0 as T '#,

Page 11: Discrimination of Locally Stationary Time Series Based on the

250 Journal of the American Statistical Association, March 2006

where the convergence to 0 follows from the moment assump-tion on $t. Hence we need only show that the sum of the trun-cated random variables tends to 0 uniformly over ) " [0,1].This is done by exploiting Bernstein’s inequality. Toward thisend, we center the truncated variables. Let µT = EZt1(|Zt| &(

T/(K log T)). First, we show that centering is negligible.Because EZt = 0, we have µT ' 0 as T '#. Because, byassumption b

(T ' 0, it follows that

sup)"[0,1]

/////1(T

[)T]+[bT]"

t=[)T]![bT]+1

vt())µT

/////&M2[bT] + 1(

TµT = o(1).

Further, with µ4 = var(Zt), we have var(vt())Zt1(|Zt| &(T

K log T ))&M2µ4. Hence, for every fixed $ > 0, we have

P

7

sup)"[0,1]

/////1(T

[)T]+[bT]"

t=[)T]![bT]+1

vt())

-#

Zt1#

|Zt|&(

TK log T

%!µT

%/////> $

8

& 2T · P

7/////1(bT

[)T]+[bT]"

t=[)T]![bT]+1

vt())

-#

Zt1#

|Zt|&(

TK log T

%!µT

%/////>$(b

8

& 2T · exp-!1

2$2/b

2M2µ4 + 13

$(b

2M(

TK log T /

(2[bT]

.

& 2T · exp-!1

2$2/b

2M2µ4 +(

2M$3bK log T

.

& 2T · exp{!c1$K log T},

for an appropriate constant c1 > 0 and T large enough. ForK > 0 large enough, the last expression in the foregoing seriesof inequalities tends to 0 as T '#.

It remains to show that sup)"[2b,1] |BT())| = o(1). Writing

BT()) =(

Tb

71

bT

[)T]+[bT]"

t=[)T]![bT]+1

' 2#

tT

%

-'1{1& t& [)T]}!wt())

(8

,

we observe that the summands are positive and bounded.Because there are (at most) 2[bT] + 1 many such summands,the term in parentheses is bounded uniformly in ). Because(

Tb' 0, by assumption, it follows that sup)"[2b,1] |BT())| =o(1).

Proof of Lemma 1

We mention that the some elements of the proof are similarto those of Polonik (1995). We consider only the case )E)m. Thecase of the regular empirical excess mass )E follows similarlybut is simpler, and hence we omit the proof.

Our target here is the empirical excess mass of the (standard-ized) variance function ¯' 2

s,)m(·), based on the smoothed residu-als and the estimated mode. We present the proof for the excessmass process based on the regular residuals. However, a closerinspection of the proof reveals that everything depends only onproperties of the process {

(T()- ! -)()),) " [0,1]}, where

-()) =& )

0 ' 2(u)du. In case of the smoothed residuals, thisprocess would be replaced by {

(T()-% ! -)()),) " [0,1]},

and Lemma 3 shows that these two processes have the samelimiting behavior.

We first prove the assertion for the excess mass of)' 2)m(·), in-

stead of the standardized function ¯' 2)m(·). The behavior of the

latter follows from the former by an easy argument.Recall the characterization of )' 2

)m(·) as consisting of twoisotonic regressions. To the left of the estimated mode )m,it is the (right continuous) slope of the greatest convex mi-norant to the cumulative sum diagram given by the points(k/T,

!kt=1 e2

t ), k = 1, . . . ,T (cf. Robertson et al. 1988), andto the left of )m, it is the (left-continuous) slope of the leastconcave majorant to the same cumulative sum diagram. It fol-lows that )' 2

)m(·) is a piecewise constant function with level sets{) " [0,1] :)' 2

)m())} = [)a(,)b(] being an interval and&)b(

0 )' 2)m())d) =![)b(T]

t=1 e2t . Hence, letting

)-(a,b) = 1T

[bT]"

t=[aT]+1

e2t ,

for the excess mass E)' 2,)m(() of)' 2)m(·), we have that

E)' 2,)m(() =$ 1

0

')' 2)m(u)! (

(+ du =$ )b(

)a(

')' 2)m(u)! (

(du

=: )H(()a(,)b(),

where )H((a,b) = )-(a,b) ! ((b ! a). We also have, by def-inition of )H(, that )H(()a(,)b() = sup0&a<b&1)H((a,b). We canexpect this empirical excess mass to be close to

E' 2,)m(() = sup0&a&)m&b&1

'-(a,b)! ((b! a)

(

= sup0&a&)m&b&1

H((a,b) = H(a(, b(), (37)

where a( and b( are defined through the last equality, -(a,b) =& ba ' 2(u)du, and H((a,b) = -(a,b)! ((b! a). It is straight-

forward to see that E' 2,)m(() = -(a(,b()! ((b( ! a(), wherefor ( & ' 2()m), we have [a(,b(] = cl{u " [0,1] :' 2(u) $ (}.[Here cl(A) denotes the closure of a set A.] For ( > ' 2()m),the mode )m becomes one of the limits of the interval [a(, b(]from (37). If )m & m, then a( = )m, and without loss of gener-ality, we assume this to be the case. It is also not difficult tosee that b( = b( as long as E' 2,)m(() > 0, and there is level(max < ' 2(m) with E' 2,)m(() = 0 for all ( > (max. Note alsothat the supremum in the definition of E' 2,)m is extended overa smaller set than in the definition of E' 2 . Hence we haveE' 2,)m(() & E' 2((), with equality for ( & ' 2()m). Because ex-cess mass functionals are positive and monotonically decreas-ing, we obtain the “approximation error,” sup($0 |(E' 2,)m(()!E' 2(())|& E' 2(' 2()m)). The latter is, in fact, the quantity fromAssumption 3, and hence is of the order oP(1/

(T ).

Page 12: Discrimination of Locally Stationary Time Series Based on the

Chandler and Polonik: Discrimination of Time Series 251

We obtain(

T'E)' 2,)m(()! E' 2(()

(

=(

T'E)' 2)m(()! E' 2,)m(()

(+(

T'E' 2,)m(()! E' 2(()

(

=(

T'E)' 2)m(()! E' 2,)m(()

(+ oP(1).

Hence it remains to consider the process(

T(E)' 2)m(() !E' 2,)m(()). We write(

T'E)' 2(()! E' 2,)m(()

(=(

T')H(()a(,)b()!H((a(,b()

(

=(

T()H( !H()()a(,)b() + R1(()

=(

T()H( !H()(a(,b() + R1(() + R2((), (38)

where R1(() =(

T[H(()a(,)b() ! H((a(,b()] and R2(() =(T[()H( !H()()a(,)b()! ()H( !H()(a(,b()].Observe that by using the definitions of a(,b( and)a(,)b( as

maximizers of H( and )H(, we have

0 & H((a(,b()!H(()a(,)b()

= (H( !)H()(a(,b()! (H( !)H()()a(,)b()

+')H((a(,b()!)H(()a(,)b()

(

& (H( !)H()(a(,b()! (H( !)H()()a(,)b(). (39)

Hence we obtain that

|R1(() + R2(()|& 2

//(T'()H( !H()()a(,)b()! ()H( !H()(a(,b()

(//. (40)

The key observation now is that (40), and also the main termin (38), can be controlled if we have a hand on the process(

T()H( ! H()(a,b). We show that this process convergesweakly to a tight Gaussian process. This, together with (39),implies that)a( and)b( are consistent estimators for a( and b(

(see Lemma 4). A combination of these two results immedi-ately implies [because of (40)] that |R1(() + R2(()| = oP(1),and this in turn entails asymptotic normality of our target quan-tity

(T(E)' 2(()! E' 2,)m(()) [cf. (38)].

Lemma 4. If sup0&a&b&1 |()H( ! H()(a,b)| = oP(1) thensup(>0 |)a( ! a(| = oP(1) and sup(>0 |)b( ! b(| = oP(1).

The proof of this lemma follows the proof of theorem 3.5 ofPolonik (1995). Details are omitted here. Note also that the re-sults that follow imply the assumed uniform consistency of )H(.

It remains to prove weak convergence of the process(T()H( !H()(a,b),a,b " [0,1], to a tight limit. Observe that

because ' 2(·) is assumed to be totally bounded, we can write-(a,b) = 1

T

![bT]t=[aT]+1 ' 2(t/T) + O(1/T), and hence we have

(T()H( !H()(a,b)

=(

T

71T

[bT]"

t=[aT]+1

' 2(t/T)($2t ! 1)

8

+ O(1/(

T ),

= ZT(b)! ZT(a) + O(1/(

T ),

where ZT()) = 1/(

T![)T]

t=1 ' 2(t/T)($2t ! 1). We now prove

that the partial sum process {ZT()),) " [0,1]} as a process inD[0,1] converges in distribution to a Gaussian process. This

then completes the proof, because it entails asymptotic stochas-tic equicontinuity, as well as asymptotically normality of theprocess ZT(b) ! ZT(a). These are the two properties that weneed to prove. The asserted covariance function of the excessmass process then follows straightforwardly from Lemma 5.

Lemma 5. Under the assumptions of Lemma 1, we have asT '# that

ZT())'G()) in distribution in D[0,1],

where G()) is a mean-0 Gaussian process with cov(G()),

G(*)) = (µ4 ! 1)& min(),*)

0 ' 4(u)du.

Proof. We first prove convergence of the finite-dimensionaldistributions. Let Yt,T = ' 2(t/T)($2

t ! 1), and define

B2T()) = var(

(TZT())) = (µ4 ! 1)

[)T]"

t=1

' 4(t/T), (41)

where µ4 denotes the fourth moment of $t.We use the Lindeberg–Feller central limit theorem. Hence

we need to show that

1

B2T())

[)T]"

t=1

E0Y2

t,TI'|Yt,T |$ $BT())

(1' 0 (42)

as T'#. We have

1

B2T())

[)T]"

t=1

E@Y2

t,TI'|Yt,T |$ $BT())

(A

= 1

B2T())

[)T]"

t=1

E3' 4#

tT

%($2

t ! 1)2I#

|$2t ! 1|$ $BT())

' 2( tT )

%4

& 1

B2T())

[)T]"

t=1

' 4#

tT

%E3($2

t ! 1)2I#

|$2t ! 1|$ $BT())

M

%4

[since ' 2(u)&M]

= 1µ4 ! 1

E3($2

t ! 1)2I#

|$2t ! 1|$ $BT())

M

%4

(because the $t are iid),

which converges to 0 because BT())'#. Hence we have thatfor every ) " (0,1), ZT())/BT ())'d N (0,1), so that by theCramer–Wold device and the univariate central limit theorem,for every ()1, . . . ,)m) " [0,1]m, ZT()) 'd N (0,"), where" is the m-m variance–covariance matrix (6(i, j)) with

6(i, j) = (µ4 ! 1)

$ min()i,)j)

0' 4(u)du. (43)

It remains to show asymptotic equicontinuity of ZT()); that is,we have to show that for every $ > 0,

lim,10

lim supT'#

P9

sup|)!*|<,

|ZT() ! *)| > $:

= 0. (44)

To prove this, we show that

E@|ZT(s)! ZT(r)|2|ZT(t)! ZT(s)|2

A& [F(t)! F(r)]2 (45)

Page 13: Discrimination of Locally Stationary Time Series Based on the

252 Journal of the American Statistical Association, March 2006

for r & s & t, where F is a nondecreasing continuous func-tion on [0,1]. Asymptotic equicontinuity then follows (seeBillingsley 1999). We have

E@|ZT(s)! ZT(r)|2|ZT(t)! ZT(s)|2

A

= E

7(

T[sT]"

j=[rT]+1

' 2( jT )

T($2

j ! 1)

82

- E

7(

T[tT]"

j=[sT]+1

' 2( jT )

T($2

j ! 1)

82

=7

1T

[sT]"

j=[rT]+1

' 4#

jT

%var($2

j )

8

-7

1T

[tT]"

j=[sT]+1

' 4#

jT

%var($2

j )

8

= (µ4 ! 1)

71T

[sT]"

j=[rT]+1

' 4#

jT

%871T

[tT]"

j=[sT]+1

' 4#

jT

%8

&#

(µ4 ! 1)

$ t

r' 4(u)du

%2

+ O(1/T)

= C'F(t)! F(r)

(2,

where F()) = (µ4 ! 1)& )

0 ' 4(u)du. It is obvious that the lastequality holds (for an appropriate C) for all |t ! r| > 1/T .It also holds for |t! r|& 1/T , because in this case the left sideequals 0.

Proof of Lemma 4, Continued. We have studied the ex-cess mass of the estimated variance function without standard-ization. What we are really interested in is the behavior of)E(() = E ¯' ((), which is the excess mass functional of the stan-dardized variance function. Using the foregoing notation, wecan write the normalizing factor

& 10 )'

2(u)du = )-(0,1), and,

similarly, we have& 1

0 ' 2(u)du = -(0,1). It follows that

)E(() = sup0&a&b&1

#)-(a,b)

)-(0,1)! ((b! a)

%.

Compared with E)' 2 , which was treated earlier, the quantity)-(a,b) is now replaced by )-(a,b)/)-(0,1). We have seen that(

T(E)' 2!E' 2)(()2(

T()-!-)(a(,b(), where a( and b( arethe maximizers of -(a,b)! ((b! a). Very similar argumentsshow that, uniformly in ( > 0,////(

T()E! E' 2)(()!(

T#)-(a%(,b%())-(0,1)

! -(a%(,b%()-(0,1)

%////= oP(1),

(46)

where a%( and b%( are the maximizers of the standardized excessmass -(a,b)/-(0,1)! ((b! a), which means that (a%(,b%() =(a(-(0,1),b(-(0,1)). All of this holds provided that the process

(T#)-(a,b)

)-(0,1)! -(a,b)

-(0,1)

%(47)

converges to a tight Gaussian limit process. But this is an im-mediate consequence of the weak convergence of the process(

T()-(a,b)!-(a,b)), as can be seen from rewriting (47) as

(T#

1)-(0,1)

')-(a,b)!-(a,b)(

! -(a,b)

)-(0,1)-(0,1)

')-(0,1)!-(0,1)(%

= 1)-(0,1)

'ZT(b)! ZT(a)

(

! -(a,b)

)-(0,1)-(0,1)

'ZT(1)! ZT(0)

(+ oP(1),

where the last equality follows using Lemma 2. [Note thatthe oP(1)-term is uniform in (.] The covariance function ofthe limit follows through straightforward computation by us-ing the fact that for any two intervals (a1,b1), (a2,b2)3 [0,1],we have

cov'ZT(a1)! ZT(b1),ZT(a2)! ZT(b2)

(

= (µ4 ! 1)

$

(a1,b1)*(a2,b2)' 4(u)du,

which follows from Lemma 5.

Proof of Theorem 1

Lemma 1, in conjunction with the continuous mapping theo-rem, yields the assertion.

Proof of Theorem 2

It is sufficient to present only the proof for )((q); the prooffor)()m(q) follows similarly. We first prove consistency of)((q).In fact, we prove a stronger result; namely, we show that foreach $ > 0, we have

(T sup

q"[$,1]|)((q)! ((q)| = OP(1). (48)

Let q!T = max(0,q! sup(>0 |()E! E)(()|). Then we can write

)((q) = inf{($ 0 :)E(() > q}= inf

0($ 0 : E(() > q + [E(()!)E(()]

1

& infB($ 0 : E(() > max

90,q! sup

(>0|()E! E)(()|

:C

= ((q!T ).

Similarly, we have, with q+T = min(1,q+ sup(>0 |()E!E)(()|),

that)((q)$ ((q+T ) such that

((q+T )&)((q)& ((q!T ). (49)

Our assumptions ensure that ((·) is differentiable, and it isstraightforward to see that (+(q) =!1/(b((q)! a((q)). Becausein addition ' 2(·) is unimodal and has no flat parts (Assump-tion 1), it follows that (+(q) is continuous. Lemma 1 togetherwith (46) shows that

(T sup(>0 |)E(()!E(()| = OP(1). Hence

a one-term Taylor expansion applied to ((q+T ) ! ((q) and to

((q!T )! ((q) implies (48).

Page 14: Discrimination of Locally Stationary Time Series Based on the

Chandler and Polonik: Discrimination of Time Series 253

Asymptotic normality follows by a refinement of the fore-going arguments. Let q0 " [0,1) be fixed. We show that(

T()((q0)! ((q0)) has the asserted asymptotic normal distrib-ution. Further let ,T = o(1) such that

(T,T '#, and define

AT = {supq"[q0/2,1] |)((q) ! ((q)| < ,T}. The foregoing showsthat P(AT) ' 1 as T '#. On AT , we have, with q±

T,% =q0 + (E!)E)(((q0)) ± sup|(!µ|<,T

|()E!E)(()! ()E!E)(µ)|,that for T large enough,

)((q0) = inf{( :)E(() > q0}= inf

0( : E(() > q0 + [E(()!)E(()]

1

$ infB( : E(() > max

D0,q0 + (E!)E)(((q0))

+ sup|(!µ|<,T

|()E! E)(()! ()E! E)(µ)|EC

= ((q+T,%).

Thus on AT , we can write(

T')((q0)! ((q0)

(

$(

T'((q+

T,%)! ((q0)(=(

T(+(5+T )(q+

T,% ! q0)

= (+(5+T )(

T(E!)E)(((q0))

+(

T(+(5+T ) sup

|(!µ|<,T

|()E! E)(()! ()E! E)(µ)|

= (+(5+T )(

T(E!)E)(((q0)) + oP(1), (50)

where the last equality follows from asymptotic stochasticequicontinuity of the process

(T()E ! E)(() (see Lemma 1).

A similar argument using ((q!T,%) instead of ((q+T,%) shows that

also(

T')((q0)! ((q0)

(& (+(5!T )

(T(E!)E)(((q0)) + oP(1).

(51)

Because the 5±T stochastically converge to q, and (+ is continu-

ous, the asserted asymptotic normality follows from the asymp-totic normality of

(T(E!)E)(((q)) and (50) and (51).

[Received May 2004. Revised June 2005.]

REFERENCESBennett, T. J., and Murphy, J. R. (1986), “Analysis of Seismic Discrimination

Capabilities Using Regional Data From Western U.S. Events,” Bulletin of theSeismological Society of America, 76, 1069–1086.

Billingsley, P. (1999), Convergence of Probability Measures (2nd ed.),New York: Wiley.

Cavalier, L. (1997), “Nonparametric Estimation of Regression Level Sets,”Statistics, 29, 131–160.

Cheng, M.-Y., and Hall, P. (1998a), “On Mode Testing and Empirical Approx-imations to Distributions,” Statistics and Probability Letters, 39, 245–254.

(1998b), “Calibrating the Excess Mass and Dip Tests of Modality,”Journal of the Royal Statistical Society, Ser. B, 60, 579–589.

Dahlhaus, R. (1997), “Fitting Time Series Models to Nonstationary Processes,”The Annals of Statistics, 25, 1–37.

Dahlhaus, R., and Neumann, M. (2001), “Locally Adaptive Fitting of Semi-parametric Models to Nonstationary Time Series,” Stochastic Processes andTheir Applications, 91, 277–308.

Dahlhaus, R., and Polonik, W. (2004), “Nonparametric Quasi-Maximum Like-lihood Estimation for Gaussian Locally Stationary Processes,” unpublishedmanuscript.

(2005), “Inference Under Shape Restrictions for Time-Varying Au-toregressive Models,” unpublished manuscript.

Hartigan, J. A. (1987), “Estimation of a Convex Density Contour in Two Di-mensions,” Journal of the American Statistical Association, 82, 267–270.

Huang, H.-Y., Ombao, H., and Stoffer, D. S. (2004), “Discrimination and Clas-sification of Nonstationary Time Series Using the SLEX Model,” Journal ofthe American Statistical Association, 99, 763–774.

Johnson, R. A., and Wichern, D. W. (1998), Applied Multivariate StatisticalAnalysis (4th ed.), Engelwood Cliffs, NJ: Prentice-Hall.

Kakizawa, Y., Shumway, R. H., and Taniguchi, M. (1998), “Discrimination andClustering for Multivariate Time Series,” Journal of the American StatisticalAssociation, 93, 328–340.

Mammen, E. (1991), “Estimating a Smooth Monotone Regression Function,”The Annals of Statistics, 19, 724–740.

Müller, D. W., and Sawitzki, G. (1987), “Using Excess Mass Estimates to In-vestigate the Modality of a Distribution,” Preprint 398, SFB 123, UniversitätHeidelberg.

(1991), “Excess Mass Estimates and Tests for Multimodality,” Journalof the American Statistical Association, 86, 738–746.

Nishiyama, Y. (1996), “A Central Limit Theorem for 4#-Valued MartingaleDifference Arrays and Its Application,” Preprint 971, Utrecht University,Dept. of Mathematics.

(2000), “Weak Convergence of Some Classes of Martingales WithJumps,” The Annals of Probability, 28, 685–712.

Nolan, D. (1991), “The Excess Mass Ellipsoid,” Journal of Multivariate Analy-sis, 39, 348–371.

Polonik, W. (1995), “Measuring Mass Concentration and Estimating DensityContour Clusters: An Excess Mass Approach,” The Annals of Statistics, 23,855–881.

Polonik, W., and Wang, Z. (2005), “Estimation of Regression Contour Clus-ters: An Application of the Excess Mass Approach to Regression,” Journal ofMultivariate Analysis, 94, 227–249.

Robertson, T., Wright, F. T., and Dykstra, R. L. (1988), Order-Restricted Sta-tistical Inference, New York: Wiley.

Sakiyama, K., and Taniguchi, M. (2001), “Discriminant Analysis for LocallyStationary Processes,” Report S-58, Research Reports in Statistics, OsakaUniversity.

Shumway, R. H., and Stoffer, D. S. (2000), Time Series Analysis and Its Appli-cations, New York: Springer-Verlag.

Sun, J., and Woodroofe, M. (1993), “A Penalized Likelihood Estimate of f (0+)When f Is Nonincreasing,” Statistica Sinica, 3, 501–515.

Whittle, P. (1962), “Gaussian Estimation in Stationary Time Series,” Bulletin ofthe International Statistical Institute, 39, 105–129.

Wood, J. (2004), personal communication.van der Vaart, A. W., and Wellner, J. A. (1996), Weak Convergence and Empir-

ical Processes: With Applications to Statistics, New York: Springer-Verlag.