multivariate time-series analysis and diffusion maps€¦ · multivariate time-series analysis and...

Contents lists available at ScienceDirect

Signal Processing

Signal Processing 116 (2015) 13–28

http://d0165-16

n Corr

journal homepage: www.elsevier.com/locate/sigpro

Multivariate time-series analysis and diffusion maps

Wenzhao Lian a,n, Ronen Talmon b, Hitten Zaveri c, Lawrence Carin a,Ronald Coifman d

a Department of Electrical & Computer Engineering, Duke University, United Statesb Department of Electrical Engineering, Technion - Israel Institute of Technology, Israelc School of Medicine, Yale University, United Statesd Department of Mathematics, Yale University, United States

a r t i c l e i n f o

Article history:Received 12 January 2015Received in revised form2 April 2015Accepted 3 April 2015Available online 13 April 2015

Keywords:Diffusion mapsBayesian generative modelsDimension reductionTime seriesStatistical manifold

x.doi.org/10.1016/j.sigpro.2015.04.00384/& 2015 Elsevier B.V. All rights reserved.

esponding author.

a b s t r a c t

Dimensionality reduction in multivariate time series analysis has broad applications,ranging from financial data analysis to biomedical research. However, high levels ofambient noise and various interferences result in nonstationary signals, which may lead toinefficient performance of conventional methods. In this paper, we propose a nonlineardimensionality reduction framework using diffusion maps on a learned statisticalmanifold, which gives rise to the construction of a low-dimensional representation ofthe high-dimensional nonstationary time series. We show that diffusion maps, withaffinity kernels based on the Kullback–Leibler divergence between the local statistics ofsamples, allow for efficient approximation of pairwise geodesic distances. To construct thestatistical manifold, we estimate time-evolving parametric distributions by designing afamily of Bayesian generative models. The proposed framework can be applied toproblems in which the time-evolving distributions (of temporally localized data), ratherthan the samples themselves, are driven by a low-dimensional underlying process. Weprovide efficient parameter estimation and dimensionality reduction methodologies, andapply them to two applications: music analysis and epileptic-seizure prediction.

& 2015 Elsevier B.V. All rights reserved.

1. Introduction

In the study of high-dimensional data, it is often ofinterest to embed the high-dimensional observations in alow-dimensional space, where hidden parameters may bediscovered, noise suppressed, and interesting and signifi-cant structures revealed. Due to high dimensionality andnonlinearity in many real-world applications, nonlineardimensionality reduction techniques have become increas-ingly popular [1–3]. These manifold-learning algorithmsbuild data-driven models, organizing data samples accord-ing to local affinities on a low-dimensional manifold. Suchmethods have broad applications to, for example, analysis

of financial data, computer vision, hyperspectral imaging,and biomedical engineering [4–6].

The notion of dimensionality reduction is useful in multi-variate time series analysis. In the corresponding low-dimensional space, hidden states may be revealed, changepoints detected, and temporal trajectories visualized [7–10].Recently, various nonlinear dimensionality reduction techni-ques have been extended to time series, including spatio-temporal Isomap [11] and temporal Laplacian eigenmap [12].In these methods, besides local affinities in the space of thedata, available temporal covariate information is incorpo-rated, leading to significant improvements in discovering thelatent states of the series.

The basic assumption in dimensionality reduction is thatthe observed data samples do not fill the ambient spaceuniformly, but rather lie on a low-dimensional manifold.

www.sciencedirect.com/science/journal/01651684

www.elsevier.com/locate/sigpro

http://dx.doi.org/10.1016/j.sigpro.2015.04.003



http://crossmark.crossref.org/dialog/?doi=10.1016/j.sigpro.2015.04.003&domain=pdf




W. Lian et al. / Signal Processing 116 (2015) 13–2814

Such an assumption does not hold for many types ofsignals, for example, data with high levels of noise [4,13–15]. In [14,15], the authors consider a different, relaxeddimensionality reduction problem on the domain of theunderlying probability distributions. The main idea is thatthe varying distributions, rather than the samples them-selves, are driven by few underlying controlling processes,yielding a low-dimensional smooth manifold in the domainof the distribution parameters. An information-geometricdimensionality reduction (IGDR) approach is then appliedto obtain an embedding of high-dimensional data usingIsomap [1], thereby preserving the geodesic distances onthe manifold of distributions.

Two practical problems arise in these methods, limitingtheir applicability to time series analysis. First, in [14,15]multiple datasets were assumed to be available, where thedata in each set drawn from the same distributional form,with fixed distribution parameters. Then, the embedding wasinferred in the space of the distribution parameters. By takinginto account the time dependency in the evolution of thedistribution parameters from a single time series, we maysubstantially reduce the number of required datasets. Asecond limitation of previous work concerns how geodesicdistances were computed. In [14,15] the approximation of thegeodesic distance between all pairs of samples was computedusing a step-by-step walk on the manifold, requiring OðN3Þoperations, which may be intractable for large N.

In this paper, we present a dimensionality-reductionapproach using diffusion maps for nonstationary high-dimensional time series, which addresses the above short-comings. Diffusion maps constitute an effective data-drivenmethod to uncover the low-dimensional manifold, andprovide a parametrization of the underlying process [16].The main idea in diffusion maps resides in aggregating localconnections between samples into a global parameterizationvia a kernel. Many kernels implicitly induce a mixture oflocal statistical models in the domain of the measurements.In particular, it is shown that using distributional informationoutperforms using sample information when the distribu-tions are available [14]. We exploit this assumption andarticulate that the observed multivariate time seriesXtARN ; t ¼ 1;…; T , is generated from a smoothly varyingparametric distribution pðXt jβtÞ, where βt is a local para-meterization of the time evolving distribution. We proposeto construct a Bayesian generative model with constraints onβt , and use Markov Chain Monte Carlo (MCMC) to estimateβt . Diffusion maps are then applied to reveal the statisticalmanifold (of the estimated distributions) using a kernel withthe Kullback–Leibler (KL) divergence as the distance mea-sure. Noting that the parametric form of distributionssignificantly affects the structure of the mapped data, theBayesian generative model should avoid using a stronginformative prior without substantial evidence.

Diffusion maps rely on the construction of a Laplaceoperator, whose eigenvectors approximate the eigenfunc-tions of the backward Fokker–Planck operator. Theseeigenfunctions describe the dynamics of the system [17].Hence, the trajectories embedded in the coordinate systemformulated by the principal eigenvectors of the Laplaceoperator can be regarded as a representation of the under-lying controlling process θt of the time series Xt .

One of the main benefits of embedding the time seriessamples into a low-dimensional domain is the ability todefine meaningful distances. In particular, diffusion-mapsembody the property that the Euclidean distance betweenthe samples in the embedding domain corresponds to adiffusion distance in the distribution domain. Diffusiondistance measures the similarity between two samplesaccording to their connectivity on the low-dimensionalmanifold [3] and has a close connection to the geodesicdistance. Thus, diffusion maps circumvent the step-by-stepwalk on the manifold [14], computing an approximation tothe geodesic distance in a single low-cost operation.Another practical advantage of the proposed method isthat we may first reveal the low-dimensional coordinatesystem based on reference data, and then in an onlinemanner extend the model to newly acquired data with alow computational cost. This is demonstrated furtherwhen considering applications in Section 4.

The proposed framework is applied to two applicationsin which the data are best characterized by temporallyevolving local statistics, rather than based on measuresdirectly applied to the data itself: music analysis andepileptic seizure prediction based on intracranial electro-encephalography (icEEG) recordings. In the first applica-tion, we show that using the proposed approach, we canuncover the key underlying processes: human voice andinstrumental sounds. In particular, we exploit the efficientcomputation of diffusion distances to obtain intra-piecesimilarity measures on well-known music, which arecompared with the state-of-the-art techniques.

In the second application, one goal is to map therecordings to the unknown underlying “brain activitystates”. This is especially crucial in epileptic seizure pre-diction, where preseizure (dangerous) states can be dis-tinguished from interictal (safe) states, so that patients canbe warned prior to seizures [18]. In this application, theobserved time series is the icEEG recordings and theunderlying process is the brain state, e.g., preseizure orinterictal. IcEEG recordings tend to be noisy, and hence,the mapping between the state of the patient's brain andthe available measurements is not deterministic, and themeasurements do not lie on a smooth manifold. Thus, theintermediate step of mapping the observations to a time-evolving parametric family of distributions is essential toovercome this challenge. We use the proposed approach toinfer a parameterization of the signal, viewed as a modelsummarizing the signal's distributional information. Basedon the inferred parameterization, we show that preseizurestate intervals can be distinguished from interictal stateintervals. In particular, we show the possibility of predict-ing seizures by visualization and simple detection algo-rithms, tested on an anonymous patient.

This paper makes three principal contributions. First,we present a data-driven method to fit flexible statisticalmodels adapted to time series. In particular, we propose aclass of Bayesian models with various prior specificationsto learn the time-evolving statistics from a single trajectoryrealization that accurately models local distributionalparameters β. Second, we complement the analysis withdiffusion maps based on the distributional informationembodied in time-series dynamics. By relying on a kernel,

W. Lian et al. / Signal Processing 116 (2015) 13–28 15

which enables to compare segments from the entire timeseries, our nonlinear method allows for the association ofsimilar patterns, and in turn, gives rise to the recovery ofan underlying process that consists of the global control-ling factors θ. In addition, diffusion maps enable tocompute meaningful distances (i.e., diffusion distances[3]), which approximate the geodesic distances betweenthe time series samples on the statistical manifold. Finally,we apply the proposed framework to two applications:music analysis and the analysis of icEEG recordings.

The remainder of the paper is organized as follows. InSection 2 we review the diffusion-maps technique, pro-pose an extended construction and examine its theoreticaland practical properties. We propose in Section 3 multipleapproaches to model multivariate time series with time-evolving distributions. In Section 4, results on two real-world applications are discussed. Conclusions and futurework are outlined in Section 5.

2. Diffusion maps using the Kullback–Leibler divergence

2.1. Underlying parametric model

Let XtARN be the raw data or extracted features at timet. The key concept is that the high-dimensional represen-tation of Xt exhibits a characteristic geometric structure.This structure is assumed to be governed by an underlyingprocess on a low-dimensional manifold, denoted by θt ,that propagates over time as a diffusion process accordingto the following stochastic differential equation (SDE)1:

dθit ¼ aiðθi

tÞdtþdwit ð1Þ

where θit is the i-th component of θt , ai are (possibly

nonlinear) drift functions and wt is a Brownian motion. Inparticular, we assume that the underlying process induces aparametric family of probability distributions in the mea-surable domain, i.e., pðXt jθtÞ. In other words, the underlyingprocess controls the statistics of the measured signals.

Note that θt controls the time-evolution of the under-lying distribution of the data, rather than directly control-ling the data itself. We do not assume a priori knowledgeof the form of the distribution pð�jθtÞ.

2.2. Local models and the Kullback–Leibler divergence

We use empirical densities to represent the localstatistics of the signal. In particular, as an intermediatestep, we assume a parametric family of local distributions.Let pðXt jβtÞ denote the local density of Xt . We emphasizethat the assumed parameterization of the local distribu-tion β is considerably different from θ, where θ is thefundamental parametrization of the low-dimensionalmanifold that represents the intrinsic state governing thesignal; β is merely used within a chosen local distribution,employed as an intermediate step, in an attempt to infer θ.

Note that because the data is assumed noisy, we do notassume that the data itself lies on a low-dimensional

1 In this paper, superscripts represent access to elements in vectors, i.e., xi is the i-th element of the vector x.

manifold, but rather, we assume that there is an under-lying and unknown distribution pðXt jθtÞ conditioned on aglobal low dimensional distribution parameters θt , thatevolves with time. To uncover the time evolution of θ (andto infer the dimension of the parameter vector θ), weassume an intermediate form of a different conditionaldistribution pðXt jβtÞ, typically selected to balance accuracywith computational complexity, where βt applies onlylocally and does not exhibit the low dimensionality orcompactness of θt . To relate the local structures/condi-tional distributions, we then compute distances betweendata at time t and t0, based on pðXt jβtÞ and pðXt0 jβt0 Þ,respectively, using an appropriate kernel. From the result-ing distance matrix we seek to uncover θt for all t. In theend we still do not explicitly know the responsibledistribution pðXt jθtÞ, but we uncover how the (typicallylow-dimensional) parameters θt evolve with time, mani-festing a useful embedding. Note that this is different fromthe approach of transforming the data from the space of Xt

to the βt space and inferring the embedding θt using βt;instead, we are using representations in the βt spaceimplicitly to compute pairwise distances between datapoints, which are needed for learning embeddings θt .

The idea of using group statistics instead of individualfeature vectors for data representation has been utilized inclassification problems [19], where distributions pðXsjβsÞare independently estimated for multilple datasets (Xs

representing all the data in a dataset sAf1;…; Sg), andfurther classification tasks are performed based on thedistributional information. In more recent works ondimension reduction [14,15], the authors also estimatedthe distribution pðXsjβsÞ for each dataset, and infer theassociated fθsg via multidimensional scaling methods. Incontrast, by leveraging time, we may better infer the Slocal datasets, characterized by time-evolving distribu-tions, and then, integrate them through an embeddingbased on a kernel.

Various divergences measure the distance betweendistributions, including the KL divergence, the Hellingerdistance, and the Bhattacharyya divergence; we refer tothe metric learning literature for a detailed comparison ondifferent divergence choices [20,21]. In this work, wepropose to use the KL divergence as a distance measurebetween the parametric probability density functions(pdfs). For any pair of measurements Xt and Xt0 , the KLdivergence between the corresponding parametric pdfs isdefined by

D p Xjβt

� �Jp Xjβt0� �� ¼ Z

Xln

pðXjβtÞpðXjβt0 Þ

� �p Xjβt

� �dX: ð2Þ

Let βt0 and βt ¼ βt0 þδβt be two close samples in theintermediate parametric domain. It can be shown [22] thatthe KL divergence is locally approximated by the Fisherinformation metric, i.e.,

DðpðXjβtÞJpðXjβt0 ÞÞCδβTt Iðβt0 Þδβt ð3Þ

where Iðβt0 Þ is the Fisher information matrix. This localapproximation will be used in the sequel to define localaffinities.


We then define the Riemannian manifold ðM; gÞ, wherethe Fisher metric in (3) is associated with the innerproduct on the manifold tangent plane g between thelocal distributions:

gij βt

� �¼Xi;j

Z∂logpðXjβtÞ

∂βit

∂log pðXjβtÞ∂βj

t

p Xjβt

� �dX ð4Þ

The points residing on M are parametric probabilitydensity functions pðXjβtÞ. Thus, for pðXjβt0 þδtÞ in a localneighborhood of pðXjβt0 Þ, the Fisher metric between thesetwo points can be approximated by DðpðXjβtÞJpðXjβt0 ÞÞ.Therefore, we use the KL divergence to construct the affinitykernel and build the graph for diffusion maps. Because onlythe distances between pair of points within a local neigh-borhood affect the diffusion maps constructed (the diffu-sion probability between points with larger distances isnegligible), diffusion distances obtained preserve the Rie-mannian metric on the manifold of local distributions. Thiswill be addressed in detail in Section 2.3.

For the signal types reported in this paper (music andicEEG), we have empirically found that a simple localGaussian model (Gaussian mixture model) with zero meanand time evolving covariance matrices can effectivelydescribe the local empirical densities of the selectedfeature sets of the signals. However, we note that the(global) distribution of the data may be unknown and/orwithout a closed form expression for the KL divergence.Thus, the intermediate parameterization βt is a localcovariance matrix Σt , and the local distribution of Xt isapproximated by N ð0;ΣtÞ. In this case, the KL divergence(asymmetric) can be explicitly written as

D p Xjβt

� �Jp Xjβt0� �� ¼ 1

2tr Σ�1

t0 Σt

� ��N

�

þ logjΣt0 jjΣt j

þðμt�μt0 ÞTΣ�1t0 μt�μt0� �� ð5Þ

Because of the zero mean assumption for the Gaussiandistributions, the last term in (5) can be dropped. Based onthe KL divergence, we define a symmetric pairwise affinityfunction using a Gaussian kernel

k Xt ;Xt0ð Þ ¼ exp �DðpðXjβtÞJpðXjβt0 ÞÞþDðpðXjβt0 ÞJpðXjβtÞÞε

� ð6Þ

The decay rate of the exponential kernel implies thatonly pdfs pðXjβtÞ within an ε-neighborhood of pðXjβt0 Þhave non-negligible affinity. Thus, the exponential kernelimplicitly restricts the comparisons to small neighbor-hoods and truncates the symmetric KL divergence. As aresult, within the context of this kernel, the approximationof the Fisher metric in (3) holds, and we have

k Xt ;Xt0ð ÞCexp �ðβt�βt0 ÞT ðIðβtÞþIðβt0 ÞÞðβt�βt0 Þε

( )ð7Þ

We note that such a kernel was shown to be positivesemi-definite, e.g., in [21].

2.3. Laplace operator and diffusion maps

Let W be a pairwise affinity matrix between the set ofmeasurements Xt , whose ðt; t0Þ-th element is given by

Wt;t0 ¼ kðXt ;Xt0 Þ: ð8ÞBased on the kernel, we form a weighted graph, where themeasurements Xt are the graph nodes and the weight ofthe edge connecting node Xt to node Xt0 is Wt;t0 . Theparticular choice of the Gaussian kernel in (7) exhibits anotion of locality by defining a neighborhood around eachmeasurement Xt of radius ε, i.e., measurements Xt0 suchthat ðβt�βt0 ÞT ðIðβtÞþIðβt0 ÞÞðβt�βt0 Þ4ε are weakly con-nected to Xt . In practice, we set ε to be the median ofthe elements of the kernel matrix. According to the graphinterpretation, such a scale results in a well connectedgraph because each measurement is effectively connectedto half of the other measurements. For more details, see[23,24].

Using the KL divergence (the Fisher information metric)as an affinity measure has the effect of an adaptive scale.Taking the parametric family of normal distributions as anexample, in particular, we consider the case where theprocess Xt is one dimensional and is given by

Xt ¼ YtþVt ð9Þwhere Yt �N ð0;θtÞ and Vt is an adaptive white Gaussiannoise with zero mean and fixed σv

2variance. According to

the parametric model (Section 2.1), θt follows a diffusionprocess propagation model which results in time varyingdistributions. Consequently, the parametric pdf of themeasurements is given by

pðXjβtÞ ¼N ð0;βtÞ ð10Þwhere βt ¼ θtþσ2

v , and the corresponding Fisher informa-tion matrix is

I βt

� �¼ 1

2β2t

: ð11Þ

In this case, the kernel based on the KL divergence is givenby

k Xt ;Xt0ð Þ ¼ exp � Jθt�θt0 J2

εðβt ;βt0 Þ

( )ð12Þ

where

ε βt ;βt0� �¼ 2ε

1ðθtþσ2

v Þ2þ 1ðθt0 þσ2

v Þ2

!�1

ð13Þ

is a locally adapted kernel scale with the following inter-pretation: when the noise rate σ2

v increases, a larger scale(neighborhood) is used in order to see “beyond the noise”and capture the geometry and variability of the underlyingparameter θ. We remark that in this specific case, theadaptive scale is the local covariance of the measurements.Thus, this metric is equal to the Mahalanobis distancebetween the measurements [9].

Let D be a diagonal matrix whose elements are thesums of rows of W , and let Wnorm ¼D�1=2WD�1=2 be anormalized kernel that shares its eigenvectors with thenormalized graph-Laplacian I�Wnorm [25]. It was shown


[3] that Wnorm converges to a diffusion operator thatreveals the low-dimensional manifold and a subset of itseigenvectors give a parameterization of the underlyingprocess. We assume that these eigenvectors are theprincipal eigenvectors associated with the largest eigen-values, although there is no guarantee. Thus, the eigen-vectors of Wnorm, denoted by ~φ j, reveal the underlyingstructure of the data. Specifically, the t-th coordinate of thej-th eigenvector can be associated with the j-th coordinateof the underlying process θt controlling the measurementXt . The eigenvectors are ordered such that λ0Zλ1Z⋯ZλT�140, where λj is the eigenvalue associated witheigenvector ~φ j. Because Wnorm is similar to P ¼D�1W ,and because D�1W is row-stochastic, λ0 ¼ 1 and ~φ0 is thediagonal of D1=2. In addition, Wnorm is positive semidefi-nite, and hence, the eigenvalues are positive. The matrix Pmay be interpreted as a transition matrix of a Markovchain on the graph nodes. Specifically, the states of theMarkov chain are the graph nodes and Pt;t0 represents theprobability of transition in a single Markov step from nodeXt to node Xt0 . Propagating the Markov chain n stepsforward corresponds to raising P to the power of n. Wealso denote the probability function from node Xt to nodeXt0 in n steps by pnðXt ;Xt0 Þ.

The eigenvectors are used to obtain a new data-drivendescription of the measurements Xt via a family of map-pings that are called diffusion maps [3]. LetΨℓ;nðXtÞ be thediffusion mappings of the measurements into the Eucli-dean space Rℓ, defined as

Ψℓ;nðXtÞ ¼ λn1 ~φt1; λ

n2 ~φ

t2;…; λnℓ ~φ

tℓ

� �T ð14Þ

where ℓ is the new space dimensionality ranging between1 and T�1. Diffusion maps have therefore two para-meters: n and ℓ. The parameter n corresponds to thenumber of steps of the Markov process on the graph, sincethe transition matrices P and Pn share the same eigenvec-tors, and the eigenvalues of Pn are the eigenvalues of Praised to the power of n. The parameter ℓ indicates theintrinsic dimensionality of the data. The dimension may beset heuristically according to the decay rate of the eigen-values, as the coordinates in (14) become negligible for alarge ℓ. In practice, we expect to see a distinct “spectralgap” in the decay of the eigenvalues. Such a gap is often agood indicator of the intrinsic dimensionality of the dataand its use is a common practice in spectral clusteringmethods. The mapping of the data Xt into the low-dimensional space using (14) provides a parameterizationof the underlying manifold and its coordinates representthe underlying processes θt . We note that as n increases,the decay rate of the eigenvalues also increases (they areconfined in the interval ½0;1�). As a result, we may set ℓ tobe smaller, enabling us to capture the underlying structureof the measurements in fewer dimensions. Thus, we mayclaim that a larger number of steps usually bring themeasurements closer in the sense of the affinity impliedby Pn, and therefore, a more “global” structure of thesignal is revealed.

The Markov process aggregates information from theentire set into the affinity metric pnðXt ;Xt0 Þ, defining theprobability of “walking” from node Xt to Xt0 in n steps. For

any n, the following metric

D2nðXt ;Xt0 Þ ¼

ZXs

½pðXt ;XsÞ�pnðXt0 ;XsÞ�2wðXsÞ dXs ð15Þ

is called diffusion distance, with wðXsÞ ¼ 1= ~φ0ðXsÞ. Itdescribes the relationship between pairs of measurementsin terms of their graph connectivity, and as a consequence,local structures and rules of transitions are integrated intoa global metric. If the integral is evaluated on the points ofthe observed data, it can be shown that the diffusiondistance (15) is equal to the Euclidean distance in thediffusion maps space when using all ℓ¼ T�1 eigenvectors[3], i.e.,

DnðXt ;Xt0 Þ ¼ JΨℓ;nðXtÞ�Ψℓ;nðXt0 ÞJ2 ð16Þ

Thus, comparison between the diffusion mappings usingthe Euclidean distance conveys the advantages of thediffusion distance stated above. In addition, since theeigenvalues tend to decay fast, for large enough n, thediffusion distance can be well approximated by only thefirst few eigenvectors, setting ℓ5T�1.

2.4. Sequential implementation

The construction of the diffusion maps embedding iscomputationally expensive due to the application of theeigenvector decomposition (EVD). In practice, the mea-surements are not always available in advance. Thus, thecomputationally demanding procedure should be appliedrepeatedly whenever a new set of measurements becomeavailable. In this section, we describe a sequential proce-dure for extending the embedding, which circumvents theEVD applications and may be suitable for supervisedtechniques [26–29].

Let Xt be a sequence of T reference measurements thatare assumed to be available in advance. The availability ofthese measurements enables one to estimate the localdensities and the corresponding kernel based on the KLdivergence. Then, the embedding of the reference mea-surements can be computed.

Let Xs be a new sequence of Smeasurements, which areassumed to become available sequentially. As proposed in[27,28], we define a nonsymmetric pairwise metricbetween any new measurement Xs and any referencemeasurement Xt , similar to (7) as

a Xs;Xtð Þ ¼ exp �ðβs�βtÞT IðβtÞðβs�βtÞε

( )ð17Þ

where βt and βs are the parametrization of the localdensities of the measurements at t and s, respectively,and a corresponding nonsymmetric kernel

As;t ¼ aðXs;XtÞ: ð18Þ

The construction of the nonsymmetric kernel requires thefeature vectors of the measurements and the Fisherinformation matrix of merely the reference measurements.

Let ~A ¼D�1A AQ �1, where DA is a diagonal matrix whose

diagonal elements are the sums of rows of A, and Q is adiagonal matrix whose diagonal elements are the sums of


rows of D�1A A. It was shown by [26,27] that

W ¼ ~AT ~A ð19Þ

where W is the pairwise affinity matrix on the T referencemeasurements Xt (8), as defined in Section 2.3.

We define now the dual extended S� S kernel betweenthe new samples by Wext ¼ ~A ~A

T. It is shown in [28] that

the elements of the extended kernel are proportional to aGaussian defined similar to (7) between a pair of newmeasurements. Combining the relationship between thekernels W and Wext yields the following: (1) the kernelsshare the same eigenvalues λj; (2) the eigenvectors φj ofW are the right singular vectors of ~A; (3) the eigenvectorsψ j ofW

ext are the left singular vectors of ~A. As discussed inSection 2.3, the right singular vectors represent the under-lying process of the reference measurements, and by[26,27], the left singular vectors naturally extend thisrepresentation to the new measurements. In addition,the relationship between the eigenvectors of the twokernels is given by the singular value decomposition(SVD) of ~A and is explicitly expressed by

ψ j ¼1ffiffiffiffiλj

q ~Aφj: ð20Þ

Now, the extended eigenvectors ψ j can be used instead of~φ j to construct the embedding of the new measurementsΨℓ;nðXsÞ in (14).

3. Modeling time evolving covariance matrices

To calculate the KL divergence, we need to estimate thelocal/intermediate parametric distribution pðXt jβtÞ at eachtime. The amount of data in each time window is limited,and therefore, assumptions have to be made to constrainthe parametric space. In this paper, we assume that thedata sample at each time, Xt , is drawn from a multivariateGaussian distribution with time evolving parameters. Forsimplicity, we focus on zero mean Gaussian distributions.We assume that the time evolving covariance matricescharacterize the dynamics of the time series. Such a timeevolving covariance model can be applied to many multi-variate time series, including volatility analysis in finance[4] and EEG activity in neurology [5]. Popular approachesfor estimating smoothly varying covariance matricesinclude the exponentially weighted moving average(EWMA) model [30] and multivariate generalized autore-gressive conditional heteroscedasticity (GARCH) models[31]. The former captures the smoothly varying trends,but fails to handle missing data, and requires long series toachieve high estimation accuracy [32]. The latter handlesmissing data at the expense of over-restricting the flex-ibility of the dynamics of the covariance matrices [4]. Mostmoving average type approaches can be simplified asΣ̂t ¼ πtΣ̂t�1þð1�πtÞΣt , where Σt is the sample covar-iance of the data samples in a sliding window at t and πt isa smoothing parameter. Σ̂t can be considered as theposterior mean estimate for Σt using an inverse-Wishartprior with mean proportional to Σ̂�1

t�1. Given their broaduse in practice, moving average type approaches are alsotested in our experiments.

In this paper, we present a latent variable model toinfer the time evolving covariance matrices, inspired bythe idea proposed in [13]. Although this generative modelcannot reconstruct the observed data samples, it allows forhandling missing data and aims to capture the dynamics ofthe evolving structure of the covariance matrices. Weassume that at each time t; t ¼ 1;…; T , a factor analyzer(FA) model fits the observed data sample XtARN:

Xt �N ðF tSt ;α�1INÞ ð21Þ

St �N ð0; IK Þ ð22Þ

α�Gaðe0; f 0Þ ð23Þwhere F tARN�K formulates a time evolving factor loadingmatrix, with a prior that will be described next. St is a Kdimensional variable in the latent space, and αmodels thenoise level in the observation space. This model constrainsthe high-dimensional data to locally reside in a K dimen-sional space, but does not assume local stationarity due tothe time evolving factor-loading matrix. Thus, at time t,the marginal distribution of Xt is

Xt �N ð0;ΣtÞ ð24Þ

Σt ¼ ðF tÞðFtÞT þα�1IN ð25ÞTherefore, we could use posterior estimates of Ft and α toestimate Σt . Because we are only interested in ðFtÞðFtÞT ,instead of Ft or St , this latent variable model circumventsthe identifiability problem encountered in common FAmodels [33]. To impose smoothness and facilitate para-meter estimation, two types of priors for Ft are proposed:a Gaussian process (GP) [34] and a non-stationary auto-regressive (AR) process [7].

In the GP model, elements of Ft are constructed as

F ij � GPð0;ΩijÞ ð26Þ

Ωijt1 ;t2 ¼ σ�1ðkðt1; t2Þþσnδðt1�t2ÞÞ ð27Þ

σ �Gaðc0; d0Þ ð28Þwhere σ�1 represents the variance of the factor loadingsover time. To infer σ�1 from the data, a broad gammadistribution prior is imposed. The hyperparameter σnrepresents the noise variance for the factor loadings,which is fixed at a small value (10�3 in our case). Eachtime varying factor loading Ftij; t ¼ 1;…; T , is constructedfrom a GP. Various kernel functions kðt1; t2Þ can be used forthe GP, including the radial basis function (RBF)kðt1; t2Þ ¼ expð�ðt1�t2Þ2=ð2τ2ÞÞ. We have tested differentkernel functions and RBF is chosen to allow for simpleinterpretation in our experiments. τ is the length-scaleparameter, which determines globally the shape of auto-correlation function [13]. This facilitates strong correlationbetween Ft1ij and Ft2ij if jt1�t2joτ and inhibits the correla-tion otherwise. τ can be estimated from the data byputting a discrete uniform prior over a library of candidateatoms [35]. However, in practice, the correlation lengthmight be available a priori and used as the appropriatevalue, which often works effectively. Standard MCMCsampling is applied with fully conjugate updates. Sampling


F ij from the GP has a OðT3Þ complexity, which can bealleviated using the random projection method in [35].

In the non-stationary AR process prior model, elementsof F t are constructed as

Ftij ¼ Ft�1ij þξtij ð29Þ

ξtij �N ð0;η�1ij Þ ð30Þ

ηij �Gaðg0;h0Þ ð31Þ

The time varying factor loading matrix Ft ; t ¼ 1;…; T , is arandom walk whose smoothness is determined by ξt . Ashrinkage prior [36] favoring ξt to be sparse is imposed toencourage smoothness. Other kinds of sparseness-promoting priors, including spike-slab [37] and general-ized double Pareto [38], can also be considered. The zeromean distribution for ξij

tmodels the difference between

consecutive covariance matrices as a stationary process,which captures the drift of factor loadings over time. Inthis model, the trend of each factor loading is assumedindependent. However, correlated trends of F t and groupsparsity of ξt can be considered for more sophisticateddata, as a future direction. A forward filtering backwardsampling (FFBS) [39] method is used to sample Ft .

Choosing which estimation method to use is data specificand depends on the available prior knowledge. If thecovariance structures are highly correlated for nearby sam-ples, while they are highly uncorrelated for faraway samples,the model with GP prior should be adopted. If the differencebetween consecutive covariance matrices is approximately astationary process, the model with non-stationary AR pro-cess prior should be considered. If the covariance matricesare evolving smoothly over time, local stationarity approxi-mately holds, and a large number of data samples areprovided, moving average type approaches can be easilyimplemented. This will be further discussed in Section 4. Thegeneralized framework can be summarized in Algorithm 1.

Algorithm 1. Diffusion maps using time evolving statistics.

Input: Observations XtARN ; t ¼ 1;…; T , diffusion step n,

neighborhood size ϵ, embedding dimension l

Output: Embeddings fΨl;nðXt Þ; t ¼ 1;…; Tg1: At each time t, estimate a distribution pðXt jβt Þ (using moving

average (MA), Bayesian generative model with AR process prior(BAR), or Gaussian Process prior (BGP) to infer βt)

2: Compute the T� T matrix W , where

Wt1 ;t2 ¼ exp �DðpðXjβt1Þ JpðXjβt2

ÞÞþDðpðXjβt2Þ JpðXjβt1

ÞÞϵ

� �

3: Set D¼ diagPTτ ¼ 1

Wt;τ

� and compute the normalized kernel

Wnorm ¼D�1=2WD�1=2

4: Keep the top ℓ non-trivial eigenvalues λj and eigenvectors~φ jART of Wnorm, j¼ 1;…; l, and construct the corresponding

diffusion maps embedding Ψℓ;nðXt Þ ¼ λn1 ~φ t1; λ

n2 ~φt

2 ;…;λnℓ ~φ tℓ

� �T

4. Applications

The proposed framework is applied to a toy example andtwo real-world applications. In the synthetic toy example,we show that the estimated diffusion distance betweendata points recovers the geodesic distance on the statistical

manifold, where IGDR [14] is used as a baseline method. Inthe first real-world application, we analyze a well-knownmusic piece by estimating the diffusion distance betweentime points to discover the intra-piece similarities as afunction of time. In the second application, the proposedframework is used to discover the different brain states ofan epileptic patient based on icEEG recordings.

4.1. Synthetic experiment

We simulate data Xt from a zero-mean multivariateGaussian distribution with time-evolving covariancematrix Σt , constructed from a GP prior as defined in(24)–(27). We consider observation length T¼500, dimen-sion N¼5, and local latent dimension K¼3. The goal is torecover the geodesic distance between data points on thestatistical manifold, defined by their corresponding covar-iance matrices. Because the pairwise symmetric KL dis-tance matrix is needed in both the proposed frameworkand IGDR, we let both methods access the true Σt in orderto focus on the comparison between the two dimen-sionality-reduction schemes. In other words, in thisexperiment we do not estimate Σt from the data, butsimply assume it is known. The purpose of this test is tocompare the diffusion-distance method with IGDR on thesame time-evolving density function.

We apply diffusion maps with the pairwise affinity kerneldefined in (6). We consider n¼200 and obtain the low-dimensional embeddings of Xt as defined in (14), with l¼3(the true dimension of the underlying process). The pairwisegeodesic distance between data points can be approximatedby the Euclidean distances between the correspondingembeddings, as shown in Fig. 1(a). On the other hand, usingIGDR, a shortest-path algorithm is executed to find approx-imate pairwise geodesic distances, followed by classicalmultidimensional scaling (MDS) methods (i.e., Laplacianeigenmaps [2]) for dimensionality reduction. The approxi-mated pairwise geodesic distances are presented in Fig. 1(b)and used as a comparison. As illustrated in Fig. 1, bothmethods yield similar distances. However, the running timeof the proposed method is a couple of seconds, whereas therunning time of IGDR is approximately 400 s, because of theOðT3Þ complexity required to compute the pairwise shortest-path distances. These computations were performed on acomputer with 2.2 GHz CPU and 8 GB RAM, with all softwarewritten in Matlab.

4.2. Music analysis

In music information retrieval, automatic genre classi-fication and composer identification are of increasinginterest. One task is to compute similarities betweendifferent segments of a music piece [40]. For comparisonwith a previous method, we test our framework on thesame music piece used in [40], “A Day in the Life” from theBeatles' album Sgt. Pepper's Lonely Hearts Club Band. Thesong is 5 min 33 s long and sampled at 22.05 kHz. Themusic piece is divided into 500 ms contiguous frames toobtain Mel Frequency Cepstral Coefficients (MFCCs), whichleads to 667 available frames overall. 40 normalized (i.e.,

Time [sec]

Tim

e [s

ec]

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time [sec]

Tim

e [s

ec]

100 200 300 400 500

50

100

150

200

250

300

350

400

450

500 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 1. (a) Normalized diffusion distance between time points. (b) Normalized approximated geodesic distance between time points.


with zero mean) MFCCs are used as features, yieldingXtARN ; t ¼ 1;…; T , where N¼40 and T¼667.

In this application we compare intra-piece similarityobtained using distances directly based on MFCC featurevectors, and based upon the statistics of MFCC features in asliding window. The motivation of the latter approach isthat the local statistics of MFCC features within slidingwindows constitute a better representation of the simila-rities and differences in the music compared to distancescomputed directly from the MFCC features. Therefore,computing distances between time-evolving distributionsis necessary, which calls for the development of methodsdiscussed in Section 2.

In Fig. 2(a) we plot the frequency content of the musicframes, as captured via a spectrogram; the spectrogramfrequency content is closely related to the MFCC features.By modeling the evolution in the statistics of multiplecontiguous frames of frequency content, the hope is tocapture more specific aspects of the music, compared tothe spectral content of single frames. Specifically, thefrequency spectrum at a single time frame may missstatistical aspects of the music captured by frequencycontent at neighboring frames.

In the proposed model, we assume that the frameshave a time evolving distribution parametrized by Σt , as in(24). The diagonal elements of Σt represent the energycontent in each frequency band at time t, and the off-diagonal elements contain the correlation pattern betweendifferent frequency bands. In music, we observe thatnearby frames in time usually tend to have similar Σt ,whereas, temporally distant frames tend to have differentΣt . Thus, the time evolution of Σt is smooth and modeledwith the GP prior. As described in (21)–(23) and (26)–(28),the GP prior explicitly encodes this belief and is utilized toestimate Σt . The length-scale parameter is set to τ¼ 5, thenumber of local factors is set to K¼5, and the otherhyperparamters are set to c0 ¼ d0 ¼ e0 ¼ f 0 ¼ σn ¼ 10�3.Empirically, we find that this model fits the data welland the performance is relatively insensitive to the para-meter settings (many different parameter settings yielded

similar results). One interesting parameter is the length-scale τ for the GP, which indicates the correlation length intime. In this example, we set it to 2.5 s, since it corre-sponds to typical lengths of musical components. Ingeneral, this particular parameter can be set empiricallydepending on our experience or prior domain knowledge,or it can be inferred via maximum likelihood approachesas indicated in the GP literature [41]. A total of 4000MCMC iterations are performed, with the first 2000samples discarded as burn-in. The covariance matrix Σt

is calculated by averaging across the collected samples.Then the pairwise affinity kernel is computed according to(5)–(6), the diffusion-map algorithm is applied, and thediffusion distances are calculated according to (14)–(16).

To demonstrate the advantage of organizing the musicintervals based on local statistics, we compare the obtainedresults to diffusion maps based on two competing affinitykernels. First, we examine a pairwise affinity kernel con-structed with Euclidean distances between the MFCC fea-tures directly. This comparison is aimed to illustrate theadvantage of taking the local statistics and dynamics intoaccount. Second, we examine a pairwise affinity kernelconstructed with Euclidean distances between linear pre-diction coefficients (LPC). LPC are a standard representationof AR processes, and hence, they implicitly encode the localstatistics of the signal. This comparison is therefore aimedto illustrate the advantage of the proposed generativeprocedure in estimating the statistics of the signal. Addi-tionally, we compare our method to results from the kernelbeta process (KBP) [40]. The KBP represents the MFCCs ateach time frame in terms of a weight vector over a learneddictionary, yielding a low-dimensional representation. Thestatistical relationships between pieces of music are thendefined by the similarity of dictionary usage. The KBPapproach models each MFCC frame while imposing thatnearby frames in time are likely to have similar dictionaryusage; in contrast, the proposed method explicitly utilizes atime-evolving covariance matrix encouraging smoothness.The subspace defined by that covariance matrix is related tothe subspace defined by the KBP dictionary-based method,


whereas the GP model does not explicitly impose dictionarystructure (the covariance is allowed to vary more freely).

In Fig. 2, we illustrate the intra-piece relationships as afunction of time, based on the proposed and three compet-ing methods. The results in Fig. 2(b)–(d) were computed viathe diffusion-based embedding, where (b) is based on theproposed method, (c) is based on the Euclidean distancebetween MFCC feature vectors, and (d) is based on theEuclidean distance between the LPC. The results in Fig. 2(e)were computed by KBP.

In Fig. 2(b)–(d), the relationship between data at times tand t0 is represented as f ðdtt0 Þ ¼ expð�dtt0=δÞ, where dtt0represents diffusion distance, and δ¼ 10�4. This plots therelationship between the data on a scale of [0,1]. In Fig. 2(e)the correlation is shown between the probability of dic-tionary usage at times t and t0, as done in [40].

The examined music is available online at http://youtu.be/XhDz0npyHmg, where one may listen to the music andexamine how it is mapped onto the segmentations andrelationships in Fig. 2(b)–(e).

In Fig. 2(c), we observe that diffusion maps based on theEuclidean distance between the MFCC features are highlyfragmented and do not capture the temporal relationalinformation of the signal, since, indeed, the local statisticsof the signal are not taken into account. On the other hand,both diffusionmaps based on the Euclidean distance betweenthe LPC and the KBP approach in Fig. 2(d) and (e), respec-tively, seem to be effective in inferring large-scale temporalrelationships. However, they fail at distinguishing localized,fine-scale temporal differences. It appears in Fig. 2(b) that theproposed method captures both fine-scale temporal differ-ences and the signal dynamics, simultaneously, and the intra-piece relationships are in good agreement with respect to themusic. These results further demonstrate the importance ofthe feature selection and the appropriate distance metric. Inparticular in this paper, the features encode local statistics tocapture dynamical behavior and the metric is designed to besensitive and carry out fine details in the signal.

For example, interval ½10;100� s consists of a solo,dominated by the human voice, in which the subinterval½30;42� s contains a unique rhythm, which is different fromthe other parts of the piece. Both the LPC based diffusionand the KBP analysis in Fig. 2(d) and (e) do not capture thelatter detailed structure, whereas it is clearly evident in theresults of the proposed method, in Fig. 2(b).

To further illustrate this point, we focus on the relation-ships obtained by the four methods in the short interval½65;75� s in Fig. 3. As shown in Fig. 3(a), in the interval½68:5;71:5� s, the human voice harmonics disappear, indi-cating a break in the singing. Comparing the correspond-ing distances in Fig. 3(b)–(e), we find that diffusion mapsusing time evolving statistics capture this break, whereasdiffusion maps based on LPC and KBP fail to capture it. Onthe other hand, diffusion maps based on the Euclideandistances between the MFCC captures the break, however,other false breaks are captured as well. Similar patternsand performance can be also found in the intervals½97;99� s and ½223;225� s.

The diffusion mapping formulates an embedding spacethat discovers the low-dimensional manifold of the signal.Fig. 4 depicts two coordinates of the mapping (setting l¼2

in (14)) corresponding to the eigenvectors associated withthe largest two non-trivial eigenvalues. For evaluation, weannotated the song with markers indicating the humanvoice. As depicted in the figure, we find that the secondcoordinate correlates with the human voice. For instance,we observe that the second coordinate is significantlylarge during the interval ½10;100� s, which consists of thesolo. In addition, the first coordinate correlates with theoverall background sounds: it takes small values for thefirst 312 s of the song, and then exhibits a sudden increasewhen peaky humming appears. Such information can beutilized to interpret the similarity between frames andmay be used for other music-analysis tasks. This suggeststhat the coordinates of the embedding, i.e., the eigenvec-tors of the graph, indeed represent the underlying factorscontrolling the music. See http://youtu.be/4uPaLgbMSQwfor an audio-visual display of these results.

4.3. Epileptic seizure prediction

We now consider epileptic-seizure prediction based onicEEG recordings. It is important and desirable to predictseizures so that patients can be warned a few minutesprior. Many studies have been conducted in order to devisereliable methods that can distinguish interictal and pre-seizure states [18,42–47]. However, because icEEG record-ings tend to be very noisy, and because the underlyingbrain activity states and their relationship to the icEEGactivities are unknown, it is considered a difficult problemwithout existing solutions [42,43].

In the present work, we consider icEEG data collectedfrom a patient at the Yale-New Haven Hospital. Multipleelectrode contacts were used during the recording; in thiswork, we focus only on the contacts located near theseizure onset area in the right occipital lobe. Discoveringthe relationships between different areas will be a subjectof future research. We study four 60-min long icEEGrecordings at a sampling rate of 256 Hz, each containinga seizure. A detailed description of the collected datasetcan be found in [47].

Fig. 5 presents a 60-min icEEG recording and its shorttime spectrogram. As shown in Fig. 5(a), the seizure occursafter approximately 56 min in this recording, and is clearlyvisible, exhibiting high signal amplitude. However, it isdifficult by observation to notice differences betweenrecording parts that immediately precede the seizure andparts that are located well before the seizure. Our goal is,first, to infer a low-dimensional representation of therecordings, which discovers the intrinsic states (i.e., thebrain activities) of the signal. By relying on such arepresentation, we detect anomaly patterns prior to theseizure onsets (preseizure state) and distinguish themfrom samples recorded during resting periods (interictalstate), thereby enabling prediction of seizures.

The short time Fourier transform (STFT) with a 1024sample window and 512 sample overlap is applied toobtain features in the frequency domain. The amplitudeof frequency components in Delta (0.5–4 Hz) and Theta(4–8 Hz) bands, with 0.25 Hz spacing, are collected foreach time frame and used as the feature vectors XtAR32.The Beta (13–25 Hz) and Gamma (25–100 Hz) bands were

http://youtu.be/XhDz0npyHmg

http://youtu.be/XhDz0npyHmg

http://youtu.be/4uPaLgbMSQw

Freq

uenc

y [k

Hz]

Time [sec]0 50 100 150 200 250 300

0

2

4

6

8

10

Time [sec]

Tim

e [s

ec]

50 100 150 200 250 300

50

100

150

200

250

300 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time [sec]

Tim

e [s

ec]

50 100 150 200 250 300

50

100

150

200

250

300 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Time [sec]

Tim

e [s

ec]

50 100 150 200 250 300

Time [sec]50 100 150 200 250 300

50

100

150

200

250

300

Tim

e [s

ec]

50

100

150

200

250

300

0

00.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−

Fig. 2. Analysis results of the song “A Day in the Life”. (a) Spectrogram of the song. (b)–(e) Comparison of the pairwise similarity matrices obtained by (b)diffusion maps using time evolving statistics, (c) diffusion maps using Euclidean distances between features, (d) diffusion maps using Euclidean distancesbetween LPC, and (e) kernel beta process [40].


Freq

uenc

y [k

Hz]

Time [sec]65 66 67 68 69 70 71 72 73 74 75

0

2

4

6

8

10

Time [sec]

Tim

e [s

ec]

65 66 67 68 69 70 71 72 73 74

0

37

74

111

148

185

222

259

296

333

Time [sec]

Tim

e [s

ec]

65 66 67 68 69 70 71 72 73 74

0

37

74

111

148

185

222

259

296

333

Time [sec]

Tim

e [s

ec]

65 66 67 68 69 70 71 72 73 74

0

37

74

111

148

185

222

259

296

333

Time [sec]

Tim

e [s

ec]

66 67 68 69 70 71 72 73 74

0

37

74

111

148

185

222

259

296

333

Fig. 3. Analysis results in the subinterval ½65;75� s of the song “A Day in the Life”. (a) Spectrogram of the song. (b)–(e) Comparison of the pairwise similaritymatrices obtained by (b) diffusion maps using time evolving statistics, (c) diffusion maps using Euclidean distances between features, (d) diffusion mapsusing Euclidean distances between LPC, and (e) kernel beta process [40].


0 50 100 150 200 250 300−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

Time [sec]

Am

plitu

de

1st component2nd componentVoice activity indicator

Fig. 4. The 2 principal eigenvectors as a function of time compared tohuman voice activity indicator.

Fig. 5. (a) A sample recording. The red line marks seizure's onset, atapproximately 56 min. (b) The STFT features of the recording. (Forinterpretation of the references to color in this figure caption, the readeris referred to the web version of this paper.)


also included but empirically showed no significant con-tribution. In this experiment, two 5-min intervals fromeach recording are analyzed: one is the 5-min intervalimmediately preceding the seizure onset (preseizurestate), and the other is a 5-min interval located 40 minbefore the seizure (interictal state). Therefore, for each 5-min interval, we have a set of vectors XtARN ; t ¼ 1;…; T ,where N¼32 and T¼149. The obtained features arecentered, and hence, each Xt is considered as a samplefrom a zero-mean multivariate Gaussian distributionN ð0;ΣtÞ.

In this application, in contrast to the prior knowledgewe have in the music analysis case, we do not have areliable notion of the way the covariance matrices arecorrelated at different times. Thus, the only belief we canincorporate into the generative model is that Σt aresmoothly varying. This information is encoded in two

approaches used to estimate Σt: a Bayesian approach witha latent variable model using a non-stationary AR processprior (BAR) and an empirical approach based on movingaveraging (MA). The former assumes stationarity of thedifferences between consecutive covariance matrices. Thelatter assumes local stationarity, i.e., feature vectors withina small window are generated from the same Gaussiandistribution.

In our experiments, both models (BAR and MA) areused for covariance matrix estimation in each 5-mininterval. The BAR model is implemented according to(21)–(23) and (29)–(31), where the number of local factorsis set to K¼5 and the hyperparameters are set toe0 ¼ f 0 ¼ g0 ¼ h0 ¼ 10�3. In this application as well wehave found that the performance is not sensitive to thechoice of the hyper-parameters, and we set them to smallvalues imposing non-informative priors. 4000 MCMCiterations are performed, with the first 2000 discarded asburn-in. The collection samples are averaged to estimateΣt . Under the MA model, we simply estimate the covar-iance directly based on the features in a local neighbor-hood in time, i.e., Σ̂M

t ¼ PtþMs ¼ t�M XsXT

s =ð2Mþ1Þ, where2Mþ1 is the length of the window in which localstationarity is assumed to hold. We chooseM large enoughto provide an accurate estimate of Σt while small enoughto avoid smoothing out the local varying statistics. Inpractice, we set M¼32 according to an empirical criterionof being the smallest value that yields sufficient smooth-ness, formulated by the admittedly ad hoc criterion:

PTt ¼ 1 JΣ̂M

t �Σ̂Mþ1t J2PT

t ¼ 1 JΣ̂M

t J2r0:05: ð32Þ

Circumventing the need for such criteria is one reason theBAR approach is considered.

In Fig. 6, we present the obtained embedding (byapplying Algorithm 1) of multiple intervals of an icEEGrecording (indexed as recording 1) using diffusion mapsbased on the time-evolving statistics. We embed three 5-min intervals: one preseizure interval and two interictalintervals (located 30 and 40 min prior to the seizure onset,respectively). The presented scatter plot shows theembedded samples in the space formulated by the 3principal eigenvectors (setting ℓ¼ 3 in (14)), i.e., each 3dimensional point corresponds to the diffusion map of asingle feature vector Xt .

We observe that under both models (BAR and MA) thelow-dimensional representation separates samples recordedin the preseizure state (colored red) from samples recordedin interictal states (colored blue and green). In both Fig. 6(a)and (b), the embedded samples of the two interictal intervalsare located approximately in the same region, with theembedded samples of one of the interictal intervals (coloredgreen) tend slightly towards the embedded samples of thepreseizure interval. This result exemplifies the ability of theproposed method to discover the underlying states. Indeed,without prior knowledge of the labeling of the differentintervals (preseizure or interictal) and based merely on therecorded signal, the proposed method organizes the signalaccording to the intrinsic state of the patient.

−0.06−0.04

−0.02 00.02

0.04

−0.05

00.05

0.1

0.15−0.05

−0.04

−0.03

−0.02

−0.01

0

0.01 Preseizure intervalInterictal interval IInterictal interval II

−4

−20

24

x 10 −3

−2−1

01

2

x 10−3

−2

−1

0

1

2

3

4

x 10−3

Preseizure intervalInterictal interval IInterictal interval II

Fig. 6. Scatter plots of the embedded samples of three 5-min intervals of icEEG recording 1. Each point is the diffusion map of the features of each timeframe in the recording, setting ℓ¼ 3. The colors indicate the different intervals. (a) The embedded samples under the MA modeling. (b) The embeddedsamples under the BAR modeling. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)


We remark that the embeddings obtained under MAand BAR models, in Fig. 6(a) and (b), respectively, aredifferent. Under the BAR modeling, the method tends torecover trajectories with similar evolving patterns due tothe strong temporal correlation assumption. On the otherhand, under the MA modeling, the method tends to formpoint clouds resulting from the local Gaussian distributionassumption.

We now test the consistency and extensibility of theobtained low-dimensional representation. In practice, it isdesirable to learn the mapping from reference recordings,and then, when new recordings become available, toembed them into the low-dimensional space in an onlinefashion in order to identify and predict seizures.

Fig. 7 depicts the embedded samples obtained byapplying Algorithm 2 using recording 1 as the referenceset and recording 2 as the new incoming set. As observed,the new incoming samples are embedded into the sameregions as the reference samples from correspondinginterval types. For example, we observe in Fig. 7(a) thatthe new samples of the interictal state interval areembedded close to the reference samples of the interictalstate interval. In addition, we observe that the samples ofthe interictal state intervals are embedded around theorigin, whereas the samples of the preseizure intervalsare embedded further away from the origin, suggestingthat the preseizure state intervals are “anomalies” thattend to stick out from the learned “normal state” region.These properties allow for a simple identification ofpreseizure samples: preseizure labels can be assigned tonew samples when their corresponding embedded pointsare near the region of preseizure reference samples andfar from the origin. As shown in Fig. 7(b), under the BARmodeling, the embedded points have a different repre-sentation. In this case, the embedded samples fromdifferent intervals form different trajectories. Neverthe-less, we can assign labels to new incoming points in asimilar manner – based on their distance to the referencetrajectories of each state.

Algorithm 2. Sequential implementation of diffusionmaps based on time evolving statistics.Input: Reference observations fXtg; t ¼ 1;…; T , new incoming

observations fXsg; s¼ 1;…; S, diffusion step n, neighborhood

size ϵ, dimension ℓ

Output: Embeddings fΨl;nðXt Þg; t ¼ 1;…; T and fΨl;nðXsÞg; s¼ 1;…; S1: Estimate distribution pðXt jβt Þ for all time points (using moving

average (MA), Bayesian model with AR process prior (BAR) orBayesian model with Gaussian process prior (BGP) to infer βt)

2: Compute the T� T matrix W , where

Wt1 ;t2 ¼ exp �DðpðXjβt1Þ JpðXjβt2

ÞÞþDðpðXjβt2Þ JpðXjβt1

ÞÞϵ

� �3: Apply eigenvalue decomposition to W and keep ℓ principle

eigenvalues λj and eigenvectors φj

4: For new incoming data Xs; s¼ 1;…; S, estimate the distributionpðXsjβsÞ (using MA, BAR or BGP).

5: Compute the S� T nonsymmetric kernel matrix A, where

As;t ¼ exp �DðpðXjβs Þ JpðXjβt ÞÞϵ

� �6: Construct R¼D�1

A A, where DA ¼ diagPT

t ¼ 1 As;t

n o7: Construct ~A ¼RQ �1, where Q ¼ diag

PTt ¼ 1 Rs;t

n o8: Calculate the diffusion maps embeddings of the new incoming

samples Ψℓ;nðXsÞ ¼ ψ s1;ψ

s2;…;ψ s

ℓ

� �T , where ψ j ¼ 1ffiffiffiffiλnj

p ~Aφj

9: Calculate the diffusion maps embeddings of the reference

samples Ψℓ;nðXt Þ ¼ λn1φt1; λ

n2φ

t2 ;…; λnℓφ

tℓ

� �T

To further validate these results with the availableknown physiological information, we performed twosanity-check experiments by extending the model to twoadditional intervals. One is an interval of 5 min recorded24 h before a seizure (in a day when the patient did notsuffer a seizure). The other is an interval of 5 min obtainedduring a preseizure state but from an icEEG electrode thatwas located remotely from the identified seizure onsetarea. In both cases, the resulted embeddings are similar tothe embedding of the interictal state samples.

From the presented results in Figs. 6 and 7, weconclude that the proposed method indeed discovers thebrain activity and enables, merely by visualization in 3

−0.06 −0.04 −0.02 0 0.02 0.04−0.05

0

0.05−0.05

0

0.05 Reference set, preseizureReference set, interictalNew set, preseizureNew set, interictal

−0.02 −0.015 −0.01 −0.005 0 0.005 0.01

−0.01−0.005

00.005

0.01−0.01

−0.005

0

0.005

0.01Reference set, preseizureReference set, interictalNew set, preseizureNew set, interictal

Fig. 7. Scatter plots of the embedded samples of four 5-min intervals of icEEG recordings: two reference intervals (preseizure and interictal) from recording1 and two new incoming intervals (preseizure and interictal) from recording 2. The representation is obtained using the reference samples and thenextended to the new incoming samples according to Algorithm 2. The dimension of the embedding is set to ℓ¼ 3 for visualization. (a) The obtainedembedding under the MA modeling. (b) The obtained embedding under the BAR modeling.

Table 1Detection rate and false alarm rate of the EEG samples from all therecordings.

Model MA BAR

Recording Detection False Alarm Detection False Alarm

1 94.3 26.2 100.0 29.52 41.8 20.5 56.0 24.03 62.3 19.7 62.3 24.64 48.4 19.7 82.8 23.8


dimensions, to distinguish preseizure states from interictalstates. Furthermore, the results imply that the coordinatesof the diffusion embedding, i.e., the eigenvectors of theconstructed kernel, have a real “meaning” in this applica-tion. Similar to the music application, where the coordi-nates indicate, for example, the human voice, here theycapture different aspects of the intrinsic state of thepatient. In the present work, we exploit this representa-tion and devise a simple classification procedure to iden-tify preseizure states, which enables to predict seizures.We remark that a simple algorithm is sufficient since theembedding already encodes the required information andenables us to distinguish the different states.

We repeated the experiment described above, andextended the model to three unseen recordings accordingto Algorithm 2. As before, recording 1 is used as a refe-rence set to learn the model, which in turn is extended tothe other three recordings (2–4). In each recording, thereis a preseizure state interval whose T samples are labeledas “preseizure” and an interictal interval whose T samplesare labeled as “interictal”. By using the labels of thereference samples as training data, we train a GaussianNaive Bayes classifier in the low-dimensional embeddingspace to distinguish samples recorded in preseizure statesfrom samples in interictal states. In our algorithm, theclassification boundary is the hyperplane lying in themiddle of the two empirical means of the embeddedreference samples of each state. We also used supportvector machine classifiers with both linear and RBF ker-nels, which yielded comparable results. In addition, wetrained classifiers (including Naive Bayes and supportvector machine classifiers) directly on the raw frequencyfeatures from the two states (following the same proce-dure as above), but results show that the data samples arenot separable in this original space. Thus, we focus on theresults of the proposed method.

Table 1 summarizes the detection rate and false alarmrate of the samples from all recordings. As can be seen, bothimplementations relying on the MA and BAR modeling

perform well in classifying samples into preseizure andinterictal states. In general, a large portion of data samplesin preseizure states is correctly identified, while only asmall fraction of data samples in interictal states areidentified as preseizure state. Further, we find that BARmodeling has an overall higher detection rate while the MAmodeling has a lower false alarm rate. One reason is thatunder the MAmodel, we assume local Gaussian distributionof data samples and smooth variation of controlling para-meters, which inhibit sudden changes, thus reducing theprobability of detecting anomaly samples. Another reason isthat the Nyström method used in Algorithm 2 (Step 8) toembed new incoming data samples has the effect ofshrinking the amplitude of the coordinates, since it inter-polates the embeddings of the reference samples. Under theMA model, this results in the identification of a largernumber of new incoming samples as interictal statebecause in the reference set, the embeddings of interictalstate samples tend to lie close to the origin (see Fig. 7). Incontrast, under the BAR model, we assume similar evolvingpatterns of data samples, organizing samples from twostate intervals into two trajectories. Therefore, identifyingstates of new incoming data samples is not affected by theshrinking effect of the Nyström method.

An interesting result is that the large portion ofembedded samples from a preseizure interval actuallyresides in the interictal state region. Such a result was


also reported in other experiments on human collecteddata. It implies that during the preseizure interval, thesubject's brain tries to maintain the normal state and resistthe seizure. As a result, the states of the samples alternaterapidly between normal and anomaly states. Thus, toeffectively predict seizures, relying on single samples/timeframes is insufficient, and an online algorithm that aggre-gates a consecutive group of samples is required. Forexample, it is evident from the results in Table 1 thattracking the percentage of anomaly samples within a5-min interval may be adequate: if the percentage ofanomalies is greater than a predefined threshold, forexample, 0.35, a prediction of seizure is reported. Design-ing more sophisticated classification algorithms and test-ing them on larger dataset with multiple subjects will beaddressed in future work.

5. Conclusions

A dimensionality-reduction method for high-dimensionaltime series is presented. The method exhibits two keycomponents. First, multiple approaches to estimate timeevolving covariance matrices are presented and compared.Second, using the Kullback–Leibler divergence as a distancemetric, diffusion maps are applied to the probability dis-tributions estimated from samples, instead of to the samplesthemselves, to obtain a low-dimensional embedding of thehigh-dimensional time series. Theoretical and experimentalresults show that the embedding inferred by this methoddiscovers the underlying factors, which govern the observa-tions, and preserve the geodesic distances between sampleson the corresponding statistical manifold. Encouragingresults are obtained in two real-world applications: musicanalysis and epileptic seizure prediction. Especially for thelatter application, an online seizure identification system isdeveloped, showing the possibility of predicting epilepticseizures based on time evolving statistics of intracranial EEGrecordings. In future work, we will propose models capturinghigher order time evolving statistics beyond the covariancematrices.

References

[1] J.B. Tenenbaum, V. De Silva, J.C. Langford, A global geometricframework for nonlinear dimensionality reduction, Science 290(5500) (2000) 2319–2323.

[2] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduc-tion and data representation, Neural Comput. 15 (6) (2003)1373–1396.

[3] R.R. Coifman, S. Lafon, Diffusion maps, Appl. Comput. Harmon. Anal.21 (1) (2006) 5–30.

[4] D. Durante, B. Scarpa, D.B. Dunson, Locally adaptive Bayesiancovariance regression, ArXiv e-prints arXiv:1210.2022.

[5] E.B. Fox, M. West, Autoregressive models for variance matrices:stationary inverse Wishart processes, ArXiv e-prints arXiv:1107.5239.

[6] R.R. Coifman, M.J. Hirn, Diffusion maps for changing data, Appl.Comput. Harmon. Anal. 36 (1) (2014) 79–107.

[7] A.F. Zuur, R.J. Fryer, I.T. Jolliffe, R. Dekker, J.J. Beukema, Estimatingcommon trends in multivariate time series using dynamic factoranalysis, Environmetrics 14 (7) (2003) 665–685.

[8] R. Talmon, R.R. Coifman, Empirical intrinsic geometry for intrinsicmodeling and nonlinear filtering, Proc. Natl. Acad. Sci. USA 110 (31)(2013) 12535–12540.

[9] R. Talmon, R.R. Coifman Intrinsic modeling of stochastic dynamicalsystems using empirical geometry. Applied and Computational

Harmonic Analysis (2014). http://dx.doi.org/10.1016/j.acha.2014.08.006.

[10] R. Talmon, S. Mallat, H. Zaveri, R.R. Coifman, "Manifold Learning forLatent Variable Inference in Dynamical Systems." YALEU/DCS/TR1491, June, 2014.

[11] O.C. Jenkins, M.J. Matarić, A spatio-temporal extension to isomapnonlinear dimension reduction, in: Proceedings of the 21st Interna-tional Conference on Machine learning, ACM, Banff, Canada, 2004,p. 56.

[12] M. Lewandowski, J. Martinez-del Rincon, D. Makris, J.C. Nebel,Temporal extension of Laplacian eigenmaps for unsuperviseddimensionality reduction of time series, in: 2010 20th InternationalConference on Pattern Recognition (ICPR), IEEE, Istanbul, Turkey,2010, pp. 161–164.

[13] E. Fox, D. Dunson, Bayesian nonparametric covariance regression,Arxiv preprint arXiv:1101.2017.

[14] K.M. Carter, R. Raich, W. Finn, A. Hero, Information-geometricdimensionality reduction, IEEE Signal Process. Mag. 28 (2) (2011)89–99.

[15] K.M. Carter, R. Raich, W.G. Finn, A.O. Hero, Fine: Fisher informationnonparametric embedding, IEEE Trans. Pattern Anal. Mach. Intell. 31(11) (2009) 2093–2098.

[16] A. Singer, R.R. Coifman, Non-linear independent component analysiswith diffusion maps, Appl. Comput. Harmon. Anal. 25 (2) (2008)226–239.

[17] B. Nadler, S. Lafon, R.R. Coifman, I.G. Kevrekidis, Diffusion maps,spectral clustering and reaction coordinates of dynamical systems,Appl. Comput. Harmon. Anal. 21 (1) (2006) 113–127.

[18] M.G. Frei, H.P. Zaveri, S. Arthurs, G.K. Bergey, C.C. Jouny, K. Lehnertz,J. Gotman, I. Osorio, T.I. Netoff, W.J. Freeman, J. Jefferys, G. Worrell,M.L.V. Quyen, S.J. Schiff, F. Mormann, Controversies in epilepsy:debates held during the fourth international workshop on seizureprediction, Epilepsy Behav. 19 (1) (2010) 4–16.

[19] P.J. Moreno, P.P. Ho, N. Vasconcelos, A Kullback–Leibler divergencebased kernel for svm classification in multimedia applications, in:Advances in Neural Information Processing Systems, 2003.

[20] K. Abou-Moustafa, M. Shah, F. De La Torre, F.P. Ferrie, Relaxedexponential kernels for unsupervised learning, in: Pattern Recogni-tion, Springer, Berlin, Heidelberg, 2011, pp. 184–195.

[21] K. Abou-Moustafa, F.P. Ferrie, A note on metric properties for somedivergence measures: the Gaussian case., in: JMLR Workshop andConference Proceedings, vol. 25, 2012, pp. 1–15.

[22] R. Dahlhaus, On the Kullback–Leibler information divergence oflocally stationary processes, Stoch. Process. Appl. 62 (1) (1996)139–168.

[23] M. Hein, J.Y. Audibert, "Intrinsic dimensionality estimation of sub-manifolds in R d." In Proceedings of the 22nd international con-ference on Machine learning, pp. 289-296. ACM, Bonn, Germany,2005.

[24] R.R. Coifman, Y. Shkolnisky, F.J. Sigworth, A. Singer, Graph Laplaciantomography from unknown random projections, IEEE Trans. ImageProcess. 17 (2008) 1891–1899.

[25] F.R.K. Chung, Spectral Graph Theory, CBMS - American Mathema-tical Society, Providence, RI, 1997.

[26] D. Kushnir, A. Haddad, R. Coifman, Anisotropic diffusion on sub-manifolds with application to earth structure classification, Appl.Comput. Harmon. Anal. 32 (2) (2012) 280–294.

[27] A. Haddad, D. Kushnir, R.R. Coifman, Texture separation via areference set, Appl. Comput. Harmon. Anal. 36 (2) (2014) 335–347.

[28] R. Talmon, I. Cohen, S. Gannot, R. Coifman, Supervised graph-basedprocessing for sequential transient interference suppression, IEEETrans. Audio Speech Lang. Process. 20 (9) (2012) 2528–2538.

[29] R. Talmon, I. Cohen, S. Gannot, R.R. Coifman, Diffusion maps forsignal processing: a deeper look at manifold-learning techniquesbased on kernels and graphs, IEEE Signal Process. Mag. 30 (4) (2013)75–86.

[30] C. Alexander, Moving average models for volatility and correlation,and covariance matrices, in: Handbook of Finance, 2008.

[31] L. Bauwens, S. Laurent, J.V. Rombouts, Multivariate GARCH models: asurvey, J. Appl. Econom. 21 (1) (2006) 79–109.

[32] R.S. Tsay, Analysis of Financial Time Series, vol. 543, John Wiley &Sons, New York, 2005.

[33] P. Richard Hahn, C.M. Carvalho, J.G. Scott, A sparse factor analyticprobit model for congressional voting patterns, J. R. Stat. Soc. Ser. C(Appl. Stat.) 61 (4) (2012) 619–635.

[34] C.E. Rasmussen, Gaussian Processes for Machine Learning, 2006.[35] A. Banerjee, D.B. Dunson, S.T. Tokdar, Efficient Gaussian process

regression for large datasets, Biometrika (2012) ass068.

http://refhub.elsevier.com/S0165-1684(15)00136-X/sbref1








arXiv:1210.2022

arXiv:1107.5239

arXiv:1107.5239









http://dx.doi.org/10.1016/j.acha.2014.08.006

http://dx.doi.org/10.1016/j.acha.2014.08.006

arXiv:1101.2017
















































[36] B. Chen, M. Chen, J. Paisley, A. Zaas, C. Woods, G. Ginsburg, A. Hero,J. Lucas, D. Dunson, L. Carin, Bayesian inference of the number offactors in gene-expression analysis: application to human viruschallenge studies, BMC Bioinform. 11 (1) (2010) 552.

[37] H. Ishwaran, J.S. Rao, Spike and slab variable selection: frequentistand Bayesian strategies, Ann. Stat. (2005) 730–773.

[38] A. Armagan, D.B. Dunson, J. Lee, Generalized double paretoshrinkage, Stat. Sin. 23 (1) (2013) 119.

[39] S. Früwirth-Schnatter, Data augmentation and dynamic linear mod-els, J. Time Ser. Anal. 15 (1994) 183–202.

[40] L. Ren, Y. Wang, D.B. Dunson, L. Carin, The kernel beta process, in:Advances in Neural Information Processing Systems, 2011, pp. 963–971.

[41] I. Murray, R.P. Adams, Slice sampling covariance hyperparameters oflatent Gaussian models, in: Advances in Neural Information Proces-sing Systems.

[42] P.R. Carney, S. Myers, J.D. Geyer, Seizure prediction: methods,Epilepsy Behav. 22 (2011) 94–101.

[43] S. Ramgopal, S. Thome-Souza, M. Jackson, N.E. KadishI.S. Fernández, J. Klehm, W. Bosl, C. Reinsberger, S. Schachter,T. Loddenkemper, Seizure detection, seizure prediction, andclosed-loop warning systems in epilepsy, Epilepsy Behav. 37 (2014)291–307.

[44] K. Gadhoumi, J.M. Lina, J. Gotman, Seizure prediction in patientswith mesial temporal lobe epilepsy using eeg measures of statesimilarity, Clin. Neurophysiol. 124 (9) (2013) 1745–1754.

[45] C. Alvarado-Rojas, M. Valderrama, A. Witon, V. Navarro, M. VanQuyen, Probing cortical excitability using cross-frequency couplingin intracranial eeg recordings: a new method for seizure prediction,in: 2011 Annual International Conference of the IEEE Engineering inMedicine and Biology Society, EMBC, 2011, pp. 1632–1635.

[46] A. Shoeb, J. Guttag, Application of machine learning to epilepticseizure detection, in: Proceedings of the 27th International Con-ference on Machine Learning, 2010, pp. 975–982.

[47] D. Duncan, R. Talmon, H.P. Zaveri, R.R. Coifman, Identifying pre-seizure state in intracranial eeg data using diffusion kernels, Math.Biosci. Eng. 10 (3) (2013) 579–590.
























multivariate time-series analysis and diffusion maps€¦ · multivariate time-series analysis and...

Documents