sliding window - math.nyu.edu

StatStream: Statistical Monitoring of Thousands of Data

Streams in Real Time �

Yunyue Zhu,Dennis Shasha

Courant Institute of Mathematical SciencesDepartment of Computer Science

New York Universityfyunyue,[email protected]

Abstract

Consider the problem of monitoring tens ofthousands of time series data streams in anonline fashion and making decisions based onthem. In addition to single stream statis-tics such as average and standard deviation,we also want to �nd high correlations amongall pairs of streams. A stock market tradermight use such a tool to spot arbitrage oppor-tunities. This paper proposes eÆcient meth-ods for solving this problem based on DiscreteFourier Transforms and a three level time in-terval hierarchy. Extensive experiments onsynthetic data and real world �nancial trad-ing data show that our algorithm beats the di-rect computation approach by several ordersof magnitude. It also improves on previousFourier Transform approaches by allowing theeÆcient computation of time-delayed correla-tion over any size sliding window and any timedelay. Correlation also lends itself to an eÆ-cient grid-based data structure. The result isthe �rst algorithm that we know of to computecorrelations over thousands of data streams inreal time. The algorithm is incremental, has�xed response time, and can monitor the pair-wise correlations of 10,000 streams on a singlePC. The algorithm is embarrassingly paral-lelizable.

1 Introduction

Many applications consist of multiple data streams.For example,

�Work supported in part by U.S. NSF grants IIS-9988345and N2010:0115586.

NYU computer science department technicl report TR2002-827,June 2002. An abbreviated version of this report will appear inProceedings of the 28th VLDB Conference, Hong Kong, China,2002

� In mission operations for NASA's Space Shut-tle, approximately 20,000 sensors are telemeteredonce per second to Mission Control at JohnsonSpace Center, Houston[16].

� There are about 50,000 securities trading in theUnited States, and every second up to 100,000quotes and trades (ticks) are generated.

Unfortunately it is diÆcult to process such datain set-oriented data management systems, althoughobject-relational time series extensions have begun to�ll the gap [21]. For the performance to be suÆcientlygood however, \Data Stream Management Systems"(DSMSs) [3], whatever their logical model, should ex-ploit the following characteristics of the application:

� Updates are through insertions of new elements(with relatively rare corrections of older data).

� Queries (moving averages, standard deviations,and correlation) treat the data as sequences notsets.

� Since a full stream is never materialized, queriestreat the data as a never-ending data stream.

� One pass algorithms are desirable because thedata is vast.

� Interpretation is mostly qualitative, so sacri�cingaccuracy for speed is acceptable.

This paper presents the algorithms and architec-ture of StatStream, a data stream monitoring system.The system computes a variety of single and multiplestream statistics in one pass with constant time (perinput) and bounded memory. To show its use for onepractical application, we include most of the statisticsthat a securities trader might be interested in. The al-gorithms, however, are applicable to other disciplines,

such as sensor data processing and medicine. The al-gorithms themselves are a synthesis of existing tech-niques and a few ideas of our own. We divide our con-tributions into functional and algorithmic. Our func-tional contributions are:

1. We compute multi-stream statistics such as syn-chronous as well as time-delayed correlation andvector inner-product in a continuous online fash-ion. This means that if a statistic holds at timet, that statistic will be reported at time t + v,where v is a constant independent of the size andduration of the stream.

2. For any pair of streams, each pair-wise statistic iscomputed in an incremental fashion and requiresconstant time per update. This is done using aDiscrete Fourier Transform approximation.

3. The approximation has a small error under natu-ral assumptions.

4. Even when we monitor the data streams over slid-ing windows, no revisiting of the expiring datastreams is needed.

5. The net result is that on a Pentium 4 PC, we canhandle 10,000 streams each producing data everysecond with a delay window v of only 2 minutes.

Our algorithmic contributions mainly have to dowith correlation statistics. First, we distinguish threetime periods:

� timepoint { the smallest unit of time over whichthe system collects data, e.g. second.

� basic window { a consecutive subsequence of time-points over which the system maintains a digestincrementally, e.g., a few minutes.

� sliding window { a user-de�ned consecutive sub-sequence of basic windows over which the userwants statistics, e.g. an hour. The user mightask, \which pairs of stocks were correlated with avalue of over 0.9 for the last hour?"

The use of the intermediate time interval that wecall basic window yields three advantages:

1. Results of user queries need not be delayed morethan the basic window time. In our example, theuser will be told about correlations between 2 PMand 3 PM by 3:02 PM and correlations between2:02 PM and 3:02 PM by 3:04 PM.

2. Maintaining stream digests based on the basicwindow allows the computation of correlationsover windows of arbitrary size, not necessarily amultiple of basic window size, as well as time-delayed correlations with high accuracy.

3. A short basic window give fast response time butfewer time series can then be handled.

A second algorithmic contribution is the grid struc-ture, each of whose cells stores the hash function valueof a stream. The structure itself is unoriginal but thehigh eÆciency we obtain from it is due to the fact thatwe are measuring correlation and have done the timedecomposition mentioned above.

The remainder of this paper will be organized asfollows. The data we consider and statistics we pro-duce are presented in Section 2. Section 3 presents ouralgorithms for monitoring high speed time series datastreams. Section 4 discusses the system StatStream.Section 5 presents our experimental results. Section 6puts our work in the context of related work.

2 Data And Queries

2.1 Time Series Data Streams

We consider data entering as a time ordered series oftriples (streamID, timepoint, value). Each stream con-sists of all those triples having the same streamID. (In�nance, a streamID may be a stock, for example.) Thestreams are synchronized.

Each stream has a new value available at every peri-odic time interval, e.g. every second. We call the inter-val index the timepoint. For example, if the periodictime interval is a second and the current timepoint forall the streams is i, after one second, all the streamswill have a new value with timepoint i+1. (Note thatif a stream has no value at a timepoint, a value willbe assigned to that timepoint based on interpolation.If there are several values during a timepoint, then asummary value will be assigned to that timepoint.)

Let si or s[i] denote the value of stream s at time-point i. s[i::j] denotes the subsequence of stream sfrom timepoints i through j inclusive. si denotes astream with streamID i. Also we use t to denotethe latest timepoint, i.e., now. The statistics we willmonitor will be denoted stat(si1j ; s

i2j ; :::; s

ikj ; j 2 [p; q]),

where the interval [p; q] is a window of interest. We willdiscuss the meaning of windows in the next section.

2.2 Temporal Spans

In the spirit of the work in [10, 11], we generalize thethree kinds of temporal spans for which the statisticsof time series are calculated.

1. Landmark windows: In this temporal span,statistics are computed based on the values be-tween a speci�c timepoint called landmark andthe present. stat(s; landmark(k)) will be com-puted on the subsequence of time series s[i]; i � k.An unrestricted window is a special case whenk = 1. For an unrestricted window the statisticsare based on all the available data.

2. Sliding windows: In �nancial applications, atleast, a sliding window model is more appropri-ate for data streams. Given the length of thesliding window w and the current timepoint t,stat(s; sliding(w)) will be computed in the sub-sequence s[t� w + 1::t].

3. Damped window model: In this model, recentsliding windows are more important than previousones. For example, in the computation of a mov-ing average, a sliding window model will compute

the average as avg =P

ti=t�w+1 si

w . By contrast, ina damped window model the weights of data de-crease exponentially into the past. For example,a moving average in a damped window model canbe computed as follows:

avgnew = avgold � p+ st � (1� p); 0 < p < 1

Other statistics in a damped window model canbe de�ned similarly.

In this paper, we will focus on the sliding windowmodel, because it is the one used most often and is themost general.

2.3 Statistics To Monitor

Consider the stream si; i = 1; :::; w. The statistics wewill monitor are

1. Single stream statistics, such as average, standarddeviation, best �t slope. These are straightfor-ward.

2. Correlation coeÆcients

corr(s; r) =1

w

Pwi=1 siri � s rpPw

i=1(si � s)2pPw

i=1(ri � r)2

3. Autocorrelation: the correlation of the series withitself at an earlier time.

4. Beta: the sensitivity of the values of a stream sto the values of another stream r (or weightedcollection of streams). For example, in �nancialapplications the beta measures the risk of a stock.A stock with a beta of 1.5 to the market indexexperiences 50 percent more movement in pricethan the market.

beta(s; r) =1

w

Pwi=1 siri � s rPwi=1(ri � r)2

3 Statistics Over Sliding Windows

To compute the statistics over a sliding window, wewill maintain a synopsis data structure for the streamto compute the statistics rapidly. To start, ourframework subdivides the sliding windows equally into

Basic WindowS[0]

Sliding Window

Basic WindowS[k−1]

...Digests Digests Digests

New Basic WindowS[k]

Figure 1: Sliding windows and basic windows

Table 1: Symbolsw the size of a sliding windowb the size of a basic windowk the number of basic windows

within a sliding windown the number of DFT coeÆcients used as digestsNs the number of data streams

shorter windows, which we call basic windows, inorder to facilitate the eÆcient elimination of old dataand the incorporation of new data. We keep digests forboth basic windows and sliding windows. For exam-ple, the running sum of the time series values withina basic window and the running sum within an entiresliding window belong to the two kinds of digests re-spectively. Figure 1 shows the relation between slidingwindows and basic windows.

Let the data within a sliding window be s[t � w +1::t]. Suppose w = kb, where b is the length of a basicwindow and k is the number of basic windows withina sliding window. Let S[0]; S[1]; :::; S[k � 1] denote asequence of basic windows, where S[i] = s[(t � w) +ib + 1::(t � w) + (i + 1)b]. S[k] will be the new basicwindow and S[0] is the expiring basic window. Thej-th value in the basic window S[i] is S[i; j]. Table1 de�nes the symbols that are used in the rest of thepaper.

The size of the basic window is important because itmust be possible to report all statistics for basic win-dow i to the user before basic window i+1 completes(at which point it will be necessary to begin computingthe statistics for window i+ 1).

3.1 Single Stream Statistics

The single stream statistics such as moving averageand moving standard deviation are computationallyinexpensive. In this section, we discuss moving av-erages just to demonstrate the concept of maintainingdigest information based on basic windows. Obviously,the information to be maintained for the moving aver-age is

P(s[t � w + 1::t]). For each basic window S[i],

we maintain the digestP(S[i]) =

Pbj=1 S[i; j]. After

b new data points from the stream become available,we compute the sum over the new basic window S[k].The sum over the sliding window is updated as follows:P

new(s) =P

old(s) +PS[k]�PS[0].

3.2 Correlation Statistics

Correlation statistics are important in many applica-tions. For example, Pairs Trading, also known as thecorrelation trading strategy, is widely employed by ma-jor Wall Street �rms. This strategy focuses on tradingpairs of equities that are correlated. The correlationbetween two streams (stocks) is a�ected by some fac-tors that are not known a priori. Any pair of streamscould be correlated at some time. Much e�ort hasbeen made to �nd such correlations in order to enjoyarbitrage pro�ts. These applications imply that theability to spot correlations among a large number ofstreams in real time will provide competitive advan-tages. To make our task even more challenging, suchcorrelations change over time and we have to updatethe moving correlations frequently.

The eÆcient identi�cation of highly correlatedstreams potentially requires the computation of allpairwise correlations and could be proportional to thetotal number of timepoints in each sliding windowtimes all pairs of streams. We make this computationmore eÆcient by (1) using a discrete fourier transformof basic windows to compute the correlation of streampairs approximately; (2) using a grid data structure toavoid the approximate computation for most pairs.

We start by a quick review of DFT. We will ex-plain how to compute the vector inner-product whenthe two series are aligned in basic windows. This ap-proach is extended to the general case when the basicwindows are not aligned. Then we will show our ap-proach for reporting the highly correlated stream pairsin an online fashion.

3.3 Review of the Discrete Fourier Transform

We will follow the convention in [2]. The Dis-crete Fourier Transform of a time sequence x =x0; x1; :::; xw�1 is a sequence X = X0; X1; :::; Xw�1 =DFT (x) of complex numbers given by

XF =1pw

w�1Xi=0

xie�j2�Fi=w F = 0; 1; :::; w � 1

where j =p�1. The inverse Fourier transform of X

is given by

xi =1pw

w�1XF=0

XF ej2�Fi=w i = 0; 1; :::; w � 1

The following properties of DFT can be found inany textbook on DFT.

� The DFT preserves the Euclidean distance be-tween two sequence x and y.

d(x; y) = d(X;Y )

� (Symmetry Property) If x is a real sequence, then

X(i) = X�(w � i); i = 1; 2; :::; w � 1

X� is the complex conjugate of X .

Since for most real time series the �rst few DFT co-eÆcients contain most of the energy(

Px2i ), we would

then expect those coeÆcients to capture the raw shapeof the time series [2, 9]. For example, the energy spec-trum for the random walk series, which models stockmovements, declines with a power of 2 with increasingcoeÆcients. And for black noise, which successfullymodels series like the water level of a river as it variesover time, the energy spectrum declines even fasterwith increasing number of coeÆcients.

3.4 Inner-product With Aligned Windows

The correlation and beta can be computed from thevector inner-product. The vector inner-product fortwo series x = (x1; :::; xw); y = (y1; :::; yw), denotedas (x; y), is just the sum of the products of their cor-responding elements:

(x; y) =wXi=1

xiyi

Given two series sx and sy, when the two series arealigned,

(sx; sy) =

kXi=1

(Sx[i]; Sy[i])

So, we must explain how to compute the inner-productof two basic windows: x1; x2; :::xb and y1; y2; :::yb.

Let f : f1(x); f2(x); ::: be a family of continu-ous functions. We approximate the time series ineach basic window, Sx[i] = x1; x2; :::xb and Sy[i] =y1; y2; :::yb, with a function family f . (We will givespeci�c examples later.)

xi �n�1Xm=0

cxmfm(i); yi �n�1Xm=0

cymfm(i)

i = 1; :::; b

cxm; cym;m = 0; :::; n � 1 are n coeÆcients to approxi-

mate the time series with the function family f .The inner-product of the two basic windows is

therefore

bXi=1

xiyi �bX

i=1

� n�1Xm=0

cxmfm(i)

n�1Xp=0

cypfp(i)�

=

n�1Xm=0

n�1Xp=0

cxmcyp

� bXi=1

fm(i)fp(i)�

=

n�1Xm=0

n�1Xp=0

cxmcypW (m; p)

where W (m; p) =Pb

i=1 fm(i)fp(i) can be precom-puted. If the function family f is orthogonal, we have

W (m; p) =

�0 m 6= pV (m) 6= 0 m = p

Thus,bX

i=1

xiyi �n�1Xm=0

cxmcymV (m)

With this curve �tting technique, we reduce thespace and time required to compute inner-productsfrom b to n. It should also be noted that besides datacompression, the curve �tting approach can be used to�ll in missing data. This works naturally for comput-ing correlations among streams with missing data atsome timepoints.

The sine-cosine function families have the rightproperties. We perform the Discrete Fourier Trans-forms for the time series over the basic windows, en-abling a constant time computation of coeÆcients foreach basic window. Following the observations in [2],we can obtain a good approximation for the series withonly the �rst n DFT coeÆcients,

xi � 1pb

n�1XF=0

XF ej2�Fi=b i = 1; 2; :::; b

3.5 Inner-product With Unaligned Windows

A much harder problem is to compute correlationswith time lags. The time series will not necessarily bealigned at their basic windows. However, the digestswe keep are enough to compute such correlations.

Without loss of generality, we will show the com-putation of the n-approximate lagged inner-productof two streams with time lags less than the sizeof a basic window. Given two such series, sx =sx1 ; :::; s

xw = Sx[1]; :::; Sx[k] and sy = sya+1; :::; s

ya+w =

Sy[0]; Sy[1]; :::; Sy[k�1]; Sy[k], where for the basic win-dows Sy[0] and Sy[k], only the last a values in Sy[0]and the �rst b�a values in Sy[k] are included (a < b).We have

(sx; sy) =

wXi=1

(sxi sya+i)

=

kXj=1

��(Sx[j]; Sy[j�1]; a)+�(Sy[j]; Sx[j]; b�a)�

where �(S1[p]; S2[q]; d) =Pd

i=1 S1[p; i]S2[q; b � d +

i]. For S1[p] = x1; :::; xb and S2[q] = y1; :::; yb,�(S1[p]; S2[q]; d) is the inner-product of the �rst d val-ues of S1[p] with the last d values of S2[q]. This can be

approximated using only their �rst n DFT coeÆcients.

dXi=1

xiyb�d+i �dX

i=1

� n�1Xm=0

cxmfm(i)

n�1Xp=0

cypfp(b� d+ i)�

=

n�1Xm=0

n�1Xp=0

cxmcyp

� dXi=1

fm(i)fp(b� d+ i)�

=

n�1Xm=0

n�1Xp=0

cxmcypW (m; p; d)

This implies that if we precompute the table of

W (m; p; d) =dX

i=1

fm(i)fp(b� d+ i)

m; p = 0:::; n� 1; d = 1; :::; bb=2cwe can compute the inner-product using only the DFTcoeÆcients without requiring the alignment of basicwindows in time O(n2) for each basic window. Pre-computation time for any speci�c displacement d isO(n2d).

Theorem 1 The n-approximate lagged inner-productof two time series using their �rst n DFT coeÆcientscan be computed in time O(kn) if the two series havealigned basic windows, otherwise it takes time O(kn2),where k is the number of basic windows.

It is not hard to show that this approach can beextended to compute the inner-product of two timeseries over sliding windows of any size.

Corollary 1 The inner-product of two time seriesover sliding windows of any size with any time delaycan be approximated using only the basic window di-gests of the data streams.

3.6 IO Performance

It might be desirable to store the summary data forfuture analysis. Since the summary data we keep aresuÆcient to compute all the statistics we are interestedin, there is no need to store the raw data streams. Thesummary data will be stored on disk sequentially in theorder of basic windows.

Let Ns be the number of streams and n be the num-ber of DFT coeÆcients we use(2n real number), theI/O cost to access the data for all streams within aspeci�c period will be

NskSizeof(float)(2 + 2n)

PageSize

while the I/O cost for the exact computation is

NskSizeof(float)b

PageSize

The improvement is a ratio of b2+2n . The �rst 2 cor-

responds to two non-DFT elements of summary data:the sum and the sum of the squares of the time seriesin each basic window. Also the I/O costs above assumesequential disk access. This is a reasonable assumptiongiven the time-ordered nature of data streams.

3.7 Monitoring Correlations Between Data

Streams

The above curve �tting technique can be used forcomputing inner-products and correlations over slid-ing windows of any size, as well as with any time delayacross streams. A frequent goal is to discover streamswith high correlations. To do online monitoring of syn-chronized streams over a �xed size of sliding windowwith correlation above a speci�c threshold, we use anapproach based on DFT and a hash technique that willreport such stream pairs quickly.

First, we introduce the normalization of a seriesover sliding windows of size w, x1; x2; :::; xw, as fol-lows.

xi =xi � x

�x; i = 1; 2; :::; w

where

�x =

vuut wXi=1

(xi � x)2

As the following lemma suggests, the correlation co-eÆcient of two time series can be reduced to the Eu-clidean distance between their normalized series[23].

Lemma 1 The correlation coeÆcient of two time se-ries x1; :::; xw and y1; :::; yw is

corr(x; y) = 1� 1

2d2(x; y)

where d(x; y) is the Euclidean distance between x andy.

Proof First, we notice that

wXi=1

x2i =wXi=1

y2i = 1

corr(x; y) =

Pwi=1(xi � x)(yi � y)pPw

i=1(xi � x)2pPw

i=1(yi � y)2=

wXi=1

xiyi

d2(x; y) =

wXi=1

(xi � yi)2 =

wXi=1

x2i +

wXi=1

y2i � 2

wXi=1

xiyi

= 2� 2corr(x; y)

By reducing the correlation coeÆcient to EuclideanDistance, we can apply the techniques in [2] to reportsequences with correlation coeÆcients higher than aspeci�c threshold.

Lemma 2 Let the DFT of the normalized form of two

time series x and y be X and Y respectively,

corr(x; y) � 1� �2 ) dn(X; Y ) � �

where dn(X; Y ) is the Euclidean distance between se-

ries X1; X2; :::; Xn and Y1; Y2; :::; Yn.

Proof As DFT preserves Euclidean distance, we have

d(X; Y ) = d(x; y)

Using only the �rst n and last n; (n << w) DFTcoeÆcients[2, 25], from the symmetry property ofDFT, we have

nXi=1

(Xi � Yi)2 +

nXi=1

(Xw�n � Yw�n)2

= 2

nXi=1

(Xi � Yi)2 = 2d2n(X; Y ) � d2(X; Y )

corr(x; y) � 1� �2 ) d2(x; y) � 2�2

) d2(X; Y ) � 2�2 ) dn(X; Y ) � �

From the above lemma, we can examine the correla-tions of only those stream pairs for which dn(X; Y ) � �holds. We will get a superset of highly correlatedpairs and there will be no false negatives. The falsepositives can be further �ltered out as we explainlater. HierarchyScan[19] also uses correlation coeÆ-cients as a similarity measure for time series. Theyuse F�1fX�

i Yig as an approximation for the corre-lation coeÆcient. Because such approximations willnot be always above the true values, some streampairs could be discarded based on their approxima-tions, even if their true correlations are above thethreshold. Though HierarchyScan proposes an empir-ical method to select level-dependent thresholds formulti-resolution scans of sequences, it cannot guaran-tee the absence of false negatives.

We can extend the techniques above to reportstream pairs of negative high-value correlations.

Lemma 3 Let the DFT of the normalized form of two

time series x and y be X and Y .

corr(x; y) � �1 + �2 ) dn(�X; Y ) � �

Proof We have

DFT (�x) = �DFT (x)

corr(x; y) � �1 + �2 )wXi=1

xiyi � �1 + �2

)wXi=1

(�xi)yi � 1� �2 ) dn(�X; Y ) � �

Now we discuss how to compute the DFT coeÆ-cients X incrementally. The DFT coeÆcients X ofthe normalized sequence can be computed from theDFT coeÆcients X of the original sequence.

Lemma 4 Let X = DFT (x); X = DFT (x), we have(X0 = 0

Xi =Xi

�xi 6= 0

We can maintain DFT coeÆcients over sliding win-dows incrementally[13].

Lemma 5 Let Xoldm be the m-th DFT coeÆcient of

the series in sliding window x0; x1; :::; xw�1 and Xnewm

be that coeÆcient of the series x1; x2; :::; xw,

Xnewm = e

j2�mw (Xold

m +xw � x0p

w)

m = 1; :::; n

This can be extended to a batch update based onthe basic windows.

Lemma 6 Let Xoldm be the m-th DFT coeÆcient

of the series in sliding window x0; x2; :::; xw�1and Xnew

m be that coeÆcient of the seriesxb; xb+1; :::; xw; xw+1; :::; xw+b�1,

Xnewm = e

j2�mb

w Xoldm

+1pw

� b�1Xi=0

ej2�m(b�i)

w xw+i �b�1Xi=0

ej2�m(b�i)

w xi�

m = 1; :::; n

The corollary suggests that to update the DFT coef-�cients incrementally, we should keep the following ndigests for the basic windows.

�m =

b�1Xi=0

ej2�m(b�i)

w xi; m = 1; :::; n

By using the DFT on normalized sequences, wealso map the original sequences into a bounded fea-ture space.

Lemma 7 Let X0; X1; :::; Xw�1 be the DFT of a nor-malized sequence x1; x2; :::; xw, we have

jXij �p2

2; i = 1; :::; n; n < w=2

Proofw�1Xi=1

(Xi)2 =

wXi=1

(xi)2 = 1

) 2

nXi=1

X2i =

nXi=1

(X2i + X2

w�i) � 1

) jXij �p2

2; i = 1; :::; n

From the above lemma, each DFT coeÆcent ranges

from �p2

2to

p2

2, therefore the DFT feature space R2n

is a cube of diameterp2. We can use a grid structure

[4] to report near neighbors eÆciently.We will use the �rst n, n � 2n, dimensions of the

DFT feature space for indexing. We superimpose ann-dimensional orthogonal regular grid on the DFT fea-ture space and partition the cube of diameter

p2 into

cells with the same size and shape. There are (2dp2

2� e)ncells of cubes of diameter �. Each stream is mapped toa cell based on its �rst n normalized DFT coeÆcients.Suppose a stream x is hashed to cell (c1; c2; :::; cn). Toreport the streams whose correlation coeÆcients withx is above the threshold 1 � �2, only streams hashedto cells adjacent to cell (c1; c2; :::; cn) need to be exam-ined. Similarly, streams whose correlation coeÆcientswith x are less than the threshold �1 + �2, must behashed to cells adjacent to cell (�c1;�c2; :::;�cn). Af-ter hashing the streams to cells, the number of streampairs to be examined is greatly reduced. We can thencompute their Euclidean distance, as well as correla-tion, based on the �rst n DFT coeÆcients.

The grid structure can be maintained as follows.Without loss of generality, we discuss the detection ofonly positive high-value correlations. There are twokinds of correlations the user might be interested in.

� synchronized correlation If we are interestedonly in synchronized correlation, the grid struc-ture is cleared at the end of every basic window.At the end of each basic window, all the datawithin the current basic window are available andthe digests are computed. Suppose that stream xis hashed to cell c, then x will be compared to anystream that has been hashed to the neighborhoodof c.

� lagged correlation If we also want to reportlagged correlation, including autocorrelation, themaintenance of the grid structure will be a lit-tle more complicated. Let TM be a user-de�nedparameter specifying the largest lag that is of in-terest. Each cell in the grid as well as the streamshashed to the grid will have a timestamp. Thetimestamp Tx associated with the hash value ofthe stream x is the time when x is hashed to thegrid. The timestamp Tc of a cell c is the latesttimestamp when the cell c is updated. The grid isupdated every basic window time but never glob-ally cleared. Let Sc be the set of cells that areadjacent to c, including c. Here is the pseudo-code to update the grid when stream x hashes tocell c:

FOR ALL cells ci 2 ScIF Tx � Tci > TM //Tci is the timestamp of ci.

clear(ci) //all the streams in ci are out of date.ELSE IF 0 < Tx � Tci � TM//some streams in ci are out of the date.FOR ALL stream y 2 ciIF Tx � Ty > TM delete(y)ELSE examine correlation(x; y)

ELSE //Tx = TciFOR ALL stream y 2 ci

examine correlation(x; y)Tci = Tx //to indicate that ci is just updated

Theorem 2 Given a collection of time series streams,using only the digests of the streams, our algorithmscan �nd those stream pairs whose correlations (whethersynchronized or lagged) are above a threshold withoutfalse negatives.

Using the techniques in this section, we can searchfor high-value lagged correlations among data streamsvery fast. The time lag must be a multiple of thesize of a basic window in this case. Theorem 1 statesthat we can approximate correlation of any time lageÆciently. The approximation methods are used as apost processing step after the hashing methods spotthose promising stream pairs.

3.8 Parallel Implementation

Our framework facilitates a parallel implementationby using a straightforward decomposition. Considera network of K servers to monitor Ns streams. Weassume these servers have similar computing resources.

The work to monitor the streams has two stages.

1. Compute the digests and single stream statisticsfor the data streams. The Ns streams are equallydivided into K groups. The server i(i = 1; :::;K)will read those streams in the i-th group andcompute their digests, single stream statistics andhash values.

2. Report highly correlated stream pairs based onthe grid structure. The grid structure is also ge-ometrically and evenly partitioned into K parts.A server X will read in its part, a set of cells SX .Server X will also read a set of cells S0X includ-ing cells adjacent to the boundary cells in SX .Server X will report those stream pairs that arehighly correlated within cells in SX . Note thatonly the �rst n normalized DFT coeÆcients needto be communicated between servers, thus reduc-ing the overhead for communication.

4 StatStream System

StatStream runs in a high performance interpretedenvironment called K[1]. Our system makes use of

this language's powerful array-based computation toachieve high speed in the streaming data environment.The system follows the algorithmic ideas above andmakes use of the following parameters:

� Correlation Threshold Only stream pairswhose absolute value of correlation coeÆcientslarger than a speci�ed threshold will be reported.The higher this threshold, the �ner the grid struc-ture, and the fewer streams whose exact correla-tions must be computed.

� Sliding Window Size This is the time intervalover which statistics are reported to the user. Ifthe sliding window size is 1 hour, then the re-ported correlations are those over the past hour.

� Duration over Threshold Some users might beinterested in only those pairs with correlation co-eÆcients above the threshold for a pre-de�ned pe-riod. For example, a trader might ask \Has theone hour correlation between two stocks been over0.95 during the last 10 minutes?" This parameterprovides such users with an option to specify aminimum duration. A longer duration period ofhighly correlated streams indicates a stronger re-lationship between the streams while a shorter onemight indicate an accidental correlation. For ex-ample, a longer duration might give a stock mar-ket trader more con�dence when taking advantageof such potential opportunities. A longer durationalso gives better performance because we can up-date the correlations less frequently.

� Range of Lagged Correlations In additionto synchronized correlations, StatStream can alsodetect high-value lagged correlations. This pa-rameter, i.e. TM in section 3.7, speci�es the rangeof the lagged correlations. For example, if therange is 10 minutes and the basic window is 2 min-utes, the system will examine cross-correlationsand autocorrelations for streams with lags of 2,4,8and 10 minutes.

5 Empirical Study

Our empirical studies attempt to answer the followingquestions.

� How great are the time savings when using theDFT approximate algorithms as compared withexact algorithms? How many streams can theyhandle in real time?

� What's the approximation error when using DFTwithin each basic window to estimate correlation?How does it change according to the basic andsliding window sizes?

� What is the pruning power of the grid structurein detecting high correlated pairs? What is theprecision?

We perform the empirical study on the followingtwo datasets on a 1.5GHz Pentium 4 PC with 128 MBof main memory.

� Synthetic Data The time series streams are gen-erated using the random walk model. For streams,

si = 100 +

iXj=1

(uj � 0:5); i = 1; 2; :::

where uj is a set of uniform random real numbersin [0; 1].

� Stock Exchange Data The New York Stock Ex-change (NYSE) Trade and Quote (TAQ) databaseprovides intraday trade and quote data for allstocks listed on NYSE, AMEX, NASDAQ, andSmallCap issues. The database grows at the rateof 10GB per month. The historical data since1993 have accumulated to 500GB. The data weuse in our experiment are the tick data of the ma-jor stocks in a trading day. The 300 stocks in thisdataset are heavily traded in NYSE. During thepeak hours, there are several trades for each stockin a second. We use the price weighted by volumeas the price of that stock at that second. In asecond when there is no trading activities for aparticular stock, we use the last trading price asits price. In this way, all the stock streams are up-dated every second, corresponding to a timepoint.The sliding window will vary from half an hour totwo hours (1,800 to 7,200 timepoints). In practicethe actual choice of the sliding windows will be upto the user. The lengths of the basic windows arehalf a minute to several minutes, depending onthe number of streams to be monitored and thecomputing capacity.

5.1 Speed Measurement

Suppose that the streams have new data every second.The user of a time series stream system asks himselfthe following questions:

1. How many streams can I track at once in an onlinefashion? (Online means that even if the data comein forever, I can compute the statistics of the datawith a �xed delay from their occurrence.)

2. How long is the delay between a change in corre-lation and the time when I see the change?

Our system will compute correlations at the end ofeach basic window. As noted above, the computation

Figure 2: Comparison of the number of streams thatthe DFT and Exact method can handle

for basic window i must �nish by the end of basic win-dow i + 1 in order for our system to be consideredon-line. Otherwise, it would lag farther and fartherbehind over time. Therefore, some of the correlationsmay be computed towards the end of the basic windowi+ 1. The user perceives the size of the basic windowas the maximum delay between the time that a changein correlation takes place and the time it is computed.

The net result is that the answers to questions (1)and (2) are related. We can increase the number ofstreams at the cost of increasing the delay in reportingcorrelation.

Figure 2 shows the number of streams vs. the min-imum size of the basic window for a uniprocessor andwith di�erent algorithms. In the DFT method, wechoose the number of coeÆcients to be 16.

Using the exact method, given the basic windowsize b, the time to compute the correlations among Ns

streams with b new timepoints is T = k0bN2s . Because

the algorithm must �nish this in b seconds, we have

k0bN2s = b) Ns =

q1

k0.

With the DFT-grid method, the work to monitorcorrelations has two parts: (1)Updating digests takestime T1 = k1bNs; (2)Detecting correlation based onthe grid takes time T2 = k2N

2s . To �nish these two

computations before the basic window ends, we haveT1+T2 = b. Since T2 is the dominating term, we have

Ns �q

bk2. Note that because of the grid structure,

k2 << k0. Also, the computation with data digestsis much more IO eÆcient than the exact method onthe raw streams. From the equation above, we can in-crease the number of streams monitored by increasingthe basic window size, i.e., delay time. This tradeo�between response time and throughput is con�rmed inthe �gure. The number of streams handled by our sys-tem increases with the size of the basic window, whilethere is no perceivable change for the exact algorithm.

The wall clock time to �nd correlations usingthe DFT-grid method is much faster than the exactmethod (Figure 3a). The time is divided into twoparts: detecting correlation and updating the digest

Figure 3: Comparison of the wall clock time

(Figure 3b).The experiments on the processing time for the

DFT-grid method also provide a guideline on how tochoose the size of the basic window. Given a speci�ccomputing capacity, �gure 4 show the processing timefor di�erent basic window sizes, when the numbers ofstreams are 5,000 (Figure 4a) and 10,000 (Figure 4b).Note that the wall clock time to detect correlation doesnot change with the size of the basic window, while thewall clock time to update the digest is proportional tothe size of the basic window. Therefore the processingtime is linear to the size of the basic window. Becausethe processing time must be shorter than the basicwindow time for our system to be online, we can de-cide the minimum basic window size. From �gure 4 tomonitor 5,000 streams the minimum basic window sizeis 25 seconds. Similarly, the basic window time shouldbe no less than 150 seconds for 10,000 streams.

5.2 Precision Measurement

Because the approximate correlations are based onDFT curve �tting in each basic window, the preci-sion of the computation depends on the size of thebasic window. Our experiments on the two data sets(Figure 5) show that errors increase with larger basicwindow size and decrease with larger sliding windowsize, but remain small throughout. This is particularlynoteworthy, because we used only the �rst 2 DFT co-eÆcients in each basic window.

We also performed experiments to test the e�ective-ness of the grid structure. The grid structure on DFTfeature space prunes out most of the low-correlatedpairs of streams. The pruning power [17] is the number

Figure 4: Comparison of the wall clock time for di�er-ent basic window sizes

Figure 5: Average approximation errors for correlationcoeÆcients with di�erent basic/sliding window sizesfor synthetic(above) and real(below) datasets

Figure 6: The precision and pruning power using dif-ferent numbers of coeÆcients, thresholds and datasets

Table 2: Precision after post processingDataset R0.85 R0.85 R0.9 R0.9 S0.85Tolerance 0.001 0.0005 0.001 0.0005 0.0005Precision 0.9933 0.9947 0.9765 0.9865 0.9931Recall 1.0 0.9995 1.0 0.9987 1.0

of pairs reported by the grid, divided by the numberof all potential pairs. Since our system guarantees nofalse negatives, the reported pairs include all the high-correlated pairs and some false positives. We also mea-sure the quality of the system by precision, which isthe ratio of the number of pairs whose correlations areabove the threshold and reported by StatStream, tothe number of pairs that are reported by StatStream.Figure 6 shows the precision and pruning power frac-tion using di�erent numbers of coeÆcients and valuesof thresholds for di�erent datasets. R0:85 indicatesthe real dataset with threshold of 0:85, S0:9 indicatesthe synthetic dataset with threshold of 0:9,etc. Thelength of the sliding window is one hour.

Table 2 shows that recall can be traded o� againstprecision in post processing. The user may specifya tolerance t such that the system will report thosepairs with approximate correlation coeÆcients largerthan threshold� t.

6 Related Work

There is increasing interest in data streams. In the the-oretical literature, Datar et. al [7] study the problemof maintaining data stream statistics over sliding win-

dows. Their focus is single stream statistics. They pro-pose an online data structure, exponential histogram,that can be adapted to report statistics over slidingwindows at every timepoint. They achieve this with alimited memory and a tradeo� of accuracy. Our onlinebasic window synopsis structure can report the precisesingle stream statistics for those timeponts that fall atbasic window boundaries with a delay of the size ofa basic window, but our multi-stream statistics alsotrade accuracy against memory and time.

Gehrke et al. [11] also study the problem of mon-itoring statistics over multiple data streams. Thestatistics they are interested in are di�erent from ours.They compute correlated aggregates when the numberof streams to be monitored is small. A typical query inphone call record streams is the percentage of interna-tional phone calls that are longer than the average du-ration of a domestic phone call. They use histogramsas summary data structures for the approximate com-puting of correlated aggregates.

Recently, the data mining community has turnedits attention to data streams. A domain-speci�c lan-guage, Hancock[6], has been designed at AT&T toextract signatures from massive transaction streams.Algorithms for constructing decision trees[8] and clus-tering [15] for data streams have been proposed. Re-cent work of Manku et al.[20], Greenwald et al.[14]have focused on the problem of approximate quantilecomputation for individual data streams. Our work iscomplementary to the data mining research becauseour techniques for �nding correlations can be used asinputs to the similarity-based clustering algorithms.

The work by Yi et al.[28] for the online data miningof co-evolving time sequences is also complementaryto our work. Our approximation algorithm can spotcorrelations among a large number of co-evolving timesequences quickly. Their method, MUSCLES can thenbe applied to those highly correlated streams for linearregression-based forecasting of new values.

Time series problems in the database communityhave focused on discovering the similarity between anonline sequence and an indexed database of previ-ously obtained sequence information. Traditionally,the Euclidean similarity measure is used. The orig-inal work by Agrawal et al. [2] utilizes the DFT totransform data from the time domain into frequencydomain and uses multidimensional index structure toindex the �rst few DFT coeÆcients. In their work,the focus is on whole sequence matching. This wasgeneralized to allow subsequence matching [9]. Ra�eiand Mendelzon[24] improve this technique by allow-ing transformations, including shifting, scaling andmoving average, on the time series before similarityqueries. The distances between sequences are mea-sured by the Euclidean distance plus the costs asso-ciated with these transformations. Our work di�ersfrom them in that (1)In [2, 9, 24, 19], the time series

are all �nite data sets and the focus is on similarityqueries against a sequence database having a precon-structed index. Our work focuses on similarity detec-tion in multiple online streams in real time. (2)We usethe correlation coeÆcients as a distance measure like[19]. The correlation measure is invariant under shift-ing and scaling transformations. Correlation makesit possible to construct an eÆcient grid structure inbounded DFT feature space. This is what enable usto give real time results. [19] allows false negatives,whereas our method does not.

Other techniques such as Discrete Wavelet Trans-form (DWT) [5, 26, 22, 12], Singular Value Decom-position (SVD)[18] and Piecewise Constant Approxi-mation (PCA)[27, 17] are also proposed for similaritysearch. Keogh et al. [17] compares these techniquesfor time series similarity queries. The performance ofthese techniques varied depending on the characteris-tics of the datasets, because no single transform canbe optimal on all datasets. These techniques based oncurve �tting are alternative ways of computing digestsand could be used in our sliding window/basic windowframework.

7 Conclusion

Maintaining multi-stream and time-delayed statisticsin a continuous online fashion is a signi�cant challengein data management. Our paper solves this problem ina scalable way that gives a guaranteed response timewith high accuracy.

The Discrete Fourier Transform technique reducesthe enormous raw data streams into a manageable syn-optic data structure and gives good I/O performance.For any pair of streams, the pair-wise statistic is com-puted in an incremental fashion and requires constanttime per update using a DFT approximation. A slid-ing/basic window framework is introduced to facili-tate the eÆcient management of streaming data di-gests. We reduce the correlation coeÆcient similar-ity measure to a Euclidean measure and make use ofa grid structure to detect correlations among thou-sands of high speed data streams in real time. Ex-periments conducted using synthetic and real datashow that StatStream detects correlations eÆcientlyand precisely.

References

[1] http://www.kx.com.

[2] R. Agrawal, C. Faloutsos, and A. N. Swami. EÆ-cient Similarity Search In Sequence Databases. InProceedings of the 4th International Conferenceof Foundations of Data Organization and Algo-rithms (FODO), pages 69{84, 1993.

[3] S. Babu and J. Widom. Continuous queries overdata streams. SIGMOD Record, 30(3):109{120,2001.

[4] J. L. Bentley, B. W. Weide, and A. C. Yao. Op-timal expected-time algorithms for closest pointproblems. ACM Transactions on MathematicalSoftware (TOMS), 6(4):563{580, 1980.

[5] K.-P. Chan and A. W.-C. Fu. EÆcient time se-ries matching by wavelets. In Proceedings of the15th International Conference on Data Engineer-ing, Sydney, Australia, pages 126{133, 1999.

[6] C. Cortes, K. Fisher, D. Pregibon, and A. Rogers.Hancock: a language for extracting signaturesfrom data streams. In ACM SIGKDD Intl. Conf.on Knowledge Discovery and Data Mining, pages9{17, 2000.

[7] M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding win-dows. In SODA, page to appear, 2002.

[8] P. Domingos and G. Hulten. Mining high-speeddata streams. In ACM SIGKDD Intl. Conf. onKnowledge Discovery and Data Mining, pages 71{ 80, 2000.

[9] C. Faloutsos, M. Ranganathan, andY. Manolopoulos. Fast subsequence match-ing in time-series databases. In Proc. ACMSIGMOD International Conf. on Management ofData, pages 419{429, 1994.

[10] V. Ganti, J. Gehrke, and R. Ramakrishnan. De-mon: Data evolution and monitoring. In Proceed-ings of the 16th International Conference on DataEngineering, San Diego, California, 2000.

[11] J. Gehrke, F. Korn, and D. Srivastava. On com-puting correlated aggregates over continual datastreams. In Proc. ACM SIGMOD InternationalConf. on Management of Data, 2001.

[12] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan,and M. Strauss. Sur�ng wavelets on streams:One-pass summaries for approximate aggregatequeries. In VLDB, pages 79{88, 2001.

[13] D. Q. Goldin and P. C. Kanellakis. On sim-ilarity queries for time-series data: Constraintspeci�cation and implementation. In Proceed-ings of the 1st International Conference on Prin-ciples and Practice of Constraint Programming(CP'95). Springer Verlag, 1995.

[14] M. Greenwald and S. Khanna. Space-eÆcient on-line computation of quantile summaries. In Proc.ACM SIGMOD International Conf. on Manage-ment of Data, pages 58{66, 2001.

[15] S. Guha, N. Mishra, R. Motwani, andL. O'Callaghan. Clustering data streams. In theAnnual Symposium on Foundations of ComputerScience,IEEE, 2000.

[16] E. Keogh and P. Smyth. A probabilistic approachto fast pattern matching in time series databases.In the third conference on Knowledge Discoveryin Databases and Data Mining, 1997.

[17] E. J. Keogh, K. Chakrabarti, S. Mehrotra, andM. J. Pazzani. Locally adaptive dimensionalityreduction for indexing large time series databases.In Proc. ACM SIGMOD International Conf. onManagement of Data, 2001.

[18] F. Korn, H. V. Jagadish, and C. Faloutsos. EÆ-ciently supporting ad hoc queries in large datasetsof time sequences. In Proceedings ACM SIG-MOD International Conference on Managementof Data, pages 289{300, 1997.

[19] C.-S. Li, P. S. Yu, and V. Castelli. Hierar-chyscan: A hierarchical similarity search algo-rithm for databases of long sequences. In ICDE,pages 546{553, 1996.

[20] G. S. Manku, S. Rajagopalan, , and B. G. Lind-say. Random sampling techniques for space eÆ-cientonline computation of order statistics of largedatasets. In Proc. ACM SIGMOD InternationalConf. on Management of Data, pages 251{262,1999.

[21] L. Molesky and M. Caruso. Managing �nan-cial time series data: Object-relational and ob-ject database systems. In VLDB'98, Proceedingsof 24rd International Conference on Very LargeData Bases, 1998.

[22] I. Popivanov and R. J. Miller. Similarity searchover time series data using wavelets. In ICDE,2002.

[23] D. Ra�ei. On similarity-based queries for timeseries data. In ICDE, pages 410{417, 1999.

[24] D. Ra�ei and A. Mendelzon. Similarity-basedqueries for time series data. In Proc. ACMSIGMOD International Conf. on Management ofData, pages 13{25, 1997.

[25] D. Ra�ei and A. Mendelzon. EÆcient retrieval ofsimilar time sequences using dft. In Proc. FODOConference, Kobe, Japan, 1998.

[26] Y.-L. Wu, D. Agrawal, and A. E. Abbadi. Acomparison of dft and dwt based similarity searchin time-series databases. In Proceedings of the 9th International Conference on Information andKnowledge Management, 2000.

[27] B.-K. Yi and C. Faloutsos. Fast time sequenceindexing for arbitrary lp norms. In Proceedingsof 26th International Conference on Very LargeData Bases, pages 385{394, 2000.

[28] B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Ja-gadish, C. Faloutsos, and A. Biliris. Online datamining for co-evolving time sequences. In Pro-ceedings of the 16th International Conference onData Engineering, pages 13{22, 2000.

sliding window - math.nyu.edu

Documents