high performance correlation techniques for time series

1

High Performance Correlation Techniques For Time Series

Xiaojian ZhaoDepartment of Computer Science

Courant Institute of Mathematical SciencesNew York university

25 Oct. 2004

2

RoadmapSection 1: Introduction

Motivation Problem Statement

Section 2 : Background GEMINI framework Random Projection Grid Structure Some Definitions Naive method and Yunyue’s Approach

Section 3 : Sketch based StatStream Efficient Sketch Computation Sketch technique as a filter Parameter selection Grid structure System Integration

Section 4 : Empirical StudySection 5 : Future WorkSection 6 : Conclusion

3

Section 1: Introduction

4

Motivation Stock prices streams

The New York Stock Exchange (NYSE) 50,000 securities (streams); 100,000 ticks (trade and quote)

Pairs Trading, a.k.a. Correlation Trading Query:“which pairs of stocks were correlated with a value of over 0.9

for the last three hours?”XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours.Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down.They should converge back later.I will sell XYZ and buy ABC …

5

Online Detection of High Correlation

Correlated!

Correlated!

6

Why speed is important

As processors speed up, algorithmic efficiency no longer matters … one might think.

True if problem sizes stay same but they don’t. As processors speed up, sensors improve --

satellites spewing out a terabyte a day, magnetic resonance imagers give higher resolution images, etc.

7

Problem Statement

Detect and report the correlation rapidly and accurately

Expand the algorithm into a general engine Apply them in many practical application

domains

8

Big Picture

Random Projection

time series 1

time series 2

time series 3

…

time series n

…

sketch 1

sketch 2

…

sketch n

…

Grid structur

e

Correlatedpairs

9

Section 2: Background

10

GEMINI framework*

* Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. Minneapolis, MN, May 25-27. pp 419-429.

DFT, DWT, etc

11

Goals of GEMINI framework

High performance

Operations on synopses will save time such as distance computation

Guarantee no false negative

Feature Space shrinks the original distances in the raw data space

.

12

Random Projection: Intuition

You are walking in a sparse forest and you are lost. You have an outdated cell phone without a GPS. You want to know if you are close to your friend. You identify yourself at 100 meters from the pointy rock

and 200 meters from the giant oak etc. If your friend is at similar distances from several of these

landmarks, you might be close to one another. The sketches are the set of distances to landmarks.

13

How to make Random Projection*

Sketch pool: A list of random vectors drawn from stable distribution (like the landmarks)

Project the time series into the space spanned by these random vectors

The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee.

•W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp. Math.,26:189-206,1984

14

),...,3,2,1(1 1111 wrrrrR

),...,3,2,1(2 2222 wrrrrR ),...,3,2,1(3 3333 wrrrrR ),...,3,2,1(4 4444 wrrrrR

)4,3,2,1( xskxskxskxsk

)4,3,2,1( yskyskyskysk

inner product

random vector sketchesraw time series

Random Projection

),...,,( 321 wxxxxx

),...,,( 321 wyyyyy

X’ current position

Y’ current position

Rocks, buildings…

Y’ relative distances

X’ relative distances

15

Sketch Guarantees

Note: Sketches do not provide approximations of individual time series window but help make comparisons.

Johnson-Lindenstrauss Lemma: For any and any integer n, let k be a positive integer such that

Then for any set V of n points in , there is a map such that for all

Further this map can be found in randomized polynomial time

10

nk ln)3/2/(4 132 dR kd RRf :

Vvu ,222 ||||)1(||)()(||||||)1( vuvfufvu

16

Sketches : Random Projection

Why we use sketches or random projections?

To reduce the dimensionality!

For example:

The original time series x is of the length 256, we may represent it with a sketch vector of length 30.

First step to removing “the curse of dimensionality”

17

Achliptas’s lemma Dimitris Achliptas proved that

*Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Let P be an arbitrary set of n points in , represented as an matrix A. Given , let

For integer , let R be a random matrix with R(i;j)= , where { } are independent random variables from either one of the following two probability distributions shown in next slide:

dR dn0,

nk log32

24320

0kk kd ijrijr

18

Achliptas’s lemma

*Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Let

Let map the row of A to the row of E. With a probability at least , for all

ARk

E1

kd RRf : thi thiPvu ,

222 ||||)1(||)()(||||||)1( vuvfufvu

2/11

2/11

yprobabilitwith

yprobabilitwithrij

6/11

3/20

6/11

yprobabilitwith

yprobabilitwith

yprobabilitwith

rijor

n1

19

Definition: Sketch Distance

)/))(...)()(((

||||22

222

11 kyskxskyskxskyskxsksqrtdsk

yskxskdsk

kk

Note: DFT, DWT distance are analogous. For those measures, the difference between the original vectors is approximated by the difference between the first Fourier/Wavelet coefficients of those vectors.

20

Empirical Study : Sketch Approximation

Return Data

0

5

10

15

20

25

30

1 66

131

196

261

326

391

456

521

586

651

716

781

846

911

976

Data Points

Dis

tan

ce

sketch

dist

21

Empirical Study: sketch distance/real distance

Factor distribution

0%

1%

2%

3%

4%

5%

Factor(Real distance/Sketch distance)

Per

cent

age

of d

ista

nce

number

Factor distribution

0%

1%

2%

3%

4%

5%

6%

7%

1.25

1.20

1.16

1.12

1.09

1.05

1.02

0.99

0.96

0.93

0.91

0.88


Per

cent

age

of d

ista

nce

number

Sketch=30

Sketch=80

Factor distribution

0%

2%

4%

6%

8%

10%

12%

1.19

1.16

1.14

1.11

1.09

1.06

1.04

1.02

1.00

0.98

0.96

0.94

0.93


Per

cent

age

of d

ista

nce

number

Sketch=1000

22

Grid Structure

),...,( 21 kxxxx

23

Correlation and Distance

There is relationship between Euclidean distance and Pearson correlation Normalization

dist2=2(1- correlation)

)var(

)(

sw

swii X

XavgXX

24

How to compute the correlation efficiently?

Goal: To find the most highly correlated stream pairs over sliding windows

Naive method Statstream method Our method

25

Naïve Approach

Space and time cost Space O(N) and time O(N2sw)

N : number of streams

sw : size of sliding window.

Let’s see Statstream approach

26

Definitions: Sliding window and Basic window

……Stock 1

Stock 2

Stock 3

Stock n

Sliding window

Time axis

Sliding window size=8

Basic window size=2

Basic window Time point

27

StatStream Idea

Use Discrete Fourier Transform(DFT) to approximate correlation as in the GEMINI approach discussed earlier.

Every two minutes (“basic window size”), update the DFT for each time series over the last hour (“sliding window size”)

Use a grid structure to filter out unlikely pairs

28

StatStream: Stream synoptic data structure

Sliding window

Basic window digests:

sum

DFT coefs

Basic window

Time point

Basic window digests:

sum

DFT coefs

29

Section 3: Sketch based StatStream

30

Problem not yet solved

DFT approximates the price-like data type very well. Gives a poor approximation for returns(today’s price – yesterday’s price)/yesterday’s price.

Return is more like white noise which contains all frequency components.

DFT uses the first n (e.g. 10) coefficients in approximating data, which is insufficient in the case of white noise.

31

Random Walk

0

0.2

0.4

0.6

0.8

1

1.2

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

10

1

The number of coefficients

Rat

io o

ver

tota

l ene

rgy

ratio

White Noise

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

10

1

The number of coefficients

Rat

io o

ver

tota

l ene

rgy

ratio

32

Big Picture Revisited

Random Projection

time series 1

time series 2

time series 3

…

time series n

…

sketch 1

sketch 2

…

sketch n

…

Grid structur

e

Correlatedpairs

Random Projection: inner product between Data Vector and random vector

33

How to compute the sketch efficiently

We will not compute the inner product at each data point because the computation is expensive.A new strategy, in joint work with Richard Cole, is used to compute the sketch.Here the random variable will be drawn from:

2/1

2/1

1

1

yprobabilitwith

yprobabilitwithrij

34

How to construct the random vector:Given time series , compute its sketch for a window of size sw=12.

Partition to smaller basic windows of size bw = 4.

The random vector within a basic window is R and a control vector b

is used to determine which basic window will be multiplied with –1 or 1 (Why? Wait…)

A final complete random vector may look like:

),,,( 4321 rrrrR 1/1 ir),,( 321 bbbb

),,( 21 xxX

b

(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) Here bw=(1 1 -1 1) b=(1 -1 1)

35

Naive algorithm and hope for improvement

There is redundancy in the second dot product given the first one. We will eliminate the repeated computation to save time

dot product

r=(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) x=(x1 x2 x3 x4; x5 x6 x7 x8; x9 x10 x11 x12)

xsk=r*x= x1+x2-x3+x4-x5-x6+x7-x8+x9+x10-x11+x12

With new data point arrival, such operations will be done again

r=(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) x’=(x5 x6 x7 x8 ; x9 x10 x11 x12; x13 x14 x15 x16)

xsk=r*x’= x5+x6-x7+x8-x9-x10+x11+x12+x13+x14+x15- x16*

36

Our algorithm (Pointwise version)

Convolve with corresponding after padding with |bw| zeros.

bwX bwR

Animation shows convolution in action:

conv1:(1 1 -1 1 0 0 0 0) (x1,x2,x3,x4)conv2:(1 1 -1 1 0 0 0 0) (x5,x6,x7,x8)conv3:(1 1 -1 1 0 0 0 0) (x9,x10,x11,x12)

1 1 -1 1 0 0 0 0

x1 x2 x3 x4

x4

x4+x3

-x4+x3+x2

x4-x3+x2+x1

x3-x2+x1

x2-x1

x1

x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4

37

Our algorithm: example

+

First Convolution Second Convolution Third Convolution

x4

x4+x3

x2+x3-x4

x1+x2-x3+x4

x1-x2+x3

x2-x1

x1

x8

x8+x7

x6+x7-x8

x5+x6-x7+x8

x5-x6+x7

x6-x5

x5

x12

x12-x11

x10+x11-x12

x9+x10-x11+x12

x9-x10+x11

x10-x9

x9

+

38

Our algorithm: example

(Sk1 Sk5 Sk9)*(b1 b2 b3) * is inner product

sk2=(x2+x3-x4) + (x5)sk6=(x6+x7-x8) + (x9)sk10=(x10+x11-x12) + (x13)Then sum up and we havexsk2=(x2+x3-x4+x5)-(x6+x7-x8+x9)+(x9+x10-x11+x12)b=( 1 -1 1)

sk1=(x1+x2-x3+x4)sk5=(x5+x6-x7+x8) sk9=(x9+x10-x11+x12)xsk1= (x1+x2-x3+x4)-(x5+x6-x7+x8)+(x9+x10-x11+x12)b= ( 1 -1 1)

First sliding window

Second sliding window

39

Our algorithm

The projection of a sliding window is decomposed into operations over basic windows

Each basic window is convolved with each random vector only once

We may provide the sketches incrementally starting from each data point.

There is no redundancy.

40

Jump by a basic window (basic window version)

Or if time series are highly correlated between two consecutive data points, we may compute the sketch every other basic window.

That is, we update the sketch for each time series only when data of a complete basic window arrive.

1 1 –1 1

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

1 1 –1 1

1 1 –1 1

x1+x2-x3+x4 x5+x6-x7+x8 x9+x10-x11+x12

41

Online Version

We take the basic window version for instance Review: To have the same baseline we normalize

the time series within each siding window. Challenge: The normalization of the time series

change over each basic window

42

Online Version

Its incremental computation nature results in a update of the average and variance whenever a new basic window enters

Do we have to compute the normalization and thus the sketch whenever a new basic window enters?

Of course not. Otherwise our algorithm will degrade into the trivial computation

43

Online Version

Then how? After mathematical manipulation, we claim that we only need store and maintain the following quantities

Sum of the whole sliding window

Sum of the square of each data in a sliding window

Sum of the whole basic window

Sum of the square of each data in a basic window

Dot Product of random vector with each basic window

1

0

sw

iX

1

0

2sw

iX

1

0

2bw

iX

1

0

bw

iX

RX bw

44

Performance comparison

Naïve algorithm For each datum and random vector O(|sw|) integer additions Pointwise version Asymptotically for each datum and random vector (1) O(|sw|/|bw|) integer additions (2) O(log |bw|) floating point operations (use FFT in computing

convolutions) Basic window version Asymptotically for each basic window and random vector (1) O(|sw|/|bw|) integer additions (2) O(|bw|) floating point operations

45

Sketch distance filter quality

We may use the sketch distance to filter the unlikely data pairs

How accurate is it? How is it compared to DFT and DWT distance

in terms of the approximation ability?

46

Empirical Study: Sketch sketch compared to DFT and DWT distance

Data length=256 DFT: the first 14 DFT coefficients are used in the

distance computation, DWT: db2 wavelet is used with coefficient

size=16 Sketch: the random vector number is 64

47

Empirical Comparison: DFT, DWT and Sketch

Price Data

0

10

20

30

40

50

1 71

141

211

281

351

421

491

561

631

701

771

841

911

981

Data Points

Dis

tan

ce sketch

dist

dwt

dft

48

Empirical Comparison : DFT, DWT and Sketch

Return Data

0

5

10

15

20

25

30

1 92 183 274 365 456 547 638 729 820 911

Data Points

Dis

tan

ce

dft

dwt

sketch

dist

49

Use the sketch distance as a filter

We may compute the sketch distance:

c could be 1.2 or larger to reduce the number of false negatives.

Finally any possible data point will be double checked with the raw data.

)/))(...)()(((

||||22

222

11 kyskxskyskxskyskxsksqrtdsk

yskxskdsk

kk

distcyskixski *||

50

Use the sketch distance as a filter

But we will not use it, why? Expensive. Since we still have to do the pairwise

comparison between each pair of stocks which is , k is the size of the sketches

)( 2knO

51

Sketch unit distance

)8,7,6,5,4,3,2,1( xskxskxskxskxskxskxskxskxsk )8,7,6,5,4,3,2,1( yskyskyskyskyskyskyskyskysk

|11| yskxsk |22| yskxsk |33| yskxsk |44| yskxsk |55| yskxsk |66| yskxsk |77| yskxsk |88| yskxsk

Given sketches:

If f distance chunks have we may say where: f: 30%, 40%, 50%, 60% … c: 0.8, 0.9, 1.1…

distcyskixski *|| distyx ||||

We have

52

Further: sketch groups

||||

,,, 321

gigigi

ggg

yskxskdsk

where

dskdskdsk

)4/))()(

)()(((2

442

33

222

2111

yskxskyskxsk

yskxskyskxsksqrtdskg

...)8,7,6,5,4,3,2,1( xskxskxskxskxskxskxskxskxsk ...)8,7,6,5,4,3,2,1( yskyskyskyskyskyskyskyskysk

We may compute the sketch group:

For example

If f sketch groups have we may say where: f: 30%, 40%, 50%, 60% c: 0.8, 0.9, 1.1

distcdskdsk gigi *|| distyx ||||

Grid Structure

53

Optimization in Parameter Space

Next, how to choose the parameters g, c, f, N?

N: total number of the sketchesg: group sizec: the factor of distancef: the fraction of groups which are necessary to claim that two time series are close enough

54

Optimization in Parameter Space

Essentially, we will prepare several groups of good parameter candidates and choose the best one to be applied to the practical data

But, how to select the good candidates? Combinatorial Design (CD) Bootstrapping

55

Combinatorial Design

The pair-wise combinations of all the parametersInformally: Each parameter value will see each value of

other parameters in some parameter group.

P: P1, P2, P3

Q: Q1, Q2, Q3, Q4

R: R1, R2

Combinations: #P*#Q*#R=24 groups

Combinatorial Design:12 groups*

*http://www.cs.nyu.edu/cs/faculty/shasha/papers/comb.html

56


Much smaller test space compared to that of all parameter combinations

We will further reduce the test space by taking advantage of continuity of recall and precision in parameter space.

0.1

0.4

0.7 1

1.3

0.1

0.60

0.2

0.4

0.6

0.8

1

Precision

f

c

Precision with different parameter groups

0.8-1

0.6-0.8

0.4-0.6

0.2-0.4

0-0.2 0.1

0.4

0.7

1

1.3 0.1

0.60

0.2

0.4

0.6

0.8

1

Recall

fc

Recall with different parameter groups

0.8-1

0.6-0.8

0.4-0.6

0.2-0.4

0-0.2

57


we will employ the coarse to fine strategy

N: 30, 36, 40, 60g: 1, 2, 3 c: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3 f: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1When the good parameters are located, its local neighbors will be searched further for better solutions

58

Bootstrapping Choose parameters with a stable performance in

both sample data and real data A sample set with 2,000,000 pairs Among it, choose with replacement 20,000 sample

100 times. Compute the recall and precision each time

59

Bootstrapping 100 recalls and precisions Compute mean and std of recalls and precisions Criterion of good parameters

Mean(recall)-std(recall)>Threshold(recall)

Mean(precision)-std(precision)>Threshold(precision)

If there are no such parameters, enlarge the replacement sample size

60

Parameter Selection

N f c mean_rec std_rec mean_prec std_prec60 0.4 0.45 1 0 0.042 0.007460 0.4 0.46 1 0 0.038 0.006960 0.4 0.47 0.998 0.0054 0.035 0.006560 0.5 0.55 1 0 0.056 0.009360 0.5 0.56 1 0 0.052 0.0088

61

Preferred data distributions

The distribution of the data affects the performance of our algorithm (Recall price and return)

The ideal data distribution:

Generally, the less human intervenes, the better The “green” data give much better results.

CX

X

dQXanddQX

1#

2#

2||2||||1||

Where, C is a small constant

62

Empirical Study: Various data types

Cstr: Continuous stirred tank reactor

Fortal_ecg: Cutaneous potential recordings of a pregnant woman

Steamgen: Model of a steam generator at Abbott Power Plant in Champaign IL

Winding: Data from a test setup of an industrial winding process

Evaporator: Data from an industrial evaporator

Wind: Daily average wind speeds for 1961-1978 at 12 synoptic meteorological stations in the Republic of Ireland

Spot_exrates: The spot foreign currency exchange rates

EEG: Electroencepholgram

63

Empirical Study: Data distributionPrice Distance Distribution

010002000300040005000600070008000

1 4 7 10 13 16 19 22 25 28 31

Di stance

Num

ber

of th

e D

ista

nce

num

Return Distance Distribution

010000

2000030000

4000050000

6000070000

18 19 20 21 22 23 24 25

Di stance

Num

ber

of th

e D

ista

nce

num

cst r Di stance Di st r i but i on

0

50

100

150

200

4 6 8 10 12 14 16 18 20 22 24 26 28 30

Di stance

Numb

er o

f th

e Di

stan

ce

num

64

Grid Structure

Critical: The largest value Useful in the normalization to fit in the grid

structure Our small lemma:

)()( WindowSlidingSizeofsketchunitMax

65

Grid Structure High correlation => closeness in the vector space To avoid checking all pairs We can use a grid structure and look in the

neighborhood, this will return a super set of highly correlated pairs.

The data labeled as “potential” will be double checked using the raw data vectors.

The pruning power: how many percentage of data are filtered as impossible to be close.

66

Inner product with random vectors r1,r2,r3,r4,r5,r6

),,,,,( 654321 xskxskxskxskxskxsk

),,,,,( 654321 yskyskyskyskyskysk

),,,,,( 654321 zskzskzskzskzskzsk

X Y Z

67

),( 21 xskxsk

),( 21 yskysk

),( 21 zskzsk

),( 43 xskxsk

),( 43 yskysk

),( 43 zskzsk

),( 65 xskxsk

),( 65 yskysk

),( 65 zskzsk

Grid structure

68

System Integration

By combining the sketch scheme with the grid structure, we can

Reduce dimensionality

Eliminate unnecessary pair comparisons

The performance can be improved substantially

69

Empirical Study: SpeedComparison of processing time

0

100

200

300

400

500

600

200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Number of Streams

Wal

l C

lock

Tim

e (s

eco

nd

s)

sketch_random

sketch_randomwalk

exact

Sliding window=3616, basic window=32 and sketch size=60

70

Empirical Study: Breakdown

Processing time of randomwalk data

020406080

100120140160

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Number of Streams

Wal

l C

lock

Tim

e (s

eco

nd

s)

Detecting Correlation

Updating Sketches

71

Empirical Study: Breakdown

Processing time of random data

0

5

10

15

20

25

30

35

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Number of Streams

Wall

Clo

ck T

ime

(seco

nd

s)

Detecting Correlation

Updating Sketches

72

The Pruning Power of the Grid Structure

Processing Time

0

500

1000

1500

2000

2500

3000

Data Type and Size

Wal

l Clo

ck T

ime(

seco

nd)

grid2

grid3

dft

scan

73

Visualization

74

Other applications

Cointegration Test Matching Pursuit Anomaly Detection

75

Cointegration Test

Make stationary by the linear combination of several non-stationary time series.

Model long run characteristic as opposed to the correlation

Statstream may be applied to test the stationary condition of cointegration

76

Decompose signal into a group of non- orthogonal sub-components

Test the correlation among atoms in a dictionary.

Expedite the component selection

Matching Pursuit

77

Anomaly Detection

Measure the relative distance of each point from its nearest neighbors

Statstream may serve as a monitor by reporting those points far from any normal points

78

Conclusion

Introduction GEMINI Framework Random Projection Statstream Review Efficient Sketch Computation Parameter Selection Grid Structure System Integration Empirical Study Future work

79

Thanks a lot!

80

Recall and Precision

Recall=C/A Precision=C/B

A B

C

A: Query ball

B: Returned result

C: Intersection

high performance correlation techniques for time series

Documents

time series nsketch

timeseries databases

sketch pool

correlation tradingquery

hilbert space

raw data space

false negativefeature

list of random vectors