high performance correlation techniques for time series
DESCRIPTION
High Performance Correlation Techniques For Time Series. Xiaojian Zhao Department of Computer Science Courant Institute of Mathematical Sciences New York university 25 Oct. 2004. Roadmap. Section 1: Introduction Motivation Problem Statement Section 2 : Background GEMINI framework - PowerPoint PPT PresentationTRANSCRIPT
1
High Performance Correlation Techniques For Time Series
Xiaojian ZhaoDepartment of Computer Science
Courant Institute of Mathematical SciencesNew York university
25 Oct. 2004
2
RoadmapSection 1: Introduction
Motivation Problem Statement
Section 2 : Background GEMINI framework Random Projection Grid Structure Some Definitions Naive method and Yunyue’s Approach
Section 3 : Sketch based StatStream Efficient Sketch Computation Sketch technique as a filter Parameter selection Grid structure System Integration
Section 4 : Empirical StudySection 5 : Future WorkSection 6 : Conclusion
3
Section 1: Introduction
4
Motivation Stock prices streams
The New York Stock Exchange (NYSE) 50,000 securities (streams); 100,000 ticks (trade and quote)
Pairs Trading, a.k.a. Correlation Trading Query:“which pairs of stocks were correlated with a value of over 0.9
for the last three hours?”XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours.Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down.They should converge back later.I will sell XYZ and buy ABC …
5
Online Detection of High Correlation
Correlated!
Correlated!
6
Why speed is important
As processors speed up, algorithmic efficiency no longer matters … one might think.
True if problem sizes stay same but they don’t. As processors speed up, sensors improve --
satellites spewing out a terabyte a day, magnetic resonance imagers give higher resolution images, etc.
7
Problem Statement
Detect and report the correlation rapidly and accurately
Expand the algorithm into a general engine Apply them in many practical application
domains
8
Big Picture
Random Projection
time series 1
time series 2
time series 3
…
time series n
…
sketch 1
sketch 2
…
sketch n
…
Grid structur
e
Correlatedpairs
9
Section 2: Background
10
GEMINI framework*
* Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. Minneapolis, MN, May 25-27. pp 419-429.
DFT, DWT, etc
11
Goals of GEMINI framework
High performance
Operations on synopses will save time such as distance computation
Guarantee no false negative
Feature Space shrinks the original distances in the raw data space
.
12
Random Projection: Intuition
You are walking in a sparse forest and you are lost. You have an outdated cell phone without a GPS. You want to know if you are close to your friend. You identify yourself at 100 meters from the pointy rock
and 200 meters from the giant oak etc. If your friend is at similar distances from several of these
landmarks, you might be close to one another. The sketches are the set of distances to landmarks.
13
How to make Random Projection*
Sketch pool: A list of random vectors drawn from stable distribution (like the landmarks)
Project the time series into the space spanned by these random vectors
The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee.
•W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp. Math.,26:189-206,1984
14
),...,3,2,1(1 1111 wrrrrR
),...,3,2,1(2 2222 wrrrrR ),...,3,2,1(3 3333 wrrrrR ),...,3,2,1(4 4444 wrrrrR
)4,3,2,1( xskxskxskxsk
)4,3,2,1( yskyskyskysk
inner product
random vector sketchesraw time series
Random Projection
),...,,( 321 wxxxxx
),...,,( 321 wyyyyy
X’ current position
Y’ current position
Rocks, buildings…
Y’ relative distances
X’ relative distances
15
Sketch Guarantees
Note: Sketches do not provide approximations of individual time series window but help make comparisons.
Johnson-Lindenstrauss Lemma: For any and any integer n, let k be a positive integer such that
Then for any set V of n points in , there is a map such that for all
Further this map can be found in randomized polynomial time
10
nk ln)3/2/(4 132 dR kd RRf :
Vvu ,222 ||||)1(||)()(||||||)1( vuvfufvu
16
Sketches : Random Projection
Why we use sketches or random projections?
To reduce the dimensionality!
For example:
The original time series x is of the length 256, we may represent it with a sketch vector of length 30.
First step to removing “the curse of dimensionality”
17
Achliptas’s lemma Dimitris Achliptas proved that
*Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Let P be an arbitrary set of n points in , represented as an matrix A. Given , let
For integer , let R be a random matrix with R(i;j)= , where { } are independent random variables from either one of the following two probability distributions shown in next slide:
dR dn0,
nk log32
24320
0kk kd ijrijr
18
Achliptas’s lemma
*Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Let
Let map the row of A to the row of E. With a probability at least , for all
ARk
E1
kd RRf : thi thiPvu ,
222 ||||)1(||)()(||||||)1( vuvfufvu
2/11
2/11
yprobabilitwith
yprobabilitwithrij
6/11
3/20
6/11
yprobabilitwith
yprobabilitwith
yprobabilitwith
rijor
n1
19
Definition: Sketch Distance
)/))(...)()(((
||||22
222
11 kyskxskyskxskyskxsksqrtdsk
yskxskdsk
kk
Note: DFT, DWT distance are analogous. For those measures, the difference between the original vectors is approximated by the difference between the first Fourier/Wavelet coefficients of those vectors.
20
Empirical Study : Sketch Approximation
Return Data
0
5
10
15
20
25
30
1 66
131
196
261
326
391
456
521
586
651
716
781
846
911
976
Data Points
Dis
tan
ce
sketch
dist
21
Empirical Study: sketch distance/real distance
Factor distribution
0%
1%
2%
3%
4%
5%
Factor(Real distance/Sketch distance)
Per
cent
age
of d
ista
nce
number
Factor distribution
0%
1%
2%
3%
4%
5%
6%
7%
1.25
1.20
1.16
1.12
1.09
1.05
1.02
0.99
0.96
0.93
0.91
0.88
Factor(Real distance/Sketch distance)
Per
cent
age
of d
ista
nce
number
Sketch=30
Sketch=80
Factor distribution
0%
2%
4%
6%
8%
10%
12%
1.19
1.16
1.14
1.11
1.09
1.06
1.04
1.02
1.00
0.98
0.96
0.94
0.93
Factor(Real distance/Sketch distance)
Per
cent
age
of d
ista
nce
number
Sketch=1000
22
Grid Structure
),...,( 21 kxxxx
23
Correlation and Distance
There is relationship between Euclidean distance and Pearson correlation Normalization
dist2=2(1- correlation)
)var(
)(
sw
swii X
XavgXX
24
How to compute the correlation efficiently?
Goal: To find the most highly correlated stream pairs over sliding windows
Naive method Statstream method Our method
25
Naïve Approach
Space and time cost Space O(N) and time O(N2sw)
N : number of streams
sw : size of sliding window.
Let’s see Statstream approach
26
Definitions: Sliding window and Basic window
……Stock 1
Stock 2
Stock 3
Stock n
Sliding window
Time axis
Sliding window size=8
Basic window size=2
Basic window Time point
27
StatStream Idea
Use Discrete Fourier Transform(DFT) to approximate correlation as in the GEMINI approach discussed earlier.
Every two minutes (“basic window size”), update the DFT for each time series over the last hour (“sliding window size”)
Use a grid structure to filter out unlikely pairs
28
StatStream: Stream synoptic data structure
Sliding window
Basic window digests:
sum
DFT coefs
Basic window
Time point
Basic window digests:
sum
DFT coefs
29
Section 3: Sketch based StatStream
30
Problem not yet solved
DFT approximates the price-like data type very well. Gives a poor approximation for returns(today’s price – yesterday’s price)/yesterday’s price.
Return is more like white noise which contains all frequency components.
DFT uses the first n (e.g. 10) coefficients in approximating data, which is insufficient in the case of white noise.
31
Random Walk
0
0.2
0.4
0.6
0.8
1
1.2
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
10
1
The number of coefficients
Rat
io o
ver
tota
l ene
rgy
ratio
White Noise
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
10
1
The number of coefficients
Rat
io o
ver
tota
l ene
rgy
ratio
32
Big Picture Revisited
Random Projection
time series 1
time series 2
time series 3
…
time series n
…
sketch 1
sketch 2
…
sketch n
…
Grid structur
e
Correlatedpairs
Random Projection: inner product between Data Vector and random vector
33
How to compute the sketch efficiently
We will not compute the inner product at each data point because the computation is expensive.A new strategy, in joint work with Richard Cole, is used to compute the sketch.Here the random variable will be drawn from:
2/1
2/1
1
1
yprobabilitwith
yprobabilitwithrij
34
How to construct the random vector:Given time series , compute its sketch for a window of size sw=12.
Partition to smaller basic windows of size bw = 4.
The random vector within a basic window is R and a control vector b
is used to determine which basic window will be multiplied with –1 or 1 (Why? Wait…)
A final complete random vector may look like:
),,,( 4321 rrrrR 1/1 ir),,( 321 bbbb
),,( 21 xxX
b
(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) Here bw=(1 1 -1 1) b=(1 -1 1)
35
Naive algorithm and hope for improvement
There is redundancy in the second dot product given the first one. We will eliminate the repeated computation to save time
dot product
r=(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) x=(x1 x2 x3 x4; x5 x6 x7 x8; x9 x10 x11 x12)
xsk=r*x= x1+x2-x3+x4-x5-x6+x7-x8+x9+x10-x11+x12
With new data point arrival, such operations will be done again
r=(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) x’=(x5 x6 x7 x8 ; x9 x10 x11 x12; x13 x14 x15 x16)
xsk=r*x’= x5+x6-x7+x8-x9-x10+x11+x12+x13+x14+x15- x16*
36
Our algorithm (Pointwise version)
Convolve with corresponding after padding with |bw| zeros.
bwX bwR
Animation shows convolution in action:
conv1:(1 1 -1 1 0 0 0 0) (x1,x2,x3,x4)conv2:(1 1 -1 1 0 0 0 0) (x5,x6,x7,x8)conv3:(1 1 -1 1 0 0 0 0) (x9,x10,x11,x12)
1 1 -1 1 0 0 0 0
x1 x2 x3 x4
x4
x4+x3
-x4+x3+x2
x4-x3+x2+x1
x3-x2+x1
x2-x1
x1
x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4
37
Our algorithm: example
+
First Convolution Second Convolution Third Convolution
x4
x4+x3
x2+x3-x4
x1+x2-x3+x4
x1-x2+x3
x2-x1
x1
x8
x8+x7
x6+x7-x8
x5+x6-x7+x8
x5-x6+x7
x6-x5
x5
x12
x12-x11
x10+x11-x12
x9+x10-x11+x12
x9-x10+x11
x10-x9
x9
+
38
Our algorithm: example
(Sk1 Sk5 Sk9)*(b1 b2 b3) * is inner product
sk2=(x2+x3-x4) + (x5)sk6=(x6+x7-x8) + (x9)sk10=(x10+x11-x12) + (x13)Then sum up and we havexsk2=(x2+x3-x4+x5)-(x6+x7-x8+x9)+(x9+x10-x11+x12)b=( 1 -1 1)
sk1=(x1+x2-x3+x4)sk5=(x5+x6-x7+x8) sk9=(x9+x10-x11+x12)xsk1= (x1+x2-x3+x4)-(x5+x6-x7+x8)+(x9+x10-x11+x12)b= ( 1 -1 1)
First sliding window
Second sliding window
39
Our algorithm
The projection of a sliding window is decomposed into operations over basic windows
Each basic window is convolved with each random vector only once
We may provide the sketches incrementally starting from each data point.
There is no redundancy.
40
Jump by a basic window (basic window version)
Or if time series are highly correlated between two consecutive data points, we may compute the sketch every other basic window.
That is, we update the sketch for each time series only when data of a complete basic window arrive.
1 1 –1 1
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
1 1 –1 1
1 1 –1 1
x1+x2-x3+x4 x5+x6-x7+x8 x9+x10-x11+x12
41
Online Version
We take the basic window version for instance Review: To have the same baseline we normalize
the time series within each siding window. Challenge: The normalization of the time series
change over each basic window
42
Online Version
Its incremental computation nature results in a update of the average and variance whenever a new basic window enters
Do we have to compute the normalization and thus the sketch whenever a new basic window enters?
Of course not. Otherwise our algorithm will degrade into the trivial computation
43
Online Version
Then how? After mathematical manipulation, we claim that we only need store and maintain the following quantities
Sum of the whole sliding window
Sum of the square of each data in a sliding window
Sum of the whole basic window
Sum of the square of each data in a basic window
Dot Product of random vector with each basic window
1
0
sw
iX
1
0
2sw
iX
1
0
2bw
iX
1
0
bw
iX
RX bw
44
Performance comparison
Naïve algorithm For each datum and random vector O(|sw|) integer additions Pointwise version Asymptotically for each datum and random vector (1) O(|sw|/|bw|) integer additions (2) O(log |bw|) floating point operations (use FFT in computing
convolutions) Basic window version Asymptotically for each basic window and random vector (1) O(|sw|/|bw|) integer additions (2) O(|bw|) floating point operations
45
Sketch distance filter quality
We may use the sketch distance to filter the unlikely data pairs
How accurate is it? How is it compared to DFT and DWT distance
in terms of the approximation ability?
46
Empirical Study: Sketch sketch compared to DFT and DWT distance
Data length=256 DFT: the first 14 DFT coefficients are used in the
distance computation, DWT: db2 wavelet is used with coefficient
size=16 Sketch: the random vector number is 64
47
Empirical Comparison: DFT, DWT and Sketch
Price Data
0
10
20
30
40
50
1 71
141
211
281
351
421
491
561
631
701
771
841
911
981
Data Points
Dis
tan
ce sketch
dist
dwt
dft
48
Empirical Comparison : DFT, DWT and Sketch
Return Data
0
5
10
15
20
25
30
1 92 183 274 365 456 547 638 729 820 911
Data Points
Dis
tan
ce
dft
dwt
sketch
dist
49
Use the sketch distance as a filter
We may compute the sketch distance:
c could be 1.2 or larger to reduce the number of false negatives.
Finally any possible data point will be double checked with the raw data.
)/))(...)()(((
||||22
222
11 kyskxskyskxskyskxsksqrtdsk
yskxskdsk
kk
distcyskixski *||
50
Use the sketch distance as a filter
But we will not use it, why? Expensive. Since we still have to do the pairwise
comparison between each pair of stocks which is , k is the size of the sketches
)( 2knO
51
Sketch unit distance
)8,7,6,5,4,3,2,1( xskxskxskxskxskxskxskxskxsk )8,7,6,5,4,3,2,1( yskyskyskyskyskyskyskyskysk
|11| yskxsk |22| yskxsk |33| yskxsk |44| yskxsk |55| yskxsk |66| yskxsk |77| yskxsk |88| yskxsk
Given sketches:
If f distance chunks have we may say where: f: 30%, 40%, 50%, 60% … c: 0.8, 0.9, 1.1…
distcyskixski *|| distyx ||||
We have
52
Further: sketch groups
||||
,,, 321
gigigi
ggg
yskxskdsk
where
dskdskdsk
)4/))()(
)()(((2
442
33
222
2111
yskxskyskxsk
yskxskyskxsksqrtdskg
...)8,7,6,5,4,3,2,1( xskxskxskxskxskxskxskxskxsk ...)8,7,6,5,4,3,2,1( yskyskyskyskyskyskyskyskysk
We may compute the sketch group:
For example
If f sketch groups have we may say where: f: 30%, 40%, 50%, 60% c: 0.8, 0.9, 1.1
distcdskdsk gigi *|| distyx ||||
Grid Structure
53
Optimization in Parameter Space
Next, how to choose the parameters g, c, f, N?
N: total number of the sketchesg: group sizec: the factor of distancef: the fraction of groups which are necessary to claim that two time series are close enough
54
Optimization in Parameter Space
Essentially, we will prepare several groups of good parameter candidates and choose the best one to be applied to the practical data
But, how to select the good candidates? Combinatorial Design (CD) Bootstrapping
55
Combinatorial Design
The pair-wise combinations of all the parametersInformally: Each parameter value will see each value of
other parameters in some parameter group.
P: P1, P2, P3
Q: Q1, Q2, Q3, Q4
R: R1, R2
Combinations: #P*#Q*#R=24 groups
Combinatorial Design:12 groups*
*http://www.cs.nyu.edu/cs/faculty/shasha/papers/comb.html
56
Combinatorial Design
Much smaller test space compared to that of all parameter combinations
We will further reduce the test space by taking advantage of continuity of recall and precision in parameter space.
0.1
0.4
0.7 1
1.3
0.1
0.60
0.2
0.4
0.6
0.8
1
Precision
f
c
Precision with different parameter groups
0.8-1
0.6-0.8
0.4-0.6
0.2-0.4
0-0.2 0.1
0.4
0.7
1
1.3 0.1
0.60
0.2
0.4
0.6
0.8
1
Recall
fc
Recall with different parameter groups
0.8-1
0.6-0.8
0.4-0.6
0.2-0.4
0-0.2
57
Combinatorial Design
we will employ the coarse to fine strategy
N: 30, 36, 40, 60g: 1, 2, 3 c: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3 f: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1When the good parameters are located, its local neighbors will be searched further for better solutions
58
Bootstrapping Choose parameters with a stable performance in
both sample data and real data A sample set with 2,000,000 pairs Among it, choose with replacement 20,000 sample
100 times. Compute the recall and precision each time
59
Bootstrapping 100 recalls and precisions Compute mean and std of recalls and precisions Criterion of good parameters
Mean(recall)-std(recall)>Threshold(recall)
Mean(precision)-std(precision)>Threshold(precision)
If there are no such parameters, enlarge the replacement sample size
60
Parameter Selection
N f c mean_rec std_rec mean_prec std_prec60 0.4 0.45 1 0 0.042 0.007460 0.4 0.46 1 0 0.038 0.006960 0.4 0.47 0.998 0.0054 0.035 0.006560 0.5 0.55 1 0 0.056 0.009360 0.5 0.56 1 0 0.052 0.0088
61
Preferred data distributions
The distribution of the data affects the performance of our algorithm (Recall price and return)
The ideal data distribution:
Generally, the less human intervenes, the better The “green” data give much better results.
CX
X
dQXanddQX
1#
2#
2||2||||1||
Where, C is a small constant
62
Empirical Study: Various data types
Cstr: Continuous stirred tank reactor
Fortal_ecg: Cutaneous potential recordings of a pregnant woman
Steamgen: Model of a steam generator at Abbott Power Plant in Champaign IL
Winding: Data from a test setup of an industrial winding process
Evaporator: Data from an industrial evaporator
Wind: Daily average wind speeds for 1961-1978 at 12 synoptic meteorological stations in the Republic of Ireland
Spot_exrates: The spot foreign currency exchange rates
EEG: Electroencepholgram
63
Empirical Study: Data distributionPrice Distance Distribution
010002000300040005000600070008000
1 4 7 10 13 16 19 22 25 28 31
Di stance
Num
ber
of th
e D
ista
nce
num
Return Distance Distribution
010000
2000030000
4000050000
6000070000
18 19 20 21 22 23 24 25
Di stance
Num
ber
of th
e D
ista
nce
num
cst r Di stance Di st r i but i on
0
50
100
150
200
4 6 8 10 12 14 16 18 20 22 24 26 28 30
Di stance
Numb
er o
f th
e Di
stan
ce
num
64
Grid Structure
Critical: The largest value Useful in the normalization to fit in the grid
structure Our small lemma:
)()( WindowSlidingSizeofsketchunitMax
65
Grid Structure High correlation => closeness in the vector space To avoid checking all pairs We can use a grid structure and look in the
neighborhood, this will return a super set of highly correlated pairs.
The data labeled as “potential” will be double checked using the raw data vectors.
The pruning power: how many percentage of data are filtered as impossible to be close.
66
Inner product with random vectors r1,r2,r3,r4,r5,r6
),,,,,( 654321 xskxskxskxskxskxsk
),,,,,( 654321 yskyskyskyskyskysk
),,,,,( 654321 zskzskzskzskzskzsk
X Y Z
67
),( 21 xskxsk
),( 21 yskysk
),( 21 zskzsk
),( 43 xskxsk
),( 43 yskysk
),( 43 zskzsk
),( 65 xskxsk
),( 65 yskysk
),( 65 zskzsk
Grid structure
68
System Integration
By combining the sketch scheme with the grid structure, we can
Reduce dimensionality
Eliminate unnecessary pair comparisons
The performance can be improved substantially
69
Empirical Study: SpeedComparison of processing time
0
100
200
300
400
500
600
200 400 600 800 1000 1200 1400 1600 1800 2000 2200
Number of Streams
Wal
l C
lock
Tim
e (s
eco
nd
s)
sketch_random
sketch_randomwalk
exact
Sliding window=3616, basic window=32 and sketch size=60
70
Empirical Study: Breakdown
Processing time of randomwalk data
020406080
100120140160
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
Number of Streams
Wal
l C
lock
Tim
e (s
eco
nd
s)
Detecting Correlation
Updating Sketches
71
Empirical Study: Breakdown
Processing time of random data
0
5
10
15
20
25
30
35
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
Number of Streams
Wall
Clo
ck T
ime
(seco
nd
s)
Detecting Correlation
Updating Sketches
72
The Pruning Power of the Grid Structure
Processing Time
0
500
1000
1500
2000
2500
3000
Data Type and Size
Wal
l Clo
ck T
ime(
seco
nd)
grid2
grid3
dft
scan
73
Visualization
74
Other applications
Cointegration Test Matching Pursuit Anomaly Detection
75
Cointegration Test
Make stationary by the linear combination of several non-stationary time series.
Model long run characteristic as opposed to the correlation
Statstream may be applied to test the stationary condition of cointegration
76
Decompose signal into a group of non- orthogonal sub-components
Test the correlation among atoms in a dictionary.
Expedite the component selection
Matching Pursuit
77
Anomaly Detection
Measure the relative distance of each point from its nearest neighbors
Statstream may serve as a monitor by reporting those points far from any normal points
78
Conclusion
Introduction GEMINI Framework Random Projection Statstream Review Efficient Sketch Computation Parameter Selection Grid Structure System Integration Empirical Study Future work
79
Thanks a lot!
80
Recall and Precision
Recall=C/A Precision=C/B
A B
C
A: Query ball
B: Returned result
C: Intersection