fast algorithms for time series with applications to finance, physics, music and other suspects...

89
Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang, and Alberto Lerner {shasha,yunyue, xiaojian, zhihua, lerner}@cs.nyu.edu Courant Institute, New York

Post on 20-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Fast Algorithms for Time Series with applications to Finance, Physics, Music and

other Suspects

Dennis Shasha

Joint work with Yunyue Zhu, Xiaojian Zhao, Zhihua Wang, and Alberto

Lerner

{shasha,yunyue, xiaojian, zhihua, lerner}@cs.nyu.edu

Courant Institute, New York University

Page 2: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Goal of this work• Time series are important in so many

applications – biology, medicine, finance, music, physics, …

• A few fundamental operations occur all the time: burst detection, correlation, pattern matching.

• Do them fast to make data exploration faster, real time, and more fun.

Page 3: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Sample Needs • Pairs Trading in Finance: find two stocks that

track one another closely. When they go out of correlation, buy one and sell the other.

• Match a person’s humming against a database of songs to help him/her buy a song.

• Find bursts of activity even when you don’t know the window size over which to measure.

• Query and manipulate ordered data.

Page 4: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Why Speed Is Important• As processors speed up, algorithmic efficiency

no longer matters … one might think.

• True if problem sizes stay same but they don’t.

• As processors speed up, sensors improve --satellites spewing out a terabyte a day, magnetic resonance imagers give higher resolution images, etc.

• Desire for real time response to queries.

Page 5: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Surprise, surprise• More data, real-time response,

increasing importance of correlation IMPLIES

Efficient algorithms and data management more important than ever!

Page 6: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Corollary• Important area, lots of new problems.

• Small advertisement: High Performance Discovery in Time Series (Springer 2004). At this conference.

Page 7: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Outline• Correlation across thousands time series

• Query by humming: correlation + shifting

• Burst detection: when you don’t know window size

• Aquery: a query language for all these stuff

Page 8: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Real-time Correlation Across Thousands of Time Series

Page 9: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Scalable Methods for Correlation

• Compress streaming data into moving synopses.

• Update the synopses in constant time.

• Compare synopses in real time.

• Use transforms + simple data structures.

(Avoid curse of dimensionality.)

Page 10: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

GEMINI framework*

* Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. Minneapolis, MN, May 25-27. pp 419-429.

Page 11: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

StatStream (VLDB,2002): Example

• Stock prices streams– The New York Stock Exchange (NYSE) – 50,000 securities (streams); 100,000 ticks (trade and quote)

• Pairs Trading, a.k.a. Correlation Trading

• Query:“which pairs of stocks were correlated with a value of over 0.9 for the last three hours?”

XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours.Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down.They should converge back later.I will sell XYZ and buy ABC …

Page 12: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Online Detection of High Correlation• Given tens of thousands of high speed time series data streams, to

detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time.

• Real time– high update frequency of the data stream– fixed response time, online

Correlated!

Page 13: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Online Detection of High Correlation• Given tens of thousands of high speed time series data streams, to

detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time.

• Real time– high update frequency of the data stream– fixed response time, online

Page 14: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Online Detection of High Correlation• Given tens of thousands of high speed time series data streams, to

detect high-value correlation, including synchronized and time-lagged, over sliding windows in real time.

• Real time– high update frequency of the data stream– fixed response time, online

Correlated!

Page 15: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

StatStream: Naïve Approach

• Goal: find most highly correlated stream pairs over sliding windows– N : number of streams

– w : size of sliding window

– space O(N) and time O(N2w) .

• Suppose that the streams are updated every second.– With a Pentium 4 PC, the exact computing method can monitor only 700

streams, where each calculation takes place with a separation of two minutes.

– “Punctuated result model” – not continuous, but online.

Page 16: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

StatStream: Our Approach

– Use Discrete Fourier Transform to approximate correlation as in Gemini approach.

– Every two minutes (“basic window size”), update the DFT for each time series over the last hour (“window size”)

– Use grid structure to filter out unlikely pairs

– Our approach can report highly correlated pairs among 10,000 streams for the last hour with a delay of 2 minutes. So, at 2:02, find highly correlated pairs between 1 PM and 2 PM. At 2:04, find highly correlated pairs between 1:02 and 2:02 PM etc.

Page 17: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

StatStream: Stream synoptic data structure• Three level time interval hierarchy

– Time point, Basic window, Sliding window• Basic window (the key to our technique)

– The computation for basic window i must finish by the end of the basic window i+1

– The basic window time is the system response time.• Digests

Basic window digests:

sum

DFT coefs

Sliding window

Basic window

Time point

Basic window digests:

sum

DFT coefs

Page 18: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

StatStream: Stream synoptic data structure

Basic window digests:

sum

DFT coefs

Sliding window

Basic window

Time point

Basic window digests:

sum

DFT coefs

Basic window digests:

sum

DFT coefs

• Three level time interval hierarchy– Time point, Basic window, Sliding window

• Basic window (the key to our technique)– The computation for basic window i must finish by the end of the

basic window i+1– The basic window time is the system response time.

• Digests

Page 19: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

StatStream: Stream synoptic data structure

Sliding window digests:

sum

DFT coefs

Basic window digests:

sum

DFT coefs

Sliding window

Basic window

Time point

Basic window digests:

sum

DFT coefs

Basic window digests:

sum

DFT coefs

• Three level time interval hierarchy– Time point, Basic window, Sliding window

• Basic window (the key to our technique)– The computation for basic window i must finish by the end of the

basic window i+1– The basic window time is the system response time.

• Digests

Page 20: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

StatStream: Stream synoptic data structure

Sliding window digests:

sum

DFT coefs

Basic window digests:

sum

DFT coefs

Sliding window

Basic window

Time point

Basic window digests:

sum

DFT coefs

Basic window digests:

sum

DFT coefs

• Three level time interval hierarchy– Time point, Basic window, Sliding window

• Basic window (the key to our technique)– The computation for basic window i must finish by the end of the

basic window i+1– The basic window time is the system response time.

• Digests

Page 21: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

StatStream: Stream synoptic data structure

Basic window digests:

sum

DFT coefs

Sliding window

Basic window

Time point

Basic window digests:

sum

DFT coefs

Basic window digests:

sum

DFT coefs

• Three level time interval hierarchy– Time point, Basic window, Sliding window

• Basic window (the key to our technique)– The computation for basic window i must finish by the end of the

basic window i+1– The basic window time is the system response time.

• Digests

Page 22: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

How general technique is applied

• Compress streaming data into moving synopses: Discrete Fourier Transform.

• Update the synopses in time proportional to number of coefficients: basic window idea.

• Compare synopses in real time: compare DFTs.

• Use transforms + simple data structures: grid structure.

Page 23: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Synchronized Correlation Uses Basic Windows

w

i i

w

i i

w

i iiw

rrss

rsrsrscorr

1

2

1

2

11

)()(),(

• Inner-product of aligned basic windows

Stream x

Stream y

Sliding window

Basic window

• Inner-product within a sliding window is the sum of the inner-products in all the basic windows in the sliding window.

Page 24: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

• Approximate with an orthogonal function family (e.g. DFT)

Approximate Synchronized Correlation

x1 x2 x3 x4 x5 x6 x7 x8

f1(1) f1(2) f1(3) f1(4) f1(5) f1(6) f1(7) f1(8)

f2(1) f2(2) f2(3) f2(4) f2(5) f2(6) f2(7) f2(8)

f3(1) f3(2) f3(3) f3(4) f3(5) f3(6) f3(7) f3(8)

x

x

x

c

c

c

3

2

1

Page 25: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

• Approximate with an orthogonal function family (e.g. DFT)

Approximate Synchronized Correlation

x1 x2 x3 x4 x5 x6 x7 x8xxx ccc 321 ,,

Page 26: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

• Approximate with an orthogonal function family (e.g. DFT)

Approximate Synchronized Correlation

x1 x2 x3 x4 x5 x6 x7 x8xxx ccc 321 ,,

y1 y2 y3 y4 y5 y6 y7 y8yyy ccc 321 ,,

Page 27: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

• Approximate with an orthogonal function family (e.g. DFT)

• Inner product of the time series Inner product of the digests

• The time and space complexity is reduced from O(b) to O(n).– b : size of basic window– n : size of the digests (n<<b)

• e.g. 120 time points reduce to 4 digests

Approximate Synchronized Correlation

x1 x2 x3 x4 x5 x6 x7 x8xxx ccc 321 ,,

y1 y2 y3 y4 y5 y6 y7 y8yyy ccc 321 ,,

i

pm pm

pmmVifif

,0

),()()(

Page 28: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Approximate lagged Correlation

• Inner-product with unaligned windows

• The time complexity is reduced from O(b) to O(n2) , as opposed to O(n) for synchronized correlation. Reason: terms for different frequencies are non-zero in the lagged case.

sliding window

sliding window

Page 29: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

2

Grid Structure(to avoid checking all pairs)

• The DFT coefficients yields a vector.

• High correlation => closeness in the vector space

– We can use a grid structure and look in the neighborhood, this will return a super set of highly correlated pairs.

x

Page 30: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Empirical Study : Speed

Comparison of processing time

0

100

200

300400

500

600

700

800

200 400 600 800 1000 1200 1400 1600

Number of Streams

Wa

ll C

loc

k T

ime

(s

ec

on

ds

)

Exact

DFT

Our algorithm is parallelizable.

Page 31: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Empirical Study: Accuracy• Approximation errors

– Larger size of digests, larger size of sliding window and smaller size of basic window give better approximation

– The approximation errors (mistake in correlation coef) are small.

0.51

2

0.5

1

23

0

0.001

0.002

0.003

0.004

0.005

Ave

rag

e A

pp

roxi

mat

ion

E

rro

r

Sliding windows (Hours)

Bas

ic

win

do

ws

(Min

ute

s)

Page 32: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Sketches : Random Projection*

• Correlation between time series of the returns of stock – Since most stock price time series are close to random walks, their return

time series are close to white noise

– DFT/DWT can’t capture approximate white noise series because there is no clear trend (too many frequency components).

• Solution : Sketches (a form of random landmark)– Sketch pool: list of random vectors drawn from stable distribution

– Sketch : The list of distances from a data vector to the sketch pool.

– The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee.

•W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp. Math.,26:189-206,1984

•D. Achlioptas. “Database-friendly random projections”. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM Press,2001

Page 33: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Sketches : Intuition

• You are walking in a sparse forest and you are lost.

• You have an old-time cell phone without GPS.

• You want to know whether you are close to your friend.

• You identify yourself as 100 meters from the pointy rock, 200 meters from the giant oak etc.

• If your friend is at similar distances from several of these landmarks, you might be close to one another.

• The sketch is just the set of distances.

Page 34: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

),...,3,2,1(1 1111 wrrrrR

),...,3,2,1(2 2222 wrrrrR ),...,3,2,1(3 3333 wrrrrR ),...,3,2,1(4 4444 wrrrrR

)4,3,2,1( xskxskxskxsk

)4,3,2,1( yskyskyskysk

inner product

random vector

sketches

raw time series

Sketches : Random Projection

),...,,( 321 wxxxxx

),...,,( 321 wyyyyy

Page 35: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

The ratio of sketch distance/real distance (Sliding window size=256 and sketch size=80)

ratio

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.7

0.74

0.78

0.82

0.86 0.

90.

940.

981.

021.

06 1.1

1.14

1.18

1.22

ratio of distance

rati

o o

ver

tota

l

ratio

Page 36: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Empirical Study: Sketch on Price and Return Data

• DFT and DWT work well for prices (today’s price is a good predictor of tomorrow’s)

• But badly for returns (todayprice – yesterdayprice)/todayprice.

• Data length=256 and the first 14 DFT coefficients are used in the distance computation, db2 wavelet is used here with coefficient size=16 and sketch size is 64

Page 37: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Empirical Comparison: DFT, DWT and Sketch

Price Data

0

10

20

30

40

50

1 98 195 292 389 486 583 680 777 874 971

Data Points

Dis

tan

ce

sketch

dist

dwt

dft

Page 38: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Empirical Comparison : DFT, DWT and Sketch

Return Data

0

5

10

15

20

25

30

1 92 183 274 365 456 547 638 729 820 911

Data Points

Dis

tan

ce

dft

dwt

sketch

dist

Page 39: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Sketch Guarantees

• Note: Sketches do not provide approximations of individual time series window but help make comparisons.Johnson-Lindenstrauss Lemma:• For any and any integer n, let k be a positive integer such that

• Then for any set V of n points in , there is a map such that for all

• Further this map can be found in randomized polynomial time

10

nk ln)3/2/(4 132 dR kd RRf :

Vvu ,222 ||||)1(||)()(||||||)1( vuvfufvu

Page 40: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Overcoming curse of dimensionality*

• May need many random projections.

• Can partition sketches into disjoint pairs or triplets and perform comparisons on those.

• Each such small group is placed into an index.

• Algorithm must adapt to give the best results.

*Idea from P.Indyk,N.Koudas, and S.Muthukrishnan. “Identifying representative trends in massive time series data sets using sketches”. VLDB 2000.

Page 41: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Inner product with random vectors r1,r2,r3,r4,r5,r6

),,,,,( 654321 xskxskxskxskxskxsk

),,,,,( 654321 yskyskyskyskyskysk

),,,,,( 654321 zskzskzskzskzskzsk

X Y Z

Page 42: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

),( 21 xskxsk

),( 21 yskysk

),( 21 zskzsk

),( 43 xskxsk

),( 43 yskysk

),( 43 zskzsk

),( 65 xskxsk

),( 65 yskysk

),( 65 zskzsk

Grid structure

Page 43: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Further Performance Improvements

-- Suppose we have R random projections of window size WS. -- Might seem that we have to do R*WS work for each timepoint for each time series.-- In ongoing work with colleague Richard Cole, we show that we can cut this down by use of convolution and an oxymoronic notion of “structured random vectors”.

*Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Page 44: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

How to compute the sketch efficiently

We will not compute the inner product at each data point which is expensive.Given that random vectors are 1/-1*, we explain our algorithm through an example: ...

<Xiaojian: we will cut from here to the end for purposes of the tutorial, but you will need this for your thesis proposal. Note however that the explanation is not at all clear yet. Given your language difficulty, you must make the slides crystal clear. For now, just remove the rest of these.>

*Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Page 45: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Our algorithmGiven Time series , compute its sketch for a window of size sw=12.

Partition to smaller basic windows of size bw = 4.

The random vector within a basic window is R, where

At the cost of reducing randomization, we have another control vector ,where or -1 with probability ½ each. is used to determine which basic window will be multiplied with –1 or 1A final complete random vector may look like:

),,,( 4321 rrrrR 1/1 ir

),,( 321 bbbb 1ib

),,( 21 xxX

b

(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) Here bw=(1 1 -1 1) b=(1 -1 1)

Page 46: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Our algorithm continued

Then convolve with corresponding after padding with |bw| zeros. <Xiaojian: define convolution on this slide; animation later>Here to show the example, we take all r=1. <bad limitation>

bwX bwR

Animation shows convolution in action:

conv1:(1 1 1 1 0 0 0 0) (x1,x2,x3,x4)conv2:(1 1 1 1 0 0 0 0) (x5,x6,x7,x8)conv3:(1 1 1 1 0 0 0 0) (x9,x10,x11,x12)

1 1 1 1 0 0 0 0

x1 x2 x3 x4

x4

x4+x3

x4+x3+x2

x4+x3+x2+x1

x3+x2+x1

x2+x1

x1

x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4

Page 47: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Our algorithm continued <Xiaojian: not really clear>

bw1 bw2

x4 x8

x3+x4 x7+x8

x2+x3+x4 x6+x7+x8

x1+x2+x3+x4 x5+x6+x7+x8

x1+x2+x3 x5+x6+x7

x1+x2 x5+x6

x1 x5

Sk4

Sk3

Sk2

Sk1

+ +

Page 48: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Our algorithm continued

Summing up the corresponding items gives us the sought sketches as follows:

Sk1=(x1+x2+x3+x4)Sk2=(x2+x3+x4) + (x5)Sk3=(x3+x4) + (x5+x6)Sk4=(x4) + (x5+x6+x7)

After 3 such convolutions (note: because we have 3 basic windows) and then after inner product with b, the sketch for the first sliding window comes into formation.

That is: (Sk1 Sk5 Sk9)*(b1 b2 b3) Here * is inner product

Page 49: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Performance

• Naïve algorithm For each datum and random vector O(|sw|) integer additions • New algorithm Asymptotically for each datum and random vector (1) O(|sw|/|bw|) integer additions (2) O(log |bw|) floating point operations (use FFT in computing

comvolutions)

Page 50: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Query by humming:Correlation + Shifting

Page 51: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Query By Humming• You have a song in your head.• You want to get it but don’t know its title.• If you’re not too shy, you hum it to your friends or to a

salesperson and you find it.• They may grimace, but you get your CD

Page 52: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

With a Little Help From My Warped Correlation

• Karen’s humming Match:

• Dennis’s humming Match:

• “What would you do if I sang out of tune?"• Yunyue’s humming Match:

Page 53: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Related Work in Query by Humming

• Traditional method: String Matching [Ghias et. al. 95, McNab et.al. 97,Uitdenbgerd and Zobel 99]– Music represented by string of pitch directions: U, D, S (degenerated

interval)

– Hum query is segmented to discrete notes, then string of pitch directions

– Edit Distance between hum query and music score

• Problem– Very hard to segment the hum query

– Partial solution: users are asked to hum articulately

• New Method : matching directly from audio [Mazzoni and Dannenberg 00]

• We use both.

Page 54: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Time Series Representation of Query

• An example hum query

• Note segmentation is hard!

0

10

20

30

40

50

60

70

0 1 2 3 4 5 6 7 8 9 10 11

time (seconds)

pit

ch v

alu

esSegment this!

Page 55: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

How to deal with poor hummers?

• No absolute pitch– Solution: the average pitch is subtracted

• Inaccurate pitch intervals– Solution: return the k-nearest neighbors

• Incorrect overall tempo– Solution: Uniform Time Warping

• Local timing variations– Solution: Dynamic Time Warping

• Bottom line: timing variations take us beyond Euclidean distance.

Page 56: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Dynamic Time Warping

• Euclidean distance: sum of point-by-point distance

• DTW distance: allowing stretching or squeezing the time axis locally

Time Series 1

Time Series 2

Page 57: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Envelope Transform using Piecewise Aggregate Approximation(PAA) [Keogh VLDB 02]

Original time series

Upper envelope

Lower envelope

U_Keogh

L_Keogh

Page 58: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Envelope Transform using Piecewise Aggregate Approximation(PAA)

Original time series

Upper envelope

Lower envelope

U_Keogh

L_Keogh

Original time series

Upper envelope

Lower envelope

U_new

L_new

• Advantage of tighter envelopes – Still no false negatives, and fewer false positives

Page 59: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Container Invariant Envelope Transform

• Container-invariant A transformation T for envelope such that

• Theorem: if a transformation is Container-invariant and Lower-bounding, then the distance between transformed times series x and transformed envelope of y lower bound their DTW distance.

Feature Space

Page 60: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Framework to Scale Up for Large Database

note/durationsequence

segmentnotes

Query criteria

Database

Humming with ‘ta’

keywords

Top Nmatch

Nearest-Nsearch

on DTWdistancewith transformedenvelope filter

melody (note)

TopN’

match

Rhythmalignment

verifier

rhythm (duration)

Database

statisticsbasedfeatures

Boostedfeature

filter

boosting

Database

Keyword

filter

Page 61: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Improvement by Introducing Humming with ‘ta’ *

• Solve the problem of note segmentation

• Compare humming with ‘la’ and ‘ta’

* Idea from N. Kosugi et al “A pratical query-by-humming system for a large music database” ACM Multimedia 2000

Page 62: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Improvement by Introducing Humming with ‘ta’(2)

• Still use DTW distance to tolerate poor humming

• Decrease the size of time series by orders of magnitude.

• Thus reduce the computation of DTW distance

Page 63: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Statistics-Based Filters *

• Low dimensional statistic feature comparison– Low computation cost comparing to DTW distance

– Quickly filter out true negatives

• Example– Filter out candidates whose note length is much larger/smaller than the

query’s note length

• More– Standard Derivation of note value

– Zero crossing rate of note value

– Number of local minimum of note value

– Number of local maximum of note value

* Intuition from Erling Wold et al “Content-based classification, search and retrieval of audio” IEEE Multimedia 1996 http://www.musclefish.com

Page 64: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Boosting Statistics-Based Filters

• Characteristics of statistics-based filters– Quick but weak classifier

– Does not guarantee no false negative

– Ideal candidates for boosting

• Boosting *– “An algorithm for constructing a ‘strong’ classifier using only a training

set and a set of ‘weak’ classification algorithm”

– “A particular linear combination of these weak classifiers is used as the final classifier which has a much smaller probability of misclassification”

* Cynthia Rudin et al “On the Dynamics of Boosting” In Advances in Neural Information Processing Systems 2004

Page 65: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Verify Rhythm Alignment in the Query Result

• Nearest-N search only used melody information

• Will A. Arentz et al* suggests combining rhythm and melody– Results are generally better than using only melody information

– Not appropriate when the sum of several notes’ duration in the query may be related to duration of one note in the candidate

• Our method:– First use melody information for DTW distance computing

– Merge durations appropriately based on the note alignment

– Reject candidates which have bad rhythm alignment

* Will Archer Arentz “Methods for retrieving musical information based on rhythm and pitch correlation” CSGSC 2003

Page 66: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Query by Humming Demo

• 1039 songs (73051 note/duration sequences)

Page 67: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Burst detection: when window size is unknown

Page 68: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Burst Detection: Applications

• Discovering intervals with unusually large numbers of events.

– In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. Might last milliseconds or days…

– In telecommunications, if the number of packages lost within a certain time period exceeds some threshold, it might indicate some network anomaly. Exact duration is unknown.

– In finance, stocks with unusual high trading volumes should attract the notice of traders (or perhaps regulators).

Page 69: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Bursts across different window sizes in Gamma Rays

• Challenge : to discover not only the time of the burst, but also the duration of the burst.

Page 70: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Burst Detection: Challenge

• Single stream problem.

• What makes it hard is we are looking at multiple window sizes at the same time.

• Naïve approach is to do this one window size at a time.

Page 71: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Elastic Burst Detection: Problem Statement

• Problem: Given a time series of positive numbers x1, x2,..., xn, and a threshold function f(w), w=1,2,...,n, find the subsequences of any size such that their sums are above the thresholds:

– all 0<w<n, 0<m<n-w, such that xm+ xm+1+…+ xm+w-1 ≥ f(w)

• Brute force search : O(n^2) time

• Our shifted binary tree (SBT): O(n+k) time.

– k is the size of the output, i.e. the number of windows with bursts

Page 72: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Burst Detection: Data Structure and Algorithm

– Define threshold for node for size 2k to be threshold for window of size 1+ 2k-1

Page 73: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Burst Detection: Example

4 5 1 2 20 3 6 4 1 0 9 1 2 1 3 5

9 3 236 22 9

10 15 9

10 3 89 3 4

1226

3311

11 1012

4544

21

Window Size 2 3 4 5Threshold 24 26 47 50

Page 74: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Burst Detection: Example

4 5 1 2 20 3 6 4 1 0 9 1 2 1 3 5

9 3 236 22 9

10 15 9

10 3 89 3 4

1226

3311

11 1012

4544

21

True AlarmFalse Alarm

Window Size 2 3 4 5Threshold 24 26 47 50

Page 75: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Burst Detection: Algorithm

• In linear time, determine whether any node in SBT indicates an alarm.

• If so, do a detailed search to confirm (true alarm) or deny (false alarm) a real burst.

• In on-line version of the algorithm, need keep only most recent node at each level.

Page 76: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

False Alarms (requires work, but no errors)

p=0.000001

0

0.01

0.02

0.03

0.04

0.05

0.06

1 1.2 1.4 1.6 1.8 2

T=W/w

Fal

se A

larm

Rat

es

Page 77: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Empirical Study : Gamma Ray Burst

Processing time vs. Number of Windows

01000020000300004000050000600007000080000

0 10 20 30 40 50

Number of Windows

Pro

cess

ing

time

(ms)

SWT Algorithm

Direct Algorithm

Page 78: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Case Study: Burst Detection(1)

Background:Motivation: In astrophysics, the sky is constantly observed for high-energy particles. When a particular astrophysical event happens, a shower of high-energy particles arrives in addition to the background noise. An unusual event burst may signal an event interesting to physicists.

Technical Overview:1.The sky is partitioned into 1800*900 buckets.2.14 Sliding window lengths are monitored from 0.1m to 39.81m 3.The original code implements the naive algorithm. 1800

900

)( 2n

Page 79: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Case Study: Burst Detection(2)

The challenges:1.Vast amount of data 1800*900 time series, so any trivial overhead may be accumulated to become nontrivial expenses.

2. Unavoidable overheads of data transformations Data pre-processing such as fetching and storage requires much work. SBT trees have to be built no matter how many sliding windows to be investigated. Thresholds are maintained over time due to the different background noises. Hit on one bucket will affect its neighbours as shown in the previous figure

Page 80: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Case Study: Burst Detection(3)

Our solutions:1. Combine near buckets into one to save space and processing time. If any alarms reported for this large bucket, go down to see each small components (two level detailed search).

2. Special implementation of SBT tree Build the SBT tree only including those levels covering the sliding windows Maintain a threshold tree for the sliding windows and update it over time.

Fringe benefits:1. Adding window sizes is easy.2. More sliding windows monitored also benefit physicists.

Page 81: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Case Study: Burst Detection(4)

Experimental results:1. Benefits improve with more sliding

windows. 2. Results consistent across different

data files. 3. SBT algorithm runs 7 times faster

than current algorithm.4. More improvement possible if

memory limitations are removed. 14 28 42

SBT1

SBT2

0

200

400

600

800

1000

Running Time

Sliding Window

Burst Detection

SBT1

old1

SBT2

old2

Page 82: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Extension to other aggregates

• SBT can be used for any aggregate that is monotonic

– SUM, COUNT and MAX are monotonically increasing

• the alarm threshold is aggregate<threshold

– MIN is monotonically decreasing

• the alarm threshold is aggregate<threshold

– Spread =MAX-MIN

• Application in Finance

– Stock with burst of trading or quote(bid/ask) volume (Hammer!)

– Stock prices with high spread

Page 83: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Empirical Study : Stock Price Spread Burst

Processing time vs. Number of Windows

1

10

100

1000

10000

100000

1000000

0 10 20 30 40 50

Number of Windows

Pro

cess

ing

time

(ms)

SWT Algorithm

Direct Algorithm

Page 84: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Extension to high dimensions

Page 85: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Elastic Burst in two dimensions

• Population Distribution in the US

Page 86: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Can discover numeric thresholds from probability threshold.

• Suppose that the moving sum of a time series is a random variable from a normal distribution.

• Let the number of bursts in the time series within sliding window size w be So(w) and its expectation be Se(w).

– Se(w) can be computed from the historical data.

• Given a threshold probability p, we set the threshold of burst f(w) for window size w such that Pr[So(w) ≥ f(w)] ≤p.

Page 87: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Find threshold for Elastic Bursts

• Φ(x) is the normal cdf, so symmetric around 0:

• Therefore

)(1)()( 1 pwSwf e

Φ(x)

x p

Φ-1(p)

ppX

ppX

)](Pr[

)](Pr[1

1

ppwS

wSwS

e

eo

)()(

)()(Pr 1

Page 88: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

Summary of Burst Detection• Able to detect bursts on many different window sizes

in essentially linear time.• Can be used both for time series and for spatial

searching.• Can specify thresholds either with absolute numbers or

with probability of hit.• Algorithm is simple to implement and has low

constants (code is available).• Ok, it’s embarrassingly simple.

Page 89: Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects Dennis Shasha Joint work with Yunyue Zhu, Xiaojian Zhao,

AQuery A Database System for Order