online event-driven subsequence matching over financial data streams huanmei wu,betty salzberg,...

Online Event-driven Subsequence Matching over Financial Data Streams

Huanmei Wu, Betty Salzberg, Donghui Zhang

Northeastern University, College of Computer & Information Science

Presented by : Evangelos Kanoulas

SIGMOD 2004NU CCIS

Motivation (1)

An incoming stream of stock market data Analyze it and do

Trend prediction Pattern recognition Dynamic clustering of multiple data streams Rule discovery

Subsequence matching is the main component

SIGMOD 2004NU CCIS

Motivation (2)

Subsequence similarity over financial data streams has its unique properties

Zigzag shape of piecewise linear representation (PLR) Relative position of end points is important Price change (amplitude) is more important than time interval

1

24

3

5

time1’

2’4’

3’

5’

S1 S2Price Price

time

S1 S2 S3

SIGMOD 2004NU CCIS

Outline

Motivation

2. Data Stream Processing

3. Subsequence Matching

4. Trend Prediction

5. Performance

6. Conclusion

SIGMOD 2004NU CCIS

Data Stream Processing (1)Aggregation and Smoothing

Incoming data arrives at any time Piecewise Linear Representation requires a

unique value for each time interval Aggregation of the raw data Smoothing of the aggregated values using the

moving average

SIGMOD 2004NU CCIS

Data Stream Processing (2)Segmentation

PLR may not be in a zig-zag shape The end points of the PLR should be points at which the

trend changes dramatically All other points are considered as noise and should be

eliminated

aggregated data stream

SIGMOD 2004NU CCIS

Data Stream Processing (3)%b data stream : the base for linear segmentation

Why use %b (Bollinger Band Percent)?

1. %b is a widely used financial indicator

2. %b has a smoothed moving trend similar to the aggregated data stream

3. %b is normalized value, most values are between -1 and 2

Uniform segmentation criteria

aggregated data stream

%b data stream

SIGMOD 2004NU CCIS

Data Stream Processing (4)Segmentation over %b

t

Pri

ce (

x)

Sliding Window

12

35

6

78 9 10

11

12

4

13

In the current sliding window, where Pj(Xj,tj) is the current point, Pi(Xi, ti) is an upper end point if,

Xi = max ( X values of the current sliding window )

Xi > Xj + ( where is the given error threshold )

Pi(Xi, ti) is the last one satisfying the above two conditions

Pi

Pj

SIGMOD 2004NU CCIS

Data Stream Processing (5)Two Step Pruning

a. Filter step on %b streams

b. Refine step on the raw sequence stream to eliminate false positives

t4t0 t1 t2 t3

Agg. Stream

%b stream δp

b

pri

ce

pri

ce

δpd

t3t0 t1 t2 t4 t5

t

δpb

SIGMOD 2004NU CCIS

Outline

Motivation Data Stream Processing

3. Subsequence Matching

4. Trend Prediction

5. Performance

6. Conclusion

SIGMOD 2004NU CCIS

Subsequence Similarity (1)Event-driven subsequence matching

Identifying a new potential end point triggers a subsequent matching search

The search algorithm finds subsequences in the historical data similar to a query subsequence

The query subsequence consists of the most current n end points

Pri

ce

tt5 t6 t7 t8 t9 t10 t11 t12 t13 t14 …… t37 t38 t39 t40

1

2

3

4

SIGMOD 2004NU CCIS

Subsequence Similarity (2) New similarity measure

S = {(X1, t1), (X2, t2), …, (Xn, tn)}

S' = {(X1', t1'), (X2', t2'), …, (Xn', tn')}

S and S' are similar if they satisfy the following two conditions :

The relative position of S and S' end points is the same d(S, S') < , where

d(S, S') = ( * ||(Xi+1 - Xi)| - |(Xi+1' - Xi')||

+ * |(ti+1 - ti) - (ti+1' - ti')|)

where , , 0 are user defined parameters

SIGMOD 2004NU CCIS

Subsequence Similarity (3)Subsequence Permutation

S = {(X1, t1), (X2, t2), …, (Xn, tn)}

S’ = { [(X1, t1), (X3, t3), …, (Xn-1, tn-1)],

[(X2, t2), (X4, t4), …, (Xn, tn)] }

S” = {[(Xi1, ti1), (Xi3, ti3), …, (Xi(n-1), ti(n-1))],

[(Xi2, ti2), (Xi4, ti4), …, (Xin, tin)] }

Separate upper and lower points

Sort separately based on X values

{i1, i3, …, i(n-1), i2, i4, …, in}

Get the subsequence permutation

SIGMOD 2004NU CCIS

Outline

Motivation Data Stream Processing Subsequence Matching

4. Trend Prediction

5. Performance

6. Conclusion

SIGMOD 2004NU CCIS

Trend predictionSubsequence matching application

Trend-K at a point p measures the change of the price to the next k points

Three trends: UP, DOWN, NOTREND

Pri

ce

t

t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 …… t37 t38 t39 t40

SIGMOD 2004NU CCIS

Outline

Motivation Data Stream Processing Subsequence Matching Trend Prediction

5. Performance

6. Conclusion

SIGMOD 2004NU CCIS

Performance (1)Similarity measure

70

65

60

55

50

45

40

35

30

Per

m+

Am

p

Am

p O

nly

Per

m O

nly

Per

m+

Eu

c

Eu

c O

nly

Cor

rect

nes

s %

SIGMOD 2004NU CCIS

Performance (2)Event–driven vs. Fixed time periods

Cor

rect

nes

s %

70

65

60

55

50

45

40

35

30

Eve

nt-

dri

ven

FT

1F

T5

FT

10

FT

15

FT

25

FT

30

FT

20

Rel

ativ

e C

PU

cos

t

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Eve

nt-

dri

ven

FT

1

FT

5

FT

10

FT

15

FT

25

FT

30

FT

20

SIGMOD 2004NU CCIS

Outline

Motivation Data Stream Processing Subsequence Similarity Trend Prediction Performance

6. Conclusion

SIGMOD 2004NU CCIS

Conclusion

Proposed an online segmentation and pruning algorithm

Defined an alternative similarity subsequence measure

Introduced an event-driven online similarity matching algorithm

Achieved 70% correct predictions using real world data

online event-driven subsequence matching over financial data streams huanmei wu,betty salzberg,...

Documents

b data stream slide

b t price x

aggregated data stream

current n end points

nu ccis sigmod

incoming data

b streams b

historical data similar