ftw: fast similarity search under the time warping distance
TRANSCRIPT
FTW: Fast Similarity Search under the Time Warping Distance
Yasushi Sakurai (NTT Cyber Space Labs)Masatoshi Yoshikawa (Nagoya Univ.)Christos Faloutsos (Carnegie Mellon Univ.)
PODS 2005 Y. Sakurai et al 2
Motivation
n Time-series dataq many applications
n computational biology, astrophysics, geology, meteorology, multimedia, economics
n Similarity searchq Euclidean distanceq DTW (Dynamic Time Warping)
n Useful for different sequence lengthsn Different sampling ratesn scaling along the time axis
PODS 2005 Y. Sakurai et al 3
Mini-introduction to DTWn DTW allows sequences to be stretched along the
time axisq Minimize the distance of sequencesq Insert ‘stutters’ into a sequenceq THEN compute the (Euclidean) distance
‘stutters’:original
PODS 2005 Y. Sakurai et al 4
Mini-introduction to DTWn DTW is computed by dynamic programming
q Warping path: set of grid cells in the time warping matrix
data sequence P of length N
query sequence Q of length M
pN
qM
pi
qjq1
p1
P
Q
p1 pi pNq1
qj
qM
p-stutters
q-stutters
Optimum warping path(the best alignment)
PODS 2005 Y. Sakurai et al 5
Mini-introduction to DTW
ïî
ïí
ì
----
+-=
=
)1,1(),1()1,(
min),(
),(),(
jifjif
jifqpjif
MNfQPD
ji
dtw
q-stutterno stutter
p-stutter
n DTW is computed by dynamic programming
p1, p2, …, pi,; q1, q2, …, qj
PODS 2005 Y. Sakurai et al 6
Mini-introduction to DTWn Global constraints limit the warping scope
q Warping scope: area that the warping path is allowed to visit
P
Q
p1 pi pN
q1
qj
qM
P
Q
p1 pi pN
q1
qj
qM
Itakura ParallelogramSakoe-Chiba Band
PODS 2005 Y. Sakurai et al 7
Mini-introduction to DTWn Width of the warping scope W is user-defined
P
Q
p1 pi pN
q1
qj
qM
Sakoe-Chiba Band
W1
P
Q
p1 pi pN
q1
qj
qM
W2
PODS 2005 Y. Sakurai et al 8
Motivation
n Similarity search for time-series dataq DTW (Dynamic Time Warping)
n scaling along the time axisBut…n High search cost O(NM)n prohibitive for long sequences
PODS 2005 Y. Sakurai et al 9
Our Solution, FTW
n Requirements: 1. Fast2. No false dismissals3. No restriction on the sequence length
n It should handle data sequences of different lengths4. Support for any, as well as for no restriction on
“warping scope”
PODS 2005 Y. Sakurai et al 10
Problem Definition
n Givenq S time-series data sequences of unequal lengths
{P1, P2, …, PS}, q a query sequence Q, q an integer k, q (optionally) a warping scope W,
n Find the k-nearest neighbors of Q from the data sequence set by using DTW with W
PODS 2005 Y. Sakurai et al 11
Overview
n Introductionn Related workn Main ideasn Experimental resultsn Conclusions
PODS 2005 Y. Sakurai et al 12
Related Work
n Sequence indexingq Agrawal et al. (FODO 1998)q Keogh et al. (SIGMOD 2001)q …
n Subsequence matchingq Faloutsos et al. (SIGMOD 1994)q Moon et al. (SIGMOD 2002)q …
PODS 2005 Y. Sakurai et al 13
Related Work
n Fast sequence matching for DTWq Yi et al. (ICDE 1998)q Kim et al. (ICDE 2001)q Chu et al. (SDM 2002)q Keogh (VLDB 2002)q Zhu et al. (SIGMOD 2003)q …
n None of the existing methods for DTW fulfills all the requirements
PODS 2005 Y. Sakurai et al 14
Overview
n Introductionn Related workn Main ideasn Experimental resultsn Conclusions
PODS 2005 Y. Sakurai et al 15
Main Idea (1) - LBS
n LBS (Lower Bounding distance measure with Segmentation)
n PA : Approximate sequencesq : segment rangeq : upper valueq : lower value
q t: length of time intervals*
):( Ui
Li
Ri ppp =
Uip
Rip
Lip AP
Rp1Rp4
Rp3
t t t t
Rp2
PODS 2005 Y. Sakurai et al 16
Main Idea (1) - LBS
Rjq
Rip
n Compute lower bounding distanceq Distance of the two ranges and :
distance of their two closest points
Rjq
Rip
Time
Value Lower bound
Time
Value Lower bound=0
PODS 2005 Y. Sakurai et al 17
Main Idea (1) - LBS
n Compute lower bounding distanceq Distance of the two ranges and :
distance of their two closest points
Rjq
ïïî
ïïí
ì
>-
>-
=)(0)()(
),(otherwise
pqpqqpqp
qpD Ui
Lj
Ui
Lj
Uj
Li
Uj
Li
Rj
Riseg
Rip
details
PODS 2005 Y. Sakurai et al 18
Main Idea (1) - LBS
P
Q
P
Q
n Exact DTW distance
PODS 2005 Y. Sakurai et al 19
Main Idea (1) - LBS
n Compute lower bounding distance from PA and QA
n Use a dynamic programming approach
AP
AQ
AP
AQ
),(),( QPDQPD dtwAA
lbs £
PODS 2005 Y. Sakurai et al 20
Main Idea (1) - LBS
n Compute lower bounding distance from PA and QA
n Use a dynamic programming approach
AP
AQ
),(),( QPDQPD dtwAA
lbs £
P
Q
PODS 2005 Y. Sakurai et al 21
Main Idea (2) - EarlyStopping
n Exploit the fact that we have found k-near neighbors at distance dcbq dcb: k-nearest neighbor distance (the Current Best)
the exact distance of the best k candidates so far
PODS 2005 Y. Sakurai et al 22
Main Idea (2) - EarlyStoppingn Exclude useless warping paths by using
q Omit g(1,3) ifq Omit g(4,1) if
AP
AQ
g(1,2)
g(3,1)
AP
AQ
cbdg >)2,1( cbd
cbdg >)1,3(
PODS 2005 Y. Sakurai et al 23
Main Idea (3) - Refinement
n Q: How to choose t (length of time intervals)?
AP
AQ
g(1,2)
g(3,1)
AP
AQ
t
t
PODS 2005 Y. Sakurai et al 24
Main Idea (3) - Refinement
n Q: How to choose t (length of intervals)?n A: Use multiple granularities, as follows:
AP
AQ
g(1,2)
g(3,1)
AP
AQ
t
t
PODS 2005 Y. Sakurai et al 25
Main Idea (3) - Refinement
n Compute the lower bounding distance from the coarsest sequences as the first refinement step
n Ignore P if , otherwise:
AP
AQ
g(1,2)
g(3,1)
AP
AQ
cbAA
lbs dQPD >),(
PODS 2005 Y. Sakurai et al 26
Main Idea (3) - Refinement
n … compute the distance from more accurate sequences as the second refinement step
n … repeat
AP
AQ
AQ
AP
PODS 2005 Y. Sakurai et al 27
Main Idea (3) - Refinement
n … until the finest granularityn Update the list of k-nearest neighbors if
P
Q
P
Q
cbdtw dQPD £),(
PODS 2005 Y. Sakurai et al 28
Overview
n Introductionn Related workn Main ideasn Experimental resultsn Conclusions
PODS 2005 Y. Sakurai et al 29
Experimental results
n Setupq Intel Xeon 2.8GHz, 1GB memory, Linuxq Datasets:
Temperature, Fintime, RandomWalkq Four different time intervals (for n=2048)
t1=2, t2=8, t3=32, t4=128
n Evaluationq Compared FTW with LB_PAA (the best so far)q Mainly computation time
PODS 2005 Y. Sakurai et al 30
Outline of experiments
n Speed vs db sizen Speed vs warping scope Wn Effect of filteringn Effect of varying-length data sequences
PODS 2005 Y. Sakurai et al 31
Search Performance
n Itakura Parallelogram
P
Q
p1 pi pN
q1
qj
qM
PODS 2005 Y. Sakurai et al 32
Search Performance
n Wall clock time as a function of data set sizen Temperature FTW is up
to 50 times faster!
PODS 2005 Y. Sakurai et al 33
Search Performance
n Wall clock time as a function of data set sizen Fintime FTW is up
to 40 times faster!
PODS 2005 Y. Sakurai et al 34
Search Performance
n Wall clock time as a function of data set sizen RandomWalk FTW is up
to 40 times faster!
More effective as the size
grows
PODS 2005 Y. Sakurai et al 35
Outline of experiments
n Speed vs db sizen Speed vs warping scope Wn Effect of filteringn Effect of varying-length data sequences
PODS 2005 Y. Sakurai et al 36
Search Performance
n Sakoe-Chiba Band
P
Q
p1 pi pN
q1
qj
qM
W1
P
Q
p1 pi pN
q1
qj
qM
W2
PODS 2005 Y. Sakurai et al 37
Search Performance
n Wall clock time as a function of warping scopen Temperature FTW is up
to 220 times faster!
PODS 2005 Y. Sakurai et al 38
Search Performance
n Wall clock time as a function of warping scopen Fintime FTW is up
to 70 times faster!
PODS 2005 Y. Sakurai et al 39
Search Performance
n Wall clock time as a function of warping scopen RandomWalk FTW is up
to 100 times faster!
PODS 2005 Y. Sakurai et al 40
Outline of experiments
n Speed vs db sizen Speed vs warping scope Wn Effect of filteringn Effect of varying-length data sequences
PODS 2005 Y. Sakurai et al 41
Effect of filtering
n Most of data sequences are excluded by coarser approximations (t4=128 and t3=32)q Using multiple granularities has significant advantages
Frequency of approximation use
PODS 2005 Y. Sakurai et al 42
Outline of experiments
n Speed vs db sizen Speed vs warping scope Wn Effect of filteringn Effect of varying-length sequences
PODS 2005 Y. Sakurai et al 43
Difference in Sequence Lengthsn 5 sequence data sets
Random(2048,0): length 2048 +/- 0Random(2048,32): length 2048 +/- 16Random(2048,64), Random(2048,128), Random(2048,256)
Outperform by 2+ orders of magnitude
LB_PAA can not handle
PODS 2005 Y. Sakurai et al 44
Overview
n Introductionn Related workn Main ideasn Experimental resultsn Conclusions
PODS 2005 Y. Sakurai et al 45
Conclusions
n Design goals: 1. Fast2. No false dismissals3. No restriction on the sequence length4. Support for any, as well as for no
restriction on “warping scope”
PODS 2005 Y. Sakurai et al 46
Conclusions
n Design goals: 1. Fast (up to 220 times faster)2. No false dismissals3. No restriction on the sequence length4. Support for any, as well as for no
restriction on “warping scope”
PODS 2005 Y. Sakurai et al 47
Page Accessesn Sequential scan of feature data should boost
performance (speed-up factors SF=5, SF=10)PAds: page accesses for data sequences
PAfd: page accesses for feature datadsfd
SF PASFPA
PA +=
details