fast subsequence matching in time-series databases christos faloutsos m. ranganathan yannis...

FFast ast SSubsequence ubsequence MMatching atching in in TTimeime-S-Series eries DDatabasesatabases

Christos FaloutsosChristos FaloutsosM. RanganathanM. Ranganathan

Yannis ManolopoulosYannis ManolopoulosDepartment of Computer Science and ISRDepartment of Computer Science and ISR

University of Maryland at College ParkUniversity of Maryland at College Park

Presented by Rui LiPresented by Rui Li

AbstractAbstractGoal: To find an efficient indexing Goal: To find an efficient indexing method to locate time series in a method to locate time series in a databasedatabaseMain Idea: Main Idea: – Map each time series into a small set of Map each time series into a small set of

multidimensional rectangles in feature multidimensional rectangles in feature spacespace

– Rectangles can be readily indexed using Rectangles can be readily indexed using traditional spatial access methods, e.g., traditional spatial access methods, e.g., R*-treeR*-tree

IntroductionIntroduction

Hot Problem: Searching similar Hot Problem: Searching similar patterns in time-series databasespatterns in time-series databases

Applications:Applications:– financial, marketing and production time financial, marketing and production time

series, e.g. stock pricesseries, e.g. stock prices– scientific databases, e.g. weather, scientific databases, e.g. weather,

geological, environmental datageological, environmental data

Introduction (cont.)Introduction (cont.)

Similarity Queries:Similarity Queries:– Whole MatchingWhole Matching– Subsequence MatchingSubsequence Matching

partial matchingpartial matching

report time series along with offsetreport time series along with offset

Introduction (cont.)Introduction (cont.)Whole Matching (Previous Work)Whole Matching (Previous Work)– Use a distance-preserving transform Use a distance-preserving transform

(e.g., DFT) to extract (e.g., DFT) to extract ff features from features from time series (e.g., the first time series (e.g., the first ff DFT DFT coefficients), and then map them into coefficients), and then map them into points in the points in the ff-dimensional feature space-dimensional feature space

– Spatial access method (e.g., R*-trees) Spatial access method (e.g., R*-trees) can be used to search for approximate can be used to search for approximate queriesqueries

Introduction (cont.)Introduction (cont.)

Subsequence Matching (Goal)Subsequence Matching (Goal)– Map time series into rectangles in Map time series into rectangles in

feature spacefeature space– Spatial access methods as the eventual Spatial access methods as the eventual

indexing mechanismindexing mechanism

BackgroundBackground

To guarantee no false dismissals for To guarantee no false dismissals for range queries, the feature extraction range queries, the feature extraction function function F()F() should satisfy the should satisfy the following formula:following formula:

Parseval Theorem:Parseval Theorem:– The DFT preserves the Euclidean The DFT preserves the Euclidean

distance between two time seriesdistance between two time series

),())(),(( 2121OODOFOFD objectfeature

),(),( YXDyxD

Proposed MethodProposed MethodMapping each time series to a trail in Mapping each time series to a trail in feature spacefeature space– Use a sliding window of size Use a sliding window of size ww and place and place

it at every possible offsetit at every possible offset– For each such placement of the window, For each such placement of the window,

extract the features of the subsequence extract the features of the subsequence inside the windowinside the window

– A time series of length A time series of length LL is mapped to a is mapped to a trail in feature space, consisting of trail in feature space, consisting of LL--ww+1 points: one point for each offset+1 points: one point for each offset

Example1Example1

Example2Example2

(a) a sample stock-price time series(a) a sample stock-price time series(b) its trail in the feature space of the 0-th and 1-st DFT (b) its trail in the feature space of the 0-th and 1-st DFT coefficientscoefficients(c) its trail of the 1-st and 2-nd DFT coefficients(c) its trail of the 1-st and 2-nd DFT coefficients

Proposed Method (cont.)Proposed Method (cont.)Indexing the trailsIndexing the trails– Simply storing the individual points of the trail Simply storing the individual points of the trail

in an R*-tree is inefficientin an R*-tree is inefficient– Exploit the fact that successive points of the Exploit the fact that successive points of the

trail tend to be similar, i.e., the contents of the trail tend to be similar, i.e., the contents of the sliding window in nearby offsets tend to be sliding window in nearby offsets tend to be similarsimilar

– Divide the trail into sub-trails and represent Divide the trail into sub-trails and represent each of them with its minimum bounding each of them with its minimum bounding (hyper)-rectangle (MBR)(hyper)-rectangle (MBR)

– Store only a few MBRsStore only a few MBRs

Proposed Method (cont.)Proposed Method (cont.)

Indexing the trails (cont.)Indexing the trails (cont.)– Can guarantee ‘no false dismissals’: Can guarantee ‘no false dismissals’:

when a query arrives, all the MBRs that when a query arrives, all the MBRs that intersect the query region are retrieved, intersect the query region are retrieved, i.e., all the qualifying sub-trails are i.e., all the qualifying sub-trails are retrieved, plus some false alarmsretrieved, plus some false alarms

Return to example1Return to example1

ε


Indexing the trails (cont.)Indexing the trails (cont.)– Map a time series into a set of Map a time series into a set of

rectangles in feature spacerectangles in feature space– Each MBR corresponds to a sub-trailEach MBR corresponds to a sub-trail


For each MBR we have to storeFor each MBR we have to store– , which are the offsets of the first , which are the offsets of the first

and last such positioningsand last such positionings– A unique identifier for each time seriesA unique identifier for each time series– The extent of the MBR in each The extent of the MBR in each

dimension, i.e.,dimension, i.e.,

Store the MBRs in an R*-treeStore the MBRs in an R*-tree– Recursively group the MBRs into parent Recursively group the MBRs into parent

MBRs, grandparent MBRs, etc.MBRs, grandparent MBRs, etc.

endstart tt ,

,...)2,2,1,1( highlowhighlow FFFF

Example1 (cont.)Example1 (cont.)– assuming a fan-out of 4assuming a fan-out of 4


The structure of a leaf node and a The structure of a leaf node and a non-leaf nodenon-leaf node


Two questionsTwo questions– Insertions: when a new time series is Insertions: when a new time series is

inserted, what is a good way to divide inserted, what is a good way to divide its trail into sub-trailsits trail into sub-trails

– Queries: how to handle queries, Queries: how to handle queries, especially the ones that are longer than especially the ones that are longer than the sliding windowthe sliding window


Insertion – Dividing trails into sub-Insertion – Dividing trails into sub-trailstrails– Goal: Optimal division so that the Goal: Optimal division so that the

number of disk accesses is minimizednumber of disk accesses is minimized

Example3Example3

fixed heuristic adaptive heuristicfixed heuristic adaptive heuristic


Insertion (cont.)Insertion (cont.)– Group trail-points into sub-trails by Group trail-points into sub-trails by

means of an adaptive heuristicmeans of an adaptive heuristic– Based on a greedy algorithm, using a Based on a greedy algorithm, using a

cost function to estimate the number of cost function to estimate the number of disk accesses for each of the optionsdisk accesses for each of the options


Insertion (cont.)Insertion (cont.)– The cost function:The cost function:

where is the sides of the where is the sides of the nn--dimensional MBR of a node in an R-treedimensional MBR of a node in an R-tree

– The marginal cost of each point: The marginal cost of each point: where where kk is the number of points in this is the number of points in this MBRMBR

n

iiLLDA

1

)5.0()(

),...,,( 21 nLLLL

kLDAmc /)(


Insertion (cont.)Insertion (cont.)– Algorithm:Algorithm:

Assign the first point of the trail to a Assign the first point of the trail to a sub-trail sub-trail ((would be a predefined small MBRwould be a predefined small MBR))FOR each successive pointFOR each successive point

IF it increases the marginal IF it increases the marginal cost cost of the current sub-trailof the current sub-trail

THEN start a new sub-trailTHEN start a new sub-trailELSE include it into the ELSE include it into the

current current sub-trailsub-trail


Insertion (cont.)Insertion (cont.)– The algorithm may not work well under The algorithm may not work well under

certain circumstancescertain circumstances– The algorithm’s goal is to minimize the The algorithm’s goal is to minimize the

size of each MBR, why don’t we use size of each MBR, why don’t we use clustering techniques!clustering techniques!


Searching – Queries longer than Searching – Queries longer than ww– If If Len(Q)=wLen(Q)=w, the searching algorithm , the searching algorithm

goes like:goes like:Map Map QQ to a point to a point qq in the feature space; the in the feature space; the query corresponds to a sphere with center query corresponds to a sphere with center qq and radius and radius εε

Retrieve the sub-trails whose MBRs intersect Retrieve the sub-trails whose MBRs intersect the query regionthe query region

Examine the corresponding time series, and Examine the corresponding time series, and discard the false alarmsdiscard the false alarms

Proposed Method (cont.)Proposed Method (cont.)Searching (cont.)Searching (cont.)– If If Len(Q)>w, Len(Q)>w, consider the following consider the following

Lemma:Lemma:Consider two sequences Consider two sequences QQ and and SS of the of the same length same length Len(Q)=Len(S)=p*wLen(Q)=Len(S)=p*wConsider their Consider their pp disjoint subsequences disjoint subsequences

andandwherewhereIf If QQ AND AND SS agree within tolerance agree within tolerance εε, then at , then at least one of the pairs of corresponding least one of the pairs of corresponding subsequence agree within tolerancesubsequence agree within tolerance

wiwiQqi 1:1 wiwiSsi 1:1

10 pi

),( ii qsp/


Searching (cont.)Searching (cont.)– If If Len(Q)>w, Len(Q)>w, the searching algorithm the searching algorithm

goes like:goes like:The query time series The query time series QQ is broken into is broken into p p sub-sub-queries which correspond to queries which correspond to pp spheres in the spheres in the feature space with radiusfeature space with radius

Retrieve the sub-trails whose MBRs intersect Retrieve the sub-trails whose MBRs intersect at least one of the sub-query regionsat least one of the sub-query regions

Examine the corresponding subsequences of Examine the corresponding subsequences of the time series, and discard the false alarmsthe time series, and discard the false alarms

p/

ExperimentsExperiments

Experiments are ran on a stock Experiments are ran on a stock prices database of 329,000 pointsprices database of 329,000 points

Only the first 3 frequencies of the Only the first 3 frequencies of the DFT are used; thus the feature space DFT are used; thus the feature space has 6 dimensions (real and imaginary has 6 dimensions (real and imaginary parts of each retained DFT parts of each retained DFT coefficient)coefficient)

Sliding window size Sliding window size ww=512=512

Experiments (cont.)Experiments (cont.)

Query time series were generated by Query time series were generated by taking random offsets into the time taking random offsets into the time series and obtaining subsequences of series and obtaining subsequences of length length Len(Q)Len(Q) from those offsets from those offsets

Experiments (cont.)Experiments (cont.)

For groups of experiments were For groups of experiments were carried outcarried out– Comparison of the proposed method Comparison of the proposed method

against the method that has sub-trails against the method that has sub-trails with only one point eachwith only one point each

– Experiments to compare the response Experiments to compare the response timetime

– Experiments with queries longer than Experiments with queries longer than ww– Experiments with larger databasesExperiments with larger databases

Related Works (citations)Related Works (citations)Continuous queries over data Continuous queries over data streamsstreamsSimilarity indexing with M-tree/SS-Similarity indexing with M-tree/SS-tree, etc.tree, etc.Efficient time series matching by Efficient time series matching by waveletswaveletsFast similarity search in the presence Fast similarity search in the presence of noise, scaling, and translation in of noise, scaling, and translation in time-series databasestime-series databases

Thank you!Thank you!

fast subsequence matching in time-series databases christos faloutsos m. ranganathan yannis...

Documents

production time series

time series of length

offset slide

mbrs slide

subtrail slide

example1 slide

tree slide

feature space rectangles