fast subsequence matching in time-series databases christos faloutsos m. ranganathan yannis...
TRANSCRIPT
FFast ast SSubsequence ubsequence MMatching atching in in TTimeime-S-Series eries DDatabasesatabases
Christos FaloutsosChristos FaloutsosM. RanganathanM. Ranganathan
Yannis ManolopoulosYannis ManolopoulosDepartment of Computer Science and ISRDepartment of Computer Science and ISR
University of Maryland at College ParkUniversity of Maryland at College Park
Presented by Rui LiPresented by Rui Li
AbstractAbstractGoal: To find an efficient indexing Goal: To find an efficient indexing method to locate time series in a method to locate time series in a databasedatabaseMain Idea: Main Idea: – Map each time series into a small set of Map each time series into a small set of
multidimensional rectangles in feature multidimensional rectangles in feature spacespace
– Rectangles can be readily indexed using Rectangles can be readily indexed using traditional spatial access methods, e.g., traditional spatial access methods, e.g., R*-treeR*-tree
IntroductionIntroduction
Hot Problem: Searching similar Hot Problem: Searching similar patterns in time-series databasespatterns in time-series databases
Applications:Applications:– financial, marketing and production time financial, marketing and production time
series, e.g. stock pricesseries, e.g. stock prices– scientific databases, e.g. weather, scientific databases, e.g. weather,
geological, environmental datageological, environmental data
Introduction (cont.)Introduction (cont.)
Similarity Queries:Similarity Queries:– Whole MatchingWhole Matching– Subsequence MatchingSubsequence Matching
partial matchingpartial matching
report time series along with offsetreport time series along with offset
Introduction (cont.)Introduction (cont.)Whole Matching (Previous Work)Whole Matching (Previous Work)– Use a distance-preserving transform Use a distance-preserving transform
(e.g., DFT) to extract (e.g., DFT) to extract ff features from features from time series (e.g., the first time series (e.g., the first ff DFT DFT coefficients), and then map them into coefficients), and then map them into points in the points in the ff-dimensional feature space-dimensional feature space
– Spatial access method (e.g., R*-trees) Spatial access method (e.g., R*-trees) can be used to search for approximate can be used to search for approximate queriesqueries
Introduction (cont.)Introduction (cont.)
Subsequence Matching (Goal)Subsequence Matching (Goal)– Map time series into rectangles in Map time series into rectangles in
feature spacefeature space– Spatial access methods as the eventual Spatial access methods as the eventual
indexing mechanismindexing mechanism
BackgroundBackground
To guarantee no false dismissals for To guarantee no false dismissals for range queries, the feature extraction range queries, the feature extraction function function F()F() should satisfy the should satisfy the following formula:following formula:
Parseval Theorem:Parseval Theorem:– The DFT preserves the Euclidean The DFT preserves the Euclidean
distance between two time seriesdistance between two time series
),())(),(( 2121OODOFOFD objectfeature
),(),( YXDyxD
Proposed MethodProposed MethodMapping each time series to a trail in Mapping each time series to a trail in feature spacefeature space– Use a sliding window of size Use a sliding window of size ww and place and place
it at every possible offsetit at every possible offset– For each such placement of the window, For each such placement of the window,
extract the features of the subsequence extract the features of the subsequence inside the windowinside the window
– A time series of length A time series of length LL is mapped to a is mapped to a trail in feature space, consisting of trail in feature space, consisting of LL--ww+1 points: one point for each offset+1 points: one point for each offset
Example1Example1
Example2Example2
(a) a sample stock-price time series(a) a sample stock-price time series(b) its trail in the feature space of the 0-th and 1-st DFT (b) its trail in the feature space of the 0-th and 1-st DFT coefficientscoefficients(c) its trail of the 1-st and 2-nd DFT coefficients(c) its trail of the 1-st and 2-nd DFT coefficients
Proposed Method (cont.)Proposed Method (cont.)Indexing the trailsIndexing the trails– Simply storing the individual points of the trail Simply storing the individual points of the trail
in an R*-tree is inefficientin an R*-tree is inefficient– Exploit the fact that successive points of the Exploit the fact that successive points of the
trail tend to be similar, i.e., the contents of the trail tend to be similar, i.e., the contents of the sliding window in nearby offsets tend to be sliding window in nearby offsets tend to be similarsimilar
– Divide the trail into sub-trails and represent Divide the trail into sub-trails and represent each of them with its minimum bounding each of them with its minimum bounding (hyper)-rectangle (MBR)(hyper)-rectangle (MBR)
– Store only a few MBRsStore only a few MBRs
Proposed Method (cont.)Proposed Method (cont.)
Indexing the trails (cont.)Indexing the trails (cont.)– Can guarantee ‘no false dismissals’: Can guarantee ‘no false dismissals’:
when a query arrives, all the MBRs that when a query arrives, all the MBRs that intersect the query region are retrieved, intersect the query region are retrieved, i.e., all the qualifying sub-trails are i.e., all the qualifying sub-trails are retrieved, plus some false alarmsretrieved, plus some false alarms
Return to example1Return to example1
ε
Proposed Method (cont.)Proposed Method (cont.)
Indexing the trails (cont.)Indexing the trails (cont.)– Map a time series into a set of Map a time series into a set of
rectangles in feature spacerectangles in feature space– Each MBR corresponds to a sub-trailEach MBR corresponds to a sub-trail
Proposed Method (cont.)Proposed Method (cont.)
For each MBR we have to storeFor each MBR we have to store– , which are the offsets of the first , which are the offsets of the first
and last such positioningsand last such positionings– A unique identifier for each time seriesA unique identifier for each time series– The extent of the MBR in each The extent of the MBR in each
dimension, i.e.,dimension, i.e.,
Store the MBRs in an R*-treeStore the MBRs in an R*-tree– Recursively group the MBRs into parent Recursively group the MBRs into parent
MBRs, grandparent MBRs, etc.MBRs, grandparent MBRs, etc.
endstart tt ,
,...)2,2,1,1( highlowhighlow FFFF
Example1 (cont.)Example1 (cont.)– assuming a fan-out of 4assuming a fan-out of 4
Proposed Method (cont.)Proposed Method (cont.)
The structure of a leaf node and a The structure of a leaf node and a non-leaf nodenon-leaf node
Proposed Method (cont.)Proposed Method (cont.)
Two questionsTwo questions– Insertions: when a new time series is Insertions: when a new time series is
inserted, what is a good way to divide inserted, what is a good way to divide its trail into sub-trailsits trail into sub-trails
– Queries: how to handle queries, Queries: how to handle queries, especially the ones that are longer than especially the ones that are longer than the sliding windowthe sliding window
Proposed Method (cont.)Proposed Method (cont.)
Insertion – Dividing trails into sub-Insertion – Dividing trails into sub-trailstrails– Goal: Optimal division so that the Goal: Optimal division so that the
number of disk accesses is minimizednumber of disk accesses is minimized
Example3Example3
fixed heuristic adaptive heuristicfixed heuristic adaptive heuristic
Proposed Method (cont.)Proposed Method (cont.)
Insertion (cont.)Insertion (cont.)– Group trail-points into sub-trails by Group trail-points into sub-trails by
means of an adaptive heuristicmeans of an adaptive heuristic– Based on a greedy algorithm, using a Based on a greedy algorithm, using a
cost function to estimate the number of cost function to estimate the number of disk accesses for each of the optionsdisk accesses for each of the options
Proposed Method (cont.)Proposed Method (cont.)
Insertion (cont.)Insertion (cont.)– The cost function:The cost function:
where is the sides of the where is the sides of the nn--dimensional MBR of a node in an R-treedimensional MBR of a node in an R-tree
– The marginal cost of each point: The marginal cost of each point: where where kk is the number of points in this is the number of points in this MBRMBR
n
iiLLDA
1
)5.0()(
),...,,( 21 nLLLL
kLDAmc /)(
Proposed Method (cont.)Proposed Method (cont.)
Insertion (cont.)Insertion (cont.)– Algorithm:Algorithm:
Assign the first point of the trail to a Assign the first point of the trail to a sub-trail sub-trail ((would be a predefined small MBRwould be a predefined small MBR))FOR each successive pointFOR each successive point
IF it increases the marginal IF it increases the marginal cost cost of the current sub-trailof the current sub-trail
THEN start a new sub-trailTHEN start a new sub-trailELSE include it into the ELSE include it into the
current current sub-trailsub-trail
Proposed Method (cont.)Proposed Method (cont.)
Insertion (cont.)Insertion (cont.)– The algorithm may not work well under The algorithm may not work well under
certain circumstancescertain circumstances– The algorithm’s goal is to minimize the The algorithm’s goal is to minimize the
size of each MBR, why don’t we use size of each MBR, why don’t we use clustering techniques!clustering techniques!
Proposed Method (cont.)Proposed Method (cont.)
Searching – Queries longer than Searching – Queries longer than ww– If If Len(Q)=wLen(Q)=w, the searching algorithm , the searching algorithm
goes like:goes like:Map Map QQ to a point to a point qq in the feature space; the in the feature space; the query corresponds to a sphere with center query corresponds to a sphere with center qq and radius and radius εε
Retrieve the sub-trails whose MBRs intersect Retrieve the sub-trails whose MBRs intersect the query regionthe query region
Examine the corresponding time series, and Examine the corresponding time series, and discard the false alarmsdiscard the false alarms
Proposed Method (cont.)Proposed Method (cont.)Searching (cont.)Searching (cont.)– If If Len(Q)>w, Len(Q)>w, consider the following consider the following
Lemma:Lemma:Consider two sequences Consider two sequences QQ and and SS of the of the same length same length Len(Q)=Len(S)=p*wLen(Q)=Len(S)=p*wConsider their Consider their pp disjoint subsequences disjoint subsequences
andandwherewhereIf If QQ AND AND SS agree within tolerance agree within tolerance εε, then at , then at least one of the pairs of corresponding least one of the pairs of corresponding subsequence agree within tolerancesubsequence agree within tolerance
wiwiQqi 1:1 wiwiSsi 1:1
10 pi
),( ii qsp/
Proposed Method (cont.)Proposed Method (cont.)
Searching (cont.)Searching (cont.)– If If Len(Q)>w, Len(Q)>w, the searching algorithm the searching algorithm
goes like:goes like:The query time series The query time series QQ is broken into is broken into p p sub-sub-queries which correspond to queries which correspond to pp spheres in the spheres in the feature space with radiusfeature space with radius
Retrieve the sub-trails whose MBRs intersect Retrieve the sub-trails whose MBRs intersect at least one of the sub-query regionsat least one of the sub-query regions
Examine the corresponding subsequences of Examine the corresponding subsequences of the time series, and discard the false alarmsthe time series, and discard the false alarms
p/
ExperimentsExperiments
Experiments are ran on a stock Experiments are ran on a stock prices database of 329,000 pointsprices database of 329,000 points
Only the first 3 frequencies of the Only the first 3 frequencies of the DFT are used; thus the feature space DFT are used; thus the feature space has 6 dimensions (real and imaginary has 6 dimensions (real and imaginary parts of each retained DFT parts of each retained DFT coefficient)coefficient)
Sliding window size Sliding window size ww=512=512
Experiments (cont.)Experiments (cont.)
Query time series were generated by Query time series were generated by taking random offsets into the time taking random offsets into the time series and obtaining subsequences of series and obtaining subsequences of length length Len(Q)Len(Q) from those offsets from those offsets
Experiments (cont.)Experiments (cont.)
For groups of experiments were For groups of experiments were carried outcarried out– Comparison of the proposed method Comparison of the proposed method
against the method that has sub-trails against the method that has sub-trails with only one point eachwith only one point each
– Experiments to compare the response Experiments to compare the response timetime
– Experiments with queries longer than Experiments with queries longer than ww– Experiments with larger databasesExperiments with larger databases
Related Works (citations)Related Works (citations)Continuous queries over data Continuous queries over data streamsstreamsSimilarity indexing with M-tree/SS-Similarity indexing with M-tree/SS-tree, etc.tree, etc.Efficient time series matching by Efficient time series matching by waveletswaveletsFast similarity search in the presence Fast similarity search in the presence of noise, scaling, and translation in of noise, scaling, and translation in time-series databasestime-series databases
Thank you!Thank you!