indexing & similarity search in time series databases · in time series databases reza...
TRANSCRIPT
Indexing & Similarity Search in Time Series Databases
Reza Akbarinia Inria & Lirmm
1
Outline
● Context
● Representations
● Similarity functions
● Motif and anomaly discovery
● Matrix profile
● Indexing
● Conclusion
Time Series
● A series of data points listed in time order, e.g., T = t1 , t2 , · · · , tn
● Many applications– Finance
– Traffic monitoring
– Earthquake prediction
– Aircraft engine analysis– Agronomy
● Problem– Similarity search
Time Series Representations
● To efficiently deal with time series, we need – techniques for representing time series and
reducing dimensions
● Representation examples – PAA– SAX– DWT
PAA Representation
PAA (Piecewise Aggregate Approximation)
● Divides the time axis to a set of (equal size) segments
● For each segment
– Calculates the mean of the segment’s values
– Puts the mean as the segment’s value
PAA Example
T1 (PAA_8)= <4, 5.5, 7.5, 3.5, 3, 6, 4.5, 2>
T2 (PAA_2)= <4.5, 4>
indexable Symbolic Aggregate approXimation (iSAX) – Represents a time series T of n segments in w-
dimensional space ● First converts the time series to PAA representation● Then converts PAA to iSAX symbols
iSAX: 11 11 01 00W= 4
PAA
iSAX Representation
● Captures both frequency and time information● Many applications in science and engineering, e.g. denoising or
compressing time series (signals)● Haar wavelet
– Input: n = 2^m values
– Algorithm
In n/2 iterations do● Calculate the mean of each pair of values● Compute the mean and its difference to the first value of the pair● Save the mean and the difference instead of the original pair
values
Discrete Wavelet Transform (DWT)
Example
20 10 60 30 40 80 50 20
1 2 3 4 5 6 7 80
20
40
60
80
100
Time series
Example
20 10 60 30 40 80 50 20
15 5
Example
20 10 60 30 40 80 50 20
15 45 5 15
20 10 60 30 40 80 40 20
15 45 60 5 15 -20
Example
20 10 60 30 40 80 40 20
15 45 60 30 5 15 -20 10
Example
20 10 60 30 40 80 40 20
15 45 60 30 5 15 -20 10
30 45 -15 15 5 15 -20 10
Example
Example
20 10 60 30 40 80 40 20
15 45 60 30 5 15 -20 10
30 45 -15 15 5 15 -20 10
37.5 -7.5 -15 15 5 15 -20 10
Transformation result
Inverting Discret Wavelet Transform
37.5 -7.5 -15 15 5 15 -20 10
Discret wavelet transform is invertible● Start from the bottom row● Add and substract mean and the difference
values
Example
30 45 -15 15 5 15 -20 10
37.5 -7.5 -15 15 5 15 -20 10
Discret wavelet transform is invertible● Start from the bottom row● Add and substract mean and the difference
values
Example
15 45 60 30 5 15 -20 10
30 45 -15 15 5 15 -20 10
37.5 -7.5 -15 15 5 15 -20 10
Discret wavelet transform is invertible● Start from the bottom row● Add and substract mean and the difference
values
Example
20 10 60 30 40 80 40 20
15 45 60 30 5 15 -20 10
30 45 -15 15 5 15 -20 10
37.5 -7.5 -15 15 5 15 -20 10
Discret wavelet transform is invertible● Start from the bottom row● Add and substract mean and the difference
values
● Consider two subsequences of size m in a time series T
– Ti,m = <ti, ti+1, …, ti+m-1>
– Tj,m = <tj, tj+1, …, tj+m-1>
Definition of Euclidean distance:
● Good for detecting subsequences that are similar in value and shape
● But, not shapes with different scales
Euclidean Distance
● Can detect similar shapes, even with different scales
– E.g., similarity in speech or walking with different speeds
● DTW calculates an optimal match between two given sequences (with different sizes) whil respecting the following rules:
– The first index from the first sequence must be matched with the first index from the other sequence
– The last index from the first sequence must be matched with the last index from the other sequence
– Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa
– The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing
● i.e., if j > i are indices from the first sequence, then there must not be two indices l > k in the other sequence, such that index i is matched with index l and index j is matched with index k
● Objective:
– Find the optimal match: i.e., the match that satisfies all the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.
Dynamic Time Warping (DTW)
Image credit: wikipedia
DTW vs Euclidean
Euclidean DTW
image credit: Elena Tsiporkova
DTW – Pseudo Code
DTW Algorithm Example
Consider two following time series:● A = <2, 4, 5, 3, 2, 4> ● B = <1, 6, 4, 3>
2 4 5 3 2 4
3
4
6
1
Time Series B
DTW Algorithm Example
We firstly set the distances to infinity
2 4 5 3 2 4
3 ∞ ∞ ∞ ∞ ∞ ∞
4 ∞ ∞ ∞ ∞ ∞ ∞
6 ∞ ∞ ∞ ∞ ∞ ∞
1 ∞ ∞ ∞ ∞ ∞ ∞
Time Series A
Time Series B
DTW Algorithm Example
Then, we fill the matrix as follows :
2 4 5 3 2 4
3 ∞ ∞ ∞ ∞ ∞ ∞
4 ∞ ∞ ∞ ∞ ∞ ∞
6 ∞ ∞ ∞ ∞ ∞ ∞
1 1 ∞ ∞ ∞ ∞ ∞
Time Series A
Time Series B
DTW Algorithm Example
2 4 5 3 2 4
3 8 ∞ ∞ ∞ ∞ ∞
4 7 ∞ ∞ ∞ ∞ ∞
6 5 ∞ ∞ ∞ ∞ ∞
1 1 ∞ ∞ ∞ ∞ ∞
Time Series A
Time Series B
Then, we fill the matrix as follows :
DTW Algorithm Example
2 4 5 3 2 4
3 8 4 ∞ ∞ ∞ ∞
4 7 3 ∞ ∞ ∞ ∞
6 5 3 ∞ ∞ ∞ ∞
1 1 4 ∞ ∞ ∞ ∞
Time Series A
Time Series B
Then, we fill the matrix as follows :
DTW Algorithm Example
2 4 5 3 2 4
3 8 4 5 ∞ ∞ ∞
4 7 3 4 ∞ ∞ ∞
6 5 3 4 ∞ ∞ ∞
1 1 4 8 ∞ ∞ ∞
Time Series A
Time Series B
Then, we fill the matrix as follows :
DTW Algorithm Example
2 4 5 3 2 4
3 8 4 5 4 ∞ ∞
4 7 3 4 5 ∞ ∞
6 5 3 4 7 ∞ ∞
1 1 4 8 10 ∞ ∞
Time Series A
Time Series B
Then, we fill the matrix as follows :
DTW Algorithm Example
2 4 5 3 2 4
3 8 4 5 4 5 ∞
4 7 3 4 5 7 ∞
6 5 3 4 7 11 ∞
1 1 4 8 10 11 ∞
Time Series A
Time Series B
Then, we fill the matrix as follows :
DTW Algorithm Example
2 4 5 3 2 4
3 8 4 5 4 5 6
4 7 3 4 5 7 7
6 5 3 4 7 11 13
1 1 4 8 10 11 14
Time Series A
Time Series BDistance matrix
Then, we fill the matrix as follows :
DTW Algorithm Example
2 4 5 3 2 4
3 8 4 5 4 5 6
4 7 3 4 5 7 7
6 5 3 4 7 11 13
1 1 4 8 10 11 14
Time Series A
Time Series BDistance matrix
● We can find the optimal match by starting from the right top of the distance matrix, and going down/left/corner by following the minimum values
● Optimal match in our example
– {(1,1), (2,2), (3,3), (4,4), (5,4), (6,4)}
● Minimal cost:
– 6 + 5 + 4 + 4 + 3 + 1 = 18
Motif and Anomaly Discovery
Winding Dataset ( The angular speed of reel 2 )
0 500 1000 150 0 2000 2500
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Time Series Motif Discovery Time Series Motif Discovery (finding repeated patterns)(finding repeated patterns)
Slide credit: E. Keogh
Many applications, e.g. :
● Mining association rules in time series requires the discovery of motifs. These are referred to as primitive shapes and frequent patterns
● Many time series anomaly/interestingness detection algorithms essentially consist of modeling normal behavior with a set of typical shapes (motifs), and detecting future patterns that are dissimilar to all typical shapes
● In robotics : allows an autonomous agent to generalize from a set of qualitatively different experiences gleaned from sensors
● In medical data mining, used for characterizing physiotherapy patient’s recovery based on the discovery of similar patterns
Motivation for MotifsMotivation for Motifs
Definition 1. Match: Given a positive real number R (called threshold or range) and a time series T containing a subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) ≤ R, then M is called a matching subsequence of C.
Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p or p < q’< q.
Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count of non-
trivial matches (ties are broken by choosing the motif whose matches have the lower variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence CK that has the highest count of
non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1 ≤ i < K.
0 100 200 3 00 400 500 600 700 800 900 1000
T
Space Shuttle STS-57 Telemetry ( Inertial Sensor )
Trivial Matches
C
Slide credit: E. Keogh
● Discord or anomaly is a point whose nearest neighbor is far from it.
● Most significant discord: given a time series T, and a similarity function D, a subsequence C is the most significant discord if its distance to its nearest neighbor is largest.
● Discord detection is important for many applications, e.g. for anomaly detection in aircraft engines
Discord
Matrix Profile: An Efficient Technique for Motif and Discord Discovery
● Given a time series T, and subsequence size m
● Matrix profile MP is a time series
– such that MP[i] returns the distance of subsequence T[i] to its nearest neighbor subsequence in T
Image credit: E. Keogh
500 1000 1500 2000 2500 3000 3500
500 1000 1500 2000 2500 3000 3500
0
T
MP
0 500 1000 1500 2000 2500 3000
How to “read” a Matrix Profile
Where you see relatively low values, you know that the subsequence in the original time series must have (at least one) relatively similar subsequence elsewhere in the data (such regions are “motifs” or reoccurring patterns)
Where you see relatively high values, you know that the subsequence in the original time series must be unique in its shape (such areas are “discords” or anomalies).
Must be an anomaly in the original data, in this region.
We call these Time Series Discords
Must be conserved shapes (motifs) in the original data, in these three regionsSlide credit: E. Keogh
Given time series “T”, a subsequence length “m”, and a subsequence “C”:
– Distance profile of C is an array that shows the distance of C to any subsequence of size m in T
MASS: Mueen’s Algorithm for Similarity Search
● It uses a convolution based method for calculating distance profile efficiently for z-normalized Euclidean distance
Distance Profile
There are several algorithms for computing matrix profile
● STAMP
● STOMP
● AAMP
Matrix Profile Algorithms
● Let
– A : times series of size n
– m: size of subsequences
– A[i] : subsequence of size m starting at position i in A
Goal:
● Calculate matrix profile MP: such that MP[i] returns the distance of A[i] to its most similar subsequence in A
Problem Definition
STAMP: Scalable Time Series Anytime Matrix Profile
Idea : for each subsequence, compute its distance profile using MASS algorithm, and choose the minimum of the distance profile
• Distance function: z-normalized Euclidean distance
• The order in which the distance profiles are computed can be random
• The random ordering provides diminishing return, which allows interrupt-resume operations anytime
MP(1:n-m+1) = inf;MPI(1:n-m+1) = -1;for i = 1:n-m+1 in a random order
d = MASS(T,T(i:i+m-1)); d( max(i-m/4,1) : min(i+m/4-1,n-m+1))= NaN;
[MP, ind] = min([MP ; d]); MPI(ind==2) = i;end
O(n2 log n)
d1,1 d1,2 … … … d1,n-
m+1
d2,1 d2,2 … … … d2,n-
m+1
… … … … … …
di,1 di,2 … di,j … di,n-m+1
… … … … … …
dn-
m+1,1
dn-
m+1,2
… … … dn-
m+1,n-
m+1
Min(D1)
Min(D2)
Min(Dn-
m+1)
Min(Di)
P1 P2 … … ... Pn-m+1
Slide credit: E. Keogh
Yan Zhu et al. Matrix Profile II: Exploiting a Novel Algorithm and GPUs to break the one Hundred Million Barrier for Time Series Motifs and Joins. IEEE ICDM Conf., 2016.
* Download page: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html
STOMP: Scalable Time series Ordered Matrix Profile
Stomp has an O(n2) time, O(n) space complexityDistance function: z-normalized Euclidean distance
Working formula:
Dot product of the ith window and the jth window.
STOMP: Scalable Time series Ordered Matrix Profile
Slide credit: E. Keogh
AAMP Algorithm *
● Distance function: Euclidean distance
● Main idea: compute the subsequence distances incrementally
– Each distance: in O(1) instead of O(m)
– Thus, an amortized complexity of O(n) to find the nearest sequence of Am[i]
● For this, we sweep the two time series in n-m iterations
– In each iteration, the distances are incrementally computed, and minimum distances updated
– In iteration k, we compute the distance of A[i] and A[i+k] ● i.e., subsequences of A that have a difference of k in their
positions
* Reza Akbarinia, Bertrand Cloez. Efficient Matrix Profile Computation Using Different Distance Functions. CoRR abs/1901.05708 (2019)
Incremental Distance Computation
● Di,j : Square of Euclidean distance between Am[i] and Bm[j]
● A[i] : <ai, …, ai+m-1>
● A[j] : <aj, …, aj+m-1>
Thus, we have
Di , j= ∑k=0
k=m−1
(ai +k−a j+ k)2
Di−1 , j−1= ∑k=0
k=m−1
(ai+k−1−a j +k−1)2
Di , j=Di−1 , j−1−(ai−1−a j−1)2+(ai+m−1−a j +m−1)
2
Algorithm
● For i=1 to n MP[i] := ∞ // initialize minimum distances
● For k=1 to n-m //do n-m iterations
– Compute D1, k // using Euclidean function
– For i=2 to n – m - k + 1 // compute distance of subsequences that// are “k” points far from each other
● Incrementally compute D i, i+k using Di-1, i+k-1 // O(1)
● If (MP[i] > Di, i+k) then– MP[i] := Di, i+k
– MP_Index[i] := i+k
– For i=n-m-1 to k //inverse scan● Incrementally compute D i, i-k using Di+1, i-k+1 // O(1)
● If (MP[i] > Di, i-k) then– MP[i] := Di, i-k
Example
t1 t2 t3 t4 t5 t6 t7 t8
Iteration 1
sw1
t1 t
2 t
3 t
4 t
5 t
6 t
7 t
8
t1 t
2 t
3 t
4 t
5 t
6 t
7 t
8 t1 t
2 t
3 t
4 t
5 t
6 t
7 t
8
t1 t
2 t
3 t
4 t
5 t
6 t
7 t
8
Iteration 2
sw2
t1 t2 t3 t4 t5 t6 t7 t8
t1 t
2 t
3 t
4 t
5 t
6 t
7 t
8
Iteration 4
sw4
t1 t
2 t
3 t
4 t
5 t
6 t
7 t
8
t1 t
2 t
3 t
4 t
5 t
6 t
7 t
8
Iteration 3
sw3
t1 t
2 t
3 t
4 t
5 t
6 t
7 t
8
Analysis of AAMP
● An exact algorithm for computing the matrix profile
● Time complexity: O(n2)
● Space complexity: O(n)
● Simpler and faster than Scrimp++ algorithm (improved version of STOMP)
– No need to Fourier transformations
Code of AAMP can be downloaded from: https://github.com/rakbarinia/AAMP-ACAMP
Execution time vs. time series length
53
Parallel Time Series Indexing and Similarity Search
DPiSAX Parallel iSAX index for big time series databases
(Terabytes)
ParCorr and RadiusSketch Parallel indexing of time series based on
sketches
Sketches stored in distributed grids
[IEEE TKDE 2019, PKDD2019, DMKD 2018, CIKM 2018, ICDM 2017]
iSAX Index
● Simple non-balanced tree– Internal nodes point to internal or leaf nodes
– Leaf nodes point to time series files on disk
– A leaf node is divided (split), when the file size exceeds a threshold
● Efficient for kNN query processing
– But, its construction may need too much time for big DBs (e.g., several days for 1 billion time series)
55
Distributed iSAX
56
Distributed iSAX
DPiSAX : Distributed Partitioned iSAX
● Main idea: construct in parallel a distributed iSAX index● Parallel index construction
– Partition the time series into uniform groups (for load balancing)
● Using parallel sampling – Each group forms a sub-tree of the index– Construct the sub-indices (sub-trees) in parallel
● Parallel query processing– Send each query directly to the node that is
supposed to answer it
D. Yagoubi, R. Akbarinia, F. Masseglia, T. Palpanas. DPiSAX: Massively Distributed Partitioned iSAX. ICDM, 2017
58
Distributed Partitioned iSAX
59
Distributed Partitioned iSAX
Experiments
● Experimental Framework– Spark– Nef platform, a cluster of 32 machines
● Compared Algorithms– Centralized iSAX2+– PLS : Parallel Linear Scan– DiSAX– DPiSAX – DbasicPiSAX: the basic version of DPiSAX (with simple partitioning)
● Datasets– Synthetic
● 4 billion time series (6TB) using a random walk generator– Real DBs
● IRIS Seismic Data: containing 40 millions time series, for a total size of 150GB● TexMex corpus. It contains 1 billion time series (SIFT feature vectors) of 128
points each (derived from 1 Billion images)
61
Results
Performance Gains for KNN Queries, compared to parallel linear search (PLS)
Performance gain of our parallel approaches compared to iSAX2+ and PLS, for 10-NN queries (batches of 10k queries), over seismic, Random Walk (RW) and TexMex datasets, using 32 nodes
Parallel Time Series Indexing using Sketches
Sketches in a nutshell
64
Sketches in a nutshell
65
Sketches in a nutshell
66
Sketches in a nutshell
67
Sketches in a nutshell
Node 1 Node 2
68
Online Demonstration
● http://imitates.gforge.inria.fr/
Thanks!
References
Djamel Edine Yagoubi, Reza Akbarinia, Florent Masseglia, and Themis Palpanas. Massively Distributed Time Series Indexing and Querying. IEEE TKDE (Transactions on Knowledge and Data Engineering), 2019.
Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, Eamonn Keogh. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets. International Conference on Data Mining (ICDM), 2016.
Yan Zhu, Zachary Zimmerman, Nader Shakibay Senobari, Chin-Chia Michael Yeh, Gareth Funning, Abdullah Mueen, Philip Berisk and Eamonn Keogh. Matrix Profile II: Exploiting a Novel Algorithm and GPUs to break the one Hundred Million Barrier for Time Series Motifs and Joins. International Conference on Data Mining (ICDM), 2016.
Djamel Edine Yagoubi, Reza Akbarinia, Boyan Kolev, Oleksandra Levchenko, Florent Masseglia, Patrick Valduriez, Dennis Shasha. ParCorr: efficient parallel methods to identify similar time series pairs across sliding windows. Data Mining and Knowledge Discovery, Springer, 2018, 32 (5), pp.1481-1507.
Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas. DPiSAX: Massively Distributed Partitioned iSAX. International Conference on Data Mining (ICDM), pp.1-6, 2017