indexing & similarity search in time series databases · in time series databases reza...

Indexing & Similarity Search in Time Series Databases

Reza Akbarinia Inria & Lirmm

1

Outline

● Context

● Representations

● Similarity functions

● Motif and anomaly discovery

● Matrix profile

● Indexing

● Conclusion

Time Series

● A series of data points listed in time order, e.g., T = t1 , t2 , · · · , tn

● Many applications– Finance

– Traffic monitoring

– Earthquake prediction

– Aircraft engine analysis– Agronomy

● Problem– Similarity search

Time Series Representations

● To efficiently deal with time series, we need – techniques for representing time series and

reducing dimensions

● Representation examples – PAA– SAX– DWT

PAA Representation

PAA (Piecewise Aggregate Approximation)

● Divides the time axis to a set of (equal size) segments

● For each segment

– Calculates the mean of the segment’s values

– Puts the mean as the segment’s value

PAA Example

T1 (PAA_8)= <4, 5.5, 7.5, 3.5, 3, 6, 4.5, 2>

T2 (PAA_2)= <4.5, 4>

indexable Symbolic Aggregate approXimation (iSAX) – Represents a time series T of n segments in w-

dimensional space ● First converts the time series to PAA representation● Then converts PAA to iSAX symbols

iSAX: 11 11 01 00W= 4

PAA

iSAX Representation

● Captures both frequency and time information● Many applications in science and engineering, e.g. denoising or

compressing time series (signals)● Haar wavelet

– Input: n = 2^m values

– Algorithm

In n/2 iterations do● Calculate the mean of each pair of values● Compute the mean and its difference to the first value of the pair● Save the mean and the difference instead of the original pair

values

Discrete Wavelet Transform (DWT)

Example

20 10 60 30 40 80 50 20

1 2 3 4 5 6 7 80

20

40

60

80

100

Time series

Example

20 10 60 30 40 80 50 20

15 5

Example

20 10 60 30 40 80 50 20

15 45 5 15

20 10 60 30 40 80 40 20

15 45 60 5 15 -20

Example

20 10 60 30 40 80 40 20

15 45 60 30 5 15 -20 10

Example

20 10 60 30 40 80 40 20

15 45 60 30 5 15 -20 10

30 45 -15 15 5 15 -20 10

Example

Example

20 10 60 30 40 80 40 20

15 45 60 30 5 15 -20 10

30 45 -15 15 5 15 -20 10

37.5 -7.5 -15 15 5 15 -20 10

Transformation result

Inverting Discret Wavelet Transform

37.5 -7.5 -15 15 5 15 -20 10

Discret wavelet transform is invertible● Start from the bottom row● Add and substract mean and the difference

values

Example

30 45 -15 15 5 15 -20 10

37.5 -7.5 -15 15 5 15 -20 10


values

Example

15 45 60 30 5 15 -20 10

30 45 -15 15 5 15 -20 10

37.5 -7.5 -15 15 5 15 -20 10


values

Example

20 10 60 30 40 80 40 20

15 45 60 30 5 15 -20 10

30 45 -15 15 5 15 -20 10

37.5 -7.5 -15 15 5 15 -20 10


values

● Consider two subsequences of size m in a time series T

– Ti,m = <ti, ti+1, …, ti+m-1>

– Tj,m = <tj, tj+1, …, tj+m-1>

Definition of Euclidean distance:

● Good for detecting subsequences that are similar in value and shape

● But, not shapes with different scales

Euclidean Distance

● Can detect similar shapes, even with different scales

– E.g., similarity in speech or walking with different speeds

● DTW calculates an optimal match between two given sequences (with different sizes) whil respecting the following rules:

– The first index from the first sequence must be matched with the first index from the other sequence

– The last index from the first sequence must be matched with the last index from the other sequence

– Every index from the first sequence must be matched with one or more indices from the other sequence, and vice versa

– The mapping of the indices from the first sequence to indices from the other sequence must be monotonically increasing

● i.e., if j > i are indices from the first sequence, then there must not be two indices l > k in the other sequence, such that index i is matched with index l and index j is matched with index k

● Objective:

– Find the optimal match: i.e., the match that satisfies all the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.

Dynamic Time Warping (DTW)

Image credit: wikipedia

DTW vs Euclidean

Euclidean DTW

image credit: Elena Tsiporkova

DTW – Pseudo Code

DTW Algorithm Example

Consider two following time series:● A = <2, 4, 5, 3, 2, 4> ● B = <1, 6, 4, 3>

2 4 5 3 2 4

3

4

6

1

Time Series B


We firstly set the distances to infinity

2 4 5 3 2 4

3 ∞ ∞ ∞ ∞ ∞ ∞

4 ∞ ∞ ∞ ∞ ∞ ∞

6 ∞ ∞ ∞ ∞ ∞ ∞

1 ∞ ∞ ∞ ∞ ∞ ∞

Time Series A

Time Series B


Then, we fill the matrix as follows :

2 4 5 3 2 4

3 ∞ ∞ ∞ ∞ ∞ ∞

4 ∞ ∞ ∞ ∞ ∞ ∞

6 ∞ ∞ ∞ ∞ ∞ ∞

1 1 ∞ ∞ ∞ ∞ ∞

Time Series A

Time Series B


2 4 5 3 2 4

3 8 ∞ ∞ ∞ ∞ ∞

4 7 ∞ ∞ ∞ ∞ ∞

6 5 ∞ ∞ ∞ ∞ ∞

1 1 ∞ ∞ ∞ ∞ ∞

Time Series A

Time Series B



2 4 5 3 2 4

3 8 4 ∞ ∞ ∞ ∞

4 7 3 ∞ ∞ ∞ ∞

6 5 3 ∞ ∞ ∞ ∞

1 1 4 ∞ ∞ ∞ ∞

Time Series A

Time Series B



2 4 5 3 2 4

3 8 4 5 ∞ ∞ ∞

4 7 3 4 ∞ ∞ ∞

6 5 3 4 ∞ ∞ ∞

1 1 4 8 ∞ ∞ ∞

Time Series A

Time Series B



2 4 5 3 2 4

3 8 4 5 4 ∞ ∞

4 7 3 4 5 ∞ ∞

6 5 3 4 7 ∞ ∞

1 1 4 8 10 ∞ ∞

Time Series A

Time Series B



2 4 5 3 2 4

3 8 4 5 4 5 ∞

4 7 3 4 5 7 ∞

6 5 3 4 7 11 ∞

1 1 4 8 10 11 ∞

Time Series A

Time Series B



2 4 5 3 2 4

3 8 4 5 4 5 6

4 7 3 4 5 7 7

6 5 3 4 7 11 13

1 1 4 8 10 11 14

Time Series A

Time Series BDistance matrix



2 4 5 3 2 4

3 8 4 5 4 5 6

4 7 3 4 5 7 7

6 5 3 4 7 11 13

1 1 4 8 10 11 14

Time Series A

Time Series BDistance matrix

● We can find the optimal match by starting from the right top of the distance matrix, and going down/left/corner by following the minimum values

● Optimal match in our example

– {(1,1), (2,2), (3,3), (4,4), (5,4), (6,4)}

● Minimal cost:

– 6 + 5 + 4 + 4 + 3 + 1 = 18

Motif and Anomaly Discovery

Winding Dataset ( The angular speed of reel 2 )

0 500 1000 150 0 2000 2500

0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140

Time Series Motif Discovery Time Series Motif Discovery (finding repeated patterns)(finding repeated patterns)

Slide credit: E. Keogh

Many applications, e.g. :

● Mining association rules in time series requires the discovery of motifs. These are referred to as primitive shapes and frequent patterns

● Many time series anomaly/interestingness detection algorithms essentially consist of modeling normal behavior with a set of typical shapes (motifs), and detecting future patterns that are dissimilar to all typical shapes

● In robotics : allows an autonomous agent to generalize from a set of qualitatively different experiences gleaned from sensors

● In medical data mining, used for characterizing physiotherapy patient’s recovery based on the discovery of similar patterns

Motivation for MotifsMotivation for Motifs

Definition 1. Match: Given a positive real number R (called threshold or range) and a time series T containing a subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) ≤ R, then M is called a matching subsequence of C.

Definition 2. Trivial Match: Given a time series T, containing a subsequence C beginning at position p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p or p < q’< q.

Definition 3. K-Motif(n,R): Given a time series T, a subsequence length n and a range R, the most significant motif in T (hereafter called the 1-Motif(n,R)) is the subsequence C1 that has highest count of non-

trivial matches (ties are broken by choosing the motif whose matches have the lower variance). The Kth most significant motif in T (hereafter called the K-Motif(n,R) ) is the subsequence CK that has the highest count of

non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1 ≤ i < K.

0 100 200 3 00 400 500 600 700 800 900 1000

T

Space Shuttle STS-57 Telemetry ( Inertial Sensor )

Trivial Matches

C


● Discord or anomaly is a point whose nearest neighbor is far from it.

● Most significant discord: given a time series T, and a similarity function D, a subsequence C is the most significant discord if its distance to its nearest neighbor is largest.

● Discord detection is important for many applications, e.g. for anomaly detection in aircraft engines

Discord

Matrix Profile: An Efficient Technique for Motif and Discord Discovery

● Given a time series T, and subsequence size m

● Matrix profile MP is a time series

– such that MP[i] returns the distance of subsequence T[i] to its nearest neighbor subsequence in T

Image credit: E. Keogh

500 1000 1500 2000 2500 3000 3500

500 1000 1500 2000 2500 3000 3500

0

T

MP

0 500 1000 1500 2000 2500 3000

How to “read” a Matrix Profile

Where you see relatively low values, you know that the subsequence in the original time series must have (at least one) relatively similar subsequence elsewhere in the data (such regions are “motifs” or reoccurring patterns)

Where you see relatively high values, you know that the subsequence in the original time series must be unique in its shape (such areas are “discords” or anomalies).

Must be an anomaly in the original data, in this region.

We call these Time Series Discords

Must be conserved shapes (motifs) in the original data, in these three regionsSlide credit: E. Keogh

Given time series “T”, a subsequence length “m”, and a subsequence “C”:

– Distance profile of C is an array that shows the distance of C to any subsequence of size m in T

MASS: Mueen’s Algorithm for Similarity Search

● It uses a convolution based method for calculating distance profile efficiently for z-normalized Euclidean distance

Distance Profile

There are several algorithms for computing matrix profile

● STAMP

● STOMP

● AAMP

Matrix Profile Algorithms

● Let

– A : times series of size n

– m: size of subsequences

– A[i] : subsequence of size m starting at position i in A

Goal:

● Calculate matrix profile MP: such that MP[i] returns the distance of A[i] to its most similar subsequence in A

Problem Definition

STAMP: Scalable Time Series Anytime Matrix Profile

Idea : for each subsequence, compute its distance profile using MASS algorithm, and choose the minimum of the distance profile

• Distance function: z-normalized Euclidean distance

• The order in which the distance profiles are computed can be random

• The random ordering provides diminishing return, which allows interrupt-resume operations anytime

MP(1:n-m+1) = inf;MPI(1:n-m+1) = -1;for i = 1:n-m+1 in a random order

d = MASS(T,T(i:i+m-1)); d( max(i-m/4,1) : min(i+m/4-1,n-m+1))= NaN;

[MP, ind] = min([MP ; d]); MPI(ind==2) = i;end

O(n2 log n)

d1,1 d1,2 … … … d1,n-

m+1

d2,1 d2,2 … … … d2,n-

m+1

… … … … … …

di,1 di,2 … di,j … di,n-m+1

… … … … … …

dn-

m+1,1

dn-

m+1,2

… … … dn-

m+1,n-

m+1

Min(D1)

Min(D2)

Min(Dn-

m+1)

Min(Di)

P1 P2 … … ... Pn-m+1


Yan Zhu et al. Matrix Profile II: Exploiting a Novel Algorithm and GPUs to break the one Hundred Million Barrier for Time Series Motifs and Joins. IEEE ICDM Conf., 2016.

* Download page: https://www.cs.ucr.edu/~eamonn/MatrixProfile.html

STOMP: Scalable Time series Ordered Matrix Profile

Stomp has an O(n2) time, O(n) space complexityDistance function: z-normalized Euclidean distance

Working formula:

Dot product of the ith window and the jth window.

STOMP: Scalable Time series Ordered Matrix Profile


AAMP Algorithm *

● Distance function: Euclidean distance

● Main idea: compute the subsequence distances incrementally

– Each distance: in O(1) instead of O(m)

– Thus, an amortized complexity of O(n) to find the nearest sequence of Am[i]

● For this, we sweep the two time series in n-m iterations

– In each iteration, the distances are incrementally computed, and minimum distances updated

– In iteration k, we compute the distance of A[i] and A[i+k] ● i.e., subsequences of A that have a difference of k in their

positions

* Reza Akbarinia, Bertrand Cloez. Efficient Matrix Profile Computation Using Different Distance Functions. CoRR abs/1901.05708 (2019)

Incremental Distance Computation

● Di,j : Square of Euclidean distance between Am[i] and Bm[j]

● A[i] : <ai, …, ai+m-1>

● A[j] : <aj, …, aj+m-1>

Thus, we have

Di , j= ∑k=0

k=m−1

(ai +k−a j+ k)2

Di−1 , j−1= ∑k=0

k=m−1

(ai+k−1−a j +k−1)2

Di , j=Di−1 , j−1−(ai−1−a j−1)2+(ai+m−1−a j +m−1)

2

Algorithm

● For i=1 to n MP[i] := ∞ // initialize minimum distances

● For k=1 to n-m //do n-m iterations

– Compute D1, k // using Euclidean function

– For i=2 to n – m - k + 1 // compute distance of subsequences that// are “k” points far from each other

● Incrementally compute D i, i+k using Di-1, i+k-1 // O(1)

● If (MP[i] > Di, i+k) then– MP[i] := Di, i+k

– MP_Index[i] := i+k

– For i=n-m-1 to k //inverse scan● Incrementally compute D i, i-k using Di+1, i-k+1 // O(1)

● If (MP[i] > Di, i-k) then– MP[i] := Di, i-k

Example

t1 t2 t3 t4 t5 t6 t7 t8

Iteration 1

sw1

t1 t

2 t

3 t

4 t

5 t

6 t

7 t

8

t1 t

2 t

3 t

4 t

5 t

6 t

7 t

8 t1 t

2 t

3 t

4 t

5 t

6 t

7 t

8

t1 t

2 t

3 t

4 t

5 t

6 t

7 t

8

Iteration 2

sw2

t1 t2 t3 t4 t5 t6 t7 t8

t1 t

2 t

3 t

4 t

5 t

6 t

7 t

8

Iteration 4

sw4

t1 t

2 t

3 t

4 t

5 t

6 t

7 t

8

t1 t

2 t

3 t

4 t

5 t

6 t

7 t

8

Iteration 3

sw3

t1 t

2 t

3 t

4 t

5 t

6 t

7 t

8

Analysis of AAMP

● An exact algorithm for computing the matrix profile

● Time complexity: O(n2)

● Space complexity: O(n)

● Simpler and faster than Scrimp++ algorithm (improved version of STOMP)

– No need to Fourier transformations

Code of AAMP can be downloaded from: https://github.com/rakbarinia/AAMP-ACAMP

Execution time vs. time series length

53

Parallel Time Series Indexing and Similarity Search

DPiSAX Parallel iSAX index for big time series databases

(Terabytes)

ParCorr and RadiusSketch Parallel indexing of time series based on

sketches

Sketches stored in distributed grids

[IEEE TKDE 2019, PKDD2019, DMKD 2018, CIKM 2018, ICDM 2017]

iSAX Index

● Simple non-balanced tree– Internal nodes point to internal or leaf nodes

– Leaf nodes point to time series files on disk

– A leaf node is divided (split), when the file size exceeds a threshold

● Efficient for kNN query processing

– But, its construction may need too much time for big DBs (e.g., several days for 1 billion time series)

55

Distributed iSAX

56

Distributed iSAX

DPiSAX : Distributed Partitioned iSAX

● Main idea: construct in parallel a distributed iSAX index● Parallel index construction

– Partition the time series into uniform groups (for load balancing)

● Using parallel sampling – Each group forms a sub-tree of the index– Construct the sub-indices (sub-trees) in parallel

● Parallel query processing– Send each query directly to the node that is

supposed to answer it

D. Yagoubi, R. Akbarinia, F. Masseglia, T. Palpanas. DPiSAX: Massively Distributed Partitioned iSAX. ICDM, 2017

58

Distributed Partitioned iSAX

59

Distributed Partitioned iSAX

Experiments

● Experimental Framework– Spark– Nef platform, a cluster of 32 machines

● Compared Algorithms– Centralized iSAX2+– PLS : Parallel Linear Scan– DiSAX– DPiSAX – DbasicPiSAX: the basic version of DPiSAX (with simple partitioning)

● Datasets– Synthetic

● 4 billion time series (6TB) using a random walk generator– Real DBs

● IRIS Seismic Data: containing 40 millions time series, for a total size of 150GB● TexMex corpus. It contains 1 billion time series (SIFT feature vectors) of 128

points each (derived from 1 Billion images)

61

Results

Performance Gains for KNN Queries, compared to parallel linear search (PLS)

Performance gain of our parallel approaches compared to iSAX2+ and PLS, for 10-NN queries (batches of 10k queries), over seismic, Random Walk (RW) and TexMex datasets, using 32 nodes

Parallel Time Series Indexing using Sketches

Sketches in a nutshell

64


65


66


67


Node 1 Node 2

68

Online Demonstration

● http://imitates.gforge.inria.fr/

Thanks!

References

Djamel Edine Yagoubi, Reza Akbarinia, Florent Masseglia, and Themis Palpanas. Massively Distributed Time Series Indexing and Querying. IEEE TKDE (Transactions on Knowledge and Data Engineering), 2019.

Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, Eamonn Keogh. Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets. International Conference on Data Mining (ICDM), 2016.

Yan Zhu, Zachary Zimmerman, Nader Shakibay Senobari, Chin-Chia Michael Yeh, Gareth Funning, Abdullah Mueen, Philip Berisk and Eamonn Keogh. Matrix Profile II: Exploiting a Novel Algorithm and GPUs to break the one Hundred Million Barrier for Time Series Motifs and Joins. International Conference on Data Mining (ICDM), 2016.

Djamel Edine Yagoubi, Reza Akbarinia, Boyan Kolev, Oleksandra Levchenko, Florent Masseglia, Patrick Valduriez, Dennis Shasha. ParCorr: efficient parallel methods to identify similar time series pairs across sliding windows. Data Mining and Knowledge Discovery, Springer, 2018, 32 (5), pp.1481-1507.

Djamel-Edine Yagoubi, Reza Akbarinia, Florent Masseglia, Themis Palpanas. DPiSAX: Massively Distributed Partitioned iSAX. International Conference on Data Mining (ICDM), pp.1-6, 2017

indexing & similarity search in time series databases · in time series databases reza...

Documents