duong tuan anh faculty of computer science and technology ho chi minh city university of technology...

67
Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Upload: marshall-mills

Post on 11-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Duong Tuan AnhFaculty of Computer Science and Technology

Ho Chi Minh City University of Technology

1Tutorial MIWAI December 2012

Page 2: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

OUTLINEIntroductionDefinitions of time series motifsApplications of time series motifsSome well-known algorithms of finding

motifsOur proposed methodConclusions

2

Page 3: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

3

Page 4: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

What are time series?A time series is a collection of

observations made sequentially in time

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500

Examples: Financial time series, scientific time series

Page 5: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Time series data miningQ. Yang & X. Wu, “10 Challenging Problems

in Data Mining Research”, Int. Journal on Information Technology and Decision Making, Vol. 5, No. 4 (2006), 597-604

3.Mining sequence data and time series data

Page 6: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Time series data miningTime series data mining is a field of data

mining to deal with the challenges from the characteristics of time series data.

Time series data have the following characteristics:Very large datasets (terabyte-sized)Subjectivity (The definition of similarity

depends on the user)Different sampling ratesNoise, missing data, etc.

Page 7: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

What do we want to do with the time series data?What do we want to do with the time series data?

ClusteringClustering ClassificationClassification

Query by Content

Rule Discovery

10

s = 0.5c = 0.3

Motif DiscoveryMotif Discovery

Novelty DetectionNovelty DetectionVisualizationVisualization

Page 8: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Time series motifsMotif: the most frequently occurring

pattern in a longer time series

8

Page 9: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Motif DiscoveryProblem Description

Unsupervised detection andmodeling of previously unknownrecurring patterns in real-valuedtime series

Discovery due to unknownsNumber of motifs + occurrencesLocation and length of occurrences“Shape” of each motif

9

Page 10: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

10

J. Lin, E. Keogh, Patel, P. and Lonardi, S., Finding Motifs in Time Series, The 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 2002.

Page 11: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

DefinitionsDefinition 1 Time Series:

A time series T = t1,…,tm is an ordered set of m real-valued variables.

Definition 2 Subsequences: Given a time series T of length m, a subsequence C of

T is a sampling of length n ≤ m of contiguous position from T, that is, C = tp,…,tp+n-1 for 1≤ p ≤ m – n + 1.

Definition 3. Match: Given a positive real number R (called range) and a

time series T containing a subsequence C beginning at position p and a subsequence M beginning at q, if D(C, M) ≤ R, then M is called a matching subsequence of C.

11

Page 12: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

DefinitionsDefinition 4 Trivial Match:

Given a time series T, containing a subsequence C beginning at position p and a matching subsequence M beginning at q, we say that M is a trivial match to C if either p = q or there does not exist a subsequence M’ beginning at q’ such that D(C, M’) > R, and either q < q’< p or p < q’< q.

12

Page 13: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

DefinitionsDefinition 5 1-Motif:

Given a time series T, a subsequence of length n and a range R, the most significant motif in T (called 1-Motif) is the subsequence C1 that has the highest count of non-trivial matches.

Definition 6 K-Motifs: The K-th most significant motif in T (called

thereafter K-Motif) is the subsequence CK that has the highest count of non-trivial matches, and satisfies D(CK, Ci) > 2R, for all 1 ≤ i < K .

13

Page 14: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Definitions

14

K-Motifs:

If the motifs are only required to be R distance apart as in A, then the two motifs may share the majority of their elements. In contrast, B illustrates that requiring the centers to be at least 2R apart insures that the motifs are unique.

Page 15: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Algorithm Find-1-Motif-Brute-Force(T, n, R)

15

The algorithm requires O(m2) calls to the distance function.

best_motif_count_so_far = 0

best_motif_location_so_far = null;

for i = 1 to length(T) – n + 1

count = 0; pointers = null;

for j = 1 to length(T) – n + 1

if Non_Trivial_Match (C[i: i + n – 1], C[j: j + n – 1], R ) then

count = count + 1;

pointers = append (pointers, j);

end end if count > best_motif_count_so_far then best_motif_count_so_far = count;

best_motif_location_so_far = i;

motif_matches = pointers;

end end This procedure calls

distance function

Page 16: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

16

Page 17: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Motif-based classification of time series(Buza et al.,2009)Motifs can be used for time series classification.

This can be done in two steps:(i) Motifs of all time series are extracted(ii) Each time series is represented as an

attribute vector using motifs so that a classifier like SVM, Naïve Bayes, etc. can be applied.

17

Buza, K. and Thieme, L. S.: Motif-based Classification of Time Series with Bayesian Networks and SVMs. In: A. Fink et al. (eds.) Advances in Data Analysis, Data Handling and Business Intelligences, Studies in Classification, Data Analysis, Knowledge Organization. Springer-Verlag, pp. 105-114 (2010).

Page 18: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Motif-based clustering of time series(Phu & Anh, 2011)Motif information are used to initialization k-means

clustering of time series:Step 1: We find 1-motifs for all time series in the

database.Step 2: We apply k-Means clustering on the 1-motifs

of all time series to obtain the clusters of motifs. From the centers of the motif clusters, we derive the associated time series and choose these time series as initial centers for the k-Means clustering .

18

Phu, L. and Anh, D. T., Motif-based Method for Initialization k-Means Clustering of Time Series Data, Proc. of 24th Australasian Joint Conference (AI 2011), Perth, Australia, Dec. 5-8. Dianhui Wang, Mark Reynolds (Eds.), LNAI 7106, Springer-Verlag, 2011, pp. 11-20.

Page 19: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Motif-based time series predictionStock Temporal Prediction based on Time Series Motifs (Jiang et al., 2009)

For a certain length n, we can find a motif. A motif is a set of subsequences that are non-trivial matches with each other. Each subsequence in this set is called an instance of the motif

For different lengths of subsequences, we can find different motifs from a time series.

The idea: If the subsequence in the current sliding window matches with the prefix of a particular motif, we can predict that it will go like the suffix of the motif.

19

Page 20: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Motif-based time series prediction (cont.)But the subsequence in the current window may be

fit for a number of motifs and it makes many possibilities of the suffix.

Rule: If a subsequence is similar with every instance in a motif, then we can conclude that it belongs to the motif and we can use the motif for prediction.

This method is applied for short-term stock prediction

20

Jiang, Y., Li, C., Han, J.: Stock temporal prediction based on time series motifs. In: Proc. of 8th Int. Conf. on Machine Learning and Cybernetics, Baoding, China, July 12-15 (2009).

Page 21: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Signature verification using time series motifs (Gruber et al., 2006)

The process consists of 4 steps:Step 1: Signatures are converted to time seriesStep 2: Time series motifs are extracted using

EP_C algorithm (using important Extreme points and Clustering)

Step 3: Motifs are used to train a dynamic radian basis function network (DRBF) that can classify time series

Step 4: Time series classification is applied to online signature verification

21

Gruber C., Coduro, M., Sick, B.: Signature Verification with Dynamic RBF Networks and Time Series Motifs. In : Proc of 10th Int. Workshop on Frontiers in Handwriting Recognition (2006).

Page 22: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Finding repeated images in database of shapes (Xi et al., 2007)

Convert 2-dimensional shapes into time seriesFind repeated images or “image motifs” in these

time seriesDefine a new form of Euclidean distance

( “Rotation invariant Euclidean distance”)Use a modified variant of Random projection “Image motifs” can be applied in anthropology,

palaeography (study of old texts) and zoology.

22

Xi, X., Keogh, E., Wei, L., Mafra-Neto, A., Finding Motifs in a Database of Shapes, Proc. of SIAM 2007, pp. 249-270.

Page 23: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

-Random Projection-Mueen-Keogh Algorithm

23

TWO WELL-KNOWN ALGORITHMS OF FINDING TIME SERIES MOTIF

Page 24: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

1. Random projection (B. Chiu, 2003)

Algorithm adapting PROJECTION method for detecting motifs in biological sequences to detecting time series motifs.

It’s based on locality-preserving hashing. Algorithm requires some pre-processing:

First, apply PAA , a method for dimensionality reduction

Discretize the transformed time series into symbolic strings (apply SAX)

24

Page 25: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

PAA for dimensionality reductionTime series databases are often extremely large.

Searching directly on these data will be very complex and inefficient.

To overcome this problem, we should use some of transformation methods to reduce the magnitude of time series.

These transformation methods are called dimensionality reduction techniques.

Some popular dimensionality reductions: DFT (Discrete Fourier Transform), DWT (Discrete Wavelet Transform), PAA (Piecewise aggregate Approximation), etc.

Page 26: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

PAATo reduce the time series from n dimensions to w

dimensions, the data is divided into w equal-sized segments. The mean value of the data within a segment is calculated

and a vector of these values becomes the data-reduced representation.

Page 27: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

DISCRETIZATION WITH SAXDiscretization of a time series is tranforming it into a

symbolic string. The main benefit of this discretization: there is an

enormous wealth of existing algorithms and data structures that allow the efficient manipulations of symbolic representations.

Lin and Keogh et al. (2003) proposed a method called Symbolic Aggregate Approximation (SAX), which allows the descretization of original time series into symbolic strings.

This discretization method is based on PAA representation and assumes normality of the resulting aggregated values.

SAX is a process which maps the PAA representation of the time series into a sequence of discrete symbols.

Page 28: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

SAXLet a be the size of the alphabet that is used to discretize

the time series. To symbolize the time series we have to find the values:

where B = 1,…,a-1 are called breakpoints (0 and a are defined

as - and +).We notice that real financial time series often have a

Gaussian distribution. To expect the equal likelihood (1/a) for each symbol, we have to pick the values basing on a statistical table for Gaussian distribution.

Definition: Breakpoints are a sorted list of number B = 1,…,a-1 such that the area under a Gausian curve from i to i+1 = 1/a.

Page 29: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Breakpoints and symbolsUsing the breakpoints, the time series

wttT ...,,1

will be discretized into the symbolic string C = c1c2….cw. Each segment will be coded as a symbol ci using the formula:

kik

ai

i

i

tk

ta

t

c

1

1

11

where k indicates the k-th symbol in the alphabet, 1 the 1st symbol in the alphabet and a the a-th symbol in the alphabet.

Page 30: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012
Page 31: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

For example, Table 1 gives the breakpoints for the values of a from 3 to 10.

Assume the size of the alphabet is a = 3, we divide the range of time series values into three segments in such a way that the accumulative probability distribution of each segment is equal (1/3).

Based on the standard normal distribution, 1 corresponds to value when P( > x) = 1/3; 2 corresponds to value when P( > x) = 1/3 + 1/3 = 2/3. Therefore, we get 1 = -0.43, 2 = 0.43 and these two breakpoints correspond to P(1 > x) = 0.33 and P(2 > x) = 0.66, respectively.

Page 32: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Table 1: A lookup table that contains the breakpoints that divide a Gaussian distribution in an arbitrary number (from 3 to 10) of equiprobable regions.

a=3 a=4 a=5 a=6 a=7 a=8 a=9 a=10

1 -0.43 -0.67 -0.84 -0.97 -1.07 -1.15 -1.22 -1.28

2 0.43 0 -0.25 -0.43 -0.57 -0.67 -0.76 -0.84

3 0.67 0.25 0 -0.18 -0.32 -0.43 -0.52

4 0.84 0.43 0.18 0 -0.14 -0.25

5 0.97 0.57 0.32 0.14 0

6 1.07 0.67 0.43 0.25

7 1.15 0.76 0.52

8 1.22 0.84

9 1.28

Page 33: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Note we made two parameter Note we made two parameter choiceschoices

0

-

-

0 20 40 60 80 100 120

bbb

a

cc

c

a

0 20 40 60 80 100 120

C

C

1 2 3 4 5 6 7

1

8

The word size (w), in this case 8.

The alphabet size (cardinality, a), in this case 3.

3

21

Page 34: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Random projection algorithmIt uses a collision matrix whose rows and columns

are SAX representation of each time series subsequence.

At each iteration, it selects certain positions of each words (as a “mask”) and traverses the word list.

For each match, the collision matrix entry is incremented.

At the end, the large entries in the collision matrix are selected as motif candidates. (greater than a threshold s)

Finally, each candidate is checked for validity in the original data.

34

Page 35: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Random projection (cont.)

35

Page 36: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

A working example

36

Page 37: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

A working example

37

Page 38: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Remarks on Random ProjectionIt’s the most popular algorithm for detecting time

series motifs.It can find motifs in linear time and is robust to noise.However, it still has three drawbacks.

(i) it has several parameters that need to be tuned by user.

(ii) if the distribution of the projections is not sufficiently wide, it becomes quadratic in time and space.

(iii) it is based on locality-preserving hashing that is effective for a relative small number of projected dimensions (10 to 20).

And it’s quite slow for large time series.

38

Page 39: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

2. Mueen-Keogh Algorithm(Mueen and Keogh, 2009)

Based on the Brute-force algorithmMK works directly on raw time series dataThree techniques to speed up the algorithm:

Exploiting the Symmetry of Euclidean DistanceExploiting Triangular Inequality and Reference

PointApplying Early Abandoning

39

Page 40: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

40

Exploiting the Symmetry of Euclidean Distance

Basing on D(A, B) = D(B, A), we can prune off a half of the distance computations by storing D(A, B) and reusing the value when we need to find D(B, A).

Page 41: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

41

Exploiting Triangular Inequality and Reference Point

Given two subsequences Ca and Cb. By triangular inequality, we have

D(Q, Ca) D(Q, Cb) + D(Ca, Cb). From that, we derive: D(Ca, Cb) D(Q, Ca) – D(Q, Cb). If we want to check whether D(Ca, Cb) R , we only need

to look at D(Q, Ca) – D(Q, Cb). If D(Q, Ca) – D(Q, Cb) R, we can conclude that D(Ca, Cb) R.

Given a reference subsequence Q, we have to compute the distances from Q to all the subsequences in time series Ti. That means we have to compute D(Q, ti) for each subsequence ti in the time series Ti.

Page 42: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

42

Applying Early AbandoningIn the case the triangular inequality can not help,

we have to compute the Euclidean distance D(Ca, Cb), then we can apply early abandoning technique.

The idea: we can abandon the Euclidean distance computation as soon as the cumulative sum during distance computation goes beyond the range R.

22

1

( , ) ( )m

ii iD X Y X Y

Page 43: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Experiments of MK Algorithm

Limitation: The use of Euclidean distance directly on raw time series data gives rise to robustness problem when dealing with noisy data.

43

Motif

length

Brute_Force MK algorithm Magnitude of

improvement

8 1034046 6639 155.75

16 1017072 5583 182.17

32 985056 4602 214.05

64 922560 3590 256.98

Table 1: Experiments on the number of distance function calls (Stock dataset)

Page 44: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

44

Our proposed method: EP-C algorithm

Page 45: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Significant Extreme Points & Clustering (Gruber et al., 2006)We can compress a time series by selecting some of its

minima and maxima, called important points and dropping the other points.

45

Page 46: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Important extreme points Important minimum:

am T= {a1,…, an} is an important minimum if there are i, j where i < m < j, such that: am is the minimum among ai, …, aj, and

ai/am ≥ R and aj/am ≥ R (R is the compression rate) Important maximum:

am T= {a1,…, an} is an important maximum if there are i, j where i < m < j, such that: am is the maximum among ai, …, aj, and

am/ai ≥ R and am/aj ≥ R (R is the compression rate)

46

Page 47: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Finding Time Series Motifs(i) Compute all important extreme points(ii) Extract candidate motifs(iii) Clustering of candidate motifs(iv) Select the motifs from the result of the clustering

47

K. B. Pratt and E. Fink, “Search for patterns in compressed time series”, International Journal of Image and Graphics, vol. 2, no. 1, pp. 89-106, 2002.

Page 48: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Extracting Motif Candidates

48

Function getMotifCandidateSequence(T)

N = length(T);

EP = findSignificantExtremePoints(T, R);

maxLength = MAX_MOTIF_LENGTH;

for i = 1 to (length(EP)-2) do motifCandidate = getSubsequence(T, epi, epi+2)

if length(motifCandidate) > maxLength

then addMotifCandidate(resample(motifCandidate,

maxLength))

else addMotifCandidate(motifCandidate)

end ifend forend

Spline Interpolation or homothety

Page 49: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Homothetic transformation

M

P

M’

N’

P’

O N

Homothety is a transformation in affine space. Given a point O and a value k ≠ 0. A homothety with center O and ratio k transforms M to M’ such that .

OMkOM '

The Figure shows a homothety with center O and ratio k = ½ which transforms the triangle MNP to the triangle M’N’P’.

Page 50: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Homothety (cont.)

The algorithm that performs homothety to transform a motif candidate T with length N (T = {Y1,…,YN}) to motif candidate of length N’ is given as follows.

1. Let Y_Max = Max{Y1,…,YN}; Y_Min = Min {Y1,…,YN}

2. Find a center I of the homothety with the coordinate: X_Center = N/2, Y_Center = (Y_Max + Y_Min)/2

3. Perform the homothety with center I and ratio k = N’/N.

Page 51: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Minimum Euclidean DistanceBesides, we note that two ‘similar’ motif candidates will not

be recognized where a vertical difference exists between them.

To make the clustering step in the EP-C algorithm able to handle not only uniform scaling but also shifting transformation along vertical axis, we modify our Euclidean distance by using Minimum Euclidean Distance as a method of negating the differences caused through vertical axis offsets.

Given two motif candidates: T’ = {T’1, T’2,…,T’N} and Q’ = {Q’1, Q’2,…,Q’N}, the Euclidean distance between them is normally given by the following formula:

'

1

2)''()','(N

i

ii QTQTD

Page 52: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

a) Uniform scaling

b) Shifting along the vertical axis

Page 53: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Minimum Euclidean Distance

In this work we use the Minimum Euclidean Distance which is defined by the following equation:

'

1

2)''()','(N

i

ii bQTMinQTD where b (2)

From Eq. (2), we can derive a suitable value for the shifting parameter b such that we can find the best match between the two motif candidates Q’ and T’ as follows:

'

1

)''('

1 N

i

ii TQN

b

Page 54: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Clustering of Motif Candidates

54

function getHierarchicalClustering(MCS, u, d)

C = getIntitialClustering(MCS)while size(C)>u do [Ci, Cj] = getMostSimilarClusters()

addCluster(C, mergeClusters(Ci, Cj )

removeCluster(C, Ci);

removeCluster(C, Cj );

endwhilereturn Cend

Page 55: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Comparing the EP-C algorithm to Random

Projection For EP-C, we implement three variants of EP-C in

order to compare EP-C using spline interpolation (for resampling a longer motif candidate to a shorter one) to EP-C using homothety and compare EP-C using HAC clustering to EP-C using K-means clustering. We denote these variants as follows: HAC|SI: the EP-C algorithm using spline

interpolation and HAC clustering. HAC|HT: the EP-C algorithm using homothety and

HAC clustering. K-means|HT: the EP-C algorithm using homothety

and K-means clustering.

Page 56: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

The test datasets

In this experiment, we tested the algorithms on four publicly available datasets: ECG (7900 data points), Memory (6800 data points), Power (35000 data points) and ECG (140000 data points).

All these four datasets are obtained from the UCR Time Series Data Mining Archive (Keogh and Folias, 2002).

Page 57: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Parameter settingsFor the RP algorithm, we use the parameter setting:

the length of each PAA segment l = 16, the alphabet size a = 5, the number of iterations in RP i = 10, the number of errors allowed by RP d = 1. The SAX word length w depends on the dataset, for example, w = 15 for ECG dataset and w = 26 for Memory dataset.

For the EP-C algorithm, we use the parameter setting: the compression ratio for extracting significant extreme points R = 1.2, the minimum length of motif candidates l_min = 50 and r = the number of clusters / the number of extreme points = 0.2. When HAC clustering is used, the distance between two clusters is computed by the minimum distance, i.e. the distance between two nearest objects each belonging to one of the two clusters.

Page 58: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Experimental results Dataset Length Algorithm Number of motif instances Runtime (sec)

ECG 7900 RP 24 9

HAC|SI 13 0.5

HAC|HT 15 0.13

K-means|HT 17 0.2

Memory 6800 RP 15 8

HAC|SI 9 3

HAC|HT 7 1

K-means|HT 8 1

Power 35000 RP N/A N/A

HAC|SI 21 21

HAC|HT 36 2

K-means|HT 44 3

ECG 140000 RP N/A N/A

HAC|SI 65 50

HAC|HT 85 3

K-means|HT 85 8

Page 59: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

From the experimental results, we can see that: 1. The three variants of EP-C algorithm are much

more effective than RP in terms of time efficiency and accuracy. The number of motif instances in the case of RP is more than that of EP-C. The reason of this fact is that RP uses sliding window to extract subsequences for motif discovery while EP-C uses the concept of significant extreme points to extract motif candidates. However, the accuracy of EP-C is still better than that of RP.

2. With large datasets such as Power (35000 data points) and ECG (140000 data points), RP can not work since it can not tackle large datasets, while HAC|HT can find the motif in a very short time (2 or 3 seconds).

3. The EP-C with homothety (HAC|HT) is better than EP-C with spline interpolation (HAC|SI) in terms of time efficiency and accuracy.

4. HAC clustering is more suitable than K-means for clustering motif candidates in the EP-C algorithm.

Page 60: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Left) ECG dataset. Right) A motif was discovered in the dataset.

Left) Power dataset. Right) A motif was discovered in the dataset.

Page 61: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Efficiency of EP-CWe can evaluate the efficiency of the EP-C algorithm by

simply considering the ratio of how many times the Euclidean distance function must be called by EP-C over the number of times it must be called by the brute-force algorithm given in (Lin et al., 2002).

Note: the numbers of Euclidean distance function calls in the EP-C are the same for all three different versions of EP-C (HAC|SI, HAC|HT and k-Means|HT).

Table 2: The efficiency of the EP-C algorithm on various datasets

Dataset ECG Memory Power ECG 7900 6800 35000 140000----------------------------------------------------------------------------Efficiency 0.0000638 0.0000920 0.0001304

0.0000635

Page 62: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Remarks on EP_C Algorithm

62

This motif discovery method (EP_C) is very efficient, especially for large time series. It is much more efficient than Random Projection.

For example, experiment on Koski_ECG dataset from UCR Archive: http://www.cs.ucr.edu/~eamonn/SAX/koski_ecg.dat

This time series has 144002 points, run time for detection motif: 3secs

Page 63: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

ConclusionsTime series motif discovery has several

practical applications.Through experiments, we see that EP_C

(Extreme points and Clustering) method are much more efficient than Random Projection.

63

Page 64: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

Future worksDeveloping methods to detect motifs in

streaming time series.Exploiting motif information in time series

prediction:Banking dataEnvironmental data

64

Page 65: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

65

Page 66: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

REFERENCES1. Buza, K. and Thieme, L. S.: Motif-based Classification of

Time Series with Bayesian Networks and SVMs. In: A. Fink et al. (eds.) Advances in Data Analysis, Data Handling and Business Intelligences, Studies in Classification, Data Analysis, Knowledge Organization. Springer-Verlag, pp. 105-114 (2010).

2. B. Chiu, E. Keogh, and S. Lonardi, Probabilistic Discovery of Time Series Motifs, Proceedings of the 9th International Conference on Knowledge Discovery and Data mining, Washington, D.C., 2003.

3. Gruber C., Coduro, M., Sick, B.: Signature Verification with Dynamic RBF Networks and Time Series Motifs. In : Proc of 10th Int. Workshop on Frontiers in Handwriting Recognition (2006).

4. J. Lin, E. Keogh, Patel, P. and Lonardi, S., Finding Motifs in Time Series, The 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 2002.

5. Mueen, A., Keogh, E., Zhu, Q., Cash, S., and Westover, B.: Exact Discovery of Time Series Motifs. In: Proc. of SDM (2009).

66

Page 67: Duong Tuan Anh Faculty of Computer Science and Technology Ho Chi Minh City University of Technology 1 Tutorial MIWAI December 2012

6. Jiang, Y., Li, C., Han, J.: Stock temporal prediction based on time series motifs. In: Proc. of 8th Int. Conf. on Machine Learning and Cybernetics, Baoding, China, July 12-15 (2009).

7. Li, Q., Lopez, I.F.V. and Moon, B.: Skyline Index for time series data, IEEE Trans. on Knowledge and Data Engineering, Vol. 16, No. 4 (2004)

8. Phu, L. and Anh, D. T., Motif-based Method for Initialization k-Means Clustering of Time Series Data, Proc. of 24th Australasian Joint Conference (AI 2011), Perth, Australia, Dec. 5-8. Dianhui Wang, Mark Reynolds (Eds.), LNAI 7106, Springer-Verlag, 2011, pp. 11-20.

9. Son, N.T., Anh, D.T., Discovering Approximate Time Series Motif based on MP_C Method with the Support of Skyline Index, Proc. of 4th Int. Conf. on Knowledge and Systems Engineering (KSE’2012), Aug. 17-19, Da Nang, Vietnam (to appear).

10. Nguyen Thanh Son, Duong Tuan Anh, Time Series Similarity Search based on Middle Points and Clipping, Proc. of 3rd Conference on Data Mining and Optimization (DMO), 27-29 June, 2011, Putrajaya, Malaysia, pp.13-19.

11. Xi, X., Keogh, E., Wei, L., Mafra-Neto, A., Finding Motifs in a Database of Shapes, Proc. of SIAM 2007, pp. 249-270.

67