mining time series data

59
Mining Time Series Data CS240B Notes by Carlo Zaniolo UCLA CS Dept A Tutorial on Indexing and Mining Time A Tutorial on Indexing and Mining Time Series Data Series Data ICDM '01 The 2001 IEEE International Conference on Data Mining November 29, San Jose Dr Eamonn Keogh Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 [email protected] With Slides from:

Upload: peyton

Post on 06-Feb-2016

76 views

Category:

Documents


1 download

DESCRIPTION

Mining Time Series Data. CS240B Notes by Carlo Zaniolo UCLA CS Dept. With Slides from:. A Tutorial on Indexing and Mining Time Series Data ICDM '01 The 2001 IEEE International Conference on Data Mining November 29, San Jose Dr Eamonn Keogh - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mining Time Series Data

Mining Time Series Data CS240B Notes by

Carlo Zaniolo

UCLA CS Dept

A Tutorial on Indexing and Mining Time Series DataA Tutorial on Indexing and Mining Time Series Data ICDM '01

The 2001 IEEE International Conference on Data MiningNovember 29, San JoseDr Eamonn KeoghDr Eamonn Keogh

Computer Science & Engineering DepartmentUniversity of California - Riverside

Riverside,CA [email protected]

With Slides from:

Page 2: Mining Time Series Data

• Introduction, Motivation•Similarity Measures

• Properties of distance measures• Preprocessing the data• Time warped measures

•Indexing Time Series•Dimensionality reduction

• Discrete Fourier Transform• Discrete Wavelet Transform• Singular Value Decomposition• Piecewise Linear Approximation• Symbolic Approximation • Piecewise Aggregate Approximation • Adaptive Piecewise Constant Approximation

• Summary, Conclusions

OutlineOutline

Page 3: Mining Time Series Data

What are Time Series?What are Time Series?

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750

.. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500

A time series is a collection of observations made sequentially in time.

Note that virtually all similarity measurements, indexing and dimensionality reduction techniques discussed in this tutorial can be used with other data types.

Page 4: Mining Time Series Data

Time Series are UbiquitousTime Series are Ubiquitous! I

People measure things...People measure things...•The presidents approval rating.•Their blood pressure.•The annual rainfall in Riverside.•The value of their Yahoo stock.•The number of web hits per second.

… … and things change over time.and things change over time.

Thus time series occur in virtually every medical, scientific and Thus time series occur in virtually every medical, scientific and businesses domain.businesses domain.

Page 5: Mining Time Series Data

Time Series are UbiquitousTime Series are Ubiquitous! II

A random sample of 4,000 graphics from 15 of the world’s newspapers published from 1974 to 1989 found that more than 75% of all graphics were time series (Tufte, 1983).

Page 6: Mining Time Series Data

Defining the similarity between two time series is at the heart of most time series data mining applications/tasks

Thus time series similarity will be the primary focus of this tutorial.

10s = 0.5

c = 0.3

Database C

2

1

4

3

5

7

6

9

8

10

Query Q(template)

Time Series Time Series SimilaritySimilarity

Classification

Clustering

Rule Discovery

Query by Content

Page 7: Mining Time Series Data

Why is Working With Time Series so Why is Working With Time Series so Difficult? Part I Difficult? Part I

1 Hour of EKG data:1 Hour of EKG data: 1 Gigabyte.

Typical WeblogTypical Weblog: 5 Gigabytes per week.

Space Shuttle DatabaseSpace Shuttle Database: 158 Gigabytes and growing.

Macho DatabaseMacho Database: 2 Terabytes, updated with 3 gigabytes per day.

Answer:Answer: How do we work with very large databases? How do we work with very large databases?

Since most of the data lives on disk (or tape), we need a representation of the data we can efficiently manipulate.

Page 8: Mining Time Series Data

Why is Working With Time Series so Why is Working With Time Series so Difficult? Part II Difficult? Part II

The definition of similarity depends on the user, the domain and the task at hand. We need to be able to handle this subjectivity.

Answer:Answer: We are dealing with subjective notions of We are dealing with subjective notions of similarity. similarity.

Page 9: Mining Time Series Data

Why is working with time series so Why is working with time series so difficult? Part III difficult? Part III

Answer:Answer: Miscellaneous data handling problems. Miscellaneous data handling problems.

• Differing data formats.Differing data formats.• Differing sampling rates.Differing sampling rates.• Noise, missing values, etc.Noise, missing values, etc.

Page 10: Mining Time Series Data

Similarity Matching Problem: Flavors 1Similarity Matching Problem: Flavors 1

Database C

Query Q(template)

Given a Query Q, a reference database C and a distance measure, find the

Ci that best matches Q.

2

1

4

3

5

7

6

9

8

10

C6 is the best match.

1: Whole Matching

Page 11: Mining Time Series Data

Similarity matching problem: flavor 2Similarity matching problem: flavor 2

Database C

Query Q(template)

Given a Query Q, a reference database C and a distance measure, find the location that best matches Q.

2: Subsequence Matching

The best matching subsection.

Note that we can always convert subsequence matching to whole matching by sliding a window across the long sequence, and copying the window contents.

Page 12: Mining Time Series Data

After all that background we might have forgotten what we are doing and why we care!

So here is a simple motivator and review..

You go to the doctor because of chest pains. Your ECG looks strange…

You doctor wants to search a database to find similar ECGS, in the hope that they will offer clues about your condition...

Two questions:• How do we define similar?• How do we search quickly?

Page 13: Mining Time Series Data

Similarity is always subjective.(i.e. it depends on the application)

All models are wrong, but some are useful…

This slide was taken from: A practical Time-Series Tutorial with MATLAB—presented at ECLM

PAKDD 2005, by Michalis Vlachos.

Page 14: Mining Time Series Data

Distance functions

Metric • Euclidean Distance• Correlation

Triangle Inequality: d(x,z) ≤ d(x,y) + d(y,z)

• Assume: d(Q,bestMatch) = 20

and d(Q,B) =150• Then, since d(A,B)=20• d(Q,A) ≥ d(Q,B) – d(B,A)• d(Q,A) ≥ 150 – 20 = 130

We do not need to get A from disk

Non-Metric

• Time Warping

• LCSS: longest common sub-sequence

Page 15: Mining Time Series Data

Preprocessing the data before Preprocessing the data before distance calculationsdistance calculations

If we naively try to measure the distance between two “raw” time series, we may get very unintuitive results.

This is because Euclidean distance is very sensitive to some distortions in the data. For most problems these distortions are not meaningful, and thus we can and should remove them.

In the next 4 slides I will discuss the 4 most common distortions, and how to remove them.

• Offset Translation• Amplitude Scaling• Linear Trend• Noise

Page 16: Mining Time Series Data

Transformation I: Offset TranslationTransformation I: Offset Translation

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

0 50 100 150 200 250 3000

0.5

1

1.5

2

2.5

3

0 50 100 150 200 250 3000 50 100 150 200 250 300

Q = Q - mean(Q)

C = C - mean(C)

D(Q,C)

D(Q,C)

Page 17: Mining Time Series Data

Transformation II: Amplitude ScalingTransformation II: Amplitude Scaling

0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000

Q = (Q - mean(Q)) / std(Q)

C = (C - mean(C)) / std(C)D(Q,C)

Page 18: Mining Time Series Data

Transformation III: Linear TrendTransformation III: Linear Trend

0 20 40 60 80 100 120 140 160 180 200-4

-2

0

2

4

6

8

10

12

0 20 40 60 80 100 120 140 160 180 200-3

-2

-1

0

1

2

3

4

5

0 20 40 60 80 100 120 140 160 180 200-3

-2

-1

0

1

2

3

4

5

After offset translation

And amplitude scaling

Removed linear trend

The intuition behind removing linear trend is this.

Fit the best fitting straight line to the time series, then subtract that line from the time series.

Page 19: Mining Time Series Data

Transformation IIII: NoiseTransformation IIII: Noise

0 20 40 60 80 100 120 140-4

-2

0

2

4

6

8

0 20 40 60 80 100 120 140-4

-2

0

2

4

6

8

Q = smooth(Q)

C = smooth(C)D(Q,C)

The intuition behind removing noise is this.

Average each datapoints value with its neighbors.

Page 20: Mining Time Series Data

1

2

3

4

6

5

7

8

9

1

4

7

5

8

6

9

2

3

A Quick Experiment to Demonstrate the A Quick Experiment to Demonstrate the

Utility of Preprocessing the DataUtility of Preprocessing the Data

Clustered using Euclidean distance

on the raw data

Clustered using Euclidean distance on the raw data, after removing noise, linear trend, offset translation and amplitude

scaling.

Page 21: Mining Time Series Data

Summary of PreprocessingSummary of PreprocessingThe “raw” time series may have distortions which we should remove before clustering, classification etc.

Of course, sometimes the distortions are the most interesting thing about the data, the above is only a general rule.

We should keep in mind these problems as we consider the high level representations of time series which we will encounter later (Fourier transforms, Wavelets etc). Since these representations often allow us to handle distortions in elegant ways.

Page 22: Mining Time Series Data

Fixed Time AxisSequences are aligned “one to one”.

“Warped” Time AxisNonlinear alignments are possible.

Dynamic Time WarpingDynamic Time Warping

Note: We will first see the utility of DTW, then see how it is calculated.

Page 23: Mining Time Series Data

Utility of Dynamic Time Warping: Example II, Data Mining Utility of Dynamic Time Warping: Example II, Data Mining

Power-Demand Time Series.Each sequence corresponds to a week’s demand for power in a Dutch research facility in 1997 [van Selow 1999].

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

Wednesday was a national holiday

Page 24: Mining Time Series Data

1

2

7

6

3

5

4

Hierarchical clustering with Euclidean Distance.Hierarchical clustering with Euclidean Distance.

<Group Average Linkage>

The two 5-day weeks are correctly grouped.

Note however, that the three 4-day weeks are not clustered together.

Also, the two 3-day weeks are also not

clustered together.

Page 25: Mining Time Series Data

Hierarchical clustering with Dynamic Time Warping.

<Group Average Linkage>

1

2

3

5

7

4

6

The two 5-day weeks are correctly grouped.

The three 4-day weeks are clustered together.

The two 3-day weeks are also clustered

together.

Page 26: Mining Time Series Data

Dynamic Time-Warping

• (how does it work?)The intuition is that we copy an element multiple times so as to achieve a better matching

Euclidean distance: d = 1T1 = [1, 1, 2, 2] | | | |T2 = [1, 2, 2, 2]

Warping distance: d = 0T1 = [1, 1, 2, 2] | | |T2 = [1, 2, 2, 2]

Page 27: Mining Time Series Data

Q

C

n11

p

w1

wk

i

j

Computing the Computing the Dynamic Time Dynamic Time Warp Distance IWarp Distance I

|n||p|

Note that the input Note that the input sequences can be of sequences can be of

different lengthsdifferent lengths Q

C

Page 28: Mining Time Series Data

Computing the Computing the Dynamic Time Dynamic Time Warp Distance IIWarp Distance II

|n||p|

KwCQDTWK

k k1min),(

Q

C

Every possible mapping from Q to C can be represented as a warping path in the search matrix.

We simply want to find the cheapest one…

Although there are exponentially many such paths, we can find one in only quadratic time using dynamic programming.

Q

C

n11

p

w1

wk

i

j

Page 29: Mining Time Series Data

Time taken to create hierarchical clustering of power-demand time seriesTime taken to create hierarchical clustering of power-demand time series.

• Time to create dendrogram using Euclidean Distance 1.2 seconds

• Time to create dendrogram using Dynamic Time Warping 3.40 hours

How to speed it up.

•Approach 1: Complexity is O(n2). We can reduce it to O(n) simply by restricting the warping path.

•Approach 2: Approximate the time series with some compressed or downsampled representation, and do DTW on the new representation.

Complexity of Time Warping Complexity of Time Warping

Page 30: Mining Time Series Data

Fast Approximations to Dynamic Time Warp Distance IIFast Approximations to Dynamic Time Warp Distance II

.. strong visual evidence to suggests it works well.

Good experimental evidence the utility of the approach on clustering, classification and query by content problems also has been demonstrated.

1.3 sec

22.7 sec

Page 31: Mining Time Series Data

Weighted Distance Measures IWeighted Distance Measures I

Weighting features is a well known technique in the machine learning community to improve classification

and the quality of clustering.

Intuition: For some queries different parts of the sequence are more important.

Page 32: Mining Time Series Data

Note: In this example we are using a piecewise linear approximation of the data. We will learn more about this representation later.

Relevance Feedback for Time SeriesRelevance Feedback for Time SeriesThe original query

The weigh vector. Initially, all weighs are the same.

Page 33: Mining Time Series Data

One by one the 5 best matching sequences will appear, and the user will rank them from between very bad (-3) to very good (+3)

The initial query is executed, and the five best matches are shown (in the dendrogram)

Page 34: Mining Time Series Data

Based on the user feedback, both the shape and the weigh vector of the query are changed.

The new query can be executed.The hope is that the query shape and weights will converge to the optimal query.

Two paper consider relevance feedback for time series.

L Wu, C Faloutsos, K Sycara, T. Payne: FALCON: Feedback Adaptive Loop for Content-Based Retrieval. VLDB 2000: 297-306

Page 35: Mining Time Series Data

Motivating Example Revisited...Motivating Example Revisited...

You go to the doctor because of chest pains. Your ECG looks strange…

You doctor wants to search a database to find similar ECGS, in the hope that they will offer clues about your condition...

Two questions:• How do we define similar?• How do we search quickly?

Page 36: Mining Time Series Data

Indexing Time SeriesIndexing Time Series

• We have seen techniques for assessing the similarity of two time series.

• However we have not addressed the problem of finding the best match to a query in a large database...

• We need someway to index the data...

• A topics extensively discussed in topical literature that we will not discuss here for lack of time—also it might not be applicable to data streams

Database C

2

1

4

3

5

7

6

9

8

10

Query QFind shapes like this

In this DB

Page 37: Mining Time Series Data

Compression – Dimensionality Reduction

• Project all sequences into a new space, and search this space instead.

Page 38: Mining Time Series Data

0 20 40 60 80 100 120 140

C

An Example of a An Example of a Dimensionality Reduction Dimensionality Reduction

Technique Technique 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 …

RawData

The graphic shows a time series with 128 points.

The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown).

n = 128

Page 39: Mining Time Series Data

0 20 40 60 80 100 120 140

C

. . . . . . . . . . . . . .

Dimensionality Reduction (cont.)Dimensionality Reduction (cont.)

1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ...

FourierCoefficients

0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 …

RawData

We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown).

The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown).

Note that at this stage we have not done dimensionality reduction, we have merely changed the representation...

Page 40: Mining Time Series Data

0 20 40 60 80 100 120 140

C

An Example of a An Example of a Dimensionality Reduction Dimensionality Reduction

Technique IIITechnique III 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928

TruncatedFourier

Coefficients

C’

We have

discarded

of the data.16

15

1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ...

FourierCoefficients

0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 …

RawData

… however, note that the first few sine waves tend to be the largest (equivalently, the magnitude of the Fourier coefficients tend to decrease as you move down the column).

We can therefore truncate most of the small coefficients with little effect.

n = 128N = 8Cratio = 1/16

Page 41: Mining Time Series Data

0 20 40 60 80 100 120 140

C

An Example of a An Example of a Dimensionality Reduction Dimensionality Reduction

Technique IIIITechnique IIII

SortedTruncated

FourierCoefficients

C’

1.5698 1.0485 0.7160 0.8406 0.3709 0.1670 0.4667 0.1928 0.1635 0.1302 0.0992 0.1282 0.2438 0.2316 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ...

FourierCoefficients

0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 …

RawData

1.5698 1.0485 0.7160 0.8406 0.2667 0.1928 0.1438 0.1416

Instead of taking the first few coefficients, we could take the best coefficients

This can help greatly in terms of approximation quality, but makes indexing hard (impossible?).

Note this applies also to Wavelets

Page 42: Mining Time Series Data

Keogh, C

hakrabarti, Pazzani

& M

ehrotra KA

IS 2000

Yi &

Faloutsos VLD

B 2000

Keogh, C

hakrabarti, Pazzani

& M

ehrotra SIGM

OD

2001

Agraw

al, Faloutsos, &.

Swam

i. FOD

O 1993

Faloutsos, Ranganathan, &

Manolopoulos. SIG

MO

D

1994

0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 1200 20 40 60 80 100 120 0 20 40 60 80 100 120

Chan &

Fu. ICD

E 1999

Korn, Jagadish &

Faloutsos. SIGM

OD

1997

Morinaka,

Yoshikaw

a, Am

agasa, &

Uem

ura, PAK

DD

2001

DFT DWT SVD APCA PAA PLA

Compressed Representations

Page 43: Mining Time Series Data

Jean Fourier

1768-1830

0 20 40 60 80 100 120 140

0

1

2

3

X

X'

4

5

6

7

8

9

Discrete Fourier Discrete Fourier Transform ITransform I

Excellent free Fourier Primer

Hagit Shatkay, The Fourier Transform - a Primer'', Technical Report CS-95-37, Department of Computer Science, Brown University, 1995.

http://www.ncbi.nlm.nih.gov/CBBresearch/Postdocs/Shatkay/

Basic Idea: Represent the time series as a linear combination of sines and cosines, but keep only the first n/2 coefficients.

Why n/2 coefficients? Because each sine wave requires 2 numbers, for the phase (w) and amplitude (A,B).

n

kkkkk twBtwAtC

1

))2sin()2cos(()(

Page 44: Mining Time Series Data

Discrete Fourier Discrete Fourier Transform IITransform II Pros and Cons of DFT as a time series Pros and Cons of DFT as a time series

representation.representation.

• Good ability to compress most natural signals.• Fast, off the shelf DFT algorithms exist. O(nlog(n)).• (Weakly) able to support time warped queries.

• Difficult to deal with sequences of different lengths.• Cannot support weighted distance measures.

0 20 40 60 80 100 120 140

0

1

2

3

X

X'

4

5

6

7

8

9

Note: The related transform DCT, uses only cosine basis functions. It does not seem to offer

any particular advantages over DFT.

Page 45: Mining Time Series Data

0 20 40 60 80 100 120 140

Haar 0

Haar 1

Haar 2

Haar 3

Haar 4

Haar 5

Haar 6

Haar 7

X

X'

DWT

Discrete Wavelet Discrete Wavelet Transform ITransform I

Alfred Haar

1885-1933

Excellent free Wavelets Primer

Stollnitz, E., DeRose, T., & Salesin, D. (1995). Wavelets for computer graphics A primer: IEEE Computer Graphics and Applications.

Basic Idea: Represent the time series as a linear combination of Wavelet basis functions, but keep only the first N coefficients.

Although there are many different types of wavelets, researchers in time series mining/indexing generally use Haar wavelets.

Haar wavelets seem to be as powerful as the other wavelets for most problems and are very easy to code.

Page 46: Mining Time Series Data

X = {8, 4, 1, 3}

1

2

3

4

5

6

7

8

h1= 4 = mean(8,4,1,3) h2 = 2 = mean(8,4) - h1 h3 = 2 = (8-4)/2 h4 = -1 = (1-3)/2

h1 = 4

1

2

3

4

5

6

7

8

h2 = 2 h3 = 2 h4 = -1 X = {8, 4, 1, 3}

I have converted a raw time series X = {8, 4, 1, 3}, into the Haar Wavelet representation H = [4, 2 , 2, 1]We can covert the Haar representation back to raw signal with no loss of information...

Page 47: Mining Time Series Data

0 20 40 60 80 100 120 140

Haar 0

Haar 1

Haar 2

Haar 3

Haar 4

Haar 5

Haar 6

Haar 7

X

X'

DWT

Discrete Wavelet Discrete Wavelet Transform IIITransform III Pros and Cons of Wavelets as a time series Pros and Cons of Wavelets as a time series

representation.representation.• Good ability to compress stationary signals.• Fast linear time algorithms for DWT exist.• Able to support some interesting non-Euclidean similarity measures. • Works best if N is = 2some_integer. Otherwise wavelets approximate the left side of signal at the expense of the right side.• Cannot support weighted distance measures.

Open Question: We have only considered one type of wavelet, there are many others. Are the other wavelets better for indexing?

YES: I. Popivanov, R. Miller. Similarity Search Over Time Series Data Using Wavelets. ICDE 2002.

NO: K. Chan and A. Fu. Efficient Time Series Matching by Wavelets. ICDE 1999

Obviously, this question still open...

Page 48: Mining Time Series Data

0 20 40 60 80 100 120 140

X

X'

eigenwave 0

eigenwave 1

eigenwave 2

eigenwave 3

eigenwave 4

eigenwave 5

eigenwave 6

eigenwave 7

SVD

Singular Value Singular Value Decomposition Decomposition

Eugenio Beltrami

1835-1899

Camille Jordan (1838--1921)

James Joseph Sylvester 1814-1897

Basic Idea: Represent the time series as a linear combination of eigenwaves but keep only the first N coefficients.

SVD is similar to Fourier and Wavelet approaches is that we represent the data in terms of a linear combination of shapes (in this case eigenwaves).

SVD differs in that the eigenwaves are data dependent.

SVD has been successfully used in the text processing community (where it is known as Latent Symantec Indexing ) for many years—but it is computationally expensive

Good free SVD Primer

Singular Value Decomposition - A Primer. Sonia Leach

Page 49: Mining Time Series Data

0 20 40 60 80 100 120 140

X

X'

eigenwave 0

eigenwave 1

eigenwave 2

eigenwave 3

eigenwave 4

eigenwave 5

eigenwave 6

eigenwave 7

SVD

Singular Value Singular Value Decomposition (cont.)Decomposition (cont.)

How do we create the eigenwaves?

We have previously seen that we can regard time series as points in high dimensional space.

We can rotate the axes such that axis 1 is aligned with the direction of maximum variance, axis 2 is aligned with the direction of maximum variance orthogonal to axis 1 etc.

Since the first few eigenwaves contain most of the variance of the signal, the rest can be truncated with little loss.

Page 50: Mining Time Series Data

0 20 40 60 80 100 120 140

X

X'

Piecewise Linear Piecewise Linear Approximation IApproximation I

Basic Idea: Represent the time series as a sequence of straight lines.

Lines could be connected, in which case we are allowedN/2 lines

If lines are disconnected, we are allowed only N/3 lines

Personal experience on dozens of datasets suggest disconnected is better. Also only disconnected allows a lower bounding Euclidean approximation

Each line segment has • length • left_height (right_height can be inferred by looking at the next segment)

Each line segment has • length • left_height • right_height

Karl Friedrich Gauss

1777 - 1855

Page 51: Mining Time Series Data

0 20 40 60 80 100 120 140

X

X'

Piecewise Linear Piecewise Linear Approximation IIApproximation II How do we obtain the Piecewise Linear

Approximation?

Optimal Solution is O(n2N), which is too slow for data mining.

A vast body on work on faster heuristic solutions to the problem can be classified into the following classes--CRatio denotes the compression ratio:

• Top-Down O(n2N)• Bottom-Up O(n/CRatio)• Sliding Window O(n/CRatio)• Other (genetic algorithms, randomized algorithms, Bspline wavelets, MDL etc)

Recent extensive empirical evaluation of all approaches suggest that Bottom-Up is the best approach overall.

Page 52: Mining Time Series Data

0 20 40 60 80 100 120 140

X

X'

Piecewise Linear Piecewise Linear Approximation IIIApproximation III

Pros and Cons of PLA as a time series Pros and Cons of PLA as a time series representation.representation.

• Good ability to compress natural signals.• Fast linear time algorithms for PLA exist.• Able to support some interesting non-Euclidean similarity measures. Including weighted measures, relevance feedback, fuzzy queries… •Already widely accepted in some communities (ie, biomedical)

• Not (currently) indexable by any data structure (but does allows fast sequential scanning).

Page 53: Mining Time Series Data

0 20 40 60 80 100 120 140

X

X'

0

1

2

3

4

5

6

7

C U U C D C U D

C

U

C

D

C

U

D

U

Symbolic Symbolic ApproximationApproximation

Key: C = ConstantU = UpD = Down

Basic Idea: Convert the time series into an alphabet of discrete symbols. Use string indexing techniques to manage the data.

Potentially an interesting idea, but all the papers thusfar are very ad hoc.

Pros and Cons of Symbolic Approximation Pros and Cons of Symbolic Approximation as a time series representation.as a time series representation.

• Potentially, we could take advantage of a wealth of techniques from the very mature field of string processing.

• There is no known technique to allow the support of Euclidean queries.• It is not clear how we should discretize the times series (discretize the values, the slope, shapes? How big of an alphabet? etc)

Page 54: Mining Time Series Data

Piecewise Aggregate Piecewise Aggregate Approximation IApproximation I

0 20 40 60 80 100 120 140

X

X'

x1

x2

x3

x4

x5

x6

x7

x8

i

ijjn

Ni

Nn

Nn

xx1)1(

N

i iiNn yxYXDR

1

2),(

Given the reduced dimensionality representation we can calculate the approximate Euclidean distance as...

Basic Idea: Represent the time series as a sequence of box basis functions.

Note that each box is the same length.

Independently introduced by two authorsKeogh, Chakrabarti, Pazzani & Mehrotra, KAIS (2000)

Byoung-Kee Yi, Christos Faloutsos, VLDB (2000)

Page 55: Mining Time Series Data

Piecewise Aggregate Piecewise Aggregate Approximation IIApproximation II

0 20 40 60 80 100 120 140

X

X'

X1

X2

X3

X4

X5

X6

X7

X8

• Extremely fast to calculate• As efficient as other approaches (empirically)• Support queries of arbitrary lengths• Can support any Minkowski metric• Supports non Euclidean measures• Supports weighted Euclidean distance• Simple! Intuitive!

• If visualized directly, looks ascetically unpleasing.

Pros and Cons of PAA as a time series Pros and Cons of PAA as a time series representation.representation.

Page 56: Mining Time Series Data

Adaptive Piecewise Adaptive Piecewise Constant Constant

Approximation IApproximation I

0 20 40 60 80 100 120 140

X

X

<cv1,cr1>

<cv2,cr2>

<cv3,cr3>

<cv4,cr4>

Basic Idea: Generalize PAA to allow the piecewise constant segments to have arbitrary lengths. Note that we now need 2 coefficients to represent each segment, its value and its length.

50 100 150 200 2500

Raw Data (Electrocardiogram)

Adaptive Representation (APCA)Reconstruction Error 2.61

Haar Wavelet Reconstruction Error 3.27

DFTReconstruction Error 3.11

The intuition is this, many signals have little detail in some places, and high detail in other places. APCA can adaptively fit itself to the data achieving better approximation.

Page 57: Mining Time Series Data

Adaptive Piecewise Adaptive Piecewise Constant Constant

Approximation IIApproximation II

0 20 40 60 80 100 120 140

X

X

<cv1,cr1>

<cv2,cr2>

<cv3,cr3>

<cv4,cr4>

The high quality of the APCA had been noted by many researchers. However it was believed that the representation could not be indexed because some coefficients represent values, and some represent lengths.

However an indexing method was discovered! (SIGMOD 2001 best paper award)

Unfortunately, it is non-trivial to understand and implement….

Page 58: Mining Time Series Data

Adaptive Piecewise Adaptive Piecewise Constant Constant

ApproximationApproximation

0 20 40 60 80 100 120 140

X

X

<cv1,cr1>

<cv2,cr2>

<cv3,cr3>

<cv4,cr4>

•Pros and Cons of APCA as a time Pros and Cons of APCA as a time series representation.series representation.

• Fast to calculate O(n). • More efficient as other approaches (on some datasets).• Support queries of arbitrary lengths.• Supports non Euclidean measures.• Supports weighted Euclidean distance.• Support fast exact queries , and even faster approximate queries on the same data structure.

• Somewhat complex implementation.• If visualized directly, looks ascetically unpleasing.

Page 59: Mining Time Series Data

Conclusion

This is just an introduction, with many unavoidable omissions:• There are dozens of papers that offer new distance measures. • Hidden Markov models do have a sound basis, but don’t scale well. •Time series analysis remains a hot area of research and the most recent papers have not been discussed here.