sax: a novel symbolic representation of time series authors jessica lin eamonn keogh li wei stefano...

24
SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate materials kindly provided by Prof. Eamonn Keogh

Upload: kaleigh-richard

Post on 11-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

SAX: a Novel Symbolic Representation of Time

Series

AuthorsJessica LinEamonn KeoghLi WeiStefano Lonardi

PresenterArif Bin Hossain

Slides incorporate materials kindly provided by Prof. Eamonn Keogh

Page 2: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Time Series

 A time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. [Wiki]

Example: Economic, Sales, Stock market forecasting EEG, ECG, BCI analysis

0 2000 4000 6000 80000

10

20

30

Page 3: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Problems

Join: Given two data collections, link items occurring in each

Annotation: obtain additional information from given data

Query by content: Given a large data collection, find the k most similar objects to an object of interest.

Clustering: Given a unlabeled dataset, arrange them into groups by their mutual similarity

Page 4: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Problems (Cont.)

Classification: Given a labeled training set, classify future unlabeled examples

Anomaly Detection: Given a large collection of objects, find the one that is most different to all the rest.

Motif Finding: Given a large collection of objects, find the pair that is most similar.

Page 5: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Data Mining Constraints

For example, suppose you have one gig of main memory and want to do K-

means clustering…

For example, suppose you have one gig of main memory and want to do K-

means clustering…Clustering ¼ gig of data, 100 secClustering ½ gig of data, 200 secClustering 1 gig of data, 400 secClustering 1.1 gigs of data, few

hours

Clustering ¼ gig of data, 100 secClustering ½ gig of data, 200 secClustering 1 gig of data, 400 secClustering 1.1 gigs of data, few

hours

Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9-15

Page 6: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Generic Data Mining

• Create an approximation of the data, which will fit in main memory, yet retains the essential features of interest

• Approximately solve the problem at hand in main memory

• Make (hopefully very few) accesses to the original data on disk to confirm the solution

Page 7: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Some Common Approximation

Page 8: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Why Symbolic Representation?

• Reduce dimension• Numerosity reduction• Hashing• Suffix Trees• Markov Models• Stealing ideas from text processing/

bioinformatics community

Page 9: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Symbolic Aggregate ApproXimation (SAX)

• Lower bounding of Euclidean distance• Lower bounding of the DTW distance• Dimensionality Reduction• Numerosity Reduction

baabccbc

Page 10: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

SAX

Allows a time series of arbitrary length n to be reduced to a string of arbitrary length w (w<<n)

NotationsC A time series C = c1, ….., cn

ĆA Piecewise Aggregate Approximation of a time series Ć = ć1,…ćw

ĈA symbolic representation of a time series Ĉ = ĉ1, …, ĉw

w Number PAA segments representing C

a Alphabet size

Page 11: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

How to obtain SAX?

Step 1: Reduce dimension by PAA Time series C of length n can be represented in a

w-dimensional space by a vector Ć = ć1,…ćw

The ith element is calculated by

Reduce dimension from 20 to 5. The 2nd element will be

i

ijjn

wi

wn

wn

cc1)1(

8

52 20

5

j

CjC

Page 12: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

How to obtain SAX?

Data is divided into w equal sized frames. Mean value of the data falling within a frame

is calculatedVector of these values becomes the PAA

0 20 40 60 80 100 120

C

C

Page 13: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

How to obtain SAX?

Step 2: Discretization Normalize Ć to have a Gaussian distribution Determine breakpoints that will produce a equal-sized

areas under Gaussian curve.

0

-

-

0 20 40 60 80 100 120

bbb

a

cc

c

a

baabccbc

Words: 8Alphabet: 3

Page 14: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Distance Measure

Given 2 time series Q and C Euclidean distance

Distance after transforming the subsequence to PAA

Page 15: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Distance Measure

Define MINDIST after transforming to symbolic representation

MINDIST lower bounds the true distance between the original time series

Page 16: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Numerosity Reduction

Subsequences are extracted by a sliding window

Sequences are mostly repetitive subsequence Sliding window finds aabbcc If the next sequence is also aabbcc, just store the

positionThis optimization depends on the data, but

typically yields a reduction factor of 2 or 3 Space shuttle telemetry with subsequence length 32

Page 17: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Experimental Validation

Clustering Hierarchical Partitional

Classification Nearest neighbor Decision tree

Motif discovery

Page 18: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Hierarchical Clustering

Sample dataset consists 3 decreasing trend, 3 upward shift and 3 normal classes

Page 19: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Partitional Clustering (k-means)

Assign each point to one of k clusters whose center is nearest

Each iteration tries to minimize the sum of squared intra-clustered error

Page 20: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Nearest Neighbor Classification

SAX beats Euclidean distance due to the smoothing effect of dimensional reduction

Page 21: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Decision Tree Classification

Since decision trees are expensive to use with high dimensional dataset, Regression Tree [Geurts.2001] is a better approach for data mining on time series

Page 22: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Motif Discovery

Implemented the random projection algorithm of Tompa and Buhler [ICMB2001] Hashing subsequenced into buckets using a random

subset of their features as a key

Page 23: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

New Version: iSAX

Use binary numbers for labeling the wordsDifferent alphabet size(cardinality)within a

wordComparison of words with different

cardinalities

Page 24: SAX: a Novel Symbolic Representation of Time Series Authors Jessica Lin Eamonn Keogh Li Wei Stefano Lonardi Presenter Arif Bin Hossain Slides incorporate

Thank you

Questions?