a comprehensive look at mining time-series and …amt/comptimeseriesseqpattmining.pdfa comprehensive...

123
[email protected] Proprietary and Confidential NOTICE: Proprietary and Confidential This material is proprietary to A. Teredesai and GCCIS, RIT. Slide 1 A Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of Computer Science, Rochester Institute of Technology

Upload: others

Post on 27-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

[email protected] Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 1

A Comprehensive Look at Mining Time-Series and

Sequential Patterns

Ankur TeredesaiDepartment of Computer Science,Rochester Institute of Technology

Page 2: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Definition of Time-Series

0 50 100 150 200 250 300 350 400 450 50023

24

25

26

27

28

29

25.175025.225025.250025.250025.275025.325025.350025.350025.400025.400025.325025.225025.200025.1750

..

..24.625024.675024.675024.625024.625024.625024.675024.7500

A time series is a collection of observations made sequentially in time.

[email protected] Dr. Ankur M. Teredesai P2

Page 3: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Sample Example for TimeSample Example for Time--Series (cont.)Series (cont.)

People measure things...People measure things...•The presidents approval rating•Their blood pressure•The annual rainfall in Los Angeles•The value of their Yahoo stock•The number of web hits per second

…… and things change over time.and things change over time.

[email protected] Dr. Ankur M. Teredesai P3

Thus time series occur in virtually every medical, scientific anThus time series occur in virtually every medical, scientific and d business domainbusiness domain

Page 4: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

What can We Do with Time-Series?

ClusteringClustering ClassificationClassification

Query by Content

Rule Discovery

10

⇒s = 0.5c = 0.3

Motif DiscoveryMotif Discovery

[email protected] Dr. Ankur M. Teredesai P4

Novelty DetectionNovelty Detection

Page 5: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Sample Model for Information Streams Mining

[email protected] Dr. Ankur M. Teredesai P5

•Information streams vs. time-series :

•In many emerging science and business applications, data takes the form of streams rather than static datasets.

•We can define information stream as continuously incoming dynamic data by contrast with static time-series data.

*MIESIS (MIning from Earth Science Information Streams)

Page 6: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Information Streams SegmentationWe need to do this for

• Symbolization

• Dimensionality reductionUsing Fixed length sliding window

0

--

0 20 40 60 80 100 120

bbb

a

cc

c

a

[email protected] Dr. Ankur M. Teredesai P6

Page 7: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Information Streams Segmentation (cont.)Using turning points

0 10 20 30 40 50 60 70-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Time

Val

ue

Information stream data from sensor

[email protected] Dr. Ankur M. Teredesai P7

Page 8: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

ClusteringFeature extraction

• For dimensional reduction, we need to extract features from raw information streams

• DFT (Discrete Fourier Transform), DWT (Discrete Wavelet Transform), PAA (Piecewise Aggregate Approximation), etc

Similarity Measure

• Defining the similarity between two raw information stream or two feature vectors

• Euclidean Distance Metric, Pearson’s Correlation Coefficient, etc

[email protected] Dr. Ankur M. Teredesai P8

Page 9: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Clustering (cont.)

Hierarchical Clustering

Partitional Clustering (e.g. K-means)

[email protected] Dr. Ankur M. Teredesai P9

Page 10: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Symbolic RepresentationExample 1

Example 2

R1R2 R5

R3

R7 R9R8

R6

R4

0

--

0 20 40 60 80 100 120

bbb

a

cc

c

a

[email protected] Dr. Ankur M. Teredesai P10

Page 11: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Symbolic Representation (cont.)Express information stream as sequence of symbols

Now, we can work on less dimensional space than raw information stream dataAlso, we can use well known string processing data structure like inverted index, HMM or suffix tree

aaabaabcbabccb

[email protected] Dr. Ankur M. Teredesai P11

Page 12: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Possible Mining OperationsNovelty detection

• Can be used to identify potential anomaly events

• It is also referred to as the detection of “Aberrant Behavior”, “Anomalies”, “Faults”, “Surprises”, “Deviants” ,“Temporal Change”, and “Outliers”.

• As like above words say, we can detect unseen patterns or sequences from incoming information stream based on training dataset

Prediction

• The utility of prediction model lies in detecting events, rather than predicting numerical values. Event is referred to as meaningful object to which we can assign some semantics, e.g., earthquake or flood.

Finding correlation between clusters

• We can detect Spatial/Temporal correlation between clusters or information streams

[email protected] Dr. Ankur M. Teredesai P12

Page 13: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Mining Time-Series and Sequence Data

Time-series database

• Consists of sequences of values or events changing with time

Applications

• Financial: stock price, inflation

• Biomedical: blood pressure

• Meteorological: precipitation

[email protected] Dr. Ankur M. Teredesai P13

Page 14: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Mining Time-Series and Sequence Data: Trend analysis

Categories of Time-Series Movements• Long-term or trend movements (trend curve)• Cyclic movements or cycle variations, e.g., business cycles• Seasonal movements or seasonal variations

– i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.

• Irregular or random movements

[email protected] Dr. Ankur M. Teredesai P14

Page 15: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Estimation of Trend CurveThe freehand method

• Fit the curve by looking at the graph

• Costly and barely reliable for large-scaled data miningThe least-square method

• Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points

The moving-average method

• Eliminate cyclic, seasonal and irregular patterns

• Loss of end data

• Sensitive to outliers

[email protected] Dr. Ankur M. Teredesai P15

Page 16: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Discovery of Trend in Time-Series (1) Estimation of seasonal variations

• Seasonal index– Set of numbers showing the relative values of a variable during the

months of the year– E.g., if the sales during October, November, and December are 80%,

120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months

• Deseasonalized data– Data adjusted for seasonal variations– E.g., divide the original monthly data by the seasonal index numbers for

the corresponding months

[email protected] Dr. Ankur M. Teredesai P16

Page 17: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Similarity Search in Time-Series AnalysisNormal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequenceTwo categories of similarity queries

• Whole matching: find a sequence that is similar to the query sequence

• Subsequence matching: find all pairs of similar sequencesTypical Applications

• Financial market

• Market basket data analysis

• Scientific databases

• Medical diagnosis

[email protected] Dr. Ankur M. Teredesai P17

Page 18: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Data transformationMany techniques for signal analysis require the data to be in the frequency domainUsually data-independent transformations are used

• The transformation matrix is determined a priori– E.g., discrete Fourier transform (DFT), discrete wavelet transform (DWT)

• The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain

• DFT does a good job of concentrating energy in the first few coefficients

• If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance

[email protected] Dr. Ankur M. Teredesai P18

Page 19: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Finding Surprising Patterns in a Time Series Database in Linear Time and Space

[email protected] Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 19

Paper by: Eamonn Keogh, Stefano Lonardi, Bill ChiuPresented in ACM SIGKDD 2002

Page 20: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Main PurposeNovelty Detection

• Authors said that this problem should not be confused with the relatively simple problem of outlier detection.

• They focused on finding surprising patterns, not on finding individually surprising datapoints.

• The blue time series at the top is a normal healthy human heart beats with an artificial “flatline” added. The sequence in red at the bottom indicates how surprising local subsections of the time series are detected

[email protected] Dr. Ankur M. Teredesai P20

Page 21: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Basic IdeasA pattern is surprising if its frequency of occurrence is greatly different from that which we expected.Their notion surprisingness of a pattern is not tied exclusively to its shape. Instead it depends on the difference between the shape’s expected frequency and its observed frequency. Example : Consider the head and shoulders pattern shown below

• The existence of this pattern in a stock market time series occurs an average of three times a year.

• If it occurred ten times this year : surprising.

• If its frequency of occurrence is less than expected : Also Surprising pattern.

[email protected] Dr. Ankur M. Teredesai P21

Page 22: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

ApproachFormal Definition of surprising pattern

• A time series pattern P, extracted from database X is surprising relative to a database R, if the probability of its occurrence is greatly different to that expected by chance, assuming that R and X are created by the same underlying process.

ExampleIf x = principalskinner

Σ is {a,c,e,i,k,l,n,p,r,s}|x| is 16skin is a substring of xprin is a prefix of xner is a suffix of xIf y = in, then fx(y) = 2If y = pal, then fx(y) = 1principalskinner

How about “y = eik” ?

[email protected] Dr. Ankur M. Teredesai P22

Page 23: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Approach (cont.)Steps (TARZAN algorithm)

• Discretizing time-series into Symbolic strings– Fixed sized sliding window– Slope of the best-fitting line

• Calculate probability of any pattern, including ones we have never seen before using Markov models

• For maintaining linear time and space property, they use suffix tree data structure

• Computing scores by comparing trees between reference data and incoming information stream

aaabaabcbabccb

[email protected] Dr. Ankur M. Teredesai P23

Page 24: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Experimental EvaluationTwo features

• Sensitivity (High True Positive Rate)– The algorithm can find truly surprising patterns in a time series.– It’s similar with Precision

• Selectivity (Low False Positive Rate)– The algorithm will not find spurious “surprising” patterns in a time series– It’s similar with Recall

Goal is maintaining High Precision and Recall

• They achieved high Sensitivity

• But, Selectivity??

[email protected] Dr. Ankur M. Teredesai P24

Page 25: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Experimental Evaluation (cont.)Shock EGG

Training data

Test data(subset)

0 200 400 600 800 1000 1200 1400 1600

200 400 600 800 1000 1200 1400 16000

Tarzan’s level of surprise

[email protected] Dr. Ankur M. Teredesai P25

Page 26: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Experimental Evaluation (cont.)Power Demand

• They consider a dataset that contains the power demand for a Dutch research facility for the entire year of 1997. The data is sampled over 15 minute averages, and thus contains 35,040 points.

• Above is the first 3 weeks of the power demand dataset. Note the repeating pattern of a strong peak for each of the five weekdays, followed by relatively quite weekends

0 200 400 600 800 1000 1200 1400 1600 1800 2000500

1000

1500

2000

2500

[email protected] Dr. Ankur M. Teredesai P26

Page 27: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Experimental Evaluation (Power Demand cont.)

They used from Monday January 6th to Sunday March 23rd as reference data. This time period includes national holidays. They tested on the remainder of the year.

They showed the 3 most surprising subsequences found by each algorithm. For each of the 3 approaches they showed the entire week (beginning Monday) in which the 3 largest values of surprise fell.

Both TSA-tree and IMM returned sequences that appear to be normal workweeks.

Tarzan returned 3 sequences that correspond to the weeks that contain national holidays.

Tarzan TSA Tree IMM

[email protected] Dr. Ankur M. Teredesai P27

Page 28: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Experimental Evaluation (cont.)

[email protected] Dr. Ankur M. Teredesai P28

• The previous experiments demonstrate the ability of Tarzan to find surprising patterns (Sensitivity)

• However, they also need to consider Tarzans selectivity. – For reducing false alarms, they attempted to scale to massive datasets.

• If Tarzan is trained on “short random walk” dataset,– The chance that similar patterns of test data exist in the short training

database is very small>Many False alarms

– Solution for this problem >Increase the size of the training data, the surprisingness of the test data

should decrease >The more training on “huge random walk data”, the less surprising

pattern could be detected

Page 29: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Possible Future Research OpportunityMentioned in the paper

• Incorporating user feedback and domain based constraints

• Applying different feature extraction techniqueInformation streams + Ontology

• Finding method to combine information streams mining with Ontology

• Intuitively, if we can extract general/abnormal patterns in information streams and generate clusters, we can give semantics to the patterns or clusters.

• For example, we can define relationship between news stream data about “War at Iraq” and stock price change of oil company using “News-Stock Ontology Model”. It is because we can detect rapid increase of the amount of news regarding “War” and “Iraq” at time t, and we can see rapid increase/decrease of oil company’s stock price at time t+α.

[email protected] Dr. Ankur M. Teredesai P29

Page 30: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Suffix Tree Data Structure

[email protected] Dr. Ankur M. Teredesai P30

Page 31: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Multidimensional IndexingMultidimensional index

• Constructed for efficient accessing using the first few Fourier coefficientsUse the index can to retrieve the sequences that are at most a certain small distance away from the query sequencePerform postprocessing by computing the actual distance between sequences in the time domain and discard any false matches

[email protected] Dr. Ankur M. Teredesai P31

Page 32: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

[email protected] Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 32

B-Trees

Page 33: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

B-TreesGeneralizes multilevel index

• Number of levels varies with size of data file, but is quite often 3

• Height balanced– Equal length access paths to different records

• Adapts well to insertions and deletionsDBMS typically use a variant called a B+ tree

• All nodes have same format: n keys, n +1 pointers, and at least half of thisUseful for primary, secondary indexes, primary keys, non-keys

[email protected] Dr. Ankur M. Teredesai P33

Page 34: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

B+Tree Example

Root n=3

100

120

150

180

30

3 5 11 30 35 100

101

110

120

130

150

156

179

180

200

[email protected] Dr. Ankur M. Teredesai P34

Page 35: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Sample Non-Leaf Node

120

150

180

< 120 120≤ k<150 150≤k<180 ≥180to keysto keys to keys to keys

[email protected] Dr. Ankur M. Teredesai P35

Page 36: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Sample Leaf Node

From non-leaf node

120

130

unus

ed

To r

ecor

d w

ith k

ey 1

20

To r

ecor

d w

ith k

ey 1

30

to next leafin sequence

[email protected] Dr. Ankur M. Teredesai P36

Page 37: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Nodes Must Not Be Too EmptyNumber of pointers in use

• At internal nodes at least ⎡(n+1)/2⎤– To child nodes

• At leaves at least ⎣(n+1)/2⎦– To data records/blocks

[email protected] Dr. Ankur M. Teredesai P37

Page 38: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Node Bounds

Full node Minimum node

120

150

180

30Non-leaf

3 5 11 30 35Leaf

[email protected] Dr. Ankur M. Teredesai P38

n=3

Page 39: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

B+tree RulesAll leaves at the same lowest level

• Balanced treePointers in leaves point to records

• Except for “sequence pointer”Number of pointers/keys for B+tree

[email protected] Dr. Ankur M. Teredesai P39

Non-leaf(non-root) n+1 n ⎡(n+1)/2⎤ ⎡(n+1)/2⎤- 1

Leaf(non-root) n+1 n

Root n+1 n 2♠ 1

Max Max Min Min Ptrs keys Ptrs→data keys

⎣(n+1)/2⎦ ⎣(n+1)/2⎦

♠ Can be 1 if only one record in the file

Page 40: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

B+tree Insertions Search for the key being insertedFour cases

• Leaf has space– Just insert (key, pointer-to-record)

• Leaf overflow

• Non-leaf overflow

• New root

[email protected] Dr. Ankur M. Teredesai P40

Page 41: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Leaf has SpaceInsert key 32

n=3

100

30

3 5 11 30 31 32

[email protected] Dr. Ankur M. Teredesai P41

Page 42: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Leaf OverflowInsert key 7

100 n=3

3 5 11 30 31

30

3 5

7

7

[email protected] Dr. Ankur M. Teredesai P42

Page 43: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Non-Leaf OverflowInsert key 160

100

120

150

180

150

156

179

180

200

160

180

160

179

n=3

[email protected] Dr. Ankur M. Teredesai P43

Page 44: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

New RootInsert 45

10 20 30

1 2 3 10 12 20 25 30 32 40 40 45

40

30new rootn=3

Tree grows at root and maintains balance

[email protected] Dr. Ankur M. Teredesai P44

Page 45: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

B+tree Deletions Search for key being deleted

• If found, deleteThree broad cases

• Leaf does not underflow

• Borrow keys from an adjacent sibling if that does not also cause underflows

• Coalesce with a sibling node– Repeat if needed

Sometimes acceptable to allow a B-tree leaf to become sub-minimum (no mergers) to violate B-tree definition

[email protected] Dr. Ankur M. Teredesai P45

Page 46: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Leaf Does Not UnderflowDelete key 35

n=4min number of keys

in a leaf= ⎣5/2⎦ = 2

10 40 100

10 20 30 35 40 50

[email protected] Dr. Ankur M. Teredesai P46

Page 47: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Borrow KeysDelete key 50

n=4min number of keys

in a leaf= ⎣5/2⎦ = 2

10 40 100

10 20 30 35 40 5035

35

[email protected] Dr. Ankur M. Teredesai P47

Page 48: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Coalesce with SiblingDelete key 50

n=4

20 40 100

20 30 40 5040

[email protected] Dr. Ankur M. Teredesai P48

Page 49: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Coalesce Non-LeafDelete 37

40 4530 3725 2620 2210 141 3

10 20 30 4040

30

25

25

new root

n=4min number of keys

in a non-leaf= ⎡(n+1)/2⎤ - 1=3-1= 2

[email protected] Dr. Ankur M. Teredesai P49

Tree shrinks at root

Page 50: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

B+tree Deletions in PracticeCoalescing is often not implemented

• Too hard and usually not worth it!

• Subsequent insertions may return node back to required minimum size

• Compromise– Try redistributing keys with a sibling– If not possible, leave it there

• If all accesses to records are through B-tree– Place a "tombstone" for deleted record at the leaf

[email protected] Dr. Ankur M. Teredesai P50

Page 51: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Traditional B-TreesB-tree is similar to B+tree

• Each search key appears only once– No redundant storage of search keys

• Additional pointer field for each search key in non-leaf node– Points to record directly

P1 K1 P2 … Pn-1 Kn-1 Pn

versus

Pn-1 Rn-1 Kn-1 PnP1 R1 K1 P2 R2 K2 …

[email protected] Dr. Ankur M. Teredesai P51

Page 52: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

B-Tree Advantages and DisadvantagesAdvantages

• Fewer nodes than corresponding B+-tree

• Possible to find key before hitting leaf nodeDisadvantages

• Only small fraction of all keys found early

• Non-leaf nodes are larger so reduced fan-out– B-tree often deeper than corresponding B+tree

• More complex than B+trees– Insertion/deletion and overall implementation

• B+trees usually better than B-trees

[email protected] Dr. Ankur M. Teredesai P52

Page 53: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

B+ Trees in Practice Typical order: 100

• Typical fill-factor around 67%.

• Average fanout around 133 Typical capacities:

• Height 4: 1334 = 312,900,700 records

• Height 3: 1333 = 2,352,637 records Can often hold top levels in buffer pool

• Level 1 = 1 page = 8 KBytes

• Level 2 = 133 pages = 1 MByte

• Level 3 = 17,689 pages = 133 MBytes

[email protected] Dr. Ankur M. Teredesai P53

Page 54: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Tree Structured Indexes Ideal for range-searches and equality searchesISAM: static structure

• Only leaf pages modified

• Overflow pages degrade performance B+tree: dynamic structure

• Inserts/deletes leave tree height-balanced, and offers graceful growth and shrinking– High fanout (F) ⇒ depth, rarely > 3 or 4– 67% occupancy on average

• Preferable to ISAM, modulo locking

• Widely used DBMS index structure and one of the most optimized DBMS components

[email protected] Dr. Ankur M. Teredesai P54

Page 55: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Multidimensional Data Geographic & multidimensional data applications

• Sale (store, day, item, color, size, etc.)– Each sale is a point in 5-dimensional space

• Customer: (age, salary, zip, married, …)Typical Queries

• Range queries– Find employees in the Toy department who make at least 25K Geoffrey

dollars?

• Nearest neighbor– I am here: where’s the nearest MacGregors?

• Is this expressible in SQL?

[email protected] Dr. Ankur M. Teredesai P55

Page 56: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Big Impediment For these queries, no clean way to eliminate lots of records that don't meet WHERE condition Approaches

• Index on one attribute– Get data for 1 attribute and remove others

• Index on attributes independently– Intersect pointers in main memory to save disk I/O– Does this help with nearest neighbor?

• Multiple key index– Index on one attribute provides pointer to an index on the other

[email protected] Dr. Ankur M. Teredesai P56

Page 57: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

2-Level Indexing

I1

I2

I3

Index onfirst attribute

Index onsecond attribute

[email protected] Dr. Ankur M. Teredesai P57

Page 58: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Example

[email protected] Dr. Ankur M. Teredesai P58

ArtSalesToy

10k15k17k21k

12k15k15k19k

SalaryIndex

DeptIndex

Name=JoeDept=SalesSalary=15k

SampleEmployee

Page 59: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Some QueriesQuestion

• For what kinds of conditions about dept and salary will a multiple-key index (dept first) significantly reduce number of disk I/O's?

How about finding records where …

• Dept = “Sales” and Salary = 20k

• Dept = “Sales” and Salary > 20k

• Dept = “Sales”

• Salary = 20k

[email protected] Dr. Ankur M. Teredesai P59

Page 60: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Interesting Application

x

y

Geographic Data

Data

<X1,Y1, Attributes>

<X2,Y2, Attributes>

...Queries

• What city is at <Xi,Yi>?

• What is within 5 miles from <Xi,Yi>?

• Which is closest point to <Xi,Yi>?

[email protected] Dr. Ankur M. Teredesai P60

Page 61: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

[email protected] Dr. Ankur M. Teredesai P61

h

nb

i a

co

d

10 20

10 20

e

g

f

m

l

kj25 15 35 20

40

30

20

10

h i a bcd efg

n omlj k

• Search points near f• Search points near b

5

15 15

Example

Page 62: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

QueriesFind points with Yi > 20Find points with Xi < 5Find points “close” to i = <12,38>Find points “close” to b = <7,24>

[email protected] Dr. Ankur M. Teredesai P62

Page 63: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Other StructuresOther geographic index structures

• Quad Trees

• R TreesMore Multikey Indexes

• Grid

• Partitioned hash

[email protected] Dr. Ankur M. Teredesai P63

Page 64: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Grid IndexKey 2

X1 X2 …… XnV1V2

Key 1

Vn

To records with key1=V3, key2=X2

[email protected] Dr. Ankur M. Teredesai P64

Page 65: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

ClaimCan quickly find records with

• key 1 = Vi and Key 2 = Xj

• key 1 = Vi

• key 2 = Xj

And also ranges …

• E.g., key 1 ≥ Vi and key 2 < Xj

[email protected] Dr. Ankur M. Teredesai P65

Page 66: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Storing Grid IndexesCatch with Grid Indexes!

• Storing Grid Index stored on disk?Problem

• Need regularity to compute position of <Vi,Xj> entry

LikeArray...

X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4

V1 V2 V3

[email protected] Dr. Ankur M. Teredesai P66

Page 67: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Solution: Use Indirection

Buckets

------------

------

------

------

X1 X2 X3V1

V2V3

V4

Grid only containspointers to buckets

[email protected] Dr. Ankur M. Teredesai P67

Buckets

Page 68: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Indexing Grid on Value Ranges

Salary Grid

Linear Scale1 2 3

Toy Sales Personnel

0-20K 120K-50K 2

50K- 38

[email protected] Dr. Ankur M. Teredesai P68

Grid can be regular without wasting space

We do have price of indirection

Page 69: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Partitioned Hashing Hash function

• Combines several attributes

• Great when attributes have values specifiedPartitioned hash function devotes some bits to each attribute independently

010110 1110010Key2Key1

h1 h2

[email protected] Dr. Ankur M. Teredesai P69

Page 70: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Example (1)

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

<Joe><Sally>

<Fred>000

111110101100011010001

[email protected] Dr. Ankur M. Teredesai P70

Insert<Fred, toy, 10K><Joe, sales, 10K><Sally, art, 30K>

Page 71: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Example (2)

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

000

111110101100011010001

Find Emp. with Dept. = Sales and Sal = 40k

[email protected] Dr. Ankur M. Teredesai P71

Page 72: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Example (3)

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look hereFind Emp. with Sal = 30k

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

000

111110101100011010001

[email protected] Dr. Ankur M. Teredesai P72

Page 73: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Example (4)

<Fred><Joe><Jan>

<Mary>

<Sally>

<Tom><Bill><Andy>

look hereFind Emp. with Dept. = Sales

h1(toy) = 0h1(sales) = 1h1(art) = 1

..h2(10k) = 01h2(20k) = 11h2(30k) = 01h2(40k) = 00

..

000

111110101100011010001

[email protected] Dr. Ankur M. Teredesai P73

Page 74: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

[email protected] Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 74

R Trees

A Dynamic Index Structure for Spatial Representation

Page 75: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Why R Trees?Multi-dimensional spaces not well represented by point locationsNeed to be able to perform range searchesOne dimensional indexes not suitable for multi-dimensional spacesEx: Find all the counties within 20 mi radius of Georgia Tech

[email protected] Dr. Ankur M. Teredesai P75

Page 76: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Main ConceptsHeight balanced tree similar to a B-treeIndex records in leaf nodes point to data objectsIndex is dynamic and no periodic reorganization is requireIndex records are of the form ( @ leaf nodes):

(I, tuple-identifier) where

I => n dimensional bounding rectangle i.e.I = (I0,I1,…,In) where n = no of dimensions and Ii = [a, b] (a closed bounded rectangle)

[email protected] Dr. Ankur M. Teredesai P76

Page 77: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

More Concepts…Non leaf nodes are of the form:

(I, child-pointer)where

child-pointer is the address of a lower node and I is the smallest rectangle covering all the rectangles in the lower node’s entriesM = maximum number of entries in one node

[email protected] Dr. Ankur M. Teredesai P77

Page 78: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

More Concepts…m = parameter specifying the minimum no of entries in a node. m can be tuned at run time and is <= M/2R tree containing N index records has at most a height [logm N] -1Worst case space utilization: m/MMaximum no of nodes: N/m + N/m2 +…+1

[email protected] Dr. Ankur M. Teredesai P78

Page 79: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

SearchingDenote the rectangle part of a node E by E.I and the child-pointer part by E.pAlgorithm Search: Given an R Tree with root node T find all index records whose rectangles overlap a search rectangle SStep 1) [Search subtrees.] If T is not a leaf, check each entry to determine if E.I overlaps S. For all overlapping entries invoke Search on the tree whose root is E.pStep 2) [Search leaf node.] If T is a leaf, check all entries E to determine whether E.I overlaps S. If so, E is a qualifying records. Return E.

[email protected] Dr. Ankur M. Teredesai P79

Page 80: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Insertion into R-TreeSimilar to B-Tree insertionNew index records added to leavesOverflowing nodes are splitSplits propagate up the tree

[email protected] Dr. Ankur M. Teredesai P80

Page 81: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm Insert - DetailsInvoke ChooseLeaf to select a leaf node L to place EIf L has room for another entry install E, else invoke SplitNode on L, obtaining L and LLInvoke AdjustTree on L (and LL if a split was performed)If root is split, create new root with two children (those obtained by splitting the old root)

[email protected] Dr. Ankur M. Teredesai P81

Page 82: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm ChooseLeafSet N to be the rootIf N is leaf, return NIf N is not leaf, let F be the entry in N whose rectangle needs least enlargement to include E.ISet N to be the child node pointed to by F.P and repeat from Step 2

[email protected] Dr. Ankur M. Teredesai P82

Page 83: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm AdjustTreeSet N = L, if L was split set NN = LLIf N is Root, STOPLet P be N’s parent and let EN be N’s entry in P. Adjust EN.I so that it “tightly”encloses all entries in NIf NN exists, create a new entry ENN, with ENN.p pointing to NN and ENN.I enclosing all rectangles in NN. Add ENN to P if there is room. Otherwise invoke SplitNode to produce PP.Move up the next level, repeat process

[email protected] Dr. Ankur M. Teredesai P83

Page 84: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Node Splitting“Full” node to be split when new entry needs to be addedMust ensure that on any subsequent searches, with high probability only one node needs to be exploredTotal area of two rectangles to be minimizedExhaustive Search – Exponential Complexity

[email protected] Dr. Ankur M. Teredesai P84

Page 85: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Quadratic Cost AlgorithmUse PickSeeds to choose two entries to be first elements of the two groups. Repeat STEP 3 until all entries have been assigned to one of the groupsInvoke algorithm PickNext to choose next entry to assign. Add it to the group whose covering rectangle needs to be expanded the least.

[email protected] Dr. Ankur M. Teredesai P85

Page 86: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm PickSeedsFor each pair of entries E1 and E2, let J be the rectangle including E1.I and E2.I.

Calculate d = area(J) – area(E1.I) – area(E2.I)Choose the pair with largest d value

[email protected] Dr. Ankur M. Teredesai P86

Page 87: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm PickNextFor each entry E not yet in any group, calculate d1 = the area increase required in

the covering rectangle of Group 1 to include E.1. Calculate d2 similarly for Group 2

Choose any entry with maximum difference between d1 and d2

[email protected] Dr. Ankur M. Teredesai P87

Page 88: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm LinearPickSeedsAlong each dimension, find the entry whose rectangle has the highest low side, and the one

with highest low side. Record the separation between themNormalize the separations by dividing by the width of the entire set along corresponding

dimensionChoose the pair with greatest normalized separation along any dimension

[email protected] Dr. Ankur M. Teredesai P88

Page 89: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm DeleteInvoke FindLeaf to locate leaf node L containing E. Remove E from LInvoke CondenseTree on LIf root node has only one child, make the child the new root

[email protected] Dr. Ankur M. Teredesai P89

Page 90: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm FindLeafSet T to be the root of the treeIf T is not leaf, check each entry F in T to determine if F.I overlaps E.I. For each

entry, invoke FindLeaf on the tree pointed to by F.PIf T is a leaf, check each entry to see if it matches E. If E is found return T

[email protected] Dr. Ankur M. Teredesai P90

Page 91: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm CondenseTreeSet N = L. Set Q, the set of eliminated nodes as empty setIf N is the root, go to STEP 6, else, let P be the parent of N, and let EN be N’s

entry in PIf N has fewer than m entries, delete EN from P and add N to Q

[email protected] Dr. Ankur M. Teredesai P91

Page 92: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Algorithm CondenseTree (contd..)If N not eliminated, adjust EN.I to tightly contain all entries in NSet N = P and repeat from STEP 2Reinsert all entries in Q. Entries from eliminated leaf nodes are inserted as in

algorithm Insert. Entries from higher-level nodes are to be inserted higher in the tree.

[email protected] Dr. Ankur M. Teredesai P92

Page 93: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

[email protected] Proprietary and Confidential

NOTICE: Proprietary and Confidential

This material is proprietary to A. Teredesai and GCCIS, RIT.

Slide 93

Multi-dimensional Sequential Pattern Mining

Page 94: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

OutlineWhy multidimensional sequential pattern mining?Problem definitionAlgorithmsExperimental resultsConclusions

[email protected] Dr. Ankur M. Teredesai P94

Page 95: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Why Sequential Pattern Mining?Sequential pattern mining: Finding time-related frequent patterns (frequent subsequences)Many data and applications are time-related

• Customer shopping patterns, telephone calling patterns – E.g., first buy computer, then CD-ROMS, software, within 3 mos.

• Natural disasters (e.g., earthquake, hurricane)

• Disease and treatment

• Stock market fluctuation

• Weblog click stream analysis

• DNA sequence analysis

[email protected] Dr. Ankur M. Teredesai P95

Page 96: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Sequential Pattern MiningMining of frequently occurring patterns related to time or other sequences

Examples

• Renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi”in that order

• Collection of ordered events within an intervalApplications

• Targeted marketing

• Customer retention

• Weather prediction

[email protected] Dr. Ankur M. Teredesai P96

Page 97: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Motivating ExampleSequential patterns are useful

• “free internet access buy package 1 upgrade to package 2”

• Marketing, product design & developmentProblems: lack of focus

• Various groups of customers may have different patternsMD-sequential pattern mining: integrate multi-dimensional analysis and sequential pattern mining

[email protected] Dr. Ankur M. Teredesai P97

Page 98: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Sequences and PatternsGiven a set of sequences, find the complete set of frequent subsequences

A sequence : < (ef) (ab) (df) c b >A sequence database

[email protected] Dr. Ankur M. Teredesai P98

Elements items within an element are listed alphabetically

SID sequence

10 <a(ababc)(acc)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(abab)(df)ccb>

40 <eg(af)cbc>

<a(bc)dc> is a subsequence of <<aa(a(abcbc)(ac))(ac)dd((ccff)>)>

Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern

Page 99: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Sequential Pattern: Basics

A sequence sequence : <(bd) c b (ac)>

Elements Elements

A sequence database sequence database

<a(bdbd)bcbcb(ade)>50<(be)(ce)d>40

<(ah)(bf)abf>30<(bf)(ce)b(fg)>20<(bdbd)cbcb(ac)>10

SequenceSeq. ID

<ad(ae)> is a subsequence subsequence of <aa(bdd)bcb(aadee)>

Given support threshold support threshold min_sup =2, <(bd)cb> is a sequential patternsequential pattern

[email protected] Dr. Ankur M. Teredesai P99

Page 100: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Enhanced similarity search methodsAllow for gaps within a sequence or differences in offsets or amplitudesNormalize sequences with amplitude scaling and offset translationTwo subsequences are considered similar if one lies within an envelope of εwidth around the other, ignoring outliersTwo sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction

[email protected] Dr. Ankur M. Teredesai P100

Page 101: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Subsequence MatchingBreak each sequence into a set of pieces of window with length wExtract the features of the subsequence inside the windowMap each sequence to a “trail” in the feature spaceDivide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangleUse a multipiece assembly algorithm to search for longer sequence matches

[email protected] Dr. Ankur M. Teredesai P101

Page 102: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Sequential pattern mining: Cases and Parameters Duration of a time sequence T

• Sequential pattern mining can then be confined to the data within a specified duration

• Ex. Subsequence corresponding to the year of 1999

• Ex. Partitioned sequences, such as every year, or every week after stock crashes, or every two weeks before and after a volcano eruption

Event folding window w

• If w = T, time-insensitive frequent patterns are found

• If w = 0 (no event sequence folding), sequential patterns are found where each event occurs at a distinct time instant

• If 0 < w < T, sequences occurring within the same period w are folded in the analysis

[email protected] Dr. Ankur M. Teredesai P102

Page 103: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Sequential pattern mining: Cases and Parameters (2)Time interval, int, between events in the discovered pattern

• int = 0: no interval gap is allowed, i.e., only strictly consecutive sequences are found– Ex. “Find frequent patterns occurring in consecutive weeks”

• min_int ≤ int ≤ max_int: find patterns that are separated by at least min_intbut at most max_int– Ex. “If a person rents movie A, it is likely she will rent movie B within 30

days” (int ≤ 30)

• int = c ≠ 0: find patterns carrying an exact interval– Ex. “Every time when Dow Jones drops more than 5%, what will happen

exactly two days later?” (int = 2)

[email protected] Dr. Ankur M. Teredesai P103

Page 104: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Episodes and Sequential Pattern Mining MethodsOther methods for specifying the kinds of patterns

• Serial episodes: A → B

• Parallel episodes: A & B

• Regular expressions: (A | B)C*(D → E)Methods for sequential pattern mining

• Variations of Apriori-like algorithms

[email protected] Dr. Ankur M. Teredesai P104

Page 105: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Click StreamsClient click-stream analysis is a click-by-click view of a visitor's journey (or journeys) through a web site. By viewing a click-stream report, you can follow the exact pathway a visitor took through a web site, even down to the length of time they spent looking at each particular page.

[email protected] Dr. Ankur M. Teredesai P105

Page 106: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Click Streams…ContinuedThe people most interested in this report would typically be involved in marketing, web design or web development. The information presented provides a click-by-click view of how visitors are interacting and navigating through their web site.

[email protected] Dr. Ankur M. Teredesai P106

Page 107: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Periodicity AnalysisPeriodicity is everywhere: tides, seasons, daily power consumption, etc.Full periodicity

• Every point in time contributes (precisely or approximately) to the periodicity

Partial periodicity: A more general notion

• Only some segments contribute to the periodicity– Jim reads NY Times 7:00-7:30 am every week day

Cyclic association rules

• Associations which form cyclesMethods

• Full periodicity: FFT, other statistical analysis methods

• Partial and cyclic periodicity: Variations of Apriori-like mining methods

[email protected] Dr. Ankur M. Teredesai P107

Page 108: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

MD Sequence DatabaseP=(*,Chicago,*,<bf>) matches tuple 20 and 30If support =2, P is a MD sequential pattern

[email protected] Dr. Ankur M. Teredesai P108

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Page 109: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Mining of MD Seq. Pat.Embedding MD information into sequences

• Using a uniform seq. pat. mining methodIntegration of seq. pat. mining and MD analysis method

[email protected] Dr. Ankur M. Teredesai P109

Page 110: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

UNISEQEmbed MD information into sequences

[email protected] Dr. Ankur M. Teredesai P110

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Mine the extended sequence database

using sequential pattern mining methods

cid MD-extension of sequences

10 <(Business,Boston,Middle)(bd)cba>

20 <(Professional,Chicago,Young)(bf)(ce)(fg)>

30 <(Business,Chicago,Middle)(ah)abf>

40 <(Education,New York,Retired)(be)(ce)>

Page 111: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Mine Sequential Patterns by Prefix Projections

Step 1: find length-1 sequential patterns

• <a>, <b>, <c>, <d>, <e>, <f>Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:

• The ones having prefix <a>;

• The ones having prefix <b>;

• …

• The ones having prefix <f> SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

[email protected] Dr. Ankur M. Teredesai P111

Page 112: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Find Seq. Patterns with Prefix <a>

Only need to consider projections w.r.t. <a>

• <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

• Further partition into 6 subsets– Having prefix <aa>;– …– Having prefix <af>

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

[email protected] Dr. Ankur M. Teredesai P112

Page 113: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Completeness of PrefixSpan

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

SDBLength-1 sequential patterns<a>, <b>, <c>, <d>, <e>, <f>

[email protected] Dr. Ankur M. Teredesai P113

<a>-projected database<(abc)(ac)d(cf)><(_d)c(bc)(ae)><(_b)(df)cb><(_f)cbc>

Length-2 sequentialpatterns<aa>, <ab>, <(ab)>,<ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db … <af>-proj. db

Having prefix <af>

…<b>-projected database

Having prefix <b>Having prefix <c>, …, <f>

… …

Page 114: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Efficiency of PrefixSpan

No candidate sequence needs to be generated

Projected databases keep shrinking

Major cost of PrefixSpan: constructing projected databases

• Can be improved by bi-level projections

[email protected] Dr. Ankur M. Teredesai P114

Page 115: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Mining MD-Patterns

[email protected] Dr. Ankur M. Teredesai P115

All

(cust-grp,*,*) (*,city,*) (*,*,age-grp)

(cust-grp,city) Cust-grp,*,age-grp)

(cust-grp,city,age-grp)

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

MD pattern(*,Chicago,*)

BUC processing

Page 116: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Dim-Seq

First find MD-patterns• E.g. (*,Chicago,*)

Form projected sequence database• <(bf)(ce)(fg)> and <(ah)abf> for (*,Chicago,*)

Find seq. pat in projected database• E.g. (*,Chicago,*,<bf>)

[email protected] Dr. Ankur M. Teredesai P116

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Page 117: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Seq-Dim

Find sequential patterns• E.g. <bf>

Form projected MD-database• E.g. (Professional,Chicago,Young) and (Business,Chicago,Middle) for

<bf>Mine MD-patterns

• E.g. (*,Chicago,*,<bf>)

[email protected] Dr. Ankur M. Teredesai P117

cid Cust_grp City Age_grp sequence

10 Business Boston Middle <(bd)cba>

20 Professional Chicago Young <(bf)(ce)(fg)>

30 Business Chicago Middle <(ah)abf>

40 Education New York Retired <(be)(ce)>

Page 118: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Scalability Over Dimensionality

[email protected] Dr. Ankur M. Teredesai P118

Page 119: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Scalability Over Cardinality

[email protected] Dr. Ankur M. Teredesai P119

Page 120: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Scalability Over Support Threshold

[email protected] Dr. Ankur M. Teredesai P120

Page 121: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Scalability Over Database Size

[email protected] Dr. Ankur M. Teredesai P121

Page 122: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

Pros & Cons of AlgorithmsSeq-Dim is efficient and scalable

• Fastest in most casesUniSeq is also efficient and scalable

• Fastest with low dimensionalityDim-Seq has poor scalability

[email protected] Dr. Ankur M. Teredesai P122

Page 123: A Comprehensive Look at Mining Time-Series and …amt/CompTimeSeriesSeqPattMining.pdfA Comprehensive Look at Mining Time-Series and Sequential Patterns Ankur Teredesai Department of

ConclusionsMD seq. pat. mining are interesting and usefulMining MD seq. pat. efficiently

• Uniseq, Dim-Seq, and Seq-DimFuture work

• Applications of sequential pattern mining

[email protected] Dr. Ankur M. Teredesai P123