searching and mining sequential data - kth
TRANSCRIPT
![Page 1: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/1.jpg)
Searching and mining sequential data
!
Panagiotis PapapetrouProfessor, Stockholm University
Adjunct Professor, Aalto University
Disclaimer: some of the images in slides 62-69 have been taken from UCR andProf. Eamonn Keogh.
![Page 2: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/2.jpg)
Outline• Sequences▫ Examples and analysis tasks
• Searching large sequences▫ EBSM: Embedding-based time series subsequence
matching▫ DRESS: Dimensionality reduction for event subsequence
matching• Learning from time series▫ Shapelets and shapelet trees▫ Random shapelet forests
• Current challenges
2
![Page 3: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/3.jpg)
• Distinct events: e.g., text, DNA• Variables over time: e.g., time series• Interval-based events: e.g., sign language
Sequences
TCTAGGGCA
0 50 100 150 200 250 300 350 400 450 50023
24
25
26
27
28
29
time axis
valueaxis
![Page 4: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/4.jpg)
Analysis Tasks
4
Similarity Search
Classification
Clustering Outlier Detection
Frequent Pattern Mining
![Page 5: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/5.jpg)
• Time series matching / stream monitoring
• DNA sequence alignment
Similarity search
TCTAGGGCA
ischemia
possibility of cancer
ECG of a patient
human genome
![Page 6: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/6.jpg)
} Query-by-hummingmusic piece à sequence of notes
Similarity search
![Page 7: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/7.jpg)
• Time series classification
• Gene classification
ischemia
TCTAGGGCA
possibility of cancer
ECG of a patient
Sequence classification
![Page 8: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/8.jpg)
Two fundamental questions
• How to define an appropriate distance measure
for the task at hand?
• How to speed up the search under that distance
measure?
8
![Page 9: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/9.jpg)
• Given two time series
X = (x1, x2, …, xn) Y = (y1, y2, …, yn)
• Define and compute D (X, Y)
• Or better…
Time Series Similarity
![Page 10: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/10.jpg)
database
query X
D (X, Y) 1-NN
• Given a time series database and a query X• Find the best match of X in the database
• Why is that useful?
Time Series Similarity
![Page 11: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/11.jpg)
Examples
• Find companies with similar stock prices over atime interval
• Find products with similar sell cycles• Find songs with similar music patterns• Cluster users with similar credit card utilization• Find similar subsequences in DNA sequences• Find scenes in video streams
![Page 12: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/12.jpg)
Elastic distance measures(images taken from Eamonn Keogh)
• Euclidean▫ rigid
• Dynamic Time Warping (DTW)▫ allows local scaling
• Longest Common SubSequence (LCSS)▫ allows local scaling▫ ignores outliers
( ) ( )å -º=
n
iii yxYXD
1
2,
![Page 13: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/13.jpg)
• Warping path W: ▫ set of grid cells in the time warping matrix
• DTW finds an optimal warping path W:▫ the path with the smallest matching score
OptimumwarpingpathW(thebestalignment) PropertiesofaDTWlegalpath
I. Boundaryconditions
W1=(1,1)andWK=(n,m)
II. ContinuityGivenWk =(a,b),thenWk-1 =(c,d),wherea-c≤ 1,b-d≤1
III. MonotonicityGivenWk =(a,b),thenWk-1 =(c,d),wherea-c≥0,b-d≥0
Properties of DTW
X
Y
![Page 14: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/14.jpg)
14
Sakoe-ChibaBand ItakuraParallelogram
r =
GlobalConstraintsn Slightlyspeedupthecalculationsandpreventpathologicalwarpingsn Aglobalconstraintlimitstheindicesofthewarpingpath
wk =(i,j)k suchthatj-r £ i £ j+rn Wherer isatermdefiningallowedrangeofwarpingforagivenpointina
sequence
![Page 15: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/15.jpg)
Speeding up search under DTW
15
EBSM++: subsequence and full sequence matching for time series, ACM TODS 2011
EBESM: full matching of temporal interval sequences, DAMI 2017
…using embeddings
RBSM: alignment of event sequences, VLDB 2009
EBSM: subsequence matching for time series, SIGMOD 2008
![Page 16: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/16.jpg)
Strategy: identify candidate matchesDatabase X
![Page 17: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/17.jpg)
indexing structure
Strategy: identify candidate matchesDatabase X
![Page 18: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/18.jpg)
query Q
indexing structure
Strategy: identify candidate matchesDatabase X
![Page 19: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/19.jpg)
candidates candidates
indexing structure
query Q
Strategy: identify candidate matchesDatabase X
candidate: possible matchMain principle: use costly distancecomputation only to evaluate thecandidates
![Page 20: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/20.jpg)
Vector embedding
Database X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15
![Page 21: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/21.jpg)
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
![Page 22: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/22.jpg)
query Q2Q1 Q4Q3 Q5
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
![Page 23: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/23.jpg)
query vector
query Q2Q1 Q4Q3 Q5
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
![Page 24: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/24.jpg)
• Embedding should be such that:– Query vector is similar to match vector
candidate match
query vector
query Q2Q1 Q4Q3 Q5
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
![Page 25: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/25.jpg)
• Using vectors we identify candidates– Much faster than brute-force search
query vector
query Q2Q1 Q4Q3 Q5
vector set
Vector embedding
X2X1 X4X3 X6X5 X8X7 X10X9 X12X11 X14X13 X15Database
![Page 26: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/26.jpg)
Using reference sequences
n Apply your favorite distance measure (e.g., DTW) to compute the cost of the
best match of R and X
n Define FR(X) to be that cost
n FR is a 1D embedding
n Each FR (X) à single real number
database sequence X
ref.
Rrow |R|
q Define a set of reference sequences
![Page 27: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/27.jpg)
n Apply the same matching computation against the query:
q Matching cost of match R with Q
n Define FR(Q) to be that cost
database sequence X
ref.
R
query Q
ref.
R
Using reference sequences
![Page 28: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/28.jpg)
Intuition about this embeddingn Suppose Q appears exactly in X
n Then:
n Warping paths are the same
n FR(Q) = FR(X)
n Suppose Q appears inexactly in X
n Then:
n We expect FR(Q) to be similar to FR(X)
n Why? Little tweaks should affect FR(X) little
n No proof, but intuitive, and lots of empirical evidence
![Page 29: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/29.jpg)
• one reference sequence à 1D embedding• two reference sequences à 2D embedding
database sequence X query Q
R2R2
Multi-dimensional embedding
database sequence X query Q
R1R1
![Page 30: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/30.jpg)
database sequence X query Q
R2R2
Multi-dimensional embedding
database sequence X query Q
R1R1
• d reference sequences à d-dimensional embedding F• Basic principle:
• If Xj is the best match of Q • F(Q) should (for most Q) be more similar to F(X, j) than to most FR(X, t)
![Page 31: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/31.jpg)
Filter-and-refine retrievalOffline step:
• Compute FR(X) for each sequence in X
Online steps - given a query Q:• Embedding step:
▫ Compute FR(Q)
• Filter step:
▫ Compare FR(Q) to all FR(X)
▫ Select p best matches à p candidate endpoints
• Refine step:
▫ Use dynamic programming to evaluate each candidate
![Page 32: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/32.jpg)
Filter-and-refine performance
• Accuracy:
ücorrect match must be among p candidates, for most queries
• Trade-off:
ü larger p à higher accuracy, lower efficiency
Database X
candidates
![Page 33: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/33.jpg)
Embedding Optimization
• Solution:▫ extract the set of reference sequences with the highest distance
variance in the dataset▫ choose those top K that can guarantee the best trade-off between
accuracy and p▫ cross-validation (one-time pre-processing step)▫ sample of queries (input to the algorithm)▫ large pool of reference sequences
• How many reference sequence to choose?
• Which ones to choose?
![Page 34: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/34.jpg)
Performance• 20 datasets: UCR Time Series Data Mining Repository
• Queries: 5,397
• Query lengths between 60 and 637
• 50% were used for embedding optimization
• EBSM can achieve over an order of magnitude speedupcompared to brute-force search for ALL datasets!
(2% to 5% of the database is examined) with accuracy of 99%
• And twice as fast as LB_PAA+LB_Keogh / UCR_Suite
![Page 35: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/35.jpg)
Application: Query-by-humming• Competitive performance on real queries• Demo available: implementation in Matlab
![Page 36: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/36.jpg)
Similarity search: event sequences
• Similarity search in large sequence databases
à a frequently occurring problem
• Plethora of string matching methods
• Still a high demand for new robust, and
scalable similarity search methods that can
handle
▫ large query lengths
▫ large similarity ranges
• Our focus: Whole sequence matching
TCTAGGGCA
![Page 37: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/37.jpg)
q Expressed Sequence Tag (EST) databases:
ü common for representing large genomes
ü portions of genes expressed as mRNA
ü sequence length at least 500 – 800
Motivation: large query lengths
q Large scale searches need to be performed against other genomic
databases to determine locations of genes
q Searches can also target whole chromosomes, where the goal is to
find chromosome similarities across different organisms
![Page 38: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/38.jpg)
ü High evolutionary divergence à task of identifying distantly related
gene or protein domains by sequence search techniques not always
trivial
ü Mutation process: intermediate sequences may possess features of
many proteins and facilitate detection of remotely related proteins
ü Large range queries, aka remote homology search, in bioinformatics
can be highly beneficial for searching proteins and genomes
Motivation: large query lengths
![Page 39: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/39.jpg)
¤ Given
ü a database S and a query Q
ü a similarity range r = δ|Q|
¤ Task:
ü retrieve all database strings X such that
ED (Q,X) ≤ r
Problem setting
![Page 40: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/40.jpg)
¤ Global alignment [ED, Q-grams, RBE]
ü full-sequence matching approach for global optimization
ü identify the min number of changes that should be performed toone sequence so as to convert it to the other
ü forcing the alignment to span the whole sequence
¤ Local alignment [SW, BLAST, RBSA]
ü identify local regions within the compared sequences that arehighly similar to each other, while they could be globally divergent
ü may allow for gaps in the alignment
¤ Short-read sequencing [SOAP, MAQ, WHAM]
Existing work
![Page 41: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/41.jpg)
¤ Measures how dissimilar two strings are
¤ Computes the minimum cost of edit operations (insertion, deletion, substitution) to convert one string to the other
¤ Auxiliary matrix α
ü Cins: insertion cost
ü Cdel: deletion cost
ü Csub: substitution cost
The Edit distance
![Page 42: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/42.jpg)
¤ Example:
A = ATC and B = ACTG
The Edit distance
A = A – T C
B = A C T G
ED (A,B) = 2
![Page 43: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/43.jpg)
¤ DRESS: a novel filter-and-refine approximate method for speeding up
similarity search under edit distance:
ü no index training required
ü can handle large query lengths and similarity ranges
¤ Efficient string representation in a new space, based on a set of
query-specific codewords à distance computation between strings is
significantly faster than in original space
¤ Extensive experimental evaluation against state-of-the-art on threelarge protein and two DNA sequence datasets
ü DRESS outperforms competitors with very competitive accuracyand retrieval runtime
Our contributions
![Page 44: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/44.jpg)
¤ A three-step framework
The DRESS framework
![Page 45: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/45.jpg)
¤ This is an online query-sensitive step
¤ For each query Q, identify a set of codewords
E = {E1, …, Et}
² E contains the t most frequent substrings of length l
² IMPORANT: no pair in E has an overlapping prefix-suffix
² t and l are identified after appropriate training
The mapping step: alphabet reduction
![Page 46: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/46.jpg)
46
The mapping step: alphabet reduction
![Page 47: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/47.jpg)
¤ Let X = (babfcde) and E = {ab, cd}
The mapping step: example
Sequences are traversedfrom left to right!
![Page 48: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/48.jpg)
¤ The process described above is linear to the database size
¤ Inverted index: precomputed for the database, where we store for each possible code-word of length l:
ü all the sequence indices and positions where it occurs
ü IND[e][i]: list of all positions where codeword e occurs in the ith database sequence
q NOTE: The index is built in linear time
The mapping step: faster than linear
![Page 49: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/49.jpg)
¤ Let Xi be a database sequence and xi denote its E-mapping
The mapping step: in detail
![Page 50: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/50.jpg)
¤Complexity of the mapping step for a single DB
sequence:
¤Good news:
ü Typically the length of xi is much smaller than the
length of Xi
ü Hence, faster than scanning Xi
The mapping step: complexity
![Page 51: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/51.jpg)
¤ Query is mapped to the E-space
¤ Each DB sequence is mapped to the E-space
¤ Brute-force search under Edit Distance on the E-space
¤ Identify a set of candidates that are within a factor of δ’ of
the query length
¤ Refine the candidates in the original space under Edit
Distance
The filter-and-refine steps
![Page 52: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/52.jpg)
¤ The matching range δ’ in the E-space depends on the original δ and
is set to be δ’ = f δ
¤ Clearly, δ’ should be set higher than δ to account for the cases where
the distance in the E-space is a higher percentage of the mapped
query length
¤ The latter can easily happen because the mapped strings are
drastically shorter than the original strings
¤ Hence: while we expect similar strings to have similar mappings, the
loss incurred by the mapping can lead to higher distances as % of the
query length
The matching range in the E-space
![Page 53: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/53.jpg)
¤ Two real datasets
¤ Three methods
¤ Variable query sizes
¤ Several parameters
Experimental setup
![Page 54: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/54.jpg)
¤ UniProt [Protein sequences]: http://www.ebi.ac.uk/uniprot/
o 530,264 strings
o 25-letter alphabet
¤ Human Chromosome 1 [DNA sequences]: ftp://ftp.ncbi.nlm.nih.go
o 249,250,621 bases available
o 4-letter alphabet
Experimental setup: Datasets
![Page 55: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/55.jpg)
Experimental setup: Datasets
![Page 56: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/56.jpg)
¤ DRESS: our method
¤ RBE: an embedding-based method for sequence search
¤ Q-grams: uses inverted indices to find candidate matches
¤ Note that DRESS is an approximate method, while RBE and Q-
grams are exact
Experimental setup: Methods
![Page 57: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/57.jpg)
¤ Retrieval cost on UniProt [401,800]
Experimental results: retrieval cost
v DRESS achieves up to 29 times lower retrieval cost against RBE and up to 75 times lower than Q-grams
![Page 58: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/58.jpg)
¤ Retrieval cost on UniProt [801,max]
v DRESS achieves up to 75 times lower retrieval cost against RBE and up to 461 times lower than Q-grams
Experimental results: retrieval cost
![Page 59: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/59.jpg)
¤ Protein datasets:
Experimental results: runtime
¤ DNA datasets:
![Page 60: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/60.jpg)
¤ DRESS: novel method for whole sequence matching
¤ Supports large query lengths and matching ranges
¤ Experiments demonstrate that for higher values of search range
DRESS produces significantly lower costs and runtimes than the
competitors
¤ Loss of guarantee of 100% recall
¤ This price can be an acceptable trade-off in several domains given
the significant runtime savings
Conclusions
![Page 61: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/61.jpg)
Future work
¤ Further improve the performance of DRESS by, e.g., implementing
multiple filter steps
¤ Study the performance of DRESS on sequences from other
domains, such as text
¤ Adapt DRESS for local alignment and short-read sequencing
¤ Explore alternative ways of producing codewords, e.g., learn
synthetic words
![Page 62: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/62.jpg)
TimeSeriesclassification• Application:Finance,Medicine,Music
• 1-Nearest Neighbor▫ Pros: accurate, robust, simple▫ Cons: time and space complexity (lazy learning); results arenot interpretable
0 200 400 600 800 1000 1200
![Page 63: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/63.jpg)
Solution
• Shapelets:▫ timeseriessubsequence▫ representativeofaclass▫ discriminativefromotherclasses
• Idea:▫ Useshapelets as“attributes”or“features”forsplittinganodeinthedecisiontree
shapelet
![Page 64: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/64.jpg)
Two steps:o Learn a set of discriminative
shapelets (typically a tree-like
structure)
o Predict a class label for a previously
unseen data series using the
minimum distance to shapelets
(following a path in the tree-structure
model)
Random Shapelet Tree
Forgot \bar{s}
TheShapelet Classifier
![Page 65: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/65.jpg)
100 200 3000
0.10.2
0.3
0.4
0.0
I
II
III
IV
V
VI
Shapelet Dictionary
2 4 0 1 3 6 5
I
II
III IV
V
VI
Shapelet Tree
Method Accuracy Time
Shapelet 0.720 0.86
NearestNeighbor 0.543 0.65
TheShapelet Classifier
![Page 66: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/66.jpg)
• Collect all candidate shapeletes in a pool• For each candidate: arrange the time series objectso based on the distance from a candidate shapelet
• Find the optimal split point (maximal information gain)• Pick the candidate achieving best utility as the shapelet
Split Point
0
candidate
Time series objects in the database
TheutilityofaShapelet
![Page 67: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/67.jpg)
CandidatesPool
(Ti − l +1)Ti∈D∑
l=MINLEN
MAXLEN
∑
DB:200Length:275Candidates:7,480,200Computation:3days
ExtractingallShapelets
![Page 68: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/68.jpg)
• Main bottleneck: candidate generation
• Reducethetimeintwoways▫ DistanceEarlyAbandon
o reducetheEuclideandistancecomputationtimebetweentwotimeseries
▫ AdmissibleEntropyPruningo aftercalculatingsometrainingsamples,useanupperboundofinformationgaintocheckagainstthebestcandidateshapelet
Split Point
0
candidate
Time series objects in the database
Speedingupcandidategeneration
![Page 69: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/69.jpg)
100 200 3000
0.10.2
0.3
0.4
0.0
I
II
III
IV
V
VI
Shapelet Dictionary
2 4 0 1 3 6 5
I
II
III IV
V
VI
Shapelet Tree
Method Accuracy Time
Shapelet 0.720 0.86
NearestNeighbor 0.543 0.65
TheShapelet Classifier
![Page 70: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/70.jpg)
• Transformations + k-NN
▫ Improved subsequence searching and matching, using online
normalization, early abandoning, and re-ordering
▫ Dimensionality reduction using symbolic aggregate approximation
• Synthetic shapelet generation
▫ Initialize using K-means clustering
▫ Learn synthetic Shapelets by following a Gradient Ascend approach
Alternativeapproaches
![Page 71: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/71.jpg)
• Feature-based▫ Select the top k most informative shapelets▫ Generate a new dataset D’ of k columns and |D| rows, whereeach element (i,j) is the (minimum) distance from the shapeletsj to data series di▫ Learn any suitable classifier (e.g., SVM, Random Forest) usingthe transformed dataset
Alternativeapproaches
![Page 72: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/72.jpg)
• A tree-structured ensemble based on the classicRandom Forest and the Shapelet Tree classifier
• Learn:▫ Build T random shapelet trees▫ Each tree is built from a random (with replacement)sample of time series in the database D▫ Inspect r random shapelets at each node
• Predict:▫ Let each tree t1, . . . , tT vote for a class label
RandomShapelet Forest
![Page 73: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/73.jpg)
• Paremeters:
o number of shapelets (r) = 100
o number of trees (T) = 500
o min shapelet length (l) = 2
o max shapelet length (u) = m (max time series length)
Experimentalresults
![Page 74: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/74.jpg)
• Friedman’s test: The observed differences in accuracy of RSF against the rest deviate significantly, i.e., p-value = 10-11
![Page 75: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/75.jpg)
75
Multi-variate time series
![Page 76: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/76.jpg)
Future work• Feature tweaking for shapelets:
▫ What is the minimum amount of changes we need
to apply to a shapelet feature so that the
classification result flips?
• Interval sequence features:
▫ Apply random shapelet forests for complex
interval sequences
76
![Page 77: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/77.jpg)
Thank you for your attention!
![Page 78: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/78.jpg)
• "Embedding-based Subsequence Matching: a query-by-humming application". In the Very Large DatabasesJournal (VLDBJ), 24(4): 519-536, 2015
• "Hum-a-song: A Subsequence Matching with Gaps-Range-Tolerances Query-By-Humming System". In Proceedings ofthe Very Large Databases Endowement (PVLDB), Volume4, Issue 11, pages 1930-1933, 2012
• "Embedding-based subsequence matching in time-seriesdatabases". In ACM Transactions on Database Systems (TODS),Vol. 36, Issue 3, Article 17, 2011
78
ReferencesSubsequence matching
![Page 79: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/79.jpg)
• "Generalized Random Shapelet Forests". In the DataMining and Knowledge Discovery Journal (DAMI)2016
• "Early-classification of Time Series". In Proceedingsof the International Conference on DiscoveryScience (DS), Best Paper Award, 2016
• "Query-sensitive Distance Measure Selection forTime Series Nearest Neighbor Classification".Intelligent Data Analysis Journal (IDAJ), 20(1): 5-27, 2016
79
ReferencesTime series classification
![Page 80: Searching and mining sequential data - KTH](https://reader031.vdocument.in/reader031/viewer/2022021223/62072eb104a39206955a3cd0/html5/thumbnails/80.jpg)
• "Indexing Sequences of Temporal Intervals". In the DataMining and Knowledge Discovery Journal (DAMI),Accepted 2016
• "Finding the Longest Common Sub-Pattern in Sequences ofTemporal Intervals". In the Data Mining and KnowledgeDiscovery Journal (DAMI), 29(5): 1178-1210, 2015
• "Mining Frequent Arrangements of Temporal Intervals". InKnowledge and Information Systems (KAIS), Vol. 21, Issue2, pages 133–171, 2010
80
ReferencesSequences of temporal intervals