similarity searches in sequence databases

67
Similarity Searches in Sequence Databases Sang-Hyun Park KMeD Research Group Computer Science Department University of California, Los Angeles

Upload: goldy

Post on 02-Feb-2016

57 views

Category:

Documents


0 download

DESCRIPTION

Similarity Searches in Sequence Databases. Sang-Hyun Park KMeD Research Group Computer Science Department University of California, Los Angeles. Contents. Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Similarity Searches in Sequence Databases

Similarity Searches in Sequence Databases

Sang-Hyun Park

KMeD Research GroupComputer Science Department

University of California, Los Angeles

Page 2: Similarity Searches in Sequence Databases

Contents

Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Page 3: Similarity Searches in Sequence Databases

What is Sequence?

A sequence is an ordered list of elements.S = 14.3, 18.2, 22.0, 22,4, 19.5, 17.1, 15.8, 15.1

Sequences are principal data format in many applications.

8AM 10AM 12PM 2PM 4PM 6PM 8PM 10PM

temperature

(oC)

time 5

10

15

20

25

Page 4: Similarity Searches in Sequence Databases

What is Similarity Search?

Similarity search finds sequences whose changing patterns are similar to that of a query sequence.

Example Detect stocks with similar growth patterns Find persons with similar voice clips Find patients whose brain tumors have similar evolution

patterns

Similarity search helps in clustering, data mining, and rule discovery.

Page 5: Similarity Searches in Sequence Databases

Classification of Similarity Search

Similarity Searches are classified as: Whole sequence searches Subsequence searches Example

S = 1,2,3 Subsequences (S) = { 1, 2, 3, 1,2, 2,3, 1,2,3 } In whole sequence searches,

the sequence S itself is compared with a query sequence Q.

In subsequence searches, every possible subsequence of S can be compared with a query sequence q.

Page 6: Similarity Searches in Sequence Databases

Similarity Measure

Lp Distance Metric

L1 : Manhattan distance or city-block distance L2 : Euclidean distance L : maximum distance in any element pairs requires that two sequences should have the same

length

n

1i

P P P ) |Q[i]S[i]| (Q)(S,L

Page 7: Similarity Searches in Sequence Databases

Similarity Measure (2)

Time Warping Distance Originally introduced in the area of speech

recognition Allows sequences to be stretched along the time axis

3,5,6 3,3,5,6 3,3,3,5,6 3,3,3,5,5,6 … Each element of a sequence can be mapped to one

or more neighboring elements of another sequence. Useful in applications where sequences may be of

different lengths or different sampling ratesQ = 10, 15, 20

S = 10, 15, 16, 20

Page 8: Similarity Searches in Sequence Databases

Similarity Measure (3)

Time Warping Distance (2) Defined recursively Computed by dynamic programming technique, O(|

S||Q|) DTW (S, Q[2:-])

DTW (S[2:-], Q)

DTW (S[2:-], Q[2:-])

DTW (S, Q) = DBASE (S[1], Q[1]) + min

Q[1] Q[2:-]Q

S[1] S[2:-]S

DBASE (S[1], Q[1]) = | S[1] – Q[1] | P

Page 9: Similarity Searches in Sequence Databases

Similarity Measure (4)

Time Warping Distance (3) S = 4,5,6,7,6,6, Q = 3,4,3 When using L1 as a DBASE, DTW (S, Q) = 12

3QS 4 3136101316

1 22 34 57 89 1011 12

456766

V2

V3 V1

S[i]

Q[j]

| S[i]Q[j] | + min (V1,V2,V3)

Page 10: Similarity Searches in Sequence Databases

False Alarm and False Dismissal

False Alarm Candidates not similar to a query. Minimize false alarms for efficiency

False Dismissal Similar sequences not retrieved by index search Avoid false dismissals for correctness

data sequences

candidates

similarseq.

false alarmfalse dismissal

candidates

similarseq.

Page 11: Similarity Searches in Sequence Databases

Contents

Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Page 12: Similarity Searches in Sequence Databases

Problem Definition

Input Set of data sequences {S} Query sequence Q Distance tolerance

Output Set of data sequences whose distances to Q are within

Similarity Measure Time warping distance function, DTW

L as a distance function for each element pair If the distance of every element pair is within , then

DTW(S,Q) .

Page 13: Similarity Searches in Sequence Databases

Previous Approaches

Naïve Scan [Ber96] Read every data sequence from database Apply dynamic programming technique For m data sequences with average length L, O(mL|

Q|)

FastMap-Based Technique [Yi98] Use FastMap technique for feature extraction Map features into multi-dimensional points Use Euclidean distance in index space for filtering Could not guarantee “no false dismissal”

Page 14: Similarity Searches in Sequence Databases

Previous Approaches (2)

LB-Scan [Yi98] Read every data sequence from database Apply the lower-bound distance function Dlb which

satisfies the following lower-bound theorem:Dlb (S,Q) DTW (S,Q)

Faster than the original time warping distance function (O(|S|+|Q|) vs. O(|S||Q|))

Guarantee no false dismissal Based on sequential scanning

Page 15: Similarity Searches in Sequence Databases

Proposed Approach

Goal No false dismissal High query processing performance

Sketch Extract a time-warping invariant feature vector Build a multi-dimensional index Use a lower-bound distance function for filtering

Page 16: Similarity Searches in Sequence Databases

Proposed Approach (2)

Feature Extraction F(S) = First(S), Last(S), Max(S), Min(S) F(S) is invariant to time warping transformation.

Distance Function for Feature Vectors

DFT (F(S), F(Q)) = max

| First(S) First(Q) |

| Last(S) Last(Q) |

| Max(S) Max(Q) |

| Min(S) Min(Q) |

Page 17: Similarity Searches in Sequence Databases

Proposed Approach (3)

Distance Function for Feature Vectors (2) Satisfies lower-bounding theorem:

DFT (F(S),F(Q)) DTW (S,Q)

More accurate than Dlb proposed in LB-Scan Faster than Dlb (O(1) vs. O(|S|+|Q|))

Page 18: Similarity Searches in Sequence Databases

Proposed Approach (4)

Indexing Build a multi-dimensional index from a set of feature

vectors Index entry First(S), Last(S), Max(S), Min(S), Identifier(S)

Query Processing Extract a feature vector F(Q) Perform range queries in index space to find data points

included in the following query rectangle: [ First(Q) , First(Q) + ],[ Last(Q) , Last(Q) + ],

[ Max(Q) , Max(Q) + ], [ Min(Q) , Min(Q) + ] Perform post-processing to discard false alarms

Page 19: Similarity Searches in Sequence Databases

Performance Evaluation

Implementation Implemented with C++ on UNIX operating system R-tree is used as a multi-dimensional index.

Experimental Setup S&P 500 stock data set (m=545, L=232) Random walk synthetic data set SunSparc Ultra-5

Page 20: Similarity Searches in Sequence Databases

Performance Evaluation (2)

Filtering Ratio Better-than LB-Scan

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

2 4 6

distance-tolerance

Fit

eri

ng

Ra

tio

(%

)

Naïve-ScanLB-ScanOurs

Page 21: Similarity Searches in Sequence Databases

Performance Evaluation (3)

Query Processing Time Faster than LB-Scan and Naïve-Scan

0.000.100.200.300.400.500.600.700.800.901.00

2 4 6

distance-tolerance

ela

ps

ed

tim

e (

se

c)

Naïve-ScanLB-ScanOurs

Page 22: Similarity Searches in Sequence Databases

Contents

Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Page 23: Similarity Searches in Sequence Databases

Problem Definition

Input Set of data sequences {S} Query sequence q Distance tolerance

Output Set of subsequences whose distances to q are within

Similarity Measure Time warping distance function, DTW

Any LP metric as a distance function for element pairs

Page 24: Similarity Searches in Sequence Databases

Previous Approaches

Naïve-Scan [Ber96] Read every data subsequence from database Apply dynamic programming technique For m data sequences with average length n, O(mL2|

q|)

Page 25: Similarity Searches in Sequence Databases

Previous Approaches (2)

ST-Index [Fal94] Assume that the minimum query length (w) is known

in advance. Locates a sliding window of size w at every possible

location Extract a feature vector inside the window Map a feature vector into a point and group trails

into MBR (Minimum Bounding Rectangle) Use Euclidean distance in index space for filtering Could not guarantee “no false dismissal”

Page 26: Similarity Searches in Sequence Databases

Proposed Approach

Goal No false dismissal High performance Support diverse similarity measure

Sketch Convert into sequences of discrete symbols Build a sparse suffix tree Use a lower-bound distance function for filtering Apply branch-pruning to reduce the search space

Page 27: Similarity Searches in Sequence Databases

Proposed Approach (2)

Conversion Generate categories from the distribution of element

values Maximum-entropy method Equal-interval method DISC method

Convert element to the symbol of the corresponding category

ExampleA = [0, 1.0], B = [1.1, 2.0], C = [2.1, 3.0], D = [3.1, 4.0]

S = 1.3, 1.6, 2.9, 3.3, 1.5, 0.1SC = B, B, C, D, B, A

Page 28: Similarity Searches in Sequence Databases

Proposed Approach (3)

Indexing Extract suffixes from sequences of discrete symbols. Example

From S1C= A, B, B, A,

we extract four suffixes: ABBA, BBA, BA, A

Page 29: Similarity Searches in Sequence Databases

Proposed Approach (4)

Indexing (2) Build a suffix tree.

Suffix tree is originally proposed to retrieve substrings exactly matched to the query string.

Suffix tree consists of nodes and edges. Each suffix is represented by the path from the root

node to a leaf node. Labels on the path from the root to the internal node Ni

represents the longest common prefix of the suffixes under Ni

Suffix tree is built with computation and space complexity, O(mL).

Page 30: Similarity Searches in Sequence Databases

Proposed Approach (4)

Indexing (3) Example : suffix tree from S1

C= A, B, B, A and S2C=

A, B

A

S1C[1:-] S2

C[1:-] S1C[4:-] S1

C[2:-] S1C[3:-] S2

C[2:-]

B

B

BA

$$

$

$ $ $

B

A A

Page 31: Similarity Searches in Sequence Databases

Proposed Approach (5)

Query Processing

Index Searching

query (q, )

Post Processing

candidates answers

suffix tree data sequences

Page 32: Similarity Searches in Sequence Databases

Proposed Approach (6)

Index Searching Visit each node of suffix tree by depth-first traversal. Build lower-bound distance table for q and edge labels. Inspect the last columns of newly added rows to find

candidates. Apply branch-pruning to reduce the search space.

Branch-pruning theorem:If all columns of the last row of the distance table have values larger than a distance tolerance , adding more rows on this table does not yield the new values less than or equal to .

Page 33: Similarity Searches in Sequence Databases

Proposed Approach (7)

Index Searching (2) Example : q = 2, 2, 1, = 1.5

1.1

A

B D

….. …..

N1

N2

N3 N4

2 2 1A 1 2 2

q

2 2 1A 1 2 2q

B 1 1

2 2 1A 1 2 2q

D 2.1 2.1 4.1

— —

…..

Page 34: Similarity Searches in Sequence Databases

Proposed Approach (8)

Lower-Bound Distance Function DTW-LB

DBASE-LB (A, v) =

0 if v is within the range of A (A.min v) P if v is smaller than A.min (v A.max) P if v is larger than A.max

A.max

A.min

A.max

A.min

A.max

A.min

possible minimumdistance = 0

possible minimumdistance = (A.min – v)P

possible minimumdistance = (v – A.max)P

v

v

v

Page 35: Similarity Searches in Sequence Databases

Proposed Approach (9)

Lower-Bound Distance Function DTW-LB (2)

satisfies the lower-bounding theorem DTW-LB(sC, q) DTW (s,q)

computation complexity O(|sC||q|)

DTW-LB (sC, q) = DBASE-LB(sC[1], q[1]) +

min

DTW-LB (sC, q[2:-])

DTW-LB (sC[2:-], q)

DTW-LB (sC[2:-], q[2:-])

Page 36: Similarity Searches in Sequence Databases

Proposed Approach (10)

Computation Complexity

m is the number of data sequences. L is the average length of data sequences. The left expression is for index searching. The right expression is for post-processing. RP ( 1) is the reduction factor by branch-pruning. RD ( 1) is the reduction factor by sharing distance tables. n is the number of subsequences requiring post-

processing.

|)q|nLRR

|q|mLO(

DP

2

Page 37: Similarity Searches in Sequence Databases

Proposed Approach (11)

Sparse Indexing The index size is linear to the number of suffixes stored. To reduce the index size, we build a sparse suffix tree

(SST). That is, we store the suffix SC[i:-] only if SC[i] SC[i–1]. Compaction Ratio

Example SC = A, A, A, A, C, B, B store only three suffixes (SC[1:-], SC[5:-], and SC[6:-]) compaction ratio C = 7/3

suffixes stored ofnumber suffixes total ofnumber

C

Page 38: Similarity Searches in Sequence Databases

Proposed Approach (12)

Sparse Indexing (2) When traversing the suffix tree, we need to find non-

stored suffixes and compute their distances to q. Assume that k elements of sC have the same value. Then, sC[1:-] is stored but sC[i:-] (i=2,3,…,k) is not stored.

For non-stored suffixes,we introduce another lower-bound distance function.DTW-LB2 (sC[i:-], q) = DTW-LB(sC, q) – (i – 1) DBASE-LB (sC[1], q[1])

DTW-LB2 satisfies the lower-bounding theorem. DTW-LB2 is O(1) when DTW-LB(sC, q) is given.

Page 39: Similarity Searches in Sequence Databases

Proposed Approach (13)

Sparse Indexing (3) With sparse indexing, the complexity becomes:

m is the number of data sequences. L is the average length of data sequences. C is the compaction ratio. n is the number of subsequences requiring post-processing. RP ( 1) is the reduction factor by branch-pruning. RD ( 1) is the reduction factor by sharing distance tables.

|)q|nL)mLC1

(1RCR

|q|mLO(

DP

2

Page 40: Similarity Searches in Sequence Databases

Performance Evaluation

Implementation Implemented with C++ on UNIX operating system

Experimental Setup S&P 500 stock data set (m=545, L=232) Random walk synthetic data set Maximum-Entropy (ME) categorization Disk-based suffix tree construction algorithm SunSparc Ultra-5

Page 41: Similarity Searches in Sequence Databases

Performance Evaluation (2)

Comparison with Naïve-Scan increasing distance-tolerances S&P 500 stock data set, |q|=20

0

50

100

150

200

250

5 10 20 30 40 50

distance-tolerance

qu

ery

pro

cessin

g t

ime (

sec)

Naïve-ScanSST

Page 42: Similarity Searches in Sequence Databases

Performance Evaluation (3)

Scalability Test increasing average length of data sequences random-walk data set, |q|=20,m=200

0

200

400

600

800

1000

1200

1400

query processing time

(sec)

average length of data sequences

Naïve-Scan 52.84 215.08 486.05 864.08 1349.92

SST 2.49 10.17 23.98 42.27 82.89

200 400 600 800 1000

Page 43: Similarity Searches in Sequence Databases

Performance Evaluation (4)

Scalability Test (2) increasing total number of data sequences random-walk data set, |q|=20, L=200

0

500

1000

1500

2000

2500

3000

query processing time

(sec)

total number of data sequences

Naïve-Scan 266 798.71 1596.36 2679.9

SST 21 60.35 124.49 215.92

100 3000 6000 10000

Page 44: Similarity Searches in Sequence Databases

Contents

Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Page 45: Similarity Searches in Sequence Databases

Introduction

We extend the proposed subsequence searching method to large sequence databases.

In the retrieval of similar subsequences with time warping distance function,

Sequential Scanning is O(mL2|q|). The proposed method is O(mL2|q| / R) (R 1). It makes search algorithms suffer from severe

performance degradation when L is very large. For a database with long sequences, we need a

new searching scheme linear to L.

Page 46: Similarity Searches in Sequence Databases

SBASS

We propose a new searching scheme: Segment-Based Subsequence Searching scheme (SBASS)

Sequences are divided into a series of piece-wise segments.

When a query sequence q with k segments is submitted, q is compared with those subsequences which consist of k consecutive data segments.

The lengths of segments may be different. SS represents the segmented sequence of S.

S = 4,5,8,9,11,8,4,3 |S| = 8SS = 4,5,8,9,11, 8,4,3 |SS| = 2

Page 47: Similarity Searches in Sequence Databases

SBASS (2)

Only four subsequences of SS are compared with QS.

SS[1],SS[2], SS[2],SS[3], SS[3],SS[4], SS[4],SS[5]

S

SS

qS

SS[1] SS[2] SS[3]SS[4] SS[5]

qS[1] qS[2]

Page 48: Similarity Searches in Sequence Databases

SBASS (3)

For SBASS scheme, we define the piece-wise time warping distance function (where k = |qS| = |sS|).

Sequential scanning for SBASS scheme is O(mL|q|).

We introduce an indexing technique with O(mL|q|/R) (R 1).

k

1i

1/P P SStw

SSptw )[i]))q[i],(s(D()q,(sD

Page 49: Similarity Searches in Sequence Databases

Sketch of Proposed Approach

Indexing Convert sequences to segmented sequences. Extract a feature vector from each segment. Categorize feature vectors. Convert segmented sequences to sequences of

symbols. Construct suffix tree from sequences of symbols.

Query Processing Traverse the suffix tree to find candidates. Discard false alarms in post processing.

Page 50: Similarity Searches in Sequence Databases

Segmentation

Approach Divide at peak points. Divide further if maximum deviation from

interpolation line is too large. Eliminate noises.

Compaction Ratio (C) = |S| / |SS|

too large deviation

noises

Page 51: Similarity Searches in Sequence Databases

Feature Extraction

From each subsequence segment, extract a feature vector:

(V1, VL,L, +, –)

V1

VL

L

+

Page 52: Similarity Searches in Sequence Databases

Categorization and Index Construction

Categorization Group similar feature vectors together using multi-

dimensional categorization methods like Multi-attribute Type Abstraction Hierarchy (MTAH).

Assign unique symbol to each category Convert segmented sequences to sequences of symbols.

S = 4,5,8,8,8,8,9,11,8,4,3SS = 4,5,8,8,8,8,9,11, 8,4,3SF = (4,11,8,2,1), (8,3,3,0,1.5)SC = A,B

From sequences of symbols, construct the suffix tree.

Page 53: Similarity Searches in Sequence Databases

Query Processing

For query processing, we calculate lower-bond distances between symbols and keep them in table.

Given the query sequence q and the distance tolerance ,

Convert q to qS and then to qC. Search the suffix tree to find those subsequences

whose lower-bound distances to qC are within . Discard false alarms in post processing.

Page 54: Similarity Searches in Sequence Databases

Query Processing (2)

Index Searching

Post Processing

candidates answers

suffix tree data sequences

qS qCq,

Page 55: Similarity Searches in Sequence Databases

Computation Complexity

Sequential scanning is O(mL|q|). Complexity of the proposed search algorithm is :

n is the number of subsequences contained in candidates.

C is the compaction ratio or the average number of elements in segments.

RD ( 1) is the reduction factor by sharing edges of suffix tree.

|)q|mLRC

|q|mLO(

D2

Page 56: Similarity Searches in Sequence Databases

Performance Evaluation

Test Set : Pseudo Periodic Synthetic Sequences

m = 100, L = 10,000 Achieved up to 6.5 times speed-up compared

to sequential scanning.

10

0.2distance tolerance

time (sec)

SeqScan

Our Approach20

30

40

50

0.4 0.6 0.8 1.0

60

Page 57: Similarity Searches in Sequence Databases

Contents

Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Page 58: Similarity Searches in Sequence Databases

Introduction

So far, we assumed that elements have single-dimensional numeric values.

Now, we consider multi-dimensional sequences.

Image Sequences Video Streams

Medical Image Sequence

Page 59: Similarity Searches in Sequence Databases

Introduction (2)

In multi-dimensional sequences, elements are represented by feature vectors. S = S[1], …, S[N], S[i] = (S[i][1], …, S[i][F])

Our proposed subsequence searching techniques are extended to the retrieval of similar multi-dimensional subsequences.

Page 60: Similarity Searches in Sequence Databases

Introduction (3)

Multi-Dimensional Time Warping Distance

DMTW (S, Q[2:-])DMTW (S, Q) = DMBASE (S[1], Q[1]) + min DMTW (S[2:-], Q)

DMTW

(S[2:-],Q[2:-])

DMBASE (S[1], Q[1]) = ( Wi | S[1][i] Q[1][i] | )

F is the number of features in each element. Wi is the weight of i-th dimension.

F

1i

Page 61: Similarity Searches in Sequence Databases

Sketch of Our Approach

Indexing Categorize multi-dimensional element values using

MTAH. Assign unique symbols to categories. Convert multi-dimensional sequences into sequences of

symbols. Construct suffix tree from a set of sequences of symbols.

Query Processing Traverse suffix tree. Find candidates whose lower-bound distances to q are

within . Do post processing to discard false alarms.

Page 62: Similarity Searches in Sequence Databases

Application to KMeD

In the environment of KMeD, the proposed technique is applied to the retrieval of medical image sequences having similar spatio-temporal characteristics to those of the query sequence.

KMeD [CCT:95] has the following features: Query by both image and alphanumeric contents Model temporal, spatial and evolutionary nature of objects Formulate queries using conceptual and imprecise terms Support cooperative processing

Page 63: Similarity Searches in Sequence Databases

Application to KMeD (2)

Query Medical Image Sequence Attribute names and their relative weights Distance tolerance

Size(0.3)

Circularity (0.1)

DistFromLV (0.6)

Page 64: Similarity Searches in Sequence Databases

Application to KMeD (3)

Query

Query AnalysisUser Model Contour Extraction

Feature Extraction

Distance Function

Similarity Searches

medical image seq. index structure

Visual Presentationmatching seq.

feedback

Page 65: Similarity Searches in Sequence Databases

Contents

Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches Conclusion

Page 66: Similarity Searches in Sequence Databases

Summary

Sequence is an ordered list of elements. Similarity search helps in clustering and data mining. For sequences of different lengths or different sampling

rates, time warping distance is useful. We proposed the whole sequence searching method with

spatial access method and lower-bound distance function. We proposed the subsequence searching method with

suffix tree and lower-bound distance functions. We proposed the segment-based subsequence searching

method for large sequence databases. We extended the subsequence searching method to the

retrieval of similar multi-dimensional subsequences.

Page 67: Similarity Searches in Sequence Databases

Contribution

We proposed the tighter and faster lower-bound distance function for efficient whole sequence searches without false dismissal.

We demonstrated the feasibility of using time warping similarity measure on a suffix tree.

We introduced the branch pruning theorem and the fast lower-bound distance function for efficient subsequence searches without false dismissal.

We applied categorization and sparse indexing for scalability.

We applied the proposed technique to the real application (KMeD).