an efficient index structure for string databases

50
An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1

Upload: joy

Post on 20-Mar-2016

36 views

Category:

Documents


0 download

DESCRIPTION

An Efficient Index Structure for String Databases. Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale /Nikita Rasam. Issue ? Find similar substrings in a large database, that is similar to a given query string quickly , using a small index structure - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Efficient Index Structure for String Databases

An Efficient Index Structure for String Databases

Tamer KahveciAmbuj K. Singh

Presented ByAtul Ugalmugale/Nikita Rasam

1

Page 2: An Efficient Index Structure for String Databases

2

• Issue ?• Find similar substrings in a large database, that is

similar to a given query string quickly, using a small index structure

• In some applications we store, search and analyze long sequences of discrete characters, which we call “strings”

• There is a frequent need to find similarities between genetic data, web data and event sequences.

Page 3: An Efficient Index Structure for String Databases

3

• Applications ?• Information Retrieval : A typical application of information

retrieval is text searching; given a large collection of documents and some text keywords we want to find the documents which contain these keywords.

• searching keywords through the net: usually by “mtallica” we mean “metallica”:

Page 4: An Efficient Index Structure for String Databases

• Computational Biology : The problem is similar in computational biology; here we have a long DNA sequence and we want to find subsequences in it that match approximately a query sequence.

…ATGCATACGATCGATT……TGCAATGGCTTAGCTA…

Animal species from the same family are bound to have more similar DNAs

Page 5: An Efficient Index Structure for String Databases

5

• Video data can be viewed as an event sequence if some pre-specified set of events are detected and stored as a sequence. Searching similar event subsequences can be used to find related video segments.

Page 6: An Efficient Index Structure for String Databases

6

• String search algorithms proposed so far are in-memory algorithms.

• Scan the whole database for each query.• Size of the string database grows faster than the

available memory capacity, and extensive memory requirements make the search techniques impractical.

• Suffer from disk I/Os when the database is too large• Performance deteriorates for long query patterns

Page 7: An Efficient Index Structure for String Databases

7

Similarity Metrics • The difference between two strings s1 and s2

is generally defined as the minimum number of edit operations to transform s1 to s2 called “edit distance ED”.

• Edit operations:– Insert– Delete– Replace

Page 8: An Efficient Index Structure for String Databases

Suppose we have two strings x,ye.g. x = kitten, y = sittingand we want to transform x into y.A closer look:

k i t t e ns i t t i n g

1st step: kitten sitten (Replace)2nd step: sittensittin (Replace)3rd step: sittinsitting (Insert)s

• What is the edit distance between “survey” and “surgery”?• s u r v e y ---> s u r g e y replace (+1)

---> s u r g e r y insert (+1)• Edit distance = 2

Page 9: An Efficient Index Structure for String Databases

• In the general version of edit distance, different operations may have different costs, or the costs depend on the characters involved.

• For example replacement could be more expensive than insertion, or replacing “a” with “o” could be less expensive than replacing “a” with “k”.

• This is called as weighted edit distance.

Page 10: An Efficient Index Structure for String Databases

• Global Alignment• Global alignment (or similarity) of s1 and s2 is defined as the maximum

valued alignment of s1 and s2.– Given two strings S1 and S2, the global alignment of them is obtained

by inserting spaces into S1 or S2 and at the ends so that are of the same length and then writing them one against the other

• Example– qacdbd & qawdb

qac_dbdqa_wdb_

• Edits and alignments are dual.– A sequence of edits can be converted into a global alignment.– An alignment can be converted into a sequence of edits

Page 11: An Efficient Index Structure for String Databases

• Local AlignmentGiven two strings X and Y find two substrings x and y from X and Y, respectively, such that their alignment score (in the global sense) is maximum over all pairs of such substrings. (empty substrings are allowed)

S(x,y) = +2 , x = y -2, x != y -1, x = ‘_’ or y = ‘_’

X=pqraxabcstvqY=yxaxbacsll

x=axabcsy=axbacs

a x a b _ c sa x _ b a c s +2+2-1+2-1+2+2=+8

Page 12: An Efficient Index Structure for String Databases

String Matching Problem• Whole Matching :

finding the edit distance ED(q,s) between a data string s and a query string q.

• Substring Matching : Consider all substrings s[i:j] of s which are close to the query string.

• Two Types of Queries :Range search seeks all the substrings of S which are within an edit distance of r to a given query q (r = range query)K-nearest neighbor search seeks the K closest substrings of S to q.

Page 13: An Efficient Index Structure for String Databases

Challenges in solving the substring matching problem

• Finding the edit distance is very costly in terms of both time and space.

• The strings in the database may be very long.• The database size for most applications grows exponentially.

New approach to overcome challenges• Define a lower bound distance for substring searching• Improve this lower bound by using the idea of wavelet

transformation• Use the MRS index structure based on the aforementioned

distance formulations

Page 14: An Efficient Index Structure for String Databases

A dynamic programming algorithm for computing the edit distance

• Problem: find the edit distance between strings x and y.• Create a (|x|+1)×(|y|+1) matrix C, where Ci,j represents the

minimum number of operations to match x1..i with y1..j. The matrix is constructed as follows.

• Ci,0 = I• C0,j = j• Ci,j = min{(Ci-1,j-1)+cost, replace

(Ci,j-1)+1, insert (Ci-1,j)+1} delete

cost = 0 if xi=yi, else 1

Page 15: An Efficient Index Structure for String Databases

How do we perform substring search?

• The same dynamic programming algorithm can be used to find the most similar substrings of a query sting q.

• The difference is that we set C0,j=0 for all j, since any text position could be the potential start of a match.

• If the similarity distance bound is k, we report all positions, where Cm ≤k (m is the last row – m = |q|).

Page 16: An Efficient Index Structure for String Databases

16

Frequency Vector• Let s be a string from the alphabet ={1, ..., }. Let ni be the number of

occurrences of the character i in s for 1i, then frequency vector: f(s) =[n1, ..., n].• Example:

– s = AATGATAG– f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]

• Let s be a string from the alphabet ={1, ..., }. Let f(s) =[v1, ..., v], be the frequency vector of s then i-1 vi = |s|.

• An edit operation on s has one of the following effects on f(s), for 1 i , j , and i != j :– vi := vi + 1

– vi := vi - 1

– vi := vi + 1 and vj := vj - 1

Page 17: An Efficient Index Structure for String Databases

17

Effect of Edit Operations on Frequency Vector

• Delete : decreases an entry by 1.• Insert : increases an entry by 1.• Replace : Insert + Delete• Example:– s = AATGATAG => f(s) = [4, 0, 2, 2]– (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2]– (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2]– (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]

Page 18: An Efficient Index Structure for String Databases

18

Frequency Distance

• Let u and v be integer points in dimensional space. The frequency distance, FD 1 (u,v) between u and v is defined as the minimum number of steps in order to go from u to v ( or equivalently from v to u) by moving to a neighbor point at each step.

frequency vector: f(s) =[n1, ..., n].

• Let s 1 and s 2 be two strings from the alphabet ={1, ..., } then– FD 1 (f(s 1), f(s 2)) ED (s 1 ,s 2)

Page 19: An Efficient Index Structure for String Databases

19

An Approximation to ED: Frequency Distance (FD1)

• s = AATGATAG => f(s)=[4, 0, 2, 2]• q = ACTTAGC => f(q)=[2, 2, 1, 2]– pos = (4-2) + (2-1) = 3– neg = (2-0) = 2– FD1(f(s),f(q)) = 3– ED(q,s) = 4

• FD1(f(s1),f(s2))=max{pos,neg}.

• FD1(f(s1),f(s2)) ED(s1,s2).f(q)

FD1(f(q),f(s))

f(s)

Page 20: An Efficient Index Structure for String Databases

Frequency Distance Calculation

/* u and v are dimensional integer points */

Algorithm : FD 1 (u,v) posDistance := negDistance := 0For i := 1 to

FD1(u, v) = max { posDist, negDist }

ii vui

ii vuposDist:

)(

ii vui

ii uvnegDist:

)(

Page 21: An Efficient Index Structure for String Databases

Wavelet Vector Computation

Let s = c1c2…cn be a string from the alphabet ={1, ..., } then Kth level wavelet transformation, k (s) , 0 <k< log2n of s is defined as: k (s) = [vk,1, ..., vk,n/2

k] where vk,I = [Ak,i , Bk,i],

f (ci) k = 0

Ak-1,2i + Ak-1,2i+1 0 < k < log2n

0 k = 0

Ak-1,2i - Ak-1,2i+1 0 < k < log2n

0<i<(n/2k)-1

Ak,i =

Bk,i =

Page 22: An Efficient Index Structure for String Databases

22

Using Local Information: Wavelet Decomposition of Strings

• s = AATGATAC => f(s)=[4, 1, 1, 2]

• s = AATG + ATAC = s1 + s2

• f(s1) = [2, 0, 1, 1]

• f(s2) = [2, 1, 0, 1]

• 1(s)= f(s1)+f(s2) = [4, 1, 1, 2]

• 2(s)= f(s1)-f(s2) = [0, -1, 1, 0]

Page 23: An Efficient Index Structure for String Databases

23

Wavelet Decomposition of a String: General Idea

• Ai,j = f(s(j2i : (j+1)2i-1))

• Bi,j = Ai-1,2j - Ai-1,2j+1

(s)=

First wavelet coefficientSecond wavelet coefficient

Page 24: An Efficient Index Structure for String Databases

Wavelet Transformation: Example

s = T C A C n = |s| = 40(s) = [v0,0 , v0,1 , v0,2 , v0,3]

= [ (A0,0, B0,0), (A0,1, B0,1), (A0,2, B0,2), (A0,3, B0,3) ]= [ (f(t), 0), (f(c), 0), (f(a), 0), (f(c), 0) ]= [([0,0,1], 0), ([0,1,0], 0), ([1,0,0], 0), ([0,1,0], 0) ]

1(s) = [ ([0,1,1], [0,-1,1]), ([1,1,0], [1,-1,0]) ]

2(s) = [ ( [1,2,1], [-1,0,1] ) ]Second wavelet coefficient

First wavelet coefficient

Page 25: An Efficient Index Structure for String Databases

Wavelet Distance Calculation

ibiii bi

iiaai

ii bbaapos,2,1,2,1 :

,2,1:

,2,1 )()(

iiii bbi

iiaai

ii bbaaneg,2,1,2,1 :

,1,2:

,1,2 )()(

Page 26: An Efficient Index Structure for String Databases

Maximum Frequency Distance Calculation

FD(s1,s2) = max { FD1(f (s1), f (s2)), FD2(ψ(s1),ψ(s2)) }

FD1 is the Frequency Distance

FD2 is the Wavelet Distance

Page 27: An Efficient Index Structure for String Databases

27

MRS-Index Structure Creation

w=2a

transform

s1

Page 28: An Efficient Index Structure for String Databases

28

s1

MRS-Index Structure Creation

Page 29: An Efficient Index Structure for String Databases

29

s1

MRS-Index Structure Creation

Page 30: An Efficient Index Structure for String Databases

30

...s1

slide c times

c=box capacity

MRS-Index Structure Creation

Page 31: An Efficient Index Structure for String Databases

31

s1

...

MRS-Index Structure Creation

Page 32: An Efficient Index Structure for String Databases

32

...

Ta,1

s1

W=2a

MRS-Index Structure Creation

Page 33: An Efficient Index Structure for String Databases

33

Using Different Resolutions

...

Ta,1

s1

W=2a

...

Ta+1,1

W=2a+1

Page 34: An Efficient Index Structure for String Databases

34

MRS-Index Structure

Page 35: An Efficient Index Structure for String Databases

35

MRS-index properties

• Relative MBR volume (Precision) decreases when– c increases.– w decreases.

• MBRs are highly clustered.

Box volume

Box Capacity

Page 36: An Efficient Index Structure for String Databases

36

Frequency Distance to an MBRLet q be the query string of length 2i where a <= i <= a + l - 1 . Given an MBR B, we define FD(q,B)= min(s belongs to B) FD(q,s)

f(q)

FD(f(q),f(s))

f(s)

f(q)

FD(f(q),B)

B

Page 37: An Efficient Index Structure for String Databases

Range Search Algorithm

Page 38: An Efficient Index Structure for String Databases

Range Queries

208

16 64 128

...w=24

...w=25

...w=26

...w=27

...

...

...

...

...

...

...

...

...

...

...

...

s1 s2 sd

1=

2 1

3 2

q q1 q2 q3

1. Partition the query string into subqueries at various resolutions available in our index.

2. Perform a partial range query for each subquery on the corresponding row of the index structure, and refine ε.

3. Disk pages corresponding to last result set are read, and postprocessing is done to elminate false retrievals.

Page 39: An Efficient Index Structure for String Databases

K-Nearest Neighbor Algorithm

Page 40: An Efficient Index Structure for String Databases

40

k-Nearest Neighbor Query

k = 3

Page 41: An Efficient Index Structure for String Databases

41

k-Nearest Neighbor Query

k = 3

Page 42: An Efficient Index Structure for String Databases

42

k-Nearest Neighbor Query [KSF+96, SK98]

k = 3

Page 43: An Efficient Index Structure for String Databases

43

k-Nearest Neighbor Query

k = 3

r = Edit distance to 3rd closest substring

r

Page 44: An Efficient Index Structure for String Databases

44

Experimental Settings

• w={128, 256, 512, 1024}.• Human chromosomes from (

www.ncbi.nlm.nih.gov)– chr02, chr18, chr21, chr22– Plotted results are from chr18 dataset.

• Queries are selected from data set randomly for 512 |q| 10000.

• An NFA based technique [BYN99] is implemented for comparison.

Page 45: An Efficient Index Structure for String Databases

45

Experimental Results 1:Effect of Box Capacity (10-NN)

• The cost of the MRS-index increases as the box capacity increases.• The cost of the MRS-index is much lower than the NFA technique for all

these box capacities.• Although using 2-wavelet coefficient slightly improves the performance

for the same box capacity, the size of the index structure is doubled. For same amount of memory, the single coefficient version performs better

Page 46: An Efficient Index Structure for String Databases

46

Experimental Results 2:Effect of Window Size (10-NN)

• The MRS-index structure outperforms the NFA technique for all the window sizes.• The performance of the MRS index structure itself improves as the window size

increases.

Page 47: An Efficient Index Structure for String Databases

47

Experimental Results 3:k-NN queries

• The performance of the MRS-index structure drops for large values of k , it still performs better than the NFA technique.

• Achieved speedups up to 45 for 10 nearest neighbors. The speedup for 200 nearest neighbors is 3.

• As the number of nearest neighbors increases, the performance of the MRS-index structure approaches to that of the NFA technique.

Page 48: An Efficient Index Structure for String Databases

48

Experimental Results 4:Range Queries

• The MRS-index structure performed up to 12 times faster than the NFA technique. The performance of the MRS-index structure improved when the queries are selected from different data strings. This is because the DNA strings have a high self similarity.

• The performance of the MRS index structure deteriorates as the error rate increases. This is because the size of the candidate set increases as the error rate increases.

Page 49: An Efficient Index Structure for String Databases

49

Discussion• In-memory (index size is 1-2% of the database size).• Lossless search.• 3 to 45 times faster than NFA technique for k-NN

queries.• 2 to 12 times faster than NFA technique for range

queries.• Can be used to speedup any previously defined

technique.

Page 50: An Efficient Index Structure for String Databases

50

THANK YOU