an efficient index structure for string databases

An Efficient Index Structure for String Databases

Tamer KahveciAmbuj K. Singh

Presented ByAtul Ugalmugale/Nikita Rasam

1

2

• Issue ?• Find similar substrings in a large database, that is

similar to a given query string quickly, using a small index structure

• In some applications we store, search and analyze long sequences of discrete characters, which we call “strings”

• There is a frequent need to find similarities between genetic data, web data and event sequences.

3

• Applications ?• Information Retrieval : A typical application of information

retrieval is text searching; given a large collection of documents and some text keywords we want to find the documents which contain these keywords.

• searching keywords through the net: usually by “mtallica” we mean “metallica”:

• Computational Biology : The problem is similar in computational biology; here we have a long DNA sequence and we want to find subsequences in it that match approximately a query sequence.

…ATGCATACGATCGATT……TGCAATGGCTTAGCTA…

Animal species from the same family are bound to have more similar DNAs

5

• Video data can be viewed as an event sequence if some pre-specified set of events are detected and stored as a sequence. Searching similar event subsequences can be used to find related video segments.

6

• String search algorithms proposed so far are in-memory algorithms.

• Scan the whole database for each query.• Size of the string database grows faster than the

available memory capacity, and extensive memory requirements make the search techniques impractical.

• Suffer from disk I/Os when the database is too large• Performance deteriorates for long query patterns

7

Similarity Metrics • The difference between two strings s1 and s2

is generally defined as the minimum number of edit operations to transform s1 to s2 called “edit distance ED”.

• Edit operations:– Insert– Delete– Replace

Suppose we have two strings x,ye.g. x = kitten, y = sittingand we want to transform x into y.A closer look:

k i t t e ns i t t i n g

1st step: kitten sitten (Replace)2nd step: sittensittin (Replace)3rd step: sittinsitting (Insert)s

• What is the edit distance between “survey” and “surgery”?• s u r v e y ---> s u r g e y replace (+1)

---> s u r g e r y insert (+1)• Edit distance = 2

• In the general version of edit distance, different operations may have different costs, or the costs depend on the characters involved.

• For example replacement could be more expensive than insertion, or replacing “a” with “o” could be less expensive than replacing “a” with “k”.

• This is called as weighted edit distance.

• Global Alignment• Global alignment (or similarity) of s1 and s2 is defined as the maximum

valued alignment of s1 and s2.– Given two strings S1 and S2, the global alignment of them is obtained

by inserting spaces into S1 or S2 and at the ends so that are of the same length and then writing them one against the other

• Example– qacdbd & qawdb

qac_dbdqa_wdb_

• Edits and alignments are dual.– A sequence of edits can be converted into a global alignment.– An alignment can be converted into a sequence of edits

• Local AlignmentGiven two strings X and Y find two substrings x and y from X and Y, respectively, such that their alignment score (in the global sense) is maximum over all pairs of such substrings. (empty substrings are allowed)

S(x,y) = +2 , x = y -2, x != y -1, x = ‘_’ or y = ‘_’

X=pqraxabcstvqY=yxaxbacsll

x=axabcsy=axbacs

a x a b _ c sa x _ b a c s +2+2-1+2-1+2+2=+8

String Matching Problem• Whole Matching :

finding the edit distance ED(q,s) between a data string s and a query string q.

• Substring Matching : Consider all substrings s[i:j] of s which are close to the query string.

• Two Types of Queries :Range search seeks all the substrings of S which are within an edit distance of r to a given query q (r = range query)K-nearest neighbor search seeks the K closest substrings of S to q.

Challenges in solving the substring matching problem

• Finding the edit distance is very costly in terms of both time and space.

• The strings in the database may be very long.• The database size for most applications grows exponentially.

New approach to overcome challenges• Define a lower bound distance for substring searching• Improve this lower bound by using the idea of wavelet

transformation• Use the MRS index structure based on the aforementioned

distance formulations

A dynamic programming algorithm for computing the edit distance

• Problem: find the edit distance between strings x and y.• Create a (|x|+1)×(|y|+1) matrix C, where Ci,j represents the

minimum number of operations to match x1..i with y1..j. The matrix is constructed as follows.

• Ci,0 = I• C0,j = j• Ci,j = min{(Ci-1,j-1)+cost, replace

(Ci,j-1)+1, insert (Ci-1,j)+1} delete

cost = 0 if xi=yi, else 1

How do we perform substring search?

• The same dynamic programming algorithm can be used to find the most similar substrings of a query sting q.

• The difference is that we set C0,j=0 for all j, since any text position could be the potential start of a match.

• If the similarity distance bound is k, we report all positions, where Cm ≤k (m is the last row – m = |q|).

16

Frequency Vector• Let s be a string from the alphabet ={1, ..., }. Let ni be the number of

occurrences of the character i in s for 1i, then frequency vector: f(s) =[n1, ..., n].• Example:

– s = AATGATAG– f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]

• Let s be a string from the alphabet ={1, ..., }. Let f(s) =[v1, ..., v], be the frequency vector of s then i-1 vi = |s|.

• An edit operation on s has one of the following effects on f(s), for 1 i , j , and i != j :– vi := vi + 1

– vi := vi - 1

– vi := vi + 1 and vj := vj - 1

17

Effect of Edit Operations on Frequency Vector

• Delete : decreases an entry by 1.• Insert : increases an entry by 1.• Replace : Insert + Delete• Example:– s = AATGATAG => f(s) = [4, 0, 2, 2]– (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2]– (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2]– (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]

18

Frequency Distance

• Let u and v be integer points in dimensional space. The frequency distance, FD 1 (u,v) between u and v is defined as the minimum number of steps in order to go from u to v ( or equivalently from v to u) by moving to a neighbor point at each step.

frequency vector: f(s) =[n1, ..., n].

• Let s 1 and s 2 be two strings from the alphabet ={1, ..., } then– FD 1 (f(s 1), f(s 2)) ED (s 1 ,s 2)

19

An Approximation to ED: Frequency Distance (FD1)

• s = AATGATAG => f(s)=[4, 0, 2, 2]• q = ACTTAGC => f(q)=[2, 2, 1, 2]– pos = (4-2) + (2-1) = 3– neg = (2-0) = 2– FD1(f(s),f(q)) = 3– ED(q,s) = 4

• FD1(f(s1),f(s2))=max{pos,neg}.

• FD1(f(s1),f(s2)) ED(s1,s2).f(q)

FD1(f(q),f(s))

f(s)

Frequency Distance Calculation

/* u and v are dimensional integer points */

Algorithm : FD 1 (u,v) posDistance := negDistance := 0For i := 1 to

FD1(u, v) = max { posDist, negDist }

ii vui

ii vuposDist:

)(

ii vui

ii uvnegDist:

)(

Wavelet Vector Computation

Let s = c1c2…cn be a string from the alphabet ={1, ..., } then Kth level wavelet transformation, k (s) , 0 <k< log2n of s is defined as: k (s) = [vk,1, ..., vk,n/2

k] where vk,I = [Ak,i , Bk,i],

f (ci) k = 0

Ak-1,2i + Ak-1,2i+1 0 < k < log2n

0 k = 0

Ak-1,2i - Ak-1,2i+1 0 < k < log2n

0<i<(n/2k)-1

Ak,i =

Bk,i =

22

Using Local Information: Wavelet Decomposition of Strings

• s = AATGATAC => f(s)=[4, 1, 1, 2]

• s = AATG + ATAC = s1 + s2

• f(s1) = [2, 0, 1, 1]

• f(s2) = [2, 1, 0, 1]

• 1(s)= f(s1)+f(s2) = [4, 1, 1, 2]

• 2(s)= f(s1)-f(s2) = [0, -1, 1, 0]

23

Wavelet Decomposition of a String: General Idea

• Ai,j = f(s(j2i : (j+1)2i-1))

• Bi,j = Ai-1,2j - Ai-1,2j+1

(s)=

First wavelet coefficientSecond wavelet coefficient

Wavelet Transformation: Example

s = T C A C n = |s| = 40(s) = [v0,0 , v0,1 , v0,2 , v0,3]

= [ (A0,0, B0,0), (A0,1, B0,1), (A0,2, B0,2), (A0,3, B0,3) ]= [ (f(t), 0), (f(c), 0), (f(a), 0), (f(c), 0) ]= [([0,0,1], 0), ([0,1,0], 0), ([1,0,0], 0), ([0,1,0], 0) ]

1(s) = [ ([0,1,1], [0,-1,1]), ([1,1,0], [1,-1,0]) ]

2(s) = [ ( [1,2,1], [-1,0,1] ) ]Second wavelet coefficient

First wavelet coefficient

Wavelet Distance Calculation

ibiii bi

iiaai

ii bbaapos,2,1,2,1 :

,2,1:

,2,1 )()(

iiii bbi

iiaai

ii bbaaneg,2,1,2,1 :

,1,2:

,1,2 )()(

Maximum Frequency Distance Calculation

FD(s1,s2) = max { FD1(f (s1), f (s2)), FD2(ψ(s1),ψ(s2)) }

FD1 is the Frequency Distance

FD2 is the Wavelet Distance

27

MRS-Index Structure Creation

w=2a

transform

s1

28

s1


29

s1


30

...s1

slide c times

c=box capacity


31

s1

...


32

...

Ta,1

s1

W=2a


33

Using Different Resolutions

...

Ta,1

s1

W=2a

...

Ta+1,1

W=2a+1

34

MRS-Index Structure

35

MRS-index properties

• Relative MBR volume (Precision) decreases when– c increases.– w decreases.

• MBRs are highly clustered.

Box volume

Box Capacity

36

Frequency Distance to an MBRLet q be the query string of length 2i where a <= i <= a + l - 1 . Given an MBR B, we define FD(q,B)= min(s belongs to B) FD(q,s)

f(q)

FD(f(q),f(s))

f(s)

f(q)

FD(f(q),B)

B

Range Search Algorithm

Range Queries

208

16 64 128

...w=24

...w=25

...w=26

...w=27

...

...

...

...

...

...

...

...

...

...

...

...

s1 s2 sd

1=

2 1

3 2

q q1 q2 q3

1. Partition the query string into subqueries at various resolutions available in our index.

2. Perform a partial range query for each subquery on the corresponding row of the index structure, and refine ε.

3. Disk pages corresponding to last result set are read, and postprocessing is done to elminate false retrievals.

K-Nearest Neighbor Algorithm

40

k-Nearest Neighbor Query

k = 3

41


k = 3

42

k-Nearest Neighbor Query [KSF+96, SK98]

k = 3

43


k = 3

r = Edit distance to 3rd closest substring

r

44

Experimental Settings

• w={128, 256, 512, 1024}.• Human chromosomes from (

www.ncbi.nlm.nih.gov)– chr02, chr18, chr21, chr22– Plotted results are from chr18 dataset.

• Queries are selected from data set randomly for 512 |q| 10000.

• An NFA based technique [BYN99] is implemented for comparison.

http://www.ncbi.nlm.nih.gov/

45

Experimental Results 1:Effect of Box Capacity (10-NN)

• The cost of the MRS-index increases as the box capacity increases.• The cost of the MRS-index is much lower than the NFA technique for all

these box capacities.• Although using 2-wavelet coefficient slightly improves the performance

for the same box capacity, the size of the index structure is doubled. For same amount of memory, the single coefficient version performs better

46

Experimental Results 2:Effect of Window Size (10-NN)

• The MRS-index structure outperforms the NFA technique for all the window sizes.• The performance of the MRS index structure itself improves as the window size

increases.

47

Experimental Results 3:k-NN queries

• The performance of the MRS-index structure drops for large values of k , it still performs better than the NFA technique.

• Achieved speedups up to 45 for 10 nearest neighbors. The speedup for 200 nearest neighbors is 3.

• As the number of nearest neighbors increases, the performance of the MRS-index structure approaches to that of the NFA technique.

48

Experimental Results 4:Range Queries

• The MRS-index structure performed up to 12 times faster than the NFA technique. The performance of the MRS-index structure improved when the queries are selected from different data strings. This is because the DNA strings have a high self similarity.

• The performance of the MRS index structure deteriorates as the error rate increases. This is because the size of the candidate set increases as the error rate increases.

49

Discussion• In-memory (index size is 1-2% of the database size).• Lossless search.• 3 to 45 times faster than NFA technique for k-NN

queries.• 2 to 12 times faster than NFA technique for range

queries.• Can be used to speedup any previously defined

technique.

50

THANK YOU

an efficient index structure for string databases

Documents