an efficient index structure for string databases

45
1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara http://www.cs.ucsb.edu/~tamer

Upload: kamea

Post on 06-Jan-2016

28 views

Category:

Documents


1 download

DESCRIPTION

An Efficient Index Structure for String Databases. Tamer Kahveci Ambuj K. Singh. Department of Computer Science University of California Santa Barbara. http://www.cs.ucsb.edu/~tamer. Whole/Substring Matching Problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Efficient Index Structure for String Databases

1

An Efficient Index Structure for String DatabasesTamer KahveciAmbuj K. Singh

Department of Computer ScienceUniversity of CaliforniaSanta Barbara

http://www.cs.ucsb.edu/~tamer

Page 2: An Efficient Index Structure for String Databases

2

Whole/Substring Matching Problem

Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1-2 % of database size).

query string

database string

Page 3: An Efficient Index Structure for String Databases

3

String Similarity

Motivation: Applications

Genetic sequence databases, NCBI

Text databases, spell checkers, web search.

Video databases (e.g. VIRAGE, MEDIA360)

Database size is too large. Most of the techniques available are in-memory.

Space requirement of current indexes is too large.

Year

Base Pairs (millions)

Page 4: An Efficient Index Structure for String Databases

4

Outline

Motivation & backgroundOur contribution Frequency vector, frequency distance

& wavelet transform Multi-resolution index structure k-NN & range queries

Experimental resultsConclusion

Page 5: An Efficient Index Structure for String Databases

5

Notation

q : query string.m,n : length of strings.r : range query radius. = r/|q|: error rate.

Page 6: An Efficient Index Structure for String Databases

6

String Similarity: an example

A C T - - T A G C

R I I D

A A T G A T A G -

Page 7: An Efficient Index Structure for String Databases

7

Background

Edit operations: Insert Delete Replace

Edit distance (ED) between s1 and s2 = minimum number of edit operations to transform s1 to s2.

Finding the edit distance is costly. O(mn) time and space if m and n are lengths of s1 and

s2 if dynamic programming is used [NW70, SW81].

Page 8: An Efficient Index Structure for String Databases

8

Related Work

Lossless search Online

[Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.

[WM92] (Wu, Manber) binary masks, O(rn). [BYN99] (Beaze-Yates, Navarro) NFA

Offline (index based) [Mye94] (Myers) condensed r-neighborhood. [BYN97] (Beaze-Yates, Navarro) dictionary.

Lossy search [AG90] (Altschul, Gish) BLAST.

FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER.

[GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

Page 9: An Efficient Index Structure for String Databases

9

Outline

Motivation & backgroundOur contribution Frequency vector, frequency distance

& wavelet transform Multi-resolution index structure k-NN & range queries

Experimental resultsConclusion

Page 10: An Efficient Index Structure for String Databases

10

Frequency Vector

Let s be a string from the alphabet ={1, ..., }. Let ni be the number of occurrences of the character i in s for 1i, then

frequency vector: f(s) =[n1, ..., n].Example: s = AATGATAG f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]

Page 11: An Efficient Index Structure for String Databases

11

Effect of Edit Operations on Frequency Vector

Delete : decreases an entry by 1.Insert : increases an entry by 1.Replace : Insert + DeleteExample: s = AATGATAG => f(s) = [4, 0, 2, 2] (del. G), s = AAT.ATAG => f(s) = [4, 0, 1,

2] (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]

Page 12: An Efficient Index Structure for String Databases

12

An Approximation to ED:Frequency Distance (FD1)

s = AATGATAG => f(s)=[4, 0, 2, 2]q = ACTTAGC => f(q)=[2, 2, 1, 2] pos = (4-2) + (2-1) = 3 neg = (2-0) = 2 FD1(f(s),f(q)) = 3 ED(q,s) = 4

FD1(f(s1),f(s2))=max{pos,neg}.

FD1(f(s1),f(s2)) ED(s1,s2).

f(q)

FD1(f(q),f(s))

f(s)

Page 13: An Efficient Index Structure for String Databases

13

An Illustration of Frequency Distance & Edit Distance

Frequency

DistanceSet of strings 1

Set of strings 2

v1 v2

Edit Distance

Page 14: An Efficient Index Structure for String Databases

14

Using Local Information: Wavelet Decomposition of Strings

s = AATGATAC => f(s)=[4, 1, 1, 2]

s = AATG + ATAC = s1 + s2

f(s1) = [2, 0, 1, 1]

f(s2) = [2, 1, 0, 1]

1(s)= f(s1)+f(s2) = [4, 1, 1, 2]

2(s)= f(s1)-f(s2) = [0, -1, 1, 0]

Page 15: An Efficient Index Structure for String Databases

15

Wavelet Decomposition of a String: General Idea

Ai,j = f(s(j2i : (j+1)2i-1))

Bi,j = Ai-1,2j - Ai-1,2j+1

(s)=

First wavelet coefficientSecond wavelet coefficient

Page 16: An Efficient Index Structure for String Databases

16

Wavelet Decomposition & ED

Define FD(s1,s2)=max{FD1, FD2}.

Page 17: An Efficient Index Structure for String Databases

17

Outline

Motivation & backgroundOur contribution Frequency vector, frequency distance

& wavelet transform Multi-resolution index structure k-NN and range queries

Experimental resultsConclusion

Page 18: An Efficient Index Structure for String Databases

18

MRS-Index Structure Creation

w=2a

transform

s1

Page 19: An Efficient Index Structure for String Databases

19

MRS-Index Structure Creation

s1

Page 20: An Efficient Index Structure for String Databases

20

MRS-Index Structure Creation

s1

Page 21: An Efficient Index Structure for String Databases

21

MRS-Index Structure Creation

...s1

slide c times

c=box capacity

Page 22: An Efficient Index Structure for String Databases

22

MRS-Index Structure Creation

s1

...

Page 23: An Efficient Index Structure for String Databases

23

MRS-Index Structure Creation

...

Ta,1

s1

W=2a

Page 24: An Efficient Index Structure for String Databases

24

Using Different Resolutions

...

Ta,1

s1

W=2a

...

Ta+1,1

W=2a+1

Page 25: An Efficient Index Structure for String Databases

25

MRS-Index Structure

Page 26: An Efficient Index Structure for String Databases

26

MRS-index properties

Relative MBR volume (Precision) decreases when c increases. w decreases.

MBRs are highly clustered. Box volume

Box Capacity

Page 27: An Efficient Index Structure for String Databases

27

Outline

Motivation & backgroundOur contribution Frequency vector, frequency distance

& wavelet transform Multi-resolution index structure k-NN & range queries

Experimental resultsConclusion

Page 28: An Efficient Index Structure for String Databases

28

Range Queries [KS01]

208

16 64 128

...w=24

...w=25

...w=26

...w=27

...

...

...

...

...

...

...

...

...

...

...

...

s1 s2 sd

1=

2 1

3 2

Page 29: An Efficient Index Structure for String Databases

29

k-Nearest Neighbor Query [KSF+96, SK98]

k = 3

Page 30: An Efficient Index Structure for String Databases

30

k-Nearest Neighbor Query

k = 3

r = Edit distance to 3rd closest substring

Page 31: An Efficient Index Structure for String Databases

31

k-Nearest Neighbor Query

k = 3

r

Page 32: An Efficient Index Structure for String Databases

32

k-Nearest Neighbor Query

k = 3

Page 33: An Efficient Index Structure for String Databases

33

Outline

Motivation & backgroundOur contributionExperimental resultsConclusion

Page 34: An Efficient Index Structure for String Databases

34

Experimental Settings

w={128, 256, 512, 1024}.Human chromosomes from (www.ncbi.nlm.nih.gov) chr02, chr18, chr21, chr22 Plotted results are from chr18 dataset.

Queries are selected from data set randomly for 512 |q| 10000. An NFA based technique [BYN99] is implemented for comparison.

Page 35: An Efficient Index Structure for String Databases

35

Experimental Results 1:Effect of Box Capacity (10-NN)

Page 36: An Efficient Index Structure for String Databases

36

Experimental Results 2:Effect of Window Size (10-NN)

Page 37: An Efficient Index Structure for String Databases

37

Experimental Results 3:k-NN queries

Page 38: An Efficient Index Structure for String Databases

38

Experimental Results 4:Range Queries

Page 39: An Efficient Index Structure for String Databases

39

Outline

Motivation & backgroundOur ContributionExperimental resultsDiscussion & conclusion

Page 40: An Efficient Index Structure for String Databases

40

Discussion

In-memory (index size is 1-2% of the database size).Lossless search.3 to 45 times faster than NFA technique for k-NN queries.2 to 12 times faster than NFA technique for range queries.Can be used to speedup any previously defined technique.

Page 41: An Efficient Index Structure for String Databases

41

Future Work

Extend to weighted edit distance and affine gaps.Extend to local similarity (substring/substring) search.Compare the quality of answers and speed to BLAST (lossy search).Use as a preprocessing step to BLAST.Apply the MRS index structure for larger alphabet size (e.g. protein sequences.).

Page 42: An Efficient Index Structure for String Databases

42

Related Work

Lossless search Online

[Mye86] (Myers) reduce space requirement to O(rn), where r is query radius.

[WM92] (Wu, Manber) binary masks, O(rn). [BYN99] (Beaze-Yates, Navarro) NFA

Offline (index based) [Mye94] (Myers) condensed r-neighborhood. [BYN97] (Beaze-Yates, Navarro) dictionary.

Lossy search [AG90] (Altschul, Gish) BLAST.

FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER.

[GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

Page 43: An Efficient Index Structure for String Databases

43

Related Work (Similar problems)

[BYP92] (Beaze-Yates, Perleberg) only replace is allowed.[Gus97] (Gusfield) exact matching, suffix trees.[JKS00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree.

Page 44: An Efficient Index Structure for String Databases

44

THANK YOU

Page 45: An Efficient Index Structure for String Databases

45

Frequency Distance to an MBR

f(q)

FD(f(q),f(s))

f(s)

f(q)

FD(f(q),B)

B