full-text indexing via burrows-wheeler transform wing-kai hon oct 18, 2006

25
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Post on 20-Dec-2015

221 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Full-Text Indexingvia Burrows-Wheeler Transform

Wing-Kai HonOct 18, 2006

Page 2: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Outline

The Text Searching ProblemWhat is Full-Text Indexing?Burrows-Wheeler Transform (BWT)BWT as a Full-Text IndexRelated work

Page 3: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Text Searching

Text: acacaaccagtcacactagac……Pattern: acac

Where does the pattern occur in the text?

Page 4: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

How fast can we search?

Let n be the length of text m be the length of pattern

We can find all positions that the pattern appears in O( n + m ) timeKnuth-Morris-Pratt, Boyer-Moore

Is O(n+m) time good?Yes, because it is optimal!

Page 5: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Text Searching (take 2)

Pattern: acac

Where does the pattern occur in the text?

Text: acacaaccagtcacactagac……

we know the text in advance and can preprocess it

Page 6: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Can we do better?

Yes, there is a data structure for the text, and by creating that, pattern search only takes O( m + ) time, where = number of times the pattern appears in the text

Such a data structure is called an index Is O(m+) time useful?

Yes, if the text is very long and it is searched many times for different patterns

Page 7: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Full-Text Index

Full-Text IndexDeals with creating an index for a textAlso, each position in the text corresponds to

an appearance of at least one pattern (full)Word-Level Index

Text is a sequence of wordsThe positions within a word does not

correspond to appearance of any patternE.g., Text: Was it a cat I saw? (Pattern: “at”

does not have an appearance)

Page 8: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Suffix Tree:An Optimal Full-Text Index

As mentioned, we can create an index for the text such that pattern searching can be done in O(m+) timeThis time is optimal

One such index is the Suffix TreeIntroduced independently by E.

McCreight in 1976 and P. Weiner in 1973

Page 9: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Suffix and Suffix Tree

Given a string S, a substring of S that ends at the last position is called a suffix of S

If S consists of n chars, S has exactly n suffixes

Theorem: If a pattern P appears at position j in S, P appears at the beginning of the suffix of S that starts at position j

Page 10: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

E.g., S: acacaac# Suffix of S: acacaac# (start at pos 1) cacaac# (start at pos 2)

acaac# (start at pos 3)

caac# (start at pos 4)

aac# (start at pos 5)

ac# (start at pos 6)

c# (start at pos 7) # (start at pos 8)

Suppose P = ac is a pattern. Then, P appears at pos 1, pos 3 and pos 6 in S.

acacaac#acacaac# acacaac#

acaac#

acacaac#

ac#

Page 11: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Suffix and Suffix Tree (2)

The suffix tree is an edge-labeled compact tree (no degree-1 nodes) with n leaves such that each leaf corresponds to a suffixConcatenating edge labels along the path

from root to leaf gives the corresponding suffix

Edge-label to each child starts with different character

Example (next slide)

Page 12: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

# c

c

aa#

# ca

# a

# ca

#ca

#

caac

#

caac

The Suffix Tree of acacaac#

8

5

3

6 42

7

1

Page 13: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Searching with Suffix Tree

To search P, we match P starting from the rootIf we can match P successfully in the

tree, the leaves under the stop point are all suffixes that corresponds to an appearance of P in the text

Then, we traverse the tree under the stop point to report where P appears

So, searching is done in O(m+) time

Page 14: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Is Suffix Tree good?

Yes, because optimal search timeNo, because of space requirement…

The space can be much larger than the text

E.g., Text = DNA of Human To store the text, we need 0.8 GbyteTo store the suffix tree, we need 64

Gbyte!

Page 15: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Something Wrong??

Both the suffix tree and the text has n things, so they both need O(n) space…

How come there is a big difference??Let us have a better analysis

Let A be the alphabet (i.e., the set of distinct characters) of a text TE.g., in DNA, A = {a,c,g,t}

Page 16: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Something Wrong?? (2)

To store T, we need only n log |A| bitsBut to store the suffix tree, we will need

n log n bitsWhen n is very large compared to |A|,

there is a huge difference

Question: Is there an index that supports fast searching, but occupies O( n log |A| ) bits only??

Page 17: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Burrows-Wheeler Transform

By arranging the suffix in ‘sorted’ order, the Burrows-Wheeler Transform is an array storing their ‘preceding chars’

Example (next slide)

Page 18: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

#

a a c #a c #a c a a c #a c a c a a c #c #c a a c #c a c a a c #

c

c

a

c

#a

a

a

BWT Suffix in sorted order

Text = acacaac#

Page 19: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

BWT is useful

BWT is shown to be compressed more easily than the original text

Also, given the position in the BWT array where the last character appears, we can get back the original text

How?

Page 20: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

#

a a c #a c #a c a a c #a c a c a a c #c #c a a c #c a c a a c #

c

c

a

c

#a

a

a

BWT Suffix in sorted order

Text = acacaac#

#

a

a

a

a

c

c

c

Sorted BWT

Page 21: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

BWT IndexFerragina and Manzini (2000) observes

that we can use BWT to support pattern searching by storing some additional O(n)-bit arrays

Precisely, let B[1..n] be the BWT. With the additional arrays, for any x, we can count the number of any char in B[1..x] in constant time

Then, we can count the number of times that a pattern appears in the text in O(m) time (How?)

Page 22: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

#

a a c #a c #a c a a c #a c a c a a c #c #c a a c #c a c a a c #

c

c

a

c

#a

a

a

BWT Suffix in sorted order

Text = acacaac#, Pattern = aca

#

a

a

a

a

c

c

c

Sorted BWT

Page 23: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

BWT Index

They also show that, by storing another O(n) bit array, we can report where the pattern appears in O( log n) time

So, searching is done in O(m + log n) time

What is the space? O( n log |A| ) bits

Page 24: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Related Work

Further compress the indexSpace is now measured in terms of the

entropy (or the randomness) of a textSupport text with large alphabetEfficient Construction

Challenge is in minimizing working spaceMore complex queries and operations

Library problem, Dictionary problem

Page 25: Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006

Pointers for Further Study

The Pizza & Chili websitehttp://pizzachili.di.unipi.it

The FM-index paper by P. Ferragina and G. Manzini, FOCS 2000

The CSA paper by R. Grossi and J.S. Vitter, STOC 2000

Discuss with me ^_^ (email: wkhon@)