the miracle of self-indexing › event › 2009 › psc09p01_presentation.pdf · bit sequences...

61
The Miracle of Self-Indexing Combining Text Compression and String Matching Gonzalo Navarro www.dcc.uchile.cl/gnavarro [email protected] Department of Computer Science University of Chile

Upload: others

Post on 07-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

The Miracle ofSelf-Indexing

Combining Text Compression and String Matching

Gonzalo Navarrowww.dcc.uchile.cl/[email protected]

Department of Computer ScienceUniversity of Chile

Page 2: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesRankCompressed Bitmaps

Sequences of Symbols and Wavelet Trees

Text IndexesThe Burrows-Wheeler TransformThe FM-Index

Page 3: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

Compact Data Structures

I They are data structures modified to use little space.I Not just compression: they must retain their functionality

and direct access.I Motivation: Improve performance thanks to memory

hierarchy.I Especially if we can operate in RAM a structure that

otherwise would need the disk.I In this talk I will focus on compact data structures for text,

where the success has been notorious.

G. Navarro The Miracle of Self-Indexing

Page 4: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

Compressed Full-Text Self-Indexes

I Consider a text T [1, n] over an alphabet of size σ.I It requires n log σ bits of space in plain form.

(we will use base-2 logarithms by default).

I Searching for patterns P[1, m] in T is a fundamentalproblem.

I It can be done in O(n) time, but this can be unacceptable ifT is large.

I We can use an index over T .I If T is made of natural language, under some restrictions,

a relatively small inverted index would do.I But there are many sequences that cannot be handled this

way: DNA, proteins, Oriental languages, music sequences,numeric signals, ...

G. Navarro The Miracle of Self-Indexing

Page 5: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

Compressed Full-Text Self-Indexes

I An index that works for any string is the suffix array.I But it requires n log n bits, way larger than n log σ.I For relatively large texts, the suffix array will not fit in main

memory.I And it will be rather slow to search on disk.I This has been a dilemma for decades, with researchers

trying different tricks to reduce the constant of the space.

G. Navarro The Miracle of Self-Indexing

Page 6: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

Compressed Full-Text Self-Indexes

I Suddenly, a miracle happened in 2000.I It was discovered that the suffix array was compressible...I ... up to the space the compressed text would need.I Within that space, not only it could offer indexed

searching...I ... but also extracting any text substring.I Thus the text was unnecessary and could be discarded.I Hence the compressed index replaced the text, and was

called a self-index:I A compressor with indexing plus, orI An index able of reproducing the text.

G. Navarro The Miracle of Self-Indexing

Page 7: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

Compressed Full-Text Self-Indexes

I There are many alternative self-indexes today.I Despite they have opened a whole new area of research,

and proved competitive in practice...I ... they are still surrounded by a veil of mystery.I They are deemed too complicated and impractical by many

people, despite the open-source implementations beingout there: http://pizzachili.dcc.uchile.cl.

I This may come from the first proposals, which were indeedcomplex.

I Not anymore the case.

G. Navarro The Miracle of Self-Indexing

Page 8: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

Compressed Full-Text Self-Indexes

I In this talk, I’ll do my best to unveil the mystery.I I will choose one of the most elegant self-indexes: the

FM-index.I I will start from the very basics, explaining only the

minimum necessary to understand how a very competitiveimplementation works.

I I hope to expose the simplicity, elegance, and practicalityof this data structure...

I ... and encourage you to go for more yourselves.

G. Navarro The Miracle of Self-Indexing

Page 9: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Bit Sequences

I Consider a bit sequence B[1, n].I We are interested in the following operation on B:

I rankb(B, i): how many times does b occur in B[1, i]?I Note rank0(B, i) = i − rank1(B, i).I By default let us assume b = 1.

G. Navarro The Miracle of Self-Indexing

Page 10: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Bit Sequences

ResultI rank can be answered in constant time.I This is easy by storing all the answers in O(n log n) bits,

but one only needs

n + O(

n · log log nlog n

)= n + o(n)

bits of space (the n bits for B[1, n] plus a sublinear extra).I The solutions are practical.

G. Navarro The Miracle of Self-Indexing

Page 11: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Rank

Rank in Constant TimeI We cut B into blocks of b = (log n)/2 bits.I We store and array R[1, n/b] with the rank values at block

beginnings.I As we need log n bits to store a rank value, the total of bits

used by R is

nb· log n =

n(log n)/2

· log n = 2n

I Let us accept that space for now (it is more than promised).

G. Navarro The Miracle of Self-Indexing

Page 12: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Rank

I ¿How to compute rank(B, i) ?I Decompose i = q · b + r , 0 ≤ r < b.I The number of 1’s up to qb is R[q].I We must count the 1’s in B[qb + 1, qb + r ].

I Assume we have a table T [0, 2b − 1][0, b − 1], such that

T [x , r ] = total of 1’s in x [1, r ]

where we see x as a stream of b bits.

G. Navarro The Miracle of Self-Indexing

Page 13: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Rank

I Since each cell of T can take values in [0, b − 1], T needs

2b · b · log b = 2log n

2 · log n2

· (log log n − 1)

≤ 12√

n log n log log n = o(n) bits.

For example, if n = 232, T just needs 512 KB.I Then, the final answer is obtained in constant time as:

R[q] + T [B[qb + 1..qb + b], r ]

I Good, but we spent 3n + o(n) bits...

G. Navarro The Miracle of Self-Indexing

Page 14: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Rank

rank(B,13) = 7

R

1 0 0 1 1 1 0 1 0 1 0 0 0 0 0 00 1 1 0 0 1 0 0 1 1 1 1 0100

000001010011100101110111

0 1 20 0

0 0

0

0

0

1

1

1

1

0

0

0

0

0

0

0

1

1

1

1

2

2

0 14131095 6 7 8721

B

G. Navarro The Miracle of Self-Indexing

Page 15: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

RankAchieving o(n) extra bits

I We also cut B into superblocks of

s =(log n)2

2= b · log n bits.

I We store an array S[1, n/s] with the rank at the beginningof superblocks.

I Now the values of R are stored adding just from thebeginning of the corresponding superblock:

R[q] = rank(B, qb) − S[bq/ log nc]= rank(B, qb) − rank(B, bq/ log nc · log n)

G. Navarro The Miracle of Self-Indexing

Page 16: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

RankI As we need log n bits to store a rank value in S, the total

number of bits used by S is

ns· log n =

n(log n)2/2

· log n =2n

log n= o(n)

I As the R values are stored relative to the superblock, theycan be only as large as s = O(log n)2, and thus only need2 log log n bits. Thus R now occupies

nb

· 2 log log n =n

(log n)/2· 2 log log n

=4n · log log n

log n= o(n) bits.

G. Navarro The Miracle of Self-Indexing

Page 17: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Rank

I ¿How we compute rank(B, i) ?I Decompose i = q · b + r , 0 ≤ r < b.I Decompose i = q′ · s + r ′, 0 ≤ r ′ < s.I Add the contents of three tables:

rank(B, i) = S[q′] + R[q] + T [B[qb + 1..qb + b], r ]

G. Navarro The Miracle of Self-Indexing

Page 18: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Rank

T

2

2

R

1 0 0 1 1 1 0 1 0 1 0 0 0 0 0 00 1 1 0 0 1 0 0 1 1 1 1 0100B

S 0

0 1 2 0 1 2 0 1 2 0 3 4

5 7 10

rank(B,13) = 7

000001010011100101110111

0 1 20 0

0 0

0

0

0

1

1

1

1

0

0

0

0

0

0

0

1

1

1

1

G. Navarro The Miracle of Self-Indexing

Page 19: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Rank on Compressed Sequences

I In many applications, the sequence B[1, n] contains few ormany 1’s.

I Let us focus on the case where there are m << n 1’s.I Binary entropy:

H0(B) =mn

lognm

+n − m

nlog

nn − m

I We can have constant-time rank using just nH0(B) + o(n)bits.

G. Navarro The Miracle of Self-Indexing

Page 20: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Compressed Representation

I Divide the sequence into blocks of b = (log n)/2 bits.I Let ci be the number of 1’s in a block Bi .I The number of distinct blocks of length b with ci 1’s is(

bci

)and hence we will try to represent Bi using⌈

log(

bci

)⌉bits.

G. Navarro The Miracle of Self-Indexing

Page 21: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Compressed Representation

I The representation of Bi will be (ci , oi), whereI The class, ci , needs dlog(b + 1)e bits.I The offset, oi , needs dlog

(bci

)e bits.

I The ci ’s are of fixed width, and take in total

nb· dlog(b + 1)e = O

(n · log log n

log n

)

I The oi ’s are of variable width. We will concatenate them all.I Later we see how to decode them.

G. Navarro The Miracle of Self-Indexing

Page 22: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Compressed RepresentationI The concatenation of the oi ’s takes

n/b∑i=1

⌈log

(bci

)⌉≤

n/b∑i=1

log(

bci

)+ n/b

= logn/b∏i=1

(bci

)+ O(n/ log n)

≤ log(

(n/b) · b∑n/bi=1 ci

)+ O(n/ log n)

= log(

nm

)+ O(n/ log n)

≤ nH0(B) + O(n/ log n) bits.

G. Navarro The Miracle of Self-Indexing

Page 23: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Compressed Representation

I We have represented B using nH0(B) + o(n) bits!I But how can we extract a block Bi in constant time?I First problem: where is oi?

I We can store the positions in P[1, n/b]...I ... but this requires (n/b) log n = 2n extra bits.I We define superblocks of s = b · log n bits...I ... and store P[1, n/s] only for superblocks.I We will have P ′[1, n/b] with pointers relative to the

superblock (like in rank ).I Since |oi | = O(log n), each P ′[i] needs O(log log n) bits.I Total: O(n log log n/ log n) bits.

G. Navarro The Miracle of Self-Indexing

Page 24: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Compressed Representation

I Second problem: which block does (ci , oi) represent?I We will have a table W [0, 2b − 1] where all bitmaps of b bits

are stored grouped by class.I The positions where each class begins in W will be

precomputed in an array

C[c] = 1 +c−1∑i=0

(bi

)

I W is sorted so that oi is the corresponding index within thezone of class ci .

I Hence (ci , oi) represents W [C[ci ] + oi ].I These tables occupy O(

√n log n) bits.

G. Navarro The Miracle of Self-Indexing

Page 25: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Compressed Representation

23

0

1 0 01

+10

1 0 0 1 1 1 0 1 0 1 0 0 0 0 0 00 1 1 0 0 1 0 0 1 1 1 1 0100B

0 0 0

5

1 1 3 1 1 0 1 1 1 3 1

1 9 15

c

P

P’ 2 4 2 4 2 4

0 0 1 0

0

o 0 1 1 0 0 0 01 1 0

000001010100011101110111

W

C

G. Navarro The Miracle of Self-Indexing

Page 26: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

RankCompressed Bitmaps

Rank

I The structures for rank occupy o(n) bits on top of B.I They need constant-time access to blocks of B.I This is what we have obtained with the compressed

representation!I Hence we have obtained rank in constant time over the

compressed representation (to H0).I More precisely, we use nH0(B) + O(n log log n/ log n) bits.

G. Navarro The Miracle of Self-Indexing

Page 27: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

Symbol Sequences

I Assume now we handle a sequence S = s1s2 . . . sn overan alphabet Σ of size σ.

I A plain representation of S needs n log σ bits.I We wish to solve rank over this sequence:

I rankc(S, i) = number of occurrences of c in S[1, i].I How can we extend our results on bits?

G. Navarro The Miracle of Self-Indexing

Page 28: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet Tree

I Is an elegant solution that lets us store S[1, n]:I Using n log σ + o(n log σ) bits.I Solving rank in time O(log σ).I Obtaining S[i] in time O(log σ).

I Has many other applications (geometry, binary relations,etc.)

G. Navarro The Miracle of Self-Indexing

Page 29: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet Tree

I Assume we split the alphabet into two subsets of the samesize (or almost).

I We create a bitmap indicating to which subset eachsymbol of S belongs to.

I We store that bitmap in the root of the wavelet tree.I For the left/right subtree, we choose the symbols of S from

each subset.I We continue recursively until each subset has just one

symbol.I It is easy to see that all the bitmaps add up to n log σ bits.

G. Navarro The Miracle of Self-Indexing

Page 30: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet Tree

l

1l l r drl

0 1 0 0

0111l l dl

a a ab a r a a b a d alll ra0 0 0 0 0 0 0 0 0 0 01 1 0 1 1 1 10 0

_ab dlr

0 0 0 0 0 0 0 0 0 0 0 01 1

_a b

a a b a a a a a b a a

bb1 1 1 1 1 111 10 0 0

_ a

a a a a a a a a a

a a a a a a a a a

dl r

r r

d l

d l l

0

G. Navarro The Miracle of Self-Indexing

Page 31: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet Tree

I How we retrieve symbol S[i]?I Look at B[i] in the root bitmap.I If B[i] = 0,

I Go to the left subtree.I The new position is i ′ = rank0(B, i).

I If B[i] = 1,I Go to the right subtree.I The new position is i ′ = rank1(B, i).

I When we arrive at a leaf, the corresponding symbol is S[i].I Time: log σ evaluations of rank .I Need to preprocess the bitmaps for rank queries.

G. Navarro The Miracle of Self-Indexing

Page 32: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet Tree

0 00r

l l d011

l1

a a ab a r a a b a d alll ra0 0 0 0 0 0 0 0 0 0 01 1 0 1 1 1 10 0

_ab dlr

0 0 0 0 0 0 0 0 0 0 0 01 1

_a b

a a b a a a a a b a a

bb1 1 1 1 1 111 10 0 0

_ a

a a a a a a a a a

a a a a a a a a a

dl r

r r

d l

d l l

rank0

0

l

1 1l l r dl

G. Navarro The Miracle of Self-Indexing

Page 33: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet Tree

I How we compute rankc(S, i)?I Let B be the root bitmap.I If c belongs to the left subtree,

I Move to the left subtree.I The new position is i ′ = rank0(B, i).

I If c belongs to the right subtree,I Move to the right subtree.I The new position is i ′ = rank1(B, i).

I When we arrive at a leaf, the answer is i .I Time: log σ evaluations of rank .

G. Navarro The Miracle of Self-Indexing

Page 34: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet Tree

a a ab a r a a b a d alll ra0 0 0 0 0 0 0 0 0 0 01 1 0 1 1 1 10 0

_ab dlr

0 0 0 0 0 0 0 0 0 0 0 01 1

_a b

a a b a a a a a b a a

bb1 1 1 1 1 111 10 0 0

_ a

a a a a a a a a a

a a a a a a a a a

dl r

r r

d l

d l l

rank

rank (S,11)’l’

1

l

01 1l l r d

0 00rl

l l d011

l1

G. Navarro The Miracle of Self-Indexing

Page 35: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet TreeI If the bitmaps are uncompressed, the space is

n log σ + o(n log σ).I If we compress the bitmaps using the technique shown

before...I Zero order entropy of a sequence: let there be nc

occurrences of symbol c in S[1, n], then

H0(S) =∑c∈Σ

nc

nlog

nnc

I Lower bound to any encoding of S that always assigns thesame code to the same symbol (e.g. Huffman).

I We will see that, by compressing the bitmaps of thewavelet tree to zero-order, we compress the sequence tozero-order as well.

G. Navarro The Miracle of Self-Indexing

Page 36: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet Tree

I We will consider the successive levels of the wavelet tree.I Let B[1, n] be the highest bits of the symbols.I Say B has n0 0s and n1 1s.I It is compressed to

nH0(B) = n0 lognn0

+ n1 lognn1

.

G. Navarro The Miracle of Self-Indexing

Page 37: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Wavelet TreeI Consider now the subsequence S0[1, n0] of the left child

and B0 the bitmap of its next highest bit.I Let n00 and n01 be its number of 0s and 1s, respectively.I B0 is compressed to

n0H0(B0) = n00 logn0

n00+ n01 log

n0

n01

.I Symmetrically with S1. Adding up the space for the first

two levels we get

n00 logn

n00+ n01 log

nn01

+ n10 logn

n10+ n11 log

nn11

I If continuing up to level log σ, we get nH0(S).

G. Navarro The Miracle of Self-Indexing

Page 38: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

Texts

I Given an alphabet Σ of size σ...I ... a text T [1, n] is a sequence over Σ.I This is just a sequence, but we are interested in other

operations.I Given a pattern P[1, m]:

I Count the number of occurrences of P in T (occ).I Locate those occ occurrences in T .

G. Navarro The Miracle of Self-Indexing

Page 39: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

Suffix Arrays

I Unlike inverted indexes, make no assumption on the text.I Can retrieve any text substring.I General model: consider the n suffixes T [i , n] of T .I Every substring of T is the prefix of a suffix of T .I This structures indexes the set of suffixes of T and allows

one to find all those sharing a given prefix.

G. Navarro The Miracle of Self-Indexing

Page 40: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

Suffix Arrays

I It is an array with the n text suffixes in lexicographic order.I Finding the substrings equal to P is the same as finding

the suffixes that start with P.I These form a lexicographical interval in the suffix array.I This interval can be binary searched in O(m log n) time.

G. Navarro The Miracle of Self-Indexing

Page 41: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

21 7 12 9 20 11 8 3 15 1 13 5 17 4 16 19 10 2 14 6 18

bar

a a a a a a a a al b r l l b r d $

G. Navarro The Miracle of Self-Indexing

Page 42: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

Suffix Arrays

I Yet, the suffix array yields space problems.I Uses about 4n bytes (12GB for the human genome).I Compressing it is of interest.I But... ¿can we compress a permutation?.

G. Navarro The Miracle of Self-Indexing

Page 43: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The Burrows-Wheeler Transform (BWT)I Is a reversible permutation of T .I It is used as a previous step to compression algorithms like

bzip2.I Puts together symbols having the same context.I It is enough to compress those symbols “put together” to

zero-order to achieve high-order entropy.I Entropy of order k : if SA is the sequence of the symbols

that precede the occurrences of A in S, then

Hk (S) =1n

∑A∈Σk

|SA|H0(SA)

I Lower bound to encodings that consider the k symbolspreceding the one to encode (e.g. PPM).

G. Navarro The Miracle of Self-Indexing

Page 44: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The Burrows-Wheeler Transform (BWT)I Take all cyclic shifts of T .I That is, ti ti+1 . . . tn−1$t1t2 . . . ti−1.I The result is a matrix M of n × n symbols.I Sort the rows lexicographically.I M is essentially the list of suffixes of T in this order:

M[i] = T [A[i] . . . n] · T [1 . . . A[i] − 1]

I The first column, F , holds the first characters of thesuffixes: F [i] = S[rank(D, i)].

I The last column, L, is the BWT of T , T bwt = L.I It is the sequence of symbols that precede the suffixes

T [A[i]..].L[i] = T bwt [i] = T [A[i] − 1]

(except if A[i] = 1, where T bwt = T [n] = $).

G. Navarro The Miracle of Self-Indexing

Page 45: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

alabar a la alabarda$labar a la alabarda$aabar a la alabarda$albar a la alabarda$alaar a la alabarda$alabr a la alabarda$alaba a la alabarda$alabara la alabarda$alabar la alabarda$alabar ala alabarda$alabar a a alabarda$alabar a l alabarda$alabar a laalabarda$alabar a la labarda$alabar a la aabarda$alabar a la albarda$alabar a la alaarda$alabar a la alabrda$alabar a la alabada$alabar a la alabara$alabar a la alabard$alabar a la alabarda

21

71

29

20

11

83

15

11

35

17

41

61

91

02

14

61

8

$alabar a la alabarda a la alabarda$alabar alabarda$alabar a la la alabarda$alabar aa$alabar a la alabarda alabarda$alabar a la la alabarda$alabar abar a la alabarda$alabarda$alabar a la alalabar a la alabarda$alabarda$alabar a la ar a la alabarda$alabarda$alabar a la alabbar a la alabarda$alabarda$alabar a la alada$alabar a la alabarla alabarda$alabar a labar a la alabarda$alabarda$alabar a la ar a la alabarda$alabarda$alabar a la alaba

Tbwt

T

araadl ll$ bbaar aaaa

alabar a la alabarda$

A

1era "a"

1era "d"

2nda "r"

9na "a"

G. Navarro The Miracle of Self-Indexing

Page 46: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The Burrows-Wheeler Transform (BWT)

I How to revert the BWT?I For all i , T = . . . L[i]F [i] . . ..I We know L[1] = T [n − 1], since F [1] = $ = T [n].I Where is c = L[1] in F?I All occurrences of c appear in the same order in F and L:

I Is the order given by the suffix that follows c in T .I That c is at F [C[c] + rankc(L, i)], where

C[c] = number of occurrences of symbols < c in T

I This is called LF-mapping, since it maps from L to F :

LF (i) = C(L[i]) + rankL[i](L, i)

I Then T [n − 2] = L[LF (1)], T [n − 3] = L[LF (LF (1))], etc.

G. Navarro The Miracle of Self-Indexing

Page 47: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The FM-Index

I Uses the connection between the suffix array and the BWT.I Replaces the binary search by the more efficient backward

search.I To search for P[1, m] we start with pm.I The segment of A corresponding to it is

A[C(pm) + 1, C(pm + 1)].I In general, we will know that the occurrences of P[i + 1, m]

start at A[spi+1, epi+1]...I ... and will use something similar to the LF-mapping to

obtain the zone A[spi , epi ] corresponding to P[i , m].I We finish with sp1 and ep1.

G. Navarro The Miracle of Self-Indexing

Page 48: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The FM-Index

I Assume we start with the segment A[spi+1, epi+1] of theoccurrences of P[i + 1, m].

I How we update it to the occurrences of P[i , m]?I For a spi+1 ≤ j ≤ epi+1, M[j][1..m − i] = P[i + 1, m].I The occurrences of P[i , m] in T appear as L[j] = pi in this

area, since T = . . . L[j] · M[j][1..m − i] . . .I Hence the answer is the range of the LF (j) where L[j] = pi .

I This is computed simply as

spi = C(pi) + rankpi (L, spi+1 − 1) + 1epi = C(pi) + rankpi (L, epi+1)

G. Navarro The Miracle of Self-Indexing

Page 49: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

21

71

29

20

11

83

15

11

35

17

41

61

91

02

14

61

8

C($)=0

C(_)=1

C(r)=19

C(l)=16

C(d)=15

C(a)=4

C(b)=13

araadl_ll$_bbaar_aaaa

r

a

b

$

aaaaaaaaabbdlllrr

araadl ll$ bbaar aaaa

alabar a la alabarda$T

Tbwt

___

F L

A

G. Navarro The Miracle of Self-Indexing

Page 50: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The FM-Index

I After m iterations, we get the interval for P[1, m].I The cost is 2m calls to rankc .I Using a wavelet tree on T bwt = L,

I We get space nH0(T ) + o(n log σ).I We count in time O(m log σ).

G. Navarro The Miracle of Self-Indexing

Page 51: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

_ __$1011

_ _ _

_ _

$_

$ _

1110 00 11

ab

1101111

$_ab dlr

aaa $ bb aa aaaa

bb aaaaa aaaa000 00000011

a b

01001101100 00100 00 0ar aad l _ ll $ bbaar aaaa

00

a = 0, rank0(16) = 10

a = 1, rank1(10) = 7

a = 0, rank0(7) = 5

10r d l l l

dl r

r1

l l ld

d l

000

0111

rank = 5

G. Navarro The Miracle of Self-Indexing

Page 52: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The FM-Index

I But, how to obtain A[i] and T [l , r ]?I We will sample and store some cells of A.I We take those A[i] pointing to positions of T which are

multiples of some b.I We store them in increasing order of i in contiguous form,

in an array A′[1, n/b].I ... and we have a bitmap B[1, n] marking the sampled is.I The positions and the bitmap will occupy O(n

b log n) bits.

G. Navarro The Miracle of Self-Indexing

Page 53: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The FM-Index

I Assume we want to find out A[i] (but have not A).I If B[i] = 1, then A[i] = A′[rank(B, i)].I Else, if B[LF (i)] = 1, then A[LF (i)] = A′[rank(B, LF (i))],

A[i] = A[LF (i)] + 1.I Else, ...I If B[LF t(i)] = 1, then A[i] = A′[rank(B, LF t(i))] + t .

I This ends in at most b steps.I Hence, we can know A[i] in time O(b log σ).I For example, in time O(log1+ε n), spending o(n log σ) extra

bits.I Now we have replaced A!

G. Navarro The Miracle of Self-Indexing

Page 54: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The FM-IndexI Now we store the same sampling in another way.I We store the i values by increasing position in T , in

E [1, n/b].I Say we want to extract T [l , r ] (but have not T ).

I The first sampled position after it is r ′ = 1 + br/bc · b.I And is pointed from A[i] = A[E [1 + br/bc]].I We know T [r ′ − 1] = L[i].I We know T [r ′ − 2] = L[LF (i)].I ...I We know T [l] = L[LF r ′−l−1(i))].

I The total cost is O((b + r − l) log σ).I For example, in time O((r − l) log σ + log1+ε n), spending

o(n log σ) extra bits.I Now we have replaced the text!

G. Navarro The Miracle of Self-Indexing

Page 55: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The FM-Index

I The O(log n) time can be improved to O( log σlog log n ) in theory.

I The space is that of the wavelet tree.I n log σ + o(n log σ) (like T ), without compression.I nH0(T ) + o(n log σ) by compressing the bits.I Many efforts have been carried out to reduce this to

nHk (T ) + o(n log σ), finally succeeding.I Yet, it turned out that no effort was necessary!

G. Navarro The Miracle of Self-Indexing

Page 56: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The FM-IndexI Consider the t = σk partitions of the BWT by context of

length kT bwt = S = S1S2 . . . St

I By the Hk formula,

nHk (T ) =t∑

i=1

|Si |H0(Si)

I Now, recall that the zero-order compression we achievedfor the wavelet tree was the sum of the zero-orderentropies of small blocks:

nH0(S) ≤ bH0(S1) + bH0(S2) + . . . + bH0(Sn/b)

where the Sj ’s are the blocks of length b.

G. Navarro The Miracle of Self-Indexing

Page 57: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

The FM-Index

I Thus the wavelet tree compresses S1S2 . . . St toH0(S1) + H0(S2) + . . . + H0(St)...

I ... plus some worst-case measure for the limits amongstrings.

I The total size of the compressed wavelet tree isnHk (T ) + O(σk log n).

I The second term is o(n log σ) for moderate k ≤ α logσ n,for any constant 0 < α < 1.

G. Navarro The Miracle of Self-Indexing

Page 58: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

Epilogue

I This FM-index, and many other self-indexes, areimplemented at mirrors http://pizzachili.dcc.uchile.clhttp://pizzachili.di.unipi.it.

I The implementations are freely accessible for any purpose,and their practicality can be tested.

I We have published a practical survey considering most ofthe currently available indexes:Compressed Text Indexes: From Theory to Practice. ACMJournal of Experimental Algorithmics (JEA) 13:article 12,2009.

I And also a more conceptual survey:Compressed Full-Text Indexes. ACM Computing Surveys39(1), article 2, 61 pages, 2007.

G. Navarro The Miracle of Self-Indexing

Page 59: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

Epilogue

I This is, of course, not the end of the story.I There are several problems, some totally, some partially,

and some not at all solved.I Some examples of almost totally solved extensions:

I How to build the index in little space?I How to handle a dynamic text collection? (slow)I How to support more functionality, as in suffix trees?

G. Navarro The Miracle of Self-Indexing

Page 60: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

Epilogue

I Some examples of only partially solved extensions:I How to support document retrieval? (theoretical)I How to compress highly repetitive collections? (some

progress)I How to handle the index in secondary memory? (little

progress).I And of course, we have all the other data structures!

G. Navarro The Miracle of Self-Indexing

Page 61: The Miracle of Self-Indexing › event › 2009 › psc09p01_presentation.pdf · Bit Sequences Sequences of Symbols and Wavelet Trees Text Indexes Compressed Full-Text Self-Indexes

Bit SequencesSequences of Symbols and Wavelet Trees

Text Indexes

The Burrows-Wheeler TransformThe FM-Index

Epilogue

I A brave new world of algorithmic challenges is ahead:I For those who love classical algorithmics, it is plenty of

opportunities to redesign classical data structures.I For those who like compression and information theory, this

gives the new challenge of designing schemes that not onlycan recover the data but also retain functionality and directaccess to it.

I For those trying to walk the thin path between sterile theoryand shallow practice, this is a marvellous opportunity tomake them work together and achieve miracles!

G. Navarro The Miracle of Self-Indexing