suffix arrays. suffix array we loose some of the functionality but we save space. let s = abab sort...

72
Suffix arrays

Post on 20-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Suffix arrays

Suffix array

• We loose some of the functionality but we save space.

Let s = ababSort the suffixes lexicographically: ab, abab, b, bab

The suffix array gives the indices of the suffixes in sorted order

2 0 3 1

How do we build it ?

• Build a suffix tree• Traverse the tree in DFS,

lexicographically picking edges outgoing from each node and fill the suffix array.

• O(n) time

How do we search for a pattern ?

• If P occurs in T then all its occurrences are consecutive in the suffix array.

• Do a binary search on the suffix array

• Takes O(mlogn) time

ExampleLet S = mississippi

iippiissippiississippimississippipi

7

4

1

0

9

8

6

3

10

5

2

ppisippisisippissippississippi

L

R

Let P = issa

M

How do we accelerate the search ?

L R

Maintain = LCP(P,L)Maintain r = LCP(P,R)

Assume ≥ r

M

r

L RM

r

If = r then start comparing M to P at + 1

L RM

r

> r

L RM

r

Someone whispers LCP(L,M)

LCP(L,M) >

L RM

r

Continue in the right half

LCP(L,M) >

L RM

r

LCP(L,M) <

L RM

r

LCP(L,M) <

Continue in the left half

L RM

r

LCP(L,M) =

start comparing M to P at + 1

Analysis

If we do more than a single comparison in an iteration then max(, r ) grows by 1 for each comparison O(m + logn) time

Construct the suffix array without the suffix tree

Linear time construction

Recursively ?

Say we want to sort only suffixes that start at even positions ?

Change the alphabet

You in fact sort suffixes of a string shorter by a factor of 2 !

Every pair of characters is now a character

Change the alphabeta$ 0

aa 1

ab 2

b$ 3

ba 4

bb 5

$ a b a a a b

2 1 2

But we do not gain anything…

Divide into triples

$ y a b b a d a b b a d o

abb

ada bba

do$

Divide into triples

$ y a b b a d a b b a d o

abb

ada bba

do$

$ y a b b a d a b b a d o

bba

dab

bad

o$$

Sort recursively 2/3 of the suffixes

$ y a b b a d a b b a d o

abb

ada bba

do$ bba

dab

bad

o$$

1 2 4 6 4 5 3 7

0 1 6 4 2 5 3 7

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 4 8 2 7 5 10

11

1 2 3 4 5 6 7 8 9 10 11 120

0 1 2 3 4 5 6 7

Sort the remaining third

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

(b, 2)

(a, 5)

(a, 7)

(y, 1)

(b, 2)

(a, 5)

(a, 7)

(y, 1)

36 9 0

1 2 3 4 5 6 7 8 9 10 11 120

1 4 8 2 7 5 10

11

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

36 9 0

1 4 8 2 7 5 10

11

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

36 9 0

4 8 2 7 5 10

11

6

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

39 0

4 8 2 7 5 10

11

6 4

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

39 0

8 2 7 5 10

11

6 4 9

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

3 0

8 2 7 5 10

11

6 4 9 3

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

8 2 7 5 10

11

6 4 9 3 8

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

2 7 5 10

11

6 4 9 3 8 2

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

7 5 10

11

6 4 9 3 8 2 7

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

5 10

11

6 4 9 3 8 2 7 5

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

0

10

11

6 4 9 3 8 2 7 5

Merge

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

summary

$ y a b b a d a b b a d o

1 4 2 6 5 3 7 8

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes

When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes

Compute LCP’s

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

1

4

827510110

6

39

Crucial observation

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(i,j) = min {LCP(i,i+1),LCP(i+1,i+2),….,LCP(j-1,j)}

1

4

827510110

6

39

$ y a b b a d a b b a d o1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(11,0)

16493827510110

0

Find LCP’s of consecutive suffixes

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(8,2)

16493827510110

01

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(9,3)

16493827510110

010

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(6,4)

16493827510110

1 010

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(7,5)

16493827510110

01 010

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(1,6)

16493827510110

5 01 010

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

LCP(2,7)

16493827510110

45 01 010

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

16493827510110

45 01 010

LCP(3,8)

3

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

16493827510110

45 01 010

LCP(4,9)

32

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

16493827510110

45 01 010

LCP(5,10)

32 1

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

0

abbado$abbadabbado$

adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$

16493827510110

45 01 010

LCP(10,11)

32 1 0

$ y a b b a d a b b a d o

1 2 3 4 5 6 7 8 9 10 11 12

1

0

6 4 9 3 8 2 7 5 10

11

045 01 010 32 1 0

We need more LCPs for search

Linearly many, calculate the all bottom up

a b c a b b c a $

2 3 4 5 6 7 8

4

1

1 8 5 2 6 3 7 9

$ca$cabbca$bca$bcabbca$bbca$a$abcabbca$abbca$

973625814

Another example9

2 1 00 31 02

Burrows –Wheeler (bzip2)

Currently best algorithm for textBasic Idea:• Sort the characters by their full context

(typically done in blocks). This is called the block sorting transform.

• Use move-to-front encoding to encode the sorted characters.

The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.

S = abracaשבנויה מכל ההזזות Mיצירת מטריצה : Iשלב

: Sהציקליות של

M =

a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a

מיון השורות בסדר לקסיקוגרפי: : IIשלב

a b r a c a #

b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

LF

L is the Burrows Wheeler Transform

a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a

a b r a c a # b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

Claim: Every column contains all chars.

LF

You can obtain F from L by sorting

a b r a c a #

b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

LF

The “a’s” are in the same order in L and in F,

Similarly for every other char.

From L you can reconstruct the string

L

#

ar

a

ca

b

F

#

a

r

a

c

a

b

What is the first char of S ?

From L you can reconstruct the string

L

#

ar

a

ca

b

F

#

a

r

a

c

a

b

What is the first char of S ? a

From L you can reconstruct the string

L

#

ar

a

ca

b

F

#

a

r

a

c

a

b

ab

From L you can reconstruct the string

L

#

ar

a

ca

b

F

#

a

r

a

c

a

b

abr

Compression ?L

#

ar

a

ca

b

Compress the transform to a string of integers using move to front

0 2 3 2 0 3

Then use Huffman to code the integers

a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a

a b r a c a # b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

Why is it good ?

LF

Characters with the same (right) context appear together

a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a

a b r a c a # b r a c a # aa c a # a b r

c a # a b r a

a # a b r a c# a b r a c a

r a c a # a b

Sorting is equivalent to computing the suffix array.

LF

Can encode and decode in linear time

frocc=2[lr-fr+1]

Substring search using the BWT (Count the pattern occurrences)

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

ipssm#pissii

L

mississippi

# 1i 2m 7p 8s 10

C

Availa

ble in

foP = si

First step

fr

lr Inductive step: Given fr,lr for P[j+1,p] Take

c=P[j]

P[ j ]

Find the first c in L[fr, lr]

Find the last c in L[fr, lr]

L-to-F mapping of these charslr

rows prefixedby char “i” s

s

slide stolen from Paolo Ferragina@

frocc=2[lr-fr+1]

Substring search using the BWT (Count the pattern occurrences)

#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m

ipssm#pissii

L

mississippi

# 1i 2m 7p 8s 10

C

Availa

ble in

foP = si

First step

fr

lr

lr

rows prefixedby char “i” s

s

slide stolen from Paolo Ferragina@

What if someone whispers how many “s” we have up to index 2 and up to index 5:

occ(s,2), occ(s,5) ?fr = C[s] + occ(s,2) + 1lr = C[s] + occ(s,5)

occ( a , j )

ipssm#pissii

L

occ(s,4) = 2

Make a bit vector for each character

ipssm#pissii

L

0 0 1 1 0 0 0 0 1 1 0 0

occ(s,4) = rank(4)

rank(i) = how many ones are there before position i ?

How do you answer rank queries ?

0 0 1 1 0 0 0 0 1 1 0 0

rank(i) = how many ones are there before position i ?

We can prepare a vector with all answers

0 0 1 2 2 2 2 2 3 4 4 4

Lets do it with O(n) bits per character

0 0 1 1 0 0 1 0 1 1 0 0

Partition in 2n/log(n) blocks of size log(n)/2

0 0 1 1 0 0 0 0 1 1 0 0logn/2

2 5 7

Keep the answer for each prefix of the blocksThere are n “kinds” of blocks, prepare a table with all answers for each block

0 0 1 1 0 0 1 0 1 1 0 0

In our solution the bit vector takes Θ(n) bits and also the “additionals” take Θ(n) bits

0 0 1 1 0 0 0 0 1 1 0 0logn/2

2 5 7

Can we do it with smaller overhead : so additionals

would take o(n) ?

0 0 1 1 0 0 1 0 1 1 0 0

superblocks of size log2(n)

0 0 1 1 0 0 0 0 1 1 0 0

log2n7 13

Each block keeps the number of one in previous blocks that are in the same superblock

0 0 1 1 0 0 0 0 1 1 0 0

0 0 1 1 0 0 0 0 1 1 0 02 4

Analysis

0 0 1 1 0 0 1 0 1 1 0 0

The superblock table is of size n/log (n)

0 0 1 1 0 0 0 0 1 1 0 0

log2n7 13

The block table is of size (loglog(n)) * n/log (n)

0 0 1 1 0 0 0 0 1 1 0 0

0 0 1 1 0 0 0 0 1 1 0 02 4

The tables for the blocks √n log(n)loglog(n)

So the additionals take o(n) space

Next step

Do it without keeping the bit vectors themselves

Instead keep only the compressed version of the text

Saves a lot of space for compressible strings