suffix arrays. suffix array we loose some of the functionality but we save space. let s = abab sort...
Post on 20-Dec-2015
226 views
TRANSCRIPT
Suffix array
• We loose some of the functionality but we save space.
Let s = ababSort the suffixes lexicographically: ab, abab, b, bab
The suffix array gives the indices of the suffixes in sorted order
2 0 3 1
How do we build it ?
• Build a suffix tree• Traverse the tree in DFS,
lexicographically picking edges outgoing from each node and fill the suffix array.
• O(n) time
How do we search for a pattern ?
• If P occurs in T then all its occurrences are consecutive in the suffix array.
• Do a binary search on the suffix array
• Takes O(mlogn) time
ExampleLet S = mississippi
iippiissippiississippimississippipi
7
4
1
0
9
8
6
3
10
5
2
ppisippisisippissippississippi
L
R
Let P = issa
M
Analysis
If we do more than a single comparison in an iteration then max(, r ) grows by 1 for each comparison O(m + logn) time
Linear time construction
Recursively ?
Say we want to sort only suffixes that start at even positions ?
Change the alphabet
You in fact sort suffixes of a string shorter by a factor of 2 !
Every pair of characters is now a character
Divide into triples
$ y a b b a d a b b a d o
abb
ada bba
do$
$ y a b b a d a b b a d o
bba
dab
bad
o$$
Sort recursively 2/3 of the suffixes
$ y a b b a d a b b a d o
abb
ada bba
do$ bba
dab
bad
o$$
1 2 4 6 4 5 3 7
0 1 6 4 2 5 3 7
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 4 8 2 7 5 10
11
1 2 3 4 5 6 7 8 9 10 11 120
0 1 2 3 4 5 6 7
Sort the remaining third
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
(b, 2)
(a, 5)
(a, 7)
(y, 1)
(b, 2)
(a, 5)
(a, 7)
(y, 1)
36 9 0
1 2 3 4 5 6 7 8 9 10 11 120
1 4 8 2 7 5 10
11
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
36 9 0
1 4 8 2 7 5 10
11
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
36 9 0
4 8 2 7 5 10
11
6
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
39 0
4 8 2 7 5 10
11
6 4
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
39 0
8 2 7 5 10
11
6 4 9
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
3 0
8 2 7 5 10
11
6 4 9 3
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
8 2 7 5 10
11
6 4 9 3 8
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
2 7 5 10
11
6 4 9 3 8 2
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
7 5 10
11
6 4 9 3 8 2 7
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
5 10
11
6 4 9 3 8 2 7 5
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
10
11
6 4 9 3 8 2 7 5
Merge
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
summary
$ y a b b a d a b b a d o
1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes
When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes
Compute LCP’s
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
1
4
827510110
6
39
Crucial observation
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
LCP(i,j) = min {LCP(i,i+1),LCP(i+1,i+2),….,LCP(j-1,j)}
1
4
827510110
6
39
$ y a b b a d a b b a d o1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
LCP(11,0)
16493827510110
0
Find LCP’s of consecutive suffixes
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
LCP(8,2)
16493827510110
01
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
LCP(9,3)
16493827510110
010
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
LCP(6,4)
16493827510110
1 010
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
LCP(7,5)
16493827510110
01 010
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
LCP(1,6)
16493827510110
5 01 010
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
LCP(2,7)
16493827510110
45 01 010
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
16493827510110
45 01 010
LCP(3,8)
3
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
16493827510110
45 01 010
LCP(4,9)
32
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
16493827510110
45 01 010
LCP(5,10)
32 1
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
0
abbado$abbadabbado$
adabbado$ado$badabbado$bado$bbadabbado$bbado$dabbado$do$o$yabbadabbado$
16493827510110
45 01 010
LCP(10,11)
32 1 0
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10
11
045 01 010 32 1 0
We need more LCPs for search
Linearly many, calculate the all bottom up
a b c a b b c a $
2 3 4 5 6 7 8
4
1
1 8 5 2 6 3 7 9
$ca$cabbca$bca$bcabbca$bbca$a$abcabbca$abbca$
973625814
Another example9
2 1 00 31 02
Burrows –Wheeler (bzip2)
Currently best algorithm for textBasic Idea:• Sort the characters by their full context
(typically done in blocks). This is called the block sorting transform.
• Use move-to-front encoding to encode the sorted characters.
The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.
S = abracaשבנויה מכל ההזזות Mיצירת מטריצה : Iשלב
: Sהציקליות של
M =
a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a
מיון השורות בסדר לקסיקוגרפי: : IIשלב
a b r a c a #
b r a c a # aa c a # a b r
c a # a b r a
a # a b r a c# a b r a c a
r a c a # a b
LF
L is the Burrows Wheeler Transform
a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a
a b r a c a # b r a c a # aa c a # a b r
c a # a b r a
a # a b r a c# a b r a c a
r a c a # a b
Claim: Every column contains all chars.
LF
You can obtain F from L by sorting
a b r a c a #
b r a c a # aa c a # a b r
c a # a b r a
a # a b r a c# a b r a c a
r a c a # a b
LF
The “a’s” are in the same order in L and in F,
Similarly for every other char.
Compression ?L
#
ar
a
ca
b
Compress the transform to a string of integers using move to front
0 2 3 2 0 3
Then use Huffman to code the integers
a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a
a b r a c a # b r a c a # aa c a # a b r
c a # a b r a
a # a b r a c# a b r a c a
r a c a # a b
Why is it good ?
LF
Characters with the same (right) context appear together
a b r a c a # b r a c a # ar a c a # a ba c a # a b rc a # a b r aa # a b r a c# a b r a c a
a b r a c a # b r a c a # aa c a # a b r
c a # a b r a
a # a b r a c# a b r a c a
r a c a # a b
Sorting is equivalent to computing the suffix array.
LF
Can encode and decode in linear time
frocc=2[lr-fr+1]
Substring search using the BWT (Count the pattern occurrences)
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
ipssm#pissii
L
mississippi
# 1i 2m 7p 8s 10
C
Availa
ble in
foP = si
First step
fr
lr Inductive step: Given fr,lr for P[j+1,p] Take
c=P[j]
P[ j ]
Find the first c in L[fr, lr]
Find the last c in L[fr, lr]
L-to-F mapping of these charslr
rows prefixedby char “i” s
s
slide stolen from Paolo Ferragina@
frocc=2[lr-fr+1]
Substring search using the BWT (Count the pattern occurrences)
#mississippi#mississipippi#missisissippi#misississippi#mississippipi#mississippi#mississsippi#missisissippi#missippi#missssissippi#m
ipssm#pissii
L
mississippi
# 1i 2m 7p 8s 10
C
Availa
ble in
foP = si
First step
fr
lr
lr
rows prefixedby char “i” s
s
slide stolen from Paolo Ferragina@
What if someone whispers how many “s” we have up to index 2 and up to index 5:
occ(s,2), occ(s,5) ?fr = C[s] + occ(s,2) + 1lr = C[s] + occ(s,5)
Make a bit vector for each character
ipssm#pissii
L
0 0 1 1 0 0 0 0 1 1 0 0
occ(s,4) = rank(4)
rank(i) = how many ones are there before position i ?
How do you answer rank queries ?
0 0 1 1 0 0 0 0 1 1 0 0
rank(i) = how many ones are there before position i ?
We can prepare a vector with all answers
0 0 1 2 2 2 2 2 3 4 4 4
Lets do it with O(n) bits per character
0 0 1 1 0 0 1 0 1 1 0 0
Partition in 2n/log(n) blocks of size log(n)/2
0 0 1 1 0 0 0 0 1 1 0 0logn/2
2 5 7
Keep the answer for each prefix of the blocksThere are n “kinds” of blocks, prepare a table with all answers for each block
0 0 1 1 0 0 1 0 1 1 0 0
In our solution the bit vector takes Θ(n) bits and also the “additionals” take Θ(n) bits
0 0 1 1 0 0 0 0 1 1 0 0logn/2
2 5 7
Can we do it with smaller overhead : so additionals
would take o(n) ?
0 0 1 1 0 0 1 0 1 1 0 0
superblocks of size log2(n)
0 0 1 1 0 0 0 0 1 1 0 0
log2n7 13
Each block keeps the number of one in previous blocks that are in the same superblock
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 02 4
Analysis
0 0 1 1 0 0 1 0 1 1 0 0
The superblock table is of size n/log (n)
0 0 1 1 0 0 0 0 1 1 0 0
log2n7 13
The block table is of size (loglog(n)) * n/log (n)
0 0 1 1 0 0 0 0 1 1 0 0
0 0 1 1 0 0 0 0 1 1 0 02 4
The tables for the blocks √n log(n)loglog(n)
So the additionals take o(n) space