etri linear-time search in suffix arrays july 14, 2003 jeong seop sim, dong kyue kim heejin park,...
DESCRIPTION
ETRI Suffix arrays Example for T = abbabaababbb# The suffixes of T abbabaababbb# (1) bbabaababbb# (2) abaababbb# (3) … b# (12) # (13) are stored in lexicographical order. 1# 2a a b a b b b # 3a b a a b a b b b # 4a b a b b b # 5a b b a b a a b a b b b # 6a b b b # 7b # 8b a a b a b b b # 9b a b a a b a b b b # 10b a b b b # 11b b # 12b b a b a a b a b b b # 13b b b # # is the lexicographically smallest special character.TRANSCRIPT
ETRIETRI
Linear-Time Search in Suffix Arrays
July 14, 2003
Jeong Seop Sim, Dong Kyue Kim
Heejin Park, Kunsoo Park
ETRIETRI
Suffix arrays
Suffix array of text TThe lexicographically sorted list of all suffixes of text T
ETRIETRI
Suffix arraysExample for T = abbabaababbb#
The suffixes of T abbabaababbb# (1)
bbabaababbb# (2) abaababbb# (3)
… b# (12) # (13)
are stored in lexicographical order.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b ## is the lexicographically smallest special character.
ETRIETRI
Suffix arraysExample for T = abbabaababbb#
The suffixes of T are abbabaababbb# (1) bbabaababbb# (2)
abaababbb# (3) … b# (12) # (13)
In actual suffix arrays, we store only the starting positions of suffixes in T but for convenience, we assume that suffixes themselves are stored.
1 13 #
2 6 a a b a b b b #
3 4 a b a a b a b b b #
4 7 a b a b b b #
5 1 a b b a b a a b a b b b #
6 9 a b b b #
7 12 b #
8 5 b a a b a b b b #
9 3 b a b a a b a b b b #
10 8 b a b b b #
11 11 b b #
12 2 b b a b a a b a b b b #
13 10 b b b #
ETRIETRI
Suffix arrays
Definition: s-suffixesSuffixes starting with string sa-suffixes, ba-suffixes, …
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
Suffix arrays vs. Suffix treesConstruction time
Suffix Array = Suffix Tree
Space Suffix Array = Suffix Tree
In practice , suffix arrays are more space efficient than suffix trees.
Search timeSuffix Array: , (p=|P|, n=|T|)Suffix Tree:
|)|log( p)log( np |)|( p
ETRIETRI
ContributionConstruction time
Suffix Array = Suffix Tree
Space Suffix Array = Suffix Tree
In practice , suffix arrays are more space efficient than suffix trees.
Search timeSuffix Array: , , Suffix Tree: |)|log( p
)log( np |)|( p |)|log( p
ETRIETRI
The meaning of our contributionConstruction time
Suffix Array = Suffix Tree
Space Suffix Array = Suffix Tree
In practice , suffix arrays are more space efficient than suffix trees.
Search timeSuffix Array: , , Suffix Tree: |)|log( p
)log( np |)|( p |)|log( p
Search time: SA ST
ETRIETRI
The meaning of our contributionConstruction time
Suffix Array = Suffix Tree
Space Suffix Array = Suffix Tree
In practice , suffix arrays are more space efficient than suffix trees.
Search timeSuffix Array: , , Suffix Tree: |)|log( p
)log( np |)|( p |)|log( p
Search time: SA ST
Suffix arrays are more powerful than suffix trees.
ETRIETRI
Our search algorithm
Our search algorithm
ETRIETRI
Search in a suffix array
Definition: Search in a suffix arrayInput
A pattern P A suffix array of T
Output
All P-suffixes of T
ETRIETRI
Search in a suffix array
All ab-suffixes are neighbors.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = ab
T = abbabaababbb#
Find all ab-suffixes.
A search example
ETRIETRI
Search in a suffix array
We have only to find
the first and the last ab-suffixes.
Because the other ab-suffixes are
stored between them.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = ab
T = abbabaababbb#
A search example
ETRIETRI
Related workIn developing our search algorithm, we adopt the idea suggested by Ferragina and Manzini (FOCS 2001).
Search a pattern in a file compressed by the Burrows-Wheeler compression algorithm
Search P from the last character to the first character of PP = ababaaabbabaaabb
We adopt this backward pattern searching idea.
ETRIETRI
Algorithm outline
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = aba
T = abbabaababbb#
Outline of our search algorithm
We find all aba-suffixes
by searching P backward.
Our algorithm has p stages
(In this case, there are 3 stages.)
ETRIETRI
Algorithm outline
P = aba
T = abbabaababbb#
Outline of our search algorithm
We find all aba-suffixes
by searching P backward.
Stage 1: find all a-suffixes.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
Algorithm outline
P = aba
T = abbabaababbb#
Outline of our search algorithm
We find all aba-suffixes
by searching P backward.
stage 1: find all a-suffixes.
stage 2: find all ba-suffixes.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
Algorithm outline
P = aba
T = abbabaababbb#
Outline of our search algorithm
We find all aba-suffixes
by searching P backward.
stage 1: find all a-suffixes.
stage 2: find all ba-suffixes.
stage 3: find all aba-suffixes.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
Elaborate stage 2
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = aba
A stage by elaborating stage 2
We find the first ba-suffix from the
first a-suffix and the last ba-suffix
from the last a-suffix.
We find all ba-suffixes
using a-suffixes found in stage 1.
ETRIETRI
Elaborate stage 2
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
P = aba
Only explain how to find the first
ba-suffix from the first a-suffix.
Finding the last ba-suffix is similar.
A stage by elaborating stage 2
ETRIETRI
Elaborate stage 2
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
To find the first ba-suffix, we count the number of suffixes that precede ba-suffixes in this suffix array.
P = aba
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Suffixes preceding ba-suffixes are
divided into two categories.
- A-type: Suffixes starting with
characters lexicographically smaller than b. (#-suffixes, a-suffixes)
- B-type: Suffixes starting with the same
character b and preceding ba-suffixes.
We count A-type and B-type suffixes in different ways.
Elaborate stage 2
A-type
B-type
ETRIETRI
Count the number of A-type suffixes
Count the number of A-type suffixes 1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
The number of A-type suffixes = The number of #-suffixes and a-suffixes = The position of the last a-suffix.
A-type
ETRIETRI
Count the number of A-type suffixes
We generate an array that stores the positions of the last #-suffix, the last a-suffix, and the last b-suffix.
With this array, we can count A-type suffixes in O(1) time.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
# 1
a 6
b 13
ETRIETRI
Count the number of A-type suffixes
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Array S pace:Time: O(n) (one scan)
|)(|
# 1
a 6
b 13
ETRIETRI
Count the number of B-type suffixes
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Count B-type suffixesb-suffixes preceding ba-suffixes.
B-type
ETRIETRI
Count the number of B-type suffixes
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
B-type suffixesb-suffixes preceding ba-suffixes.
A suffix generated by removing the leftmost b from a B-type suffix appears in a suffix subarray preceding a-suffixes found in stage 1.
B-type
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Count the number of B-type suffixes
The number of B-type suffixes are the number of suffixes
being in a suffix subarray that precedes a-suffixes
whose previous characters are bs B-type
We count this with array N.
b
b
b
a
#
b
b
a
b
a
b
a
a
Let U be the conceptual array of
previous characters of suffixes.
U
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
b
b
b
a
#
b
b
a
b
a
b
a
a
Count the number of B-type suffixes # a b
0 0 1
0 0 2
0 0 3
0 1 3
1 1 3
1 1 4
1 1 5
1 2 5
1 2 6
1 3 6
1 3 7
1 4 7
1 5 7
5],7[ bN
Array N
entries|| n
N[i,b] stores the number of suffixes whose previous characters are bs in a suffix subarray SA[1,i].
U
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
b
b
b
a
#
b
b
a
b
a
b
a
a
Count the number of B-type suffixes # a b
0 0 1
0 0 2
0 0 3
0 1 3
1 1 3
1 1 4
1 1 5
1 2 5
1 2 6
1 3 6
1 3 7
1 4 7
1 5 7
U
We can count B-type
suffixes in O(1) time
by accessing an entry of N.
ETRIETRI
Array NSpace:
An alternative way Space: O(n) time for counting B-type suffixes.
Array N
|)| (O n
|)|(logO
# a b
0 0 1
0 0 2
0 0 3
0 1 3
1 1 3
1 1 4
1 1 5
1 2 5
1 2 6
1 3 6
1 3 7
1 4 7
1 5 7
ETRIETRI
Query for N[i,b]Counting B-type suffixes
O(log n) time
O(log ) time||
ETRIETRI
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b] O(log n) time
In an O(log n) time algorithm,
we generate an array
whose ith entry stores
the location of the ith b in U.
1 1
2 2
3 3
4 6
5 7
6 9
7 11
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
To count suffixes whose previous
characters are bs in SA[1,8].
= To count bs in U[1,8]
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
Find the largest value not
exceeding 8 in this array.
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
To find 7 in this array,
we perform binary search.
O(log n) time.
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
The index of 7 (5) is
the number of b’s in U[1,8].1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
ETRIETRI
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: O(log n) time
1 1
2 2
3 3
4 6
5 7
6 9
7 11
1 4
2 8
3 10
4 12
5 13
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
1 5
Generally, we require arrays for
all characters. #
a
b
O(n) space
ETRIETRI
Query for N[i,b]
O(log n) time
O(log ) time||
ETRIETRI
For the last characters
of each block,
we compute the entries
of N.
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: time
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
Divide U into
-sized blocks.
|)|(log O# a b
0 0 3
1 1 4
1 2 6
1 4 7
||
ETRIETRI
For the other entries
in each block,
we generate a similar
data structure used
in O(log n) time alg.
O(log ) time
for binary search.
Still O(n) space in total.
b
b
b
a
#
b
b
a
b
a
b
a
a
UQuery for N[i,b]: time
1 #
2 a a b a b b b #
3 a b a a b a b b b #
4 a b a b b b #
5 a b b a b a a b a b b b #
6 a b b b #
7 b #
8 b a a b a b b b #
9 b a b a a b a b b b #
10 b a b b b #
11 b b #
12 b b a b a a b a b b b #
13 b b b #
|)|(log O# a b
0 0 3
1 1 4
1 2 6
1 4 7
||
ETRIETRI
Summaryp stages
Each stageCount A-type suffixes
Time: O(1) Space: O(n) for M array
Count B-type suffixes Time: Space: O(n) for computing the value of an entry N
In total, time with O(n) space.|)|log( p
|)|(log
ETRIETRI
Conclusion
In a suffix array, one can choose or search time algorithm depending on the alphabet
size.
Suffix arrays are more powerful than suffix trees.
|)|log( p )log( np