suffix trees
DESCRIPTION
Suffix trees. Trie. A tree representing a set of strings. c. a. { aeef ad bbfe bbfg c }. b. e. b. d. e. f. f. e. g. Trie (Cont). Assume no string is a prefix of another. c. 1) Each edge is labeled by a letter, - PowerPoint PPT PresentationTRANSCRIPT
Suffix trees
Trie• A tree representing a set of strings.
ab
c
e
e
f
d b
f
e g
{ aeef ad bbfe bbfg c }
Trie (Cont)• Assume no string is a prefix of
anothera
bc
e
e
f
d b
f
e g
1) Each edge is labeled by a letter,
2) No two edges outgoing from the same node are labeled the same.
3) Each string corresponds to a leaf.
Compressed Trie • Compress unary nodes, label edges by
stringsa
bc
e
e
f
d b
f
e g
a
bbf
c
eefd
e g
Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s
To make these suffixes prefix-free we add a special character, say $, at the end of s
Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab${
$ b$ ab$ bab$ abab$ }
ab
ab
$
ab$
b
$
$
$
Note that a suffix tree has O(n) nodes n = |s| (why?)
Trivial algorithm to build a Suffix tree
Put the largest suffix in
Put the suffix bab$ in
abab$
abab
$
ab$
b
Put the suffix ab$ in
ab
ab
$
ab$
b
ab
ab
$
ab$
b
$
Put the suffix b$ in
ab
ab
$
ab$
b
$
ab
ab
$
ab$
b
$
$
Put the suffix $ in
ab
ab
$
ab$
b
$
$
ab
ab
$
ab$
b
$
$
$
We will also label each leaf with the starting point of the corres. suffix.
ab
ab
$
ab$
b
$
$
$
12
ab
ab
$
ab
$
b
3
$ 4
$
5
$
AnalysisTakes O(n2) time to build.
It is possible to construct it in O(n) time
But, how come? does it take O(n) space ?
$
Linear space ? • Consider the string aaaaaabbbbbb
a
c
bbbbbb$
a b
a
a
a
b
b
b
b
b$
bbbbbb$
bbbbbb$
bbbbbb$
bbbbbb$
$
$
$
$
$
abbbbbb$
To use only O(n) space encode the edge-labels
a
c
bbbbbb$
a b
a
a
a
b
b
b
b
b$
bbbbbb$
bbbbbb$
bbbbbb$
(7,13)
$
$
$
$
$
• Consider the string aaaaaabbbbbb
$
abbbbbb$
To use only O(n) space encode the edge-labels
(1,1)
c
(7,13)
(7,13)
(7,13)
(7,13)
(7,13)
• Consider the string aaaaaabbbbbb
(6,13)
(1,1)
(1,1)
(1,1)
(1,1)
(13,13)
(7,7)
(13,13)
(13,13)
(13,13)
(12,13)
(13,13)
(7,7)
(7,7)
(7,7)
(7,7)
What can we do with it ?Exact string matching:Given a text T, |T| = n, preprocess it
such that when a pattern P, |P|=m, arrives you can quickly decide if it occurs in T.
We may also want to find all occurrences of P in T
Exact string matchingIn preprocessing we just build a suffix tree in O(n) time
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
Given a pattern P = ab we traverse the tree according to the pattern.
12
ab
ab
$
ab$
b
3
$ 4
$
5
$
If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence.By traversing this subtree we get all k occurrences in O(m+k) time
Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s S
To associate each suffix with a unique string in S add a different special char to each s
Generalized suffix tree (Example)
Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2
{ $ # b$ b# ab$ ab# bab$ aab# abab$ }
1
2
a
b
ab
$
ab$
b
3
$
4
$
5$
1
b#
a
2
#
3
#4
#
So what can we do with it ? Matching a pattern against a database of strings
Longest common substring (of two strings)Every node with a leaf
descendant from string s1 and a leaf descendant from string s2 represents a common substring.
A maximal common substring corresponds to such node.
1
2
a
b
ab
$
ab$
b
3
$
4
$
5$
1
b#
a
2
#
3
#4
#
Find such node with largest “string depth”
Lowest common ancetorsA lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
2
a
b
ab
$
ab$
b
3
$
4
$
5$
1
b#
a
2
#
3
#4
#
Lowest common ancestors
Write an Euler tour of the tree3
4
51
672
4 1 7 1 3 5 6 5 33 1231
2
3
46
9
10
8
57
11
12
0
LCA(1,5) = 3
Shallowest node
3
4
51
672
4 1 7 1 3 5 6 5 33 123
1
2
3
46
9
10
8
57
11
12
0
2 1 2 1 0 1 2 1 00 110
minimum
Node id
Level
Range minimum3
4
51
672
4 1 7 1 3 5 6 5 33 123
1
2
3
46
9
10
8
57
11
12
0
2 1 2 1 0 1 2 1 00 110
minimum
Preprocess an array, such that given i,j you can find the minimum in [i,j] fastReduction takes linear time
Trivial algorithms for RMQ• O(n) space, O(n) query time
• O(n2) space, O(1) query time
Less trivial algorithms to RMQ
• Try to use O(nlog(n)) space to do a query in O(1) time..
• Simpler: O(n + sqrt(n)) to do a query in sqrt(n)
Lowest common ancetorsA lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
2
a
b
ab
$
ab$
b
3
$
4
$
5$
1
b#
a
2
#
3
#4
#
Finding maximal palindromes
• A palindrome: caabaac, cbaabc• Want to find all maximal palindromes
in a string s
Let s = cbaaba
The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position n-i+2 of sr
Maximal palindromes algorithm
Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc#
For every i, find the LCA of suffix i of s and suffix n-i+2 of sr
3
a
a b
a
baaba$
b
3
$
7$
b
7
#c
1
6a b
c #
5
2 2
a $
c #
a
5
6
$
4
4
1
c #
a $
$abc #
c #
Let s = cbaaba$ then sr = abaabc#
AnalysisO(n) time to identify all palindromes
Suffix trees for compression
aabcbbabbsb (2,5) bbabbabcbbabbabbsb…S[1..n] =
Compression aabcbbabbsbabcbbbbabbabcbbabbabbsb…
aabcbbabbsb (2,5) bbabbabcbbabbabbsb…S[1..n] =
Let priori=S[i..i+Li-1] = S[si..si+Li-1] be this prefix
Compression
Let Li be the length of the longest prefix of S[i..n] that is a substring of S[1..i-1]Let si the previous position of this prefix
aabcbbabbsbabcbbbbabbabcbbabbabbsb…
aabcbbabbsb (2,5) bbabbabcbbabbabbsb…
S[1..n]=
Compressionprior12
(s12,L12)
i=12
aabcbbabbsbabcbbbbabbabcbbabbabbsb…
For (i=1 ; i<=n ;;){Compute (si,Li) if Li>1 {output(si,Li); i=i+Li}; else {output(S[i]); i=i+1}}
aabcbb
(2,2)
bs
(6,4)(? si,Li) איך לחשב את
“Ziv-Lempel” compression
aabcbbabbsbabbbbbabbabcbbabbabbsb…
Implementation using suffix tree
Before compression:
• Build a suffix tree T for S.
• For each node v, compute cv : – the smallest leaf index in v’s subtree.– the starting position of the leftmost copy of the
substring that labels the path from the root to v.
• O(n) time.
leaf i
root
S[i...n]vstring-len)v( + cv ≥ i
computing (si,Li):
icv
string-len)v(
Let v be the “first” node such that:
set si ← cv
set Li ← i-cv
v2
v1
Example
S = abababab 1 2 3 4 5 6 7 8
1
1
1
12
2
2
a
a
a
a
a
a
a
b
b
b
bb
b
b
b
$
$
$
$
$$
$ $18 76 54 32
i=1 Li=0 a
i=2 Li=0 b
i=3 Li=2 si=1 )1,2(
i=5 Li=4 si=1 )1,4(
Suffix array• We loose some of the functionality
but we save space.Let s = ababSort the suffixes lexicographically: ab, abab, b, babThe suffix array gives the indices of the suffixes in sorted order
2 0 3 1
How do we build it ?• Build a suffix tree• Traverse the tree in in-order,
lexicographically picking edges outgoing from each node and fill the suffix array.
• O(n) time• Can also build it directly
ExampleLet S = mississippi
iippiissippiississippimississippipi
7
4
1
0
9
8
6
3
10
5
2
ppisippisisippissippississippi
L
R
Let P = issa
M
How do we search for a pattern ?
• If P occurs in T then all its occurrences are consecutive in the suffix array.
• Do a binary search on the suffix array
• Takes O(mlogn) time• Can also do it in O(m+log(n)) with an
additional array
Computing the suffix array• Can do it in linear time without
constructing the suffix tree ?
Divide into triples
$ y a b b a d a b b a d o
abb ada bba do$
Divide into triples
$ y a b b a d a b b a d o
abb ada bba do$
$ y a b b a d a b b a d o
bba dab bad o$$
Radix sort triplets
$ y a b b a d a b b a d o
abb ada bba do$
$ y a b b a d a b b a d o
bba dab bad o$$
abb
ada
bba
do$
bad
dab
o$$
Change the alphabet
$ y a b b a d a b b a d o
abb ada bba do$
$ y a b b a d a b b a d o
bba dab bad o$$
abb
ada
bba
do$
bad
dab
o$$
1
2
3
4
5
6
7
1 2 4 6
4 5 3 7
Sort the suffixes
$ y a b b a d a b b a d o
abb ada bba do$
$ y a b b a d a b b a d o
bba dab bad o$$
1 2 4 6
4 5 3 7
1 2 4 62 4 64 664 5 3 75 3 73 77
Recursively (problem size decreases to 2/3 the original)
abb ada bba do$ bba dab bad o$$
1 2 4 6 4 5 3 7
0 1 2 3 4 5 6 7
0 - 123645371 - 24645372 - 4645373 - 645374 - 45375 - 5376 - 377 - 7
0 - 123645371 – 24645376 – 374 - 45372 - 4645375 – 5373 - 645377 - 7
Go back to original problem abb ada bba do$ bba dab bad o$$
1 2 4 6 4 5 3 70 - 123645371 - 24645372 - 4645373 - 645374 - 45375 - 5376 - 377 - 7
0 - 123645371 – 24645376 – 374 - 45372 - 4645375 – 5373 - 645377 - 7
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 120
0/1 1/4 2/7 3/10 4/2 5/5 6/8 7/11
Go back to original problem abb ada bba do$ bba dab bad o$$
1 2 4 6 4 5 3 70 - 123645371 - 24645372 - 4645373 - 645374 - 45375 - 5376 - 377 - 7
0 - 123645371 – 24645376 – 374 - 45372 - 4645375 – 5373 - 645377 - 7
$ y a b b a d a b b a d o
1 2 3 4 5 6 7 8 9 10 11 120
0/1 1/4 2/7 3/10 4/2 5/5 6/8 7/11
1 4 2 6 5 3 7 8
Sort the remaining third
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
(b, 2) (a, 5) (a, 7)(y, 1)
(b, 2)(a, 5) (a, 7) (y, 1)
36 9 0
1 2 3 4 5 6 7 8 9 10 11 120
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
(b, 2) (a, 5) (a, 7)(y, 1)
(b, 2)(a, 5) (a, 7) (y, 1)
36 9 0
1 2 3 4 5 6 7 8 9 10 11 120
1 4 8 2 7 5 10 11
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
36 9 0
1 4 8 2 7 5 10 11
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
36 9 0
4 8 2 7 5 10 11
6
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
39 0
4 8 2 7 5 10 11
6 4
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
39 0
8 2 7 5 10 11
6 4 9
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
3 0
8 2 7 5 10 11
6 4 9 3
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
8 2 7 5 10 11
6 4 9 3 8
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
2 7 5 10 11
6 4 9 3 8 2
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
7 5 10 11
6 4 9 3 8 2 7
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
5 10 11
6 4 9 3 8 2 7 5
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
0
10 11
6 4 9 3 8 2 7 5
Merge
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10 11 0
summary
$ y a b b a d a b b a d o1 4 2 6 5 3 7 8
1 2 3 4 5 6 7 8 9 10 11 12
1
0
6 4 9 3 8 2 7 5 10 11 0
When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes
When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes