tries and suffix tries
TRANSCRIPT
-
8/10/2019 Tries and Suffix Tries
1/26
-
8/10/2019 Tries and Suffix Tries
2/26
Tries
A trie (pronounced try) is a tree representing a collection of strings withone node per common prefix
Each key is spelled out along some path starting at the root
Each edge is labeled with a character c!
A node has at most one outgoing edge labeled c, for c!
Smallest tree such that:
Natural way to represent either a setor a mapwhere keys are strings
http://xlinux.nist.gov/dads/HTML/prefix.htmlhttp://xlinux.nist.gov/dads/HTML/prefix.htmlhttp://xlinux.nist.gov/dads/HTML/node.htmlhttp://xlinux.nist.gov/dads/HTML/node.html -
8/10/2019 Tries and Suffix Tries
3/26
-
8/10/2019 Tries and Suffix Tries
4/26
Tries: example
Checking for presence of a key P,where n= | P |, is ??? time
If total length of all keys is N, trie
has ??? nodes
O(n)
O
(N
)
What about | | ?
Depends how we represent outgoing
edges. If we dont assume |
| is asmall constant, it shows up in one orboth bounds.
i
n
s
t
a
n
t
t
e
r
n
a
l
e
t1
2 3
-
8/10/2019 Tries and Suffix Tries
5/26
Tries: another example
We can index T with a trie. The trie mapssubstrings to offsets where they occur
ac 4
ag 8
at 14
cc 12
cc 2
ct 6
gt 18
gt 0
ta 10tt 16
ac
g
t
cg
t
c
t
t
a
t
4
8
14
12, 2
6
18, 0
10
16
root:
-
8/10/2019 Tries and Suffix Tries
6/26
Tries: implementation
!"#$$&'()*#+,-./)!012
333 &'() (4+")4)50#0(-5 -6 # 4#+7 8$$-!(#0(59 :);$ ,$0'(59$ -' -0) +#(' 333
!>' E$)"67'--0 6-'! (5:2 H 6-' )#!< !'L!M EFG H (6 5-0 0'L!M
!>'LN@#">)NM E@ H #0 0'2
')0>'5P-5)H :); ?#$5N0 (5 0'L!M
H 9)0 @#">)D -' P-5) (6 0'5!>'79)0,N@#">)N1
Python example:
http://nbviewer.ipython.org/6603619
http://nbviewer.ipython.org/6603619http://nbviewer.ipython.org/6603619http://nbviewer.ipython.org/6603619http://nbviewer.ipython.org/6603619 -
8/10/2019 Tries and Suffix Tries
7/26
Tries: alternatives
Tries arent the only tree structure that can encode sets or maps with string
keys. E.g. binary or ternary search trees.
i
Example from: Bentley, Jon L., and Robert Sedgewick. "Fast algorithms for sorting and searching
strings." Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. Society for
Industrial and Applied Mathematics, 1997
b s o
a e h n t n t
Ternary search tree for as, at, be, by, he, in, is, it, of, on, or, to
s y e f r o
t
-
8/10/2019 Tries and Suffix Tries
8/26
Indexing with suffixes
Until now, our indexes have been based on extracting substrings from T
A very different approach is to extract suffixesfrom T. This will lead us tosome interesting and practical index data structures:
6
5
3
1
0
4
2
$
A$
ANA$
ANANA$
BANANA$
NA$
NANA$
$B A N A N A
A$ B A N A N
A N A $ B A N
A N A N A $B
BA N A N A$
N A $ B A N A
N A N A $ BA
Suffix Tree Suffix Array FM IndexSuffix Trie
-
8/10/2019 Tries and Suffix Tries
9/26
Suffix trie
Build a triecontaining all suffixesof a text T
G T T A T A G C T G A T C G C G G C G T A G C G G $G T T A T A G C T G A T C G C G G C G T A G C G G $ T T A T A G C T G A T C G C G G C G T A G C G G $ T A T A G C T G A T C G C G G C G T A G C G G $ A T A G C T G A T C G C G G C G T A G C G G $ T A G C T G A T C G C G G C G T A G C G G $ A G C T G A T C G C G G C G T A G C G G $ G C T G A T C G C G G C G T A G C G G $ C T G A T C G C G G C G T A G C G G $ T G A T C G C G G C G T A G C G G $ G A T C G C G G C G T A G C G G $ A T C G C G G C G T A G C G G $ T C G C G G C G T A G C G G $ C G C G G C G T A G C G G $ G C G G C G T A G C G G $ C G G C G T A G C G G $ G G C G T A G C G G $ G C G T A G C G G $ C G T A G C G G $ G T A G C G G $ T A G C G G $ A G C G G $ G C G G $ C G G $ G G $ G $ $
T:
m(m+1)/2
chars
-
8/10/2019 Tries and Suffix Tries
10/26
-
8/10/2019 Tries and Suffix Tries
11/26
-
8/10/2019 Tries and Suffix Tries
12/26
Suffix trie
Each path from root to leaf represents asuffix; each suffix is represented by somepath from root to leaf
! " #
! " #
"
!
#
!
! #
"
!
#
!
! #
"
!
#
Shortest
(non-empty)suffix
Longest suffix
T: abaaba abaaba$T$:
Would this still be the case if we hadntadded $?
-
8/10/2019 Tries and Suffix Tries
13/26
Suffix trie
T: abaaba
Would this still be the case if we hadntadded $? No
! "
! "
"
!
!
!
"
!
!
!
"
!
Each path from root to leaf represents asuffix; each suffix is represented by somepath from root to leaf
-
8/10/2019 Tries and Suffix Tries
14/26
Suffix trie
We can think of nodes as having labels,where the label spells out characters on thepath from the root to the node
! " #
! " #
"
!
#
!
! #
"
!
#
!
! #
"
!
#
baa
-
8/10/2019 Tries and Suffix Tries
15/26
Suffix trie
How do we check whether a string Sis asubstring of T?
! " #
! " #
"
!
#
!
! #
"
!
#
!
! #
"
!
#
Note: Each of Tssubstrings is spelled outalong a path from the root. I.e., everysubstringis aprefix of some suffixof T.
Start at the root and follow the edgeslabeled with the characters of S
If we fall off the trie -- i.e. there is no
outgoing edge for next character of S, thenSis not a substring of T
If we exhaust Swithout falling off, Sis asubstring of T
S = baaYes, its a substring
-
8/10/2019 Tries and Suffix Tries
16/26
Suffix trie
How do we check whether a string Sis asubstring of T?
! " #
! " #
"
!
#
!
! #
"
!
#
!
! #
"
!
#
Note: Each of Tssubstrings is spelled outalong a path from the root. I.e., everysubstringis aprefix of some suffixof T.
Start at the root and follow the edgeslabeled with the characters of S
If we fall off the trie -- i.e. there is no
outgoing edge for next character of S, thenSis not a substring of T
If we exhaust Swithout falling off, Sis asubstring of T
-
8/10/2019 Tries and Suffix Tries
17/26
-
8/10/2019 Tries and Suffix Tries
18/26
Suffix trie
How do we check whether a string Sis asuffixof T?
! " #
! " #
"
!
#
!
! #
"
!
#
!
! #
"
!
#
Same procedure as for substring, butadditionally check whether the final node inthe walk has an outgoing edge labeled $
S = baaNota suffix
-
8/10/2019 Tries and Suffix Tries
19/26
Suffix trie
How do we check whether a string Sis asuffixof T?
! " #
! " #
"
!
#
!
! #
"
!
#
!
! #
"
!
#
Same procedure as for substring, butadditionally check whether the final node inthe walk has an outgoing edge labeled $
-
8/10/2019 Tries and Suffix Tries
20/26
Suffix trie
How do we count the number of timesa string Soccurs as a substring of T?
! " #
! " #
"
!
#
!
! #
"
!
#
!
! #
"
!
#
Follow path corresponding to S.Either we fall off, in which case
answer is 0, or we end up at node nand the answer = # of leaf nodes inthe subtree rooted at n.
Leaves can be counted with depth-firsttraversal.
n
-
8/10/2019 Tries and Suffix Tries
21/26
Suffix trie
How do we find the longest repeatedsubstringof T?
! " #
! " #
"
!
#
!
! #
"
!
#
!
! #
"
!
#
Find the deepest node with morethan one child
-
8/10/2019 Tries and Suffix Tries
22/26
Suffix trie: implementation!"#$$Q>66(R&'(),-./)!012
B)6CC(5(0CC,$)"6D 012
333 *#:) $>66(R 0'() 6'-4 0 333 0 SENTNH $+)!(#" 0)'4(5#0-' $;4.-" $)"67'--0 EFG 6-'( (5R'#59),")5,0112 H 6-' )#!< $>66(R !>' E$)"67'--0 6-'! (50L(2M2 H 6-' )#!< !66(R (6! 5-0(5!>'2 !>'L!M EFG H #BB ->09-(59 )B9) (6 5)!)$$#'; !>' E!>'L!M
B)66-""-?U#0'2 ')0>'5P-5) !>' E!>'L!M
')0>'5!>'
B)6.$0'(59,$)"6D $12 333 V)0>'5 0'>) (66 $ #++)#'$ #$ # $>.$0'(59 -6 0 333 ')0>'5$)"676-""-?U#0
-
8/10/2019 Tries and Suffix Tries
23/26
Suffix trie
How many nodes does the suffix trie have?
Is there a class of string where the numberof suffix trie nodes grows linearly with m?
Yes: e.g. a string of mas in a row (am)
! "
! "
! "
! "
"
T= aaaa
1 Root
mnodes withincoming aedge
m + 1 nodes withincoming $edge
2m+ 2 nodes
-
8/10/2019 Tries and Suffix Tries
24/26
Suffix trie
Is there a class of string where the numberof suffix trie nodes grows with m2?
Yes: anbn
1 root
nnodes along b chain, right nnodes along a chain, middle
nchains of nb nodes hanging offeacha chain node
2n + 1 $leaves (not shown)
n2+ 4n+ 2 nodes, where m= 2n
Figure & exampleby Carl Kingsford
-
8/10/2019 Tries and Suffix Tries
25/26
Suffix trie: upper bound on size
Suffix trie
Root
Deepest leaf
Max # nodes from top to bottom= length of longest suffix + 1= m+ 1
Max # nodes from left to right= max # distinct substrings of any length
m
O(m2) is worst case
Could worst-case # nodes be worse than O(m2)?
-
8/10/2019 Tries and Suffix Tries
26/26
Suffix trie: actual growth
0 100 200 300 400 500
0
50
000
100000
150000
200000
250000
Length prefix over which suffix trie was built
#s
uffix
trien
odes
m^2actual
mBuilt suffix tries for the first500 prefixes of the lambdaphage virus genome
Black curve shows how #nodes increases with prefixlength