On-line Construction of On-line Construction of Suffix TreesSuffix Trees
Chairman :Chairman : Prof. R.C.T. Lee Prof. R.C.T. Lee
Speaker :Speaker : C. S. Wu ( C. S. Wu ( 吳展碩吳展碩 ))
June 10, 2004June 10, 2004Dept. of CSIEDept. of CSIE
National Chi Nan UniversityNational Chi Nan University
22
SourceSource
E. Ukkonen. E. Ukkonen.
On-line construction of suffix treesOn-line construction of suffix trees. . Algorithmica, Algorithmica, 1414:249--260, 1995. :249--260, 1995.
33
OutlineOutline
IntroductionIntroductionSuffix triesSuffix tries and and suffix treessuffix treesConstructing suffix triesConstructing suffix tries
Quadratic timeQuadratic timeOn-lineOn-line construction of suffix trees construction of suffix trees
Liner TimeLiner Time
44
NotationsNotations
TT = = tt11tt22 ... ... ttnn be a string over an alphabet be a string over an alphabet ..
TTii denote the denote the prefixprefix tt1 1 … … ttii of of TT for for 00 ii nn..
.
TTii denote the denote the suffixsuffix ttii … … ttnn of of TT where where 11 ii n + n + 11..
.
TT :: abcdeTT33 :: abc
TT :: abcdeTT33 :: cde
55
Notations (cont.)Notations (cont.)
TTnn++11 = = is the is the emptyempty suffix. suffix. The set of all The set of all sufsuffifixes of T is denoted xes of T is denoted
((TT))..TT :: abcde((TT))
:: abcde bcde cde de e
66
Suffix Tries & Suffix TreesSuffix Tries & Suffix Trees
a
abab
ababcababc
abcabc
b
a
b
c
c
b
a
b
c
c
c
bb
cc
babcbabc
bcbc
abab
ababcababc
ab
abcc
bc
abcc
abcabc
babcbabc
bcbc
cc
bb
Suffix TrieSuffix Trie Suffix TreeSuffix Tree
77
Suffix TriesSuffix Tries
The The suffix triesuffix trie of of TT is a trie representing is a trie representing ((TT))..
STrieSTrie((TT)) = = ((Q Q {{}}, , rootroot, , FF, , gg, , ff))
and define such a trie as an augmented and define such a trie as an augmented deterministic finite-state automationdeterministic finite-state automation..
88
STrieSTrie((TT)) = = ((QQ{{}}, , rootroot, , FF, , gg, , ff)).. QQ is the is the setset of the statesof the states of of STrieSTrie((TT))..
one-to-one correspondence with the substring one-to-one correspondence with the substring of of TT
xx is the is the statestate that corresponds to a that corresponds to a substring substring xx..
is an auxiliary state.is an auxiliary state. rootroot is the is the initial stateinitial state corresponds to the corresponds to the
empty string empty string .. FF is the is the final statesfinal states corresponds to corresponds to ((TT))..
Suffix Tries (cont.)Suffix Tries (cont.)
x
99
gg is the transition function: is the transition function:gg((xx, , aa) = ) = yy for all for all xx, , yy in in QQ such that such that yy = =
xaxa, where , where aa .. f f is the suffix function:is the suffix function:
Let Let xx rootroot. Then . Then xx = = ayay for some for some aa , and we set , and we set ff((xx)) = = yy..
ff((rootroot)) = = ..We call We call ff((rr)) the the suffix linksuffix link of state of state rr..
Suffix Tries (cont.)Suffix Tries (cont.)
1010
Suffix Tries (cont.)Suffix Tries (cont.)
a
abab
abcabdabcabd
b
c
a
b
d
b
c
a
b
d
TT = = abcabdabcabd
d
a
d
d
d
abdabd
bcabdbcabd
cabcabdd
dd
bbdd
b
bb
c
suffix linkssuffix links
Note: Only last layer ofNote: Only last layer of suffix links are suffix links are shown explicitly.shown explicitly.
STrieSTrie((TT)) = = ((QQ{{}}, , rootroot, , FF, , gg, , ff))
1111
We call the We call the pathpath that starts from the that starts from the deepest state deepest state tt11 ... ... ttii-1-1 and ends at and ends at the the boundary pathboundary path..
Boundary pathBoundary path consists of the consists of the last last layer oflayer of suffix links suffix links..
Boundary PathBoundary Path
1212
Constructing Suffix TriesConstructing Suffix Tries
Observation : Observation : ((TTii)) = = ((TTi-1i-1))ttii {{}}
abcd bcd cd d
((TTi-1i-1)) abcde bcde cde de e
((TTii))
boundary pathboundary path
1313
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
Algorithm 1.Algorithm 1.rr toptop;;
whilewhile gg((rr, , ttii)) is undefined is undefined dodo
create new state create new state r'r' and new transition and new transition gg((rr, , ttii)) = = r'r';;
ifif rr toptop thenthen create new suffix link create new suffix link ff((oldr'oldr')) = = r'r';;
oldr'oldr' r'r';;
rr ff((rr));;
create new suffix link create new suffix link ff((oldr'oldr')) = = gg((rr, , ttii));;
toptop gg((toptop, , ttii))..
1414
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
aTT = = aa
toptop
rr
rr
toptop
We color theWe color the boundary path boundary path orangeorange
1515
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
a
ababb
b
TT = = aabb
rr
rr
toptop
rr
bbtoptop
We color theWe color the boundary path boundary path orangeorange
1616
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
a
ababb
c
b
c
TT = = ababcc
bb
ctoptop
rrrr
rr
rr
toptop
We color theWe color the boundary path boundary path orangeorange
1717
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
a
ababb
c
a
b
c
a
TT = = abcabcaa
a
bb
c
toptop
rrrr
rr
rr
toptop
We color theWe color the boundary path boundary path orangeorange
1818
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
a
ababb
c
a
b
b
c
a
b
TT = = abcaabcabb
a
b
bb
c
toptop
rrrr
rr
rr
toptop We color theWe color the boundary path boundary path orangeorange
1919
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
a
abab
abcabdabcabd
b
c
a
b
d
b
c
a
b
d
TT = = abcababcabdd
d
a
d
d
d
abdabd
bcabdbcabd
cabcabdd
dd
bbdd
b
bb
c
toptop
rrrr
rr
rrrr
rr
rr
toptop
2020
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
a
abab
abcabdabcabd
b
c
a
b
d
b
c
a
b
d
TT = = abcabdabcabd
d
a
d
d
d
abdabd
bcabdbcabd
cabcabdd
dd
bbdd
b
bb
c
2121
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
Theorem 1Theorem 1
Suffix trie STrieSuffix trie STrie((TT)) can be can be constructed in time proportional to constructed in time proportional to the size ofthe size of STrieSTrie((TT)) whichwhich, , in the in the worst caseworst case, , isis OO((||TT||22))..
Note: The number of nodes in Note: The number of nodes in STrieSTrie((TT)) is the number of substrings is the number of substrings of of TT. . TT has at most has at most OO((nn22)) substrings. Thus the size of substrings. Thus the size of STrieSTrie((TT)) is is OO((nn22))..
2222
Suffix TreesSuffix Trees
Suffix tree Suffix tree StreeStree((TT)) represents represents STrieSTrie((TT)) in space linear in the length |in space linear in the length |TT|.|. Represent only a subsetRepresent only a subset Q' Q' {{}} of the states of the states
of of STrieSTrie((TT)).. Q'Q' consists of all consists of all branchingbranching statesstates and all and all
leaves leaves of of StrieStrie((TT)).. Called the states in Called the states in Q'Q' {{}} the the explicit explicit
statesstates.. The other states of The other states of STrieSTrie((TT)) are called are called implicit implicit
statesstates as states of as states of STreeSTree((TT)).. Implicit statesImplicit states are not explicitly present in are not explicitly present in
STreeSTree((TT))..
2323
Suffix Trees (cont.)Suffix Trees (cont.)
cc
a
abab
ababcababc
abcabc
b
a
b
c
c
b
a
b
c
c
c
bb
babcbabc
bcbc
Suffix TrieSuffix Trie
abab
ababcababc
ab
abcc
bc
abcc
abcabc
babcbabc
bcbc
cc
bb
Suffix TreeSuffix Tree
implicit statesimplicit statesexplicit statesexplicit states
2424
Suffix Trees (cont.)Suffix Trees (cont.)
The string The string w = tw = tkk ... ... ttpp between two explicit between two explicit states states ss and and rr is represented in is represented in STreeSTree((TT)) as generalized transition as generalized transition g'g'((ss, , ww)) = = rr..
To save space the string To save space the string w = tw = tkk ... ... ttpp is is actually represented as a pair actually represented as a pair ((kk, , pp)) of of pointers to pointers to TT..
A transition A transition g'g'((ss, , ((kk, , pp)))) = = rr is called an is called an
a-transitiona-transition if if ttkk = = aa.. Each Each ss can have at most one can have at most one a-transitiona-transition for for
each each
aa ..
2525
Suffix Trees (cont.)Suffix Trees (cont.)
Suffix function:Suffix function: Defined only for all branching states Defined only for all branching states xx root root as as
f 'f '((xx)) = = yy where where yy is a branching state is a branching state such that such that
xx = = ayay for some for some a a f'f'((rootroot)) = = ..
If If xx is a branching state, the also is a branching state, the also f 'f '((xx)) is a is a branching state. These suffix links are branching state. These suffix links are explicitly represented. explicitly represented.
The suffix tree of The suffix tree of TT is denoted as is denoted as
STreeSTree((TT)) = = ((Q' Q' {{}}, , rootroot, , g'g', , f 'f '))
2626
Size of Suffix TreesSize of Suffix Trees
abab
ababcababc
ab
abcc
bc
abcc
abcabc
babcbabc
bcbc
cc
bb
(5,5)(5,5)(2,2)(2,2)(1,2)(1,2)
(3,5)(3,5)
(5,5)(5,5)
(3,5)(3,5)
(5,5)(5,5)
TT = = ababcababc
aa--transitiontransition
bb--transitiontransition
cc--transitiontransition
2727
Size of Suffix Trees (cont.)Size of Suffix Trees (cont.)
The size of The size of STreeSTree((TT)) is is linear sizelinear size in in ||TT|.|.Q'Q' has at most has at most ||TT| leaves| leaves and therefore and therefore
Q'Q' has to contain at most has to contain at most ||TT| - | - 11 branchingbranching statesstates in in Q'Q'..
There can be at most There can be at most 22||TT| - | - 22 transitions transitions between the states in between the states in Q'Q'..
2828
Reference to a StateReference to a State
We refer to a state We refer to a state rr of a suffix tree by a of a suffix tree by a referencereference pairpair((ss, , ww))..ss is some explicit state that is an ancestor is some explicit state that is an ancestor
of of rr..ww is the string spelled out by the is the string spelled out by the
transitions form transitions form ss to to rr in the corresponding in the corresponding suffix trie.suffix trie.
A reference pair is A reference pair is canonicalcanonical if if ss is the is the closest ancestorclosest ancestor of of rr..
PairPair((ss,, ))is represented as is represented as ((ss, , ((pp + + 11, , pp))))..
2929
States on the Boundary PathStates on the Boundary Path
Let Let ss11 = = tt11 ... ... ttii--11, , ss22, , ss33, ... , , ... , ssii = = rootroot, , ssii++11 = = be the states of be the states of STrieSTrie((TTii--11)) on on the boundary paththe boundary path..
LetLet j j be the smallest index such that be the smallest index such that ssjj is is not a leafnot a leaf..
Let Let j'j' be the smallest index such that be the smallest index such that ssj'j' has a has a ttii--transitiontransition..
We call state We call state ssjj the the active pointactive point and and ssj'j' the the end pointend point of of STrieSTrie((TTii--11))..
3030
States on the Boundary PathStates on the Boundary Path
Lemma 1 Lemma 1 Algorithm 1 adds to STrieAlgorithm 1 adds to STrie((TTi-i-11)) a t a tii--transition for each of the states stransition for each of the states shh, , 11 h h << j'.j'.
For For 11 h h << j j, the new transition expands an , the new transition expands an old branch of the trie that ends at leaf sold branch of the trie that ends at leaf shh..
For For j j h h << j' j', the new transition initiates a , the new transition initiates a new branch from snew branch from shh..
Algorithm 1 does not create any other Algorithm 1 does not create any other transitions.transitions.
3131
States on the Boundary PathStates on the Boundary Path
Algorithm 1 inserts two different Algorithm 1 inserts two different groups of groups of ttii-transitions into -transitions into STrieSTrie((TTii--
11))::First groupsFirst groups
The states on the boundary path before the The states on the boundary path before the active point active point ssjj get a transition.get a transition.
Second groupsSecond groupsThe states from the active point The states from the active point ssjj to the to the
end point end point ssj'j', the end point excluded, get a , the end point excluded, get a new transition.new transition.
3232
States on the Boundary PathStates on the Boundary Path
a
ababb
c
a
b
b
c
a
b
a
b
bb
c
activactive e
pointpoint
TTii--11 = = abcababcab
STrieSTrie((TTii--11))
ttii = = dd
end end pointpoint
last layer oflast layer of suffix links suffix links ((boundary pathboundary path))
first groupfirst group
second groupsecond group
3333
States on the Boundary PathStates on the Boundary Path
a
abab
abcabdabcabd
b
c
a
b
d
b
c
a
b
d
d
a
d
d
d
abdabd
bcabdbcabd
cabcabdd
dd
bbdd
b
bb
c
first groupfirst group
second groupsecond group
STrieSTrie((TTii))
ttii = = dd
We color theWe color the new transition new transitionand new node and new node greengreen
activactive e
pointpoint
end end pointpoint
TTii--11 = = abcababcab
3434
Adding Transitions to Adding Transitions to STree(Ti-1) STree(Ti-1)
First groupFirst group can be can be notnot changedchanged to to STreeSTree((TTii--11).). Transitions of Transitions of STreeSTree((TTii--11)) leading to a leaf is leading to a leaf is
called an called an open transitionopen transition.. Such a transition is of the form Such a transition is of the form g'g'((ss, , ((kk, , ii--11)))) = = rr.. Instead, open transitions are represented as Instead, open transitions are represented as g'g'((ss, , ((kk, ,
)))).. indicates that this transition is 'indicates that this transition is 'open to growopen to grow'.'.
3535
Open TransitionsOpen Transitions
ab ab (1,2)(1,2)
bb(2,2)(2,2)
activactive e
pointpoint
TTii--11 = = abcababcab
STreeSTree((TTii--11))
ttii = = dd
end end pointpoint
first groupfirst group
second groupsecond group
cab
cab
cab
ab
(3,(3,))abcababcab
(3,(3,))bcabbcab
(3,(3,))cabcab
3636
Open TransitionsOpen Transitions
abb
d
first groupfirst group
second groupsecond group
STreeSTree((TTii))
ttii = = dd
We color theWe color the new transition new transitionand new node and new node greengreen
activactive e
pointpointend end
pointpointTTii--11 = = abcababcab
(3,(3,))abcababcab
dd
(3,(3,))bcabbcab
dd
(3,(3,))cabcabdd
cabd
cabd
cabddd
ab ab (1,2)(1,2)
bb(2,2)(2,2)
3737
Adding Transitions to Adding Transitions to STree(Ti-1) (cont.)STree(Ti-1) (cont.)
Create new branches for the Create new branches for the second groupsecond group.. They are presented They are presented explicitly or implicitlyexplicitly or implicitly.. They will be found They will be found along the boundary pathalong the boundary path using using
reference pairs and suffix links.reference pairs and suffix links. Let Let ((ss, , ww)) be the be the canonical reference pair canonical reference pair for for sshh, ,
j j h < j'. h < j'. ((ss, , ww)) = = ((ss, , ((kk, , ii - - 11)))) for some for some kk ii.. If If ((ss, , ((kk, , ii - - 11)))) already refers to the already refers to the end pointend point ssj'j', we are , we are
done.done. Otherwise a new branch has to be created.Otherwise a new branch has to be created.
If If ((ss, , ((kk, , ii - - 11)))) refers to an implicitly state, a new refers to an implicitly state, a new explicit state is created by explicit state is created by splitting the transitionsplitting the transition. Then . Then a a ttii-transition-transition is created. is created.
3838
On-Line Construction of On-Line Construction of Suffix Trees (cont.)Suffix Trees (cont.)
Lemma 2Lemma 2 Let Let ((s, s, ((k, i - k, i - 11)))) be a reference pair of the end be a reference pair of the end point spoint sj'j' of STree of STree((TTi-i-11)). Then . Then ((s, s, ((k, ik, i)))) is a is a reference pair of the active point of STreereference pair of the active point of STree((TTii))..
Proof.Proof. ssjj is the is the active pointactive point of of STreeSTree((TTii-1-1)) if and only if if and only if ssjj is the is the
longest suffix of longest suffix of TTii-1-1 that occurs at least twice in that occurs at least twice in TTii-1-1.. ssj'j' is the is the end pointend point of of STreeSTree((TTii-1-1)) if and only if if and only if ssj'j' is the is the
longest suffix of longest suffix of TTii-1-1 such that such that ttj'j' ... ... ttii-1-1ttii is a substring of is a substring of TTii-1-1..
If If ssj'j' is the end point of is the end point of STreeSTree((TTii-1-1)) then then ttj'j' ... ... ttii-1-1ttii is the is the longest suffix of longest suffix of TTii that occurs at least twice in that occurs at least twice in TTii, that , that is, then state is, then state gg((ssj'j', , ttii)) is the active point of is the active point of STreeSTree((TTii))..
3939
Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)
TT = = aa
ss = = rootroot
kk = = 11
ii = = 00ii = = 11
(1,(1,))
ss = =
kk = = 22TT = = aabb
ii = = 22
(2,(2,))
kk = = 33TT = = ababcc
ii = = 33
(3,(3,))
kk = = 44activactiv
e e pointpoint
end end pointpoint
TT = = abcabcaa
ii = = 44
TT = = abcaabcabb
ii = = 55
TT = = abcababcabdd
ii = = 66activactiv
e e pointpoint
end end pointpointend end
pointpoint
(2,2)(2,2)(1,2)(1,2) (6,(6,))
(4,(4,))
(5,(5,))
TT = = abcabdabcabdkk = = 55kk = = 66
(3,(3,))
(3,(3,))
4040
On-Line Construction of On-Line Construction of Suffix TreesSuffix Trees
Algorithm 2 Algorithm 2 Construction of Construction of STreeSTree((TT)) for string for string TT = = tt11tt22...# in alphabet ...# in alphabet = = {{tt--11, ..., , ..., tt--
mm}}; # is the end marker.; # is the end marker.
Create states Create states rootroot and and ;;
forfor jj 11, ... , , ... , mm dodo
create transition create transition g'g'((,, ((--jj, -, -jj)))) = = rootroot;;
create suffix link create suffix link f'f'((rootroot))== ;;
ss rootroot; ; kk 11; ; ii 00;;
while while ttii++11 # # dodo
ii ii + + 11;;
((ss, , kk)) updateupdate((ss, , ((kk, , ii))));;
((ss, , kk)) canonizecanonize((ss, , ((kk, , ii))))..
4141
On-Line Construction of On-Line Construction of Suffix Trees (cont.)Suffix Trees (cont.)
procedureprocedure updateupdate((ss, , ((kk, , ii))))::((ss, , ((kk, , i - i - 11))))is the canonical reference pair for the active is the canonical reference pair for the active point;point;
oldr oldr rootroot; ; ((endpointendpoint, , rr) ) test-and-splittest-and-split((ss, , ((kk, , i i - - 11)), , ttii));;
while notwhile not ((end-pointend-point)) dodo
create new transition create new transition g'g'((rr, , ((ii, , )))) = = r'r' where where r'r' is a is a new state;new state;
ifif oldroldr rootroot thenthen create new suffix link create new suffix link f'f'((oldroldr)) = = rr;;
oldroldr rr;;
((ss, , kk)) canonizecanonize((f'f'((ss)),,((kk, , ii - - 11))));;
((end-pointend-point, , rr)) test-and-splittest-and-split((ss,,((kk, , ii - - 11)), , ttii));;
if if oldroldr root root thenthen create new suffix link create new suffix link f'f'((oldroldr)) = s; = s;
returnreturn ((ss, , kk))..
4242
On-Line Construction of On-Line Construction of Suffix Trees (cont.)Suffix Trees (cont.)
procedureprocedure test-and-splittest-and-split((ss, , ((kk, , pp)), , tt))::ifif kk pp thenthen
let let g'g'((ss, , ((k'k', , p'p')))) = = s's' be the be the ttkk-transition from -transition from ss;;
ifif t t = = ttk'k'++pp--kk++11 then returnthen return((truetrue, , ss))
elseelsereplace the replace the ttkk-transition above by transitions-transition above by transitions
g'g'((ss, , ((k'k', , k'k' + + pp - - kk)))) = = r r and and g'g'((rr, , ((k'k' + + pp - - kk + + 11, , p'p')))) = = s's'
where where rr is a new state; is a new state;
returnreturn((falsefalse, , rr))
elseelse
if if there is no there is no tt-transition from -transition from ss thenthen returnreturn((falsefalse, , ss))
elseelse returnreturn((truetrue, , ss))..
4343
On-Line Construction of On-Line Construction of Suffix Trees (cont.)Suffix Trees (cont.)
procedureprocedure canonizecanonize((ss, , ((kk, , pp))))::ifif pp < < kk then returnthen return((ss, , kk))
elseelse
find the find the ttkk-transition -transition g'g'((ss,,((k'k', , p'p')))) = = s's' from from ss;;
whilewhile p'p' – – k'k' pp – – kk dodo
kk kk + + p'p' – – k'k' + + 11;;
ss s's';;
ifif kk pp thenthen
find the find the ttkk-transition -transition g'g'((s, s, ((k'k', , p'p')))) = = s's' from from ss;;
returnreturn((ss, , kk).).
4444
Time ComplexityTime Complexity
Theorem 2Theorem 2 Algorithm 2 constructs the suffix tree STreeAlgorithm 2 constructs the suffix tree STree((TT)) for a string T = tfor a string T = t11 ... t ... tnn on-line in time O on-line in time O((nn))..
Proof.Proof. The The update update is called is called nn times. It takes time proportional times. It takes time proportional
to the total number of the visited states.to the total number of the visited states.
4545
Time Complexity AnalysisTime Complexity Analysis
aa
abab
abcabc
abcaabca
abcababcab
abcabdabcabd
heig
ht =
n
width n + 1
4646
Time Complexity AnalysisTime Complexity Analysis
activactive e
pointpoint
end end pointpoint
ssjj
ssj'j'
Let Let rri-i-11 be the string corresponding to the be the string corresponding to the active pointactive point
The string corresponding to The string corresponding to end pointend point is ( is (rrii)) i-i-1 1 ((Lemma 2Lemma 2))
Note: Note: rrii = = ((rrii)) i-i-11ttii
So that the number of the visited states in loopSo that the number of the visited states in loop i i
= = lengthlength((rri-i-11)) - - ((lengthlength((rrii))-1-1)) + + 11
Total number of the visited statesTotal number of the visited states
= = ((lengthlength((rrii-1-1)) - - lengthlength((rrii)) + 2 + 2))
= = lengthlength((rr00)) - - lengthlength((rrnn)) + 2 + 2nn 2 2nn
4747
ConclusionConclusion
Suffix tree can be constructed in Suffix tree can be constructed in linear time by employinglinear time by employingsuffix linkssuffix linksopen transitions open transitions for leaf nodesfor leaf nodes implicit nodes implicit nodes relay on active points and end points.relay on active points and end points.
4848
Suffix trees have many applications:Suffix trees have many applications:string searching string searching finding repeat substringsfinding repeat substringsMany applications appear in Many applications appear in
Algorithms on Strings, Trees, and Sequences: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Computer Science and Computational Biology, by Dan Gusfield, Cambridge, 1997. by Dan Gusfield, Cambridge, 1997.
4949
Any Questions?Any Questions?
5050
Thank YouThank You