on-line linear-time construction of word suffix trees
DESCRIPTION
On-line Linear-time Construction of Word Suffix Trees. Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda (Kyushu University & JST). Pattern Searching Problem. Given : text T in S * and pattern P in S * - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/1.jpg)
On-line Linear-time Construction of
Word Suffix Trees
Shunsuke Inenaga(Japan Society for the Promotion of Science
& Kyushu University)
Masayuki Takeda(Kyushu University & JST)
![Page 2: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/2.jpg)
Pattern Searching Problem
Given: text T in and pattern P in Find: an occurrence of P in T
: alphabet : set of strings
Using an indexing structure for T, we can solve the above problem in O(|P|) time.
![Page 3: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/3.jpg)
Suffix Trie
A trie representing all suffixes of T
a
a
c
b
$
c
b
$
c
b
$
b
$
$T = aacb$
aacb$acb$cb$b$$
![Page 4: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/4.jpg)
Unwanted Matching
pace runner
pattern
s
text
![Page 5: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/5.jpg)
Unwanted Matching
pace runner
pattern
s
text
![Page 6: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/6.jpg)
Unwanted Matching
pace runner
pattern
s
text
![Page 7: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/7.jpg)
Introducing Word Separator #
# : word separator - special symbol not in D = # : dictionary of words
Text T : an element of D+
(T is a sequence T1T2…Tk of k words in D)
e.g., T = This#is#a#pen#
= {A,…,z} D = {...,This#,...,a#,...is#,...pen#,...}
![Page 8: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/8.jpg)
Word-level Pattern Searching Problem
Given: text T in D+ and pattern P in D+
Find: an occurrence of P in T which immediately follows #
e.g.
The#space#runner#is#not#your#good#pace#runner#
![Page 9: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/9.jpg)
Word-level Pattern Searching Problem
Given: text T in D+ and pattern P in D+
Find: an occurrence of P in T which immediately follows #
e.g.
The#space#runner#is#not#your#good#pace#runner#
![Page 10: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/10.jpg)
Word Suffix Trie
A trie representing the suffixes of T which immediately follows # (and T itself).
T = aa#b#
aa#b#a#b##b#b##
a
a
#
b
#
b
#
![Page 11: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/11.jpg)
Comparison
a
a
#
b
#
b
#
a
a
#
b
#
#
b
#
#
b
#
b
#
T = aa#b#
Suffix Trie Word Suffix Trie
![Page 12: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/12.jpg)
Construction
Suffix Trie : Ukkonen’s on-line algorithm ( 1995 )
Word Suffix Trie : We modify Ukkonen’s algorithm by:
Using minimum DFA accepting dictionary D Redefining suffix links
![Page 13: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/13.jpg)
Minimum DFA
The minimum DFA accepting D = # clearly requires constant space (for fixed ).
We replace the root node of the suffix trie with the final state of the DFA.
#
![Page 14: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/14.jpg)
Suffix Links
T = aa#b#
#
a,b
a
a
#
b
b
#
#
Word Suffix Trie
![Page 15: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/15.jpg)
Suffix Links [cont.]
a
a
#
b
#
#
b
#
#
b
#
b
#
T = aa#b# a,b
a
a
#
b
b
#
#
Suffix Trie Word Suffix Trie
#
![Page 16: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/16.jpg)
On-line Construction
#
a
T = aa#b# a,b
a
Suffix Trie Word Suffix Trie
#
![Page 17: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/17.jpg)
On-line Construction
a
a
T = aa#b# a,b
a
a
Suffix Trie Word Suffix Trie
##
![Page 18: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/18.jpg)
# #
On-line Construction
a
a
##
#
T = aa#b# a,b
a
a
#
Suffix Trie Word Suffix Trie
![Page 19: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/19.jpg)
# #
On-line Construction
a
a
##
#
T = aa#b#
bb
b
b
a,b
a
a
#
b
b
Suffix Trie Word Suffix Trie
![Page 20: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/20.jpg)
# #
On-line Construction
a
a
##
#
T = aa#b#
bb
b
b
# #
#
#
a,b
a
a
#
b
b
#
#
Suffix Trie Word Suffix Trie
![Page 21: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/21.jpg)
Pseudo CodeJust change here!!
a
a
#
b
#
b
#
a
a
#
b
#
#
b
#
#
b
#
b
#
#
a,b#
#
![Page 22: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/22.jpg)
Like Dress-up Doll
![Page 23: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/23.jpg)
Like Dress-up Doll [cont.]Hair is different
Body is the same!!
Different looking!!!!
a
a
#
b
#
b
#
a
a
#
b
#
#
b
#
#
b
#
b
#
#
a,b#
#
![Page 24: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/24.jpg)
Drawback of Word Suffix Trie
Word suffix tries require O(k|T|) space.
Andersson et al. introduced word suffix trees which can be implemented in O(k) space.
![Page 25: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/25.jpg)
Construction of Word Suffix Trees
Algorithm by Andersson et al. ( 1996 )
for text T = T1T2…Tk, constructs word suffix trees in O(|T|) expected time with O(k) space.
Our algorithm
simulates the on-line word suffix trie algorithm on word suffix trees.
runs in O(|T|) time in the worst cases, with O(k) space.
![Page 26: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/26.jpg)
Normal and Word Suffix Trees
aa
#b
#
b#
a
a#
b#
#b#
#b#
b#
#
T = aa#b#
Suffix Tree Word Suffix Tree
![Page 27: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/27.jpg)
Construction Algorithm
Just change here!!
![Page 28: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/28.jpg)
Conclusions
We first proposed an on-line word suffix trie construction algorithm.
The keys to the algorithm are the minimal DFA accepting D and the re-defined suffix links.
Further, we introduced an on-line algorithm to build word suffix trees that works with O(k) space and in O(|T|) time in the worst cases.
![Page 29: On-line Linear-time Construction of Word Suffix Trees](https://reader035.vdocument.in/reader035/viewer/2022062301/568150c0550346895dbee119/html5/thumbnails/29.jpg)
Further Work
“Sparse Directed Acyclic Word Graphs”by Shunsuke Inenaga and Masayuki TakedaAccepted to SPIRE’06
“Sparse Compact Directed Acyclic Word Graphs”by Shunsuke Inenaga and Masayuki TakedaAccepted to PSC’06