ukk's algorithm of suffix tree

36
. . . . . . . . Algorithm of Suffix Tree by [email protected] Jiachen Yang 1 1 Research Student in Osaka University November 12, 2011 jc-yang (Osaka-U) SuffixTree November 12, 2011 1 / 25

Upload: jiachen-yang

Post on 30-Jun-2015

322 views

Category:

Design


3 download

TRANSCRIPT

Page 1: Ukk's Algorithm of Suffix Tree

. . . . . .

.

......

Algorithm of Suffix Treeby [email protected]

Jiachen Yang1

1Research Student in Osaka University

November 12, 2011

jc-yang (Osaka-U) SuffixTree November 12, 2011 1 / 25

Page 2: Ukk's Algorithm of Suffix Tree

. . . . . .

Part Outline

...1 What is Suffix Tree

...2 History & Naïve Algorithm

...3 Optimization of Naïve Algorithm

...4 Examples & Analysis

jc-yang (Osaka-U) SuffixTree November 12, 2011 2 / 25

Page 3: Ukk's Algorithm of Suffix Tree

. . . . . .

What can we do with suffix tree?..Linear algorithms for exact stringmatching

like KMPSearch for strings

...1 Check if a string P of length m isa substring in O(m) time.

...2 Find all z occurrences of thepatterns P1, · · · ,Pq of totallength m as substrings inO(m+ z) time.

...3 Search for a regular expression Pin time expected sublinear in n .

...4 Find for each suffix of a patternP, the length of the longestmatch between a prefix ofP[i . . .m] and a substring in D inθ(m) time .

...5 . . .

Find properties of the strings...1 Find the longest common substrings of the

string Si and Sj in θ(ni +nj) time ....2 Find all maximal pairs, maximal repeats or

supermaximal repeats in θ(n+ z) time, if thereare z such repeats .

...3 Find the Lempel-Ziv decomposition in θ(n)time .

...4 Find the longest repeated substrings in θ(n)time.

...5 Find the most frequently occurring substringsof a minimum length in θ(n) time.

...6 Find the shortest strings from ∑ that do notoccur in D, in O(n+ z) time, if there are z suchstrings.

...7 Find the shortest substrings occurring onlyonce in θ(n) time.

...8 Find, for each i, the shortest substrings of Sinot occurring elsewhere in D in θ(n) time.

...9 . . .

jc-yang (Osaka-U) SuffixTree November 12, 2011 3 / 25

Page 4: Ukk's Algorithm of Suffix Tree

. . . . . .

Trie, Radix Tree, Suffix Trie & Suffix Tree

trie1(AKA prefix tree) is a dictionary tree.

stores a set of words.each node represents a character except that root is empty string.words with common prefix share same parent nodes.minimal deterministic finite automaton that accepts all words.

radix tree (AKA patricia trie or radix trie) is a trie with compressed chain of nodes.Each internal node has at least 2 children.

suffix trie is a trie which stores all suffix of a given string.suffix tree is a suffix radix tree.

that enables linear time construction and fastalgorithms of other problems on a string.

1pronounced as in word retrieval by its inventor, /tri:/ “tree”, butpronounced /traI/ “try” by other authors

jc-yang (Osaka-U) SuffixTree November 12, 2011 4 / 25

Page 5: Ukk's Algorithm of Suffix Tree

. . . . . .

Trie & Radix Tree

.Trie of “A to tea ted ten iin inn”..

......

.Radix tree example..

......

jc-yang (Osaka-U) SuffixTree November 12, 2011 5 / 25

Page 6: Ukk's Algorithm of Suffix Tree

. . . . . .

Suffix Trie & Suffix Tree of “banana”

^

b a n $

a

n

a

n

a

$

n $

a

n $

a

$

a

n $

a

$

0:6

1:0

banana$

8:6

a

6:6

na

10:6

$

4:6

na

9:6

$

2:1

na$

5:6

$

3:2

na$

7:6

$

jc-yang (Osaka-U) SuffixTree November 12, 2011 6 / 25

Page 7: Ukk's Algorithm of Suffix Tree

. . . . . .

Suffix Tree of “mississippi”

mississippi$

0:11

12:8

(1,2)i

15:10

(8,9)p

4:4

(2,3)s

1:0

(0,...)mississi...

18:11

(11,...)$...

13:8

(8,...)ppi$...

6:8

(2,5)ssi

17:11

(11,...)$...

16:10

(10,...)i$...

14:8

(9,...)pi$...

8:8

(3,5)si

10:8

(4,5)i

7:8

(8,...)ppi$...

2:1

(5,...)ssippi$...

9:8

(8,...)ppi$...

3:2

(5,...)ssippi$...

11:8

(8,...)ppi$...

5:4

(5,...)ssippi$...

jc-yang (Osaka-U) SuffixTree November 12, 2011 7 / 25

Page 8: Ukk's Algorithm of Suffix Tree

. . . . . .

Part Outline

...1 What is Suffix Tree

...2 History & Naïve Algorithm

...3 Optimization of Naïve Algorithm

...4 Examples & Analysis

jc-yang (Osaka-U) SuffixTree November 12, 2011 8 / 25

Page 9: Ukk's Algorithm of Suffix Tree

. . . . . .

History of Suffix Tree Algorithms..

First linear algorithm was introduced by Weiner1973 as position tree. Awarded by Donald Knuthas “Algorithm of the year 1973”.Greatly simplified by McCreight 1976.

Above two algorithms are processing string backward.

First online construction by Ukkonen 1995, whichis easier to understand.

Above algorithms assume size of alphabet as fixedconstant .

Limitation was break by Farach 1997, optimal forall alphabets.

Further study are continued to scale to scenarios whenthe whole suffix tree or even input string cannot fit intomemory.

M. Farach. Optimal suffix treeconstruction with largealphabets. In focs, page137. Published by the IEEEComputer Society, 1997.

E. M. McCreight. Aspace-economical suffixtree construction algorithm.J. ACM, 23:262–272, April1976. ISSN 0004-5411.doi: http://doi.acm.org/10.1145/321941.321946. URLhttp:

//doi.acm.org/10.

1145/321941.321946.

E. Ukkonen. On-lineconstruction of suffix trees.Algorithmica, 14:249–260,1995. ISSN 0178-4617.URLhttp://dx.doi.org/10.

1007/BF01206331.10.1007/BF01206331.

P. Weiner. Linear patternmatching algorithms. InSwitching and AutomataTheory, 1973. SWAT ’08.IEEE Conference Record of14th Annual Symposium on,pages 1 –11, oct. 1973. doi:10.1109/SWAT.1973.13.

jc-yang (Osaka-U) SuffixTree November 12, 2011 9 / 25

Page 10: Ukk's Algorithm of Suffix Tree

. . . . . .

Backward Construction of Suffix Tree1 2 3 4

banana$ banana$ anana$ banana$ anana$ nana$ banana$ ana nana$

na$ $

7 6 5

banana$ a na $

na $ na$ $

na$ $

banana$ a na

na $ na$ $

na$ $

banana$ ana na

na$ $ na$ $

jc-yang (Osaka-U) SuffixTree November 12, 2011 10 / 25

Page 11: Ukk's Algorithm of Suffix Tree

. . . . . .

Construct Suffix Tree by Sorting SuffixSuffix :

mississippiississippississippisissippiissippissippisippiippippipii

Sorted suffix :

iippiissippiississippimississippipippisippisissippissippississippi

Tree of sorted suffix :

|-i->|-| |-ppi| |-ssi ->|-ppi| |-ssippi|-mississippi|-p->|-i| |-pi|-s->|-i-->|-ppi

| |-ssippi|-si ->|-ppi

|-ssippi

Time complexity will be O(N2 logN) .Space complexity will be O(N2) .

jc-yang (Osaka-U) SuffixTree November 12, 2011 11 / 25

Page 12: Ukk's Algorithm of Suffix Tree

. . . . . .

Naïve Algorithm

SUFFIXTREE(string)1 for i← 1 to length(string)2 do UPDATE(treei) � Phrase i

UPDATE(treei)

1 for j← 1 to i2 do node← treei.FIND(suffix[j to i−1])3 EXTEND(node,string[i]) � Extension j

Time complexity will be O(N3) .Space complexity will be O(N2) .The challenge is to make sure treei is updated to treei+1 efficiently.

jc-yang (Osaka-U) SuffixTree November 12, 2011 12 / 25

Page 13: Ukk's Algorithm of Suffix Tree

. . . . . .

Suffix Extend Cases of Naïve Algorithm..

Case I If path Sj...i ends at leaf, append a char Si+1 to end of edge into leaf.Sj...i = . . .na

na :Sj...i+1 = . . .nan

nan

Case II If path Sj...i ends in the middle of an edge , and next char Si+1 is not equal tothe next char in the edge, split that edge, create a internal node, add a newedge to a new leaf.

Sj...i = . . .na

nan :Sj...i+1 = . . .nay

nan

y

Case III If path Sj...i ends in the middle of an edge , and next char Si+1 is equal to thenext char in the edge, do nothing, extenstion has done.

Sj...i = . . .na

nan :Sj...i+1 = . . .nan

nan

jc-yang (Osaka-U) SuffixTree November 12, 2011 13 / 25

Page 14: Ukk's Algorithm of Suffix Tree

. . . . . .

Naïve Online Construction of Suffix Tree

1 2 3 4

b ba a ban an n bana ana na

7 6 5

banana$ a na $

na $ na$ $

na$ $

banana anana nana banan anan nan

jc-yang (Osaka-U) SuffixTree November 12, 2011 14 / 25

Page 15: Ukk's Algorithm of Suffix Tree

. . . . . .

Properties of Suffix Tree..

...1 Each update will add exactly 1leaf node .

nr_leaf = N...2 Suffix tree is full tree.

Each internal node has at least 2children.nr_internal < Nnr_node < 2N

...3 Worst case Fabonacci wordabaababaabaab

...4 Suffix is either explicit or implicit.

Explicit when it ends at a node.Implicit when it ends in themiddle of an edge.

0:6

1:0

banana$

8:6

a

6:6

na

10:6

$

4:6

na

9:6

$

2:1

na$

5:6

$

3:2

na$

7:6

$

jc-yang (Osaka-U) SuffixTree November 12, 2011 15 / 25

Page 16: Ukk's Algorithm of Suffix Tree

. . . . . .

Part Outline

...1 What is Suffix Tree

...2 History & Naïve Algorithm

...3 Optimization of Naïve Algorithm

...4 Examples & Analysis

jc-yang (Osaka-U) SuffixTree November 12, 2011 16 / 25

Page 17: Ukk's Algorithm of Suffix Tree

. . . . . .

Optimization of Naïve Algorithm..

...1 Substrings can be represented as (start, end) pair of their index inorignal string.

Reduce space complexity to O(N) if size of alphabet is fixed constant....2 Once a leaf, Always a leaf

Represent edge that links to a leaf as (start, · · · ).Extend leaf nodes for free. We do not need Extend Case I.

0:6

1:0

banana$

8:6

a

6:6

na

10:6

$

4:6

na

9:6

$

2:1

na$

5:6

$

3:2

na$

7:6

$ :

0:6

8:6

(1,2)a

1:0

(0,...)banana$...

10:6

(6,...)$...

6:6

(2,4)na

9:6

(6,...)$...

4:6

(2,4)na

7:6

(6,...)$...

3:2

(4,...)na$...

5:6

(6,...)$...

2:1

(4,...)na$...

jc-yang (Osaka-U) SuffixTree November 12, 2011 17 / 25

Page 18: Ukk's Algorithm of Suffix Tree

. . . . . .

Active Point

During a phrase, if we meet Extend Case III, that is if we foundS[i+1] already exists in suffix[j . . . i] then S[i+1] will exists in∀suffix[k . . . i],k ∈ j . . . i.Thus Case III is a sign that means update of this phrase is finished.During phrase i if we stopped at suffix[k . . . i] by Case III, then innext phrase we can start from suffix[k . . . i+1] because all suffixstart with 1 . . .k−1 will end at Case I.We called this point(current internal node, current position k instring) as Active Point.

a ab aba abaa abaab abaaba abaabab abaababa abaababaa abaababaab abaababaaba abaababaabaa abaababaabaab

0:0

1:0

(0,...)a...

0:1

1:0

(0,...)ab...

2:1

(1,...)b...

0:2

1:0

(0,...)aba...

2:1

(1,...)ba...

0:3

3:3

(0,1)a

2:1

(1,...)baa...

4:3

(3,...)a...

1:0

(1,...)baa...

0:4

3:3

(0,1)a

2:1

(1,...)baab...

4:3

(3,...)ab...

1:0

(1,...)baab...

0:5

3:3

(0,1)a

2:1

(1,...)baaba...

4:3

(3,...)aba...

1:0

(1,...)baaba...

0:6

3:3

(0,1)a

7:6

(1,3)ba

4:3

(3,...)abab...

5:6

(1,3)ba

2:1

(3,...)abab...

8:6

(6,...)b...

1:0

(3,...)abab...

6:6

(6,...)b...

0:7

3:3

(0,1)a

7:6

(1,3)ba

4:3

(3,...)ababa...

5:6

(1,3)ba

2:1

(3,...)ababa...

8:6

(6,...)ba...

1:0

(3,...)ababa...

6:6

(6,...)ba...

0:8

3:3

(0,1)a

7:6

(1,3)ba

4:3

(3,...)ababaa...

5:6

(1,3)ba

2:1

(3,...)ababaa...

8:6

(6,...)baa...

1:0

(3,...)ababaa...

6:6

(6,...)baa...

0:9

3:3

(0,1)a

7:6

(1,3)ba

4:3

(3,...)ababaab...

5:6

(1,3)ba

2:1

(3,...)ababaab...

8:6

(6,...)baab...

1:0

(3,...)ababaab...

6:6

(6,...)baab...

0:10

3:3

(0,1)a

7:6

(1,3)ba

4:3

(3,...)ababaaba...

5:6

(1,3)ba

2:1

(3,...)ababaaba...

8:6

(6,...)baaba...

1:0

(3,...)ababaaba...

6:6

(6,...)baaba...

0:11

3:3

(0,1)a

7:6

(1,3)ba

13:11

(3,6)aba

5:6

(1,3)ba

11:11

(3,6)aba

8:6

(6,...)baabaa...

14:11

(11,...)a...

4:3

(6,...)baabaa...

9:11

(3,6)aba

6:6

(6,...)baabaa...

10:11

(11,...)a...

1:0

(6,...)baabaa...

12:11

(11,...)a...

2:1

(6,...)baabaa...

0:12

3:3

(0,1)a

7:6

(1,3)ba

13:11

(3,6)aba

5:6

(1,3)ba

11:11

(3,6)aba

8:6

(6,...)baabaab...

14:11

(11,...)ab...

4:3

(6,...)baabaab...

9:11

(3,6)aba

6:6

(6,...)baabaab...

10:11

(11,...)ab...

1:0

(6,...)baabaab...

12:11

(11,...)ab...

2:1

(6,...)baabaab...

jc-yang (Osaka-U) SuffixTree November 12, 2011 18 / 25

Page 19: Ukk's Algorithm of Suffix Tree

. . . . . .

Ukk’s Update using Active Point

UPDATE(treei)

1 current_suffix← active_point2 next_char← string[i]3 while True4 do if there exists edge start with next_char5 then break � Case III6 else7 split current edge if implicit8 create new leaf with new edge labelled next_char9 if current_suffix is empty

10 then break11 else current_suffix← next shorter suffix12 active_point← current_suffix

jc-yang (Osaka-U) SuffixTree November 12, 2011 19 / 25

Page 20: Ukk's Algorithm of Suffix Tree

. . . . . .

Suffix Link to find next shorter suffixSuffix link

Internal node of suffix Xα has a link to node α .If α is empty, suffix link points to root.

How to create suffix linkLink together every internal nodes that are created by splitting insame phrase.

banana$ banana$

0:6

8:6

(1,2)a

1:0

(0,...)banana$...

10:6

(6,...)$...

6:6

(2,4)na

9:6

(6,...)$...

4:6

(2,4)na

7:6

(6,...)$...

3:2

(4,...)na$...

5:6

(6,...)$...

2:1

(4,...)na$...

0:6

8:6

(1,2)a

1:0

(0,...)banana$...

10:6

(6,...)$...

6:6

(2,4)na

9:6

(6,...)$...

4:6

(2,4)na

7:6

(6,...)$...

3:2

(4,...)na$...

5:6

(6,...)$...

2:1

(4,...)na$...

jc-yang (Osaka-U) SuffixTree November 12, 2011 20 / 25

Page 21: Ukk's Algorithm of Suffix Tree

. . . . . .

Fast jump using Suffix Link

Assume we are at Suffix Xαβ , whose parent internal noderepresent Xα .

...1 Go back to parent internal node,

...2 Jumping follow the node’s suffix link to the node represent Suffixα

...3 Go down to Suffix αβ .

Even jump down in step 3 because we already know length of β .(Skip/Count trick)Combining all these tricks we can Extend a character in O(1)time.

jc-yang (Osaka-U) SuffixTree November 12, 2011 21 / 25

Page 22: Ukk's Algorithm of Suffix Tree

. . . . . .

Part Outline

...1 What is Suffix Tree

...2 History & Naïve Algorithm

...3 Optimization of Naïve Algorithm

...4 Examples & Analysis

jc-yang (Osaka-U) SuffixTree November 12, 2011 22 / 25

Page 23: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

m

0:0

1:0

(0,...)m...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 24: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

mi

0:1

2:1

(1,...)i...

1:0

(0,...)mi...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 25: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

mis

0:2

2:1

(1,...)is...

3:2

(2,...)s...

1:0

(0,...)mis...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 26: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

miss

0:3

2:1

(1,...)iss...

3:2

(2,...)ss...

1:0

(0,...)miss...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 27: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

missi

0:4

2:1

(1,...)issi...

4:4

(2,3)s

1:0

(0,...)missi...

5:4

(4,...)i...

3:2

(3,...)si...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 28: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

missis

0:5

2:1

(1,...)issis...

4:4

(2,3)s

1:0

(0,...)missis...

5:4

(4,...)is...

3:2

(3,...)sis...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 29: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

mississ

0:6

2:1

(1,...)ississ...

4:4

(2,3)s

1:0

(0,...)mississ...

5:4

(4,...)iss...

3:2

(3,...)siss...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 30: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

mississi

0:7

2:1

(1,...)ississi...

4:4

(2,3)s

1:0

(0,...)mississi...

5:4

(4,...)issi...

3:2

(3,...)sissi...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 31: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

mississip

0:8

12:8

(1,2)i

14:8

(8,...)p...

4:4

(2,3)s

1:0

(0,...)mississi...

13:8

(8,...)p...

6:8

(2,5)ssi

8:8

(3,5)si

10:8

(4,5)i

7:8

(8,...)p...

2:1

(5,...)ssip...

9:8

(8,...)p...

3:2

(5,...)ssip...

11:8

(8,...)p...

5:4

(5,...)ssip...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 32: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

mississipp

0:9

12:8

(1,2)i

14:8

(8,...)pp...

4:4

(2,3)s

1:0

(0,...)mississi...

13:8

(8,...)pp...

6:8

(2,5)ssi

8:8

(3,5)si

10:8

(4,5)i

7:8

(8,...)pp...

2:1

(5,...)ssipp...

9:8

(8,...)pp...

3:2

(5,...)ssipp...

11:8

(8,...)pp...

5:4

(5,...)ssipp...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 33: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

mississippi

0:10

12:8

(1,2)i

15:10

(8,9)p

4:4

(2,3)s

1:0

(0,...)mississi...

13:8

(8,...)ppi...

6:8

(2,5)ssi

16:10

(10,...)i...

14:8

(9,...)pi...

8:8

(3,5)si

10:8

(4,5)i

7:8

(8,...)ppi...

2:1

(5,...)ssippi...

9:8

(8,...)ppi...

3:2

(5,...)ssippi...

11:8

(8,...)ppi...

5:4

(5,...)ssippi...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 34: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – mississippi..

mississippi$

0:11

12:8

(1,2)i

15:10

(8,9)p

4:4

(2,3)s

1:0

(0,...)mississi...

18:11

(11,...)$...

13:8

(8,...)ppi$...

6:8

(2,5)ssi

17:11

(11,...)$...

16:10

(10,...)i$...

14:8

(9,...)pi$...

8:8

(3,5)si

10:8

(4,5)i

7:8

(8,...)ppi$...

2:1

(5,...)ssippi$...

9:8

(8,...)ppi$...

3:2

(5,...)ssippi$...

11:8

(8,...)ppi$...

5:4

(5,...)ssippi$...

jc-yang (Osaka-U) SuffixTree November 12, 2011 23 / 25

Page 35: Ukk's Algorithm of Suffix Tree

. . . . . .

Time Complexity Analysis

..m.

i

.

s

.

s

.

i

.

s

.

s

.

i

.

p

.

p

.

i

.

$

. m. i. s. s. i. s. s. i. p. p. i. $

Time complexity is 2N = O(N) .

jc-yang (Osaka-U) SuffixTree November 12, 2011 24 / 25

Page 36: Ukk's Algorithm of Suffix Tree

. . . . . .

Experiment – English text..

0

2

4

6

8

10

12

14

16

0 20000 40000 60000 80000 100000 120000 140000

Con

stru

ctio

nTi

me

(sec

.)

File size (byte)

"data" using 1:3

jc-yang (Osaka-U) SuffixTree November 12, 2011 25 / 25