crochemore’s algorithm for repetitions revisited...

25
Crochemore’s algorithm for repetitions revisited - computing runs F. Franek, M. Jiang Computing and Software McMaster University Hamilton, Ontario Israel Stringology Conference Bar-Ilan University, Tel-Aviv March-April 2009 Israel Stringology Conference, Bar-Ilan slide 1/24

Upload: lamphuc

Post on 27-May-2019

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Crochemore’s algorithm for repetitions revisited - computing runs

F. Franek, M. JiangComputing and Software

McMaster UniversityHamilton, Ontario

Israel Stringology ConferenceBar-Ilan University, Tel-Aviv

March-April 2009

Israel Stringology Conference, Bar-Ilan slide 1/24

Page 2: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

• Why we are interested in Crochemore’s repetition algorithm

• A brief description of our implementation of Crochemore’s algorithm.

• A simple modification of Crochemore’s algorithm to compute runs (worsening the complexity to O(n log2(n))

• A modification of Crochemore’s algorithm to compute runs while preserving the complexity O(n log(n))

• Conclusion

Israel Stringology Conference, Bar-Ilan slide 2/24

Page 3: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 3/24

Why we are interested in Crochemore’s repetition algorithm

Page 4: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

A run captures the notion of a maximal non-extendible repetition in a string x

(s,p,e,t)

s

p

starting position (leftmost)

period

e power, exponent

t tail (rightmost)

irreducible generatorIsrael Stringology Conference, Bar-Ilan slide 4/24

Alternative: (s,p,end) e = (end - s + 1) / p t = (end - s + 1) % p

Page 5: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 5/24

Computing runs in linear time

Main (1989) introduced runs and gave the following algorithm to compute the leftmost occurrence of every run of a string x:

(1) Compute a suffix tree for x (linear, using Farach’s algorithm)

(2) using the suffix tree, compute Lempel-Ziv factorization of x (linear, Lempel-Ziv)

(3) using the Lempel-Ziv factorization, compute the leftmost runs (linear, Main)

Page 6: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 6/24

Lempel-Ziv factorization can be computed in linear time using suffix array (Abouelhoda, Kurtz, & Ohlebusch 2004)

Suffix array can be computed in linear time (Kärkkäinen, Sanders 2003, Ko, Aluru 2003)

Chen, Puglisi, & Smyth 2007, using suffix array and the lcp array (lcp can be computed from suffix array in linear time, Kasai et al 2001): ) it computes Lempel-Ziv factorization in linear time using Ukkonen’s on-line approach.

Page 7: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 7/24

All these approaches are complicated and elaborate, and the implementations into code are not readily available.

Also, they do not lend themselves well to parallelization(see slide 9 -- the refinement of the classes can be done naturally in parallel as the refinement of one class is independent from the refinement of another class.)

We have a good and “space efficient” implementation of Crochemore’s algorithm, that naturally lends itself to parallelization.

Page 8: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 8/24

A brief description of our implementation of Crochemore’s algorithm

Page 9: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

{0,2,3,5,7,8,10,11,13}

a

{1,4,6,9,12,14}

b

{2,7,10}

aa

{1,4,6,9,12}

ba

{0,3,5,8,11,13}

ab

{14}

b$

{2,7,10}

aab

{1,6,9}

baa

{0,3,5,8,11}

aba

{4,12}

bab

{13}

ab$

{2,7,10}

aaba

{1,6,9}

baab

{0,5,8}

abaa

{4}

baba

{3,11}

abab

{12}

bab$

{7}

aabaa

{1,6,9}

baaba

{0,5,8}

abaab

{3}

ababa

{2,10}

aabab

{11}

abab$

{0,5,8}

abaaba

{2}

aababa

{10}

aabab$

{6}

baabaa

{5,8}

abaabaa

{0}

abaabab

{1}

baababa

{1,9}

baabab

{9}

baabab$

{5}

abaabaab

{8}

abaabaa$

{15}

$

Israel Stringology Conference, Bar-Ilan slide 9/24

a b a a b a b a a b a a b a b $0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

{0,2,3,5,7,8,10,11,13}a {1,4,6,9,12,14}b

level

1

{2,7,10}aa {1,4,6,9,12}ba2 {0,3,5,8,11,13}ab {14}b$

{2,7,10}aab {1,6,9}baa3 {0,3,5,8,11}aba {4,12}bab{13}ab$

{2,7,10}aaba {1,6,9}baab4 {0,5,8}abaa {4}baba{3,11}abab {12}bab$

{7}aabaa {1,6,9}baaba5 {0,5,8}abaab{3}ababa{2,10}aabab {11}abab$

6 {0,5,8}abaaba{2}aababa {10}aabab$ {6}baabaa

7 {5,8}abaabaa {0}abaabab {1}baababa

{1,9}baabab

{9}baabab$

8 {5}abaabaab {8}abaabaa$

{15}$

Page 10: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

CNext[ ]

CPrev[ ]

CEnd[ ]

CStart[ ]

CSize[ ]3

c1={2,4,5}

0 1 2 3 4 5 6

4 5 O

42O

2

5

CMember[]111

indexesN

Total this slide 6*Nsubtotal 6*N

Israel Stringology Conference, Bar-Ilan slide 10/24

Page 11: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

CEmptyStack

SelQueue

ScQueue

RefStack

Refine[]

0 1 2 3 4 5 6 indexesN

Total this slide 5*Nsubtotal 11*N

0 1 3 ….

Israel Stringology Conference, Bar-Ilan slide 11/24

Page 12: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

FNext[ ]

FPrev[ ]

FStart[ ]

FMember[]

f2={3,5}

0 1 2 3 4 5 6

5 O

3O

indexesN

3

Total this slide 4*Noverall total 15*N

2 2

Israel Stringology Conference, Bar-Ilan slide 12/24

Page 13: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

CNext[ ]

CPrev[ ]

CEnd[ ]

CStart[ ]

CSize[ ]

c1={2,4,5}

0 1 2 3 4 5 6

4 5 3

2

2

CMember[]111

indexesN

Total this slide 4*Nsubtotal 4*N

5 Memoryvirtualization

Israel Stringology Conference, Bar-Ilan slide 13/24

Page 14: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

CEmptyStack

SelQueue

ScQueue

Refine[]

RefStack

0 1 2 3 4 5 6 indexesN

Total this slide 2*Nsubtotal 6*N

0 1 3 …. Memorymultiplexing

Refine[] is virtualized over FNext[], FPrev[], and FStart[]

Israel Stringology Conference, Bar-Ilan slide 14/24

Page 15: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

FNext[ ]

FPrev[ ]

FStart[ ]

FMember[]

f2={3,5}

0 1 2 3 4 5 6

5

3

indexesN

3

Total this slide 4*Noverall total 10*N

2 2

Refine[] is virtualized over

Memoryvirtualization

Israel Stringology Conference, Bar-Ilan slide 15/24

Page 16: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Gap[ ]

GapList[ ]

GNext[ ]

GPrev[]

0 1 2 3 4 5 6

5

indexesN

6

Total this slide 4*Noverall total 14*N

2

Israel Stringology Conference, Bar-Ilan slide 16/24

3

Page 17: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 17/24

Though the repetitions are reported level by level, they are not reported in any appreciable order (caused by the manipulations of GapList)

a b a a b a b a a b a a b a b $0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(10,1,2)(7,1,2)(2,1,2)(11,2,2)(3,2,2)(4,2,2)(6,3,2)(5,3,3)(0,3,2)(7,3,2)(0,5,2)(1,5,2)

a b a a b a b a a b a a b a b $a b a a b a b a a b a a b a b $a b a a b a b a a b a a b a b $a b a a b a b a a b a a b a b $a b a a b a b a a b a a b a b $a b a a b a b a a b a a b a b $a b a a b a b a a b a a b a b $a b a a b a b a a b a a b a b $a b a a b a b a a b a a b a b $a b a a b a b a a b a a b a b $ a b a a b a b a a b a a b a b $ a b a a b a b a a b a a b a b $

run

run

run

Page 18: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 18/24

A simple modification of Crochemore’s algorithm to compute runs (worsening the

complexity to O(n log2(n))

Page 19: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 19/24

We have to collect repetitions and “join” them into runs.Collecting, “joining”, and reporting level by level, basically in a binary search tree:

RunLeft[ ] ( reuse FNext[ ] )

RunRight[ ] ( reuse FPrev[ ] )

Run_s[ ] ( reuse FMember[ ] )

Run_end[ ] ( reuse FStart[ ] )

Complexity: need O(log(n)) for each repetitionto place it in the tree, overall O(n log2(n))

Page 20: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 20/24

Collecting and “joining” in a binary search tree, reporting at the end: the same complexity O(n log2(n)), memory requirement increased by 5*N

RunLeft[ ]

Run_p[ ]

RunRight[ ]

Run_s[ ]

Run_end[ ]

Total this slide 5*Noverall total 19*N

Points to the “root” of the search tree for runs of period p.

p

Page 21: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 21/24

A modification of Crochemore’s algorithm to compute runs while preserving the complexity

O(n log(n))

Page 22: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 22/24

Collecting into buckets, “joining” and reporting at the end.

Run_s[ ]

Run_Last[ ] ( reuse FNext[ ] )

points to the last run with period p2, so we knowwith what to join the incoming repetition with (if at all), as we sweep from left to right.

p2

Linked list of repetitions starting at ss

p1 end1

p2 end2 Memory: ? O(n log(n))

Page 23: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 23/24

Complexity: O(n log(n))Memory: 15*N + O(n log(n)) To avoid dynamic allocation of memory, we areusing allocation from arena technique.

Page 24: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated

Israel Stringology Conference, Bar-Ilan slide 24/24

Conclusion• Crochemore’s algorithm is fast, though memory demanding

• Our implementation is as memory efficient as possible

• Great potential for parallel implementation

• Preliminary test very positive

• Further research (1) to compare performance with linear time algorithms (problem - lack of code) (2) to implement parallel version with little communication overhead

Page 25: Crochemore’s algorithm for repetitions revisited ...optlab.mcmaster.ca/jiangm5/docs/bar-ilan.pdfIsrael Stringology Conference, Bar-Ilan slide 7/24 All these approaches are complicated