suffix sorting & related algoritmics martin farach-colton rutgers university usa

Post on 18-Dec-2015

220 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Suffix Sorting

&

Related

Algoritmics

Martin Farach-ColtonRutgers

UniversityUSA

What is Suffix Sorting?

✢ Given a string, give the sorted order of its suffixes.

✢ Why would we want that?

✤ More on this later.

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Mississippi$

sissippi$

sippi$

i$pi$ppi$

ssissippi$ississippi$

issippi$ssippi$

ippi$

$

Mississippi$

sissippi$sippi$

i$

pi$ppi$

ssissippi$

ississippi$issippi$

ssippi$

ippi$

$

{

{

How fast can we sort?

✢ Sorting suffixes of a string can be no faster than sorting the characters of a string.

✢ Can we match the sorting lower bound?

Suffix Sorting✢ We are sorting strings, so Radix Sort

is natural.

✤ This yields an algorithm with time O(n2) to O(n2logn), depending on assumptions on sortability of characters.

✥ O(n2logn) in comparison model.

✥ O(n2) for small integers. In between in general word model.

✢ We can do better by combining Merge Sort with Radix Sort.

Building Blocks

✢ Range Reduction

✢ Radix Step

✢ Chunking

Range Reduction

✢ Observation: If we apply a monotone function to the characters, the sorted order doesn’t change.

Example: Mississippi$Mississippi$

sissippi$

sippi$

i$pi$ppi$

ssissippi$ississippi$

issippi$ssippi$

ippi$

$

214414413315144144133154414413315

413315441331514413315

13315

414413315

3315315155

214414413315

14414413315

4414413315

413315

4413315

1441331513315

414413315

3315315

15

5

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Mississippi$

sissippi$sippi$

i$

pi$ppi$

ssissippi$

ississippi$issippi$

ssippi$

ippi$

$

i 1

M 2

p 3

s 4

$ 5

Range Reduction

✢ Our only range reduction operation will be:

✤ Replace every character by rank in sorted order of characters.

✤ After RR, length n string will be in [n]n

i 1

M 2

p 3

s 4

$ 5

RR helps running time

✢ Radix Sort on raw input might take O(n2logn) time.

✢ RR takes at most O(nlogn) time.

✢ Radix Sort on small-integer inputs takes O(n2) time.

✢ Total time is O(n2).

Radix Step✢ Recall that Radix Sort proceeds in

steps:

✤ Lexicographically sort the last i characters of each string.

✤ Stably sort by preceding character. Now strings are lexicographically sorted by last i+1 characters.

sorte

d sorte

d

Using Radix Step

✢ If we have recursively sorted some subset of suffixes of a string:

✤ 1 step of radix will sort the preceding suffixes.

✢ Ex: If you have sorted suffixes at odd positions, then you get suffixes at even positions.

Example: 214414413315

214414413315

4414413315

413315

14413315

3315

15

214414413315

4414413315413315

14413315

3315

15

214414413315144144133154414413315

413315441331514413315

13315

414413315

3315315155

25 1111

11 99 37 73

Odd Suffixes

Example: 214414413315

214414413315

4414413315413315

14413315

3315

15

Even Suffixes

5144144133154413315

414413315

13315

315

5

14414413315

4413315414413315

13315

315

88 10

610

6 4 12

4 12

52

Radix Step

✢ We normally think of Radix Sort as sorting one set of strings.

✢ But with suffixes (which are interrelated), we can use a Radix Step to get sorted orders of one set from another.

Where are we now?✢ Step 1: Recursively sort odd

suffixes.

✤ How? And how is it recursive? A recursive step must sort every suffix! We’ll get to that.

✢ Step 2: Get even suffixes in linear time.

✤ By Radix Step.

✢ Step 3: Merge!

Merging is tricky...

✢ F ‘97 gave first linear time solution.

✢ This yields an optimal suffix sorting routine.

✢ It’s a fun algorithm, but highly unintuitive.

✤ Or so I’ve been told!

Chunking✢ Let’s solve the recursion problem

first.

✢ Given two integers i and j, let 〈 i,j 〉 be their bit concatenation.

✤ If i,j∈[n], then 〈 i,j 〉∈ [n2].

✢ Given a string S = (s1,s2,...,sn)

✢ Let S’ = ( 〈 s1,s2 〉 , 〈 s3,s4 〉 ,...,

〈 sn-1,sn〉 )

Chunking + Recursion: I

✢ Observation: The order of the odd suffixes of S = (s1,s2,...,sn) is the same as the order of all suffixes of S’ = ( 〈 s1,s2 〉 , 〈 s3,s4 〉 ,..., 〈 sn-

1,sn〉 )

✤ Since bit concatenation preserves lexicographic ordering.

Example: 214414413315

8 2 5 11

1 93 6 5 4 21

214414413315

4414413315413315

14413315

3315

15〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉

〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉

〈 41 〉〈 33 〉〈 15 〉〈 33 〉〈 15 〉

〈 15 〉

25 1111

11 99 37 73

17 36 12 33 27 13

36 12 33 27 13

12 33 27 13

33 27 1327 13

1317 36 12 33 27 1336 12 33 27 1312 33 27 1333 27 1327 1313

base 8

Chunking + Recursion: II

✢ Chunking+Range Reduction = Recursion

✤ Input is in [n]n.

✤ Chunked Input is in [n2]n/2.

✤ Range Reduced Chunking is in [n/2]n/2.

✤ So now problem instance is half the size and we can recurse.

Example: 214414413315

8 2 5 11

1 93 6 5 4 21

〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉

36154261542

542

242

1542 361542

61542542

2

42

1542

Recall Basic Operations

✢ Range Reduction

✢ Radix Step

✢ Chunking

✢ How we are ready for the whole algorithm.

Suffix Sorting

✢ Step 1: Chunk + Range Reduction.

✤ Recurse on new string.

✤ Get sorted order of odd suffixes.

✢ Step 2: Radix Step.

✤ Get sorted order of even suffixes.

✢ Step 3: Merge!

✤ We still don’t know how to do this.

The Trouble with Merging

✢ Let’s start merging. We need to see which is smaller: the smallest odd suffix s2i-1 or the smallest even suffix s2j.

✤ We can compare the first character.

✤ We can then compare s2i with s2j

✥ But this is another odd/even comparison.

The difference between 3 and 2

✢ It’s possible to merge the lists.

✢ But Kärkkäinen & Sanders showed the cute way to merge.

✢ The modified the recursion to make the merge easy.

Mod 3 Recursion✢ Given a string S = (s1,s2,...,sn)

✢ Let S1 = ( 〈 s1,s2,s3 〉 , 〈 s4,s5,s6 〉 ,...,

〈 sn-2,sn-1,sn〉 )

✢ Let S2 = ( 〈 s2,s3,s4 〉 , 〈 s5,s6,s7 〉 ,...,

〈 sn-4,sn-3,sn-2 〉 )

✢ Let O12 be order of suffix congruent to 1 and 2 mod 3.

✤ You get this recursively from sorting the suffixes of S1S2

Radix Step x 2

✢ We have O12 from the recursion.

✢ One Radix Step gives us O01

✢ Another Radix Step gives us O02

✢ Each pair of suffix is now compared in one list.

✢ Each suffix appears in two lists.

Merging... at last!

✢ To merge O12, O01, and O12 note that:

✤ The smallest suffix is the first on two lists.

✤ Pop the smallest.

✤ Now the next smallest is the first on two lists.

✤ Etc.

Total time✢ T(n) to sort suffix of strings in [n]n

✢ T(n) = recursion + 2*radix + merging

✢ T(n) = T(2n/3) + O(n) + O(n)

✢ T(n) = O(n)

✢ So the initial Range Reduction step into the integer alphabet is the bottleneck.

✤ So this algorithm is optimal for any alphabet.

Why did we want to sort suffixes anyway?

✢ It’s easy to go from Suffix Sorting to Suffix Arrays...

Suffix Arrays

✢ A suffix array is:

✤ The sorted order of suffixes

✤ Their pairwise adjacent lcp’s.

✢ They are handy as space efficient indexes.

✢ How do we compute the lcp’s from the suffix order?

Example: Mississippi$

? ?

?

? ? ? ?? ? ??

Mississippi$☟

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Example: Mississippi$

? ?

0

? ? ? ?? ? ??

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Example: Mississippi$

? ?

0

? ? ? ?? ? ?

?

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Example: Mississippi$

? ?

0

? ? ? ?? ? ?

4

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

☺☺☺☺

☺☺☺☺☹

Example: Mississippi$

? ?

0

? ? ? ?

?

? ?

4

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

☺☺

☺☺☹

Example: Mississippi$

? ?

0

? ? ? ?

3

? ?

4

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

☺☺

☺☺☹

What’s a suffix tree?

✢ Compacted trie of all suffixes of a string.

✢ What’s it good for?

✤ Too many things to enumerate...

✤ See the stringology literature of the last 30 years!

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Example: Mississippi$

Brute-force Algorithm

✢ We insert one suffix at a time.

✢ Each suffix insertion takes O(n) time.

✢ Total time is O(n2)

How low can we go?✢ There is a leaf for each suffix

✤ That’s why we put a $ at end.

✢ Each internal node is branching

✤ Because we have a compacted trie.

✢ Each edge has a constant-size label.

✤ Just a pointer to string + length.

✢ So suffix tree has size O(n).

Weiner’s Algorithm

✢ Weiner showed a suffix-tree construction in O(n) time for binary alphabet.

✢ Still adds suffixes one at a time.

✤ But gets a speed-up on insertions by exploiting Suffix Links.

Defs: LCP

✢ Let li be the leaf for suffix S[1,i].

✢ Let LCP(i,j) be the length of the longest common prefix of S[1,i] and S[1,j].

✤ Ex: LCP(2,5) = 4 for Mississippi.

Suffix Links

✢ Claim: If some node in a suffix tree has string aα, for a a character and α a string, then some node in the suffix tree has string α.

Example: Mississippi$issi

issi

Example: Mississippi$

issi

issi

Example: Mississippi$

Example: Mississippi$

Suffix Links: Proof

✢ How do we know a suffix link always exists?

✤ Maybe the link points into the middle of a edge...

Suffix Links: Proof Cartoon

a αi i+1

j j+1

Adding Suffix Links

✢ Adding each suffix link is just a least common ancestors computation

✤ This takes O(n) preprocessing + O(1) per lca.

✤ Adding all suffix links is O(n) time.

Suffix Links Uses

✢ The first use of suffix links was to speed up suffix tree construction.

✤ If you keep suffix links on the partially constructed tree, you don’t have to start every insertion from the root.

✤ Yields a linear time construction!

So we’re done!

✢ The data structure has linear size.

✤ So linear lower bound.

✢ The algorithm runs in linear time.

✤ So linear upper bound.

✢ Declare victory and go home!

Alphabets everywhere.

✢ This algorithm runs in linear time for binary alphabet.

✢ For an alphabet of size s, it runs in O(n log s).

Lower Bounds

✢ Element uniqueness reduces to suffix tree construction, so in the algebraic decision tree model, the lower bound is Omega(n log n).

✢ If we require edges to be sorted, then sorting is lower bound.

Open Problem

✢ Is there an interesting lower bound for suffix trees where each node may have an arbitrary order of children?

✢ Such a tree is no good as an index, but just fine for LCA applications of suffix trees.

Back to Suffix Links

✢ Fun fact: The suffix links form a sl-tree. The length of the string at a node is the depth in the sl-tree.

Example: Mississippi$

2

111

3

4

Stripping a Suffix Tree: I

✢ If we are given a suffix tree with no edge labels, we can reconstruct the labels in linear time:

✤ Compute Suffix Links (by LCAs).

✤ Compute Depth in SL-tree.

✤ Each node selects a leaf and computes offset within its descendant suffix.

Example: Mississippi$

Example: Mississippi$

2

111

3

4

Example: Mississippi$

2

111

3

4

Suffix Arrays

✢ A suffix array is:

✤ The sorted order of suffixes

✤ Their pairwise adjacent lcp’s.

Example: Mississippi$

8 2 11

1 9 10

6 3 7 4 12

51 1 0 0 1 0 13 2 04

Suffix Arrays

✢ They are much more space efficient than suffix trees.

✢ You can still use them as indexes.

✢ You can build them as fast as suffix trees...

✤ By dfs of suffix tree. How about w/o suffix trees?

Building suffix trees from suffix arrays

✢ You can build the suffix tree left to right by keeping a stack of the right-most path.

✢ This gives you the shape of the tree.

✢ Then add link labels.

✢ Total construction is linear for any alphabet.

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 048 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 04

1

☟8 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 04

1

4

☟8 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 04

1

4

☟8 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 04

1

4etc.

☟8 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Even Less Information!

✢ You don’t even need LCP information to build the suffix tree.

✢ You can compute LCP array from suffix order array in linear time.

✢ Big Conclusion: Given sorted order of suffixes, you can build the suffix array in linear time for any alphabet.

How do we sort suffixes?

✢ Mergesort!

✤ Careful choice of recursion.

✤ Careful merging.

How do we sort suffixes?

✢ Key Operations:

✤ Range Reduction.

✤ Radix Sorting.

✤ Clumping.

top related