parallel suffix array construction by accelerated sampling matthew felice pace university of warwick...

21
Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick

Upload: amos-webster

Post on 01-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Parallel Suffix Array Construction by Accelerated Sampling

Matthew Felice Pace University of Warwick

Joint work with Alexander TiskinUniversity of Warwick

Outline

• Introduction• Difference Covers• Sequential Suffix Array Construction• Bulk-Synchronous Parallel (BSP) Model• Suffix Array Construction in BSP• Conclusion

Introduction

• What is a suffix array?• A data structure, denoted by   , that holds the lexicographic

order of all the suffixes of a given string   of size   .

• Suffix array construction related to sorting.• Naïve solution is to radix sort all the suffixes in   .

• We assume that a given string of size   is over or an indexed alphabet.

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8 4

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8 4 1

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8 4 1 5

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

8 4 1 5 6 2 0 7 3

Introduction

• Manber and Myers [1990] presented the first suffix array construction algorithm (SACA) running in   .

• Kärkkäinen and Sanders [2003], Kim et al. [2003], Ko and Aluru [2003] all developed SACAs running in   .

• Kärkkäinen et al. [2006] extend their algorithm to run on a p

processor BSP machine with optimal   local computation and communication costs and requiring supersteps.

• We reduce the number of supersteps required to xxxxx while preserving the optimal computation and communication costs.

Introduction

• The idea behind the SACAs having linear worst case running time is to use recursion1. Divide the indices of the input string   into two nonempty

disjoint sets.

2. Form string   and   from the characters indexed by each set.

3. Recursively construct   .4. Use   to construct   .5. Merge   and   to obtain   .

0 1 2 3 4 5 6 7 8 9

c a b d a b b d a $

c d b $a b a b d a $

Difference Covers

• Given a positive integer   , let   denote the set of integers   .

• Then   can be defined such that for any   , there exists   such that       .

•   is known as a difference cover of   .

• Let   , i.e.   , then we can have e.g. xxxxxxxxxxxx, but not   .

0 ≡ 1 – 1 (mod 4) 0 ≡ 1 – 1 (mod 4)

1 ≡ 3 – 2 (mod 4) 1 ≡ 1 – 0 (mod 4)

2 ≡ 3 – 1 (mod 4) 3 ≡ 0 – 1 (mod 4)

3 ≡ 1 – 2 (mod 4) 2 ??

Difference Covers

• Colbourn and Ling [2000] give a method for computing the difference cover   of   , for any positive integer   , in time   .•  

• Lemma 1 [Kärkkäinen and Sanders 2003]• If   is a difference cover of   , and   and   are integers, then

there exists   such that   and   are both in   .

• Let   and   , then

i j l (i + l) mod 3 (j + l) mod 3

30 35 3 (30 + 3) mod 4 = 1 (35 + 3) mod 4 = 2

20 35 2 (20 + 2) mod 4 = 2 (35 + 2) mod 4 = 1

Sequential Suffix Array Construction• Given string   of size   , and a positive integer   , we construct the

suffix array as follows:

• Construct difference cover   of   (e.g. for   ,   ).

• Partition the set of indices into sets.

• Denote every character , , such that , as a sample character, and for each such character define a super-character corresponding to   .

x[0] x[1] x[2] x[3] x[4] x[5] … x[n-1]

x[0] x[1] x[2] x[3] x[4] x[5] … x[n-1]

x[2] x[3] x[5] x[6] … -1

x[3] x[4] x[6] x[7] -1

x[0] x[1] x[2] x[3] x[4] x[5] … x[n-1]

Sequential Suffix Array Construction• Construct string   of super-characters, of size   .

• Construct   , identical to   with each super-character replaced by its rank in the sorted list of super-characters.

• Recursively call algorithm on string   , with parameter   .

• When algorithm returns with   fill array   with the rank of each suffix of   .

x[1:3] x[4:6] … x[n-2:n] x[2:4] x[5:7] … x[n-1:n+1]

4 8 3 3 … 2

Sequential Suffix Array Construction• For each   , find an   such that

asdfkjhiuhoknmkjnkj (e.g. and , then )

• Then for each   ,   , define tuple   ,

and sort the tuples separately for each   .

x[0] x[1] x[2] x[3] x[4] x[5] … x[n-1]

rank[1] rank[4] …

Sequential Suffix Array Construction• Sort all the suffixes of   by first   characters to get sets of

suffixes having an identical prefix.

• Each set of suffixes with an identical prefix can be divided into subsets of suffixes whose order within the subset has already been found.

• Merge the subsets of each set of suffixes with identical prefixes, using Lemma 1.

• Suffix array is obtained in time   .

aaa aab …

x[0:2] x[10:12] x[5:7] x[12:14] x[1:3]

⁞ ⁞ ⁞ ⁞ ⁞

Sequential Suffix Array Construction• The size of the string decreases by a factor of   in each level

of recursion.

n

  1

• This requires levels of recursion.

BSP model

• Model developed to allow rigorous parallel algorithm design over diverse physical systems• p processors each with local memory• Global communication environment• Barrier synchronisation

comm env

P P PP...

M M MM...

BSP model

• A BSP machine is defined by 3 parameters• p – number of processors• g – inverse bandwidth of the network• l – network latency

• Algorithms run in supersteps, each of which is measured by• comp – maximum computation over all processors• comm – maximum communication over all processors

• Total cost of an algorithm having S supersteps is

Suffix Array Construction in BSP

• Sequential algorithm divided into four steps• Three integer sorting steps• Final merging step

• Integer sorting in BSP requires   superstep with comp and comm, using a technique called regular sampling. [Chan and Dehne 1999]

• We can perform the final merging step using the same technique.

• Therefore, we can perform each level of recursion in supersteps.

Suffix Array Construction in BSP

• The size of the string decreases by a factor of   in each level of recursion.

n

• This requires   levels, i.e. supersteps.

Suffix Array Construction in BSP

• However, by decreasing the sampling frequency at each level of recursion we can accelerate the rate by which the size of the input string in successive levels of recursion decreases.

• By setting   , the size of the input string converges towards   super-exponentially.

Suffix Array Construction in BSP

Suffix Array Construction in BSP

• However, by decreasing the sampling frequency at each level of recursion we can accelerate the rate by which the size of the input string in successive levels of recursion decreases.

• By setting   , the size of the input string converges towards   super-exponentially.

• Therefore, we only require   supersteps to construct the suffix array of a given string.

Conclusion

• Presented an algorithm for constructing suffix arrays in parallel on a   processor machine.

• Algorithm requires optimal   local computation and communication costs.

• Reduced the number of supersteps required to a near optimal   .

• Open questions• Can we construct suffix arrays in   supersteps?• Can we apply the accelerated sampling technique to other

algorithms?

Thank you!