[acm press the fifth annual acm symposium - velen, germany (1993.06.30-1993.07.02)] proceedings of...

10
Deterministic Distribution Sort in Shared and Distributed Memory Multiprocessors (extended abstract) Mark H. Nodine* Motorola Cambridge Res. Ctr. One Kendall Square, Bldg. 200 Cambridge, MA 02139 Abstract We present an elegant deterministic load balancing strategy for distribution sort that is applicable to a wide variety of parallel diska and parallel memory hierarchies with both single and parallel processors. The simplest application of the strategy is an optimal deterministic algorithm for external sorting with multiple disks and parallel processors. In each input/output (1/0) operation, each of the D ~ 1 disks can simultaneously transfer a block of B contiguous records. Our two measures of performance are the number of 1/0s and the amount of work done by the CPU(s); our algorithm ia simultaneously optimal for both measures. We also show how to sort determiniatically in parallel memory hierarchies. When the processors are interconnected by any sort of a PRAM, our algorithms are optimal for all parallel memory hierarchies; when the interconnection network is a hypercube, our algorithms are either optimal or best-known. Part of this research was done while the author was at Brown University, supported in part by an IBM Graduate Fellowship, by NSF research grants CCR-9007851 and IRI-91 16451, and by Army Research Office grant DAAL03-91-G–0035. Email: [email protected] .com. t P=t of this ~csem& WM done while the author WM at Brown University. Support was provided in part by Presi- dential Young Investigator Award CCR-9047466 with match- ing funds from IBM, by NSF research grant CCR-9007851, and by Army Research Office grant DAAL03-91-G-O035. Email: jsvtlcs.duke.edu. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. ACM-SPAA’93-6/93 /Velen,Germany. Q 1993 ACM 0-8979 j-599_ 2J93/0006/OJ 20...$1-50 Jeflreg Scott Vitterf Dept. of Computer Science Duke University, Box 90129 Durham, NC 27708-0129 1 Introduction Input/Output communication (1/0) between primary and secondary memory is a major bottleneck in many important computations, and it is especially prevalent when parallel processors are used. In this paper we consider the important application of external sorting, in which the records to be sorted are too numerous to fit in internal memory and instead reside in secondary storage, typically made up of one or more magnetic disks. Data are usually transferred in units of blocks, which may consist of several kilobytes. This blocking takes advantage of the fact that the seek time is usually much longer than the time needed to transfer a record of data once the disk read/write head is in place. An increasingly popular way to get further speedup is to use many disk drives working in parallel [GHK, GiS, Jil, Mag, PGK, Uni]. Aggarwal and Vitter did initial work in the use of parallel block transfer for sorting [AgV], generalizing the sequential work of Floyd [Flo]. Let us consider the parameters N = # records in the file M = # records that can fit in internal memory P = # CPUS (internal processors) B = # records per block D = # blocks transferred per 1/0 where M < N, l~P~M, andl~DB~ M/2. In the Aggarwal-Vitter model, there is only one CPU (P = 1), and in each 1/0, D blocks of B records can be transferred simultaneously, as illustrated in Figure 1. Their measure of performance is the number of parallel 1/0s required; they ignore internal computation time. Aggarwal and Vitter proved that the average-case and worst-case number of 1/0s required for sorting isl ( N log(N/B) @ DB log(M/B) ) (1) 1we ~~e the notation log z to denote the quantity max{l, log2 z}. All logarithms in this paper are base 2. 120

Upload: jeffrey-scott

Post on 09-Feb-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

Deterministic Distribution Sort

in Shared and Distributed Memory Multiprocessors

(extended abstract)

Mark H. Nodine*

Motorola Cambridge Res. Ctr.

One Kendall Square, Bldg. 200

Cambridge, MA 02139

Abstract

We present an elegant deterministic load balancing strategy

for distribution sort that is applicable to a wide variety of

parallel diska and parallel memory hierarchies with both

single and parallel processors. The simplest application

of the strategy is an optimal deterministic algorithm for

external sorting with multiple disks and parallel processors.

In each input/output (1/0) operation, each of the D ~ 1

disks can simultaneously transfer a block of B contiguous

records. Our two measures of performance are the number

of 1/0s and the amount of work done by the CPU(s); our

algorithm ia simultaneously optimal for both measures. We

also show how to sort determiniatically in parallel memory

hierarchies. When the processors are interconnected by any

sort of a PRAM, our algorithms are optimal for all parallel

memory hierarchies; when the interconnection network is a

hypercube, our algorithms are either optimal or best-known.

●Part of this research was done while the author was at BrownUniversity, supported in part by an IBM Graduate Fellowship,by NSF research grants CCR-9007851 and IRI-91 16451, andby Army Research Office grant DAAL03-91-G–0035. Email:[email protected] .com.

t P=t of this ~csem& WM done while the author WM at

Brown University. Support was provided in part by Presi-dential Young Investigator Award CCR-9047466 with match-ing funds from IBM, by NSF research grant CCR-9007851, andby Army Research Office grant DAAL03-91-G-O035. Email:jsvtlcs.duke.edu.

Permission to copy without fee all or part of this material is

granted provided that the copies are not made or distributed for

direct commercial advantage, the ACM copyright notice and the

title of the publication and its date appear, and notice is given

that copying is by permission of the Association for Computing

Machinery. To copy otherwise, or to republish, requires a fee

and/or specific permission.

ACM-SPAA’93-6/93 /Velen,Germany.Q 1993 ACM 0-8979 j-599_ 2J93/0006/OJ 20...$1-50

Jeflreg Scott Vitterf

Dept. of Computer Science

Duke University, Box 90129

Durham, NC 27708-0129

1 Introduction

Input/Output communication (1/0) between primary

and secondary memory is a major bottleneck in many

important computations, and it is especially prevalent

when parallel processors are used. In this paper we

consider the important application of external sorting,

in which the records to be sorted are too numerous to

fit in internal memory and instead reside in secondary

storage, typically made up of one or more magnetic

disks. Data are usually transferred in units of blocks,

which may consist of several kilobytes. This blocking

takes advantage of the fact that the seek time is usually

much longer than the time needed to transfer a record

of data once the disk read/write head is in place. An

increasingly popular way to get further speedup is to

use many disk drives working in parallel [GHK, GiS,

Jil, Mag, PGK, Uni].

Aggarwal and Vitter did initial work in the use of

parallel block transfer for sorting [AgV], generalizing

the sequential work of Floyd [Flo]. Let us consider the

parameters

N = # records in the file

M = # records that can fit in internal memory

P = # CPUS (internal processors)

B = # records per block

D = # blocks transferred per 1/0

where M < N, l~P~M, andl~DB~ M/2.

In the Aggarwal-Vitter model, there is only one CPU

(P = 1), and in each 1/0, D blocks of B records canbe transferred simultaneously, as illustrated in Figure 1.

Their measure of performance is the number of parallel

1/0s required; they ignore internal computation time.

Aggarwal and Vitter proved that the average-case and

worst-case number of 1/0s required for sorting isl

(N log(N/B)

@ DB log(M/B) )(1)

1we ~~e the notation log z to denote the quantitymax{l, log2 z}. All logarithms in this paper are base 2.

120

Page 2: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

Externalmemory

(disk)

CPU memory

Figure 1: A simple D-parallel two-level memory model.

Their lower bound is based solely on routing arguments,

except for the pathological case in which M and B are

extremely small, in which case the comparison model is

used. They gave two algorithms, a modified merge sort

and a distribution sort, that each achieved the optimal

1/0 bounds.

Vitter and Shriver [ViSa] considered the more real-

istic D-disk model, in which the secondary storage is

partitioned into D physically distinct disk drives, as in

Figure 2a. (Note that each head of a multi-head drive

can count ss a distinct disk in this definition, as long

as each can operate independently of the other heads

on the drive.) In each 1/0 operation, each of the Ddisks can simultaneously transfer one block of B records.

Thus, D blocks can be transferred per 1/0, as in the

[AgV] model, but only if no two blocks access the same

disk. This assumption is reasonable in view of the way

real systems are constructed.

Vitter and Shriver presented a randomized version of

distribution sort in the D-disk model using two com-

plementary partitioning techniques. Their algorithm

meets the 1/0 lower bound (1) for the more lenient

model of [AgV], and thus it is optimal. The difficulty

in implementing distribution sort on a set of D paral-

lel disks is making sure that each bucket can be read

efficiently in parallel. The randomization was used to

distribute each of the buckets evenly over the D disks

so they could be read efficiently with parallel 1/0. They

posed as an open problem whether there is an optimal

deterministic algorithm. An affirmative answer was pro-

vided by Nodine and Vitter using an algorithm based

on merge sort called Greed Sort [NoV]. Unfortunately,

the Greed Sort technique does not seem to yield optimal

sorting bounds on memory hierarchies.2

Disk striping is a commonly-used technique in which

the D disks are synchronized, so that the D blocks

accessed during an 1/0 are at the same relative position

on each disk. This technique effectively transforms the

disks into a single disk with larger block size B’ = DB.Merge sort combined with disk striping is deterministic,

but the number of 1/0s used can be much larger than

z An ~moneom res~t in that regard was reported by -other

author in SPDP ’92.

optimal, by a multiplicative factor of log(M/13).

In this paper we describe Balance Sort, the first

known optimal and deterministic sorting algorithm

based on distribution sort. Balance Sort is optimal for

sorting on multiple disks and CPUS, both in terms of

the number of 1/0 steps and in terms of the amount

of internal processing work. We also use it for optimal

sorting on parallel memory hierarchies.

Section 2 describes the memory models considered in

this paper, and our main results are listed in Section 3.

In Section 4, we give an algorithm that is optimal for

all the parallel multi-level hierarchies. In Section 5, we

show how to alter the algorithm to deal with parallelism

of CPUS in the parallel disk model. Conclusions are

given in Section 6.

2 Memory Models

2.1 Parallel disk models

Conceptually, the simplest large-scale memory is the

two-level memory, known as the disk model. Figure 2a

shows the uniprocessor (P = 1) multiple disk model

with D > 1 disks. The more general model, in which

the internal processing is done on P interconnected

processors, is shown in Figure 2b for the special

case P = D. The interconnections we consider are

the hypercube and the Exclusive-Read/Exclusive-Write

(EREW) PRAM.

In a single 1/0, each of the D disks can simultaneously

transfer a block of B records. Our main measure of

performance is the number of 1/0s, but at the same

time we also consider the amount of internal processing

done. The difficulty in designing optimal algorithms is

dealing with the partitioning of secondary storage into

separate disks.

2.2 Parallel multilevel hierarchies

The first multilevel hierarchy memory model that we

consider is the Hierarchical Memory Model (HMM) pro-

posed by Aggarwal et al. [AAC], depicted in Figure 3a.

In the HMMj[=J model, access to memory location z

takes $(x) time. Figure 3a suggests the HMMpog.1

model, where each layer in the hierarchy is twice as

large as the previous layer. Accesses to records in thefirst layer take one time unit; in general each record in

the nth layer takes n time units to access. Figure 3a

can actually be taken as representative of the so-called

“well-behaved” cost functions f(z), such as f(z) = Za,CY>o.

An elaboration of HMM is the Block Tkansfer (BT)

model of Aggarwal et al. [ACSa], depicted schematicallyin Figure 3b. Like HMM, it has a cost function f(z),

but additionally it simulates the effect of block transfer

by allowing the .t + 1 locations c, z – 1, . . . z – 1 to be

accessed at cost ~(z) + t. An alternative block-oriented

121

Page 3: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

(a) D

Y

Internalmemory

CPU

(b) D

838Internal Internal Internalmemoiy memory memory

Figure 2: (a) The parallel disk model. Each of the

D disks can simultaneously transfer B records to and

from internal memory in a single 1/0. The internal

memory canstorelkf > DB records. (b) Multiprocessor

generalization of the 1/0 model in (a), in which each of

the P = D internal processors controls one disk and

has an internal memory of size M/P. The P processors

are connected by some topology such as a hypercube or

an” EREW PRAM and their memories collectively have

size M.

m’Figure 3: Multilevel hierarchy models. (a) The HMM

model. (b) The BT model. (c) The UMH model.

memory hierarchy is the Uniform Memory Hierarchy

(UMH) of Alpern et al. [ACF], depicted in Figure 3c.

As with two-level hierarchies, multilevel hierarchies

can be parallelized, as shown in Figure 4. The bsse

memory levels of the Ii hierarchies are attached to

H interconnected processors. We assume that the

hierarchies are all of the same kind. We denote theparallel hierarchical models for HMM, BT, and UMHas P-HMM, P-BT, and P-UMH.

3 Main Results

In this section, we present our main results. The

Balance Sort approach we describe in the next section

gives us optimal deterministic algorithms for all the

models we consider. In particular we get deterministic

(as well as more practical) versions of the optimal

randomized algorithms of [ViSa], [ViSb] and [ViN].

H

wFigure 4: Parallel multilevel memory hierarchies. The

H hierarchies (of any of the types listed in Figure 3)

have their base levels connected by H interconnected

processors.

We also improve upon the deterministic Greed Sort

algorithm in [NoV], which is known to be optimal only

for the parallel disk models and not for hierarchical

memories. The lower bounds are proved in [AgV]

(Theorem 1) and [ViSb] (Theorems 2 and 3).

Theorem 1 The number of 1/0s needed for sorting

N records in the parallel disk model is

(N log(N/13)

)@ DB log(M/B) “

The upper bound is given by a deterministic algorithm

based on Balance Sort, which also achieves simultane-

ously optimal @((N/P] log N) internal processing time

with a PRAM interconnection, assuming for technical

reasons that either P < M iogmin{M/B, log M)/ log M

or log M = O(log M/B). When the interconnection is

a hypercube, the internal processing time is the number

of I\Os times the time to partition DB elements among

m sorted partition elements on a P-processor hy-percube. The lower bounds apply to both the average

case and the worst case. The I/O lower bound does

not require the use of the compam”son model of compu-

tation, except for the case when M and B are extremely

small with respect to N, namely, when B log(M/B) =o(log(N/B)). The internal processing lower bound uses

the comparison model.

Theorem 2 In the P-HMM model with an EREW

PRAM interconnection, the time for sorting is

(d) iff(~) =logz;@ (g log; log ~

@ ((+$)”+1 +#log N) iff(x) =z”, a> 0.

122

Page 4: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

On a hypercube interconnection, the P-HMM time forsorting is

o (fi (log# lc)g(~) + MT(H)))

if f(z) = log z;

Q (($$)”+’+ .& T(H)) if f(z) = 2Y, ~ >0,

where Z’(H) = O(log H (log log iY)2) is the time needed

to sort H items on an H-processor hypercube. The

upper bounds are given by a deterministic algorithm

based on Balance Sort. The lower bounds for the PRAM

interconnection hold for any type of PRAM. The lower

bounds for the f(x) = Za case require the comparison

model of computation.

The term involving T(H) in the hypercube expres-

sion for ~(z) = log z is possibly nonoptimal by an

O(min{(log N)/(log H), (log log H)2}) factor; however,

the algorithm is optimal for large and small values of N.

Theorem 3 In the P-BT model with an EREW

PRAM, the time for sorting is

6 (g loglv)

@(g logN)

@(~ (logz g + log N))

@ ((f)”+ #logN)

The corresponding bounds

tion of the P-BT memoy ,

i~~(re) = logo;

if f(z) =za, O<@ < 1;

if f(z) = Za, a = 1;

if f(z) =Za, a> 1.

for a hypercube interconnec-

hierarchies are

o (#&j$T(H)) if f(z) = logz;

o (wT(H)) if f(z) = za,

O< CY<l;

‘T(H))) if f(z)= z“, a = 1;Q @ (10g2 % + logH

e ((#)a + ST(H)) if f(z) = Ze, a >1,

where T(H) = O(log H (log log 17)2) is the time needed

to sort H items on an H-processor hypercube. The

upper bounds are given by a deierminisiic algordhm

based on Balance Sort. The lower bounds for the

PRAM interconnection hold for any type of PRAM.

The (iV/H) log N terms and the terms invoiving T(H)in the lower bounds require the comparison model of

computation.

The terms involving T(H) in the hypercube ex-

pressions for ~(z) = logx and f(z) = z“,

Algorithm 1 [Sort(N, T)]

if N~3Hn := [N/Hi

form: =1 ton

(1) Read H locations to the base level

(last read may be partial)

Sort internally

(2) Write back out again

(3) Do binary merge sort of the <3 sorted lists

else

(4) :;tTalize ~ ~ComputePartitionElements(S)

(5)

(6) Balance(T)’

for b:=lto S

(7) T := Read bth row of L {set in Balance}

N~ := number of elements in bucket b

(8) Sort(N~, T)

(9) Append sorted bucket to output area

a < 1, are possibly nonoptimal by a factor of

O(min{(log N)/(log H), (log log H)2}), baaed on the

comparison model of computation; these terms are neg-

ligible unless N is superpolynomial in H and H grows

without bound.

Our techniques can also be used to transform the ran-

domized P-UMH algorithms of [ViN] into determinis-

tic ones with our PRAM interconnection. In this pa-

per, however, we concentrate on the P-HMM and P-BT

models.

4 Parallel Memory Hierarchies

This section describes the deterministic algorithm that

serves as the basis for the upper bounds in Theorems 2

and 3. Subsection 4.1 gives the overall sorting algo-

rithm. Subsection 4.2 describes the deterministic sub-

routine for matching that we use. In Subsection 4.3, we

analyze the algorithm for the P-HMM model. Subsec-

tion 4.4 covers the P-BT model.

4.1 The sorting algorithm

For simplicity we assume that the N keys are distinct;

this assumption is easily realizable by appending to

each key the record’s initial location. Algorithm 1

is the top-level description of our sorting algorithm

for parallel memory hierarchies. The algorithm is a

version of distribution sort (sometimes called bucket

sort). It sorts a set of elements by choosing S – 1

partitioning elements of approximately evenly-spacedrank in the set and using them to partition the data into

S disjoint ranges, or buckets, The individual buckets

are then sorted recursively and concatenated to form a

completely sorted list.

19-2

Page 5: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

Algorithm 2 [ComputePartitionElements(S)]

Partition into G groups ofs (N/G] elements,

G1, . . ..GG

for g :=lto G

(1) Sort G~ recursively

(2) Set aside every hog NJ th element into C

(3) Sort C using binary merge sort with hierarchystriping

(4) ej := the ~N/((S – 1) log IV)jth smallest element

of c

The difficult part of the algorithm is the load bal-

ancing done by the routine Balance, which makes sure

that the buckets are (approximately) evenly distributed

among the hierarchies during partitioning. In order forthe balancing to occur in optimal deterministic time,

it is necessary to do partial striping of the hierarchies,

so that we will have only H’ virtual hierarchies with

logical (or virtual) blocks of size B = H/H’. We useHI = H~/3,

A number of parameters in each level of the algorithm

merit explanation:

T = array of H) elements pointing to the starting

block on each virtual hierarchy

S = # buckets = (# of partition elements)+ 1

E = array of S – 1 partition elements

x = S x H’ histogram matrix (described later)

A = S x H’ auziliary matrix (described later)

L = S x H’ location matrix (described later)

The correctness of Algorithm 1 is easy to establish,

since the bottom level of recursion by definition pro-

duces a sorted list and each level thereafter concate-

nates sorted lists in the right order. To get optimal

performance, we determine the number S of buckets

differently depending upon which hierarchical model we

are using.

Algorithm 2 gives the routine for computing the S– 1

partition elements, based on [AAC, ViSb]. It works

by recursively sorting sets of size N/G and choosing

every hog NJ th element. The approximate partition

elements are selected from this subset. The specific

value of G used in the algorithm is dependent upon

which hierarchical memory model is being used. We

can show that if we choose every log Nth element and

we choose G such that G log N ~ N/S, then we get

O < Nb < 2N/S for any bucket b, where N~ is the size

of bucket b.

Following the call to ComputePartitionElements, the

original set of N records is left as G sorted subsets of

approximate y N/G records each. This fact is crucially

Algorithm 3 [Balance(T)]

v := H’

while there are unprocessed elements

(1)

(2)

(3)

(4)

(5)

(6)(7)

(8)

(9)

Read next v vir~ual blocks

Partition the records into buckets (in parallel)

collect buckets into virtual blocks of size H/H’

(all elements of block from same bucket)

Update the histogram matrix X based on the

placements of the virtual blocks

A:= ComputeAux(X)

H := {virtual hierarchies with no 2s in A}

Write out the virtual blocks corresponding to H{ Rebalance reassigns and writes some virtual

blocks }

v := IH[ + Rebalance(d, X)

Update the histogram matrix X to compensatefor unprocessed blocks

Update the internal pointers of the virtual

blocks and the location matrix L

Collect unprocessed virtual blocks to allow

room for next v virtual blocks in the next

iteration

important in allowing for partial hierarchy striping, as

described in the next subsections.

Algorithm 3 gives the Balance routine for balancing

the buckets among the virtual hierarchies. Balance

works as follows, successively track by track: A

parallel read is done from the current subset of the G

sorted subsets in order to get a full track of records.

(Some records may have been left from the previousiteration.) These records are partitioned into buckets

by merging them with the partition elements, and the

contents of the buckets from the track are formed

into virtual blocks. Each virtual block resides on

some virtual hierarchy. The virtual blocks that donot overly unbalance their respective buckets, meaning

that they do not introduce a 2 into the auxiliary

matrix (described below), are written to higher levels

of their respective virtual hierarchies. Those virtual

blocks that do unbalance their buckets are sent to the

Rebalance subroutine. As the records are partitionedand distributed, the histogram matrix X = {Zbh}

records how the buckets are distributed among the

hierarchies; in particular, Zbh is the number of virtual

blocks of bucket b on virtual hierarchy h. Updating

the histogram matrix on line (3) simply means thatif virtual hierarchy h is assigned a virtual block from

bucket b, then we increment x~h by 1. The Jocaiion

matrix L = {~b~) tells what location W21S written last on

each virtual hierarchy for each bucket.

The auxiliary matrix A = {Ubh} determines if the

placement becomes too badly skewed. Specifically, if mb

124

Page 6: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

is the median number of virtual blocks that bucket b has

on all the virtual hierarchies (i.e., ma is the median of

xbl, . . . . ~b~l ),3 we define in the ComputeAux routine

(Algorithm 4)

abh := m={(),~bh – rob}.

This definition forces the important invariant:

Invariant 1 Ai! ieast [H’/2l entries of every row of the

auxiliary matm”z A are 0s.

The Balance routine (and its subroutine Rebalance)

maintains good balance by guaranteeing that there are

at most li?’/2] 2s in each row of the auxiliary matrix A,

and that the remaining entries must be 0s and 1s.

Any 2s that remain correspond to “unprocessed” virtual

blocks and are conceptually written back (without the

need for an actual write operation) to the input in

line (7) of Algorithm 3. The net result is that the

auxiliary matrix A effectively contains only 0s and 1s:

Invariant 2 After each track is processed conceptually,

the auxiliay matriz A is binay; that is, all of its

entn”es are either O or 1. Hence ~bh ~ mb + 1 for all

1 ~ h ~ H’, where mb is the median entry on row b

of A.

Invariant 2, coupled with the definition of median,

proves that the buckets are balanced:

Theorem 4 Any bucket b will take no more than a

factor of about 2 above the optimal number of tracks

to read.

Recently, an alternative definition of auxiliary matrix

was proposed that has a similar effect of making each

bucket balanced within a factor of 2; the term abh is

defined to be 1 when the number of blocks per bucket

is more than twice the desired evenly-balanced number

[Arg].

After the rebalancing, Step (9) of Algorithm 3 routesany unprocessed virtual blocks into a contiguous region

that does not overlap with any of the next v virtual

hierarchies to be read, This operation takes time

O(log H) by monotone routing [Lei, Section 3.4.3].

Algorithm 5 gives the Rebalance subroutine. At mostHI 2s can be introduced into the auxiliary matrix A by

the virtual blocks being processed. This fact follows

since only H’ values in the histogram matrix X are

incremented, one for each virtual hierarchy, and only

values of the histogram matrix that are incremented can

become 2 in the auxiliary matrix. We call the subroutine

Rearrange to remove introduced 2s at least [H’/4l ata time until we ha~e at ~vst [M’ /2] 2s left, The 100P

will thus execute at most twice.

3We use the convention that the median is always the (D/21 thsmallest element, rather than the convention in statistics that itis the average of the two middle elements if D is even.

Algorithm 4 [ComputeAux(X)]

for b:=lto S

mb := median ( [H’/2l th smallest) element of

xbl, ...,xbH1

for h := 1 to If’ in parallel

abh := max{(), ~bh – m~}

Algorithm 5 [Rebalance(A, X) returns v]

U:=(I

(1) while there are at least [H’/2J virtual hierarchies

with 2s in A

U := {virtual hierarchies with the next [H’/2J

2s}

(2) v := v + Rearrange(U)

return(v)

Algorithm 6 shows the subroutine Rearrange, which

is able to remove up to [H’/4l 2s from the auxiliary

matrix A in a single parallel memory reference. The

Rearrange subroutine is based on the simple observation

that if we read a virtual block of bucket b from a

virtual hierarchy h for which abh = 2 and we write it

to another virtual hierarchy h’ for which abh, = O, then

we have removed the 2. By “removing a 2“ from the

auxiliary matrix A, we mean that if the auxiliary matrix

is immediately recomputed after the operation, an entry

that was a 2 will become at most 1, and no 2s will be

introduced. We set up a matching so that we could

accommodate removing all of the (at most) lH’/2j 2s

simultaneously in a single parallel memory reference,

although we only guarantee that we remove [H’/4l of

them. Hence it follows that if the auxiliary matrix Ais computed immediately after the call to Rebalance,

there will be at most [H’/2] 2s in it. As mentioned

before, the virtual blocks corresponding to those 2s are

considered conceptually to be part of the next track,

and thus all 2s are effective y removed from A.

Line (5) of Algorithm 6 is done with a parallel

memory reference. Line (4) is accomplished by sorting

according to destination address and doing monotone

routing [Lei, 13.4.3].

In Section 4.2, we show that Faat.Partial-Match in

line (2) of Algorithm 6 always matches at least [H’/4l

of the 2s. Rearrange is able to process the same number

of virtual blocks in a single parallel memory reference asFas.Partial-Match i~ able to match. To achieve optimal

time for memory hierarchies with a logarithmic access

cost, we need to do the matching in time O(log H)

on an EREW PRAM. It is for this reason that we

use the special routine Fast-Partial.Match, since even

1.25

Page 7: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

Algorithm 6 [Rearrange(U) returns v]

{ Create a bipartite matching problem}

{ u= {Ul, . . . . Vlul} is the set of virtualhierarchies with a 2 in A for some bucket }

V:= {l,...,}’}

E:=!?Jfor i := 1 to IU]

h := Ui

b[h] := (unique) bucket such that a~~ = 2

for j := 1 to lVl in parallel

h’ := ~

(1) if ahhl = ~

E := Eu(tJi, ~)

{ We swap a pair of virtual blocks for every edge in

the following match }

(2) ?) := Fast.Partial-Match (U, V, E)

{ The array R will tell us what buckets to read

from each hierarchy }

{ The array W will tell us on what virtual

hierarchy to write the virtual block )

for h := 1 to H’ in parallel

r[h] := O

w[h] := Ofor each match (U~, ~ ) in parallel

h := Ui

r[h] := b[h]

‘W[h] := Vj

(3) Update X to reflect the swap

for h := 1 to H’ in parallel

if r[h] # O

(4) Route the reassigned virtual block ofbucket r[h] from virtual hierarchy h

to w [hl

(5) Write out the virtual block onto virtual

hierarchy w[h]

the fastest known deterministic parallel algorithm for

maximal matching (the simplest alternative) with n

items is 0(log2 n) with a quartic number of processors

for dense graphs (which we have) [Luba]. Since n =

H’, this means that the fastest known algorithm is

@(logz H). Unfortunately, this algorithm is not fastenough.

4.2 How to do fast deterministic matching

In this subsection, we discuss how to do fsst deter-

ministic partial matching as part of the rebalancing

technique. This algorithm uses time O(T(H)), which

is logarithmic on a PRAM and O(log H(log log H)2) on

a hypercube. (By partial striping, we can reduce the

matching time for

doesn’t affect the

the hypercube to O(log H), but this

overall running time by more than

Algorithm 7 [Fast-Partial-Match(U, V, E) returns v]

for each vertex u G U

(1) while u has not picked edge-adjacent vertex

in V

u picks a random vertex in V ({1,..., H’})(2) if vertex in V is picked by more than one vertex

in U, smallest-numbered vertex in U wins

Add the picked pairs to matching

return(the number of matched pairs)

a constant factor. ) In the matching problem, we have

two vertex sets U and V. There are k = lH’/2J vertices

in U, each of which has edges to at least [H’/2] of the

H’ vertices in V, by Invariant 1. Each edge represents

a possible swap between a 2 and a O on a row of the

auxiliary matrix A, which if put in the matching will

remove the 2.

We start by giving a randomized version of

Fast.-Partial.Match, shown in Algorithm 7, for doing

partial bipartite matching. To prove that this match-

ing can be done in O(T(H)) time with H’ processors,

we can show that Loop (1) of Algorithm 7 will be exe-

cuted only a constant number of times, on the average

and that Step (2) takes time O(T(II)). Notice that

we can implement Loop (1) by assigning one processor

to each vertex in U and V, using only O(H’) proces-

sors. Step (1) of Algorithm 7 therefore takes constant

time, on the average, with O(H)) processors. We can

show that Step (2), a concurrent write operation, can be

done in O(T(H)) time. We have the first lH’/2J proces-

sors trying to send messages to some subset of the first

H’ processors. We sort the messages according to their

destination. This sorting can be done in time O(T(H’))

using H’ processors. Once we have the messages sorted,

we can do a segmented prefix operation for each unique

key to compute how many destinations were selected,

eliminate all but the first message in each segment, and

finally route the messages to their destinations using

monotone routing. The total expected time for the par-

tial matching is thus O(Z’(H)) for any interconnection.

Finally, we can show the following lemma.

Lemma 1 The expected number of vertices matched in

Algorithm 7 is at least H’/4.

This algorithm can be derandomized in an efficient

way using the techniques of Luby [Luba, Lubb]. First,

notice that we have H = (H’)3 processors available, sowe can run up to (H’)2 copies of the Fast-Partial-Match

algorithm simultaneously. The above randomized

matching algorithm uses only pairwise independence

in the analysis of the running time, so we construct

a special probability space to take advantage of the

pairwise independence. We make sure that the running

time of the algorithm will not be changed in the new

126

Page 8: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

probability space. Analysis of the running times shows

that there must be some point in the probability space

that matches at least [ff’/4l vertices in O(log H’) steps,

and that point can be found exhaustively in parallel,

The matrix of random variables from which we are

sampling occupies only O(H) space, so that it can all

be fit at the base memory level of the hierarchies, and

we therefore need not consider the cost of accessing the

random variables.

To summarize, we have shown the following theorem.

Theorem 5 The Fasi.PartiaLMaich matches at least

[H’/4l vertices deterministically in O(T(H)) time us-

ing H processors.

4.3 Analysis of P-HMM

In the P-HMM model, we choose

{

+-N

2H1/3 if N>li2

G=

~ ifN~H2

{

min{~,~} ifN>H2

s =

min{fi?%} ifNsH2

It is relatively straightforward to show that with these

values of G and S, the virtual blocking can be done and

the buckets are approximately the same size.

After a lot of mathematical manipulation and analy-

sis, the overall recurrence becomes

T(N) = +)+?wd+

0( H’(3+T(Hmfor the case N > 3H, where T(H) is the time needed to

sort H items on H processors. When N ~ 3H, we have

T(N) = O(Z’(H)).

Lemma 2 When f(z) = log z, the algorithm given

sorts in deterministic time

On a hypercube, the best known value of T(H) is

O(log H log log H) if precomputation is allowed and

O(log H (log log H)2) with no precomputation. On a

PRAM, we have T(H) = O(log H).

Lemma 3 When f(z) = x“, the algorithm given sortsin deterministic time

()Na+l

z N 10g ‘T(H).+ Hlog H

In fact, we can go even farther for the P-HMM

model and show that the algorithm is uniformly optimal

for any “well-behaved” cost functions ~(z) on any

interconnection that has T(H) = O(log H). The

proof of this fact is essentially that of the corresponding

theorem in [ViSb] for their randomized algorithm.

4.4 Analysis of P-BT

Almost the same algorithm will work for the P-BT

model as we used in the P-HMM model. The only

difference is that in Algorithm 1, we need to add another

step right after Step (6) to reposition all the buckets into

consecutive locations on each virtual memory hierarchy,

This repositioning is done on a virtual-hierarchy-by-

virtual-hierarchy basis, using the generalized matrix

transposition algorithm given in [ACSa].

We concentrate in this section on the cost functionf(x) = Za, where O < a <1. We choose

if N>H2

As with P-HMM, the virtual blocking can be done and

the buckets are approximately the same size.

We need to make one more change to the algorithm

for BT hierarchies, but one that is hard to write

explicitly. Aggarwal et al. gave an algorithm called

the “touch” algorithm [ACSa]. This algorithm takes

an array of n consecutive records stored at the lowest

possible level and passes them through the base memory

level in order, using time O(n log log n) for O < a <1.

As it turns out, all the data structures are processed

in order throughout the algorithm owing to the sorted

runs of size N/G created by finding the partitioning

elements. The effect of this change is that we get the

same recurrence as for the P-HMM model, using an

effective cost function f(z) = log log Z* = O(log log z).

The limiting step in the algorithm for P-BT is the need

for repositioning the buckets, which can be done using

the cited algorithm in time O (( N/H)(log log(N/H))4).

So the overall recurrence is

T(N) = G“ (~) + ~h T(N~)+

)O (g ((log log *)4+ T(H)) + ~ log log ~ ,

(2)

for the case N > 3H, where T(H) is the time needed to

sort H items on H processors. The case N < 3H is the

same as for P-HMM.

127

Page 9: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

Lemma 4 The given algorithm sorts in the P-BT

model, with f(z) = Z“, for O < & < 1, using time

Z’(N) = (ZV/H) log N.

5 Parallel Disks with Parallel

Processors

In this section, we describe a version of Balance

Sort for the parallel disk model. The algorithm is

optimal in terms of the number of parallel disk 1/0s

and also in terms of the internal processing time,

assuming that the P processors are interconnected as

any type of PRAM. If log(Af/B) = o(log Al), we require

a concurrent read/concurrent write (CRCW) PRAM

interconnection.

The algorithm for the parallel disk model is similar

to that used for the P-HMM model, with the following

changes: In Algorithm 1, we use N < A4 rather than

N ~ H as the termination condition for the recursion.The Balance algorithm is modified so that it reads

memoryloads at a time, though it still process virtual

blocks the same way. A “memoryload” is the collection

of O(ikf) records that fit into memory. Note that the

parameter we call D (the number of disks) is similar

to the parameter we called II for parallel memory

hierarchies; however, the number P of CPUS may be

different from the number D of disks. We likewise

use partial striping. We also use a different method

for computing the partitioning elements, described in

[ViSa]. Finally, we let S = (A4/B)’i4.

The 1/0 bound is easy to show for the new algorithm.

The A, X, L, and E arrays all reside in the internal

memory, so there is no 1/0 cost to access them. For

the number of 1/0s, we get the recurrence

{

ST (~) + O(N/Dl?) if N > M

T(N) =o ifN~Ll

which has solution

(T(N) = O ~ logs g)(

N log(N/B)

)= 0 m log(M/B) “

This is the same bound as was shown to be optimal forthe parallel disk model [AgV].

The tricky part is showing that the internal processors

can be used efficiently for any number of PRAM

processors P < M log min{L1/B, log lVf}/ log M (or up

to P= Afif log&f= 0(log(i14/B))). To bound the

amount of time spent processing each memoryload, we

use a variety of techniques including using an algorithm

of Raj asekaran and Reif [RaR] as part of a radix

sort, Cole’s EREW PRAM parallel merge sort [Col],

increment al updating, and even/odd partitioning.

6 Conclusions

In this paper, we have described the first known

deterministic algorithm for sorting optimally using

parallel hierarchical memories. This algorithm improves

upon the randomized algorithms of Vitter and Shriver

[ViSa, ViSb] and the deterministic disk algorithm of

Nodine and Vitter [NoV]. The algorithm applies to P-

HMM, P-BT, and the parallel variants of the UMH

models. In the parallel disk model with parallel CPUS,

our algorithm is optimal simultaneously in terms of both

the number of 1/0s and the internal processing time.

The algorithms can operate without need of non-striped

write operations, a useful feature for error checking and

correcting protocols.

A promising approach to balancing that the authors

first considered is to do a greedy balance via rein-cost

matching on the placement matrix. We conjecture thatsuch an approach results in globally balanced buckets

with perhaps an even faster implementation.

It is conceivable that the Sharesort algorithm of

Cypher and Plaxton [CYP] may be applicable to

parallel disks and parallel memory hierarchies to get

an algorithm with performance similar to ours in the

big-oh sense. Alternatively, our balancing results may

be applicable to their parallel sorting model.

Our algorithms are both theoretically efficient andvery practical in terms of constant factors, and we ex-

pect our balance technique to be quite useful as large-

scale parallel memories are built, not only for sorting

but also for other load-balancing applications on paral-

lel disks and parallel memory hierarchies. Although we

have presented a deterministic algorithm, the random-

ized algorithm resulting from the randomized matching

is even simpler to implement in practice.

7 References

[AAC]

[ACSa]

[AgV]

[ACF]

Alok Aggarwal, Bowen Alpern, Ashok K.

Chandra, and Marc Snir, “A Model for Hier-

archical Memory,” Proceedings of 19th An-

nual ACM Symposium on Theoy of Comput-

ing (May 1987), 305–314.

Alok Aggarwal, Ashok K. Chandra, and Marc

Snir, “Hierarchical Memory with Block Trans-

fery Proceedings of .28th Annual IEEE Sym-posium on Foundations of Computer Science

(October 1987), 204-216.

Alok Aggarwal and Jeffrey Scott Vitter, “The

Input/Output Complexity of Sorting and Re-

lated Problems: Communications of the ACM

31 (September 1988), 1116-1127.

Bowen Alpern, Larry Carter, and Ephraim

Feig, “Uniform Memory Hierarchies,” l%oceed-

ings of the 91st Annual IEEE Symposium

128

Page 10: [ACM Press the fifth annual ACM symposium - Velen, Germany (1993.06.30-1993.07.02)] Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures - SPAA '93

on Foundations of Computer Science (October

1990), 600-608.

[Arg] Lars Arge, January 1993, private communica-

tion.

[BFP] Manuel Blum, Robert W. Floyd, Vaughan

Pratt, Ronald L. Rivest, and Robert E. Tarjan,

“Time Bounds for Selection,” J. Computer and

System Sciences 7 (1973), 448-461.

[Col] Richard Cole, “Parallel Merge Sort ,“ SIAM J.

Computing 17 (August 1988), 770-785,

[CyP] Robert Cypher and C. Greg Plaxton, “Deter-

ministic Sorting in Nearly Logarithmic Time

on the Hypercube and Related Computers,”

Journal of Computer and System Sciences (to

appear), also appears in Proceedings of the

22nd Annual ACM Symposium on Theory of

Computing, (May 1990), 193-203.

[Flo] Robert W. Floyd, “Permuting Information inIdealized Two-Level Storage,” in Complexity

of Computer Computations, R. Miller and J.

Thatcher, cd., Plenum, 1972, 105-109.

[GHK] Garth Gibson, Lisa Hellerstein, Richard M.

Karp, Randy H. Katz, and David A. Patterson,

“Coding Techniques for Handling Failures inLarge Disk Arrays,” U. C. Berkeley, UCB/CSD

88/477, December 1988.

[GiS] David Gifford and Alfred Spector, “The TWA

Reservation System,” Communications of the

ACM 27 (July 1984), 650-665.

[GoS] Mark Goldberg and Thomas Spencer, “Con-

structing a Maximal Independent Set in Paral-

lel,” SIAM J. Discrete Math 2, 322–328.

[HoK] Jia-Wei Hong and H. T. Kung, “1/0 Complex-

ity: The Red-Blue Pebble Game,” Proc. of the

I%h Annual A Cbf Symposium on the Theory

of Computing (May 1981), 326-333.

[1sS] Amos Israeli and Y. Shiloach, “An Improved

Parallel Algorithm for Maximal Matching,”

Information Processing Letters 22 (1986), 57-

60.

[Jil] W. Jilke, “Disk Array Mass Storage Systems:The New Opportunity,” Amperif Corporation,

September 1986.

[Lei] F. Thomson Leighton, in Introduction to Par-

allel Algorithms and Architectures, Morgan

Kaufmann Publishers, San Mateo, CA, 1992.

[Luba] Michael Luby, “A Simple Parallel Algorithm

for the Maximal Independent Set Problem,”

SIAM J. Computing 15 (1986), 1036-1053.

Proceedings of the 29th Annual IEEE Symp&

sium on the Foundations of Computer Science,

(October 1988), 162-173.

[Mag] Ninamary Buba Maginnis, “Store More, Spend

Less: Mid-Range Options Around,” Computer-

world (November 16, 1987), 71-82.

[NoV] Mark H. Nodine and Jeffrey Scott Vitter,

“Greed Sort: An Optimal External Sorting Al-

gorithm for Multiple Disks,” Brown University,

CS-91-20, August 1991, also appears in short-

ened form in “Large-Scale Sorting in Paral-

lel Memories,” Proc, 9rd Annual ACM Sympo-

sium on Parallel Algon”thms and Architectures,

Hilton Head, SC (July 1991), 29-39.

[PGK] David A. Patterson, Garth Gibson, and Randy

H. Katz, “A Case for Redundant Arrays of

Inexpensive Disks (RAID),” Proceedings ACM

SIGMOD Conference (June 1988), 109-116.

[RaR] Sanguthevar Rajasekaran and John H. Reif,

“Optimal and Sublogarithmic Time Random-ized Parallel Sorting Algorithms,” SIAM J.

Computing 18 (1989), 594-607.

[Uni] University of California at Berkeley, “Massive

Information Storage, Management, and Use

(NSF Institutional Infrastructure Proposal),”

Technical Report No. UCB/CSD 89/493, Jan-

uary 1989.

[ViN] Jeffrey Scott Vitter and Mark H. Nodine,

“Large-Scale Sorting in Uniform Memory Hi-

erarchies,” Journal of Parallel and Distributed

Computing (January 1993), also appears in

shortened form in “Large-Scale Sorting in Par-

allel Memories,” Proc. %d Annual ACM Sym-

posium on Parallel Algorithms and Architec-

tures, Hilton Head, SC (July 1991), 29-39.

[ViSa] Jeffrey Scott Vitter and Elizabeth A. M.

Shriver, “Algorithms for Parallel Memory I:

Two-Level Memories,” Algorithmic, to ap-

pear.

[ViSb] Jeffrey Scott Vitter and Elizabeth A. M.

Shriver, “Algorithms for Parallel Memory 11:

Hierarchical Multilevel Memories,” Algorith-

mic, to appear.

[Lubb] Michael G. Luby, “lt,emoving Randomness in

a Parallel Computation Without a Processor

Penalty,” International Computer Science In-

stitute, TR-89-044, July 1989, also appears in

129