[acm press the fifth annual acm symposium - velen, germany (1993.06.30-1993.07.02)] proceedings of...
TRANSCRIPT
Deterministic Distribution Sort
in Shared and Distributed Memory Multiprocessors
(extended abstract)
Mark H. Nodine*
Motorola Cambridge Res. Ctr.
One Kendall Square, Bldg. 200
Cambridge, MA 02139
Abstract
We present an elegant deterministic load balancing strategy
for distribution sort that is applicable to a wide variety of
parallel diska and parallel memory hierarchies with both
single and parallel processors. The simplest application
of the strategy is an optimal deterministic algorithm for
external sorting with multiple disks and parallel processors.
In each input/output (1/0) operation, each of the D ~ 1
disks can simultaneously transfer a block of B contiguous
records. Our two measures of performance are the number
of 1/0s and the amount of work done by the CPU(s); our
algorithm ia simultaneously optimal for both measures. We
also show how to sort determiniatically in parallel memory
hierarchies. When the processors are interconnected by any
sort of a PRAM, our algorithms are optimal for all parallel
memory hierarchies; when the interconnection network is a
hypercube, our algorithms are either optimal or best-known.
●Part of this research was done while the author was at BrownUniversity, supported in part by an IBM Graduate Fellowship,by NSF research grants CCR-9007851 and IRI-91 16451, andby Army Research Office grant DAAL03-91-G–0035. Email:[email protected] .com.
t P=t of this ~csem& WM done while the author WM at
Brown University. Support was provided in part by Presi-dential Young Investigator Award CCR-9047466 with match-ing funds from IBM, by NSF research grant CCR-9007851, andby Army Research Office grant DAAL03-91-G-O035. Email:jsvtlcs.duke.edu.
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed for
direct commercial advantage, the ACM copyright notice and the
title of the publication and its date appear, and notice is given
that copying is by permission of the Association for Computing
Machinery. To copy otherwise, or to republish, requires a fee
and/or specific permission.
ACM-SPAA’93-6/93 /Velen,Germany.Q 1993 ACM 0-8979 j-599_ 2J93/0006/OJ 20...$1-50
Jeflreg Scott Vitterf
Dept. of Computer Science
Duke University, Box 90129
Durham, NC 27708-0129
1 Introduction
Input/Output communication (1/0) between primary
and secondary memory is a major bottleneck in many
important computations, and it is especially prevalent
when parallel processors are used. In this paper we
consider the important application of external sorting,
in which the records to be sorted are too numerous to
fit in internal memory and instead reside in secondary
storage, typically made up of one or more magnetic
disks. Data are usually transferred in units of blocks,
which may consist of several kilobytes. This blocking
takes advantage of the fact that the seek time is usually
much longer than the time needed to transfer a record
of data once the disk read/write head is in place. An
increasingly popular way to get further speedup is to
use many disk drives working in parallel [GHK, GiS,
Jil, Mag, PGK, Uni].
Aggarwal and Vitter did initial work in the use of
parallel block transfer for sorting [AgV], generalizing
the sequential work of Floyd [Flo]. Let us consider the
parameters
N = # records in the file
M = # records that can fit in internal memory
P = # CPUS (internal processors)
B = # records per block
D = # blocks transferred per 1/0
where M < N, l~P~M, andl~DB~ M/2.
In the Aggarwal-Vitter model, there is only one CPU
(P = 1), and in each 1/0, D blocks of B records canbe transferred simultaneously, as illustrated in Figure 1.
Their measure of performance is the number of parallel
1/0s required; they ignore internal computation time.
Aggarwal and Vitter proved that the average-case and
worst-case number of 1/0s required for sorting isl
(N log(N/B)
@ DB log(M/B) )(1)
1we ~~e the notation log z to denote the quantitymax{l, log2 z}. All logarithms in this paper are base 2.
120
Externalmemory
(disk)
CPU memory
Figure 1: A simple D-parallel two-level memory model.
Their lower bound is based solely on routing arguments,
except for the pathological case in which M and B are
extremely small, in which case the comparison model is
used. They gave two algorithms, a modified merge sort
and a distribution sort, that each achieved the optimal
1/0 bounds.
Vitter and Shriver [ViSa] considered the more real-
istic D-disk model, in which the secondary storage is
partitioned into D physically distinct disk drives, as in
Figure 2a. (Note that each head of a multi-head drive
can count ss a distinct disk in this definition, as long
as each can operate independently of the other heads
on the drive.) In each 1/0 operation, each of the Ddisks can simultaneously transfer one block of B records.
Thus, D blocks can be transferred per 1/0, as in the
[AgV] model, but only if no two blocks access the same
disk. This assumption is reasonable in view of the way
real systems are constructed.
Vitter and Shriver presented a randomized version of
distribution sort in the D-disk model using two com-
plementary partitioning techniques. Their algorithm
meets the 1/0 lower bound (1) for the more lenient
model of [AgV], and thus it is optimal. The difficulty
in implementing distribution sort on a set of D paral-
lel disks is making sure that each bucket can be read
efficiently in parallel. The randomization was used to
distribute each of the buckets evenly over the D disks
so they could be read efficiently with parallel 1/0. They
posed as an open problem whether there is an optimal
deterministic algorithm. An affirmative answer was pro-
vided by Nodine and Vitter using an algorithm based
on merge sort called Greed Sort [NoV]. Unfortunately,
the Greed Sort technique does not seem to yield optimal
sorting bounds on memory hierarchies.2
Disk striping is a commonly-used technique in which
the D disks are synchronized, so that the D blocks
accessed during an 1/0 are at the same relative position
on each disk. This technique effectively transforms the
disks into a single disk with larger block size B’ = DB.Merge sort combined with disk striping is deterministic,
but the number of 1/0s used can be much larger than
z An ~moneom res~t in that regard was reported by -other
author in SPDP ’92.
optimal, by a multiplicative factor of log(M/13).
In this paper we describe Balance Sort, the first
known optimal and deterministic sorting algorithm
based on distribution sort. Balance Sort is optimal for
sorting on multiple disks and CPUS, both in terms of
the number of 1/0 steps and in terms of the amount
of internal processing work. We also use it for optimal
sorting on parallel memory hierarchies.
Section 2 describes the memory models considered in
this paper, and our main results are listed in Section 3.
In Section 4, we give an algorithm that is optimal for
all the parallel multi-level hierarchies. In Section 5, we
show how to alter the algorithm to deal with parallelism
of CPUS in the parallel disk model. Conclusions are
given in Section 6.
2 Memory Models
2.1 Parallel disk models
Conceptually, the simplest large-scale memory is the
two-level memory, known as the disk model. Figure 2a
shows the uniprocessor (P = 1) multiple disk model
with D > 1 disks. The more general model, in which
the internal processing is done on P interconnected
processors, is shown in Figure 2b for the special
case P = D. The interconnections we consider are
the hypercube and the Exclusive-Read/Exclusive-Write
(EREW) PRAM.
In a single 1/0, each of the D disks can simultaneously
transfer a block of B records. Our main measure of
performance is the number of 1/0s, but at the same
time we also consider the amount of internal processing
done. The difficulty in designing optimal algorithms is
dealing with the partitioning of secondary storage into
separate disks.
2.2 Parallel multilevel hierarchies
The first multilevel hierarchy memory model that we
consider is the Hierarchical Memory Model (HMM) pro-
posed by Aggarwal et al. [AAC], depicted in Figure 3a.
In the HMMj[=J model, access to memory location z
takes $(x) time. Figure 3a suggests the HMMpog.1
model, where each layer in the hierarchy is twice as
large as the previous layer. Accesses to records in thefirst layer take one time unit; in general each record in
the nth layer takes n time units to access. Figure 3a
can actually be taken as representative of the so-called
“well-behaved” cost functions f(z), such as f(z) = Za,CY>o.
An elaboration of HMM is the Block Tkansfer (BT)
model of Aggarwal et al. [ACSa], depicted schematicallyin Figure 3b. Like HMM, it has a cost function f(z),
but additionally it simulates the effect of block transfer
by allowing the .t + 1 locations c, z – 1, . . . z – 1 to be
accessed at cost ~(z) + t. An alternative block-oriented
121
(a) D
Y
Internalmemory
CPU
(b) D
838Internal Internal Internalmemoiy memory memory
Figure 2: (a) The parallel disk model. Each of the
D disks can simultaneously transfer B records to and
from internal memory in a single 1/0. The internal
memory canstorelkf > DB records. (b) Multiprocessor
generalization of the 1/0 model in (a), in which each of
the P = D internal processors controls one disk and
has an internal memory of size M/P. The P processors
are connected by some topology such as a hypercube or
an” EREW PRAM and their memories collectively have
size M.
m’Figure 3: Multilevel hierarchy models. (a) The HMM
model. (b) The BT model. (c) The UMH model.
memory hierarchy is the Uniform Memory Hierarchy
(UMH) of Alpern et al. [ACF], depicted in Figure 3c.
As with two-level hierarchies, multilevel hierarchies
can be parallelized, as shown in Figure 4. The bsse
memory levels of the Ii hierarchies are attached to
H interconnected processors. We assume that the
hierarchies are all of the same kind. We denote theparallel hierarchical models for HMM, BT, and UMHas P-HMM, P-BT, and P-UMH.
3 Main Results
In this section, we present our main results. The
Balance Sort approach we describe in the next section
gives us optimal deterministic algorithms for all the
models we consider. In particular we get deterministic
(as well as more practical) versions of the optimal
randomized algorithms of [ViSa], [ViSb] and [ViN].
H
wFigure 4: Parallel multilevel memory hierarchies. The
H hierarchies (of any of the types listed in Figure 3)
have their base levels connected by H interconnected
processors.
We also improve upon the deterministic Greed Sort
algorithm in [NoV], which is known to be optimal only
for the parallel disk models and not for hierarchical
memories. The lower bounds are proved in [AgV]
(Theorem 1) and [ViSb] (Theorems 2 and 3).
Theorem 1 The number of 1/0s needed for sorting
N records in the parallel disk model is
(N log(N/13)
)@ DB log(M/B) “
The upper bound is given by a deterministic algorithm
based on Balance Sort, which also achieves simultane-
ously optimal @((N/P] log N) internal processing time
with a PRAM interconnection, assuming for technical
reasons that either P < M iogmin{M/B, log M)/ log M
or log M = O(log M/B). When the interconnection is
a hypercube, the internal processing time is the number
of I\Os times the time to partition DB elements among
m sorted partition elements on a P-processor hy-percube. The lower bounds apply to both the average
case and the worst case. The I/O lower bound does
not require the use of the compam”son model of compu-
tation, except for the case when M and B are extremely
small with respect to N, namely, when B log(M/B) =o(log(N/B)). The internal processing lower bound uses
the comparison model.
Theorem 2 In the P-HMM model with an EREW
PRAM interconnection, the time for sorting is
(d) iff(~) =logz;@ (g log; log ~
@ ((+$)”+1 +#log N) iff(x) =z”, a> 0.
122
On a hypercube interconnection, the P-HMM time forsorting is
o (fi (log# lc)g(~) + MT(H)))
if f(z) = log z;
Q (($$)”+’+ .& T(H)) if f(z) = 2Y, ~ >0,
where Z’(H) = O(log H (log log iY)2) is the time needed
to sort H items on an H-processor hypercube. The
upper bounds are given by a deterministic algorithm
based on Balance Sort. The lower bounds for the PRAM
interconnection hold for any type of PRAM. The lower
bounds for the f(x) = Za case require the comparison
model of computation.
The term involving T(H) in the hypercube expres-
sion for ~(z) = log z is possibly nonoptimal by an
O(min{(log N)/(log H), (log log H)2}) factor; however,
the algorithm is optimal for large and small values of N.
Theorem 3 In the P-BT model with an EREW
PRAM, the time for sorting is
6 (g loglv)
@(g logN)
@(~ (logz g + log N))
@ ((f)”+ #logN)
The corresponding bounds
tion of the P-BT memoy ,
i~~(re) = logo;
if f(z) =za, O<@ < 1;
if f(z) = Za, a = 1;
if f(z) =Za, a> 1.
for a hypercube interconnec-
hierarchies are
o (#&j$T(H)) if f(z) = logz;
o (wT(H)) if f(z) = za,
O< CY<l;
‘T(H))) if f(z)= z“, a = 1;Q @ (10g2 % + logH
e ((#)a + ST(H)) if f(z) = Ze, a >1,
where T(H) = O(log H (log log 17)2) is the time needed
to sort H items on an H-processor hypercube. The
upper bounds are given by a deierminisiic algordhm
based on Balance Sort. The lower bounds for the
PRAM interconnection hold for any type of PRAM.
The (iV/H) log N terms and the terms invoiving T(H)in the lower bounds require the comparison model of
computation.
The terms involving T(H) in the hypercube ex-
pressions for ~(z) = logx and f(z) = z“,
Algorithm 1 [Sort(N, T)]
if N~3Hn := [N/Hi
form: =1 ton
(1) Read H locations to the base level
(last read may be partial)
Sort internally
(2) Write back out again
(3) Do binary merge sort of the <3 sorted lists
else
(4) :;tTalize ~ ~ComputePartitionElements(S)
(5)
(6) Balance(T)’
for b:=lto S
(7) T := Read bth row of L {set in Balance}
N~ := number of elements in bucket b
(8) Sort(N~, T)
(9) Append sorted bucket to output area
a < 1, are possibly nonoptimal by a factor of
O(min{(log N)/(log H), (log log H)2}), baaed on the
comparison model of computation; these terms are neg-
ligible unless N is superpolynomial in H and H grows
without bound.
Our techniques can also be used to transform the ran-
domized P-UMH algorithms of [ViN] into determinis-
tic ones with our PRAM interconnection. In this pa-
per, however, we concentrate on the P-HMM and P-BT
models.
4 Parallel Memory Hierarchies
This section describes the deterministic algorithm that
serves as the basis for the upper bounds in Theorems 2
and 3. Subsection 4.1 gives the overall sorting algo-
rithm. Subsection 4.2 describes the deterministic sub-
routine for matching that we use. In Subsection 4.3, we
analyze the algorithm for the P-HMM model. Subsec-
tion 4.4 covers the P-BT model.
4.1 The sorting algorithm
For simplicity we assume that the N keys are distinct;
this assumption is easily realizable by appending to
each key the record’s initial location. Algorithm 1
is the top-level description of our sorting algorithm
for parallel memory hierarchies. The algorithm is a
version of distribution sort (sometimes called bucket
sort). It sorts a set of elements by choosing S – 1
partitioning elements of approximately evenly-spacedrank in the set and using them to partition the data into
S disjoint ranges, or buckets, The individual buckets
are then sorted recursively and concatenated to form a
completely sorted list.
19-2
Algorithm 2 [ComputePartitionElements(S)]
Partition into G groups ofs (N/G] elements,
G1, . . ..GG
for g :=lto G
(1) Sort G~ recursively
(2) Set aside every hog NJ th element into C
(3) Sort C using binary merge sort with hierarchystriping
(4) ej := the ~N/((S – 1) log IV)jth smallest element
of c
The difficult part of the algorithm is the load bal-
ancing done by the routine Balance, which makes sure
that the buckets are (approximately) evenly distributed
among the hierarchies during partitioning. In order forthe balancing to occur in optimal deterministic time,
it is necessary to do partial striping of the hierarchies,
so that we will have only H’ virtual hierarchies with
logical (or virtual) blocks of size B = H/H’. We useHI = H~/3,
A number of parameters in each level of the algorithm
merit explanation:
T = array of H) elements pointing to the starting
block on each virtual hierarchy
S = # buckets = (# of partition elements)+ 1
E = array of S – 1 partition elements
x = S x H’ histogram matrix (described later)
A = S x H’ auziliary matrix (described later)
L = S x H’ location matrix (described later)
The correctness of Algorithm 1 is easy to establish,
since the bottom level of recursion by definition pro-
duces a sorted list and each level thereafter concate-
nates sorted lists in the right order. To get optimal
performance, we determine the number S of buckets
differently depending upon which hierarchical model we
are using.
Algorithm 2 gives the routine for computing the S– 1
partition elements, based on [AAC, ViSb]. It works
by recursively sorting sets of size N/G and choosing
every hog NJ th element. The approximate partition
elements are selected from this subset. The specific
value of G used in the algorithm is dependent upon
which hierarchical memory model is being used. We
can show that if we choose every log Nth element and
we choose G such that G log N ~ N/S, then we get
O < Nb < 2N/S for any bucket b, where N~ is the size
of bucket b.
Following the call to ComputePartitionElements, the
original set of N records is left as G sorted subsets of
approximate y N/G records each. This fact is crucially
Algorithm 3 [Balance(T)]
v := H’
while there are unprocessed elements
(1)
(2)
(3)
(4)
(5)
(6)(7)
(8)
(9)
Read next v vir~ual blocks
Partition the records into buckets (in parallel)
collect buckets into virtual blocks of size H/H’
(all elements of block from same bucket)
Update the histogram matrix X based on the
placements of the virtual blocks
A:= ComputeAux(X)
H := {virtual hierarchies with no 2s in A}
Write out the virtual blocks corresponding to H{ Rebalance reassigns and writes some virtual
blocks }
v := IH[ + Rebalance(d, X)
Update the histogram matrix X to compensatefor unprocessed blocks
Update the internal pointers of the virtual
blocks and the location matrix L
Collect unprocessed virtual blocks to allow
room for next v virtual blocks in the next
iteration
important in allowing for partial hierarchy striping, as
described in the next subsections.
Algorithm 3 gives the Balance routine for balancing
the buckets among the virtual hierarchies. Balance
works as follows, successively track by track: A
parallel read is done from the current subset of the G
sorted subsets in order to get a full track of records.
(Some records may have been left from the previousiteration.) These records are partitioned into buckets
by merging them with the partition elements, and the
contents of the buckets from the track are formed
into virtual blocks. Each virtual block resides on
some virtual hierarchy. The virtual blocks that donot overly unbalance their respective buckets, meaning
that they do not introduce a 2 into the auxiliary
matrix (described below), are written to higher levels
of their respective virtual hierarchies. Those virtual
blocks that do unbalance their buckets are sent to the
Rebalance subroutine. As the records are partitionedand distributed, the histogram matrix X = {Zbh}
records how the buckets are distributed among the
hierarchies; in particular, Zbh is the number of virtual
blocks of bucket b on virtual hierarchy h. Updating
the histogram matrix on line (3) simply means thatif virtual hierarchy h is assigned a virtual block from
bucket b, then we increment x~h by 1. The Jocaiion
matrix L = {~b~) tells what location W21S written last on
each virtual hierarchy for each bucket.
The auxiliary matrix A = {Ubh} determines if the
placement becomes too badly skewed. Specifically, if mb
124
is the median number of virtual blocks that bucket b has
on all the virtual hierarchies (i.e., ma is the median of
xbl, . . . . ~b~l ),3 we define in the ComputeAux routine
(Algorithm 4)
abh := m={(),~bh – rob}.
This definition forces the important invariant:
Invariant 1 Ai! ieast [H’/2l entries of every row of the
auxiliary matm”z A are 0s.
The Balance routine (and its subroutine Rebalance)
maintains good balance by guaranteeing that there are
at most li?’/2] 2s in each row of the auxiliary matrix A,
and that the remaining entries must be 0s and 1s.
Any 2s that remain correspond to “unprocessed” virtual
blocks and are conceptually written back (without the
need for an actual write operation) to the input in
line (7) of Algorithm 3. The net result is that the
auxiliary matrix A effectively contains only 0s and 1s:
Invariant 2 After each track is processed conceptually,
the auxiliay matriz A is binay; that is, all of its
entn”es are either O or 1. Hence ~bh ~ mb + 1 for all
1 ~ h ~ H’, where mb is the median entry on row b
of A.
Invariant 2, coupled with the definition of median,
proves that the buckets are balanced:
Theorem 4 Any bucket b will take no more than a
factor of about 2 above the optimal number of tracks
to read.
Recently, an alternative definition of auxiliary matrix
was proposed that has a similar effect of making each
bucket balanced within a factor of 2; the term abh is
defined to be 1 when the number of blocks per bucket
is more than twice the desired evenly-balanced number
[Arg].
After the rebalancing, Step (9) of Algorithm 3 routesany unprocessed virtual blocks into a contiguous region
that does not overlap with any of the next v virtual
hierarchies to be read, This operation takes time
O(log H) by monotone routing [Lei, Section 3.4.3].
Algorithm 5 gives the Rebalance subroutine. At mostHI 2s can be introduced into the auxiliary matrix A by
the virtual blocks being processed. This fact follows
since only H’ values in the histogram matrix X are
incremented, one for each virtual hierarchy, and only
values of the histogram matrix that are incremented can
become 2 in the auxiliary matrix. We call the subroutine
Rearrange to remove introduced 2s at least [H’/4l ata time until we ha~e at ~vst [M’ /2] 2s left, The 100P
will thus execute at most twice.
3We use the convention that the median is always the (D/21 thsmallest element, rather than the convention in statistics that itis the average of the two middle elements if D is even.
Algorithm 4 [ComputeAux(X)]
for b:=lto S
mb := median ( [H’/2l th smallest) element of
xbl, ...,xbH1
for h := 1 to If’ in parallel
abh := max{(), ~bh – m~}
Algorithm 5 [Rebalance(A, X) returns v]
U:=(I
(1) while there are at least [H’/2J virtual hierarchies
with 2s in A
U := {virtual hierarchies with the next [H’/2J
2s}
(2) v := v + Rearrange(U)
return(v)
Algorithm 6 shows the subroutine Rearrange, which
is able to remove up to [H’/4l 2s from the auxiliary
matrix A in a single parallel memory reference. The
Rearrange subroutine is based on the simple observation
that if we read a virtual block of bucket b from a
virtual hierarchy h for which abh = 2 and we write it
to another virtual hierarchy h’ for which abh, = O, then
we have removed the 2. By “removing a 2“ from the
auxiliary matrix A, we mean that if the auxiliary matrix
is immediately recomputed after the operation, an entry
that was a 2 will become at most 1, and no 2s will be
introduced. We set up a matching so that we could
accommodate removing all of the (at most) lH’/2j 2s
simultaneously in a single parallel memory reference,
although we only guarantee that we remove [H’/4l of
them. Hence it follows that if the auxiliary matrix Ais computed immediately after the call to Rebalance,
there will be at most [H’/2] 2s in it. As mentioned
before, the virtual blocks corresponding to those 2s are
considered conceptually to be part of the next track,
and thus all 2s are effective y removed from A.
Line (5) of Algorithm 6 is done with a parallel
memory reference. Line (4) is accomplished by sorting
according to destination address and doing monotone
routing [Lei, 13.4.3].
In Section 4.2, we show that Faat.Partial-Match in
line (2) of Algorithm 6 always matches at least [H’/4l
of the 2s. Rearrange is able to process the same number
of virtual blocks in a single parallel memory reference asFas.Partial-Match i~ able to match. To achieve optimal
time for memory hierarchies with a logarithmic access
cost, we need to do the matching in time O(log H)
on an EREW PRAM. It is for this reason that we
use the special routine Fast-Partial.Match, since even
1.25
Algorithm 6 [Rearrange(U) returns v]
{ Create a bipartite matching problem}
{ u= {Ul, . . . . Vlul} is the set of virtualhierarchies with a 2 in A for some bucket }
V:= {l,...,}’}
E:=!?Jfor i := 1 to IU]
h := Ui
b[h] := (unique) bucket such that a~~ = 2
for j := 1 to lVl in parallel
h’ := ~
(1) if ahhl = ~
E := Eu(tJi, ~)
{ We swap a pair of virtual blocks for every edge in
the following match }
(2) ?) := Fast.Partial-Match (U, V, E)
{ The array R will tell us what buckets to read
from each hierarchy }
{ The array W will tell us on what virtual
hierarchy to write the virtual block )
for h := 1 to H’ in parallel
r[h] := O
w[h] := Ofor each match (U~, ~ ) in parallel
h := Ui
r[h] := b[h]
‘W[h] := Vj
(3) Update X to reflect the swap
for h := 1 to H’ in parallel
if r[h] # O
(4) Route the reassigned virtual block ofbucket r[h] from virtual hierarchy h
to w [hl
(5) Write out the virtual block onto virtual
hierarchy w[h]
the fastest known deterministic parallel algorithm for
maximal matching (the simplest alternative) with n
items is 0(log2 n) with a quartic number of processors
for dense graphs (which we have) [Luba]. Since n =
H’, this means that the fastest known algorithm is
@(logz H). Unfortunately, this algorithm is not fastenough.
4.2 How to do fast deterministic matching
In this subsection, we discuss how to do fsst deter-
ministic partial matching as part of the rebalancing
technique. This algorithm uses time O(T(H)), which
is logarithmic on a PRAM and O(log H(log log H)2) on
a hypercube. (By partial striping, we can reduce the
matching time for
doesn’t affect the
the hypercube to O(log H), but this
overall running time by more than
Algorithm 7 [Fast-Partial-Match(U, V, E) returns v]
for each vertex u G U
(1) while u has not picked edge-adjacent vertex
in V
u picks a random vertex in V ({1,..., H’})(2) if vertex in V is picked by more than one vertex
in U, smallest-numbered vertex in U wins
Add the picked pairs to matching
return(the number of matched pairs)
a constant factor. ) In the matching problem, we have
two vertex sets U and V. There are k = lH’/2J vertices
in U, each of which has edges to at least [H’/2] of the
H’ vertices in V, by Invariant 1. Each edge represents
a possible swap between a 2 and a O on a row of the
auxiliary matrix A, which if put in the matching will
remove the 2.
We start by giving a randomized version of
Fast.-Partial.Match, shown in Algorithm 7, for doing
partial bipartite matching. To prove that this match-
ing can be done in O(T(H)) time with H’ processors,
we can show that Loop (1) of Algorithm 7 will be exe-
cuted only a constant number of times, on the average
and that Step (2) takes time O(T(II)). Notice that
we can implement Loop (1) by assigning one processor
to each vertex in U and V, using only O(H’) proces-
sors. Step (1) of Algorithm 7 therefore takes constant
time, on the average, with O(H)) processors. We can
show that Step (2), a concurrent write operation, can be
done in O(T(H)) time. We have the first lH’/2J proces-
sors trying to send messages to some subset of the first
H’ processors. We sort the messages according to their
destination. This sorting can be done in time O(T(H’))
using H’ processors. Once we have the messages sorted,
we can do a segmented prefix operation for each unique
key to compute how many destinations were selected,
eliminate all but the first message in each segment, and
finally route the messages to their destinations using
monotone routing. The total expected time for the par-
tial matching is thus O(Z’(H)) for any interconnection.
Finally, we can show the following lemma.
Lemma 1 The expected number of vertices matched in
Algorithm 7 is at least H’/4.
This algorithm can be derandomized in an efficient
way using the techniques of Luby [Luba, Lubb]. First,
notice that we have H = (H’)3 processors available, sowe can run up to (H’)2 copies of the Fast-Partial-Match
algorithm simultaneously. The above randomized
matching algorithm uses only pairwise independence
in the analysis of the running time, so we construct
a special probability space to take advantage of the
pairwise independence. We make sure that the running
time of the algorithm will not be changed in the new
126
probability space. Analysis of the running times shows
that there must be some point in the probability space
that matches at least [ff’/4l vertices in O(log H’) steps,
and that point can be found exhaustively in parallel,
The matrix of random variables from which we are
sampling occupies only O(H) space, so that it can all
be fit at the base memory level of the hierarchies, and
we therefore need not consider the cost of accessing the
random variables.
To summarize, we have shown the following theorem.
Theorem 5 The Fasi.PartiaLMaich matches at least
[H’/4l vertices deterministically in O(T(H)) time us-
ing H processors.
4.3 Analysis of P-HMM
In the P-HMM model, we choose
{
+-N
2H1/3 if N>li2
G=
~ ifN~H2
{
min{~,~} ifN>H2
s =
min{fi?%} ifNsH2
It is relatively straightforward to show that with these
values of G and S, the virtual blocking can be done and
the buckets are approximately the same size.
After a lot of mathematical manipulation and analy-
sis, the overall recurrence becomes
T(N) = +)+?wd+
0( H’(3+T(Hmfor the case N > 3H, where T(H) is the time needed to
sort H items on H processors. When N ~ 3H, we have
T(N) = O(Z’(H)).
Lemma 2 When f(z) = log z, the algorithm given
sorts in deterministic time
On a hypercube, the best known value of T(H) is
O(log H log log H) if precomputation is allowed and
O(log H (log log H)2) with no precomputation. On a
PRAM, we have T(H) = O(log H).
Lemma 3 When f(z) = x“, the algorithm given sortsin deterministic time
()Na+l
z N 10g ‘T(H).+ Hlog H
In fact, we can go even farther for the P-HMM
model and show that the algorithm is uniformly optimal
for any “well-behaved” cost functions ~(z) on any
interconnection that has T(H) = O(log H). The
proof of this fact is essentially that of the corresponding
theorem in [ViSb] for their randomized algorithm.
4.4 Analysis of P-BT
Almost the same algorithm will work for the P-BT
model as we used in the P-HMM model. The only
difference is that in Algorithm 1, we need to add another
step right after Step (6) to reposition all the buckets into
consecutive locations on each virtual memory hierarchy,
This repositioning is done on a virtual-hierarchy-by-
virtual-hierarchy basis, using the generalized matrix
transposition algorithm given in [ACSa].
We concentrate in this section on the cost functionf(x) = Za, where O < a <1. We choose
if N>H2
As with P-HMM, the virtual blocking can be done and
the buckets are approximately the same size.
We need to make one more change to the algorithm
for BT hierarchies, but one that is hard to write
explicitly. Aggarwal et al. gave an algorithm called
the “touch” algorithm [ACSa]. This algorithm takes
an array of n consecutive records stored at the lowest
possible level and passes them through the base memory
level in order, using time O(n log log n) for O < a <1.
As it turns out, all the data structures are processed
in order throughout the algorithm owing to the sorted
runs of size N/G created by finding the partitioning
elements. The effect of this change is that we get the
same recurrence as for the P-HMM model, using an
effective cost function f(z) = log log Z* = O(log log z).
The limiting step in the algorithm for P-BT is the need
for repositioning the buckets, which can be done using
the cited algorithm in time O (( N/H)(log log(N/H))4).
So the overall recurrence is
T(N) = G“ (~) + ~h T(N~)+
)O (g ((log log *)4+ T(H)) + ~ log log ~ ,
(2)
for the case N > 3H, where T(H) is the time needed to
sort H items on H processors. The case N < 3H is the
same as for P-HMM.
127
Lemma 4 The given algorithm sorts in the P-BT
model, with f(z) = Z“, for O < & < 1, using time
Z’(N) = (ZV/H) log N.
5 Parallel Disks with Parallel
Processors
In this section, we describe a version of Balance
Sort for the parallel disk model. The algorithm is
optimal in terms of the number of parallel disk 1/0s
and also in terms of the internal processing time,
assuming that the P processors are interconnected as
any type of PRAM. If log(Af/B) = o(log Al), we require
a concurrent read/concurrent write (CRCW) PRAM
interconnection.
The algorithm for the parallel disk model is similar
to that used for the P-HMM model, with the following
changes: In Algorithm 1, we use N < A4 rather than
N ~ H as the termination condition for the recursion.The Balance algorithm is modified so that it reads
memoryloads at a time, though it still process virtual
blocks the same way. A “memoryload” is the collection
of O(ikf) records that fit into memory. Note that the
parameter we call D (the number of disks) is similar
to the parameter we called II for parallel memory
hierarchies; however, the number P of CPUS may be
different from the number D of disks. We likewise
use partial striping. We also use a different method
for computing the partitioning elements, described in
[ViSa]. Finally, we let S = (A4/B)’i4.
The 1/0 bound is easy to show for the new algorithm.
The A, X, L, and E arrays all reside in the internal
memory, so there is no 1/0 cost to access them. For
the number of 1/0s, we get the recurrence
{
ST (~) + O(N/Dl?) if N > M
T(N) =o ifN~Ll
which has solution
(T(N) = O ~ logs g)(
N log(N/B)
)= 0 m log(M/B) “
This is the same bound as was shown to be optimal forthe parallel disk model [AgV].
The tricky part is showing that the internal processors
can be used efficiently for any number of PRAM
processors P < M log min{L1/B, log lVf}/ log M (or up
to P= Afif log&f= 0(log(i14/B))). To bound the
amount of time spent processing each memoryload, we
use a variety of techniques including using an algorithm
of Raj asekaran and Reif [RaR] as part of a radix
sort, Cole’s EREW PRAM parallel merge sort [Col],
increment al updating, and even/odd partitioning.
6 Conclusions
In this paper, we have described the first known
deterministic algorithm for sorting optimally using
parallel hierarchical memories. This algorithm improves
upon the randomized algorithms of Vitter and Shriver
[ViSa, ViSb] and the deterministic disk algorithm of
Nodine and Vitter [NoV]. The algorithm applies to P-
HMM, P-BT, and the parallel variants of the UMH
models. In the parallel disk model with parallel CPUS,
our algorithm is optimal simultaneously in terms of both
the number of 1/0s and the internal processing time.
The algorithms can operate without need of non-striped
write operations, a useful feature for error checking and
correcting protocols.
A promising approach to balancing that the authors
first considered is to do a greedy balance via rein-cost
matching on the placement matrix. We conjecture thatsuch an approach results in globally balanced buckets
with perhaps an even faster implementation.
It is conceivable that the Sharesort algorithm of
Cypher and Plaxton [CYP] may be applicable to
parallel disks and parallel memory hierarchies to get
an algorithm with performance similar to ours in the
big-oh sense. Alternatively, our balancing results may
be applicable to their parallel sorting model.
Our algorithms are both theoretically efficient andvery practical in terms of constant factors, and we ex-
pect our balance technique to be quite useful as large-
scale parallel memories are built, not only for sorting
but also for other load-balancing applications on paral-
lel disks and parallel memory hierarchies. Although we
have presented a deterministic algorithm, the random-
ized algorithm resulting from the randomized matching
is even simpler to implement in practice.
7 References
[AAC]
[ACSa]
[AgV]
[ACF]
Alok Aggarwal, Bowen Alpern, Ashok K.
Chandra, and Marc Snir, “A Model for Hier-
archical Memory,” Proceedings of 19th An-
nual ACM Symposium on Theoy of Comput-
ing (May 1987), 305–314.
Alok Aggarwal, Ashok K. Chandra, and Marc
Snir, “Hierarchical Memory with Block Trans-
fery Proceedings of .28th Annual IEEE Sym-posium on Foundations of Computer Science
(October 1987), 204-216.
Alok Aggarwal and Jeffrey Scott Vitter, “The
Input/Output Complexity of Sorting and Re-
lated Problems: Communications of the ACM
31 (September 1988), 1116-1127.
Bowen Alpern, Larry Carter, and Ephraim
Feig, “Uniform Memory Hierarchies,” l%oceed-
ings of the 91st Annual IEEE Symposium
128
on Foundations of Computer Science (October
1990), 600-608.
[Arg] Lars Arge, January 1993, private communica-
tion.
[BFP] Manuel Blum, Robert W. Floyd, Vaughan
Pratt, Ronald L. Rivest, and Robert E. Tarjan,
“Time Bounds for Selection,” J. Computer and
System Sciences 7 (1973), 448-461.
[Col] Richard Cole, “Parallel Merge Sort ,“ SIAM J.
Computing 17 (August 1988), 770-785,
[CyP] Robert Cypher and C. Greg Plaxton, “Deter-
ministic Sorting in Nearly Logarithmic Time
on the Hypercube and Related Computers,”
Journal of Computer and System Sciences (to
appear), also appears in Proceedings of the
22nd Annual ACM Symposium on Theory of
Computing, (May 1990), 193-203.
[Flo] Robert W. Floyd, “Permuting Information inIdealized Two-Level Storage,” in Complexity
of Computer Computations, R. Miller and J.
Thatcher, cd., Plenum, 1972, 105-109.
[GHK] Garth Gibson, Lisa Hellerstein, Richard M.
Karp, Randy H. Katz, and David A. Patterson,
“Coding Techniques for Handling Failures inLarge Disk Arrays,” U. C. Berkeley, UCB/CSD
88/477, December 1988.
[GiS] David Gifford and Alfred Spector, “The TWA
Reservation System,” Communications of the
ACM 27 (July 1984), 650-665.
[GoS] Mark Goldberg and Thomas Spencer, “Con-
structing a Maximal Independent Set in Paral-
lel,” SIAM J. Discrete Math 2, 322–328.
[HoK] Jia-Wei Hong and H. T. Kung, “1/0 Complex-
ity: The Red-Blue Pebble Game,” Proc. of the
I%h Annual A Cbf Symposium on the Theory
of Computing (May 1981), 326-333.
[1sS] Amos Israeli and Y. Shiloach, “An Improved
Parallel Algorithm for Maximal Matching,”
Information Processing Letters 22 (1986), 57-
60.
[Jil] W. Jilke, “Disk Array Mass Storage Systems:The New Opportunity,” Amperif Corporation,
September 1986.
[Lei] F. Thomson Leighton, in Introduction to Par-
allel Algorithms and Architectures, Morgan
Kaufmann Publishers, San Mateo, CA, 1992.
[Luba] Michael Luby, “A Simple Parallel Algorithm
for the Maximal Independent Set Problem,”
SIAM J. Computing 15 (1986), 1036-1053.
Proceedings of the 29th Annual IEEE Symp&
sium on the Foundations of Computer Science,
(October 1988), 162-173.
[Mag] Ninamary Buba Maginnis, “Store More, Spend
Less: Mid-Range Options Around,” Computer-
world (November 16, 1987), 71-82.
[NoV] Mark H. Nodine and Jeffrey Scott Vitter,
“Greed Sort: An Optimal External Sorting Al-
gorithm for Multiple Disks,” Brown University,
CS-91-20, August 1991, also appears in short-
ened form in “Large-Scale Sorting in Paral-
lel Memories,” Proc, 9rd Annual ACM Sympo-
sium on Parallel Algon”thms and Architectures,
Hilton Head, SC (July 1991), 29-39.
[PGK] David A. Patterson, Garth Gibson, and Randy
H. Katz, “A Case for Redundant Arrays of
Inexpensive Disks (RAID),” Proceedings ACM
SIGMOD Conference (June 1988), 109-116.
[RaR] Sanguthevar Rajasekaran and John H. Reif,
“Optimal and Sublogarithmic Time Random-ized Parallel Sorting Algorithms,” SIAM J.
Computing 18 (1989), 594-607.
[Uni] University of California at Berkeley, “Massive
Information Storage, Management, and Use
(NSF Institutional Infrastructure Proposal),”
Technical Report No. UCB/CSD 89/493, Jan-
uary 1989.
[ViN] Jeffrey Scott Vitter and Mark H. Nodine,
“Large-Scale Sorting in Uniform Memory Hi-
erarchies,” Journal of Parallel and Distributed
Computing (January 1993), also appears in
shortened form in “Large-Scale Sorting in Par-
allel Memories,” Proc. %d Annual ACM Sym-
posium on Parallel Algorithms and Architec-
tures, Hilton Head, SC (July 1991), 29-39.
[ViSa] Jeffrey Scott Vitter and Elizabeth A. M.
Shriver, “Algorithms for Parallel Memory I:
Two-Level Memories,” Algorithmic, to ap-
pear.
[ViSb] Jeffrey Scott Vitter and Elizabeth A. M.
Shriver, “Algorithms for Parallel Memory 11:
Hierarchical Multilevel Memories,” Algorith-
mic, to appear.
[Lubb] Michael G. Luby, “lt,emoving Randomness in
a Parallel Computation Without a Processor
Penalty,” International Computer Science In-
stitute, TR-89-044, July 1989, also appears in
129