a scalable framework for heterogeneous gpu-based clusters · the keeneland system using 12 cores...

15
A Scalable Framework for Heterogeneous GPU-Based Clusters * (Regular Submission) Fengguang Song Innovative Computing Laboratory University of Tennessee Knoxville TN 37996-3450 [email protected] Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester [email protected] Abstract GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to their high energy efficiency and much improved single-node computational performance, however, there is few parallel software that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate than the performance of the PCI-Express connection (or the interconnection network) such that communication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi- level partitioning and distribution method that guarantees a near-optimal communication volume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming model to fire the tasks on different compute nodes. We devised a distributed dynamic scheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtime system employs a novel distributed task-assignment protocol to solve data dependencies between tasks without coordination between processing units. The runtime system on each node consists of a number of CPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communication thread, and a CUDA communication thread. By overlapping computation and communication through dynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization on the heterogeneous Keeneland system [23] using 100 nodes, each with twelve CPU cores and three GPUs. Moreover, our framework can also deliver high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs. keywords: Dense linear algebra, distributed runtime systems, hybrid CPU-GPU, heterogeneous clusters. * This material is based upon work supported by the NSF grants CCF-0811642, OCI-0910735, by the DOE grant DE-FC02- 06ER25761, by Nvidia, and by Microsoft Research.

Upload: others

Post on 28-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

A Scalable Framework for Heterogeneous GPU-Based Clusters ∗

(Regular Submission)

Fengguang SongInnovative Computing Laboratory

University of TennesseeKnoxville TN 37996-3450

[email protected]

Jack DongarraUniversity of Tennessee

Oak Ridge National LaboratoryUniversity of [email protected]

Abstract

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to theirhigh energy efficiency and much improved single-node computational performance, however, there is fewparallel software that can utilize all CPU cores and all GPUs on the heterogeneous system efficiently. On aheterogeneous cluster, the performance of a GPU (or a compute node) increases in a much faster rate thanthe performance of the PCI-Express connection (or the interconnection network) such that communicationeventually becomes the bottleneck of the entire system. To overcome the bottleneck, we developed a multi-level partitioning and distribution method that guarantees a near-optimal communication volume. We havealso extended heterogeneous tile algorithms to work on distributed-memory GPU clusters. Our main ideais to execute a serial program and generate hybrid-size tasks, and follow a dataflow programming modelto fire the tasks on different compute nodes. We devised a distributed dynamic scheduling runtime systemto schedule tasks, and transfer data between hybrid CPU-GPU compute nodes transparently. The runtimesystem employs a novel distributed task-assignment protocol to solve data dependencies between taskswithout coordination between processing units. The runtime system on each node consists of a number ofCPU compute threads, a number of GPU compute threads, a task generation thread, an MPI communicationthread, and a CUDA communication thread. By overlapping computation and communication throughdynamic scheduling, we are able to attain a high performance of 75 TFlops for Cholesky factorization onthe heterogeneous Keeneland system [23] using 100 nodes, each with twelve CPU cores and three GPUs.Moreover, our framework can also deliver high performance on distributed-memory clusters without GPUs,and shared-system multiGPUs.

keywords: Dense linear algebra, distributed runtime systems, hybrid CPU-GPU, heterogeneous clusters.

∗This material is based upon work supported by the NSF grants CCF-0811642, OCI-0910735, by the DOE grant DE-FC02-06ER25761, by Nvidia, and by Microsoft Research.

Page 2: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

1 Introduction

Based on the November 2011 Green500 list [20],twenty-three out of the top thirty greenest supercom-puters are GPU-based. However, there is few soft-ware that can take advantage of the large-scale het-erogeneous systems efficiently, especially to utilizeall CPU cores and all GPUs. Considering many op-erations of scientific computing applications are car-ried out through numerical linear algebra libraries,we focus on providing fundamental linear algebraoperations on the new heterogeneous architectures.

A great amount of effort has gone into the im-plementation of linear algebra libraries. LAPACK[5], Intel MKL, AMD ACML, and PLASMA [4]are mainly designed for shared-memory multicoremachines. ScaLAPACK [10] is intended for dis-tributed memory CPU-based machines. CUBLAS[18], MAGMA [22], and CULA [15] provide a sub-set of the LAPACK subroutines but work on a singleGPU. So far these libraries do not support computa-tions using multiple CPU cores and multiple GPUson a single node, not to mention distributed GPU-based clusters. Moreover, with an increasing numberof cores on the host whose performance continues tokeep up with the GPU speed, new parallel softwareshould not ignore either GPUs or CPUs.

Our work aims to provide a unified frameworkto solve linear algebra problems on any numberof CPU cores, any number of GPUs, and for ei-ther shared-memory or distributed-memory systems.Our solution consists of three essential components:(1) a static multi-level data distribution method, (2)heterogeneous tile algorithms, and (3) a distributeddynamic scheduling runtime system. The solutionworks as follows. Given a matrix input, we first splitit into tiles of hybrid sizes. Then we distribute thetiles to host main memories and GPU device memo-ries on a cluster with a static method. Each computenode runs a runtime system (launched as an MPI pro-cess) that schedules tasks within the node dynami-cally. Different nodes communicate with each otherby means of MPI messages. Our runtime system fol-lows the data-flow programming model and buildsa partial directed acyclic graph (DAG) dynamically,where a completed task will trigger a set of new tasksin the DAG.

We use a static multi-level distribution method to

allocate data to different hosts and GPUs. Each com-pute node is heterogeneous since it has both CPUsand GPUs, but different nodes have the same per-formance. Therefore, we design a multi-level dis-tribution method. On the top (i.e., inter-node) level,we use a 2-D block cyclic distribution method. Onthe second (i.e., intra-node between different GPUs)level, we allocate a node’s local blocks to merelyGPUs with a 1-D or 2-D block cyclic method. Onthe third (i.e., intra-node between CPUs and GPUs)level, we cut a slice from each GPU’s local block andput it to the host. The output of the multi-level distri-bution method is that each matrix block is uniquelyassigned to the host or a GPU on a specific node.

We also use heterogeneous tile algorithms to han-dle the difference between CPUs and GPUs [21]. Inthe algorithm, there are a great number of small tasksfor CPU cores, and a great number of large tasks forGPUs, to compute concurrently, at any time. While aheterogeneous tile algorithm is based on tiles, it hastwo types of tiles: small ones for CPU cores, andlarge ones for GPUs. Our work combines the het-erogenous tile algorithms and the multi-level distri-bution method together so that the algorithms are ap-plicable to heterogeneous clusters with hybrid CPUsand GPUs.

We design a distributed scheduling runtime systemfor heterogeneous clusters. Each compute node is ex-ecuting a runtime system that can solve data depen-dencies dynamically, and transfer data from a parenttask to its children transparently. All runtime sys-tems (one per node) proceed in parallel, and executea task-assignment protocol to build subsets (or parti-tions) of a DAG dynamically. There is no communi-cation required when building the DAG. The proto-col guarantees that all runtime systems make a unan-imous decision without coordinating with each othersuch that every task is computed by one and only oneprocessing unit (on a host or a GPU).

Our experiments with double-precision Choleskyand QR factorizations, on the heterogeneousKeeneland system [23] at the Oak Ridge NationalLaboratory, demonstrate great scalability from one to100 nodes using all CPUs and GPUs. In addition, weapply our framework to the other two possible envi-ronments: clusters without GPUs, and a shared sys-tem with CPUs and multiple GPUs. Compared withvendor-optimized and open source libraries (i.e., In-

1

Page 3: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

tel MKL, and StarPU [6]), our framework is able toprovide better performance thank Intel MKL by upto 66% on clusters without GPUs, and up to 250%better performance than StarPU on shared-systemmultiGPUs.

2 Background

We place greater emphasis on communications inour framework design. On a host that is attachedwith multiple GPUs through PCI-Express connec-tions, the ratio of computation to communication onthe GPUs keeps increasing. Eventually the commu-nication time on a PCI-Express connection will be-come the bottleneck of the entire system.

Researchers most often use dynamic schedulingmethods to support heterogeneous systems, whereeach CPU core or GPU picks up a ready task fromtask queues independently whenever a processingunit becomes idle. At the beginning of our design,we have also implemented dynamic scheduling run-time systems. In our dynamic runtime system, allthe CPU cores and GPUs share a global ready taskqueue, and each GPU owns a software cache on itsdevice memory. Whenever a GPU reads a block ofdata from the host, it stores the data to its softwarecache. All the data in the GPUs’ software caches arealso backed up by the main memory on the host. Wehave used two cache writing policies: write-throughand write-back. With the write-through policy, ev-ery modification to the software cache must be up-dated to the host main memory immediately. Withthe write-back policy, a modification to the softwarecache is updated to the host main memory only whena different device wants to access the modified data.To achieve the best performance, our software cachesize on each GPU is configured as large as the inputmatrix size to eliminate capacity cache misses.

Figure 1 shows our experiments with the double-precision Cholesky factorization on a single node ofthe Keeneland system using 12 cores and 3 FermiGPUs. In the figure, we compare our software-cache based dynamic scheduling runtime system,the generic dynamic scheduling runtime system ofStarPU [6], and our distributed-GPUs frameworkthat builds upon a static data distribution method.By changing from the write-through policy to the

0 100 200 300 400 500 600 700 800

1920

3840

5760

7680

9600

1152

0

1344

0

1536

0

1728

0

1920

0

2112

0

Gflo

ps

Maxtrix Size

Distri-GPUs framework Write-back cache Write-through cache StarPU 0.9.1

Figure 1: A comparison between dynamic schedul-ing systems and our distributed-GPU framework.

write-back policy, we can improve the program per-formance significantly because of the reduced com-munication.

Differently, StarPU consists of profiling, perfor-mance modeling, and various scheduling policies toachieve load balancing and reduce data transfers.However, since our static data distribution methodcan guarantee a near lower-bound communicationcost and has less scheduling overhead, it is fasterthan StarPU by up to 200% for small to medium ma-trix sizes. This has inspired us to use a static datadistribution method on GPU-based clusters. Herewe emphasize that despite its better performance,our framework is intended for solving linear algebraproblems, while StarPU is more generic and can sup-port other applications.

3 Heterogeneous Tile Algorithms

Our previous work has designed a class of hetero-geneous tile algorithm and applied them to shared-system multiGPUs [21]. Here we mention it brieflyfor completeness. In the algorithm, every task takesseveral individual tiles as input and output. On a het-erogeneous system with CPUs and GPUs, we createtwo different tiles.

Figure 2 shows a matrix that is stored in a tiledata layout with two different tiles. All the tasksthat modify small tiles are to be executed by CPUcores, and those that modify large tiles are to be ex-

2

Page 4: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

1 3 5 2 4 6

1

2

3

1

2

3

1 3 5 2 4 6

Figure 2: An example of the heterogeneous tile algo-rithm for Cholesky factorization.

ecuted by GPUs. Given a matrix with 3 tile rowsand 6 tile columns, Cholesky factorization can becomputed recursively as follows. At the first iter-ation, (1) we solve Cholesky factorizations on tilesA11, A21, and A31 in the first column. That is,L11 ← Cholesky(A11), and Lij ← Aij(L

T11)

−1;(2) then we update the trailing submatrix located be-tween the second and the last sixth column. That is,Aij ← Aij − Li1Lj1. The Cholesky factorizationcan be completed by recursively applying the abovetwo steps to the trailing submatrix that starts fromthe j-th tile column. If a matrix has n tile columns,the factorization takes n iterations to complete. Thesame idea can be applied to other matrix factoriza-tions (e.g., QR and LU factorizations). For moredetails, please refer to our report [21] for the exactalgorithms of Cholesky and QR factorizations.

3.1 Static Multi-Level Block Cyclic DataDistribution

This section presents a new multi-level partitioningscheme to create small and large tiles on distributed-memory heterogeneous clusters. (1) At the top level,we divide a matrix into p×p large tiles, each of whichis of size B×B. (2) We distribute the p×p large tilesto a Pr × Pc process grid using a 2-D block cyclicmethod. There is one process per node. (3) At thebottom level (i.e., within each node), we verticallycut every large tile of size B×B on each node into anumber of (s− 1) small tiles of size B × b, and oneremaining large tile of size B× (B− (s−1) · b). Weallocate the small tiles to the entire set of CPU coreson the host, meanwhile allocate the remaining tilesto GPUs using a 1-D or 2-D block cyclic method. Sofar we use a 1-D method because of the small numberof GPUs (i.e., ≤ 4) on each compute node.

After the multi-level block cyclic data distribution,each node is assigned to p

Pr× p·s

Pcrectangular tiles.

Given a tile indexed by [I, J], we first map it to nodeNi then to device Dj , where D0 denotes the host,and Dj≥1 denotes the j-th GPU located on node Ni.Assuming each node has G GPUs, we can computenode Ni (0 ≤ i ≤ PrPc-1), and device Dj as fol-lows:

Ni = (I mod Pr) · Pc + (J

smod Pc),

Dj =

0 : (J mod s) < s− 1

(J/sPcmod G) + 1: (J mod s) = s− 1

In other words, Step (2) distributes the large tilesacross Pr × Pc nodes (for Ni). Next, within eachnode, the tile columns whose indices are multiplesof (s−1) are mapped to the node’s G GPUs in a 1-Dcyclic way, and the rest of the columns are mapped toall CPUs on the host (for Dj). Although small tasksare assigned to all the CPUs, each CPU core can pickup any small task independently (i.e., not in a fork-join manner). We also tune the tile sizes of B andb to achieve the highest performance. Appendix A.1describes how we choose the best tile sizes.

4 Our Framework Overview

We design a distributed dynamic scheduling runtimesystem for the heterogeneous GPU-based clusters.Given a cluster with P nodes, we launch P MPI pro-cesses (one process per node), each of which exe-cutes an instance of the runtime system. We also as-sume a matrix is stored in the hybrid tile data layoutthat uses two different tile sizes.

Not only do we distribute data to hosts and GPUson different nodes statically, but also we distributetasks to hosts and GPUs statically. We require thatthe location of a task be the same as the locationof the task’s output. Although a task’s allocation isstatic, we schedule tasks dynamically within a hostor GPU, in order to reduce synchronization pointsand to overlap computation with communication.

Our runtime system follows a dataflow pro-gramming model and is essentially data-availabilitydriven. Whenever a parent task completes, it triggersits child tasks immediately. The runtime system canidentify data dependencies between tasks and unroll

3

Page 5: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

a Directed Acyclic Graph (DAG) dynamically. Notethat a DAG has never been created and stored in ourruntime system explicitly. A parallel program startswith an entry task and finishes with an exit task ofthe DAG, respectively. The runtime system is alsoresponsible for transferring data from a parent taskto its children transparently.

Each runtime system instance is multi-threaded.It creates five types of threads: a task-generationthread, a set of CPU compute threads for CPU cores,a set of GPU management threads for GPUs, aninter-node MPI communication thread, and an intra-node CUDA communication thread. If a machine isnot distributed, the MPI communication thread is notcreated. Similarly, if there is no GPU, the CUDAcommunication thread is not created.

A task-generation thread create tasks (similar toCPUs issuing instructions) and drives the executionof a parallel program. There are actually P task-generation threads running on P nodes. All thetask-generation threads execute the same sequentialcode independently and create task instances for theprogram without communication. They execute adistributed task-assignment protocol. Based on thecommon knowledge of the static multi-level distri-bution, each task generation thread is able to decideby itself which task it should compute and where thetask’s children are located without exchanging anymessages.

5 The Framework Implementation

As shown in Fig. 3, our runtime system consists ofseven components listed as follows. (1) Task win-dow: a fixed-size task queue that stores all the gener-ated but not finished tasks. (2) Ready task queues: afew lists of ready tasks. (3) Task-generation thread: asingle thread that executes a serial program, and gen-erates new tasks. (4) CPU compute threads: there is acompute thread running on every CPU core. (5) GPUmanagement (or compute) threads: there is a GPUmanagement thread for each GPU. (6) MPI commu-nication thread: a single thread that transfers databetween different nodes. (7) CUDA communicationthread: a single thread that transfers data among thehost and multiple GPUs within the same node usingcudaMemcpyAsync.

Task-generation thread

... Task window:

...

...

mbox:

Ready tasks: Ready

tasks: Ready tasks:

Ready tasks:

Multicore Host GPU1 GPU2 GPUn

MPI mbox: mbox: mbox: GPUDirect V2.0

HPC Network Compute threads

MPI thread

CUDA thread

Figure 3: Architecture of the distributed-GPU run-time system on each node.

5.1 Different Task Queues

A task window stores generated tasks in a single-linked list. It also keeps the original sequential orderbetween tasks. Each task consists of the informa-tion of a task’s input and output. Based on the inputand output information, when a task modifies its out-put, the runtime system scans the list to search forthe tasks who are waiting for the output. However,a global task list is often too long to search and canresult in severe contention between threads.

To make accesses to the task window faster, weuse 2-D task lists to implement the task window. Asillustrated in Fig. 4, each tile in a matrix has its owntask list. If a task’s input or output is tile [I, J], theruntime system will add an instance of the task to [I,J]’s task list. When a matrix is distributed to differentcompute nodes, we partition the 2-D task lists intodifferent nodes according to the location of the tiles.That is, if tile [I, J] is allocated to node Pk, tile [I, J]’stask list is also assigned to node Pk.

A ready task queue stores “ready-to-go” taskswhose inputs are all available. If a task modifies atile that belongs to the host or a GPU, it is addedto that host or GPU’s ready task queue correspond-ingly. We don’t allow host and GPUs to steal tasksfrom each other to avoid unnecessary data transfersand to increase data reuse. In addition, a ready taskin our implementation is simply a pointer that pointsto a task stored in the task window.

4

Page 6: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

5.2 Solving Data Dependencies

A tile’s task list maintains the serial semantic or-der between tasks that either read or write the tile.Whenever two tasks access the same tile and one ofthem is write, the runtime system detects a data de-pendency and stalls the successor till the predecessoris finished. Here we only consider the true depen-dency RAW (read after write), and use renaming toavoid the WAR (write after read) and WAW (writeafter write) dependencies. Figure 5 shows an exam-ple of a task list that is used for tile A[i, j], wheretasks 1-3 are waiting for task 0’s output.

There are only two operations to access a task list:FIRE and APPEND. After a task completes and mod-ifies its output tile [I, J], the FIRE operation searches[I, J]’s task list for the tasks that want to read [I, J]. Itscans the list from the just completed task to the endto find which tasks are waiting for [I, J]. The scanprocess will exit when confronting a task that wantsto write to [I, J]. If a task is visited before the scanprocess exits, the FIRE operation marks the task’s in-put as “ready”. When all the inputs of a task becomeready, the task evolves into a ready task.

The APPEND operation is invoked by the task-generation thread. After generating a new task, thegeneration thread inspects every input and output ofthe task. Given an input that reads tile [I, J], beforeappending the input task instance, APPEND scans [I,J]’s task list from the list head to find if there existsa task that writes to tile [I, J]. If none of the taskswrites to [I, J], the input task instance is marked as“ready”. Otherwise, it is “unready”. By contrast,given an output [I, J], APPEND puts an output taskinstance to the end of [I, J]’s task list immediately.

list of tasks

Figure 4: The 2-D task window implementation.Each tile has its own task list whose tasks either reador write the tile.

5.3 Computation Components

Each CPU core runs a CPU compute thread. When-ever a CPU compute thread becomes idle, it picks upa ready task from the host’s shared ready task queueand computes it by itself. After finishing the task,the thread invokes the FIRE operation to determinewhich tasks are the children of the finished task, andmoves them to a ready task queue if possible.

Similarly, each GPU has a GPU compute thread.A GPU compute thread is essentially a GPU man-agement thread, which is running on the host butcan start GPU kernels quickly. For convenience, wethink of the GPU management thread as a powerfulcompute thread. If a node has g GPUs and n CPUcores, our runtime system launches g GPU computethreads to represent (or manage) the g GPUs, and(n−g−2) CPU compute threads to represent the re-maining CPU cores. The remaining number of CPUcores is not equal to (n− g) because we use one corefor MPI communication and another core for CUDAmemory copies.

5.4 Communication Components

There are two types of communications on a GPU-based cluster: communication between nodes, andcommunication within a node. On each node, wecreate a thread to perform MPI operations to trans-fer data between nodes, and another thread to copymemories among the host and different GPUs on thesame node.

The technique of CUDADirect V2.0 supports di-rect memory copies between GPUs on the samenode. It may also send or receive GPU buffers on dif-ferent nodes directly if an MPI library supports CU-DADirect. To make our framework more portable,we choose to move data from GPU to host on thesource node first, then send it to a destination node.

R1: *R2: *W: A[i,j]

R1: A[i,j]R2: *W: *

R1: A[i,j]R2: *W: *

R1: *R2: A [i,j]W: *

R1: *R2: *W: A [i,j]

Task 0 Task 1 Task 2 Task 3 Task 4

Figure 5: Solving data dependencies for a set of tasksthat read or write tile A[i, j]. The completion of Task0 will fire Tasks 1-3 that want to read A[i,j].

5

Page 7: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

After the destination host receives the data, it copiesthe data from host to one or more of its GPUs.

An MPI communication thread runs on a dedi-cated CPU core. It calls nonblocking MPI point-to-point operations to send and receive messages. Atthe beginning, the thread posts an MPI Irecv oper-ation and an MPI Isend operation. Next, it checksif the pending receive or send operation has finishedwith busy-polling. When an operation is finished, thethread posts a new operation to replace the finishedone so that there are always two operations (one re-ceive and one send) ongoing at the same time. Figure9 in Appendix A shows the pseudocode to imple-ment the MPI communication thread. In the code,wait4send and wait4recv indicate if there ex-ists a pending send or receive operation.

A CUDA communication thread also uses a ded-icated CPU core on the host. Each GPU has twomail boxes: out mbox and in mbox. The mes-sages in an out mbox are intended from the GPU toother devices, and the messages in an in mbox areintended from other devices to the GPU itself. Wecreate two streams for each GPU: one for outgoingtraffic and the other for incoming traffic. Similarto the MPI communication thread, the CUDA com-munication thread tries to start one incoming mem-ory copy and one outgoing memory copy for eachGPU simultaneously. If there are a number of gGPUs, there will be 2g cudaMemcpyAsync opera-tions happening concurrently in which each GPU hastwo operations. To implement the CUDA communi-cation thread, wait4send and wait4recv havebeen changed to bitsets, where the i-th bit denotesthe status of the i-th GPU. We have also implementedselect GPU streams to substitute MPI Testso that we can test in which GPU streams the asyn-chronous cudaMemcpy operations have finished.

5.5 Data Storage and Management

The host and each GPU have an indirect data struc-ture to store matrices. Given a matrix with p×q tiles,the indirect data structure consists of p × q point-ers each pointing to a tile. A pointer is null if thecorresponding tile is not stored in the host or GPU.We store a GPU’s indirect data structure to the hostmemory, but the pointers in the GPU’s indirect struc-ture actually point to GPU device memories. By us-

ing the indirect data structure, a GPU compute threadcan simply look it up and pass correct arguments (i.e.GPU device pointers) to launch a GPU kernel.

Our runtime system transfers data from a parenttask to its children transparently, however, it does notknow how long the data should persist in the destina-tion device. We provide programmers with a specialfunction of Release Tile() to free data. Re-lease Tile does not free any memory, but sets up amarker in the task window. The marker tells the run-time system that the tile will not be needed by thefuture tasks that are generated after the marker, andit is safe to free the tile whenever possible. When aprogrammer writes a sequential program, he or shecan add Release Tile() to the program just like call-ing the ANSI C function free. The task-generationthread keeps track of the expected number of visitsfor each tile. Meanwhile the compute threads countthe actual number of visits for each tile. The runtimesystem will free a tile if and only if: i) Release Tilehas been called to mark the tile, and ii) the actualnumber is equal to the expected number of visits tothe tile. In essence, this is an asynchronous deallo-cation method with which a dynamic runtime systemcan decide when it is safe to free data.

6 Distributed Task AssignmentProtocol

Numerous runtime systems (one per node) executethe same code and generate the same set of tasks sothat a task may be duplicated on each node. We de-sign a protocol to guarantee that a task is computedby one and one one processing unit (CPU or GPU),and an input is sent to a waiting task only once.

Given a task with k1 inputs, all the runtime sys-tems across the cluster will generate k1 input taskinstances in total. It is exactly k1 instances becauseeach input belongs to exactly one node and only thatnode will claim ownership of the input. Also we de-fine that the first output of a task is the main output,and the rest outputs are minor outputs.

Our runtime system generates eight types of taskinstances using a set of rules. The rational behindthe rules is that when all runtime systems look at thesame input or output, they should make an unani-mous decision merely based on a predefined distribu-

6

Page 8: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

tion (i.e., the static multi-level block cyclic distribu-tion) without any communication. Note that the fol-lowing rules of 1, 2-4 and 5-8 are applied to a task’smain output, inputs, and minor-outputs, respectively.

1. Owner. Each runtime system looks at a newlygenerated task’s main output. If the main out-put is assigned to a certain host or GPU (e.g.,on nodei) based upon a static data distribu-tion, only nodei’s runtime system will create anowner task instance. An owner task instancestores the complete information of the task (i.e.,input, output, the ready status of each input).

2. Native input. Each input of a new task will bechecked by every runtime system. If the inputand the task’s main output are assigned to thesame host or GPU (e.g., on nodei), only nodei’sruntime system will create a native-input taskinstance. The native-input instance also stores apointer pointing to the task’s owner instance.

3. Intra-node alien input. Unlike Rule 2, if theinput and the task’s main output belong tothe same node (e.g., on nodei) but on differ-ent devices, the runtime system on nodei willcreate an intra-node alien-input task instance.The intra-node alien-input instance also storesa pointer pointing to the task’s owner instance.

4. Inter-node alien input. Unlike Rule 3, if theinput and the task’s main output belong to dif-ferent nodes, and suppose the input belongs tonodei, the runtime system on nodei will cre-ate an inter-node alien-input task instance. Theinter-node alien-input instance stores the loca-tion of the task’s main output.

5. Native minor-output. All runtime systems lookat each minor output of a newly generated task.If the minor output and the task’s main outputbelong to the same host or GPU (e.g., on nodei),the runtime system on nodei will create a nativeminor-output task instance. The task’s ownerinstance stores a pointer pointing to the nativeminor-output instance.

6. Sink minor-output. Unlike Rule 5, if the minoroutput and the main output belong to differentdevices (regardless of nodes), and suppose the

minor output is assigned to nodej , the runtimesystem on nodej will create a sink minor-outputtask instance. The sink instance is expecting itscorreponding source to send data to it.

7. Intra-node source minor-output. Unlike Rule5, if the minor output and the main output be-long to different devices but on the same node(e.g., nodei), the runtime system on nodei willcreate an intra-node source minor-output taskinstance. The intra-node source minor-outputstores a pointer pointing to its correspondingsink instance on the same node.

8. Inter-node source minor-output. If the minoroutput and the main output belong to differ-ent nodes, and suppose the main output is as-signed to nodei, the runtime system on nodeiwill create an inter-node source minor-outputtask instance. The inter-node source minor-output stores the location of its correspondingsink instance on a remote node.

Since we require the location of an owner task in-stance be where a computation occurs, our runtimesystems is designed to support linking a task’s in-put instances, minor-output instances, and its ownerinstance together so that the availability of an inputtriggers the owner. In our runtime system, the linkinginformation is either a direct pointer or the locationof the owner task instance. Also by distinguishingintra-node from inter-node, the runtime system candecide if it needs to copy data to a different device,or send an MPI message to a different node in orderto fire a child task.

A distinctive feature of the protocol is that all theruntime systems can follow the same rules to gener-ate tasks and solve data dependencies in an embar-rassingly parallel manner without any communica-tion (except for the actual data transfers).

7 Performance Evaluation

We conducted experiments with the Cholesky andQR factorization in double precision on the heteroge-neous Keeneland system [23] at the Oak Ridge Na-tional Laboratory. The Keeneland system has 120nodes and is connected by a Qlogic QDR InfiniBand

7

Page 9: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

0

20

40

60

80

100

120

1 2 4 8 16 32 64 100

Ove

rall

Tflo

ps

Number of Nodes

DGEMM UB Distri. GPUs mkl_scalapack 10.3

(a) Cholesky

0 10 20 30 40 50 60 70 80

1 2 4 8 16 32 64 100

DSSRFB UB Distri. GPUs mkl_scalapack 10.3

(b) QR

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1 2 4 8 16 32 64 100

Tflo

ps P

er

No

de

DGEMM UB Distri. GPUs mkl_scalapack 10.3

(c) Cholesky

0.0

0.2

0.4

0.6

0.8

1.0 1.2

1 2 4 8 16 32 64 100

DSSRFB UB Distri. GPUs mkl_scalapack 10.3

(d) QR

Figure 6: Weak scalability on distributed GPUs. (a) and (b) show the overall performance, while (c) and (d)show the performance-per-node for the Cholesky and QR factorizations (in double precision), respectively. Everyexperiment uses all 12 CPU cores and 3 GPUs on each node.

network. Each node on the system runs CentOS 5.5,and has two Intel Xeon 2.8 GHz 6-core processorsand three Nvidia Fermi 1.15 GHz M2070 GPUs. Thehost on each node has 24 GB memories, and eachGPU has 6 GB device memories. There is a full PCI-Expresse bandwidth to every GPU on the system. Allthe nodes have installed CUDA 4.0, Intel Compilers12.0, Intel MKL 10.3.5, and OpenMPI 1.5.1.

In the following experiments, we perform weakscalability experiments to measure the capability ofour programs to solve potentially larger problemsif there are more computing resources. While weare focused on clusters with distributed GPUs, ourframework is also able to achieve high performancein the other two environments: clusters withoutGPUs, and shared-system multiGPUs (i.e., a nodewith both CPUs and GPUs).

7.1 Distributed GPUs

We did experiments on Keeneland using all twelveCPU cores and all three GPUs on each node. Fig-ure 6 shows how the performance of our distributed-GPU framework scales as we increase the number ofnodes and the matrix size simultaneously. Althoughthere are 120 nodes on Keeneland, its batch sched-uler only allows a job to use 110 nodes in maximum.Considering one or two nodes are unavailable some-times, we increase the number of nodes from one to100. As the number of nodes is increased by k, weincrease the matrix size by

√k. The single-node ex-

periment takes an input matrix of size 34,560.In our experiments, we measure the total number

of TeraFlops to solve the Cholesky factorization and

the QR factorization, as shown in Figure 6 (a) and(b), respectively. To show the possible maximumperformance (i.e., upper bound) of our programs,we also depict the curves of DGEMM and DSS-RFB that are the dominant computational kernels ofCholesky factorization and QR factorization, respec-tively. We calculate an upper bound by the follow-ing formula: kernel UB = serial cpu kernel perf× #cores + gpu kernel perf × #gpus. To show thebenefits of using GPUs, we also present the perfor-mance of the Intel MKL 10.3.5 ScaLAPACK librarywhich uses CPUs only. In (a), the overall perfor-mance of our distributed-GPU Cholesky factoriza-tion reaches 75 TFlops on 100 nodes, while MKLScaLAPACK reaches 6.3 TFlops. In (b), the overallperformance of our distributed-GPU QR factoriza-tion reaches 40 TFlops on 100 nodes, while MKLScaLAPACK reaches 9.2 TFlops.

Figure 6 (c) and (d) show another view (i.e., Per-formance Per Node) for the same experiments as dis-played in (a) and (b). That is, TFlops-Per-Node =Overall TF lopsNumberNodes on a given number of nodes. Ideally,

the performance-per-node is a constance number ina weak scalability experiment. From (c), we cansee that our distributed-GPU Cholesky factorizationdoes not lose any performance from one node to 100nodes. In (d), our distributed-GPU QR factorizationscales well again from four nodes to 100 nodes. Theperformance-per-node on four nodes drops from 0.44TFlops to 0.41 TFlops because the four-node experi-ment uses a 2× 2 process grid and has a larger com-munication overhead than a process grid with Pr = 1(Appendix A.2 analyzes the related communicationcost).

8

Page 10: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

0

2

4

6

8

10 12

1 2 4 8 16 32 64 100

Ove

rall

Tflo

ps

Number of Nodes

Distri. GPUs mkl scalapack

(a) Cholesky

0

2

4

6

8

10

1 2 4 8 16 32 64 100

Distri. GPUs mkl scalapack

(b) QR

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

1 2 4 8 16 32 64 100

Tflo

ps P

er N

ode

Distri. GPUs mkl scalapack

(c) Cholesky

0.00

0.02

0.04

0.06

0.08

0.10

0.12

1 2 4 8 16 32 64 100

Distri. GPUs mkl scalapack

(d) QR

Figure 7: Weak scalability on clusters with CPUs only. Every experiment uses 12 CPU cores on each node.

7.2 Clusters without GPUs

We did another set of experiments to test whetherour framework is capable of delievering high per-formance when the system is a conventional clusterwithout GPUs. We only use the 12 CPU cores oneach node to do the experiments, and compare ourCholesky and QR factorization implementation withthe Intel MKL 10.3.5 ScaLAPACK library.

We performed weak scalability experiments again,where the input size increases by

√2 whenever we

double the number of nodes. In Fig. 7 (a), the over-all performance of the ScaLAPACK Cholesky fac-torization is slower than our Cholesky factorizationby 43% on 100 nodes. In (b), our QR factorizationprogram and the ScaLAPACK QR factorization havecomparable overall performance. Figure 7 (c) and(d) show the performance per node. In (c), our CPU-only Cholesky factorization scales well from 2 to 100nodes. Its curve has a dip from one to two nodessince the runtime system on each node uses a ded-icated core to do MPI communication (i.e., 1

12 lesscomputing power) if there are more than one node.Similar to Cholesky factorization, in (d), our QR fac-torization scales well from 4 to 100 nodes. Becauseof its good scalability, our QR program eventuallyoutperforms the Intel MKL ScaLAPACK QR factor-ization by 5% when the number of nodes is greaterthan 32. Note that we use 11 out of 12 cores on eachnode to do the real computation, while ScaLAPACKuses all 12 cores, however we are still 5% faster.

7.3 Shared-System MultiGPUs

To evaluate our framework on a shared-system withmulticore CPUs and multiple GPUs, we compare ourCholesky factorization to StarPU 0.9.1 [6] on a singlenode of the Keeneland system.

0 100 200 300 400 500 600 700 800 900

960

3840

67

20

9600

12

480

1536

0 18

240

2112

0 24

000

2688

0 29

760

Ove

rall

Gflo

ps

Matrix Size (N)

Distri. GPUs StarPU 0.9.1

Figure 8: Cholesky factorization (in double preci-sion) on a shared-system with 12 cores and 3 GPUs.

StarPU uses a dynamic scheduling runtime systemto assign tasks to CPUs and GPUs to keep load bal-ancing and reduce data transfers. The StarPU imple-mentation of Cholesky factorization uses the samecomputational kernels as ours, which calls subrou-tines from the Intel MKL 10.3.5, CUBLAS 4.0, andMAGMA 1.0 libraries. With the help from one of theStarPU developers, we ported the StarPU Choleskyfactorization to Keeneland and also tuned its perfor-mance thoroughly. Porting the StarPU QR factoriza-tion to Nvidia Fermi GPUs is not successful so fardue to numerical errors in the result.

Figure 8 shows the overall performance of ourframework and StarPU 0.9.1 to solve double-precision Cholesky factorizations. All the StarPUexperiments use 9 CPU cores and 3 GPUs to do thereal computation, and use the remaining three coresto manage the three GPUs. By contrast, our imple-mentation uses 8 CPU cores and 3 GPUs to do thereal computation since we also use an additional coreto do CUDA communications. The performance datashows that our framework can rise to high perfor-mance more quickly than the StarPU program. Whenthe input size is not very large, our framework isfaster than StarPU (i.e., 250% faster when N≤ 7680,

9

Page 11: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

and 100% faster when N ≤ 12480). When the inputsize is sufficiently large (i.e., N ≥ 26,880), StarPUstarts to be close to our framework.

8 Related Work

There are a number of runtime systems developed tosupport multiple GPU devices on a shared-memorysystem. StarPU develops a dynamic scheduling run-time system to execute a sequential code on the hostCPUs and GPUs in parallel [6], and has been appliedto the Cholesky, QR, and LU factorizations [3, 2, 1].SuperMatrix is another runtime system that supportsshared-memory systems with multiple GPUs [19].It uses several software-cache schemes to maintainthe coherence between the host RAM and the GPUmemories. While SuperMatrix requires that GPUstake most of the computations, our framework uti-lizes all CPU cores and all GPUs on both shared-memory and distributed-memory systems.

StarSs is a programming model that uses direc-tives to annotate a sequential source code to executeon various architectures such as SMP, CUDA, andCell [7]. A programmer is responsible for specifyingwhich piece of code should be executed on a GPU.Its runtime then executes the annotated code in par-allel on the host and GPUs. It is also possible to usethe hybrid MPI/SMPSs approach to support clusterswith multicore CPUs [17].

There is research work that supports scientificcomputations on distributed GPUs. Factica usesCUDA to accelerate the LINPACK Benchmark [13]on heterogeneous clusters by modifying the origi-nal source code slightly. The revised code inter-cepts every DTRSM or DGEMM call, and splits itinto two calls to execute on both CPUs and GPUs,respectively. Those calls to CPUs relies on settingOMP NUM THREADS to utilizes all CPU cores on thehost. Differently, our distributed GPU frameworkallows every CPU core to compute tasks indepen-dently. On the other hand, we use one MPI processper node, instead of one MPI process per GPU.

Fogue et al. ported the PLAPACK library toGPU-accelerated clusters [14]. They require thatCPUs compute the diagonal block factorizationswhile GPUs compute all the remaining operations.They also store all data in GPU memories to reduce

communication. In our method, we distribute a ma-trix across the host and GPUs, and can utilize allCPU cores and all GPUs. Note that it is possible thatthe computational power of a host may be greaterthan that of a GPU such that the host needs to com-pute most of the work.

Many researchers have studied the static datadistribution strategies on heterogeneous distributedmemory systems. Dongarra et al. designed an algo-rithm to map a set of uniform tiles to a 1-D collectionof heterogeneous processors [11]. Robert et al. pro-posed a heuristic 2-D block data allocation to extendScaLAPACK to work on heterogeneous clusters [9].Lastovetsky et al. developed a static data distribu-tion strategy that takes into account both processorheterogeneity and memory heterogeneity for densematrix factorizations [16]. Our work targets clustersof nodes that consist of multicore CPUs and multi-ple GPUs, and uses a novel static multi-level blockcyclic strategy.

9 Conclusion and Future Work

As the trend of adding more GPUs to each node todeliver high performance continues, it is important tostart to design new parallel software on the heteroge-neous architectures. We present a new framework tosolve dense linear algebra problems on large-scaleGPU-based clusters. To attain high performance,we focus our design on minimizing communication,maximizing the degree of task parallelism, accom-modating processor heterogeneity, hiding communi-cation, and keeping load balance. The framework es-sentially consists of a static multi-level data distribu-tion method, a class of heterogeneous tile algorithms,and a distributed scheduling runtime system.

Our experiments with the Cholesky and QR fac-torizations on the heterogeneous Keeneland sys-tem demonstrate good scalability in various environ-ments: clusters with or without GPUs, and shared-systems with multi-GPUs. Our future work alongthis line is to apply the approach to two-sided factor-izations and sparse matrices. Another future work isto add NUMA support to our runtime system to im-prove performance on each node that has hundredsor even thousands of CPU cores as well as a greatnumber of NUMA memory nodes.

10

Page 12: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

References

[1] E. Agullo, C. Augonnet, J. Dongarra,M. Faverge, J. Langou, H. Ltaief, and S. To-mov. LU factorization for accelerator-basedsystems. ICL Technical Report ICL-UT-10-05,Innovative Computing Laboratory, Universityof Tennessee, 2010.

[2] Emmanuel Agullo, Cedric Augonnet, JackDongarra, Mathieu Faverge, Hatem Ltaief,Samuel Thibault, and Stanimire Tomov. QRfactorization on a multicore node enhancedwith multiple GPU accelerators. In IPDPS2011, Alaska, USA, 2011.

[3] Emmanuel Agullo, Cedric Augonnet, JackDongarra, Hatem Ltaief, Raymond Namyst,Jean Roman, Samuel Thibault, and StanimireTomov. Dynamically scheduled Cholesky fac-torization on multicore architectures with GPUaccelerators. In Symposium on ApplicationAccelerators in High Performance Computing(SAAHPC), Knoxville, USA, 2010.

[4] Emmanuel Agullo, Jack Dongarra, Bilel Hadri,Jakub Kurzak, Julien Langou, Julie Lan-gou, Hatem Ltaief, Piotr Luszczek, and AsimYarKhan. PLASMA Users’ Guide. Technicalreport, ICL, UTK, 2011.

[5] E. Anderson, Z. Bai, C. Bischof, L. Black-ford, J. Demmel, J. Dongarra, J. Du Croz,A. Greenbaum, S. Hammarling, A. McKen-ney, and D. Sorensen. LAPACK Users’ Guide.SIAM, 1992.

[6] Cedric Augonnet, Samuel Thibault, RaymondNamyst, and Pierre-Andre Wacrenier. StarPU:A unified platform for task scheduling on het-erogeneous multicore architectures. Concurr.Comput. : Pract. Exper., Special Issue: Euro-Par 2009, 23:187–198, February 2011.

[7] Eduard Ayguade, Rosa M. Badia, Francisco D.Igual, Jesus Labarta, Rafael Mayo, and En-rique S. Quintana-Ortı. An extension of theStarSs programming model for platforms withmultiple GPUs. In Proceedings of the 15th

International Euro-Par Conference on Paral-lel Processing, Euro-Par ’09, pages 851–862,Berlin, Heidelberg, 2009. Springer-Verlag.

[8] Grey Ballard, James Demmel, Olga Holtz, andOded Schwartz. Communication-optimal par-allel and sequential Cholesky decomposition.In Proceedings of the twenty-first annual sym-posium on Parallelism in algorithms and archi-tectures, SPAA ’09, pages 245–252, New York,NY, USA, 2009. ACM.

[9] Olivier Beaumont, Vincent Boudet, AntoinePetitet, Fabrice Rastello, and Yves Robert. Aproposal for a heterogeneous cluster ScaLA-PACK (dense linear solvers). IEEE Transac-tions on Computers, 50:1052–1070, 2001.

[10] L. S. Blackford, J. Choi, A. Cleary,E. D’Azevedo, J. Demmel, I. Dhillon,J. Dongarra, S. Hammarling, G. Henry, A. Pe-titet, K. Stanley, D. Walker, and R. Whaley.ScaLAPACK Users’ Guide. SIAM, 1997.

[11] Pierre Boulet, Jack Dongarra, Yves Robert,and Frdric Vivien. Static tiling for heteroge-neous computing platforms. Parallel Comput-ing, 25(5):547 – 568, 1999.

[12] James W. Demmel, Laura Grigori, Mark Fred-erick Hoemmen, and Julien Langou.Communication-optimal parallel and se-quential QR and LU factorizations. LAPACKWorking Note 204, UTK, August 2008.

[13] Jack J. Dongarra, Piotr Luszczek, and An-toine Petitet. The LINPACK Benchmark: past,present, and future. Concurrency and Compu-tation: Practice and Experience, 15:803–820,2003.

[14] Manuel Fogu, Francisco D. Igual, Enrique S.Quintana-ort, and Robert Van De Geijn. Re-targeting PLAPACK to clusters with hardwareaccelerators. FLAME Working Note 42, 2010.

[15] J. R. Humphrey, D. K. Price, K. E. Spagnoli,A. L. Paolini, and E. J. Kelmelis. CULA: Hy-brid GPU accelerated linear algebra routines. InSPIE Defense and Security Symposium (DSS),April 2010.

11

Page 13: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

[16] Alexey Lastovetsky and Ravi Reddy. Data dis-tribution for dense factorization on computerswith memory heterogeneity. Parallel Comput.,33:757–779, December 2007.

[17] Vladimir Marjanovic, Jesus Labarta, EduardAyguade, and Mateo Valero. Overlapping com-munication and computation by using a hybridMPI/SMPSs approach. In Proceedings of the24th ACM International Conference on Super-computing, ICS ’10, pages 5–16, New York,NY, USA, 2010. ACM.

[18] NVIDIA. CUDA Toolkit 4.0 CUBLAS Li-brary, 2011.

[19] Gregorio Quintana-Ortı, Francisco D. Igual,Enrique S. Quintana-Ortı, and Robert A. van deGeijn. Solving dense linear systems on plat-forms with multiple hardware accelerators. InProceedings of the 14th ACM SIGPLAN sym-posium on Principles and practice of paral-lel programming, PPoPP ’09, pages 121–130,New York, NY, USA, 2009. ACM.

[20] Sushant Sharma, Chung-Hsing Hsu, andWu chun Feng. Making a case for a Green500list. In IEEE International Parallel and Dis-tributed Processing Symposium (IPDPS 2006)/Workshop on High Performance - Power AwareComputing, 2006.

[21] Fengguang Song, Stanimire Tomov, and JackDongarra. Efficient support for matrix compu-tations on heterogeneous multi-core and multi-GPU architectures. LAPACK Working Note250, UTK, June 2011.

[22] S. Tomov, R. Nath, P. Du, and J. Dongarra.MAGMA Users’ Guide. Technical report, ICL,UTK, 2011.

[23] J.S. Vetter, R. Glassbrook, J. Dongarra,K. Schwan, B. Loftis, S. McNally, J. Meredith,J. Rogers, P. Roth, K. Spafford, and S. Yala-manchili. Keeneland: Bringing heterogeneousGPU computing to the computational sciencecommunity. Computing in Science Engineer-ing, 13(5):90 –95, sept.-oct. 2011.

wait4send = wait4recv = 0;while(!is_done || wait4send)

if(!is_done && !wait4recv) call MPI_Irecv(recv_buf, MPI_ANY_SOURCE, &recv_req);wait4recv = 1;

if(!wait4send)

msg = get_msg(host’s out_mbox);call MPI_Isend(msg->data, msg->dst_pid, &send_req);wait4send = 1;

if(wait4send)

call MPI_Test(&send_req);if(success) wait4send = 0;

if(wait4recv)

call MPI_Test(&recv_req);if(success)

store recv_buf to the host’s local matrix;wait4recv = 0;

Figure 9: Pseudocode of the MPI communicationthread in our distributed runtime system.

A Appendix

A.1 Tile Size Auto-TuningWe adjust the ratio of the small tiles on the host to thelarge tiles on GPUs, to keep load balancing between CPUsand GPUs within each node. The load balancing betweennodes is attained automatically by the 2-D block cyclicdistribution method.

Given a tile of size B × B, we partition it into twoparts. Suppose Bh = b(s − 1) and Bg = B − Bh. Theleft partition of size B × Bh is allocated to the host, andthe right partition of size B×Bg is allocated to a GPU. Welet T (m× n) denote the number of floating operations tocompute a matrix of size m×n. fcore and fgpu denote thespeed (i.e., flop/s) on a CPU core or GPU, respectively.

In a block cyclic data distribution (1-D or 2-D), a num-ber of G tiles are allocated to a number of G GPUs suchthat each GPU has one tile and it takes T (B × Bg)/fgputime to compute. Differently, the CPU cores on thehost receive G small partitions (each is of size B × Bh)corresponding to the G tiles, and it takes G × T (B ×Bh)/(fcore ×NumCores) time to compute.

To achieve load balancing, we determine Bh by the fol-lowing formula:

T (B ×Bg)

fgpu=

G× T (B ×Bh)

fcore ×NumCores

⇒ T (B ×Bh)

T (B ×Bg)=

fcore ×NumCores

fgpu ×G

⇒ Bh = B × fcore ×NumCores

fcore ×NumCores+ fgpu ×G

In practice, fcore or fgpu is the maximum performanceof the dominant computational kernel in an algorithm. In

12

Page 14: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

addition, we fine-tune the value of Bh automatically. Westart from the estimated size Bh and search for an optimalB∗h near Bh. We wrote a script to execute a matrix factor-ization with an input of size N = c0 ·B ·G, where we setc0 = 3 to reduce the tuning time. The script adjusts thevalue of Bh to search for the minimum difference betweenthe CPU and the GPU computation time. Note that Bh isdependent only on the number of CPU cores and the num-ber of GPUs, assuming the system and the computationalkernels are not changed.

The large tile size B is critical for the GPU perfor-mance. To determine the best B, we search for the min-imal matrix size that provides the best performance forthe dominant GPU kernel in an algorithm (e.g., GEMM forCholesky factorization). Our search ranges from 256 to2048, and is performed only once for every new compu-tational library and every new GPU architecture.

A.2 Communication Cost AnalysisWe count the number of messages and the number ofwords communicated by a process that has the most com-munication among all processes. Assume there are Pprocesses (one process per node), and the broadcast be-tween processes is implemented by a tree topology. Weuse Pr and Pc to denote a Pr × Pc process grid, whereP = Pr · Pc. In the multi-level block cyclic data dis-tribution, we use n, B, s to denote the matrix size, thetop-level tile size, and the number of partitions of eachtop-level tile, respectively.

A.2.1 Distributed Heterogeneous Tile CholeskyFactorization

For each iteration, we first broadcast the diagonal blockdown the panel (i.e. logPr messages), next each pro-cess that owns data on the panel broadcasts to Pc +Pr processes that are waiting for the panel data (i.e.#rowsBPr

log(Pc +Pr) messages, where #rows=n−b jscBat the j-th iteration). The number of messages is expressedas follows:

msgchol =

nB s−1∑j=0

logPr +n− b jscB

BPrlog(Pc + Pr)

=ns

BlogPr +

n2s

2B2Prlog(Pc + Pr)

=ns

2BlogP +

n2s

4B2√P

logP +n2s

2B2√P

And the number of words is:

wordchol =nB

4logP +

n2

4√P

logP +n2

2√P

A.2.2 Distributed Heterogeneous Tile QRFactorization

In the tile QR factorization, we can stack up v adjacenttiles to form a virtual tile, which is always allocated to thesame host or GPU. At the j-th iteration, each process hasn−b js cB(vB)Pr

× n−b js cBBPc

virtual tiles of size (vB) × B. Sinceone B × B tile out of every virtual tile will be sent downto its below process as a message (there is no message ifPr=1), and every tile on the panel will be broadcast rightto Pc processes, the number of messages is expressed asfollows:

msgqr =

nB s−1∑j=0

n− b jscBvBPr

·n− b jscB

BPc+

n− b jscBBPr

logPc

=

nB s−1∑j=0

(n− b jscB)2

vB2P+ (n− bj

scB)

logPc

BPr

=1

vB2P

n3s

3B+

n2s

2B2PrlogPc

=n3s

3vB3P+

n2s

4B2√P

logP

And the number of words is:

wordqr =n3

3vBP+

n2

4√P

logP

= (n

3vB√P logP

+1

4)n2

√P

logP

If we set the virtual tile size v as n/B/Pr, (vB√P ) is equal

to n. Therefore, msgqr = n2s3B2√P

+ n2s4B2√PlogP , and

wordqr = ( 13 logP + 1

4 )n2√PlogP .

A.2.3 Comparison with ScaLAPACK

Table 1 compares the heterogeneous tile algorithms withthe communication lower bounds (LB), and the ScaLA-PACK subroutines regarding the number of words (i.e.,communication volume) and the number of messages. Wecan see that the heterogeneous tile Cholesky factorizationhas attained the communication volume lower bound towithin a logarithmic factor. The communication volumeof the heterogeneous tile QR factorization is greater thanthe communication volume lower bound by a factor of( n3vB√P+ 1

4 logP ). Note that we can increase the valueof v to minimize QR’s communication volume to reachthe lower bound to within a factor of ( 13 + 1

4 logP ).

13

Page 15: A Scalable Framework for Heterogeneous GPU-Based Clusters · the Keeneland system using 12 cores and 3 Fermi GPUs. In the figure, we compare our software-cache based dynamic scheduling

Table 1: Communication cost of the heterogeneoustile algorithms. We assume square matrices, andPr = Pc =

√P .

#words #messages

Cholesky LB [[8]] Ω( n2√

P) Ω(

√P )

PDPOTRF [[8]] ( nb4

+ n2√

P) log P 3n

2blog P

Hetero. Cholesky ( 14

log P + 12

) n2√

P

n2s

4B2√

P(log P + 2)

QR LB [[12]] Ω( n2√

P) Ω(

√P )

PDGEQRF [[12]] ( 34

n2√

P+ 3

4nb) log P ( 3

2+ 5

2b)n log P

Hetero. QR ( n

3vB√

P log P+ 1

4) n2√

Plog P n3s

3vB3P+ n2s

4B2√

Plog P

14