a scalable framework for heterogeneous gpu-based...

A Scalable Framework for Heterogeneous GPU-Based Clusters ∗

(Regular Submission)

Fengguang SongInnovative Computing Laboratory

University of TennesseeKnoxville TN 37996-3450

[email protected]

Jack DongarraUniversity of Tennessee

Oak Ridge National LaboratoryUniversity of [email protected]

Abstract

GPU-based heterogeneous clusters continue to draw attention from vendors and HPC users due to theirhigh energy efficiency and much improved single-node computational performance, however, there is littleparallel software available that can utilize all CPU cores and all GPUs on the heterogeneous system effi-ciently. On a heterogeneous cluster, the performance of a GPU (or a compute node) increases in a muchfaster rate than the performance of the PCI-Express connection (or the interconnection network) such thatcommunication eventually becomes the bottleneck of the entire system. To overcome the bottleneck, wedeveloped a multi-level partitioning and distribution method that guarantees a near-optimal communicationvolume. We have also extended heterogeneous tile algorithms to work on distributed-memory GPU clus-ters. Our main idea is to execute a serial program and generate hybrid-size tasks, and follow a dataflowprogramming model to fire the tasks on different compute nodes. We then devised a distributed dynamicscheduling runtime system to schedule tasks, and transfer data between hybrid CPU-GPU compute nodestransparently. The runtime system employs a novel distributed task-assignment protocol to solve data de-pendencies between tasks without coordination between processing units. The runtime system on eachnode consists of a number of CPU compute threads, a number of GPU compute threads, a task generationthread, an MPI communication thread, and a CUDA communication thread. By overlapping computationand communication through dynamic scheduling, we are able to attain a high performance of 75 TFlopsfor Cholesky factorization on the heterogeneous Keeneland system [24] using 100 nodes, each with twelveCPU cores and three GPUs. Moreover, our framework can also attain high performance on distributed-memory clusters without GPUs, and shared-system multiGPUs.

keywords: Dense linear algebra, distributed runtime systems, hybrid CPU-GPU, heterogeneous clusters.

∗This material is based upon work supported by the NSF grants CCF-0811642, OCI-0910735, by the DOE grant DE-FC02-06ER25761, by Nvidia, and by Microsoft Research.

1 Introduction

Based on the November 2011 Green500 list [21],twenty-three out of the top thirty greenest supercom-puters are GPU-based. However, there is little soft-ware that can take advantage of the large-scale het-erogeneous systems efficiently, especially to utilizeall CPU cores and all GPUs. Considering many op-erations of scientific computing applications are car-ried out through numerical linear algebra libraries,we focus on providing fundamental linear algebraoperations on the new heterogeneous architectures.

A great amount of effort has gone into the im-plementation of linear algebra libraries. LAPACK[5], Intel MKL, AMD ACML, and PLASMA [4]are mainly designed for shared-memory multicoremachines. ScaLAPACK [10] is intended for dis-tributed memory CPU-based machines. CUBLAS[19], MAGMA [23], and CULA [16] provide a sub-set of the LAPACK subroutines but work on a singleGPU. So far these libraries do not support computa-tions using multiple CPU cores and multiple GPUson a single node, not to mention distributed GPU-based clusters. Moreover, with an increasing numberof cores on the host whose performance continuesto keep up with the GPU performance, new parallelsoftware should not ignore either GPUs or CPUs.

Our work aims to provide a unified frameworkto solve linear algebra problems on any numberof CPU cores, any number of GPUs, and for ei-ther shared-memory or distributed-memory systems.Our solution consists of three essential components:(1) a static multi-level data distribution method, (2)heterogeneous tile algorithms, and (3) a distributeddynamic scheduling runtime system. The solutionworks as follows. Given a matrix input, we first splitit into tiles of hybrid sizes. Then we distribute thetiles to host main memories and GPU device memo-ries on a cluster with a static method. Each computenode runs a runtime system (launched as an MPI pro-cess) that schedules tasks within the node dynami-cally. Different nodes communicate with each otherby means of MPI messages. Our runtime system fol-lows the data-flow programming model and buildsa partial directed acyclic graph (DAG) dynamically,where a completed task will trigger a set of new tasksin the DAG.

We use a static multi-level distribution method to

allocate data to different hosts and GPUs. Each com-pute node is heterogeneous since it has both CPUsand GPUs, but different nodes have the same per-formance. Therefore, we design a multi-level dis-tribution method. On the top (i.e., inter-node) level,we use a 2-D block cyclic distribution method. Onthe second (i.e., intra-node between different GPUs)level, we allocate a node’s local blocks to merelyGPUs with a 1-D or 2-D block cyclic method. Onthe third (i.e., intra-node between CPUs and GPUs)level, we cut a slice from each GPU’s local block andput it to the host. The output of the multi-level distri-bution method is that each matrix block is uniquelyassigned to the host or a GPU on a specific node.

We also use heterogeneous tile algorithms to han-dle the difference between CPUs and GPUs. In thealgorithms, there are a great number of small tasksfor CPU cores, and a great number of large tasks forGPUs, to compute concurrently at any time. While aheterogeneous tile algorithm is based on tiles, it hastwo types of tiles: small ones for CPU cores, andlarge ones for GPUs. Our work combines the het-erogenous tile algorithms and the multi-level distri-bution method together so that the algorithms are ap-plicable to heterogeneous clusters with hybrid CPUsand GPUs.

We design a distributed scheduling runtime systemfor heterogeneous clusters. Each compute node is ex-ecuting a runtime system that can solve data depen-dencies dynamically, and transfer data from a parenttask to its children transparently. All runtime sys-tems (one per node) proceed in parallel, and executea task-assignment protocol to build subsets (or parti-tions) of a DAG dynamically. There is no communi-cation required when building the DAG. The proto-col guarantees that all runtime systems make a unan-imous decision without coordinating with each othersuch that every task is computed by one and only oneprocessing unit (on a host or a GPU).

Our experiments with double-precision Choleskyand QR factorizations, on the heterogeneousKeeneland system [24] at the Oak Ridge NationalLaboratory, demonstrate great scalability from oneto 100 nodes using all CPUs and GPUs. In addi-tion, we apply our framework to the other two possi-ble environments: clusters without GPUs, and sharedsystems with multiple CPUs and multiple GPUs.Compared with vendor-optimized and open source

1

libraries (i.e., Intel MKL 10.3.5, and StarPU 0.9.1[6]), our framework is able to provide better perfor-mance thank Intel MKL by up to 75% on clusterswithout GPUs, and up to 250% better performancethan StarPU on shared-system multiGPUs.

2 Background

We place greater emphasis on communications inour framework design. On a host that is attachedwith multiple GPUs through PCI-Express connec-tions, the ratio of computation to communication onthe GPUs keeps increasing. Eventually the commu-nication time on a PCI-Express connection will be-come the bottleneck of the entire system.

Researchers often use dynamic scheduling meth-ods to support heterogeneous systems, where eachCPU core or GPU picks up a ready task from taskqueues independently whenever a processing unit be-comes idle. At the beginning of our design, we havealso implemented a dynamic scheduling runtime sys-tem. In our dynamic runtime system, all the CPUcores and GPUs share a global ready task queue, andeach GPU owns a software cache on its device mem-ory. Whenever a GPU reads a block of data from thehost, it stores the data to its software cache. All thedata in the GPUs’ software caches are also backedup by the main memory on the host. We have usedtwo cache writing policies: write-through and write-back. With the write-through policy, every modifica-tion to the software cache must be updated to the hostmain memory immediately. With the write-back pol-icy, a modification to the software cache is updated tothe host main memory only when a different devicewants to access the modified data. To achieve thebest performance, our software cache size on eachGPU is configured as large as the input matrix sizeto eliminate capacity cache misses.

Figure 1 shows our experiments with the double-precision Cholesky factorization on a single nodeof the Keeneland system using 12 cores and 3Nvidia Fermi GPUs. In the figure, we compareour software-cache based dynamic scheduling run-time system, the generic dynamic scheduling run-time system of StarPU [6], and our distributed-GPUsframework that builds upon a static data distributionmethod. By changing from the write-through pol-

0 100 200 300 400 500 600 700 800

1920

3840

5760

7680

9600

1152

0

1344

0

1536

0

1728

0

1920

0

2112

0

Gflo

ps

Maxtrix Size

Distri-GPUs framework Write-back cache Write-through cache StarPU 0.9.1

Figure 1: A comparison between dynamic schedul-ing systems and our distributed-GPU framework.

icy to the write-back policy, we can improve the pro-gram performance greatly due to reduced communi-cations.

Differently, StarPU consists of profiling, perfor-mance modeling, and sophisticated scheduling poli-cies to achieve load balancing and to reduce datatransfers. However, since our static data distribu-tion method can guarantee a near lower-bound com-munication cost and has less scheduling overhead,it is faster than StarPU by up to 250% for small tomedium matrix sizes. This has inspired us to em-ploy a static data distribution method on GPU-basedclusters. Here we emphasize that despite its betterperformance, our framework is intended for solvingdense linear algebra problems, while StarPU is moregeneric and can support other applications.

3 Heterogeneous Tile Algorithms

Our previous work has designed a class of hetero-geneous tile algorithm and applied them to shared-system multiGPUs [22]. Here we mention it brieflyfor completeness. In the algorithm, every task takesseveral individual tiles as input and output. Also ona heterogeneous system with CPUs and GPUs, wecreate two different tiles.

Figure 2 shows a matrix that is stored in a tiledata layout with two different tiles. All the tasksthat modify small tiles are to be executed by CPUcores, and those that modify large tiles are to be ex-

2

1 3 5 2 4 6

1

2

3

1

2

3

1 3 5 2 4 6

Figure 2: An example of the heterogeneous tile algo-rithm for Cholesky factorization.

ecuted by GPUs. Given a matrix with three tile rowsand six tile columns, Cholesky factorization can becomputed recursively as follows. At the first iter-ation, (1) we solve Cholesky factorizations on tilesA11, A21, and A31 in the first column. That is,L11 ← Cholesky(A11), and Lij ← Aij(L

T11)

−1;(2) then we update the trailing submatrix located be-tween the second and the last sixth column. That is,Aij ← Aij − Li1Lj1. The Cholesky factorizationcan be completed by recursively applying the abovetwo steps to the trailing submatrix that starts fromthe j-th tile column. If a matrix has n tile columns,the factorization takes n iterations to complete. Thesame idea can be applied to other matrix factoriza-tions (e.g., QR and LU factorizations). For moredetails, please refer to our paper [22] for the exactalgorithms of Cholesky and QR factorizations.

3.1 Static Multi-Level Block Cyclic DataDistribution

This section presents a new multi-level partitioningscheme to create small and large tiles on distributed-memory heterogeneous clusters. (1) At the top level,we divide a matrix into p×p large tiles, each of whichis of size B×B. (2) We distribute the p×p large tilesto a Pr × Pc process grid using a 2-D block cyclicmethod. There is one process per node. (3) At thebottom level (i.e., within each node), we verticallycut every large tile of size B×B on each node into anumber of (s− 1) small tiles of size B × b, and oneremaining large tile of size B× (B− (s−1) · b). Weallocate the small tiles to the entire set of CPU coreson the host, while allocate the remaining large tilesto GPUs using a 1-D or 2-D block cyclic method. Sofar we use a 1-D method because of the small numberof GPUs (i.e., ≤ 4) on each compute node.

After the multi-level block cyclic data distribution,each node is assigned to a number of p

Pr× p·s

Pcrectan-

gular tiles. Given a tile indexed by [I, J], we first mapit to node Ni then to device Dj , where D0 denotesthe host, and Dj≥1 denotes the j-th GPU located onnode Ni. Assuming each node has G GPUs, we cancalculate node Ni (0 ≤ i ≤ PrPc-1), and device Dj

as follows:

Ni = (I mod Pr) · Pc + (J

smod Pc),

Dj =

0 : (J mod s) < s− 1

(J/sPcmod G) + 1: (J mod s) = s− 1

In other words, Step (2) distributes the large tilesacross Pr × Pc nodes (for Ni). Next, within eachnode, the tile columns whose indices are multiplesof (s−1) are mapped to the node’s G GPUs in a 1-Dcyclic way, and the rest of the columns are mapped toall CPUs on the host (for Dj). Although small tasksare assigned to all the CPUs, each CPU core can pickup any small task independently (i.e., not in a fork-join manner). We also tune the tile sizes of B andb to achieve the highest performance. Appendix A.1describes how we choose the best tile sizes.

4 Our Framework Overview

We design a distributed dynamic scheduling runtimesystem for the heterogeneous GPU-based clusters.Given a cluster with P nodes, we launch P MPI pro-cesses (one process per node), each of which exe-cutes an instance of the runtime system. We also as-sume a matrix is stored in the hybrid tile data layoutthat uses two different tile sizes.

Not only do we distribute data to hosts and GPUson different nodes statically, but also we distributetasks to hosts and GPUs statically. We require thatthe location of a task be the same as the locationof the task’s output. Although a task’s allocation isstatic, we schedule tasks dynamically within a hostor GPU in order to reduce synchronization points andoverlap computation with communication.

Our runtime system follows a dataflow pro-gramming model and is essentially data-availabilitydriven. Whenever a parent task completes, it triggersits child tasks immediately. The runtime system canidentify data dependencies between tasks and unroll

3

a Directed Acyclic Graph (DAG) dynamically. Notethat a DAG has never been created and stored in ourruntime system explicitly. A parallel program startswith an entry task and finishes with an exit task ofthe DAG, respectively. The runtime system is alsoresponsible for transferring data from a parent taskto its children transparently.

Each runtime system instance is multi-threaded.It creates five types of threads: a task-generationthread, a set of CPU compute threads for CPU cores,a set of GPU management threads for GPUs, aninter-node MPI communication thread, and an intra-node CUDA communication thread. If a machine isnot distributed, the MPI communication thread is notcreated. Similarly, if there is no GPU, the CUDAcommunication thread is not created.

A task-generation thread create tasks (similar toa CPU issuing instructions) and drives the execu-tion of a user program. There are actually P task-generation threads running on P compute nodes. Allthe task-generation threads execute the same serialprogram independently and create task instances forthe program without communication. They executea distributed task-assignment protocol. Based on thecommon knowledge of the static multi-level distri-bution, each task-generation thread is able to decideby itself which task it should compute and where thetask’s children are located without exchanging anymessages.

5 The Framework Implementation

As shown in Fig. 3, our runtime system consistsof seven components that are listed as follows. (1)Task window: a fixed-size task queue that storesall the generated but not finished tasks. (2) Readytask queues: a number of ready task lists. (3) Task-generation thread: a single thread that executes aserial program and generates new tasks. (4) CPUcompute threads: there is a compute thread runningon each CPU core. (5) GPU management (or com-pute) threads: there is a GPU management thread foreach GPU. (6) MPI communication thread: a singlethread that transfers data between different nodes.(7) CUDA communication thread: a single threadthat transfers data among the host and multiple GPUswithin the same node using cudaMemcpyAsync.

Task-generation thread

... Task window:

...

...

mbox:

Ready tasks: Ready

tasks: Ready tasks:

Ready tasks:

Multicore Host GPU1 GPU2 GPUn

MPI mbox: mbox: mbox: GPUDirect V2.0

HPC Network Compute threads

MPI thread

CUDA thread

Figure 3: The runtime system architecture on eachhybrid CPU-GPU compute node.

5.1 Different Task Queues

A task window stores generated tasks in a single-linked list. It also keeps the original sequential orderbetween tasks. Each task consists of the informa-tion of a task’s input and output. Based on the inputand output information, when a task modifies its out-put, the runtime system can scan the list to search forthe tasks who are waiting for the output. However,a global task list is often too long to search and canresult in severe contention between threads.

To make accesses to the task window faster, weuse 2-D task lists to implement the task window. Asillustrated in Fig. 4, each tile in a matrix has its owntask list. If a task’s input or output is tile [I, J], theruntime system will add an instance of the task to [I,J]’s task list. When a matrix is distributed to differentcompute nodes, we partition the 2-D task lists intodifferent nodes according to the location of the tiles.That is, if tile [I, J] is allocated to node Pk, tile [I, J]’stask list is also assigned to node Pk.

A ready task queue stores “ready-to-go” taskswhose inputs are all available. If a task writes to atile that belongs to the host or a GPU, it is added tothat host or GPU’s ready task queue correspondingly.Task stealing has not been implemented to avoid un-necessary data transfers and to increase data reuse. Inaddition, a ready task in our implementation is sim-ply a pointer that points to a task stored in the taskwindow.

4

5.2 Solving Data Dependencies

A tile’s task list keeps the serial semantic order be-tween tasks that either read or write the tile. When-ever two tasks access the same tile and one of them iswrite, the runtime system detects a data dependencyand stalls the successor till the predecessor is fin-ished. Here we only consider the true dependencyRAW (read after write), and use renaming to elimi-nate the WAR (write after read) and WAW (write af-ter write) dependencies. Figure 5 shows an exampleof a task list that is used for tile A[i, j], where tasks1-3 are waiting for task 0’s output.

There are two operations to access a task list:FIRE and APPEND. After a task completes and mod-ifies its output tile [I, J], the FIRE operation searches[I, J]’s task list for those tasks that want to read [I, J].It scans the list from the just completed task to findwhich tasks are waiting for [I, J]. The scan processwill exit when confronting a task that wants to writeto [I, J]. If a task is visited before the scan processexits, the FIRE operation marks the task’s input as“ready”. When all the inputs of a task become ready,the task evolves into a ready task.

The APPEND operation is invoked only by thetask-generation thread. After generating a new task,the generation thread inspects every input and out-put of the task. Given an input that reads tile [I, J],before appending the input task instance, APPENDscans [I, J]’s task list from the list head to find if thereexists a task that writes to tile [I, J]. If none of themwrites to [I, J], the input task instance is marked as“ready”. Otherwise, it is “unready”. By contrast,given an output [I, J], APPEND puts an output taskinstance to the end of [I, J]’s task list immediately.

list of tasks

Figure 4: The 2-D task window implementation.Each tile has its own task list whose tasks either reador write the tile.

5.3 Computation Component

Each CPU core has a CPU compute thread. When-ever a CPU compute thread becomes idle, it picks upa ready task from the host’s shared ready task queueand computes it on its own. After finishing the task,it invokes the FIRE operation to determine whichtasks are the children of the finished task, and movesthem to a ready task queue if possible.

Similarly, each GPU has a GPU compute thread.A GPU compute thread is essentially a GPU man-agement thread, which is running on the host butcan start GPU kernels quickly. For convenience, wethink of the GPU management thread as a powerfulcompute thread. If a node has g GPUs and n CPUcores, our runtime system will launch g GPU com-pute threads to represent (or manage) the g GPUs,and (n − g − 2) CPU compute threads to representthe remaining CPU cores. The remaining number ofCPU cores is not equal to (n − g) since we use oneCPU core for MPI communication and another onefor CUDA memory copies.

5.4 Communication Component

There are two types of communications on a GPU-based cluster: communication between nodes, andcommunication within a node. On each node, welaunch a thread to perform MPI operations to transferdata between different nodes, and another thread tocopy memories among the host and different GPUson the same node.

The technique of CUDADirect V2.0 supports di-rect memory copies between GPUs on the samenode. It may also send or receive GPU buffers on dif-ferent nodes directly if an MPI library supports CU-DADirect. To make our framework more portable,we choose to move data from GPU to the host on thesource node first, then send it to a destination node.

R1: *R2: *W: A[i,j]

R1: A[i,j]R2: *W: *

R1: A[i,j]R2: *W: *

R1: *R2: A [i,j]W: *

R1: *R2: *W: A [i,j]

Task 0 Task 1 Task 2 Task 3 Task 4

Figure 5: Solving data dependencies for a set of tasksthat read or write tile A[i, j]. The completion of Task0 will fire Tasks 1-3 that want to read A[i,j].

5

After the destination host receives the data, it copiesthe data from its host to one or more of its GPUs.

An MPI communication thread runs on a dedi-cated CPU core. It calls nonblocking MPI point-to-point operations to send and receive messages. Atthe beginning, the thread posts an MPI Irecv oper-ation and an MPI Isend operation. Next, it checkswhether the pending receive or send operation hasfinished with busy-polling. Whenever an operationis finished, it posts a new operation to replace thefinished one so that there are always two operations(one receive and one send) ongoing at the same time.Figure 9 in Appendix A shows our pseudocode to im-plement the MPI communication thread. In the code,wait4send and wait4recv indicate if there ex-ists a pending send or receive operation.

A CUDA communication thread also uses a ded-icated CPU core on the host. Each GPU has twomail boxes: out mbox and in mbox. The mes-sages in an out mbox are intended from the GPU toother devices, and the messages in an in mbox areintended from other devices to the GPU itself. Wecreate two streams for each GPU: one for outgoingtraffic and the other for incoming traffic. Similarto the MPI communication thread, the CUDA com-munication thread tries to start one incoming mem-ory copy and one outgoing memory copy for eachGPU simultaneously. If there are a number of gGPUs, there will be 2g cudaMemcpyAsync opera-tions happening concurrently in which each GPU hastwo operations. To implement the CUDA communi-cation thread, wait4send and wait4recv havebeen changed to bitsets, where the i-th bit denotesthe status of the i-th GPU. We have also implementedselect GPU streams to substitute MPI Testso that we can test in which GPU streams the asyn-chronous cudaMemcpy operations have finished.

5.5 Data Storage and Management

The host and each GPU have an indirect data struc-ture to store matrices. Given a matrix with p×q tiles,the indirect structure consists of p × q pointers eachpointing to a tile. A pointer is null if the correspond-ing tile is not stored in the host or GPU. We store aGPU’s indirect data structure to the host memory, butthe pointers in the GPU’s indirect structure actuallypoint to GPU device memories. By using the indi-

rect data structure, a GPU compute thread can sim-ply look up the structure and pass correct arguments(i.e. GPU device pointers) to launch GPU kernels.

Our runtime system can transfer data from a par-ent task to its children automatically, however, it doesnot know how long the data should persist in thedestination device. We provide programmers with aspecial function of Release Tile() to free data.Release Tile does not free any memory, but sets upa marker in the task window. The marker tells theruntime system that the tile will not be needed byany future tasks (i.e., tasks after the marker), and itis safe to free the tile whenever possible. When aprogrammer writes a sequential program, he or shecan add Release Tile() to the program just like call-ing the ANSI C function free. The task-generationthread keeps track of the expected number of visitsfor each tile. Meanwhile the compute threads countthe actual number of visits for each tile. The runtimesystem will free a tile if and only if: i) Release Tilehas been called to mark the tile, and ii) the actualnumber is equal to the expected number of visits tothe tile. In essence, this is an asynchronous deallo-cation method with which a dynamic runtime systemcan decide when it is safe to free data.

6 Distributed Task AssignmentProtocol

Numerous runtime systems (one per node) executethe same code and generate the same set of tasks sothat a task may be duplicated on each node. We de-sign a protocol to guarantee that a task is computedby one and one one processing unit (CPU or GPU),and an input is sent to a waiting task only once.

Given a task with k1 inputs, all the runtime sys-tems across the cluster will generate k1 input taskinstances in total. It is exactly k1 instances becauseeach input belongs to exactly one node and only thatnode will claim ownership of the input. Also we de-fine that the first output of a task is the main output,and the rest outputs are minor outputs.

Our runtime system generates eight types of taskinstances using a set of rules. The rational behindthe rules is that when all runtime systems look atthe same input or output, they should make a unan-imous decision merely based on a predefined distri-

6

bution (i.e., the multi-level block cyclic distribution)without any communication. Note that the followingrules of 1, 2-4 and 5-8 are used to generate a task’smain output, inputs, and minor-outputs, respectively.

1. Owner. Each runtime system looks at a newlygenerated task’s main output. If the main outputis assigned to the host or a GPU on nodei basedupon a static data distribution, only nodei’s run-time system will create an owner task instance.An owner task instance stores the complete in-formation of the task (i.e., input, output, theready status of each input).

2. Native input. Each input of a new task will bechecked by every runtime system. If the inputand the task’s main output are assigned to thesame host or GPU (e.g., on nodei), only nodei’sruntime system will create a native-input taskinstance. The native-input instance stores apointer pointing to the task’s owner instance.

3. Intra-node alien input. Unlike Rule 2, if theinput and the task’s main output belong tothe same node (e.g., on nodei) but on differ-ent devices, the runtime system on nodei willcreate an intra-node alien-input task instance.The intra-node alien-input instance also storesa pointer pointing to the task’s owner instance.

4. Inter-node alien input. Unlike Rule 3, if theinput and the task’s main output belong to dif-ferent nodes, and suppose the input belongs tonodei, the runtime system on nodei will cre-ate an inter-node alien-input task instance. Theinter-node alien-input instance stores the loca-tion of the task’s main output.

5. Native minor-output. Every runtime systemlooks at each minor output of a newly generatedtask. If the minor output and the task’s mainoutput belong to the same host or GPU (e.g., onnodei), the runtime system on nodei will createa native minor-output task instance. The task’sowner instance stores a pointer pointing to thenative minor-output instance.

6. Sink minor-output. Unlike Rule 5, if the minoroutput and the main output belong to differentdevices (regardless of nodes), and suppose the

minor output is assigned to nodej , the runtimesystem on nodej will create a sink minor-outputtask instance. The sink instance is expecting itscorreponding source to send data to it.

7. Intra-node source minor-output. If the minoroutput and the main output belong to differentdevices but on the same node (e.g., nodei), theruntime system on nodei will create an intra-node source minor-output task instance. Theintra-node source minor-output stores a pointerpointing to its corresponding sink instance (gen-erated by Rule 6) on the same node.

8. Inter-node source minor-output. If the minoroutput and the main output belong to differ-ent nodes, and suppose the main output is as-signed to nodei, the runtime system on nodeiwill create an inter-node source minor-outputtask instance. The inter-node source minor-output stores the location of its correspondingsink instance on a remote node.

Since we require the location of an owner taskinstance be where the task computation occurs, ourruntime systems is designed to support linking atask’s input instances, minor-output instances, andowner instance together so that the availability of aninput triggers the owner. In our runtime system, thelinking information is either a pointer or the locationof the owner task instance. Also by distinguishingintra-node from inter-node, the runtime system candecide if it needs to copy data to a different device,or even send an MPI message to a different node inorder to fire a child task.

A distinctive feature of the protocol is that all theruntime systems can follow the same rules to gener-ate tasks and solve data dependencies in an embar-rassingly parallel manner without any communica-tion (except for the actual data transfers).

7 Performance Evaluation

We conducted experiments with the Cholesky andQR factorization in double precision on the heteroge-neous Keeneland system [24] at the Oak Ridge Na-tional Laboratory. The Keeneland system has 120nodes and is connected by a Qlogic QDR InfiniBand

7

0

20

40

60

80

100

120

1 2 4 8 16 32 64 100

Ove

rall

Tflo

ps

Number of Nodes

DGEMM UB Distri. GPUs mkl_scalapack 10.3

(a) Cholesky

0 10 20 30 40 50 60 70 80

1 2 4 8 16 32 64 100

DSSRFB UB Distri. GPUs mkl_scalapack 10.3

(b) QR

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1 2 4 8 16 32 64 100

Tflo

ps P

er

No

de

DGEMM UB Distri. GPUs mkl_scalapack 10.3

(c) Cholesky

0.0

0.2

0.4

0.6

0.8

1.0 1.2

1 2 4 8 16 32 64 100

DSSRFB UB Distri. GPUs mkl_scalapack 10.3

(d) QR

Figure 6: Weak scalability on distributed GPUs. (a) and (b) show the overall performance, while (c) and (d)show the performance-per-node for the Cholesky and QR factorizations (in double precision), respectively. Everyexperiment uses all 12 CPU cores and 3 GPUs on each node.

network. Each node on the system runs CentOS 5.5,and has two Intel Xeon 2.8 GHz 6-core processorsand three Nvidia Fermi 1.15 GHz M2070 GPUs. Thehost on each node has 24 GB memories, and eachGPU has 6 GB device memories. There is a full PCI-Expresse bandwidth to every GPU on the system. Allthe nodes have installed CUDA 4.0, Intel Compilers12.0, Intel MKL 10.3.5, and OpenMPI 1.5.1.

In the following experiments, we perform weakscalability experiments to measure the capability ofour programs to solve potentially larger problemsif there are more computing resources. While weare focused on clusters with distributed GPUs, ourframework is also able to achieve high performancein the other two environments: multicore clusterswithout GPUs, and shared-system multiGPUs (i.e.,a node with both CPUs and GPUs).

7.1 Distributed GPUs

We did experiments on Keeneland using all twelveCPU cores and all three GPUs on each node. Figure6 shows how our distributed-GPU framework scalesas we increase the number of nodes and the matrixsize simultaneously. Although there are 120 nodeson Keeneland, its batch scheduler only allows a jobto use 110 nodes in maximum. Considering one ortwo nodes are unavailable sometimes, we use a num-ber of nodes from one to 100. As the number ofnodes is increased by k, we increase the matrix sizeby√k. The single-node experiment takes an input

matrix of size 34,560.In our experiments, we measure the total num-

ber of TeraFlops to solve the Cholesky factorization

and the QR factorization, shown in Figure 6 (a) and(b), respectively. To display the possible maximumperformance (i.e., upper bound) of our programs,we also depict the curves of DGEMM and DSS-RFB that are the dominant computational kernels ofCholesky factorization and QR factorization, respec-tively. We calculate an upper bound by the follow-ing formula: kernel UB = serial cpu kernel perf× #cores + gpu kernel perf × #gpus. To show thebenefits of using GPUs, we also present the perfor-mance of the Intel MKL 10.3.5 ScaLAPACK librarywhich uses CPUs only. In (a), the overall perfor-mance of our distributed-GPU Cholesky factoriza-tion reaches 75 TFlops on 100 nodes, while MKLScaLAPACK reaches 6.3 TFlops. In (b), the overallperformance of our distributed-GPU QR factoriza-tion reaches 40 TFlops on 100 nodes, while MKLScaLAPACK reaches 9.2 TFlops.

Figure 6 (c) and (d) show another view (i.e., Per-formance Per Node) for the same experiments as dis-played in (a) and (b). That is, TFlops-Per-Node =Overall TF lopsNumberNodes on a given number of nodes. Ideally,

the performance-per-node is a constance number ina weak scalability experiment. From (c), we cansee that our distributed-GPU Cholesky factorizationdoes not lose any performance from one node to 100nodes. In (d), our distributed-GPU QR factorizationscales well again from four nodes to 100 nodes. Theperformance-per-node on four nodes drops from 0.44TFlops to 0.41 TFlops because the four-node experi-ment uses a 2× 2 process grid and has a larger com-munication overhead than a process grid with Pr = 1(Appendix A.2 analyzes the related communicationcost).

8

0

2

4

6

8

10 12

1 2 4 8 16 32 64 100

Ove

rall

Tflo

ps

Number of Nodes

Distri. GPUs mkl scalapack

(a) Cholesky

0

2

4

6

8

10

1 2 4 8 16 32 64 100


(b) QR

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14

1 2 4 8 16 32 64 100

Tflo

ps P

er N

ode


(c) Cholesky

0.00

0.02

0.04

0.06

0.08

0.10

0.12

1 2 4 8 16 32 64 100


(d) QR

Figure 7: Weak scalability on clusters with CPUs only. Every experiment uses 12 CPU cores on each node.

7.2 Clusters without GPUs

We did another set of experiments to test whetherour framework is capable of delievering high per-formance if a system is a conventional cluster with-out GPUs. We only use the 12 CPU cores on eachKeeneland node to do the experiments, and compareour Cholesky and QR factorization implementationwith the Intel MKL 10.3.5 ScaLAPACK library.

We performed weak scalability experiments again,where the input size increases by

√2 whenever we

double the number of nodes. In Fig. 7 (a), theoverall performance of our Cholesky factorizationis faster than the ScaLAPACK Cholesky factoriza-tion by 75% on 100 nodes. In (b), our QR factor-ization and the ScaLAPACK QR factorization havecomparable overall performance. Figure 7 (c) and(d) show the performance per node. In (c), our CPU-only Cholesky factorization scales well from 2 to 100nodes. Its curve has a dip from one to two nodessince our runtime system on each node uses a ded-icated core to do MPI communication (i.e., 1

12 lesscomputing power) if there are more than one node.Similar to Cholesky factorization, in (d), our QR fac-torization scales well from 4 to 100 nodes. Becauseof its good scalability, our QR program eventuallyoutperforms the Intel MKL ScaLAPACK QR factor-ization by 5% when the number of nodes is greaterthan 32. Note that we use 11 out of 12 cores on eachnode to do the real computation, while ScaLAPACKuses all 12 cores, however we are still 5% faster.

7.3 Shared-System MultiGPUs

To evaluate our framework on a shared-system withmulticore CPUs and multiple GPUs, we compare ourCholesky factorization to StarPU 0.9.1 [6] on a singlenode of the Keeneland system.

0 100 200 300 400 500 600 700 800 900

960

3840

67

20

9600

12

480

1536

0 18

240

2112

0 24

000

2688

0 29

760

Ove

rall

Gflo

ps

Matrix Size (N)

Distri. GPUs StarPU 0.9.1

Figure 8: Cholesky factorization (in double preci-sion) on a shared-system with 12 cores and 3 GPUs.

StarPU uses a dynamic scheduling runtime systemto assign tasks to CPUs and GPUs to keep load bal-ancing and reduce data transfers. The StarPU imple-mentation of Cholesky factorization uses the samecomputational kernels as ours, which calls subrou-tines from the Intel MKL 10.3.5, CUBLAS 4.0, andMAGMA 1.0 libraries. With the help from one of theStarPU developers, we ported the StarPU Choleskyfactorization to Keeneland and also tuned its perfor-mance thoroughly. Porting the StarPU QR factoriza-tion to Nvidia Fermi GPUs is not successful so fardue to numerical errors in the result.

Figure 8 shows the overall performance of ourframework and StarPU 0.9.1 to solve double-precision Cholesky factorizations. All the StarPUexperiments use 9 CPU cores and 3 GPUs to do thereal computation, and use the remaining three coresto manage the three GPUs. By contrast, our imple-mentation uses 8 CPU cores and 3 GPUs to do thereal computation since we also use an additional coreto do CUDA communications. The performance datashows that our framework can rise to high perfor-mance more quickly than the StarPU program. Whenthe input size is not too large, our framework is fasterthan StarPU (i.e., 250% faster when N ≤ 7680, and

9

100% faster when N ≤ 12480). When the input sizeis sufficiently large (i.e., N ≥ 26,880), StarPU startsto be close to our framework.

8 Related Work

There are a number of runtime systems developed tosupport multiple GPU devices on a shared-memorysystem. StarPU develops a dynamic scheduling run-time system to execute a sequential code on the hostCPUs and GPUs in parallel [6], and has been appliedto the Cholesky, QR, and LU factorizations [3, 2, 1].SuperMatrix is another runtime system that supportsshared-memory systems with multiple GPUs [20].It uses several software-cache schemes to maintainthe coherence between the host RAM and the GPUmemories. While SuperMatrix requires that GPUstake most of the computations, our framework uti-lizes all CPU cores and all GPUs on both shared-memory and distributed-memory systems.

StarSs is a programming model that uses direc-tives to annotate a sequential source code to executeon various architectures such as SMP, CUDA, andCell [7]. A programmer is responsible for specifyingwhich piece of code should be executed on a GPU.Its runtime then executes the annotated code in par-allel on the host and GPUs. It is also possible to usethe hybrid MPI/SMPSs approach to support clusterswith multicore CPUs [18].

There is research work that supports scientificcomputations on distributed GPUs. Fatica [14] usesCUDA to accelerate the LINPACK Benchmark [13]on heterogeneous clusters by modifying the origi-nal source code slightly. The revised code inter-cepts every DTRSM or DGEMM call, and splits itinto two calls to execute on both CPUs and GPUs,respectively. Those calls to CPUs rely on settingOMP NUM THREADS to utilize all CPU cores on thehost. Differently, our distributed GPU frameworkallows every CPU core to compute tasks indepen-dently. On the other hand, we use one MPI processper node, instead of one MPI process per GPU.

Fogue et al. ported the PLAPACK library toGPU-accelerated clusters [15]. They require thatCPUs compute the diagonal block factorizationswhile GPUs compute all the remaining operations.They also store all data in GPU memories to reduce

communication. In our method, we distribute a ma-trix across the host and GPUs, and can utilize allCPU cores and all GPUs. Note that it is possible thatthe computational power of a host may be greaterthan that of a GPU such that the host needs to com-pute most of the work.

Many researchers have studied the static datadistribution strategies on heterogeneous distributedmemory systems. Dongarra et al. designed an algo-rithm to map a set of uniform tiles to a 1-D collectionof heterogeneous processors [11]. Robert et al. pro-posed a heuristic 2-D block data allocation to extendScaLAPACK to work on heterogeneous clusters [9].Lastovetsky et al. developed a static data distribu-tion strategy that takes into account both processorheterogeneity and memory heterogeneity for densematrix factorizations [17]. Our work targets clustersof nodes that consist of multicore CPUs and multi-ple GPUs, and uses a novel static multi-level blockcyclic strategy.

9 Conclusion and Future Work

As the trend of adding more GPUs to each node todeliver high performance continues, it is importantto start to design new parallel software on the hetero-geneous architectures. We present a new frameworkto solve dense linear algebra problems on large-scaleGPU-based clusters. To attain scalable performance,we focus our design on minimizing communication,maximizing the degree of task parallelism, accom-modating processor heterogeneity, hiding communi-cation, and keeping load balance. The frameworkessentially consists of a static multi-level partition-ing and distribution method, heterogeneous tile algo-rithms, and a distributed scheduling runtime system.

Our experiments with the Cholesky and QR fac-torizations on the heterogeneous Keeneland sys-tem demonstrate great scalability in various environ-ments: clusters with or without GPUs, and shared-systems with multi-GPUs. Our future work alongthis line is to apply the approach to two-sided factor-izations and sparse matrices. Another future work isto add NUMA support to our runtime system to im-prove performance on each node that has hundredsor even thousands of CPU cores as well as a greatnumber of NUMA memory nodes.

10

References

[1] E. Agullo, C. Augonnet, J. Dongarra,M. Faverge, J. Langou, H. Ltaief, and S. To-mov. LU factorization for accelerator-basedsystems. ICL Technical Report ICL-UT-10-05,Innovative Computing Laboratory, Universityof Tennessee, 2010.

[2] Emmanuel Agullo, Cedric Augonnet, JackDongarra, Mathieu Faverge, Hatem Ltaief,Samuel Thibault, and Stanimire Tomov. QRfactorization on a multicore node enhancedwith multiple GPU accelerators. In IPDPS2011, Alaska, USA, 2011.

[3] Emmanuel Agullo, Cedric Augonnet, JackDongarra, Hatem Ltaief, Raymond Namyst,Jean Roman, Samuel Thibault, and StanimireTomov. Dynamically scheduled Cholesky fac-torization on multicore architectures with GPUaccelerators. In Symposium on ApplicationAccelerators in High Performance Computing(SAAHPC), Knoxville, USA, 2010.

[4] Emmanuel Agullo, Jack Dongarra, Bilel Hadri,Jakub Kurzak, Julien Langou, Julie Lan-gou, Hatem Ltaief, Piotr Luszczek, and AsimYarKhan. PLASMA Users’ Guide. Technicalreport, ICL, UTK, 2011.

[5] E. Anderson, Z. Bai, C. Bischof, L. Black-ford, J. Demmel, J. Dongarra, J. Du Croz,A. Greenbaum, S. Hammarling, A. McKen-ney, and D. Sorensen. LAPACK Users’ Guide.SIAM, 1992.

[6] Cedric Augonnet, Samuel Thibault, RaymondNamyst, and Pierre-Andre Wacrenier. StarPU:A unified platform for task scheduling on het-erogeneous multicore architectures. Concurr.Comput. : Pract. Exper., Special Issue: Euro-Par 2009, 23:187–198, February 2011.

[7] Eduard Ayguade, Rosa M. Badia, Francisco D.Igual, Jesus Labarta, Rafael Mayo, and En-rique S. Quintana-Ortı. An extension of theStarSs programming model for platforms withmultiple GPUs. In Proceedings of the 15th

International Euro-Par Conference on Paral-lel Processing, Euro-Par ’09, pages 851–862,Berlin, Heidelberg, 2009. Springer-Verlag.

[8] Grey Ballard, James Demmel, Olga Holtz, andOded Schwartz. Communication-optimal par-allel and sequential Cholesky decomposition.In Proceedings of the twenty-first annual sym-posium on Parallelism in algorithms and archi-tectures, SPAA ’09, pages 245–252, New York,NY, USA, 2009. ACM.

[9] Olivier Beaumont, Vincent Boudet, AntoinePetitet, Fabrice Rastello, and Yves Robert. Aproposal for a heterogeneous cluster ScaLA-PACK (dense linear solvers). IEEE Transac-tions on Computers, 50:1052–1070, 2001.

[10] L. S. Blackford, J. Choi, A. Cleary,E. D’Azevedo, J. Demmel, I. Dhillon,J. Dongarra, S. Hammarling, G. Henry, A. Pe-titet, K. Stanley, D. Walker, and R. Whaley.ScaLAPACK Users’ Guide. SIAM, 1997.

[11] Pierre Boulet, Jack Dongarra, Yves Robert,and Frdric Vivien. Static tiling for heteroge-neous computing platforms. Parallel Comput-ing, 25(5):547 – 568, 1999.

[12] James W. Demmel, Laura Grigori, Mark Fred-erick Hoemmen, and Julien Langou.Communication-optimal parallel and se-quential QR and LU factorizations. LAPACKWorking Note 204, UTK, August 2008.

[13] Jack J. Dongarra, Piotr Luszczek, and An-toine Petitet. The LINPACK Benchmark: past,present, and future. Concurrency and Compu-tation: Practice and Experience, 15:803–820,2003.

[14] Massimiliano Fatica. Accelerating Linpackwith CUDA on heterogenous clusters. In Pro-ceedings of 2nd Workshop on General Pur-pose Processing on Graphics Processing Units,GPGPU-2, pages 46–51, New York, NY, USA,2009. ACM.

[15] Manuel Fogue, Francisco D. Igual, Enrique S.Quintana-ort, and Robert Van De Geijn. Re-

11

targeting PLAPACK to clusters with hardwareaccelerators. FLAME Working Note 42, 2010.

[16] J. R. Humphrey, D. K. Price, K. E. Spagnoli,A. L. Paolini, and E. J. Kelmelis. CULA: Hy-brid GPU accelerated linear algebra routines. InSPIE Defense and Security Symposium (DSS),April 2010.

[17] Alexey Lastovetsky and Ravi Reddy. Data dis-tribution for dense factorization on computerswith memory heterogeneity. Parallel Comput.,33:757–779, December 2007.

[18] Vladimir Marjanovic, Jesus Labarta, EduardAyguade, and Mateo Valero. Overlapping com-munication and computation by using a hybridMPI/SMPSs approach. In Proceedings of the24th ACM International Conference on Super-computing, ICS ’10, pages 5–16, New York,NY, USA, 2010. ACM.

[19] NVIDIA. CUDA Toolkit 4.0 CUBLAS Li-brary, 2011.

[20] Gregorio Quintana-Ortı, Francisco D. Igual,Enrique S. Quintana-Ortı, and Robert A. van deGeijn. Solving dense linear systems on plat-forms with multiple hardware accelerators. InProceedings of the 14th ACM SIGPLAN sym-posium on Principles and practice of paral-lel programming, PPoPP ’09, pages 121–130,New York, NY, USA, 2009. ACM.

[21] Sushant Sharma, Chung-Hsing Hsu, andWu chun Feng. Making a case for a Green500list. In IEEE International Parallel and Dis-tributed Processing Symposium (IPDPS 2006)/Workshop on High Performance - Power AwareComputing, 2006.

[22] Fengguang Song, Stanimire Tomov, and JackDongarra. Efficient support for matrix compu-tations on heterogeneous multi-core and multi-GPU architectures. LAPACK Working Note250, UTK, June 2011.

[23] S. Tomov, R. Nath, P. Du, and J. Dongarra.MAGMA Users’ Guide. Technical report, ICL,UTK, 2011.

[24] J.S. Vetter, R. Glassbrook, J. Dongarra,K. Schwan, B. Loftis, S. McNally, J. Meredith,J. Rogers, P. Roth, K. Spafford, and S. Yala-manchili. Keeneland: Bringing heterogeneousGPU computing to the computational sciencecommunity. Computing in Science Engineer-ing, 13(5):90 –95, sept.-oct. 2011.

12

wait4send = wait4recv = 0;while(!is_done || wait4send)

if(!is_done && !wait4recv) call MPI_Irecv(recv_buf, MPI_ANY_SOURCE,

&recv_req);wait4recv = 1;

if(!wait4send)

msg = get_msg(host’s out_mbox);call MPI_Isend(msg->data, msg->dst_pid,

&send_req);wait4send = 1;

if(wait4send)

call MPI_Test(&send_req);if(success) wait4send = 0;

if(wait4recv)

call MPI_Test(&recv_req);if(success)

store recv_buf to the host memory;wait4recv = 0;

Figure 9: Pseudocode of the MPI communicationthread in our distributed runtime system.

A Appendix

A.1 Tile Size Auto-Tuning

We adjust the ratio of the small tiles on the host to the largetiles on GPUs to keep load balancing between CPUs andGPUs within each node. Meanwhile the load balancingbetween nodes is attained automatically by the 2-D blockcyclic distribution method.

Given a tile of size B × B, we partition it into twoparts: B×Bh and B×Bg , where Bh+Bg = B. We alsosuppose Bh = b(s−1), where b is a sub-tile size. The leftpartition of size B × Bh is allocated to the host, and theright partition of size B × Bg is allocated to a GPU. Welet T (m× n) denote the number of floating operations tocompute a matrix of size m×n. fcore and fgpu denote thespeed (i.e., flop/s) on a CPU core or GPU, respectively.

In a block cyclic data distribution (1-D or 2-D), a num-ber of G tiles are allocated to a number of G GPUs suchthat each GPU has one tile and it takes T (B × Bg)/fgputime to compute. Differently, the CPU cores on thehost receive G small partitions (each is of size B × Bh)from the G tiles, and it takes G × T (B × Bh)/(fcore ×NumCores) time to compute.

To achieve load balancing, we determine Bh by the fol-lowing formula:

T (B ×Bg)

fgpu=

G× T (B ×Bh)

fcore ×NumCores

⇒ T (B ×Bh)

T (B ×Bg)=

fcore ×NumCores

fgpu ×G

⇒ Bh = B × fcore ×NumCores

fcore ×NumCores+ fgpu ×G

In practice, fcore or fgpu denotes the maximum perfor-mance of the dominant computational kernel in an algo-rithm. In addition, we fine-tune the value of Bh automat-ically. We start from the estimated size Bh and search foran optimal B∗h near Bh. We wrote a script to execute amatrix factorization with an input of size N = c0 ·B ·G,where we set c0 = 3 to reduce the tuning time. The scriptadjusts the value of Bh to search for the minimum dif-ference between the CPU and the GPU computation time.Note that Bh is dependent only on the number of CPUcores and the number of GPUs, assuming the machine andthe computational kernels have not changed.

The large tile size B is also critical for the GPU per-formance. To determine the best B, we search for theminimal matrix size that provides the best performancefor the dominant GPU kernel in an algorithm (e.g., GEMMfor Cholesky factorization). Our search ranges from 256to 2048, and is performed only once for every new com-putational library and every new GPU architecture.

A.2 Communication Cost Analysis

We count the number of messages and the number ofwords communicated by a process that has the most com-munication among all processes. Assume there are Pprocesses (one process per node), and the broadcast be-tween processes is implemented by a tree topology. Weuse Pr and Pc to denote a Pr × Pc process grid, whereP = Pr · Pc. In a multi-level block cyclic data distribu-tion, we use n, B, s to denote the matrix size, the top-leveltile size, and the number of partitions of each top-leveltile, respectively.

A.2.1 Distributed Heterogeneous Tile CholeskyFactorization

For each iteration, we first broadcast the diagonal blockdown the panel (i.e. logPr messages), next each pro-cess that owns data on the panel broadcasts to Pc +Pr processes that are waiting for the panel data (i.e.#rowsBPr

log(Pc +Pr) messages, where #rows=n−b jscBat the j-th iteration). The number of messages is expressed

13

as follows:

msgchol =

nB s−1∑j=0

logPr +n− b jscB

BPrlog(Pc + Pr)

=ns

BlogPr +

n2s

2B2Prlog(Pc + Pr)

=ns

2BlogP +

n2s

4B2√P

logP +n2s

2B2√P

And the number of words is:

wordchol =nB

4logP +

n2

4√P

logP +n2

2√P

A.2.2 Distributed Heterogeneous Tile QRFactorization

In the tile QR factorization, we can stack up v adjacenttiles to form a virtual tile, which is always allocated to thesame host or GPU. At the j-th iteration, each process hasn−b js cB(vB)Pr

× n−b js cBBPc

virtual tiles of size (vB) × B. Sinceone B × B tile out of every virtual tile will be sent downto its below process as a message (there is no message ifPr=1), and every tile on the panel will be broadcast rightto Pc processes, the number of messages is expressed asfollows:

msgqr =

nB s−1∑j=0

n− b jscBvBPr

·n− b jscB

BPc+

n− b jscBBPr

logPc

=

nB s−1∑j=0

(n− b jscB)2

vB2P+ (n− bj

scB)

logPc

BPr

=1

vB2P

n3s

3B+

n2s

2B2PrlogPc

=n3s

3vB3P+

n2s

4B2√P

logP

And the number of words is:

wordqr =n3

3vBP+

n2

4√P

logP

= (n

3vB√P logP

+1

4)n2

√P

logP

If we set the virtual tile size v as n/B/Pr, (vB√P ) is equal

to n. Therefore, msgqr = n2s3B2√P

+ n2s4B2√PlogP , and

wordqr = ( 13 logP + 1

4 )n2√PlogP .

A.2.3 Comparison with ScaLAPACK

Table 1 compares our distributed-version heterogeneoustile algorithms with the communication lower bound

(LB), and the ScaLAPACK subroutines regarding thenumber of words (i.e., communication volume) and thenumber of messages. From the table, we can see that theheterogeneous tile Cholesky factorization has attained thecommunication volume lower bound to within a logarith-mic factor. The communication volume of the heteroge-neous tile QR factorization is greater than its lower boundby a factor of ( n

3vB√P

+ 14 logP ). Note that we could

increase v to minimize the heterogeneous tile QR’s com-munication volume to reach its lower bound to within afactor of ( 13 + 1

4 logP ).

Table 1: Communication cost of the distributed het-erogeneous tile algorithms. We assume square ma-trices, and Pr = Pc =

√P .

#words #messages

Cholesky LB [[8]] Ω( n2√

P) Ω(

√P )

PDPOTRF [[8]] ( nb4

+ n2√

P) log P 3n

2blog P

Hetero. Cholesky ( 14

log P + 12

) n2√

P

n2s

4B2√

P(log P + 2)

QR LB [[12]] Ω( n2√

P) Ω(

√P )

PDGEQRF [[12]] ( 34

n2√

P+ 3

4nb) log P ( 3

2+ 5

2b)n log P

Hetero. QR ( n

3vB√

P log P+ 1

4) n2√

Plog P n3s

3vB3P+ n2s

4B2√

Plog P

14

a scalable framework for heterogeneous gpu-based...

Documents