communication optimization on gpu: a case study …cadlab.cs.ucla.edu › ~jaywang › papers ›...

Communication Optimization on GPU: A Case Study of Sequence AlignmentAlgorithms

Jie WangUniversity of California, Los Angeles

Los Angeles, USA

Xinfeng XiePeking University

Beijing, China

Jason CongUniversity of California, Los Angeles

Los Angeles, USA

Abstract—Data movement is increasingly becoming the bot-tleneck of both performance and energy efficiency in moderncomputation. Until recently, it was the case that there islimited freedom for communication optimization on GPUs,as conventional GPUs only provide two types of methods forinter-thread communication: using shared memory or globalmemory. However, a new warp shuffle instruction has beenintroduced since the Kepler architecture on Nvidia GPUs,which enables threads within the same warp to directlyexchange data in registers. This brought new performanceoptimization opportunities for algorithms with intensive inter-thread communication. In this work, we deploy register shufflein the application domain of sequence alignment (or similarly,string matching), and conduct a quantitative analysis of theopportunities and limitations of using register shuffle. Weselect two sequence alignment algorithms, Smith-Waterman(SW) and Pairwise-Hidden-Markov-Model (PairHMM), fromthe widely used Genome Analysis Toolkit (GATK) as casestudies. Compared to implementations using shared memory,we obtain a significant speed-up of 1.2× and 2.1× by usingshuffle instructions for SW and PairHMM. Furthermore,we develop a performance model for analyzing the kernelperformance based on the measured shuffle latency from a suiteof microbenchmarks. Our model provides valuable insights forCUDA programmers into how to best use shuffle instructionsfor performance optimization.

I. INTRODUCTIONThe graphics processing unit (GPU) is a widely used het-

erogeneous platform that is equipped with a massive numberof threads, offering high performance for many applications.It has become an integral part of today’s computing systems.

Communication is an important factor for both perfor-mance and energy efficiency on GPUs. Table I shows thecomputation throughput, shared memory and global memorybandwidth (BW) on two Nvidia GPUs. As we can see,there is a huge gap between computation and memorysystems, making it usually unrealistic to fully exploit therich computation resources on GPUs. This has spawned anactive research community for communication optimizationon GPUs [1]–[3].

Communication among threads (inter-thread communica-tion) takes up a large portion of the total communicationon GPU, considering that the GPU uses a large numberof threads to explore the parallelism in applications. Forapplications with intensive inter-thread communication, theperformance will be limited by the efficiency of inter-thread

Table I: Overview of the gap between computation andmemory systems on modern GPUs.

Nvidia K1200 Nvidia Titan XGFLOPs 1,057 6,611shared memory BW(GB/s) 550 3,302global memory BW(GB/s) 80 336

communication methods. There are two conventional typesof methods for GPU threads to communicate with eachother: shared memory and global memory. Threads withinthe same thread block can communicate through sharedmemory, which is basically a scratchpad memory that canoffer high bandwidth. Threads in different thread blockscan communicate via global memory. For both methods, weneed to set explicit synchronization barriers for potential datahazards.

Starting from the Kepler architecture, threads within thesame warp can communicate with each other using a newinstruction called SHFL, or ”shuffle” [4]. Shared memorycan be saved by using shuffle instructions, which may helpimprove the occupancy, and synchronization overheads areeliminated because shuffle is used by threads within onesingle warp with implicit synchronization.

This new instruction provides a unique opportunity foroptimizing communication in applications with intensiveinter-thread communication. However, shuffle has its limita-tions, as it can be used only for threads within the samewarp, and it will increase the register usage which maybecome the new limiter for occupancy. Trade-offs betweenthe benefits and limitations of shuffle need to be consideredbefore deploying shuffle in kernels. Previous works [5]–[7]using shuffle instructions to obtain performance gains did notinvestigate such trade-offs and there is a lack of quantitativeand systematic approaches for analyzing the impacts ofshuffle instructions on kernel performance. This motivatesus to conduct a detailed analysis of shuffle instructions.

In this work we select two algorithms, Smith-Waterman(SW) and Pairwise-Hidden-Markov-Model (PairHMM),from a widely used genomic application (GATK) [8] as casestudies for analyzing the effectiveness of shuffle instructionson communication optimization. For each algorithm, weimplement two designs using either shared memory or

shuffle for inter-thread communication. Furthermore, wedevelop a suite of microbenchmarks to evaluate the latencyof shuffle and other instructions in detail. A performancemodel to analyze the kernel performance is built based on themeasured latency from these microbenchmarks. We conducta detailed analysis of shuffle instructions in two kernels withthe help of the performance model and microbenchmarks.

We summarize the contributions of our work as follows.• We conduct a quantitative and systematic analysis of

the impact of shuffle instructions on communicationoptimization. A suite of microbenchmarks is devel-oped for measuring the latency of shuffle instructions.Furthermore, a performance model for analyzing theperformance of sequence alignment algorithms is de-veloped. This model helps validate the results frommicrobenchmarks and estimates the performance gainsof using shuffle instructions, taking all trade-offs intoconsideration.

• We use shuffle instructions to optimize the inter-threadcommunication for two sequence alignment algorithms,SW and PairHMM. Compared to designs using sharedmemory, we achieve speedups of 1.2× and 2.1× forSW and PairHMM, respectively, using shuffle instruc-tions.

• We conclude the trade-offs of using shuffle instructionsfrom the case studies of two sequence alignment algo-rithms. This work provides valuable insights for CUDAprogrammers to use shuffle instructions for communi-cation optimization in a wider class of applications.

The remainder of this paper is organized as follows.In Section II we present details of shuffle instructions.Microbenchmarks for testing the shuffle latency are dis-cussed. Section III describes the algorithms of SW andPairHMM. In Section IV we discuss the general designmethodology of using shared memory or shuffle instructionsfor these algorithms, and optimization techniques to furtherimprove the performance. Then we present the performancemodel for analyzing and estimating the performance of thesedesigns. Experimental results are presented in Section V. Wediscuss the overall performance of our designs and conducta detailed analysis on trade-offs of using shuffle instructions.Section VI summarizes prior research. Finally, we concludeour work in Section VII.

II. UNDERSTANDING SHUFFLE INSTRUCTIONS

In this section we will first introduce the shuffle instruc-tions. Then we will present the microbenchmarks for testingthe latency of shuffle and several related instructions.

A. Shuffle Instruction

Shuffle allows threads within a warp to directly share datain registers. It can be used only for threads within a singlewarp, and all the threads involved in the shuffle instructionneed to be active during the execution time. No explicit

Figure 1: Variants of shuffle instructions.

a b c d

c d a bd a c b c d a b

__shfl __shfl_up/down __shfl_xor

any-to-any shift to neighbor butterfly exchange

Figure 2: An example of using shuffle instructions forreduction.

1 1 1 1 1 1 1 1

2 2 2 2

4 4

8

v+=__shfl_down(v,4)

v+=__shfl_down(v,2)

v+=__shfl_down(v,1)

synchronization is needed for shuffle as it is executed withinthe single warp.

There are four variants of shuffle instructions, as depictedin Figure 1. shfl can directly copy data from any indexedlane. shfl up and shfl down copy data from a lane witheither a lower or a higher ID relative to the caller. Thelast instruction shfl xor copies data from a lane basedon bitwise XOR of its own lane ID.

Figure 2 depicts an example using shuffle for reduction.Without shuffle, we need to store the intermediate data ateach reduction stage back to either shared or global memory.The benefits of using shuffle are summarized below [9].

• Save shared memory usage. Using register shuffle canfree up shared memory. The shared memory can beeither used for other data or to increase the occupancy.

• Reduce instruction count. As for read-after-write access(e.g., Figure 2), when using shared memory, threeinstructions (write, synchronize, and read) are needed.Shuffle can finish the same work with only one singleinstruction.

• Eliminate synchronization. Explicit synchronization isnot required since threads within the same warp are im-plicitly synchronized due to the single-thread-multiple-thread (SIMT) execution model on GPU.

Nevertheless, we find that with the exception of the thirdpoint above, the benefits are not obvious performance-wise.Although using register shuffle can save shared memory, theregister usage will increase, which may become the newlimitation factor for occupancy, and thus may affect perfor-mance. As for the second point on the reduced instructioncount, the latency of shuffle instructions is not disclosedpublicly, which makes it difficult to estimate the performancegains using shuffle. These observations motivate us to studythe shuffle instructions in a quantitative and systematic ap-proach, using sequence alignment algorithms as case studies.

1 __global__ void reg(float* in, float* out) {

2 float a = in[threadIdx.x];

3 for (int i=0; i<ITERATIONS; i++)

4 a *= a;

5 out[threadIdx.x] = a;

6 }

7 __global__ void shuffle(float* in, float* out) {

8 float a = in[threadIdx.x];

9 for (int i=0; i<ITERATIONS; i++)

10 a *= __shfl(a, src_thread);

11 out[threadIdx.x] = a;

12 }

13 __global__ void sharedMem(float* in, float* out) {

14 __shared__ float buf[32];

15 for (int i=0; i<32; i++)

16 buf[i] = in[i];

17 int ind = buf[0];

18 float a = 1.0;

19 for (int i=0; i<ITERATIONS; i++) {

20 ind = buf[ind];

21 a *= ind;

22 }

23 out[0] = a;

24 }

25 __global__ void sharedMemSync(float* in, float* out) {

26 __shared__ float buf[32];

27 for (int i=0; i<32; i++)

28 buf[i] = in[i];

29 int ind = buf[0];

30 float a = 1.0;

31 for (int i=0; i<ITERATIONS; i++) {

32 ind = buf[ind];

33 a *= ind;

34 __syncthreads();

35 }

36 out[0] = a;

37 }

Listing 1: CUDA code for testing the instruction latency.

B. Microbenchmark for Testing the Shuffle Latency

The shuffle instructions are first introduced on the Ke-pler architecture. However, latency of these instructionsis not disclosed and therefore remains unclear to CUDAprogrammers. Such information is critical to performanceestimation. Therefore, we develop a suite of microbench-marks to estimate the latency of shuffle and several otherrelated instructions. This benchmark is conducted on severalGPUs with different architectures to further evaluate theperformance across different GPU generations.

This microbenchmark covers all variants of shuffle in-structions, as shown in Figure 1 in Section II. To provideperformance comparison, the microbenchmark also cov-ers the shared memory access and synchronization usingsyncthreads().

The major code used in the microbenchmark is shownin Listing 1. For the kernel shuffle(), each thread in thewarp loads one item from the global memory and updatesit using data from other threads for several iterations. Con-sidering the RAW dependency of variable a across different

Figure 3: Test results of microbenchmarks.

1 15

13 13 13 13 13

28

35

1

159 9 9 9 9 9

21

57

1

74 5 5 5 5 5

16

52

0

10

20

30

40

50

60

late

ncy

(cycl

e)

K40 K1200 Titan X

iterations, the elapsed time of the kernel can be calculatedas below.

tshuffle = #iteration× (latencyshuffle + α) + β (1)

The factor α refers to the sum of latency of all theremaining instructions (e.g., multiplication). The factor βcovers overheads outside the loop.

The kernel register() uses only register access. Similarly,the elapsed time for this kernel is calculated by the equationbelow.

treg = #iteration× (latencyreg + α) + β (2)

Therefore, by conducting multiple runs with differentnumbers of iterations, we can use a linear regression modelto obtain the slope factors kshuffle = latencyshuffle + αand kreg = latencyreg+α. The latency of the shfl() instruc-tion is therefore derived as latencyreg + kshuffle − kreg .

To avoid the effects of warp scheduling among differentthread blocks, we will launch only one block with 32threads. We apply this approach to test all shuffle instruc-tions.

The kernel sharedMem() tests the shared memory accesslatency. The goal of this test is to evaluate the latency insteadof throughput; therefore, we launch only one thread blockwith one thread inside for pointer chasing in the sharedmemory. The kernel sharedMemSync() is used for evaluat-ing the latency of syncthreads(). We add syncthreads()after each iteration in the code. With the same methodology,we can derive the latency of shared memory access andsynchronization as shown below.

latencysharedMem = latencyreg+ksharedMem−kreg (3)

latencysync = latencyreg+ksync−kreg−latencysharedMem

(4)On GPUs, register access takes one cycle to finish, there-

fore, latencyreg = 1 in all the equations. In the test, wewill conduct ten runs with different values of ITERATIONS.The results are shown in Figure 3. We test shfl up/downwith different strides and shfl with randomly generatedlane IDs.

As we can see, on average the latency of shuffle is in-between that of shared memory access and register access.

Besides, the latency of different types of shuffle instructionsvaries. For example, on K1200, shfl xor takes longer latencythan any other shuffle instruction. Such results indicate thatthe underlying mechanisms might be different for differentshuffle instructions.

The latency pattern of shuffle is consistent on GPUswith the same architecture (K1200 and Titan X, usingMaxwell architecture). However, we notice that such latencyvaries across different architectures. As we can see, on K40(Kepler architecture), shfl xor is the instruction with thelowest latency among all shuffle instructions, while it is theinstruction with the highest latency on Maxwell architecture.It shows that the underlying architecture for shuffle is alsomodified across different GPU generations.

The results from this microbenchmark help clear upprevious confusion regarding shuffle instructions. Clearly,the shuffle instruction is not as fast as direct register access,but it is still faster than shared memory access. The latencyalso varies across different types of instructions and GPUarchitectures.

III. OVERVIEW OF SEQUENCE ALIGNMENT ALGORITHMS

In this section we will first introduce the general idea ofsequence alignment algorithms, and then describe the detailsof SW and PairHMM.

A. Algorithm Overview

Sequence alignment algorithms, e.g., Smith-Waterman,Needleman-Wunsch, and PairHMM, have been employed inmany application domains such as bioinformatics, finance,language processing, etc. The principle of these algorithms isto traverse possible alignments between two sequences andselect the alignment with the best score according to certaincriteria using a dynamic programming approach. This isdone by updating a matrix with the size of M×N , in whichM,N are the lengths of two sequences. The computationcomplexity of these algorithms is O(MN). Each entry inthe matrix depends on its neighbors from certain directions.Figure 4 shows the common dependency graph of thesealgorithms. Each entry depends on its left, up, and left-upneighbors.

To accelerate the application, programmers can exploretwo kinds of parallelisms: 1) inter-task parallelism and 2)intra-task parallelism. Inter-task parallelism refers to the par-allelism among different alignment tasks which are indepen-dent of each other while intra-task parallelism refers to theparallelism among different cells on the same anti-diagonal.Figure 4 shows one example of intra-task parallelism. Manyprevious works [10]–[13] use either one or both of these twokinds of parallelisms to accelerate the computation.

Exploiting intra-task parallelism in these algorithms willintroduce intensive inter-thread communication. Therefore,we choose these algorithms as our application drivers tojustify the effectiveness of shuffle instructions.

Figure 4: Dependence graph for sequence alignment algo-rithms.

Figure 5: PairHMM model.

𝑀

𝐷

𝐼 𝜖

𝛼

𝛽

𝛾

𝛿𝜁

𝜇

B. Algorithm Details

The SW and PairHMM algorithms in this paper are ex-tracted from the HaplotypeCaller [14] of GATK. These twoalgorithms are used to align DNA sequences for discoveringvariants in human genes.

SW [15] is a well-known algorithm for sequence align-ment. It is exploited to identify the optimal local alignmentbetween two sequences by means of dynamic programming.Given two sequences s1 and s2 of length M and N , itcomputes the score matrix H as follows.

Hi,j = max

0

Hi−1,j−1 + s(ai, bj)

maxk≥1{Hi−k,j +Wk}maxl≥1{Hi,j−l +Wl}

(5)

where 1 ≤ i ≤ M and 1 ≤ j ≤ N . ai and bj are i-th and j-th characters in sequence s1 and s2, respectively.s(a, b) is a similarity function for two characters, and Wk

and Wl are the gap-scoring scheme. For the latter two casesin Equation 5, we maintain two buffers holding the localmaximum along each direction so that each time only the leftand up neighbors need to be accessed. Meanwhile, a back-tracing matrix btrack recording the paths chosen for eachcell is updated. After the computation, we will locate the cellwith the maximal value in the last row and column of thescore matrix H , and retrieve the optimal alignment of twosequences back from this position using the matrix btrack.Note that the conventional SW will find the maximumthroughout the whole score matrix, and the algorithm hasbeen modified to adapt to the needs of HaplotypeCaller.

PairHMM [16] is a variant of SW. However, there areseveral fundamental differences between the two algorithms.PairHMM uses the hidden Markov model to align twosequences and will generate a probability score, measuringthe similarity of two sequences. Figure 5 shows the HMM

Figure 6: GPU designs for sequence alignment algorithmswith different types of inter-thread communication: a) designA: using shared memory, b) design B: using shuffle.

sequence2

sequen

ce1

a)

thread

thread

thread

thread

thread

thread

sequence2

sequen

ce1

b)

thread

thread

thread

thread

thread

thread

reg1reg3 reg2

used for this algorithm. There are three states in total:match, insertion, and deletion. α, β, γ, δ, ε, ζ, µ are transitionprobabilities among different states.

Different from SW, in PairHMM, we will com-pute three score matrices (match(M), insertion(I), anddeletion(D)) using the following equations.

Mi,j = Pi,j(αMi−1,j−1 + βIi−1,j−1 + γDi−1,j−1)

Ii,j = δMi−1,j + εIi−1,j

Di,j = ζMi,j−1 + µDi,j−1

(6)where Pi,j is the prior probability of emitting two characters(ai, bj) in the two sequences s1 and s2. The sum of all thecells in the last row of matrix I and D is the probabilitythat measures the similarity of two sequences.

In conclusion, compared to SW, there are three matrices(M ,I , and D) to update instead of only one score matrix H .Also, there is no back-tracing phase in PairHMM, and theoutput of the PairHMM will be one single number measuringthe similarities between two sequences.

IV. IMPLEMENTATION DETAILS

In this section we will first discuss the general method-ology of implementing sequence alignment algorithms inSection III. Then, optimization techniques for both kernelsare described. In the end, we present the performance modelfor analyzing the performance of our designs.

A. Design MethodologyIn this section we discuss the general design methodology

for algorithms with dependence graph as shown in Figure 4by exploiting the intra-task parallelism. Shared memoryand shuffle are used as two alternatives for inter-threadcommunication.

Respecting the dependence order between two adjacentanti-diagonals, we will iterate through all the anti-diagonalsone by one. Each thread will be assigned one cell on theanti-diagonal.

The inter-thread communication occurs when loading andwriting data among threads. This can be implemented eitherusing shared memory or shuffle. Figure 6 shows the designsusing shared memory or shuffle (denoted by design A andB). Listing 2 shows the major CUDA code for two designs.

Figure 7: Two-level tiling scheme for SW: a) coarse-grainedtiling, b) fine-grained tiling. In a), cells on the boundariesthat are surrounded by dashed boxes need to be stored backto the global memory. The data on the horizontal boundarieswill be retrieved and consumed by the next block, and dataon the vertical boundaries are used for finding the maximalvalue in the last column of the score matrix.

sequence2

sequence1

a)

b)

BSIZE

N

M BSIZE

BSIZE

thread

thread

thread

In design A, we use shared memory to store the dataon the anti-diagonals. Data on the same anti-diagonal arestored in one line buffer. This enables coalesced sharedmemory access from different threads on the same anti-diagonal. From the dependency graph, three line buffersare sufficient for these algorithms, as shown in Figure 6a).After the computation on each anti-diagonal has finished, wewill rotate three line buffers and synchronize all the threadsbefore the next iteration.

In design B, data in the anti-diagonals are stored in localregisters, and the three line buffers using shared memory indesign A are freed up. Each thread will hold three registers,as denoted by reg1, reg2, reg3. These registers store the threecells calculated or to be calculated by the current thread.As shown in Figure 6b), in order to calculate a cell, itwill load its left neighbor from reg2 locally, and its up orleft-up neighbors from reg2 or reg3 of the adjacent thread,respectively, using shuffle instructions. There is no explicitsynchronization at the end of each iteration, because thethreads within the warps are implicitly synchronized.

The characteristics of the algorithms and the instructions(shared access vs. shuffle) will bring unique design chal-lenges and opportunities for each kernel. In the followingsections, we will describe the techniques we employ totackle those obstacles and help further improve the perfor-mance.

B. Design Optimization of SW

1) Shared Memory: The major challenge for SW is thatwe need to store the whole back-tracing matrix btrack forretrieving the optimal alignment later. Storing the wholebtrack matrix in the shared memory is impossible due tothe limited shared memory resource. Therefore, we applya two-level tiling method to mitigate the problem, whichis depicted in Figure 7. We first tile the matrix at the rowdimension. The whole matrix will be tiled into blocks of sizeBSIZE×N. Cells lying in the boundaries of the block will bestored back to the global memory, as the horizontal boundary

1 __shared__ data_t buf1[BUF_SIZE];



4 for (int diag = 0; diag < DIAG_NUM; diag++) {

5 // LOAD

6 data_t left = buf2[threadIdx.x];

7 data_t up = buf2[threadIdx.x-1];

8 data_t leftup = buf3[threadIdx.x-1];

9 data_t cur = compute(left, up, leftup); // COMPUTE

10 buf1[threadIdx.x] = cur; // WRITE

11 rotate(buf1, buf2, buf3); // ROTATE

12 __syncthreads(); // SYNC

13 }

1 data_t reg1, reg2, reg3;

2 for (int diag = 0; diag < DIAG_NUM; diag+) {

3 // LOAD

4 data_t left = reg2;

5 data_t up = __shfl(reg2, threadIdx.x-1);

6 data_t leftup = __shfl(reg3, threadIdx.x-1);

7 // COMPUTE

8 data_t cur = compute(left, up, leftup);

9 // WRITE

10 reg1 = cur;

11 // ROTATE

12 reg3 = reg2; reg2 = reg1;

13 }

Listing 2: CUDA code for implementations of using shared memory (design A) and shuffle (design B).

will be used by the next block, and the vertical boundarywill be used for searching for the maximal score later. Insideeach block, we further tile it into smaller tiles with the shapeof parallelograms. Therefore, each tile covers BSIZE anti-diagonals. We choose the shape of parallelograms instead ofnormal rectangles due to the fact that using rectangular tileswill result in low warp efficiency considering most threadswill be wasted at the upper left and lower right corners ofthe tiles.

In the end, we will have three line buffers of length BSIZE,and a matrix of size BSIZE×BSIZE to store the data inbtrack. We will assign BSIZE threads for the task. Eachthread will work on one complete row in the matrix, anddata along the row can be reused locally inside each thread.The execution order of tiles is depicted in Figure 7.

2) Shuffle: The two-level tiling scheme is applied toshuffle as well. Data in the anti-diagonals are stored inlocal registers, and the previous three line buffers in sharedmemory are freed up as depicted in Figure 6b).

C. Design Optimization of PairHMM

As discussed in Section III, there are several differencesbetween PairHMM and SW, which may affect the designchoices when implementing the kernels. Several architecturemodifications are made to adapt to these features.

1) Shared Memory: The implementation for PairHMMis nearly the same as depicted in Figure 6a). The lastthread working on the last row will have an additionaljob as accumulating the results of cells in matrix matchand insertion. Tiling is no longer needed here, becausethe shared memory is mostly used by line buffers to holdthe intermediate results in anti-diagonals, whose size canbe fit on-chip in our experiments. However, based on theobservations that sequence lengths vary a lot in our datasets,setting the same line buffer length for all tasks is notefficient. We optimize the performance by duplicating thekernels with several copies, each with different line buffersize. Tasks with different sequence lengths will fall intodifferent kernels at the launch time to efficiently use theshared memory.

Figure 8: Shuffle implementation for PairHMM. In thisexample, each thread will hold six registers for two cells intotal, and compute them one by one on each anti-diagonal.Inter-thread communication only happens between boundarycells.

sequence2

sequen

ce1

thread

thread

thread

reg1reg3 reg2

reg4reg5reg6

Similar to SW, each thread will work on one completerow in the matrix, which offers the opportunity of datareuse along the horizontal axis. PairHMM benefits morefrom this feature, as there are more metadata (e.g., transitionprobabilities) of sequences used for calculation, and themovement of these data can be saved by data reuse.

2) Shuffle: The implementation that uses the shuffle in-struction faces the limitation that this instruction can beonly used within one warp. Therefore, if there is morethan one warp working on the anti-diagonal, threads indifferent warps need to communicate with each other usingeither shared memory or global memory. This will resultin branch divergence. Besides, the access of shared/globalmemory will unfortunately cancel the benefits of usingshuffle instructions because every shuffle instruction withinthe warp is accompanied by one shared/global memoryaccess across the warps. Based on these considerations andexperiments, we make the compromise to use 32 threads(one warp) for calculating the whole sequence alignmenttask. This solution still delivers remarkable performance, asshown in Section V.

Similar to what we have done for SW, we create three reg-isters (reg1, reg2, reg3) for each thread. Threads look up datafrom neighbors via shuffle instructions. However, using only32 threads will bring problems for tasks whose sequence

lengths are longer than 32. We solve this by assigningmultiple cells along the anti-diagonal for each thread, andeach thread will compute these cells one by one. We cannotuse a fixed number of cells for each thread because thiswill cause inefficiency across tasks with different sequencelengths. Based on the similar heuristic adopted in the sharedmemory implementation, we will create subfunctions withdifferent numbers of cells to calculate for each thread, andassign tasks with different subfunctions during the runtime.

Figure 8 depicts one example when each thread needsto compute two cells on the anti-diagonal. This schemewill bring even more benefits for performance. As we cansee, inter-thread communication only takes place betweenboundary cells across different threads. For the rest of thecells, communication can finish by using direct registeraccess, with the lowest access latency among all the dataaccess methods on GPU.

D. Performance Model

In this section we present the performance model whichprovides a quantitative perspective to analyze the perfor-mance of different kernels for accelerating sequence align-ment algorithms.

CUPs (cell update per second) is the widely used metricfor measuring the performance of sequence alignment algo-rithms. It measures the number of cells in the matrix thatcan be computed per second. We adopt this metric as themeasurement of kernel performance in this work. Note thatfor PairHMM, we count three updates in three matrices (M ,I and D) as one cell update for simplicity.

The performance model is shown as below.

performance(CUPs) =parallelism× frequency

latency(7)

parallelism is defined as the number of cells updatedin parallel. If each thread is assigned to update one cell inthe matrix, this factor equals the active threads lying in allthe SMs on GPU. This factor can be calculated using theequation below.

parallelism = #SM ×min{ #reg/SM

#reg/thread,

#sharedMem/SM

#sharedMem/threadBlock× thread

threadBlock}

(8)

where #reg/SM , #sharedMem/SM , and #SMare platform-dependent information, #reg/thread,#sharedMem/threadBlock and thread/threadBlockare kernel characteristics, which can be derived fromkernels with the help of the Nvidia nvcc compiler bysetting specific compilation flags. Note that this factor is inproportional to occupancy as well.

The factor frequency is gathered from hardware specifi-cation. latency refers to the average time interval to finish

one cell. More specifically, when every cell on the anti-diagonal is assigned to one thread, it refers to the latencyto finish the entire anti-diagonal. The latency is dominatedby the critical path in each iteration which includes 1) loaddata from neighbor cells, 2) compute the current cell, and3) write back the current cell.

This performance model is intuitively simple and can bequite handy when estimating and analyzing the performanceof kernels. In our work, this model serves two purposes.First, it helps justify the validity of our microbenchmarkswhen testing the latency of different instructions. After gath-ering the performance of kernels and kernel characteristics,we can calculate the latency and compare it to the estimatedlatency using results from our microbenchmarks. Second, ithelps programmers estimate the performance using differentcommunication methods. With estimated parallelism usingEquation 8 and latency from our microbenchmarks, CUDAprogrammers can easily make trade-offs in advance beforetaking efforts to implement a new kernel using shuffleinstructions.

V. EXPERIMENTAL RESULTS

In this section we first introduce the setup for experimentsand present the overall performance of our designs ondifferent platforms. Then we use the performance of thedesigns to validate the microbenchmarks with the help ofthe performance model. We discuss the trade-offs of usingshuffle instructions in the end.

A. Experiment Setup

The kernel performance is evaluated on two NvidiaGPUs—Quadro K1200 and Titan X. Both K1200 and Titanuse the Maxwell architecture, while K1200 is a low-endGPU with high energy efficiency, and Titan X is a high-end GPU with high computation capability.

The algorithms of SW and PairHMM are extracted fromGATK 3.6. We use the genome sample of a human withbreast cancer (HCC1954) as the inputs of HaplotypeCallerand dump out the data as input datasets for two kernels.

All of the GPU kernels are written in CUDA 7.5. TheNvidia nvcc compiler is used to compile the code and getkernel characteristics (e.g., reg and shared memory usage).For the convenience of illustration, in the following sectionswe will denote SW implementations using shared memoryor shuffle as SW1 and SW2, and PairHMM implementationsas PH1 and PH2.

B. Performance Overview

The DNA sequence is broken into regions to be analyzedin order in HaplotypeCaller. For each region, Haplotype-Caller will have two intermediate stages, generating severalpairs of sequences to align, using SW and PairHMM,respectively. Each pair of sequences is denoted as a task.We dump out the tasks for SW, and group them together asbatches for each region. The input datasets for PairHMM

Figure 9: Performance overview of GPU implementations:a) SW, b) PairHMM.

a) b)

0.130.16

0.43

0.53

0.140.16

0.46

0.55

0.00

0.10

0.20

0.30

0.40

0.50

0.60

SW1 SW2 SW1(peak) SW2(peak)

GC

UP

s

K1200 Titan X

2.03.4 2.9

6.08.0

9.5

17.3

34.8

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

PH1 PH2 PH1(peak) PH2(peak)G

CU

Ps

k1200 Titan X

Figure 10: Impacts of re-batching on performance of SWkernels: a) K1200, b) Titan X.

a)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0 200 400 600 800

GU

Ps

Batch Size

SW1 SW1(peak) SW2 SW2(peak)

0.0

5.0

10.0

15.0

20.0

25.0

0 500 1000 1500 2000 2500 3000 3500

GC

UP

s

Batch Size

SW1 SW1(peak) SW2 SW2(peak)

b)

are generated in the same fashion. For each kernel, thenumber of tasks in each batch varies depending on the DNAclips being analyzed. In our datasets, the average number oftasks per batch for SW is four, whereas the number is 189for PairHMM. The insufficient tasks for SW will limit theperformance, as discussed later.

We measure the GCUPs performance for each batch andtake the average. For GPU implementations, each task isassigned to one thread block, and all the tasks within thesame batch are launched together as a compute kernel. Weset BSIZE as 32 for both SW1 and SW2, which offerthe best performance from our experiments. We set 128threads/threadBlock for PH1 because the maximal sequencelength is less than 128, and 32 threads/threadBlock for PH2.Note that the GPU performance reported below includes thedata transfer time between the host and device.

Figure 9 shows the average and maximal performanceof all the kernels. Note that the performance for SW isrelatively low, because the GPU performance is severelyimpacted by the batch size, as mentioned above. Therefore,we break the boundary of regions and re-batch the tasks indifferent regions together to evaluate SW kernels. Resultsare shown in Figure 10.

As we can see, our SW kernels deliver significant per-formance. On Titan X, design SW2 achieves the peakperformance of 19.6 GCUPs and an average performanceof 18.5 GCUPs when re-batching 3,200 tasks together.

Table II: Detailed information of kernels.

SW1 SW2 PH1 PH2GCUPs 3.6 4.3 3.2 6.8occupancy(%) 32.2 49.7 56.2 29.1#reg/thread 54 56 56 94#sharedMem/threadBlock 1868 1472 6764 2644latency(cycle) 1203 1014 1528 379reduction(cycle) 189 1149

Table III: Instruction breakdown and analysis for latencyreduction.

operation instruction #instructionSW1 SW2 PH1 PH2

LOAD SMEM/(shfl,reg) 3 (2,1) 32 (6,25)WRITE SMEM/reg 1 1 12 12ROTATE SMEM/reg 2 2 24 24SYNC 1 0 1 0

latency 183 22 1485 115est. reduction 161(-14.8%) 1370(19.2%)

As for PairHMM, on Titan X design PH2 can achievethe peak performance of 34.8 GUCPs and an averageperformance of 6.0 GCUPs. For all the implementations,using shuffle instructions delivers better performance thanthose using shared memory instead.

C. Model ValidationIn this section we will use our performance model to

validate the results from microbenchmarks. We select thebiggest batch among the original datasets (w/o re-batching)as the inputs and K1200 as the target platform. This guar-antees that the GPU is fully occupied by tasks so thatour analysis will not be affected by factors other thancomputation itself. We conduct several repeated runs foreach kernel and take the average as the kernel performance.This number doesn’t include the data transfer because ourfocus is the computation part of the kernel.

Table II presents the kernel performance and statistics.Using our performance model, we can compute the averagelatency of each iteration for four kernels.

For SW, using shuffle can cut down the iteration latencyby 189 cycles. The reduction comes from: 1) data accesslatency, and 2) synchronization. Table III shows the detailedbreakdown of the latency.

In the shared memory implementation shown in List-ing 2a), there are three shared memory loads and one writein each iteration. Two more accesses are required for rotatingthree line buffers, and one syncthreads() at the endof each iteration. Based on our microbenchmarks, sharedaccess takes around 21 cycles, and sycnthreads() takes57 cycles. We can estimate the latency as 3 × 21 + 1 ×21 + 2 × 21 + 1 × 57 = 183 cycles. In SW2, three loadsfrom neighbor cells are replaced by two shuffles and onedirect register access, as shown in Listing 2b). All the otherinstructions can be replaced by direct register accesses; thesynchronization is eliminated. The estimated latency will be2× 9 + 4× 1 = 22 cycles. Therefore, the estimated latency

reduction is 183 − 22 = 161 cycles. The relative error ofour analysis is -14.8%.

For PairHMM, in each iteration there are three matricesto update (eight loads in total: three in match, three ininsertion, and two in deletion). Additionally, 128 threads(4 warps) are used to update one anti-diagonal, which willissue 8 × 4 = 32 shared memory instructions each time.As for design PH2, each thread will compute 4 cells in to-tal. Inter-thread communication happens between boundarycells, which brings two shuffle instructions and one registeraccess for each matrix (only two shuffle for deletion). Forall the remaining cells inside, direct register access suffices.There will be six shuffle instructions and 25 register accessesin total. The analysis for other operations are similar toSW. Based on our estimation, using shuffle instruction helpsreduce latency by 1370 cycles. The relative error is 19.2%.

The relative error for prediction for both designs is low,which validates results of the microbenchmarks. This analy-sis shows a normal flow for CUDA programmers to estimatethe performance gains when using shuffle instructions. Wecan first estimate the new register and shared memory usagewhen using shuffle instructions and compute the factorparallelism. Then, we can calculate the latency based onthe computation breakdown and latency of instructions frommicrobenchmarks. Finally, we can use our model to computethe performance of shuffle designs.

D. Trade-Off Analysis

Using shuffle instructions helps improve performance overdesigns using shared memory. As shown in Table II, usingshuffle instructions provides performance gains of 1.2× and2.1× for SW and PairHMM, respectively.

For SW, using shuffle instructions helps save shared mem-ory, and therefore increases the occupancy (parallelism).Meanwhile, latency is reduced. Both factors contribute tothe performance improvement of SW2 over SW1.

As for PairHMM, we observe a drop of occupancy fromPH1 to PH2. The reason is that we assign more cells foreach thread to calculate, thus significantly increasing theregister usage for each thread. The register usage becomesthe limiter of occupancy and drags down the occupancy from56.2% to 29.1%. Meanwhile, inter-thread communication inPairHMM is more intensive than SW, as we will accessmore data (three matrices vs. one matrix) with more threads(128 threads/threadBlock vs. 32 threads/threadBlock). Thereduction of latency from using shuffle instructions is largerthan SW, which offsets the decrease of parallelism andimproves performance eventually.

Based on the analysis above, we conclude that the trade-offs of using shuffle instructions are as follows:

• Using register shuffle can free up shared memory. How-ever, the impact on occupancy varies among differentapplications. It is possible that the increased numberof registers may become the new limiter of occupancy,which will hurt the parallelism we can obtain.

• In terms of performance, both parallelism andlatency matter. For applications with intensive inter-thread parallelism, the reduction of the latency usingshuffle instructions plays an important role in overallperformance, which could even offset the negativeimpacts of using more registers and bring performancegains in the end.

With the help of the performance model and the mi-crobenchmarks for shuffle instructions, CUDA programmerscan easily handle such trade-offs and make design decisionsin advance. Note that the root cause of the first point liesin the limitation of shuffle instructions since they can beonly used for threads within the same warp. Therefore, forapplications like PairHMM, we need to place more cellsto calculate for each thread, with more registers to use perthread. This will significantly increase the register usage andaffect the occupancy eventually.

VI. RELATED WORK

The shuffle instruction offers a new alternative for inter-thread communication. Previous works [5]–[7] leverageshuffle instructions as a replacement for shared memoryoperations, and have seen performance gains from suchoptimization. However, there is no detailed analysis aboutthe root causes for such benefits. Our work extends theshuffle instructions to a new application domain, sequencealignment, and is the first work to conduct a systematic andquantitative analysis of shuffle instructions.

In this paper we pick SW and PairHMM as two appli-cation drivers which present dependency patterns of near-neighbor communication. SW is a well-known sequencealignment algorithm, and there are many GPU implemen-tations for different variants of SW. Manavski et al. [11]utilize the inter-task parallelism by assigning each GPUthread one entire alignment task. The sequences need to besorted in advance so that the task for threads within thesame block can be as similar as possible. Liu et al. [10]propose a combined solution which couples CPU and GPUtogether to accelerate SW protein search. They use thesame programming model as Manavski et al. and furtherutilize the SIMD instructions to improve parallelism. TheSW application we adopt in the paper is different from theseworks in terms of both the algorithm and datasets.

PairHMM is a variant of SW which integrates HMM intothe algorithm. There have been several previous works [12],[17] on CPU and FPGA that employ the intra-task par-allelism along the anti-diagonals. Intel released GenomicsKernel Library (GKL) [13] which uses AVX intrinsics to ac-celerate the algorithm. Ito et al. [12] propose a systolic arrayarchitecture on FPGA that uses the IBM CAPI interface anddelivers the performance of 1.7 GCUPs on the same genomesample that we use in this paper. Our implementation in thiswork outperforms all the previous works on PairHMM.

VII. CONCLUSION

Data movement is one of the critical limiting factors ofperformance and energy efficiency. In this work we look intothe communication optimization methodology for applica-tions with intensive inter-thread communication. The shuffleinstruction offers new alternatives in addition to conventionalmethods for using shared memory and global memory, andbrings trade-offs to be considered at the same time. Inthis work we conduct a quantitative analysis on shuffleinstructions, using two sequence alignment algorithms (SWand PairHMM) as case studies.

A suite of microbenchmarks is developed for measuringthe latency of shuffle and several other instructions. We findthat the latency of shuffle is in-between that of register andshared memory access, and it varies across different types ofshuffle instructions and architectures (Kepler vs. Maxwell).

We implement two algorithms using either shared memoryor register shuffle for inter-thread communication. Usingshuffle instructions instead of shared memory has broughtsignificant performance gains of 1.2× and 2.1× for SW andPairHMM, respectively. This demonstrates the optimizationopportunities for deploying shuffle instructions for applica-tions with intensive inter-thread communication.

We are proposing a performance model that takes kernelcharacteristics and the instruction latency into consideration,and helps analyze and estimate the design performance forsuch algorithms. With the help of this performance modeland microbenchmarks, we conduct a detailed analysis on theperformance impacts of shuffle instructions.

This work provides valuable insights for CUDA program-mers for making trade-offs when using shuffle instructionsin a wider class of applications.

VIII. ACKNOWLEDGEMENT

The authors would like to thank Muhuan Huang andJanice Martin-Wheeler for editing the paper. This work wassupported in part by C-FAR, one of the six SRC STARnetCenters, sponsored by MARCO and DARPA, NSF/IntelInnovation Transition Grant (CCF-1436827) awarded to theCenter for Domain-Specific Computing, and contributionsfrom Fujitsu Laboratories.

REFERENCES

[1] W.-m. Hwu, “What is ahead for parallel computing,” Journalof Parallel and Distributed Computing, vol. 74, no. 7, pp.2574–2581, 2014.

[2] S. Xiao and W. c. Feng, “Inter-block gpu communication viafast barrier synchronization,” in Parallel Distributed Process-ing (IPDPS), 2010 IEEE International Symposium on, April2010, pp. 1–12.

[3] T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R.Beard, and D. I. August, “Automatic cpu-gpu communicationmanagement and optimization,” in Proceedings of the 32NdACM SIGPLAN Conference on Programming Language De-sign and Implementation, ser. PLDI ’11. New York, NY,USA: ACM, 2011, pp. 142–151.

[4] Nvidia. (2016) Cuda c programmingguide. [Online]. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz4NVazmUbb

[5] Y. Hanada, S. Kitaoka, and Y. Xinhua, “Optimizing particlesimulation for kepler gpu,” Procedia Engineering, vol. 61, pp.376 – 380, 2013.

[6] D. Bakunas-Milanowski, V. Rego, J. Sang, and C. Yu, “A fastparallel selection algorithm on gpus,” in 2015 InternationalConference on Computational Science and ComputationalIntelligence (CSCI). IEEE, 2015, pp. 609–614.

[7] H. Jiang and N. Ganesan, “Fine-grained acceleration ofhmmer 3.0 via architecture-aware optimization on massivelyparallel processors,” in Parallel and Distributed ProcessingSymposium Workshop (IPDPSW), 2015 IEEE International.IEEE, 2015, pp. 375–383.

[8] B. Institute. (2016) Genome analysis toolkit. [Online].Available: https://software.broadinstitute.org/gatk/

[9] N. Mark Harris, “Cuda pro tip: Do the keplershuffle,” 2014. [Online]. Available: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-kepler-shuffle/

[10] G. Lu and J. Ni, “Highlighting computations in bioscience andbioinformatics: review of the symposium of computations inbioinformatics and bioscience (scbb07),” BMC Bioinformat-ics, vol. 9, no. 6, p. S1, 2008.

[11] S. A. Manavski and G. Valle, “Cuda compatible gpu cards asefficient hardware accelerators for smith-waterman sequencealignment,” BMC Bioinformatics, vol. 9, no. 2, p. S10, 2008.

[12] M. Ito and M. Ohara, “A power-efficient fpga acceleratorsystolic array with cache-coherent interface for pair-hmmalgorithm,” in 2016 IEEE Symposium in Low-Power andHigh-Speed Chips (COOL CHIPS XIX). IEEE, 2016, pp.1–3.

[13] Intel. (2016) Genomics kernel library (gkl). [Online].Available: https://github.com/Intel-HLS/GKL

[14] B. Institute, “Haplotypecaller,” 2016. [Online].Available: https://software.broadinstitute.org/gatk/gatkdocs/org broadinstitute gatk tools walkers haplotypecallerHaplotypeCaller.php

[15] T. F. Smith and M. S. Waterman, “Identification of commonmolecular subsequences,” Journal of molecular biology, vol.147, no. 1, pp. 195–197, 1981.

[16] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biolog-ical sequence analysis: probabilistic models of proteins andnucleic acids. Cambridge university press, 1998.

[17] S. Ren, V. M. Sima, and Z. Al-Ars, “Fpga acceleration of thepair-hmms forward algorithm for dna sequence analysis,” inBioinformatics and Biomedicine (BIBM), 2015 IEEE Interna-tional Conference on, Nov 2015, pp. 1465–1470.

communication optimization on gpu: a case study …cadlab.cs.ucla.edu › ~jaywang › papers ›...

Documents