to gpu synchronize or not gpu synchronize? wu-chun feng and shucai xiao department of computer...

21
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering, Virginia Tech ISCAS 2010

Upload: monica-warner

Post on 23-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

To GPU Synchronize or Not GPU

Synchronize?Wu-chun Feng and Shucai

Xiao

Department of Computer Science, Department of Electrical and Computer Engineering, Virginia

Tech

ISCAS 2010

Page 2: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Outline

• Introduction• Preliminaries• Related Works• Proposed GPU-based

Synchronization• Problems, Experiments, and

Analysis• Conclusions

Page 3: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Introduction

• Multi(many)-core era has come.• General purpose GPU (GPGPU)

allows massively parallel computation with low cost.

• GPUs typically map well only to data-parallel or task-parallel applications – Due to the lack of support for

communication between streaming multiprocessors (SMs).

Page 4: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Introduction (cont.)

• Communication could be done via global memory.– Need barrier synchronization.

• CPU barrier synchronization– By (inefficiently) implementing the

barrier synchronization via the host CPU.

– Slow.

Page 5: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Introduction (cont.)

• GPU barrier synchronization– Improve performance by 10~40%.– Theoretically run the risk that

barrier may release earlier.

• CUDA 2.2 support new function _threadfence() to solve this problem.– The correctness could be guarantee.

Page 6: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Introduction (cont.)

• Unfortunately, _threadfence() incurs so much overhead in the proposed GPU barrier synchronization.– That is, CPU barrier synchronization

performs as well as or better than the GPU barrier synchronizations in many cases.

• “Whether to GPU synchronize or not GPU synchronize?”

Page 7: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Preliminaries: CUDA

• Compute Unified Device Architecture• Developed by nVIDIA.• The CPU code does the sequential part.• Highly parallelized part usually implement

in the GPU code, called kernel.• Calling GPU function in CPU code is called

kernel launch.• In a kernel, threads are grouped as a grid of

thread blocks, and each thread block contains a number of threads.– Multiple blocks can be executed on the same SM,

but one block cannot be executed across different SMs.

Page 8: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Preliminaries: GPU architecture

Page 9: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Preliminaries: synchronization

• Synchronization in parallel programming.– Making sure that each thread get the right data

for computation.

• CUDA provides a data communication mechanism for threads within a single block via the barrier function syncthreads().– Intra-SM communication.

• However, there is no explicit software or hardware support for data communication of threads across different blocks.– Inter-SM communication.

Page 10: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Related Works

• When multiple GPU thread blocks are scheduled to execute on a single SM simultaneously, deadlock might occurs.– In multi-core environment, a process can

yield its execution to other processes, but CUDA blocks do not.

• [17] assign only one block per SM to address this problem.

Page 11: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Related Works (cont.)

• When a barrier synchronization is needed across different blocks, programmers traditionally use a kernel launch as a way to implicitly barrier synchronize [4], [7].

• [14] propose a protocol for data communication across multiple GPUs.– Data needs to be transferred to the host

memory first and then copied back to the device memory, and hence poor performance in different SMs on a single GPU.

Page 12: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Proposed GPU-based Synchronization

• GPU synchronization– Lock-based synchronization.

• Single mutex variable for all thread blocks.• Once a block finishes its computation on

an SM, it atomically increments the mutex variable.

– Lock-free synchronization.• One distinct variable to control each block,

thus eliminating the need for different blocks to contend for the single mutex variable.

• The need for atomic addition is removed.

Page 13: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Experiments

• Environment– GeForce GTX 280: 30SMs, 8 cores each,

running at 1.3GHz (shader clock).– CUDA 2.2 SDK.– Details are omitted.

• Two experiments.– Dynamic programming (DP) for genomic

sequence alignment (specifically the Smith-Waterman algorithm).

– Bitonic sort (BS).

Page 14: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Performance comparisons

Page 15: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Problems, Experiments, and Analysis

• In order to eliminate the infinitesimal risk that barrier may release earlier when the proposed synchronization run, _threadfence() is used and hence incurred the overhead.

• The same experiments with modified barrier using _threadfence().

Page 16: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Performance comparisons

Page 17: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Problems, Experiments, and Analysis (cont.)

• Analyze GPU lock-based synchronization as an example, whose operation set is a superset of that are used in the other one.

• Synchronization overhead– ta is the overhead of atomic add– tc is the mutex variable checking time– ts is the time consumed by syncthreads()– tf is the threadfence() execution time.

Page 18: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Problems, Experiments, and Analysis (cont.)

• Unfortunately, the execution times for these component operations cannot be measured directly on the GPU.

• An indirect approach.– A kernel’s execution time can be

expressed as

– Measuring the kernel execution time both with and without a specific operation and taking the time difference as the overhead of that operation.

Page 19: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Execution time profiling

• Use micro benchmark to test.– Calculates the average of two floats over

10,000 iterations.• CPU synchronization

– Each kernel calculates the average once, and the kernel is launched 10,000 times.

• GPU synchronization– The kernel is launched only once, and

there is a 10,000-iteration for loop used in the kernel with the GPU barrier function called in each loop.

Page 20: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Results

• For 10,000 times of execution, ts = 0.541, ta = 2.300 × n, tc = 5.564, and tf = 0.333 × n + 7.267, where n is the number of blocks in the kernel, and the units are in milliseconds.

t5t4

t3t1t2

ts=t3-t1

ta=t2

tc=t4-t3-t2

tf=t5-t4

Page 21: To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Conclusions

• Demonstrate the efficiency of inter-SM communication using GPU-based barrier synchronization.

• To eliminate the risk of asynchronous, threadfence() is used though it incurs high overhead.

• Grudgingly conclude that one should GPU synchronize (with or without threadfence()) on the current generation, but more definitive ’yes’ for the next generation GPUs.