to gpu synchronize or not gpu synchronize? - nvidia.com · applications: fft, dynamic programming,...

of 27/27
synergy.cs.vt.edu To GPU Synchronize or Not GPU Synchronize? Prof. Wu FENG Departments of Computer Science and Electrical & Computer Engineering

Post on 18-Aug-2019

217 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • synergy.cs.vt.edu

    To GPU Synchronize or Not GPU Synchronize?

    Prof. Wu FENGDepartments of Computer Science and Electrical & Computer Engineering

  • synergy.cs.vt.edu

    What are the Takeaways?

    • At the systems software level … – How to support efficient communication between SMs

    via barrier synchronization on the GPU GPU Synchronization

    • At the application level …– How to integrate the GPU synchronization capability into real

    applications: FFT, dynamic programming, and bitonic sort

    • From a performance and correctness perspective …– How to “improve” the performance of existing GPU-optimized

    applications

    – How to guarantee correctness and its associated cost

  • synergy.cs.vt.edu

    Motivation

    • Only task- or data-parallel algorithms map well to the GPU.– No explicit support for communication between SMs,

    i.e., inter-block data communication

    • Consecutive kernel launches from CPU serve as an implicit barrier synchronization for inter-block communication.– How expensive is CPU implicit barrier synchronization?

    – Would barrier synchronization on the GPU be better in support of “more general-purpose computation”?

    • To GPU synchronize or not GPU synchronize?

  • synergy.cs.vt.edu

    Outline

    • Motivation

    • Background

    – GTX 280 and CUDA Programming Model

    • GPU Synchronization

    – GPU Lock-Based Synchronization

    – GPU Lock-Free Synchronization

    • Experimental Results

    • Conclusion & Future Work

  • synergy.cs.vt.edu

    GTX 280 Architecture

    On-chip memory • Small sizes

    • Fast access

    Off-chip memory • Large size

    • High access latency

    Local and Global Memory

  • synergy.cs.vt.edu

    CUDA Programming Model

    • CUDA: An extension of the C programming language

    Host

    Kernel 1

    Kernel: A global function called from host and executed on device

    •Consists of multiple blocks with each block consisting of multiple threads

    •Intra-block sync is implemented with __syncthreads()

    •Inter-block sync is implemented via kernel launches

    Device

    Kernel 2

    Grid 1

    Block

    #1

    Block

    #2

    Block

    #N

    Block

    #3

    Grid 2

    Block

    #1

    Block

    #2

    Block

    #N

    Block

    #3

    Thread

    #1

    Thread

    # (M-1)

    Thread

    #2

    Thread

    #N

    __syncthreads()

    Barrier between two kernel launches

  • synergy.cs.vt.edu

    Outline

    • Motivation

    • Background

    – GTX 280 and CUDA Programming Model

    • GPU Synchronization

    – GPU Lock-Based Synchronization

    – GPU Lock-Free Synchronization

    • Experimental Results

    • Conclusion & Future Work

  • synergy.cs.vt.edu

    Types of Barrier Synchronization

    CPU Synchronization vs. GPU Synchronization

    Host Device

    Computation

    Kernel Launch

    Return

    Kernel Launch

    Return

    Host Device

    Kernel Launch

    Return

    Barrier

    Barrier

    __gpu_sync()

    __gpu_sync()

    Computation

    Kernel Launch

    Return

    Barrier

    Barrier

    for() {

    __kernel_func();

    }

    __global__ void __kernel_func()

    {

    for () {

    __device_func();

    __gpu_sync();

    }

    }

    Computation

    Computation

    ComputationComputation

  • synergy.cs.vt.edu

    GPU Lock-Based Synchronization

    • Time Profile

    Block

    #1

    Block

    #2

    Block

    #3

    Block

    #N

    g_mutex

    atomicAdd(1) atomicAdd(1)atomicAdd(1) atomicAdd(1)

    Block

    #1

    Block

    #2

    Block

    #3

    Block

    #N

    g_mutex == Ng_mutex == Ng_mutex == N

    g_mutex == N

    Block #1

    Execute sequentially by different blocksatN

    Block #2

    Block #N

    1. atomicAdd()

    2. g_mutex checking

    Execute in parallelBlock #1 Block #2 Block #Nct

    Total synchronization time: sca tttN

    ?

    ? ? ?Intra-block sync with __syncthreads()

    3. __syncthreads()st

  • synergy.cs.vt.edu

    GPU Lock-Free Synchronization

    • Time Profile

    Block

    #1

    Block

    #2

    Block

    #3

    Block

    #N

    Ain[1]=1

    Block

    #1

    Block

    #2

    Block

    #3

    Block

    #N

    Block #1

    1. Set Ain2. Check Ain3. Intra-block sync4. Set Aout5. Check Aout6. Intra-block sync

    Note: Each step can be executed in parallel bydifferent blocks

    Block #2 Block #3 Block #N

    Block #1 Block #2 Block #N

    Ain[2]=1 Ain[3]=1 Ain[N]=1

    Block #1

    T #1 T #2 T #3 T #N

    Ain[1]==1 Ain[2]==1 Ain[3]==1 Ain[N]==1

    Intra-block sync with __syncthreads()

    Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1

    ? ? ? ?

    Aout[1]==1?

    Aout[2]==1?

    Aout[3]==1?

    Aout[N]==1?

    Block #3

    T #1 T #2 T #3 T #N

    __syncthreads()

    T #1 T #2 T #3 T #N

    Block #1

    SIt

    CIt

    StSOt

    COt

    Total synchronization time: COSOSCISI ttttt 2

    Intra-block sync with __syncthreads()

  • synergy.cs.vt.edu

    Guaranteeing Correctness

    Block

    #1

    Block

    #2

    Block

    #3

    Block

    #N

    Ain[1]=1

    Block

    #1

    Block

    #2

    Block

    #3

    Block

    #N

    Ain[2]=1 Ain[3]=1 Ain[N]=1

    Block #1

    T #1 T #2 T #3 T #N

    Ain[1]==1 Ain[2]==1 Ain[3]==1 Ain[N]==1

    Aout[1]=1 Aout[2]=1 Aout[3]=1 Aout[N]=1

    ? ? ? ?

    Aout[1]==1?

    Aout[2]==1?

    Aout[3]==1?

    Aout[N]==1?

    Block

    #1

    Block

    #2

    Block

    #3

    Block

    #N

    g_mutex

    atomicAdd(1) atomicAdd(1)atomicAdd(1) atomicAdd(1)

    Block

    #1

    Block

    #2

    Block

    #3

    Block

    #N

    g_mutex == Ng_mutex == Ng_mutex == N

    g_mutex == N?

    ?? ?

    GPU lock-based synchronization

    GPU lock-free synchronization

    __threadfence()__threadfence()

    __threadfence()__threadfence()

    __threadfence()__threadfence()

    __threadfence()__threadfence()

  • synergy.cs.vt.edu

    Outline

    • Motivation

    • Background

    – GTX 280 and CUDA Programming Model

    • GPU Synchronization

    – GPU Lock-Based Synchronization

    – GPU Lock-Free Synchronization

    • Experimental Results

    • Conclusion & Future Work

  • synergy.cs.vt.edu

    Experimental Set-Up

    • Hardware– Host

    • 2.2-GHz Intel Core 2 Duo CPU

    • 2 X 2GB of DDR2 SDRAM

    – Device

    • GTX 280 video card

    • 1024 MB device memory

    • Software– 64-bit Ubuntu GNU/Linux 8.04

    – NVIDIA CUDA 2.3 SDK toolkit

  • synergy.cs.vt.edu

    Measurements

    • Execution time without __threadfence()

    • Execution time with __threadfence()

    • Synchronization time percentage (without __threadfence())

    Fast Fourier

    Transformation (FFT)

    Smith-Waterman

    Bitonic sort

    CPU synchronization

    GPU lock-free synchronization

    Applications Synchronization Approach

    GPU lock-based synchronization

  • synergy.cs.vt.edu

    Execution Time without __threadfence()Kernel Execution Time vs. Number of Blocks in the Kernel

    Smith-Waterman

    FFT Bitonic sort

    • Less time is needed with more

    blocks in the kernel

    • Matrix filling time difference in FFT

    is smaller than the other two with

    different sync approaches used

    • GPU sync has a better performance

    than CPU implicit sync, BUT …

    Baseline: CPU Sync

    Baseline: CPU sync

    Baseline: CPU sync

    Relative Time Decreases

    Baseline is CPU synchronization

  • synergy.cs.vt.edu

    Performance & Correctness:Execution Time w/o __threadfence()

    • Performance Improvement (relative to existing GPU)– FFT: 10% | Dynamic Programming: 26% | Bitonic Sort: 40%

    • Overall Performance Improvement (relative to CPU serial)– FFT: 70x | Dynamic Programming: 13x | Bitonic Sort: 24x

    • But …– Our GPU barrier synchronizations run the risk that writes performed

    before our gpu sync() barrier are not completed by the time the GPU is released from the barrier.

    – syncthreads() can only guarantee writes to shared memory and global memory visible to threads of the same block, it cannot do so for threads across different blocks.

    – In practice, highly unlikely the above will ever happen given the amount of time spent spinning at the barrier, but still possible. So, …

    To GPU Synchronize or Not GPU Synchronize?NVIDIA Booth, SC|09, November 2009

  • synergy.cs.vt.edu

    Execution Time with __threadfence()Kernel Execution Time vs. Number of Blocks in the Kernel

    Smith-Waterman

    FFT Bitonic sort

    • Matrix filling time difference in FFT

    is smaller than the other two with

    different sync approaches used

    • With __threadfence() called,

    performance of GPU sync is worse

    than CPU implicit sync

    Baseline: CPU Sync

    Baseline: CPU Sync

    Baseline: CPU Sync

    Relative Time Increases

    Baseline is CPU synchronization

  • synergy.cs.vt.edu

    Percentage of Time Spent Synchronizing

    Synchronization Time Percentages (without __threadfence())

    • % time to sync in FFT is lower than the other two algorithms

    • Sync time percentages of SWat and bitonic sort are more than 50%

    with CPU sync

    • % time to GPU sync is lower than that of CPU implicit sync

  • synergy.cs.vt.edu

    Profiling the Synchronization Time• Micro-benchmark – Compute average of two floats 10,000

    times

    Elements computed

    by Block #1

    Array

    Array

    Array

    Original value

    Round #2

    Round #3

    CPU sync –

    Kernel launchRound #1

    GPU sync –

    __gpu_sync()CPU sync –

    Kernel launch

    GPU sync –

    __gpu_sync()

    Elements computed

    by Block #2

    Elements computed

    by Block #3

  • synergy.cs.vt.edu

    Decomposing Synchronization Time

    • Ways to compute each time component– Only kernel execution time can be recorded

    – Indirect method

    – Use GPU lock-based synchronization as the example

    – Synchronization time can be represented as

    – Times that can be recorded directlyfsca ttttN

    atN : Kernel consisting of only atomicAdd

    comt : Kernel consisting of only computation

    scacom tttNt : Kernel with GPU lock-based synchronization

    scomtt : Kernel with computation and __syncthreads()

    fscacom ttttNt : Kernel with GPU lock-based

    synchronization (__threadfence())

  • synergy.cs.vt.edu

    Profile of Synchronization Time

    • Results

    atN

    comt black line

    scomtt red line

    scacom tttNt

    fscacom ttttNt

    comt + cpu snyc

    683.5comt nta 300.2

    564.5ct

    540.0st

    267.7333.0ntffor 10,000 times execution

    404.54_synccput

  • synergy.cs.vt.edu

    Outline

    • Motivation

    • Background

    – GTX 280 and CUDA Programming Model

    • GPU Synchronization

    – GPU Lock-Based Synchronization

    – GPU Lock-Free Synchronization

    • Experimental Results

    • Conclusion & Future Work

  • synergy.cs.vt.edu

    Conclusion

    • At the systems software level … – How to support efficient communication between SMs

    via barrier synchronization on the GPU GPU Synchronization

    • At the application level …– How to integrate the GPU synchronization capability into real

    applications: FFT, dynamic programming, and bitonic sort

    • From a performance and correctness perspective …– How to “improve” the performance of existing GPU-optimized

    applications

    – How to guarantee correctness and its associated cost

    • __threadfence guarantees correctness but needs to be optimized to support GPU synch. Is Fermi the answer?!

    To GPU Synchronize or Not GPU Synchronize?NVIDIA Booth, SC|09, November 2009

  • synergy.cs.vt.edu

    Conclusion

    • To GPU synchronize or not GPU synchronize?– GTX 280 / Tesla C1060 / Tesla S1080: Do NOT GPU synchronize.

    – Fermi? Likely GPU synchronize?

    • Next steps?– Efficient inter-block synchronization via NVIDIA Fermi

    – Efficient inter-block synchronization in OpenCL

    – Automated tool to transform CPU sync to GPU sync

    • For more information– “On the Robust Mapping of Dynamic Programming onto a Graphics

    Processing Unit,” 15th Int’l Conf. on Parallel & Distributed Systems, 12/2009.

    – “Inter-Block GPU Communication via Fast Barrier Synchronization,” Technical Report TR-09-19, Computer Science, Virginia Tech, 10/2009.

  • synergy.cs.vt.edu

    Yawn …

    • Where are the massive speed-ups and cool pictures?

    To GPU Synchronize or Not GPU Synchronize?NVIDIA Booth, SC|09, November 2009

  • synergy.cs.vt.edu

    Electrostatic Potential for Molecular Dynamics

    Processor + Optimization

    Execution Time(seconds)

    Speed-Up

    CPU Serial 36,360 -

    GPU + Kernel Split + Multi-Level HCP

    0.37 98,270x

    Viral Capsid

    • Visit the Supermicro booth, i.e., behind you, or go to http://www.youtube.com/watch?v=zPBFenYg2Zk

    Contact: Prof. Alexey Onufriev for info on the science!

    http://www.youtube.com/watch?v=zPBFenYg2Zk

  • Wu FENG, [email protected]

    http://sss.cs.vt.edu/

    http://synergy.cs.vt.edu/

    Laboratory

    http://www.green500.org/

    http://www.mpiblast.org/