lect11 12 cuda threads

Upload: renato-quezada-e

Post on 14-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Lect11 12 Cuda Threads

    1/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    1

    CUDA Threads

  • 7/30/2019 Lect11 12 Cuda Threads

    2/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    2

    CUDA Thread Block

    All threads in a block execute the same

    kernel program (SPMD)

    Programmer declares block:

    Block size 1 to 512 concurrent threads

    Block shape 1D, 2D, or 3D

    Block dimensions in threads

    Threads have thread id numbers within block Thread program uses thread id to select

    work and address shared data

    Threads in the same block share data and

    synchronize while doing their share of thework

    Threads in different blocks cannot cooperate

    Each block can execute in any order relativeto other blocks!

    CUDA Thread Block

    Thread Id #:

    0 1 2 3 m

    Thread program

    Courtesy: John Nickolls,

    NVIDIA

  • 7/30/2019 Lect11 12 Cuda Threads

    3/25

    CUDA Thread Organization Implementation

    blockIdx& threadIdx:unique coordinates of

    threads used to distinguish threads & identify foreach thread the appropriate portion of the data to

    process:

    Assigned to threads by the CUDA runtime system

    Appear as bu i l t -in var iablesthat are ini t ial izedby the

    runt ime systemand accessed within the kernel functions

    References to the blockIdxand threadIdxvariables return

    the appropriate values that form coordinates of the thread

    gridDim& blockDim: Bu i l t in var iableswhich

    provide the dimensions of the grid and block

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    3

  • 7/30/2019 Lect11 12 Cuda Threads

    4/25

    4

    Example Thread Organization

    M=8 threads/block (1D organization)

    threadIdx.x= 0, 1, 2, , 7

    N thread blocks (1D organization) blockIdx.x= 0, 1, , N-1

    8*N threads/grid: blockDim.x=8 ; gridDim.x=N

    In the code of the figure threadID=blockIdx.x*blockDim.x + threadIdx.x

    For Thread 3 of Block 5, threadID= 5*8+ 3 =43

    Thread Block 0 Thread Block 1 Thread Block N - 1

    float x = input[threadID];float y = func(x);output[threadID] = y;

    threadID

    float x = input[threadID];float y = func(x);output[threadID] = y;

    float x = input[threadID];float y = func(x);output[threadID] = y;

    76543210 76543210 76543210

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

  • 7/30/2019 Lect11 12 Cuda Threads

    5/25

    General Thread Organization

    Grid: 2D array of blocks; Block: 3D array of threads

    Execution Configuration determines exact organization

    In the example of previous slide, if N=128 & M=32, then the

    following Execution Configuration is used

    Dim3 dimGrid(128,1,1);//dimGrid.x, dimGrid.ytake values 1 to 65,535

    Dim3 dimBlock(32,1,1); // size of block limited to 512 threads

    KernelFunction(.);

    blockIdx.xranges between 0 &dimGrid.x-1

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    5

  • 7/30/2019 Lect11 12 Cuda Threads

    6/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    6

    Host

    Kernel

    1

    Kernel

    2

    Device

    Grid 1

    Block

    (0, 0)

    Block

    (1, 0)

    Block(0, 1)

    Block(1, 1)

    Grid 2

    Courtesy:

    Block (1, 1)

    Thread

    (0,1,0)

    Thread

    (1,1,0)

    Thread

    (2,1,0)

    Thread

    (3,1,0)

    Thread

    (0,0,0)

    Thread

    (1,0,0)

    Thread

    (2,0,0)

    Thread

    (3,0,0)

    (0,0,1) (1,0,1

    ) (2,0,1) (3,0,1)

    Another Example

    Dim3 dimGrid(2,2,1)

    Dim3 dimBlock(4,2,2);

    KernelFunction(.);

    Block(1,0) has

    blockIdx.x=1 & blockIdx.y=0

    Thread(2,1,0) has

    threadIdx.x=2 ;threadIdx.y=1;

    threadIdx.z=0

  • 7/30/2019 Lect11 12 Cuda Threads

    7/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    7

    Md

    Nd

    Pd

    Pdsub

    TILE_WIDTH

    WIDTHWIDTH

    Tx=threadIdx.x0 1 TILE_WIDTH-12

    Bx=blockIdx.x0 1 2

    By= blockIdx.y ty210

    TILE_WIDTH-1

    2

    1

    0

    TILE_

    WIDTH

    E

    WIDTH

    WIDTH

    Matrix Multiplication Using Multiple

    Blocks Break-up Pd into tiles

    Each thread block calculates one tile; Block size

    equal tile size

    Each thread calculates one Pd element

    identified using

    blockIdx.x & blockIdx.y to identifythe tile

    hreadIdx.x, threadIdx.y to identifythe thtread within the tile

  • 7/30/2019 Lect11 12 Cuda Threads

    8/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    8

    Md

    Nd

    Pd

    Pdsub

    TILE_WIDTH

    WIDTHWIDTH

    tx0 1 TILE_WIDTH-12

    bx0 1 2

    byty

    210

    TILE_WIDTH-1

    2

    1

    0

    TILE_

    WIDTH

    E

    WIDTH

    WIDTH

    Revised Matrix Multiplication using

    Multiple Blocks (cont.)

    the y index of the Pd element computed by athread is y= by*TILE_WIDTH + ty

    the x index of the Pd element computed by athread is x = bx*TILE_WIDTH + tx

    Thread (tx,ty) in block (bx, by) usesrow y of Md and column x of Nd to update

    element Pd[y*Width+x]

  • 7/30/2019 Lect11 12 Cuda Threads

    9/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    9

    P1,0P0,0

    P0,1

    P2,0 P3,0

    P1,1

    P0,2 P2,2 P3,2P1,2

    P3,1P2,1

    P0,3 P2,3 P3,3P1,3

    Block(0,0) Block(1,0)

    Block(1,1)Block(0,1)

    TILE_WIDTH = 2

    A Small Example: P(4,4)

    For Block(1,0)blockIdx.x=1 & blockIdx.y=0

    For Block(0,0)blockIdx.x=0 & blockIdx.y=0

    For Block(1,1)blockIdx.x=1 & blockIdx.y=1

    For Block(0,1)blockIdx.x=0 & blockIdx.y=1

  • 7/30/2019 Lect11 12 Cuda Threads

    10/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    10

    Pd1,0

    A Small Example: Multiplication

    Md2,0

    Md1,1

    Md1,0Md0,0

    Md0,1

    Md3,0

    Md2,1

    Pd0,0

    Md3,1 Pd0,1

    Pd2,0Pd3,0

    Nd0,3Nd1,3

    Nd1,2

    Nd1,1

    Nd1,0Nd0,0

    Nd0,1

    Nd0,2

    Pd1,1

    Pd0,2 Pd2,2Pd3,2Pd1,2

    Pd3,1Pd2,1

    Pd0,3 Pd2,3Pd3,3Pd1,3

    thread(0,0) of block(0,0) calculates Pd0,0

    thread(0,0) of block(0,1) calculates Pd0,2

    thread(1,1) of block(0,0) calculates Pd1,1

    thread(1,1) of block(0,1) calculates Pd1,3

  • 7/30/2019 Lect11 12 Cuda Threads

    11/25

    Revised Host Code for Launching the Revised Kernel

    //Setup the execution configuration

    Dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH,1);

    Dim3 dimBlock(TILE_WIDTH, TILE_WIDTH,1);

    //Launch the device computation threads

    KernelFunctionMatrixMulKernel(Md, Nd, Pd, Width);

    //Kernel can handle arrays of dimensions up to

    1,048,560 X 1,048,560{16 X 65,535=1,048,560}

    using 65,535 X 65,535= 4,294,836,225 blocks each with256 threads

    Total number of parallel threads =

    1,099,478,073,600 > 1 Tera (1012)threads

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    11

  • 7/30/2019 Lect11 12 Cuda Threads

    12/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    12

    Revised Matrix Multiplication Kernel using

    Multiple Blocks__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)

    {// Calculate the row index of the Pd element and M

    int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;

    // Calculate the column idenx of Pd and N

    int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

    float Pvalue = 0;

    // each thread computes one element of the block sub-matrix

    for (int k = 0; k < Width; ++k)

    Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

    Pd[Row*Width+Col] = Pvalue;

    }

  • 7/30/2019 Lect11 12 Cuda Threads

    13/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    13

    Transparent Scalability Cuda runtime system can execute blocks in any order

    A kernel scales across any number of execution

    resources Transparent Scalability: The ability to execute the same

    application on hardware with different # of execution

    resources

    Device

    Block 0 Block 1

    Block 2 Block 3

    Block 4 Block 5

    Block 6 Block 7

    Each block can execute in any order relative to other

    blocks.

    Execution with

    few resources

    Kernel grid

    Block 0 Block 1

    Block 2 Block 3

    Block 4 Block 5

    Block 6 Block 7

    Device

    Block 0 Block 1 Block 2 Block 3

    Block 4 Block 5 Block 6 Block 7time

    Execution with more resources

  • 7/30/2019 Lect11 12 Cuda Threads

    14/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    14

    Thread Block Assignment to Streaming Multiprocessors

    Upon Kernel launch, threads are assignedto Streaming Multiprocessors, SMs in

    block granularity

    Up to 8 blocksassigned to each SM

    as resources allows

    SM in GT200 can take up to 1024threads

    Could be 256 (threads/block) * 4

    blocks

    Or 128 (threads/block) * 8 blocks, etc

    t0 t1 t2 tm

    Blocks

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    t0 t1 t2 tm

    Blocks

    SM 1SM 0

  • 7/30/2019 Lect11 12 Cuda Threads

    15/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    15

    Thread Block Assignment to Streaming Multiprocessors (cont.)

    t0 t1 t2 tm

    Blocks

    SP

    SharedMemory

    MT IU

    SP

    SharedMemory

    MT IU

    t0 t1 t2 tm

    Blocks

    SM 1SM 0

    GT200 has 30 SMs

    Up to 240 (30*8) blocks execute

    simultaneously

    Up to 30,720 (30*1024) concurrentthreads can be residing in SMs for

    executionSM maintains thread/block id #s

    SM manages/schedules thread execution

  • 7/30/2019 Lect11 12 Cuda Threads

    16/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    16

    GT200 Example: Thread Scheduling

    Each Block is divided into 32-

    thread Warps for scheduling

    purposes Size of warp is

    implementation dependant

    not part of the CUDA

    programming model

    Warp is the unit of threadscheduling in SM

    If 3 blocks are assigned to an

    SM and each block has 256

    threads, how many Warps are

    there in an SM?

    Each Block is divided into

    256/32 = 8 Warps

    There are 8 * 3 = 24 Warps

    in each SM

    t0 t1 t2 t31

    t0 t1 t2 t31

    Block 1 Warps Block 2 Warps

    SP

    SP

    SP

    SP

    SFU

    SP

    SP

    SP

    SP

    SFU

    Instruction Fetch/Dispatch

    Instruction L1Streaming Multiprocessor

    Shared Memory

    t0 t1 t2 t31

    Block 1 Warps

  • 7/30/2019 Lect11 12 Cuda Threads

    17/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    17

    GT200 Example: Thread Scheduling (Cont.)

    SM implements zero-overheadwarp scheduling:

    At any time, only one of the warps is executed by SM

    Warps whose next instruction has its operands ready for

    consumption are eligible for execution

    Eligible Warps are selected for execution on a prioritizedscheduling policy

    All threads in a warp execute the same instruction when

    selected (SIMD execution)

    TB1

    W1

    TB = Thread Block, W = Warp

    TB2

    W1

    TB3

    W1

    TB2

    W1

    TB1

    W1

    TB3

    W2

    TB1

    W2

    TB1

    W3

    TB3

    W2

    Time

    TB1, W1 stallTB3, W2 stallTB2, W1 stall

    Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

  • 7/30/2019 Lect11 12 Cuda Threads

    18/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    18

    GT200 Block Granularity Considerations

    For Matrix Multiplication using multiple blocks, should I use

    8X8, 16X16 or 32X32 blocks?

    For 8X8 blocks, we have 64 threads per Block. Since each SM cantake up to 1024 threads, there are 16 Blocks (1024/64). However,

    each SM can only take up to 8 Blocks. Hence only 512 (64*8)

    threads will go into each SM!

    SM execution resources are under utilized; fewer wraps to

    schedule around long latency operations

    For 16X16 blocks , we have 256 threads per Block. Since each SM

    can take up to 1024 threads, it can take up to 4 Blocks (1024/256)

    Full thread capacity in each SM

    Maximal number of warps for scheduling around long-latencyoperations (1024/32= 32 wraps)

    For 32X32 blocks, we have 1024 threads per Block. Not even one

    can fit into an SM! (512 threads /block limitation)

  • 7/30/2019 Lect11 12 Cuda Threads

    19/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    19

    Some Additional API Features

  • 7/30/2019 Lect11 12 Cuda Threads

    20/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    20

    Application Programming Interface

    The API is an extension to the C programming

    language

    It consists of:

    Language extensions To target portions of the code for execution on the device

    A runtime library split into:

    1. A host component to control and access one or more

    devices from the host

    2. A device component providing device-specific functions

    3. A common component providing built-in vector types and a

    subset of the C runtime library in both host and device

    codes

  • 7/30/2019 Lect11 12 Cuda Threads

    21/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    21

    Language Extensions:

    Built-in Variables

    dim3 gridDim;

    Dimensions of the grid in blocks (gridDim.z

    unused) dim3blockDim;

    Dimensions of the block in threads

    dim3blockIdx;

    Block index within the grid

    dim3 threadIdx;

    Thread index within the block

  • 7/30/2019 Lect11 12 Cuda Threads

    22/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    22

    Common Runtime Component:

    Mathematical Functions pow, sqrt, cbrt, hypot

    exp, exp2, expm1

    log, log2, log10, log1p

    sin, cos, tan, asin, acos, atan, atan2

    sinh, cosh, tanh, asinh, acosh, atanh

    ceil, floor, trunc, round

    Etc.

    When executed on the host, a given function usesthe C runtime implementation if available

    These functions are only supported for scalar types,

    not vector types

  • 7/30/2019 Lect11 12 Cuda Threads

    23/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    23

    Device Runtime Component:

    Mathematical Functions

    Some mathematical functions (e.g. sin(x))have a less accurate, but faster device-onlyversion (e.g.__sin(x))__pow

    __log,__log2,__log10

    __exp

    __sin,__cos,__tan

  • 7/30/2019 Lect11 12 Cuda Threads

    24/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL, University of Illinois, Urbana-Champaign

    24

    Host Runtime Component

    Provides functions to deal with:

    Device management (including multi-device systems)

    Memory management

    Errorhandling

    Initializes the first time a runtime function is called

    A host thread can invoke device code on only onedevice

    Multiple host threads required to run on multiple

    devices

  • 7/30/2019 Lect11 12 Cuda Threads

    25/25

    David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE498AL University of Illinois Urbana Champaign

    25

    Device Runtime Component:

    Synchronization Function

    void __syncthreads();

    Synchronizes all threads in a block

    Once all threads have reached this point,

    execution resumes normally

    Used to avoid RAW / WAR / WAW hazards

    when accessing shared or global memory

    Allowed in conditional constructs only if theconditional is uniform across the entire thread

    block