general purpose graphics processing units (gpgpus)

45
General Purpose Graphics Processing Units (GPGPUs) Lecture notes from MKP, J. Wang, and S. Yalamanchili

Upload: melia

Post on 23-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

General Purpose Graphics Processing Units (GPGPUs). Lecture notes from MKP, J. Wang, and S. Yalamanchili. What is a GPGPU?. Graphics Processing Unit (GPU): (NVIDIA/AMD/Intel) Many-core Architecture Massively Data-Parallel Processor (Compared with a CPU) Highly Multi-threaded GPGPU: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: General Purpose Graphics Processing Units (GPGPUs)

General Purpose Graphics Processing Units (GPGPUs)

Lecture notes from MKP, J. Wang, and S. Yalamanchili

Page 2: General Purpose Graphics Processing Units (GPGPUs)

(2)

What is a GPGPU?• Graphics Processing Unit (GPU):

(NVIDIA/AMD/Intel) Many-core Architecture Massively Data-Parallel Processor (Compared with a

CPU) Highly Multi-threaded

• GPGPU: General-Purpose GPU, High Performance Computing Become popular with CUDA and OpenCL

programming languages

Page 3: General Purpose Graphics Processing Units (GPGPUs)

(3)

Motivation• High Throughput and Memory Bandwidth

Page 4: General Purpose Graphics Processing Units (GPGPUs)

(4)

Discrete GPUs in the System

Page 5: General Purpose Graphics Processing Units (GPGPUs)

(5)

Fused GPUs: AMD & Intel

On-Chip and sharing the cache

Not as powerful as the discrete GPUs

Page 6: General Purpose Graphics Processing Units (GPGPUs)

(6)

Core Count: NVIDIA

1536 cores at 1GHz

• All cores are not created equal• Need to understand the programming model

Page 7: General Purpose Graphics Processing Units (GPGPUs)

(7)

GPU Architectures (NVIDIA Tesla)Streaming

multiprocessor

8 × Streamingprocessors

Page 8: General Purpose Graphics Processing Units (GPGPUs)

(8)

NVIDIA GK110 Architectures

Page 9: General Purpose Graphics Processing Units (GPGPUs)

(9)

CUDA Programming Model• NVIDIA• Compute Unified Device Architecture (CUDA)• Kernel: C-like function executed on GPU• SIMD or SIMT

Single Instruction Multiple Data/thread (SIMD, SIMT) All threads execute the same instruction But on its own data Lock Step

Inst 0

Thread0 1 2 3 4 5 6 7

Data

DataInst 1

Page 10: General Purpose Graphics Processing Units (GPGPUs)

(10)

Block

CUDA Thread Hierarchy• Each thread uses

IDs to decide what data to work on 3-dimension Hierarchy:

Thread, Block, Grid

0 1 2 30,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

0,0,0 0,0,1 0,0,2 0,0,3

0,1,0 0,1,1 0,1,2 0,1,3

0,2,0 0,2,1 0,2,2 0,2,3

0,3,0 0,3,1 0,3,2 0,3,3

1,0,0 1,0,1 1,0,2 1,0,3

Grid

Block(0,0,0)

Block(0,0,1)

Block(0,1,0)

Block(0,1,1)

Grid

Block(0,0,0)

Block(0,0,1)

Block(0,1,0)

Block(0,1,1)

Grid

Block(0,0,0)

Block(0,0,1)

Block(0,1,0)

Block(0,1,1)

Thread

Kernel 0 Kernel 1 Kernel 2

Page 11: General Purpose Graphics Processing Units (GPGPUs)

(11)

Vector Addition • Let’s assume N=16, blockDim=4 4 blocks

blockIdx.x = 0blockDim.x = 4 threadIdx.x = 0,1,2,3Idx= 0,1,2,3

blockIdx.x = 1blockDim.x = 4 threadIdx.x = 0,1,2,3Idx= 4,5,6,7

+

blockIdx.x = 2blockDim.x = 4 threadIdx.x = 0,1,2,3Idx= 8,9,10,11

blockIdx.x = 3blockDim.x = 4 threadIdx.x = 0,1,2,3Idx= 12,13,14,15

+ + + +

for (int index = 0; index < N; ++index) { c[index] = a[index] + b[index]; }

Page 12: General Purpose Graphics Processing Units (GPGPUs)

(12)

Vector Addition

void vector_add ( float *a, float* b, float *c, int N) { for (int index = 0; index < N; ++index) c[index] = a[index] + b[index]; } }

int main () { vector_add(a, b, c, N); }

__global__ vector_add ( float *a, float *b, float *c, int N) {int index = blockIdx.x * blockDim.x + threadIdx.x;

if (index < N) c[index] = a[index]+b[index];}

int main() { dim3 dimBlock( blocksize, blocksize) ; dim3 dimGrid (N/dimBlock.x, N/dimBlock.y); add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N); }

CPU Program GPU Program Kernel

Page 13: General Purpose Graphics Processing Units (GPGPUs)

(13)

GPU Architecture Basics

SM

Memory

Memory Controller

SM SM……

PCI-CacheFetch

Core

Core

Core

CoreDecoder

FPUnit

INTUnit

CUDA CoreEX

MEM

WB

The SI in SIMT

In-order Core

Page 14: General Purpose Graphics Processing Units (GPGPUs)

(14)

Execution of a CUDA Program

• Blocks are scheduled and executed independently on SMs

• All blocks share memory

Page 15: General Purpose Graphics Processing Units (GPGPUs)

(15)

Executing a Block of Threads

• Execution Unit: Warp a group of threads (32 for NVIDIA GPUs)

• Blocks are partitioned into warps with consecutive thread ID.

SM

Warp 0

Warp 1

Warp 2

Warp 3

Block 0128 Threads

Block 1128 Threads

Warp 0

Warp 1

Warp 2

Warp 3

Page 16: General Purpose Graphics Processing Units (GPGPUs)

(16)

T T T T T T T T

Warp Execution• A warp executes one common instruction at a

time• Threads in a warp are mapped to CUDA cores• Warps are switched and executed on SM

T T T T

Warp Execution

One warp

One warp

One warp

Inst 1Inst 2Inst 3

PC

Core

Core

Core

Core

SM

Page 17: General Purpose Graphics Processing Units (GPGPUs)

(17)

Handling Branches

• CUDA Code:if(…) … (True for some threads)

else … (True for others)

• What if threads takes different branches?• Branch Divergence!

T T T T

taken not taken

Page 18: General Purpose Graphics Processing Units (GPGPUs)

(18)

Branch Divergence• Occurs within a warp• All branch conditions are serialized and will be

executed Performance issue: low warp utilization

if(…)

{… }

else {…}

Idle threads

Page 19: General Purpose Graphics Processing Units (GPGPUs)

(19)

Vector Addition• N = 60• 64 Threads, 1 block• Q: Is there any branch divergence? In which

warp?

__global__ vector_add ( float *a, float *b, float *c, int N) {int index = blockIdx.x * blockDim.x + threadIdx.x;

if (index < N) c[index] = a[index]+b[index];}

Page 20: General Purpose Graphics Processing Units (GPGPUs)

(20)

Example: VectorAdd on GPU

__global__ vector_add ( float *a, float *b, float *c, int N) {int index = blockIdx.x * blockDim.x + threadIdx.x;

if (index < N) c[index] = a[index]+b[index];}

setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3; //r8 = &c[index]

L2:ret;

PTX (Assembly):CUDA:

Page 21: General Purpose Graphics Processing Units (GPGPUs)

(21)

Example: VectorAdd on GPU

• N=8, 8 Threads, 1 block, warp size = 4• 1 SM, 4 Cores• Pipeline:

Fetch: o One instruction from each warpo Round-robin through all warps

Execution:o In-order execution within warps

With proper data forwarding 1 Cycle each stage

• How many warps?

Page 22: General Purpose Graphics Processing Units (GPGPUs)

(22)

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

Execution Sequence

Page 23: General Purpose Graphics Processing Units (GPGPUs)

(23)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

setp W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

Execution Sequence (cont.)

Page 24: General Purpose Graphics Processing Units (GPGPUs)

(24)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

setp W1 setp W0

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

Execution Sequence (cont.)

Page 25: General Purpose Graphics Processing Units (GPGPUs)

(25)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

bra W0setp W1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp W0 setp W0 setp W0 setp W0

Execution Sequence (cont.)

Page 26: General Purpose Graphics Processing Units (GPGPUs)

(26)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

@p bra W1@p bra W0

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp W1 setp W1 setp W1 setp W1

setp W0 setp W0 setp W0 setp W0

Execution Sequence (cont.)

Page 27: General Purpose Graphics Processing Units (GPGPUs)

(27)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

bra L2@p bra W1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

bra W0 bra W0 bra W0 bra W0

setp W1 setp W1 setp W1 setp W1

setp W0 setp W0 setp W0 setp W0

Execution Sequence (cont.)

Page 28: General Purpose Graphics Processing Units (GPGPUs)

(28)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

bra L2 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

bra W1 bra W1 bra W1 bra W1

bra W0 bra W0 bra W0 bra W0

setp W1 setp W1 setp W1 setp W1

Execution Sequence (cont.)

Page 29: General Purpose Graphics Processing Units (GPGPUs)

(29)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

bra W1 bra W1 bra W1 bra W1

bra W0 bra W0 bra W0 bra W0

Execution Sequence (cont.)

Page 30: General Purpose Graphics Processing Units (GPGPUs)

(30)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBbra W1 bra W1 bra W1 bra W1

ld W0

Execution Sequence (cont.)

Page 31: General Purpose Graphics Processing Units (GPGPUs)

(31)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

ld W1ld W0 ld W0 ld W0 ld W0

Execution Sequence (cont.)

Page 32: General Purpose Graphics Processing Units (GPGPUs)

(32)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

ld W0ld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

Execution Sequence (cont.)

Page 33: General Purpose Graphics Processing Units (GPGPUs)

(33)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

add W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

ld W1

ld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

ld W0 ld W0 ld W0 ld W0

Execution Sequence (cont.)

Page 34: General Purpose Graphics Processing Units (GPGPUs)

(34)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

add W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

ld W1 ld W1 ld W1 ld W1add W0

Execution Sequence (cont.)

Page 35: General Purpose Graphics Processing Units (GPGPUs)

(35)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

st W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBld W0 ld W0 ld W0 ld W0

ld W1 ld W1 ld W1 ld W1

add W1add W0 add W0 add W0 add W0

Execution Sequence (cont.)

Page 36: General Purpose Graphics Processing Units (GPGPUs)

(36)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

st W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBld W1 ld W1 ld W1 ld W1

add W0 add W0 add W0 add W0

add W1 add W1 add W1 add W1st W0

Execution Sequence (cont.)

Page 37: General Purpose Graphics Processing Units (GPGPUs)

(37)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ret FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBadd W0 add W0 add W0 add W0

add W1 add W1 add W1 add W1

st W1st W0 st W0 st W0 st W0

Execution Sequence (cont.)

Page 38: General Purpose Graphics Processing Units (GPGPUs)

(38)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ret FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBadd W1 add W1 add W1 add W1

ret

st W0 st W0 st W0 st W0

st W1 st W1 st W1 st W1

Execution Sequence (cont.)

Page 39: General Purpose Graphics Processing Units (GPGPUs)

(39)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ret FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBst W0 st W0 st W0 st W0

st W1 st W1 st W1 st W1

ret ret ret ret

Execution Sequence (cont.)

Page 40: General Purpose Graphics Processing Units (GPGPUs)

(40)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBst W1 st W1 st W1 st W1

ret ret ret ret

ret ret ret ret

Execution Sequence (cont.)

Page 41: General Purpose Graphics Processing Units (GPGPUs)

(41)

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBret ret ret ret

ret ret ret ret

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Execution Sequence (cont.)

Page 42: General Purpose Graphics Processing Units (GPGPUs)

(42)

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBret ret ret ret

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Execution Sequence (cont.)

Page 43: General Purpose Graphics Processing Units (GPGPUs)

(43)

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Execution Sequence (cont.)

Page 44: General Purpose Graphics Processing Units (GPGPUs)

(44)

Study Guide• Be able to define the terms thread block, warp,

and SIMT with examples• Understand the Vector Addition Example in

enough detail to Know what operations are in each core at any cycle Given a number of pipeline stages in each core know

how many warps are required to fill the pipelines? How many instructions are executed in total?

• Key differences between fused and discrete GPUs

Page 45: General Purpose Graphics Processing Units (GPGPUs)

(45)

Glossary• CUDA• Branch

divergence• Kernel • OpenCL

• Stream Multiprocessor

• Thread block• Warp