gpu computing with cuda lecture 1 - introduction · gpu computing with cuda lecture 1 -...

51
GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Upload: trinhlien

Post on 09-May-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

GPU Computing with CUDALecture 1 - Introduction

Christopher Cooper Boston University

August, 2011UTFSM, Valparaíso, Chile

1

Page 2: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

General Ideas

‣Objectives

- Learn CUDA

- Recognize CUDA friendly algorithms and practices

‣ Requirements

- C/C++ (hopefully)

‣ Lab oriented course!

2

Page 3: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Resources

‣Material available in http://www.bu.edu/pasi/materials/post-pasi-training/

‣Machines to be used

- Cluster USM

- BUNGEE/BUDGE at Boston University

‣Main references

- Kirk, D. and Hwu, W. Programming Massively Parallel Processors

- Sanders, J. and Kandrot, E. CUDA by Example

- CUDA C Programming Guide

- CUDA C Best Practices Guide 3

Page 4: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Outline of course

‣Week 1

- Basic CUDA

‣Week 2

- Optimization

‣Week 3

- CUDA Libraries

‣Week 4

- Applications

4

Page 5: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Outline

‣Understanding the need of multicore architectures

‣Overview of the GPU hardware

‣ Introduction to CUDA

- Main features

- Thread hierarchy

- Simple example

- Concepts behind a CUDA friendly code

5

Page 6: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Moore’s Law

‣ Transistor count of integrated circuits doubles every two years

6

Page 7: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Getting performance

‣We have more and more transistors, then?

‣How can we get more performance?

- Increase processor speed

‣ Have gone from MHz to GHz in the last 30 years

‣ Power scales as frequency3!

‣Memory needs to catch up

- Parallel executing

‣ Concurrency, multithreading

7

Page 8: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Getting performance

‣We have more and more transistors, then?

‣How can we get more performance?

- Increase processor speed

‣ Have gone from MHz to GHz in the last 30 years

‣ Power scales as frequency3!

‣Memory needs to catch up

- Parallel executing

‣ Concurrency, multithreading

7

Page 9: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Getting performance

‣We have more and more transistors, then?

‣How can we get more performance?

- Increase processor speed

‣ Have gone from MHz to GHz in the last 30 years

‣ Power scales as frequency3!

‣Memory needs to catch up

- Parallel executing

‣ Concurrency, multithreading

7

Page 10: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Getting performance

8

Page 11: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Moore’s Law

‣ Serial performance scaling reached its peak

‣ Processors are not getting faster, but wider

‣Challenge: parallel thinking

9

Page 12: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Graphic Processing Units (GPU)

10

Page 13: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Graphic Processing Units (GPU)

‣GPU is a hardware specially designed for highly parallel applications (graphics)

11

!"#$%&'()*(+,%'-./0%1-,!

!

!

2( ( !345(!(6'-7'#881,7(9/1.&(:&';1-,(<*=!

!

!

>17/'&()?)*( >@-#%1,7?6-1,%(A$&'#%1-,;($&'(B&0-,.(#,.(C&8-'D(E#,.F1.%"(G-'(%"&(!63(#,.(963(

Page 14: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Graphic Processing Units (GPU)

‣ Fast processing must come with high bandwidth!

12

!"#$%&'()*(+,%'-./0%1-,!

!

!

2( ( !345(!(6'-7'#881,7(9/1.&(:&';1-,(<*=!

!

!

>17/'&()?)*( >@-#%1,7?6-1,%(A$&'#%1-,;($&'(B&0-,.(#,.(C&8-'D(E#,.F1.%"(G-'(%"&(!63(#,.(963(

Page 15: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Graphic Processing Units (GPU)

‣Graphics market is hungry for better and faster rendering

‣Development is pushed by this HUGE industry

- High quality product

- Cheap!

‣ Proven to be a real alternative for scientific applications

13

Top 500 list. June 2011

Page 16: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

GPU chip design

‣CPU vs GPU

14

! !"#$%&'()*(+,%'-./0%1-,!

!

!

!234(!(5'-6'#771,6(8/1.&(9&':1-,(;*<! ! =!!

!

"#$!%$&'()!*$#+),!-#$!,+'.%$/&).0!+)!12(&-+)34/(+)-!.&/&*+2+-0!*$-5$$)!-#$!678!&),!-#$!978!+'!-#&-!-#$!978!+'!'/$.+&2+:$,!1(%!.(;/<-$4+)-$)'+=$>!#+3#20!/&%&22$2!.(;/<-&-+()!!!$?&.-20!5#&-!3%&/#+.'!%$),$%+)3!+'!&*(<-!!!&),!-#$%$1(%$!,$'+3)$,!'<.#!-#&-!;(%$!-%&)'+'-(%'!&%$!,$=(-$,!-(!,&-&!/%(.$''+)3!%&-#$%!-#&)!,&-&!.&.#+)3!&),!12(5!.()-%(2>!&'!'.#$;&-+.&220!+22<'-%&-$,!*0!@+3<%$!A4BC!

!

!

>16/'&()?@*( A"&(852(3&B-%&:(C-'&(A'#,:1:%-':(%-(3#%#(5'-0&::1,6(

!

D(%$!'/$.+1+.&220>!-#$!978!+'!$'/$.+&220!5$224'<+-$,!-(!&,,%$''!/%(*2$;'!-#&-!.&)!*$!$?/%$''$,!&'!,&-&4/&%&22$2!.(;/<-&-+()'!!!-#$!'&;$!/%(3%&;!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!+)!/&%&22$2!!!5+-#!#+3#!&%+-#;$-+.!+)-$)'+-0!!!-#$!%&-+(!(1!&%+-#;$-+.!(/$%&-+()'!-(!;$;(%0!(/$%&-+()'C!E$.&<'$!-#$!'&;$!/%(3%&;!+'!$?$.<-$,!1(%!$&.#!,&-&!$2$;$)->!-#$%$!+'!&!2(5$%!%$F<+%$;$)-!1(%!'(/#+'-+.&-$,!12(5!.()-%(2>!&),!*$.&<'$!+-!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!&),!#&'!#+3#!&%+-#;$-+.!+)-$)'+-0>!-#$!;$;(%0!&..$''!2&-$).0!.&)!*$!#+,,$)!5+-#!.&2.<2&-+()'!+)'-$&,!(1!*+3!,&-&!.&.#$'C!

G&-&4/&%&22$2!/%(.$''+)3!;&/'!,&-&!$2$;$)-'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!D&)0!&//2+.&-+()'!-#&-!/%(.$''!2&%3$!,&-&!'$-'!.&)!<'$!&!,&-&4/&%&22$2!/%(3%&;;+)3!;(,$2!-(!'/$$,!</!-#$!.(;/<-&-+()'C!H)!IG!%$),$%+)3>!2&%3$!'$-'!(1!/+?$2'!&),!=$%-+.$'!&%$!;&//$,!-(!/&%&22$2!-#%$&,'C!J+;+2&%20>!+;&3$!&),!;$,+&!/%(.$''+)3!&//2+.&-+()'!'<.#!&'!/('-4/%(.$''+)3!(1!%$),$%$,!+;&3$'>!=+,$(!$).(,+)3!&),!,$.(,+)3>!+;&3$!'.&2+)3>!'-$%$(!=+'+()>!&),!/&--$%)!%$.(3)+-+()!.&)!;&/!+;&3$!*2(.K'!&),!/+?$2'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!H)!1&.->!;&)0!&23(%+-#;'!(<-'+,$!-#$!1+$2,!(1!+;&3$!%$),$%+)3!&),!/%(.$''+)3!&%$!&..$2$%&-$,!*0!,&-&4/&%&22$2!/%(.$''+)3>!1%(;!3$)$%&2!'+3)&2!/%(.$''+)3!(%!/#0'+.'!'+;<2&-+()!-(!.(;/<-&-+()&2!1+)&).$!(%!.(;/<-&-+()&2!*+(2(30C!

)*@ !234!D(#(8&,&'#E?5/'$-:&(5#'#EE&E(

!-7$/%1,6(4'0"1%&0%/'&(

H)!L(=$;*$%!BMMN>!LOHGHP!+)-%(,<.$,!68GP"#$&!3$)$%&2!/<%/('$!/&%&22$2!.(;/<-+)3!&%.#+-$.-<%$!!!5+-#!&!)$5!/&%&22$2!/%(3%&;;+)3!;(,$2!&),!+)'-%<.-+()!'$-!&%.#+-$.-<%$!!!-#&-!2$=$%&3$'!-#$!/&%&22$2!.(;/<-$!$)3+)$!+)!LOHGHP!978'!-(!

"#$%&!

'()!"*+,-*.!

'()!

'()!

'()!

/0'1!

"2)!

/0'1!

!!

!

!!

!

!!

!

!!

!

!!

!

!!

!

!!

!

!!

!

32)!

! !"#$%&'()*(+,%'-./0%1-,!

!

!

!234(!(5'-6'#771,6(8/1.&(9&':1-,(;*<! ! =!!

!

"#$!%$&'()!*$#+),!-#$!,+'.%$/&).0!+)!12(&-+)34/(+)-!.&/&*+2+-0!*$-5$$)!-#$!678!&),!-#$!978!+'!-#&-!-#$!978!+'!'/$.+&2+:$,!1(%!.(;/<-$4+)-$)'+=$>!#+3#20!/&%&22$2!.(;/<-&-+()!!!$?&.-20!5#&-!3%&/#+.'!%$),$%+)3!+'!&*(<-!!!&),!-#$%$1(%$!,$'+3)$,!'<.#!-#&-!;(%$!-%&)'+'-(%'!&%$!,$=(-$,!-(!,&-&!/%(.$''+)3!%&-#$%!-#&)!,&-&!.&.#+)3!&),!12(5!.()-%(2>!&'!'.#$;&-+.&220!+22<'-%&-$,!*0!@+3<%$!A4BC!

!

!

>16/'&()?@*( A"&(852(3&B-%&:(C-'&(A'#,:1:%-':(%-(3#%#(5'-0&::1,6(

!

D(%$!'/$.+1+.&220>!-#$!978!+'!$'/$.+&220!5$224'<+-$,!-(!&,,%$''!/%(*2$;'!-#&-!.&)!*$!$?/%$''$,!&'!,&-&4/&%&22$2!.(;/<-&-+()'!!!-#$!'&;$!/%(3%&;!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!+)!/&%&22$2!!!5+-#!#+3#!&%+-#;$-+.!+)-$)'+-0!!!-#$!%&-+(!(1!&%+-#;$-+.!(/$%&-+()'!-(!;$;(%0!(/$%&-+()'C!E$.&<'$!-#$!'&;$!/%(3%&;!+'!$?$.<-$,!1(%!$&.#!,&-&!$2$;$)->!-#$%$!+'!&!2(5$%!%$F<+%$;$)-!1(%!'(/#+'-+.&-$,!12(5!.()-%(2>!&),!*$.&<'$!+-!+'!$?$.<-$,!()!;&)0!,&-&!$2$;$)-'!&),!#&'!#+3#!&%+-#;$-+.!+)-$)'+-0>!-#$!;$;(%0!&..$''!2&-$).0!.&)!*$!#+,,$)!5+-#!.&2.<2&-+()'!+)'-$&,!(1!*+3!,&-&!.&.#$'C!

G&-&4/&%&22$2!/%(.$''+)3!;&/'!,&-&!$2$;$)-'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!D&)0!&//2+.&-+()'!-#&-!/%(.$''!2&%3$!,&-&!'$-'!.&)!<'$!&!,&-&4/&%&22$2!/%(3%&;;+)3!;(,$2!-(!'/$$,!</!-#$!.(;/<-&-+()'C!H)!IG!%$),$%+)3>!2&%3$!'$-'!(1!/+?$2'!&),!=$%-+.$'!&%$!;&//$,!-(!/&%&22$2!-#%$&,'C!J+;+2&%20>!+;&3$!&),!;$,+&!/%(.$''+)3!&//2+.&-+()'!'<.#!&'!/('-4/%(.$''+)3!(1!%$),$%$,!+;&3$'>!=+,$(!$).(,+)3!&),!,$.(,+)3>!+;&3$!'.&2+)3>!'-$%$(!=+'+()>!&),!/&--$%)!%$.(3)+-+()!.&)!;&/!+;&3$!*2(.K'!&),!/+?$2'!-(!/&%&22$2!/%(.$''+)3!-#%$&,'C!H)!1&.->!;&)0!&23(%+-#;'!(<-'+,$!-#$!1+$2,!(1!+;&3$!%$),$%+)3!&),!/%(.$''+)3!&%$!&..$2$%&-$,!*0!,&-&4/&%&22$2!/%(.$''+)3>!1%(;!3$)$%&2!'+3)&2!/%(.$''+)3!(%!/#0'+.'!'+;<2&-+()!-(!.(;/<-&-+()&2!1+)&).$!(%!.(;/<-&-+()&2!*+(2(30C!

)*@ !234!D(#(8&,&'#E?5/'$-:&(5#'#EE&E(

!-7$/%1,6(4'0"1%&0%/'&(

H)!L(=$;*$%!BMMN>!LOHGHP!+)-%(,<.$,!68GP"#$&!3$)$%&2!/<%/('$!/&%&22$2!.(;/<-+)3!&%.#+-$.-<%$!!!5+-#!&!)$5!/&%&22$2!/%(3%&;;+)3!;(,$2!&),!+)'-%<.-+()!'$-!&%.#+-$.-<%$!!!-#&-!2$=$%&3$'!-#$!/&%&22$2!.(;/<-$!$)3+)$!+)!LOHGHP!978'!-(!

"#$%&!

'()!"*+,-*.!

'()!

'()!

'()!

/0'1!

"2)!

/0'1!

!!

!

!!

!

!!

!

!!

!

!!

!

!!

!

!!

!

!!

!

32)!CPU GPU

GPU devotes more transistors to data processing

Page 17: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

GPU chip design

15

Page 18: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

GPU chip design

‣A glance at the GeForce 8800

16

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture Texture Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store Load/store Load/store Load/store Load/store

Streaming multiprocessor

Streaming processor

Page 19: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

GPU chip design

‣Cores?

17SM

3 billion transistors512 CUDA cores (32 in 16 SMs)64kB of on chip RAMHigh bandwidth

SP

Page 20: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

GPU chip design

‣ The GPU core is the stream processor

‣ Stream processors are grouped in Stream Multiprocessors

- SM is basically a SIMD processor (Single Instruction Multiple Data)

18

Page 21: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

GPU chip design

‣Core ideas

- GPUs consist of many “simple” cores

‣ Designed for many simpler tasks: high throughput

‣ Fewer control units: latency

19

Maximize throughput of all threads

Minimize latency of a thread

Page 22: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Parallel thinking

‣ Latency? Throughput?

- Latency: time to complete a task

- Throughput: number of tasks in a fixed time

‣Decisions!?!

- Depends on the problem

20

Execution time

Bandwidth

“If you were plowing a field, which would you rather use: two strong oxen or 1024 chicken?”

Seymour Cray

Page 23: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Parallel thinking

‣ 1024 chicken? We better have a strategy!

- Rethink our algorithms to be more parallel friendly

- Massively parallel:

‣ Data parallelism

‣ Load balancing

‣ Regular computations

‣ Data access

‣ Avoid conflicts

‣ and so on...21

Page 24: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

Computer Unified Device Architecture (CUDA)

‣ Parallel computer architecture developed by NVIDIA

‣General purpose programming model

‣ Specially designed for General Purpose GPU computing:

- Offers a compute designed API

- Explicit GPU memory managing

22

Page 25: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA enabled GPUs

‣What is the difference between GPUs

- CUDA enabled GPUs

- GPUs classified according to compute capability

23

CUDA Programming Guide Appendix A CUDA Programming Guide Appendix F

Page 26: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Main features

‣C/C++ with extensions

‣Heterogeneous programming model:

- Operates in CPU (host) and GPU (device)

24

Host CPU

Device GPU

Page 27: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Threads

‣ In CUDA, a kernel is executed by many threads

- A thread is a sequence of executions

- Multi-thread: many threads will be running at the same time

25

__global__voidvec_add(float*A,float*B,float*C,intN){inti=threadIdx.x+blockDim.x*blockIdx.x;

if(i>=N){return;}C[i]=A[i]+B[i];}

voidvec_add(float*A,float*B,float*C,intN){for(inti=0;i<N;i++)

C[i]=A[i]+B[i];}

Page 28: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Threads

‣ Threads are grouped into thread blocks

- Programming abstraction

- All threads within a thread block run in the same SM

- Threads of the same block can communicate

‣ Thread blocks conform a grid

26

Page 29: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Threads

‣CUDA virtualizes the physical hardware

- Thread is a virtualized scalar processor

- Thread blocks is a virtualized multiprocessors

‣ Thread blocks need to be independent

- They run to completion

- No idea in which order they run

27

Page 30: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Threads

‣ Provides automatic scalability across GPUs

- Thread hierarchy

- Shared memories

- Barrier synchronization

28

Page 31: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Threads

‣ Threads have a unique combination of block ID and thread ID

- We can operate in different parts of the data

- SIMT: Single Instruction Multiple Threads

- Threads: 1D, 2D or 3D

- Blocks: 1D or 2D

29

Page 32: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Extensions to C

30

‣Minimal effort from C to CUDA

‣ Function declarations

‣Variable qualifiers

‣ Execution configuration

‣Others

__global__voidkernel()__device__floatfunction()

__shared__floatarray[5]__constant__floatarray[5]

dim3dim_grid(100,100);//10000blocksdim3dim_block(16,16);//256threadsperblockin2Dkernel<<<dim_grid,dim_block>>>();//Launchkernel

threadIdx.xblockIdx.x

Page 33: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Program execution

31

Allocate and initialize data on CPU

Allocate data on GPU

Transfer data from CPU to GPU

Run kernel

Transfer data from GPU to CPU

Page 34: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

32

__global__voidvec_add(float*A,float*B,float*C,intN){//Using1DthreadIDandblockIDinti=threadIdx.x+blockDim.x*blockIdx.x;

if(i>=N){return;}C[i]=A[i]+B[i];}

intmain(){

....//LaunchkernelwithN/256blocksof256threads

intblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);

}

Page 35: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

32

__global__voidvec_add(float*A,float*B,float*C,intN){//Using1DthreadIDandblockIDinti=threadIdx.x+blockDim.x*blockIdx.x;

if(i>=N){return;}C[i]=A[i]+B[i];}

intmain(){

....//LaunchkernelwithN/256blocksof256threads

intblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);

}

Thread indexing

Page 36: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

32

__global__voidvec_add(float*A,float*B,float*C,intN){//Using1DthreadIDandblockIDinti=threadIdx.x+blockDim.x*blockIdx.x;

if(i>=N){return;}C[i]=A[i]+B[i];}

intmain(){

....//LaunchkernelwithN/256blocksof256threads

intblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);

}

Perform operation

Page 37: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

33

intN=200;

//AllocateandinitializehostCPUfloat*A_h=newfloat[N];float*B_h=newfloat[N];float*C_h=newfloat[N];

for(inti=0;i<N;i++){A_h[i]=1.3f;B_h[i]=2.0f;}

//AllocatememoryontheGPUfloat*A_d,*B_d,*C_d;

cudaMalloc((void**)&A_d,N*sizeof(float));cudaMalloc((void**)&B_d,N*sizeof(float));cudaMalloc((void**)&C_d,N*sizeof(float));

//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);

Page 38: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

33

intN=200;

//AllocateandinitializehostCPUfloat*A_h=newfloat[N];float*B_h=newfloat[N];float*C_h=newfloat[N];

for(inti=0;i<N;i++){A_h[i]=1.3f;B_h[i]=2.0f;}

//AllocatememoryontheGPUfloat*A_d,*B_d,*C_d;

cudaMalloc((void**)&A_d,N*sizeof(float));cudaMalloc((void**)&B_d,N*sizeof(float));cudaMalloc((void**)&C_d,N*sizeof(float));

//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);

CPU allocation

Page 39: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

33

intN=200;

//AllocateandinitializehostCPUfloat*A_h=newfloat[N];float*B_h=newfloat[N];float*C_h=newfloat[N];

for(inti=0;i<N;i++){A_h[i]=1.3f;B_h[i]=2.0f;}

//AllocatememoryontheGPUfloat*A_d,*B_d,*C_d;

cudaMalloc((void**)&A_d,N*sizeof(float));cudaMalloc((void**)&B_d,N*sizeof(float));cudaMalloc((void**)&C_d,N*sizeof(float));

//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);

GPU allocation

Page 40: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

33

intN=200;

//AllocateandinitializehostCPUfloat*A_h=newfloat[N];float*B_h=newfloat[N];float*C_h=newfloat[N];

for(inti=0;i<N;i++){A_h[i]=1.3f;B_h[i]=2.0f;}

//AllocatememoryontheGPUfloat*A_d,*B_d,*C_d;

cudaMalloc((void**)&A_d,N*sizeof(float));cudaMalloc((void**)&B_d,N*sizeof(float));cudaMalloc((void**)&C_d,N*sizeof(float));

//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);

Copy to device

Page 41: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

34

//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);

//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);

//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);

//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);

Page 42: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

34

//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);

//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);

//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);

//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);

Copy to device

Page 43: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

34

//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);

//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);

//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);

//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);

Launch kernel

Page 44: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

34

//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);

//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);

//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);

//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);

Copy back to host

Page 45: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

34

//CopyhostmemorytodevicecudaMemcpy(A_d,A_h,N*sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(B_d,B_h,N*sizeof(float),cudaMemcpyHostToDevice);

//RungridofN/256blocksof256threadseachintblocks=int(N‐0.5)/256+1;vec_add<<<blocks,256>>>(A_d,B_d,C_d,N);

//CopybackfromdevicetohostcudaMemcpy(C_h,C_d,N*sizeof(float),cudaMemcpyDeviceToHost);

//FreedevicecudaFree(A_d);cudaFree(B_d);cudaFree(C_d);

Free device

Page 46: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Vector add example

35

__global__voidvec_add(float*A,float*B,float*C,intN){//Using1DthreadIDandblockIDinti=threadIdx.x+blockDim.x*blockIdx.x;

if(i>=N){return;}C[i]=A[i]+B[i];}

Thread i

FetchA[i], B[i]

Write C[i]

Operate

Thread N-1

FetchA[N-1], B[N-1]

Write C[N-1]

Operate

Thread i+1

FetchA[i+1], B[i+1]

Write C[i+1]

Operate

Thread 0

FetchA[0], B[0]

Write C[0]

Operate

... ...

Page 47: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - To keep in mind

‣ Keep executions simple

- Leave tough things to the CPU

‣ Performance:

- Load balance

- Data parallelism

- Avoid conflicts

- Keep the GPU busy!

‣How fast can we go?

36

Page 48: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - To keep in mind

‣ Load balance

- All threads should do the same amount of work

‣Data Parallelism (Massive)

- Arrange your algorithm so is data parallel friendly

- Look for regularity

‣Avoid conflicts

- Data access conflicts will serialize your applications or give you wrong answers!

37

Page 49: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - To keep in mind

‣ Keep the GPU busy!

- High peak compute throughput compared to bandwidth

- Fermi:

‣ 1TFLOP peak throughput, 144 GB/s peak off chip memory access (36 Gfloats per second)

‣ 4*1TFLOP = 4000GB/s for peak throughput!

‣ 1000/36 ≈ 28 operations per fetched value

- Need to hide latency!

38

Page 50: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - To keep in mind

‣How fast can we go?

- Keep Amdahl’s law in mind

- Know how much parallelization can be done in your application

- Measure independent parts of your algorithm before going to the GPU

39

P: parallel portionS: sequential portion

speedup =1

(1! P ) + PS

Page 51: GPU Computing with CUDA Lecture 1 - Introduction · GPU Computing with CUDA Lecture 1 - Introduction Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1

CUDA - Applications

‣Ultrasound imaging‣Molecular dynamics‣ Seismic migration‣Astrophysics simulations‣Graphics‣ ....

40

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009!ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign!