complete unified device architecture a highly scalable parallel programming framework submitted in...

Complete Unified Device Architecture

A Highly Scalable Parallel Programming Framework

Submitted in partial fulfillment of the requirements for the Maryland high school diploma

Andrew “Shirley” Das Sarma (Calico Cannonballs McMullins), Blair Computational Methods 2009

Background: Why CUDA?Scientific Computing

• A large computer market

• Arithmetic-intensive

• Huge datasets

• Distributed

• Parallel

Background: Why CUDA?Moore’s Law

• Transistors double every 24 months

• Slowing down?

• New tricks– Multicore– Multi-node

• Metrics– Transistors per circuit– Performance per unit cost

Background: Why CUDA?CPU vs. GPU

• CPUs optimized for general workload– More instructions per second– Pipelining, lookahead branch prediction, etc.

• GPUs optimized for parallel calculations– 1 pixel shader = 1 thread– Lots of pixel shaders– Lots of arithmetic– On-card DRAM


In terms of raw computing power, GPUs surpass CPUs.

What is CUDA?

• GPGPU (not just graphics, or no graphics)• Runs on CPU and GPU• High-level language

– Extension of C– FORTRAN coming soon

• One compiler• Only NVIDIA so far

– Tesla– Larrabee

• Unfathomably cool

How it works

• C language extension– Language constructs– Keywords

• Low-overhead threads

• Independent blocks

• CPU or GPU: choose one– CPU good for sequential or non-numerical

tasks– GPU good for highly parallel calculations

GPU block diagram

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

CUDA: A C extension

• Declspecs: host, global, device

• Keywords: blockIdx, threadIdx, etc.

• Intrinsics: __syncthreads()• Runtime API

– cudaMalloc()– cudaMemcpy()– etc.

• Kernel launch: kernel<<<blocks,threads>>>()

CUDA: A C extension

gcc / cl

G80 SASSfoo.sass

OCG

nvcc/cudaccEDG C/C++ frontend

Open64 Global Optimizer

GPU Assemblyfoo.s

CPU Host Code foo.cpp

Integrated source(foo.cu)

Background: Pointers

• Pointer: a structure that contains the address of some other data in memory

• malloc(size_t sz) returns a pointer to sz bytes of available memory

• To declare a 20-element int array:int * A = (int *) malloc(20*sizeof(int));

Background: Threads

• Sequence of instructions

• One thread at a time– Multicore

• Desktop computer has thousands of threads– Usually fewer than 4 cores

• GPU comfortably runs millions of threads– Hundreds of cores

CUDA execution model

• Arrays of parallel threads

• Each thread executes the same code

• Work determined by threadIdx, blockIdx, blockDim, gridDim

• Blocks: collections of threads– Threads in a block can cooperate and share

fast local memory– No inter-block cooperation

• 1D, 2D, or 3D block/thread numbering


Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)


• All functions are declared __host__, __global__, or __device__

• Host: Runs on CPU, called from CPU

• Global: Runs on GPU, called from CPU

• Device: Runs on GPU, called from GPU

CUDA memory model

• Global memory– Faster than CPU memory– Slower than cache– Accessible by all threads

• Block shared memory– Small-ish, fast, shared by threads in a block

• Thread memory– Small, fast, local

• Texture memory– Small, fast, global

Example: SAXPY ( )__host__ void SAXPYCPU(float * X, float * Y, float a, int N){

for(int i=0; i<N; i++)Y[i] = a*X[i] + Y[i]

}__global__ voidSAXPYGPU(float * X, float * Y, float a){

int i = blockDim.x*blockIdx.x+threadIdx.x;Y[i] = a*X[i] + Y[i];

}

(continued)

Example: SAXPY

__host__ int main() {

int N = 1073741824 ; //2^30 ≈ 1 billion

size_t sz = N * sizeof(float); //bytes we need

float * h_X = (float *) malloc(sz); //allocate the

float * h_Y = (float *) malloc(sz); //host memory

/*some code to fill up h_X and h_Y*/

float * d_X, * d_Y;

cudaMalloc((void **)&d_X, sz); //allocate the

cudaMalloc((void **)&d_Y, sz); //device memory

//move the data onto the GPGPU

cudaMemcpy(d_X, h_X, sz, cudaMemcpyHostToDevice);

cudaMemcpy(d_Y, h_Y, sz, cudaMemcpyHostToDevice);

(continued)

Example: SAXPY

//data is on the device; time to do some SAXPY

int threadsPerBlock = 256;

int blocks = N / threadsPerBlock;

SAXPYGPU<<<blocks, threadsPerBlock>>>(X, Y, 2);

cudaThreadSynchronize(); //wait until done

cudaMemcpy(h_Y, d_Y, sz, cudaMemcpyDeviceToHost);

cudaFree(d_X);

cudaFree(d_Y); //we no longer need the device memory

}

Example: SAXPY

That was easy.

Example: 2D integration

Simpson 2D coefficient matrix:

Our function:f(x,y)=exy (x+y+π)-1/2 sin(log(x-y+π))Want ∫∫ f(x,y) dA over |x|,|y| ≤ 1

Example: 2D integration__host__ int main(){

int B = N/T; //(N+1)^2=points, T=threads, B=blockssize_t sz = B*N*sizeof(dtyp); //dtyp is typedef’ddtyp * d, *h = (dtyp *) malloc(sz);cudaMalloc((void **)&d, sz);dim3 Threads(T);dim3 Grid(B, N); //W=bound of integrationS2DGPU<<<Grid, Threads>>>(-W, W, -W, W, d); //INVOKEcudaThreadSynchronize(); //wait for it to finishcudaMemcpy(h, d, sz, cudaMemcpyDeviceToHost);cudaFree(d);dtyp u=0;for(int i=0; i<B*N; i++)

u += h[i]; //sigma the different resultsu += f2(W, W); //algorithm misses last pointu *= (dtyp)4*W*W/(9*N*N); //normalize

}

Example: 2D integration__host__ void S2DCPU(dtyp x0, dtyp xf, dtyp y0, dtyp yf, dtyp* a){

*a=0;dtyp x=x0, y;for(int i=0; i<=N; i++){

y = y0;for(int j=0; j<=N; j++){

bool c1 = i==0||i==N, c2 = j==0||j==N;*a+=(c1?(c2?1:(j%2==0?2:4)):

(i%2==0?(c2?2:(j%2==0?4:8)):(c2?4:(j%2==0?8:16))))*f2(x,y);

y += (yf-y0)/N;}x += (xf-x0)/N;

}}

Example: 2D integration__global__ void S2DGPU(dtyp x0, dtyp xf, dtyp y0, dtyp yf, dtyp * a){

int X = blockIdx.x*blockDim.x+threadIdx.x;int Y = blockIdx.y;dtyp x = x0+(xf-x0)*X/(gridDim.x*blockDim.x);dtyp y = y0+(yf-y0)*Y/gridDim.y;__shared__ dtyp u[T];bool evx = (X&1)==0, evy = (Y&1)==0;u[threadIdx.x] = (X==0?(Y==0?1:(evy?2:4)):(evx?(Y==0?2:(evy?4:8)): (Y==0?4:(evy?8:16))))*F(x,y);if(threadIdx.x==0)

if(blockIdx.x==0)u[threadIdx.x]+=(blockIdx.y==0?1: ((blockIdx.y&1)==0?2:4))*F(xf,y);

else if(blockIdx.x==1)u[threadIdx.x]+=(blockIdx.y==0?1: ((blockIdx.y&1)==0?2:4))*F(x0+(xf-x0) *blockIdx.y/gridDim.y, yf);

__syncthreads();if(threadIdx.x==0){

for(int i=1; i<T; i++)u[0]+=u[i];

a[blockIdx.x*gridDim.y+Y]=u[0];}

}

Next-gen GPGPUs

NVIDIA Tesla S1070– 960 cores @ 1.44 GHz– 16 GB DRAM– No more, no less– 506 GB/s memory

bandwidth– 4000 GFLOPS– 800 W (.2 W/GFLOPS)– $4,000 ($1/GFLOPS)

Intel Xeon 5500– 4 cores @ 3.2 GHz– Up to 192 GB DRAM*– *Memory not included– 64 GB/s memory

bandwidth– ~50 GFLOPS– 130 W (2.6 W/GFLOPS)– $2,300 ($46/GFLOPS)

Runtime data: 2D integration

dtyp N GPU

16

GPU

256

GPU

512

CPU 16

CPU

256

CPU 2048

double 1024 8 9 - 114 208 -

double 16384 1685 2047 - 25996 15192 17056

float 1024 8 9 10 113 210 -

float 16384 353 255 361 29949 23907 23009

Note: All times are in milliseconds.

complete unified device architecture a highly scalable parallel programming framework submitted in...

Documents

gpu slide

global slide

parallel slide

kernel slide

sizeofint slide

cool slide

gpu block diagram slide

cuda execution model