cuda. assignment subject: des using cuda deliverables: des.c, des.cu, report due: 12/14,...

16
CUDA

Upload: eugene-kelly

Post on 20-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

CUDA

Page 2: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Assignment Subject: DES using CUDA Deliverables: des.c, des.cu, report Due: 12/14, [email protected]

Page 3: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Index What is GPU? Programming model and Simple Example The Environment for CUDA programming What is DES?

Page 4: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

What’s in a GPU? A GPU is a heterogeneous chip multi-

processor (highly tuned for graphics)

Page 5: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Slimming down kjkd

Idea #1:Remove components that help a single instruction stream run fast

Page 6: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Parallel execution

Two cores Four cores

Sixteen cores:16 simultaneous instruction streams

Be able to share an instruction stream

Page 7: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

SIMD processing

Idea #2:Amortize cost/complexity of managing an instruction stream across many ALUs

16 cores = 128 ALUs

Page 8: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

What about branches?

Page 9: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Throughput!

Idea #3:Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations

Page 10: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Summary: three key ideas of GPU1. Use many “slimmed down cores” to run in

parallel2. Pack cores full of ALUs (by sharing

instruction stream across groups of fragments)

3. Avoid latency stalls by interleaving execution of many groups of fragments

When one group stalls, work on another group

Page 11: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Programming Model GPU is viewed as a compute device operating

as a coprocessor to the main CPU (host) Data-parallel, compute intensive functions should

be off-loaded to the device Functions that are executed many times, but

independently on different data, are prime candidates I.e. body of for-loops

A function compiled for the device is called a kernel

The kernel is executed on the device as many different threads

Both host (CPU) and device (GPU) manage their own memory, host memory and device memory

Page 12: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Block and Thread Allocation Blocks assigned to SMs (Streaming

Multiprocessos) Threads assigned to PEs (Processing

Elements) • Each thread executes the kernel• Each block has an unique block ID• Each thread has an unique thread ID within the block

• Warp: max 32 threads• GTX 280: 30SMs• 1 SM: 8 SPs• 1 SM: 32 warps 1024 threads• Total threads: 30*1024 = 30,720

Page 13: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Memory model Memory types

Registers (r/w per thread)

Local mem (r/w per thread)

Shared mem (r/w per block)

Global mem (r/w per kernel)

Constant mem (r per kernel)

Separate from CPU CPU can access global

and constant mem via PCIe bus

Page 14: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Simple Example (C to CUDA conversion)__global_ void ForceCalcKernel(int nbodies, struct Body *body, ..) {}__global_ void Advancing Kernel(int nbodies, struct Body *body, …){}

int main(…) { Body *body, *body1; … cudaMalloc((void**)&body1, sizeof(Body)*nbodies); cudaMemcpy(body1, body, sizeof(Body)*nbodies, cuda_HostToDevice); for(timestep = …) { ForceCalcKernel<<1, 1>>(nbodies, body1, …); AdvancingKernel<<1, 1>>(nbodies, body1, …); } cudaMemcpy(body, body1, sizeof(Body)*nbodies, cuda_DeviceToHost); cudaFree(body1); …}

Indicates GPU kernel that CPU can call

Separate address spaces, need two pointers

Allocate memory on GPU

Copy CPU data to GPU

Call GPU kernel with 1block and 1thread per block

Copy GPU data back to CPU

Page 15: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

Environment The NVCC compiler

CUDA kernels are typically stored in files ending with .cu

NVCC uses the host compiler (CL/G++) to compile CPU code

NVCC automatically handles #include’s and linking

You can download CUDA toolkit from: http://developer.nvidia.

com/cuda-downloads

Page 16: CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14, nai0315@snu.ac.kr

What is DES? The archetypal block cipher

An algorithm that takes a fixed-length string of plaintext bits and transforms it through a series of complicated operations into another ciphertext bitstring of the same length

The block size is 64 bits