cuda. assignment subject: des using cuda deliverables: des.c, des.cu, report due: 12/14,...
TRANSCRIPT
CUDA
Assignment Subject: DES using CUDA Deliverables: des.c, des.cu, report Due: 12/14, [email protected]
Index What is GPU? Programming model and Simple Example The Environment for CUDA programming What is DES?
What’s in a GPU? A GPU is a heterogeneous chip multi-
processor (highly tuned for graphics)
Slimming down kjkd
Idea #1:Remove components that help a single instruction stream run fast
Parallel execution
Two cores Four cores
Sixteen cores:16 simultaneous instruction streams
Be able to share an instruction stream
SIMD processing
Idea #2:Amortize cost/complexity of managing an instruction stream across many ALUs
16 cores = 128 ALUs
What about branches?
Throughput!
Idea #3:Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations
Summary: three key ideas of GPU1. Use many “slimmed down cores” to run in
parallel2. Pack cores full of ALUs (by sharing
instruction stream across groups of fragments)
3. Avoid latency stalls by interleaving execution of many groups of fragments
When one group stalls, work on another group
Programming Model GPU is viewed as a compute device operating
as a coprocessor to the main CPU (host) Data-parallel, compute intensive functions should
be off-loaded to the device Functions that are executed many times, but
independently on different data, are prime candidates I.e. body of for-loops
A function compiled for the device is called a kernel
The kernel is executed on the device as many different threads
Both host (CPU) and device (GPU) manage their own memory, host memory and device memory
Block and Thread Allocation Blocks assigned to SMs (Streaming
Multiprocessos) Threads assigned to PEs (Processing
Elements) • Each thread executes the kernel• Each block has an unique block ID• Each thread has an unique thread ID within the block
• Warp: max 32 threads• GTX 280: 30SMs• 1 SM: 8 SPs• 1 SM: 32 warps 1024 threads• Total threads: 30*1024 = 30,720
Memory model Memory types
Registers (r/w per thread)
Local mem (r/w per thread)
Shared mem (r/w per block)
Global mem (r/w per kernel)
Constant mem (r per kernel)
Separate from CPU CPU can access global
and constant mem via PCIe bus
Simple Example (C to CUDA conversion)__global_ void ForceCalcKernel(int nbodies, struct Body *body, ..) {}__global_ void Advancing Kernel(int nbodies, struct Body *body, …){}
int main(…) { Body *body, *body1; … cudaMalloc((void**)&body1, sizeof(Body)*nbodies); cudaMemcpy(body1, body, sizeof(Body)*nbodies, cuda_HostToDevice); for(timestep = …) { ForceCalcKernel<<1, 1>>(nbodies, body1, …); AdvancingKernel<<1, 1>>(nbodies, body1, …); } cudaMemcpy(body, body1, sizeof(Body)*nbodies, cuda_DeviceToHost); cudaFree(body1); …}
Indicates GPU kernel that CPU can call
Separate address spaces, need two pointers
Allocate memory on GPU
Copy CPU data to GPU
Call GPU kernel with 1block and 1thread per block
Copy GPU data back to CPU
Environment The NVCC compiler
CUDA kernels are typically stored in files ending with .cu
NVCC uses the host compiler (CL/G++) to compile CPU code
NVCC automatically handles #include’s and linking
You can download CUDA toolkit from: http://developer.nvidia.
com/cuda-downloads
What is DES? The archetypal block cipher
An algorithm that takes a fixed-length string of plaintext bits and transforms it through a series of complicated operations into another ciphertext bitstring of the same length
The block size is 64 bits