gpu architecture and programming. gpu vs cpu

GPU Architecture and Programming

GPU vs CPUhttps://www.youtube.com/watch?v=fKK933KK6Gg

GPU Architecture

• GPU (Graphics Processing Unit) were originally designed as graphics accelerators, used for real-time graphics rendering.

• Starting in the late 1990s, the hardware became increasingly programmable, culminating in NVIDIA's first GPU in 1999.

• CPU + GPU is a powerful combination – CPUs consist of a few cores optimized for serial processing, – GPUs consist of thousands of smaller, more efficient cores

designed for parallel performance. – Serial portions of the code run on the CPU while parallel

portions run on the GPU

Architecture of GPU

Image copied from http://www.pgroup.com/lit/articles/insider/v2n1a5.htm Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf

CUDA Programming

• CUDA (Compute Unified Device Architecture) is a parallel programming platform created by NVIDIA based on its GPUs.

• By using CUDA, you can write programs that directly access GPU.

• CUDA platform is accessible to programmers via CUDA libraries and extensions to programming languages like C, C++ AND Fortran. – C/C++ programmers use “CUDA C/C++”, compiled with nvcc

compiler– Fortran programmers can use CUDA Fortran, compiled with PGI

CUDA Fortran

• Terminology:– Host: The CPU and its memory (host memory)– Device: The GPU and its memory (device memory)

Programming Paradigm

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Parallel function of application: execute as a kernel

Programming Flow

1. Copy input data from CPU memory to GPU memory

2. Load GPU program and execute3. Copy results from GPU memory to CPU

memory

• Each parallel function of application is execute as a kernel

• That means GPUs are programmed as a sequence of kernels; typically, each kernel completes execution before the next kernel begins.

• Fermi has some support for multiple, independent kernels to execute simultaneously, but most kernels are large enough to fill the entire machine.

Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf

Hello World! Example


_ _global_ _ is a CUDA C/C++ keyword meaning • mykernel() will be exectued on the device• mykernel() will be called from the host

Addition Example

• Since add runs on device, pointers a, b, and c must point to device memory


Vector Addition Example

Kernel Function:


main:


Alternative 1:


Alternative 2:

int globalThreadId = threadIdx.x + blockIdx.x * M //M is the number of threads in a block

Int globalThreadId = threadIdx.x + blockIdx.x * blockDim.x


• So the kernel becomes


• The main becomes


Handling Arbitrary Vector Sizes


gpu architecture and programming. gpu vs cpu

Documents

gpu memoryload gpu

groups of cores

gpu architecture of

stream processors cores

efficient cores

cuda cc

cuda platform

fermi multiprocessor