newbie’s guide to_the_gpgpu_universe

Newbie’s guide to the GPGPU universe

Ofer Rosenberg

Agenda

• GPU History

• Anatomy of a Modern GPU

• Typical GPGPU Models

• The GPGPU universe

GPU History

A GPGPU perspective

3

From Shaders to Compute (1)

In the beginning, GPU HW was fixed & optimized for Graphics…

Slide from: GPU Architecture: Implications & Trends, David Luebke, NVIDIA Research, SIGGRAPH 2008:


• GPUs evolved to programmable

(which made Gaming companies very happy…)

Shader:

A simple program, that may run on a graphics processing

unit, and describe the traits of either a vertex or a pixel.

The birth of GPGPU (1)

• Interest from the academic world

Pixel shader = do the same program for (1024 X 768 X 60)

= highly efficient SPMD (Single Program, Multiple Data) machine

• Fictitious graphics pipe to solve problems

– Advanced Graphics problems

– General Computational problems

6

The birth of GPGPU (2)

• In 2002, Mark Harris from NVIDIA

coined the term GPGPU

“General-Purpose computation on

Graphics Processing Units”

• Used a graphics language for general

computation

• Highly effective, but :

– The developer needs to learn another

(not intuitive) language

– The developer was limited by the

graphics language


• GPUs needed one more evolutional step Unified Shaders

8

Rise of modern GPGPU

• Unified Architecture paved the way for modern GPGPU languages

GeForce 8800

GTX (G80) was

released on

Nov. 2006

CUDA 0.8 was

released on Feb.

2007 (first official

Beta)

ATI x1900

(R580)

released on

Jan 2006

CTM was

released on

Nov. 2006

Evolution of Compute APIs (GPGPU)

• CUDA & CTM led to two compute standards: Direct Compute & OpenCL

• DirectCompute is a Microsoft standard

– Released as part of WIn7/DX11, a.k.a. Compute Shaders

– Runs only on Windows

– Microsoft C++ AMP maps to DirectCompute

• OpenCL is a cross-OS / cross-Vendor standard

– Managed by a working group in Khronos

– Apple is the spec editor & conformance owner

– Work can be scheduled on both GPUs and CPUs

CUDA 1.0

Released

June 2007

CUDA 2.0

Released

Aug 2008

OpenCL 1.0

Released

Dec 2008

DirectX 11

Released

Oct 2009

CUDA 3.0

Released

Mar 2010

OpenCL 1.1

Released

June 2010

CUDA 4.0

Released

May 2011

OpenCL 1.2

Released

Nov 2011

CUDA 4.1

Released

Jan 2012

CUDA 4.2

Released

April 2012

C++ AMP 1.0

Released

Aug 2012

CUDA 5.0

Released

Oct 2012

CUDA 5.5

Released

July 2013

OpenCL 2.0

Provisional

Released

July 2013

CTM SDK

Released

Nov 2006

GPGPU Evolution

2004 – Stanford University: Brook for GPUs

2006 – AMD releases CTM

NVIDIA releases CUDA

2008 – OpenCL 1.0 released

G80 – 346 GFLOPS R580 – 375 GFLOPS

GPGPU Evolution

Nov 2009 - First Hybrid SC in the Top10: Chinese Tianhe-1

1,024 Intel Xeon E5450 CPUs

5,120 Radeon 4870 X2 GPUs

Nov 2010 – First Hybrid SC reaches #1 on Top500 list: Tianhe-1A

14,336 Xeon X5670 CPUs

7,168 Nvidia Tesla M2050 GPUsSource: http://www.top500.org/lists/

GPGPU Evolution

2013 - OpenCL on : Nexus 4 (Qualcomm Adreno 320)

Nexus 10 (ARM Mali T604)

Android 4.2 adds GPU support for Renderscript

2014 – NVIDIA Tegra 5 will support CUDA

2013 – GPGPU Continuum becomes a reality

The GPGPU Continuum

Apple A6 GPU

25 GFLOPS

< 2W

ORNL TITAN SC

27 PFLOPS

8200 KW

AMD G-T16R

46 GFLOPS*

4.5W

NVIDIA GTX Titan

4500 GFLOPS

250W

Intel i7-3770

511 GFLOPS*

77W* GFLOPS of CPU+GPU

Anatomy of

a Modern GPU

GPGPU Perspective

15

Massive Parallelism

From GPGPU perspective,

GPU is a highly multi-threaded wide vector machine

16

Parallelism detailed

• Multi (Many) Cores

• Wide Vector Unit

• Multi-threaded (latency/stalls hiding)

17

14 SMXsK20 NVIDIA

32 Compute UnitsHD7970 AMD

60 CoresXeon Phi 5110P Intel

6 Warps per SMX32 floats = WarpK20 NVIDIA

4 Wavefronts per CU64 floats = WavefrontHD7970 AMD

1 VPU per Core16 floats = VPUXeon Phi 5110P Intel

64 Warps per SMXK20 NVIDIA

40 Wavefronts per CUHD7970 AMD

NVIDIA GK110 SMX

Typical GPU Caveats

• Wide vectors = SIMD (SIMT) execution

– Conditional code has to be executed “vector wide”

– Mitigation: Predication (execute all code using masks on parts)

– Performance hit on mixed execution, up to 1/N efficiency (where N is

vector width)

• Many Cores & Small caches = High percentage of Stalls

– Mitigation:

• Hold multiple in-flight contexts (aka Warps/Wavefronts) per core

• Stall = fast context switch between in-flight context and active context

• Requires huge register bank (NV & AMD: 256KB per SMX/CU)

– Latency hiding depends on having enough in-flight contexts

18A Must Read: (images to the right are taken from this talk)

“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University and Mike Houston, Fellow, AMD

Typical GPGPU Models

This section describes some general GPGPU models, which apply

to a wide range of languages

19

Simplified System Model

• Host runs the OS, Application, Drivers, etc.

• GPU is connected to the Host through PCIe, Shared

Memory, etc.

Application code contains API calls*,

which use a Runtime environment,

which provides GPU access

The Application code contains “kernels”,

which are short programs/functions,

which are loaded and executed on the GPU

* In some languages the API calls are abstracted through special syntax or directives 20

Host

Application

Runtime

GPU

KernelKernelKernel

GPGPU Execution Model (1)

• A “kernel” is executed on a grid (1D/2D/3D)

• Each point in the grid executes one instance of

the kernel, orthogonally*

• Per-instance read/write is accomplished by using

the instance’s index

* There are sync primitives on a group/block level (or whole device)

21

OpenCL

CUDA

// Kernel definition

__global__ void MatAdd(float A[N][N], float B[N][N],

float C[N][N])

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < N && j < N)

C[i][j] = A[i][j] + B[i][j];

}

int main()

{

// Kernel invocation

dim3 dimBlock(16, 16);

dim3 dimGrid((N + dimBlock.x – 1) / dimBlock.x,

(N + dimBlock.y – 1) / dimBlock.y);

MatAdd<<<dimGrid, dimBlock>>>(A, B, C);

}

GPGPU Execution Model (2)

• GPU execution model is asynchronous

– Commands are sent down the stack

– Kernels executed based on GPU load & status (serves a few Apps)

– Application code may wait on completion

• Quequeing Model

– Explicit (OpenCL)

– Default is implicit, Advanced usage is explicit (CUDA)

• SPMD MPMD

– GPU used to be able to execute one kernel at a time

– Modern languages support multiple simultaneous kernels 22

GPGPU Memory Model

Basically, a distributed memory system:

• Separated Host memory / Device memory

– Create a buffer/image on the host

– Create a buffer/image on the device

• Opaque handle (OpenCL) or device-side pointer (CUDA)

• Sync operations between memories:

– Read / Write

– Map / Unmap (marshalling)

• Pinned memory for faster sync

• GPU can access Host mapped memory (CUDA)23

Host

Application

Runtime

GPU

Buffer

Create Write

Buffer

GPU Memory Model

• Few types, GPU architecture driven

• Has affect on performance – use the right type

• Watch out from coherency issues

– Not your typical MESI architecture…

24

Compilation Model

• Most GPGPU languages use dynamic compilation

– A common practice in the world of GPUs

– Different GPU architectures : no common ISA

– ISA varies even between generations of the same vendor

• Front-End converts High-level language to IR

(Intermediate Representation)

– Assembly of a virtual machine

– LLVM is very common in this world

– In some languages, this happens at application compile time

• Back-End(s) converts from IR to Binary

– Some Vendors use additional intermediate-to-intermediate stages

• Most languages enable storing of IR & IL

– Some do it implicitly (CUDA)

OpenCL C C for CUDA Fortran

LLVM* IR

PTX IL

GK110 Binary GF104 Binary

OpenACC

* NVIDIA has “NVVM”, which

is LLVM with a set of

restrictions

GPGPU usages

CUDA usages

Advanced Graphics

Game Physics

Computer Vision

Cluster/ HPC

Finance

Scientific

Media Processing

Johannes Gutenberg University Mains

•CUDA Community Showcase:

•~900 applications from Academia•http://www.nvidia.com/object/cuda-apps-

flash-new.html#

Imperial College London

UC Davis, California

TU Darmstadt

http://www.nvidia.com/object/cuda-apps-flash-new.html









GPGPU Languages

• Welcome to the jungle…

28

Vendor overview: NVIDIA

Geforce:

• GPU for Gaming

• GTX680

Tesla:

• GPU Accelerators

• K10 / K20

Quadro:

• Professional GFX

• K5000

All running the same cores (Kepler GK104 or GK110)

http://www.google.co.il/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=lwxnGoRsoNfVrM&tbnid=D4QY5oUZRGbXHM:&ved=0CAUQjRw&url=http://www.nvidia.com/object/product_quadro_fx_5800_us.html&ei=5p6rUaDMBOel0QWL94CQDg&psig=AFQjCNEeChdYT4wr29G6GMR801m9jmzeVA&ust=1370288223333597

http://www.google.co.il/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=T4-y3_drDNcH7M&tbnid=FgAKOnKSxZyqNM:&ved=0CAUQjRw&url=http://www.bidorbuy.co.za/item/72498585/Leadtek_WinFast_NVIDIA_GTX_680_2048MB_Graphics_Card.html&ei=RKCrUd6IM6GS0AXtloCoDQ&psig=AFQjCNG0CyOxdtSQK9MRdkShSBvN8ab0pQ&ust=1370288549855502

Vendor overview: AMD

31

Radeon:

• GPU for Gaming

• HD7970

FirePro:

• Professional GFX

• W9000

All running the same cores (GCN)

APU:

• CPU+GPU on same die

• A10

http://www.google.co.il/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&docid=2UDpgMiRTE61HM&tbnid=ZYJrgYhGEd-WmM:&ved=0CAUQjRw&url=http://digitalartsonline.co.uk/news/creative-hardware/siggraph-2012-amd-launches-firepro-w5000-w7000-w8000-w9000-graphics-card/&ei=4qerUbLQEueW0AWMzICIDQ&psig=AFQjCNG0Ym__Uxmir70FKaanbQ0q-eLWxg&ust=1370290490530842

Vendor overview: Intel

Xeon Phi:

• Accelerator Card

• 5110P

CPU:

• CPU+GPU on same die

• Haswell Core i7-4xxx

Leading Mobile GPU Vendors

Vivante CG4000

• Unified Shaders

• 4 Cores, SIMD4 each

• Supports OpenCL 1.2

• 48 Gflops

NVIDIA Tegra 4

• 6 X 4-wide Vertex shaders

• 4 X 4-wide Pixel Shaders

• No GPGPU support

• 74 GFLOPS

ARM Mali T604

• 4 Cores

• Multiple “pipes” per core

• Supports OpenCL 1.1

• 68 GFlops

Imagination PowerVR 5xx

• Apple, Samsung, Motorola,

Intel

• Unified Shaders

• Supports OpenCL 1.1 EP (543)

• 38 Gflops (Apple’s MP4 ver)

Qualcomm Adreno 320

• Part of Snapdragon S4

• Unified Shader

• Supports OpenCL 1.1 EP

• 50 GFlops

newbie’s guide to_the_gpgpu_universe

Technology

gpgpu evolution

released g80

birth of gpgpu

modern gpu gpgpu perspective

gpgpu execut

gpu architecture

gpu support

general gpgpu models