cs 380 - gpu and gpgpu programming lecture 8+9: gpu...

CS 380 - GPU and GPGPU ProgrammingLecture 8+9: GPU Architecture 7+8

Markus Hadwiger, KAUST

2

Reading Assignment #5 (until March 12)

Read (required):

• Programming Massively Parallel Processors book,Chapter 3 (Introduction to CUDA)

• Programming Massively Parallel Processors book,Chapter 4 (CUDA Threads) until (including) 4.3

Read (optional):• NVIDIA Fermi graphics (GF100) and compute white papers:

http://www.nvidia.com/object/IO_86775.html

http://www.nvidia.com/object/IO_86776.html

• NVIDIA Kepler (GK110) white papers:http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

• NVIDIA Maxwell (GM107) white paper:http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-

Ti-Whitepaper.pdf

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

From Shader Code to a Teraflop:How Shader Cores Work

Kayvon FatahalianStanford University


My chip!

16 cores

8 mul-add ALUs per core(128 total)

16 simultaneousinstruction streams

64 concurrent (but interleaved)instruction streams

512 concurrent fragments

= 256 GFLOPs (@ 1GHz)

5


My “enthusiast” chip!

32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)6

KAUST King Abdullah University of Science and Technology 11

NVIDIA G80/GT200 Architecture

• Streaming Processor (SP)

• Streaming Multiprocessor (SM)

• Texture/Processing Cluster (TPC)18

Courtesy AnandTech

NVIDIA G80/GT200 Architecture

• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs

• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)

19

G80 / G92 GT200Courtesy AnandTech

NVIDIA GT200 GPGPU Hardware

NVIDIA Tesla 10-series• Based on GT200 architecture

• 1 Teraflop / device

• 4GB RAM / device

• Multiple devices pernode / machine

Tesla C1060

Tesla S1070

NVIDIA Fermi / GF100 Hardware

Geforce GTX 580• 512 CUDA cores (16 SMs)

• 1.5 GB memory

Tesla 20-series• Cards: M2070/C2070, ...

• Blades: S2050/S2070

• 3GB or 6GB / GPU, ECC memory

22

NVIDIA Fermi / GF100 Features

Names

• Compute: Fermi; product: Tesla-20 series

• Graphics: GF100 (product: Geforce GTX 480, 580, ...)

Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/

html/C/doc/ptx_isa_3.0.pdf

• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf

L1 and L2 caches

More CUDA cores (up to 512)

Faster double precision float performance, faster atomics, float atomics

DirectX 11 and OpenGL 4 functionality

• New shader types, scatter writes to images, ...

23

NVIDIA Fermi / GF100 Stats

24

Streaming Multiprocessor

Streaming processors are nowCUDA cores

32 CUDA cores per Fermistreaming multiprocessor (SM)

16 SMs = 512 CUDA cores

CPU-like cache hierarchy• L1 cache / shared memory

• L2 cache

Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)

Dual Warp Schedulers

Markus Hadwiger, KAUST 25

26

Graphics Processor Clusters (GPC)

(instead of TPC on GT200)

4 Streaming Processors

32 CUDA cores / SM

4 SMs / GPC =128 cores / GPC

Decentralized rasterizationand geometry

• 4 raster engines

• 16 ”PolyMorph” engines

27

NVIDIA Fermi / GF100 Structure

Full size

• 4 GPCs

• 4 SMs each

• 6 64-bitmemorycontrollers(= 384 bit)

28

NVIDIA Fermi / GF100 Die

Full size

• 4 GPCs

• 4 SMs each

29

Compute Capab. 2.0

• 1024 threads / block

• More threads / SM

• 32K registers / SM

• New synchronization functions

30

L1 Cache vs. Shared Memory

Two different configs• 64KB total

• 16KB shared, 48KB L1 cache

• 48KB shared, 16KB L1 cache

• Set per kernel

31

Global Memory Access

Cached on Fermi

L1 cache per SM

Global L2 cache

Compile time flag can choose:• Caching in both L1 and L2

• Caching only in L2

Cache line size (L1, L2):• 128 bytes

NVIDIA Kepler Architecture

Two different versions• GK104, compute capability 3.0

– Geforce GTX 680, …– Quadro K5000– Tesla K10 series

• GK110, compute capability 3.5– Geforce GTX Titan (just released!)– Tesla K20 series


GF100 Graphics Pipeline

• ?Input Assembler

Vertex Shader

Pixel Shader

Hull Shader

Rasterizer

Output Merger

Tessellator

Domain Shader

Geometry Shader Stream Output

34

NVIDIA Kepler / GK104 Structure

Full size

• 4 GPCs

• 2 SMXs each

= 8 SMXs,1536 CUDA cores

GK104 SMX

• 192 CUDA cores

• 32 LD/ST units

• 16 SFUs

• 16 texture units


36

NVIDIA Kepler / GK110 Structure

Full size

• 15 SMXs

• 2880CUDAcores

GK110 SMX

• 192 CUDA cores

• 64 DP units

• 32 LD/ST units

• 16 SFUs

• 16 texture units

New read-onlydata cache (48KB)


Compute Capabilities 2.0 – 3.5


Maxwell vs. Kepler Architecture

GM107


Maxwell vs. Kepler Architecture

GK107

vs.

GM107


Thank you.

cs 380 - gpu and gpgpu programming lecture 8+9: gpu...

Documents