cs 380 - gpu and gpgpu programming lecture 6: gpu...

Post on 05-Oct-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CS 380 - GPU and GPGPU ProgrammingLecture 6: GPU Architecture 5

Markus Hadwiger, KAUST

2

Reading Assignment #3 (until Feb. 16)

Read (required):• Programming Massively Parallel Processors book, Chapter 1 (Introduction)

• Programming Massively Parallel Processors book, Appendix B(GPU Compute Capabilities)

• OpenGL 4.0 Shading Language Cookbook, Chapter 2

Read (optional):• OpenGL 4.0 Shading Language Cookbook, Chapter 1

• GLSL book, Chapter 7 (OpenGL Shading Language API)

NVIDIA G80/GT200 Architecture

• Streaming Processor (SP)

• Streaming Multiprocessor (SM)

• Texture/Processing Cluster (TPC)3

Courtesy AnandTech

NVIDIA G80/GT200 Architecture

• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs

• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)

4

G80 / G92 GT200Courtesy AnandTech

Example: GeForce 8

ff

5

6

NVIDIA Fermi / GF100 Features

Names

• Compute: Fermi; product: Tesla-20 series

• Graphics: GF100 (product: Geforce GTX 480, 580, ...)

Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/

html/C/doc/ptx_isa_3.0.pdf

• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf

L1 and L2 caches

More CUDA cores (up to 512)

Faster double precision float performance, faster atomics, float atomics

DirectX 11 and OpenGL 4 functionality

• New shader types, scatter writes to images, ...

7

NVIDIA Fermi / GF100 Stats

8

Streaming Multiprocessor

Streaming processors are nowCUDA cores

32 CUDA cores per Fermistreaming multiprocessor (SM)

16 SMs = 512 CUDA cores

CPU-like cache hierarchy• L1 cache / shared memory

• L2 cache

Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)

Dual Warp Schedulers

Markus Hadwiger, KAUST 9

10

Graphics Processor Clusters (GPC)

(instead of TPC on GT200)

4 Streaming Processors

32 CUDA cores / SM

4 SMs / GPC =128 cores / GPC

Decentralized rasterizationand geometry

• 4 raster engines

• 16 ”PolyMorph” engines

11

NVIDIA Fermi / GF100 Structure

Full size

• 4 GPCs

• 4 SMs each

• 6 64-bitmemorycontrollers(= 384 bit)

12

NVIDIA Fermi / GF100 Die

Full size

• 4 GPCs

• 4 SMs each

GF100 Graphics Pipeline

• ?Input Assembler

Vertex Shader

Pixel Shader

Hull Shader

Rasterizer

Output Merger

Tessellator

Domain Shader

Geometry Shader Stream Output

14

Compute Capab. 2.0

• 1024 threads / block

• More threads / SM

• 32K registers / SM

• New synchronization functions

15

L1 Cache vs. Shared Memory

Two different configs (on Fermi and Kepler; NOT on Maxwell!)• 64KB total

• 16KB shared, 48KB L1 cache

• 48KB shared, 16KB L1 cache

• Set per kernel

16

Global Memory Access

Cached on Fermi

L1 cache per SM

Global L2 cache

Compile time flag can choose:• Caching in both L1 and L2

• Caching only in L2

Cache line size (L1, L2):• 128 bytes

17

Global Memory Access

CUDA 6.5

18

Global Memory Access

CUDA 7.0

NVIDIA Kepler Architecture

Three different versions• Compute capability 3.0 (GK104)

– Geforce GTX 680, …– Quadro K5000– Tesla K10

• Compute capability 3.5 (GK110)– Geforce GTX 780 / Titan / Titan Black– Quadro K6000– Tesla K20, Tesla K40

• Compute capability 3.7 (GK210)– Tesla K80– Very new (~end of 2014)

Markus Hadwiger, KAUST 19

GK104 SMX

• 192 CUDA cores

• 32 LD/ST units

• 16 SFUs

• 16 texture units

Markus Hadwiger, KAUST 20

KAUST King Abdullah University of Science and Technology 21

GK110 SMX

• 192 CUDA cores

• 64 DP units

• 32 LD/ST units

• 16 SFUs

• 16 texture units

New read-onlydata cache (48KB)

Markus Hadwiger, KAUST 22

23

NVIDIA Kepler / GK104 Structure

Full size

• 4 GPCs

• 2 SMXs each

= 8 SMXs,1536 CUDA cores

24

NVIDIA Kepler / GK110 Structure (1)

Full size

• 15 SMXs(Titan Black;Titan: 14)

• 2880 CUDAcores(Titan Black;Titan: 2688)

• 5 GPCs of3 SMXs each

25

NVIDIA Kepler / GK110 Structure (2)

Titan (not Black)

• 14 SMXs

• 2688 CUDAcores

• 5 GPCs with3 SMXs or2 SMXs each

Compute Capabilities 2.0 – 3.5

Markus Hadwiger, KAUST 26

Maxwell (GM) Architecture

Multiprocessor: SMM

4 partitions inside the SMM• 32 CUDA cores each

• 128 CUDA cores in total

• Each has its own warp scheduler,dispatch units, register file

Shared memory and L1 cache nowseparate!• L1 cache shares with texture cache

• Shared memory is its own space

Markus Hadwiger, KAUST 27

Maxwell (GM) Architecture

First gen.

GM107

(GTX 750Ti)

5 SMMs

(640 CUDA cores in total)

Markus Hadwiger, KAUST 28

Maxwell (GM) Architecture

Second gen.

GM204

(GTX 980)

16 SMMs

(2048 CUDA cores in total)

4 GPCs of 4 SMMs

Markus Hadwiger, KAUST 29

Maxwell (GM) vs. Kepler (GK) Architecture

GK107 vs. GM107

Markus Hadwiger, KAUST 30

Maxwell (GM) vs. Kepler (GK) Architecture

GK107 vs. GM204

Markus Hadwiger, KAUST 31

32

Compute Capab. 5.x (Part 1)

Maxwell• GM107: 5.0

• GM204: 5.2

33

Compute Capab. 5.x (Part 2)

Maxwell• GM107: 5.0

• GM204: 5.2

Thank you.

top related