cs 380 - gpu and gpgpu programming lecture 8+9: gpu...

41
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST

Upload: others

Post on 25-Nov-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

CS 380 - GPU and GPGPU ProgrammingLecture 8+9: GPU Architecture 7+8

Markus Hadwiger, KAUST

Page 2: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

2

Reading Assignment #5 (until March 12)

Read (required):

• Programming Massively Parallel Processors book,Chapter 3 (Introduction to CUDA)

• Programming Massively Parallel Processors book,Chapter 4 (CUDA Threads) until (including) 4.3

Read (optional):• NVIDIA Fermi graphics (GF100) and compute white papers:

http://www.nvidia.com/object/IO_86775.html

http://www.nvidia.com/object/IO_86776.html

• NVIDIA Kepler (GK110) white papers:http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf

http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

• NVIDIA Maxwell (GM107) white paper:http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-

Ti-Whitepaper.pdf

Page 3: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

From Shader Code to a Teraflop:How Shader Cores Work

Kayvon FatahalianStanford University

Page 4: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture
Page 5: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

My chip!

16 cores

8 mul-add ALUs per core(128 total)

16 simultaneousinstruction streams

64 concurrent (but interleaved)instruction streams

512 concurrent fragments

= 256 GFLOPs (@ 1GHz)

5

Page 6: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

My “enthusiast” chip!

32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)6

Page 7: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture
Page 8: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture
Page 9: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture
Page 10: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture
Page 11: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

KAUST King Abdullah University of Science and Technology 11

Page 12: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

KAUST King Abdullah University of Science and Technology 12

Page 13: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

KAUST King Abdullah University of Science and Technology 13

Page 14: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture
Page 15: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture
Page 16: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture
Page 17: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture
Page 18: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

NVIDIA G80/GT200 Architecture

• Streaming Processor (SP)

• Streaming Multiprocessor (SM)

• Texture/Processing Cluster (TPC)18

Courtesy AnandTech

Page 19: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

NVIDIA G80/GT200 Architecture

• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs

• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)

19

G80 / G92 GT200Courtesy AnandTech

Page 20: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

NVIDIA GT200 GPGPU Hardware

NVIDIA Tesla 10-series• Based on GT200 architecture

• 1 Teraflop / device

• 4GB RAM / device

• Multiple devices pernode / machine

Tesla C1060

Tesla S1070

Page 21: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

NVIDIA Fermi / GF100 Hardware

Geforce GTX 580• 512 CUDA cores (16 SMs)

• 1.5 GB memory

Tesla 20-series• Cards: M2070/C2070, ...

• Blades: S2050/S2070

• 3GB or 6GB / GPU, ECC memory

Page 22: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

22

NVIDIA Fermi / GF100 Features

Names

• Compute: Fermi; product: Tesla-20 series

• Graphics: GF100 (product: Geforce GTX 480, 580, ...)

Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/

html/C/doc/ptx_isa_3.0.pdf

• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf

L1 and L2 caches

More CUDA cores (up to 512)

Faster double precision float performance, faster atomics, float atomics

DirectX 11 and OpenGL 4 functionality

• New shader types, scatter writes to images, ...

Page 23: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

23

NVIDIA Fermi / GF100 Stats

Page 24: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

24

Streaming Multiprocessor

Streaming processors are nowCUDA cores

32 CUDA cores per Fermistreaming multiprocessor (SM)

16 SMs = 512 CUDA cores

CPU-like cache hierarchy• L1 cache / shared memory

• L2 cache

Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)

Page 25: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

Dual Warp Schedulers

Markus Hadwiger, KAUST 25

Page 26: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

26

Graphics Processor Clusters (GPC)

(instead of TPC on GT200)

4 Streaming Processors

32 CUDA cores / SM

4 SMs / GPC =128 cores / GPC

Decentralized rasterizationand geometry

• 4 raster engines

• 16 ”PolyMorph” engines

Page 27: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

27

NVIDIA Fermi / GF100 Structure

Full size

• 4 GPCs

• 4 SMs each

• 6 64-bitmemorycontrollers(= 384 bit)

Page 28: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

28

NVIDIA Fermi / GF100 Die

Full size

• 4 GPCs

• 4 SMs each

Page 29: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

29

Compute Capab. 2.0

• 1024 threads / block

• More threads / SM

• 32K registers / SM

• New synchronization functions

Page 30: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

30

L1 Cache vs. Shared Memory

Two different configs• 64KB total

• 16KB shared, 48KB L1 cache

• 48KB shared, 16KB L1 cache

• Set per kernel

Page 31: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

31

Global Memory Access

Cached on Fermi

L1 cache per SM

Global L2 cache

Compile time flag can choose:• Caching in both L1 and L2

• Caching only in L2

Cache line size (L1, L2):• 128 bytes

Page 32: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

NVIDIA Kepler Architecture

Two different versions• GK104, compute capability 3.0

– Geforce GTX 680, …– Quadro K5000– Tesla K10 series

• GK110, compute capability 3.5– Geforce GTX Titan (just released!)– Tesla K20 series

Markus Hadwiger, KAUST 32

Page 33: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

GF100 Graphics Pipeline

• ?Input Assembler

Vertex Shader

Pixel Shader

Hull Shader

Rasterizer

Output Merger

Tessellator

Domain Shader

Geometry Shader Stream Output

Page 34: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

34

NVIDIA Kepler / GK104 Structure

Full size

• 4 GPCs

• 2 SMXs each

= 8 SMXs,1536 CUDA cores

Page 35: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

GK104 SMX

• 192 CUDA cores

• 32 LD/ST units

• 16 SFUs

• 16 texture units

Markus Hadwiger, KAUST 35

Page 36: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

36

NVIDIA Kepler / GK110 Structure

Full size

• 15 SMXs

• 2880CUDAcores

Page 37: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

GK110 SMX

• 192 CUDA cores

• 64 DP units

• 32 LD/ST units

• 16 SFUs

• 16 texture units

New read-onlydata cache (48KB)

Markus Hadwiger, KAUST 37

Page 38: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

Compute Capabilities 2.0 – 3.5

Markus Hadwiger, KAUST 38

Page 39: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

Maxwell vs. Kepler Architecture

GM107

Markus Hadwiger, KAUST 39

Page 40: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

Maxwell vs. Kepler Architecture

GK107

vs.

GM107

Markus Hadwiger, KAUST 40

Page 41: CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture

Thank you.