![Page 1: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/1.jpg)
CS 380 - GPU and GPGPU ProgrammingLecture 6: GPU Architecture 5
Markus Hadwiger, KAUST
![Page 2: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/2.jpg)
2
Reading Assignment #3 (until Feb. 16)
Read (required):• Programming Massively Parallel Processors book, Chapter 1 (Introduction)
• Programming Massively Parallel Processors book, Appendix B(GPU Compute Capabilities)
• OpenGL 4.0 Shading Language Cookbook, Chapter 2
Read (optional):• OpenGL 4.0 Shading Language Cookbook, Chapter 1
• GLSL book, Chapter 7 (OpenGL Shading Language API)
![Page 3: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/3.jpg)
NVIDIA G80/GT200 Architecture
• Streaming Processor (SP)
• Streaming Multiprocessor (SM)
• Texture/Processing Cluster (TPC)3
Courtesy AnandTech
![Page 4: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/4.jpg)
NVIDIA G80/GT200 Architecture
• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs
• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)
4
G80 / G92 GT200Courtesy AnandTech
![Page 5: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/5.jpg)
Example: GeForce 8
ff
5
![Page 6: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/6.jpg)
6
NVIDIA Fermi / GF100 Features
Names
• Compute: Fermi; product: Tesla-20 series
• Graphics: GF100 (product: Geforce GTX 480, 580, ...)
Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/
html/C/doc/ptx_isa_3.0.pdf
• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf
L1 and L2 caches
More CUDA cores (up to 512)
Faster double precision float performance, faster atomics, float atomics
DirectX 11 and OpenGL 4 functionality
• New shader types, scatter writes to images, ...
![Page 7: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/7.jpg)
7
NVIDIA Fermi / GF100 Stats
![Page 8: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/8.jpg)
8
Streaming Multiprocessor
Streaming processors are nowCUDA cores
32 CUDA cores per Fermistreaming multiprocessor (SM)
16 SMs = 512 CUDA cores
CPU-like cache hierarchy• L1 cache / shared memory
• L2 cache
Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)
![Page 9: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/9.jpg)
Dual Warp Schedulers
Markus Hadwiger, KAUST 9
![Page 10: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/10.jpg)
10
Graphics Processor Clusters (GPC)
(instead of TPC on GT200)
4 Streaming Processors
32 CUDA cores / SM
4 SMs / GPC =128 cores / GPC
Decentralized rasterizationand geometry
• 4 raster engines
• 16 ”PolyMorph” engines
![Page 11: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/11.jpg)
11
NVIDIA Fermi / GF100 Structure
Full size
• 4 GPCs
• 4 SMs each
• 6 64-bitmemorycontrollers(= 384 bit)
![Page 12: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/12.jpg)
12
NVIDIA Fermi / GF100 Die
Full size
• 4 GPCs
• 4 SMs each
![Page 13: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/13.jpg)
GF100 Graphics Pipeline
• ?Input Assembler
Vertex Shader
Pixel Shader
Hull Shader
Rasterizer
Output Merger
Tessellator
Domain Shader
Geometry Shader Stream Output
![Page 14: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/14.jpg)
14
Compute Capab. 2.0
• 1024 threads / block
• More threads / SM
• 32K registers / SM
• New synchronization functions
![Page 15: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/15.jpg)
15
L1 Cache vs. Shared Memory
Two different configs (on Fermi and Kepler; NOT on Maxwell!)• 64KB total
• 16KB shared, 48KB L1 cache
• 48KB shared, 16KB L1 cache
• Set per kernel
![Page 16: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/16.jpg)
16
Global Memory Access
Cached on Fermi
L1 cache per SM
Global L2 cache
Compile time flag can choose:• Caching in both L1 and L2
• Caching only in L2
Cache line size (L1, L2):• 128 bytes
![Page 17: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/17.jpg)
17
Global Memory Access
CUDA 6.5
![Page 18: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/18.jpg)
18
Global Memory Access
CUDA 7.0
![Page 19: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/19.jpg)
NVIDIA Kepler Architecture
Three different versions• Compute capability 3.0 (GK104)
– Geforce GTX 680, …– Quadro K5000– Tesla K10
• Compute capability 3.5 (GK110)– Geforce GTX 780 / Titan / Titan Black– Quadro K6000– Tesla K20, Tesla K40
• Compute capability 3.7 (GK210)– Tesla K80– Very new (~end of 2014)
Markus Hadwiger, KAUST 19
![Page 20: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/20.jpg)
GK104 SMX
• 192 CUDA cores
• 32 LD/ST units
• 16 SFUs
• 16 texture units
Markus Hadwiger, KAUST 20
![Page 21: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/21.jpg)
KAUST King Abdullah University of Science and Technology 21
![Page 22: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/22.jpg)
GK110 SMX
• 192 CUDA cores
• 64 DP units
• 32 LD/ST units
• 16 SFUs
• 16 texture units
New read-onlydata cache (48KB)
Markus Hadwiger, KAUST 22
![Page 23: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/23.jpg)
23
NVIDIA Kepler / GK104 Structure
Full size
• 4 GPCs
• 2 SMXs each
= 8 SMXs,1536 CUDA cores
![Page 24: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/24.jpg)
24
NVIDIA Kepler / GK110 Structure (1)
Full size
• 15 SMXs(Titan Black;Titan: 14)
• 2880 CUDAcores(Titan Black;Titan: 2688)
• 5 GPCs of3 SMXs each
![Page 25: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/25.jpg)
25
NVIDIA Kepler / GK110 Structure (2)
Titan (not Black)
• 14 SMXs
• 2688 CUDAcores
• 5 GPCs with3 SMXs or2 SMXs each
![Page 26: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/26.jpg)
Compute Capabilities 2.0 – 3.5
Markus Hadwiger, KAUST 26
![Page 27: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/27.jpg)
Maxwell (GM) Architecture
Multiprocessor: SMM
4 partitions inside the SMM• 32 CUDA cores each
• 128 CUDA cores in total
• Each has its own warp scheduler,dispatch units, register file
Shared memory and L1 cache nowseparate!• L1 cache shares with texture cache
• Shared memory is its own space
Markus Hadwiger, KAUST 27
![Page 28: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/28.jpg)
Maxwell (GM) Architecture
First gen.
GM107
(GTX 750Ti)
5 SMMs
(640 CUDA cores in total)
Markus Hadwiger, KAUST 28
![Page 29: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/29.jpg)
Maxwell (GM) Architecture
Second gen.
GM204
(GTX 980)
16 SMMs
(2048 CUDA cores in total)
4 GPCs of 4 SMMs
Markus Hadwiger, KAUST 29
![Page 30: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/30.jpg)
Maxwell (GM) vs. Kepler (GK) Architecture
GK107 vs. GM107
Markus Hadwiger, KAUST 30
![Page 31: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/31.jpg)
Maxwell (GM) vs. Kepler (GK) Architecture
GK107 vs. GM204
Markus Hadwiger, KAUST 31
![Page 32: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/32.jpg)
32
Compute Capab. 5.x (Part 1)
Maxwell• GM107: 5.0
• GM204: 5.2
![Page 33: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/33.jpg)
33
Compute Capab. 5.x (Part 2)
Maxwell• GM107: 5.0
• GM204: 5.2
![Page 34: CS 380 - GPU and GPGPU Programming Lecture 6: GPU ...faculty.kaust.edu.sa/sites/markushadwiger/Documents/CS380_spring… · Multiprocessor: SMM 4 partitions inside the SMM • 32](https://reader034.vdocument.in/reader034/viewer/2022051812/602eff0fc2a6144dfe1c7a03/html5/thumbnails/34.jpg)
Thank you.