cs 380 - gpu and gpgpu programming lecture 6: gpu...
Post on 05-Oct-2020
3 Views
Preview:
TRANSCRIPT
CS 380 - GPU and GPGPU ProgrammingLecture 6: GPU Architecture 5
Markus Hadwiger, KAUST
2
Reading Assignment #3 (until Feb. 16)
Read (required):• Programming Massively Parallel Processors book, Chapter 1 (Introduction)
• Programming Massively Parallel Processors book, Appendix B(GPU Compute Capabilities)
• OpenGL 4.0 Shading Language Cookbook, Chapter 2
Read (optional):• OpenGL 4.0 Shading Language Cookbook, Chapter 1
• GLSL book, Chapter 7 (OpenGL Shading Language API)
NVIDIA G80/GT200 Architecture
• Streaming Processor (SP)
• Streaming Multiprocessor (SM)
• Texture/Processing Cluster (TPC)3
Courtesy AnandTech
NVIDIA G80/GT200 Architecture
• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs
• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)
4
G80 / G92 GT200Courtesy AnandTech
Example: GeForce 8
ff
5
6
NVIDIA Fermi / GF100 Features
Names
• Compute: Fermi; product: Tesla-20 series
• Graphics: GF100 (product: Geforce GTX 480, 580, ...)
Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/
html/C/doc/ptx_isa_3.0.pdf
• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf
L1 and L2 caches
More CUDA cores (up to 512)
Faster double precision float performance, faster atomics, float atomics
DirectX 11 and OpenGL 4 functionality
• New shader types, scatter writes to images, ...
7
NVIDIA Fermi / GF100 Stats
8
Streaming Multiprocessor
Streaming processors are nowCUDA cores
32 CUDA cores per Fermistreaming multiprocessor (SM)
16 SMs = 512 CUDA cores
CPU-like cache hierarchy• L1 cache / shared memory
• L2 cache
Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)
Dual Warp Schedulers
Markus Hadwiger, KAUST 9
10
Graphics Processor Clusters (GPC)
(instead of TPC on GT200)
4 Streaming Processors
32 CUDA cores / SM
4 SMs / GPC =128 cores / GPC
Decentralized rasterizationand geometry
• 4 raster engines
• 16 ”PolyMorph” engines
11
NVIDIA Fermi / GF100 Structure
Full size
• 4 GPCs
• 4 SMs each
• 6 64-bitmemorycontrollers(= 384 bit)
12
NVIDIA Fermi / GF100 Die
Full size
• 4 GPCs
• 4 SMs each
GF100 Graphics Pipeline
• ?Input Assembler
Vertex Shader
Pixel Shader
Hull Shader
Rasterizer
Output Merger
Tessellator
Domain Shader
Geometry Shader Stream Output
14
Compute Capab. 2.0
• 1024 threads / block
• More threads / SM
• 32K registers / SM
• New synchronization functions
15
L1 Cache vs. Shared Memory
Two different configs (on Fermi and Kepler; NOT on Maxwell!)• 64KB total
• 16KB shared, 48KB L1 cache
• 48KB shared, 16KB L1 cache
• Set per kernel
16
Global Memory Access
Cached on Fermi
L1 cache per SM
Global L2 cache
Compile time flag can choose:• Caching in both L1 and L2
• Caching only in L2
Cache line size (L1, L2):• 128 bytes
17
Global Memory Access
CUDA 6.5
18
Global Memory Access
CUDA 7.0
NVIDIA Kepler Architecture
Three different versions• Compute capability 3.0 (GK104)
– Geforce GTX 680, …– Quadro K5000– Tesla K10
• Compute capability 3.5 (GK110)– Geforce GTX 780 / Titan / Titan Black– Quadro K6000– Tesla K20, Tesla K40
• Compute capability 3.7 (GK210)– Tesla K80– Very new (~end of 2014)
Markus Hadwiger, KAUST 19
GK104 SMX
• 192 CUDA cores
• 32 LD/ST units
• 16 SFUs
• 16 texture units
Markus Hadwiger, KAUST 20
KAUST King Abdullah University of Science and Technology 21
GK110 SMX
• 192 CUDA cores
• 64 DP units
• 32 LD/ST units
• 16 SFUs
• 16 texture units
New read-onlydata cache (48KB)
Markus Hadwiger, KAUST 22
23
NVIDIA Kepler / GK104 Structure
Full size
• 4 GPCs
• 2 SMXs each
= 8 SMXs,1536 CUDA cores
24
NVIDIA Kepler / GK110 Structure (1)
Full size
• 15 SMXs(Titan Black;Titan: 14)
• 2880 CUDAcores(Titan Black;Titan: 2688)
• 5 GPCs of3 SMXs each
25
NVIDIA Kepler / GK110 Structure (2)
Titan (not Black)
• 14 SMXs
• 2688 CUDAcores
• 5 GPCs with3 SMXs or2 SMXs each
Compute Capabilities 2.0 – 3.5
Markus Hadwiger, KAUST 26
Maxwell (GM) Architecture
Multiprocessor: SMM
4 partitions inside the SMM• 32 CUDA cores each
• 128 CUDA cores in total
• Each has its own warp scheduler,dispatch units, register file
Shared memory and L1 cache nowseparate!• L1 cache shares with texture cache
• Shared memory is its own space
Markus Hadwiger, KAUST 27
Maxwell (GM) Architecture
First gen.
GM107
(GTX 750Ti)
5 SMMs
(640 CUDA cores in total)
Markus Hadwiger, KAUST 28
Maxwell (GM) Architecture
Second gen.
GM204
(GTX 980)
16 SMMs
(2048 CUDA cores in total)
4 GPCs of 4 SMMs
Markus Hadwiger, KAUST 29
Maxwell (GM) vs. Kepler (GK) Architecture
GK107 vs. GM107
Markus Hadwiger, KAUST 30
Maxwell (GM) vs. Kepler (GK) Architecture
GK107 vs. GM204
Markus Hadwiger, KAUST 31
32
Compute Capab. 5.x (Part 1)
Maxwell• GM107: 5.0
• GM204: 5.2
33
Compute Capab. 5.x (Part 2)
Maxwell• GM107: 5.0
• GM204: 5.2
Thank you.
top related