cs 380 - gpu and gpgpu programming lecture 8+9: gpu...
TRANSCRIPT
CS 380 - GPU and GPGPU ProgrammingLecture 8+9: GPU Architecture 7+8
Markus Hadwiger, KAUST
2
Reading Assignment #5 (until March 12)
Read (required):
• Programming Massively Parallel Processors book,Chapter 3 (Introduction to CUDA)
• Programming Massively Parallel Processors book,Chapter 4 (CUDA Threads) until (including) 4.3
Read (optional):• NVIDIA Fermi graphics (GF100) and compute white papers:
http://www.nvidia.com/object/IO_86775.html
http://www.nvidia.com/object/IO_86776.html
• NVIDIA Kepler (GK110) white papers:http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
• NVIDIA Maxwell (GM107) white paper:http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-
Ti-Whitepaper.pdf
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
From Shader Code to a Teraflop:How Shader Cores Work
Kayvon FatahalianStanford University
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
My chip!
16 cores
8 mul-add ALUs per core(128 total)
16 simultaneousinstruction streams
64 concurrent (but interleaved)instruction streams
512 concurrent fragments
= 256 GFLOPs (@ 1GHz)
5
SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/
My “enthusiast” chip!
32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)6
KAUST King Abdullah University of Science and Technology 11
KAUST King Abdullah University of Science and Technology 12
KAUST King Abdullah University of Science and Technology 13
NVIDIA G80/GT200 Architecture
• Streaming Processor (SP)
• Streaming Multiprocessor (SM)
• Texture/Processing Cluster (TPC)18
Courtesy AnandTech
NVIDIA G80/GT200 Architecture
• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs
• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)
19
G80 / G92 GT200Courtesy AnandTech
NVIDIA GT200 GPGPU Hardware
NVIDIA Tesla 10-series• Based on GT200 architecture
• 1 Teraflop / device
• 4GB RAM / device
• Multiple devices pernode / machine
Tesla C1060
Tesla S1070
NVIDIA Fermi / GF100 Hardware
Geforce GTX 580• 512 CUDA cores (16 SMs)
• 1.5 GB memory
Tesla 20-series• Cards: M2070/C2070, ...
• Blades: S2050/S2070
• 3GB or 6GB / GPU, ECC memory
22
NVIDIA Fermi / GF100 Features
Names
• Compute: Fermi; product: Tesla-20 series
• Graphics: GF100 (product: Geforce GTX 480, 580, ...)
Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/
html/C/doc/ptx_isa_3.0.pdf
• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf
L1 and L2 caches
More CUDA cores (up to 512)
Faster double precision float performance, faster atomics, float atomics
DirectX 11 and OpenGL 4 functionality
• New shader types, scatter writes to images, ...
23
NVIDIA Fermi / GF100 Stats
24
Streaming Multiprocessor
Streaming processors are nowCUDA cores
32 CUDA cores per Fermistreaming multiprocessor (SM)
16 SMs = 512 CUDA cores
CPU-like cache hierarchy• L1 cache / shared memory
• L2 cache
Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)
Dual Warp Schedulers
Markus Hadwiger, KAUST 25
26
Graphics Processor Clusters (GPC)
(instead of TPC on GT200)
4 Streaming Processors
32 CUDA cores / SM
4 SMs / GPC =128 cores / GPC
Decentralized rasterizationand geometry
• 4 raster engines
• 16 ”PolyMorph” engines
27
NVIDIA Fermi / GF100 Structure
Full size
• 4 GPCs
• 4 SMs each
• 6 64-bitmemorycontrollers(= 384 bit)
28
NVIDIA Fermi / GF100 Die
Full size
• 4 GPCs
• 4 SMs each
29
Compute Capab. 2.0
• 1024 threads / block
• More threads / SM
• 32K registers / SM
• New synchronization functions
30
L1 Cache vs. Shared Memory
Two different configs• 64KB total
• 16KB shared, 48KB L1 cache
• 48KB shared, 16KB L1 cache
• Set per kernel
31
Global Memory Access
Cached on Fermi
L1 cache per SM
Global L2 cache
Compile time flag can choose:• Caching in both L1 and L2
• Caching only in L2
Cache line size (L1, L2):• 128 bytes
NVIDIA Kepler Architecture
Two different versions• GK104, compute capability 3.0
– Geforce GTX 680, …– Quadro K5000– Tesla K10 series
• GK110, compute capability 3.5– Geforce GTX Titan (just released!)– Tesla K20 series
Markus Hadwiger, KAUST 32
GF100 Graphics Pipeline
• ?Input Assembler
Vertex Shader
Pixel Shader
Hull Shader
Rasterizer
Output Merger
Tessellator
Domain Shader
Geometry Shader Stream Output
34
NVIDIA Kepler / GK104 Structure
Full size
• 4 GPCs
• 2 SMXs each
= 8 SMXs,1536 CUDA cores
GK104 SMX
• 192 CUDA cores
• 32 LD/ST units
• 16 SFUs
• 16 texture units
Markus Hadwiger, KAUST 35
36
NVIDIA Kepler / GK110 Structure
Full size
• 15 SMXs
• 2880CUDAcores
GK110 SMX
• 192 CUDA cores
• 64 DP units
• 32 LD/ST units
• 16 SFUs
• 16 texture units
New read-onlydata cache (48KB)
Markus Hadwiger, KAUST 37
Compute Capabilities 2.0 – 3.5
Markus Hadwiger, KAUST 38
Maxwell vs. Kepler Architecture
GM107
Markus Hadwiger, KAUST 39
Maxwell vs. Kepler Architecture
GK107
vs.
GM107
Markus Hadwiger, KAUST 40
Thank you.