cs 380 - gpu and gpgpu programming lecture 6: gpu...

CS 380 - GPU and GPGPU ProgrammingLecture 6: GPU Architecture 5

Markus Hadwiger, KAUST

Reading Assignment #3 (until Feb. 16)

Read (required):• Programming Massively Parallel Processors book, Chapter 1 (Introduction)

• Programming Massively Parallel Processors book, Appendix B(GPU Compute Capabilities)

• OpenGL 4.0 Shading Language Cookbook, Chapter 2

Read (optional):• OpenGL 4.0 Shading Language Cookbook, Chapter 1

• GLSL book, Chapter 7 (OpenGL Shading Language API)

NVIDIA G80/GT200 Architecture

• Streaming Processor (SP)

• Streaming Multiprocessor (SM)

• Texture/Processing Cluster (TPC)3

Courtesy AnandTech

NVIDIA G80/GT200 Architecture

• G80/G92: 8 TPCs * ( 2 * 8 SPs ) = 128 SPs

• GT200: 10 TPCs * ( 3 * 8 SPs ) = 240 SPs• Arithmetic intensity has increased (ALUs vs. texture units)

G80 / G92 GT200Courtesy AnandTech

Example: GeForce 8

NVIDIA Fermi / GF100 Features

• Compute: Fermi; product: Tesla-20 series

• Graphics: GF100 (product: Geforce GTX 480, 580, ...)

Compute capability 2.1 / 2.0; PTX ISA 3.0 / 2.x• http://developer.download.nvidia.com/compute/DevZone/docs/

html/C/doc/ptx_isa_3.0.pdf

• http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf

L1 and L2 caches

More CUDA cores (up to 512)

Faster double precision float performance, faster atomics, float atomics

DirectX 11 and OpenGL 4 functionality

• New shader types, scatter writes to images, ...

NVIDIA Fermi / GF100 Stats

Streaming Multiprocessor

Streaming processors are nowCUDA cores

32 CUDA cores per Fermistreaming multiprocessor (SM)

16 SMs = 512 CUDA cores

CPU-like cache hierarchy• L1 cache / shared memory

• L2 cache

Texture units and caches now in SM(instead of with TPC=multiple SMs in GT200)

Dual Warp Schedulers

Markus Hadwiger, KAUST 9

Graphics Processor Clusters (GPC)

(instead of TPC on GT200)

4 Streaming Processors

32 CUDA cores / SM

4 SMs / GPC =128 cores / GPC

Decentralized rasterizationand geometry

• 4 raster engines

• 16 ”PolyMorph” engines

NVIDIA Fermi / GF100 Structure

Full size

• 4 GPCs

• 4 SMs each

• 6 64-bitmemorycontrollers(= 384 bit)

NVIDIA Fermi / GF100 Die

Full size

• 4 GPCs

• 4 SMs each

GF100 Graphics Pipeline

• ?Input Assembler

Vertex Shader

Pixel Shader

Hull Shader

Rasterizer

Output Merger

Tessellator

Domain Shader

Geometry Shader Stream Output

Compute Capab. 2.0

• 1024 threads / block

• More threads / SM

• 32K registers / SM

• New synchronization functions

L1 Cache vs. Shared Memory

Two different configs (on Fermi and Kepler; NOT on Maxwell!)• 64KB total

• 16KB shared, 48KB L1 cache

• 48KB shared, 16KB L1 cache

• Set per kernel

Global Memory Access

Cached on Fermi

L1 cache per SM

Global L2 cache

Compile time flag can choose:• Caching in both L1 and L2

• Caching only in L2

Cache line size (L1, L2):• 128 bytes

CUDA 6.5

CUDA 7.0

NVIDIA Kepler Architecture

Three different versions• Compute capability 3.0 (GK104)

– Geforce GTX 680, …– Quadro K5000– Tesla K10

• Compute capability 3.5 (GK110)– Geforce GTX 780 / Titan / Titan Black– Quadro K6000– Tesla K20, Tesla K40

• Compute capability 3.7 (GK210)– Tesla K80– Very new (~end of 2014)

GK104 SMX

• 192 CUDA cores

• 32 LD/ST units

• 16 SFUs

• 16 texture units

KAUST King Abdullah University of Science and Technology 21

GK110 SMX

• 192 CUDA cores

• 64 DP units

• 32 LD/ST units

• 16 SFUs

• 16 texture units

New read-onlydata cache (48KB)

NVIDIA Kepler / GK104 Structure

Full size

• 4 GPCs

• 2 SMXs each

= 8 SMXs,1536 CUDA cores

NVIDIA Kepler / GK110 Structure (1)

Full size

• 15 SMXs(Titan Black;Titan: 14)

• 2880 CUDAcores(Titan Black;Titan: 2688)

• 5 GPCs of3 SMXs each

NVIDIA Kepler / GK110 Structure (2)

Titan (not Black)

• 14 SMXs

• 2688 CUDAcores

• 5 GPCs with3 SMXs or2 SMXs each

Compute Capabilities 2.0 – 3.5

Maxwell (GM) Architecture

Multiprocessor: SMM

4 partitions inside the SMM• 32 CUDA cores each

• 128 CUDA cores in total

• Each has its own warp scheduler,dispatch units, register file

Shared memory and L1 cache nowseparate!• L1 cache shares with texture cache

• Shared memory is its own space

First gen.

(GTX 750Ti)

5 SMMs

(640 CUDA cores in total)

Second gen.

(GTX 980)

16 SMMs

(2048 CUDA cores in total)

4 GPCs of 4 SMMs

Maxwell (GM) vs. Kepler (GK) Architecture

GK107 vs. GM107

Maxwell (GM) vs. Kepler (GK) Architecture

GK107 vs. GM204

Compute Capab. 5.x (Part 1)

Maxwell• GM107: 5.0

• GM204: 5.2

Compute Capab. 5.x (Part 2)

Maxwell• GM107: 5.0

• GM204: 5.2

Thank you.

cs 380 - gpu and gpgpu programming lecture 6: gpu...

Documents

smm screen

smm fianl report

cs 380 - gpu and gpgpu programming lecture 1:...

smm netology

cs 380 - gpu and gpgpu programming lecture 8+9: gpu...

smm-7 smm-6 smm-5 #* crossing smm-9 smm-8 smm-4 smm-1 smm-3...

smm 742_assignment2_presentationhanifahabdullah

smm prezentatsia

imaging mass spectrometry · 2019. 11. 4. · 4 6 f l neck...

samsung smm

smm jun2016dc1 - farnell · smm 10 smm 15 smm 20 smm 25 smm...

smm concrete repair

smm project

smm 3g modem firmware release...

smm-7 smm-6 smm-5 #* crossing smm-9 smm-8 smm-4 smm-1 smm...

smm & caching

legion smm report_example

agilent technologies scanning microwave microscopy (smm...

smm marketing mix

smm introduction.pptx