cs 380 - gpu and gpgpu programming lecture 6+7: gpu ... · cs 380 - gpu and gpgpu programming...

28
CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

Upload: others

Post on 27-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

CS 380 - GPU and GPGPU ProgrammingLecture 6+7: GPU Architecture 5+6

Markus Hadwiger, KAUST

Page 2: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

2

Reading Assignment #4 (until March 5)

Read (required):• GLSL book, Chapter 7 (OpenGL Shading Language API)

• OpenGL Shading Language 4.3 specification: Chapter 2http://www.opengl.org/registry/doc/GLSLangSpec.4.30.6.pdf

• Programming Massively Parallel Processors book,Chapter 2 (History of GPU Computing)

Read (optional):

• OpenGL 4.0 Shading Language Cookbook, Chapter 3

• Download OpenGL 4.3 specificationhttp://www.opengl.org/registry/doc/glspec43.core.20120806.pdf

See more at: http://www.opengl.org/documentation/specs/

Page 3: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

3

Quiz #1: Mar. 5

Organization• First 30 min of lecture

• No material (book, notes, ...) allowed

Content of questions• Lectures (both actual lectures and slides)

• Reading assigments

• Programming assignments (algorithms, methods)

• Solve short practical examples

Page 4: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

From Shader Code to a Teraflop:How Shader Cores Work

Kayvon FatahalianStanford University

Page 5: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Part 1: throughput processing

• Three key concepts behind how modern GPU processing cores run code

• Knowing these concepts will help you:1. Understand space of GPU core (and

throughput CPU processing core) designs2. Optimize shaders/compute kernels3. Establish intuition: what workloads might

benefit from the design of these architectures?5

Page 6: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Slimming down

ALU(Execute)

Fetch/Decode

ExecutionContext

Idea #1:

Remove components thathelp a single instructionstream run fast

6

Page 7: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Add ALUs

Fetch/Decode

Idea #2:

Amortize cost/complexity ofmanaging an instructionstream across many ALUs

ALU 1 ALU 2 ALU 3 ALU 4

ALU 5 ALU 6 ALU 7 ALU 8

SIMD processing

(or SIMT, SPMD)

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

7

Page 8: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8. . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;  

x = 0; 

refl = Ka;  

8

Page 9: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8. . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;  

x = 0; 

refl = Ka;  

TT TT TT FF FFFF FF FF

9

Page 10: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8. . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;  

x = 0; 

refl = Ka;  

TT TT TT FF FFFF FF FF

Not all ALUs do useful work! Worst case: 1/8 performance

10

Page 11: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But what about branches?

ALU 1 ALU 2 . . . ALU 8. . . Time

(clocks)

2 ... 1 ... 8

if (x > 0) {

} else {

}

<unconditional shader code>

<resume unconditional shader code>

y = pow(x, exp);

y *= Ks;

refl = y + Ka;  

x = 0; 

refl = Ka;  

TT TT TT FF FFFF FF FF

11

Page 12: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Stalls!

Texture access latency = 100’s to 1000’s of cycles

We’ve removed the fancy caches and logic that helps avoid stalls.

Stalls occur when a core cannot run the next instruction because of a dependency on a previous

operation.

12

Page 13: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

But we have LOTS of independent fragments.

Idea #3:Interleave processing of many fragments on a single core

to avoid stalls caused by high latency operations.

13

Page 14: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stallsTime

(clocks)Frag 1 … 8

Fetch/Decode

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU ALU ALU ALU

ALU ALU ALU ALU

14

Page 15: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stallsTime

(clocks)

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2

3 4

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

15

Page 16: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stallsTime

(clocks)

Stall

Runnable

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

16

Page 17: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stallsTime

(clocks)

Stall

Runnable

1 2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

17

Page 18: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Hiding shader stallsTime

(clocks)

1 2 3 4

Stall

Stall

Stall

Stall

Runnable

Runnable

Runnable

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

18

Page 19: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Throughput!Time

(clocks)

Stall

Runnable

2 3 4

Frag 1 … 8 Frag 9… 16 Frag 17 … 24 Frag 25 … 32

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

Stall

Runnable

Done!

1

Increase run time of one groupTo maximum throughput of many groups

Start

Start

Start

19

Page 20: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Storing contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

Pool of context storage

64 KB

20

Page 21: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Twenty small contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2 3 4 5

6 7 8 9 10

11 1512 13 14

16 2017 18 19

(maximal latency hiding ability)

21

Page 22: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Twelve medium contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

1 2 3 4

5 6 7 8

9 10 11 12

22

Page 23: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Four large contexts

Fetch/Decode

ALU ALU ALU ALU

ALU ALU ALU ALU

43

1 2

(low latency hiding ability)

23

Page 24: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

Clarification

• NVIDIA / AMD Radeon GPUs– HW schedules / manages all contexts (lots of them)– Special on-chip storage holds fragment state

• Intel MIC/Larrabee– HW manages four x86 (big) contexts at fine granularity– SW scheduling interleaves many groups of fragments on

each HW context– L1-L2 cache holds fragment state (as determined by SW)

Interleaving between contexts can be managed by HW or SW (or both!)

24

Page 25: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

My chip!

16 cores

8 mul-add ALUs per core(128 total)

16 simultaneousinstruction streams

64 concurrent (but interleaved)instruction streams

512 concurrent fragments

= 256 GFLOPs (@ 1GHz)

25

Page 26: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

My “enthusiast” chip!

32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)26

Page 27: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST
Page 28: CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 6+7: GPU Architecture 5+6 Markus Hadwiger, KAUST

Thank you.