why gpus?

21
Why GPUs? Why GPUs? Robert Strzodka Robert Strzodka

Upload: hong

Post on 11-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Why GPUs?. Robert Strzodka. Overview. Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics. Data Processing in General. lack of parallelism. memory wall. IN. OUT. memory. memory. OUT. IN. Processor. Old and New Wisdom in Computer Architecture. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Why GPUs?

Why GPUs?Why GPUs?

Robert StrzodkaRobert Strzodka

Page 2: Why GPUs?

2

OverviewOverview

• Computation / Bandwidth / Power

• CPU – GPU Comparison

• GPU Characteristics

Page 3: Why GPUs?

3

INOUT

Data Processing in GeneralData Processing in General

ProcessorIN OUT

mem

ory

mem

ory

memorymemorywallwall

lack oflack ofparallelismparallelism

Page 4: Why GPUs?

4

Old and New Wisdom in Computer ArchitectureOld and New Wisdom in Computer Architecture

• Old: Power is free, Transistors are expensive• New: “Power wall”, Power expensive, Transistors free

(Can put more transistors on chip than can afford to turn on)

• Old: Multiplies are slow, Memory access is fast• New: “Memory wall”, Multiplies fast, Memory slow

(200 clocks to DRAM memory, 4 clocks for FP multiply)

• Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)

• New: “ILP wall”, diminishing returns on more ILP HW(Explicit thread and data parallelism must be exploited)

• New: Power Wall + Memory Wall + ILP Wall = Brick Wall

slide courtesy of

Christos Kozyrakis

Page 5: Why GPUs?

5

Uniprocessor Performance (SPECint)Uniprocessor Performance (SPECint)

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Pe

rfo

rma

nce

(vs

. V

AX

-11

/78

0)

25%/year

52%/year

??%/yearFrom Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Sea change in chip design: multiple “cores” or processors per chip

3X

slide courtesy of

Christos Kozyrakis

Page 6: Why GPUs?

6

Processor

Instruction-Stream-Based ProcessingInstruction-Stream-Based Processing

instructions

cache

mem

ory

mem

orydata data

datadata

datadata

data

Page 7: Why GPUs?

7

Instruction- and Data-StreamsInstruction- and Data-Streams

Addition of 2D arrays: C= A + B

for(y=0; y<HEIGHT; y++)for(x=0; x<WIDTH; x++) { C[y][x]= A[y][x]+B[y][x];}

instuctionstream

processingdata

inputStreams(A,B);outputStream(C);kernelProgram(OP_ADD);processStreams();

data streamsundergoing a

kerneloperation

Page 8: Why GPUs?

8

Processor

Data-Stream-Based ProcessingData-Stream-Based Processing

mem

ory

mem

ory

pip

eline

datadata

configuration

pip

eline

pip

eline

Page 9: Why GPUs?

9

Architectures: Data – Processor LocalityArchitectures: Data – Processor Locality

• Field Programmable Gate Array (FPGA)– Compute by configuring Boolean functions and local memory

• Processor Array / Multi-core Processor– Assemble many (simple) processors and memories on one chip

• Processor-in-Memory (PIM)– Insert processing elements directly into RAM chips

• Stream Processor– Create data locality through a hierarchy of memories

Page 10: Why GPUs?

10

OverviewOverview

• Computation / Bandwidth / Power

• CPU – GPU Comparison

• GPU Characteristics

Page 11: Why GPUs?

11

The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor

Input Arrays: 1D, 3D,

2D (typical)

Vertex Processor (VP)

Kernel changes indexregions of output arrays

Rasterizer

Creates data streams from index regions

Stream of array elements,order unknown

Fragment Processor (FP)

Kernel changes each datum independently,

reads more input arrays

Output Arrays: 1D, 3D (slice),

2D (typical)

Page 12: Why GPUs?

12

Index Regions in Output ArraysIndex Regions in Output Arrays

Output region• Quads and Triangles– Fastest option

Output region

• Line segments– Slower, try to pair lines to

2xh, wx2 quads

Output region

• Point Clouds– Slowest, try to gather

points into larger forms

Page 13: Why GPUs?

13

High Level Graphics Language for the High Level Graphics Language for the KernelsKernels

• Float data types:– half 16-bit (s10e5), float 32-bit (s23e8)

• Vectors, structs and arrays:– float4, float vec[6] , float3x4, float arr[5][3], struct {}

• Arithmetic and logic operators: – +, -, *, /; &&, ||, !

• Trignonometric, exponential functions:– sin, asin, exp, log, pow, …

• User defined functions– max3(float a, float b, float c) { return max(a,max(b,c)); }

• Conditional statements, loops:– if, for, while, dynamic branching in PS3

• Streaming and random data access

Page 14: Why GPUs?

14

Input and Output ArraysInput and Output Arrays

CPU• Input and output

arrays may overlap

GPU• Input and output arrays

must not overlap

Input

Output

Input

Output

Page 15: Why GPUs?

15

Native Memory Layout – Data LocalityNative Memory Layout – Data Locality

CPU• 1D input

• 1D output

• Higher dimensions with offsets

GPU• 1D, 2D, 3D input

• 2D output

• Other dimensions with offsets

Input Input Output

Output

Color coded localityred (near), blue (far)

Page 16: Why GPUs?

16

Data-Flow: Gather and ScatterData-Flow: Gather and Scatter

CPU• Arbitrary gather

• Arbitrary scatter

GPU• Arbitrary gather

• Restricted scatter

Input Output Input Output

Input Output Input Output

Page 17: Why GPUs?

17

OverviewOverview

• Computation / Bandwidth / Power

• CPU – GPU Comparison

• GPU Characteristics

Page 18: Why GPUs?

18

1) Computational Performance1) Computational PerformanceG

FL

OP

S

chart courtesy

of John Owens

ATI R520

Note: Sustained performance is usually much lower and depends heavily on the memory system !

Page 19: Why GPUs?

19

2) Memory Performance2) Memory Performance

• CPU– Large cache– Few processing elements– Optimized for spatial and

temporal data reuse

GeForce 7800 GTXGeForce 7800 GTX Pentium 4Pentium 4

chart courtesy

of Ian Buck

Memory access types: Cache, Sequential, Random

• GPU – Small cache– Many processing elements– Optimized for sequential

(streaming) data access

Page 20: Why GPUs?

20

3) Configuration Overhead3) Configuration Overhead

Configu-Configu-rationrationlimitedlimited

Compu-Compu-tationtationlimitedlimited

chart courtesy

of Ian Buck

Page 21: Why GPUs?

21

ConclusionsConclusions

• Parallelism is now indispensable to further increase performance

• Both memory and processing element dominated designs have pros and cons

• Mapping algorithms to the appropriate architecture allows enormous speedups

• Many of GPU’s restrictions are crucial for parallel efficiency (Eat the cake or have it)