why gpus?

Why GPUs?Why GPUs?

Robert StrzodkaRobert Strzodka

2

OverviewOverview

• Computation / Bandwidth / Power

• CPU – GPU Comparison

• GPU Characteristics

3

INOUT

Data Processing in GeneralData Processing in General

ProcessorIN OUT

mem

ory

mem

ory

memorymemorywallwall

lack oflack ofparallelismparallelism

4

Old and New Wisdom in Computer ArchitectureOld and New Wisdom in Computer Architecture

• Old: Power is free, Transistors are expensive• New: “Power wall”, Power expensive, Transistors free

(Can put more transistors on chip than can afford to turn on)

• Old: Multiplies are slow, Memory access is fast• New: “Memory wall”, Multiplies fast, Memory slow

(200 clocks to DRAM memory, 4 clocks for FP multiply)

• Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)

• New: “ILP wall”, diminishing returns on more ILP HW(Explicit thread and data parallelism must be exploited)

• New: Power Wall + Memory Wall + ILP Wall = Brick Wall

slide courtesy of

Christos Kozyrakis

5

Uniprocessor Performance (SPECint)Uniprocessor Performance (SPECint)

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Pe

rfo

rma

nce

(vs

. V

AX

-11

/78

0)

25%/year

52%/year

??%/yearFrom Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Sea change in chip design: multiple “cores” or processors per chip

3X

slide courtesy of

Christos Kozyrakis

6

Processor

Instruction-Stream-Based ProcessingInstruction-Stream-Based Processing

instructions

cache

mem

ory

mem

orydata data

datadata

datadata

data

7

Instruction- and Data-StreamsInstruction- and Data-Streams

Addition of 2D arrays: C= A + B

for(y=0; y<HEIGHT; y++)for(x=0; x<WIDTH; x++) { C[y][x]= A[y][x]+B[y][x];}

instuctionstream

processingdata

inputStreams(A,B);outputStream(C);kernelProgram(OP_ADD);processStreams();

data streamsundergoing a

kerneloperation

8

Processor

Data-Stream-Based ProcessingData-Stream-Based Processing

mem

ory

mem

ory

pip

eline

datadata

configuration

pip

eline

pip

eline

9

Architectures: Data – Processor LocalityArchitectures: Data – Processor Locality

• Field Programmable Gate Array (FPGA)– Compute by configuring Boolean functions and local memory

• Processor Array / Multi-core Processor– Assemble many (simple) processors and memories on one chip

• Processor-in-Memory (PIM)– Insert processing elements directly into RAM chips

• Stream Processor– Create data locality through a hierarchy of memories

10

OverviewOverview




11

The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor

Input Arrays: 1D, 3D,

2D (typical)

Vertex Processor (VP)

Kernel changes indexregions of output arrays

Rasterizer

Creates data streams from index regions

Stream of array elements,order unknown

Fragment Processor (FP)

Kernel changes each datum independently,

reads more input arrays

Output Arrays: 1D, 3D (slice),

2D (typical)

12

Index Regions in Output ArraysIndex Regions in Output Arrays

Output region• Quads and Triangles– Fastest option

Output region

• Line segments– Slower, try to pair lines to

2xh, wx2 quads

Output region

• Point Clouds– Slowest, try to gather

points into larger forms

13

High Level Graphics Language for the High Level Graphics Language for the KernelsKernels

• Float data types:– half 16-bit (s10e5), float 32-bit (s23e8)

• Vectors, structs and arrays:– float4, float vec[6] , float3x4, float arr[5][3], struct {}

• Arithmetic and logic operators: – +, -, *, /; &&, ||, !

• Trignonometric, exponential functions:– sin, asin, exp, log, pow, …

• User defined functions– max3(float a, float b, float c) { return max(a,max(b,c)); }

• Conditional statements, loops:– if, for, while, dynamic branching in PS3

• Streaming and random data access

14

Input and Output ArraysInput and Output Arrays

CPU• Input and output

arrays may overlap

GPU• Input and output arrays

must not overlap

Input

Output

Input

Output

15

Native Memory Layout – Data LocalityNative Memory Layout – Data Locality

CPU• 1D input

• 1D output

• Higher dimensions with offsets

GPU• 1D, 2D, 3D input

• 2D output

• Other dimensions with offsets

Input Input Output

Output

Color coded localityred (near), blue (far)

16

Data-Flow: Gather and ScatterData-Flow: Gather and Scatter

CPU• Arbitrary gather

• Arbitrary scatter

GPU• Arbitrary gather

• Restricted scatter

Input Output Input Output

Input Output Input Output

17

OverviewOverview




18

1) Computational Performance1) Computational PerformanceG

FL

OP

S

chart courtesy

of John Owens

ATI R520

Note: Sustained performance is usually much lower and depends heavily on the memory system !

19

2) Memory Performance2) Memory Performance

• CPU– Large cache– Few processing elements– Optimized for spatial and

temporal data reuse

GeForce 7800 GTXGeForce 7800 GTX Pentium 4Pentium 4

chart courtesy

of Ian Buck

Memory access types: Cache, Sequential, Random

• GPU – Small cache– Many processing elements– Optimized for sequential

(streaming) data access

20

3) Configuration Overhead3) Configuration Overhead

Configu-Configu-rationrationlimitedlimited

Compu-Compu-tationtationlimitedlimited

chart courtesy

of Ian Buck

21

ConclusionsConclusions

• Parallelism is now indispensable to further increase performance

• Both memory and processing element dominated designs have pros and cons

• Mapping algorithms to the appropriate architecture allows enormous speedups

• Many of GPU’s restrictions are crucial for parallel efficiency (Eat the cake or have it)

why gpus?

Documents

memory slow

intel d850emvr motherboard

intel vc820 motherboard

dram memory

memory access

intel xeonintel xeon

amd opteron tm

vax11780sheet1data points