why gpus?
DESCRIPTION
Why GPUs?. Robert Strzodka. Overview. Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics. Data Processing in General. lack of parallelism. memory wall. IN. OUT. memory. memory. OUT. IN. Processor. Old and New Wisdom in Computer Architecture. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/1.jpg)
Why GPUs?Why GPUs?
Robert StrzodkaRobert Strzodka
![Page 2: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/2.jpg)
2
OverviewOverview
• Computation / Bandwidth / Power
• CPU – GPU Comparison
• GPU Characteristics
![Page 3: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/3.jpg)
3
INOUT
Data Processing in GeneralData Processing in General
ProcessorIN OUT
mem
ory
mem
ory
memorymemorywallwall
lack oflack ofparallelismparallelism
![Page 4: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/4.jpg)
4
Old and New Wisdom in Computer ArchitectureOld and New Wisdom in Computer Architecture
• Old: Power is free, Transistors are expensive• New: “Power wall”, Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)
• Old: Multiplies are slow, Memory access is fast• New: “Memory wall”, Multiplies fast, Memory slow
(200 clocks to DRAM memory, 4 clocks for FP multiply)
• Old: Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)
• New: “ILP wall”, diminishing returns on more ILP HW(Explicit thread and data parallelism must be exploited)
• New: Power Wall + Memory Wall + ILP Wall = Brick Wall
slide courtesy of
Christos Kozyrakis
![Page 5: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/5.jpg)
5
Uniprocessor Performance (SPECint)Uniprocessor Performance (SPECint)
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Pe
rfo
rma
nce
(vs
. V
AX
-11
/78
0)
25%/year
52%/year
??%/yearFrom Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Sea change in chip design: multiple “cores” or processors per chip
3X
slide courtesy of
Christos Kozyrakis
![Page 6: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/6.jpg)
6
Processor
Instruction-Stream-Based ProcessingInstruction-Stream-Based Processing
instructions
cache
mem
ory
mem
orydata data
datadata
datadata
data
![Page 7: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/7.jpg)
7
Instruction- and Data-StreamsInstruction- and Data-Streams
Addition of 2D arrays: C= A + B
for(y=0; y<HEIGHT; y++)for(x=0; x<WIDTH; x++) { C[y][x]= A[y][x]+B[y][x];}
instuctionstream
processingdata
inputStreams(A,B);outputStream(C);kernelProgram(OP_ADD);processStreams();
data streamsundergoing a
kerneloperation
![Page 8: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/8.jpg)
8
Processor
Data-Stream-Based ProcessingData-Stream-Based Processing
mem
ory
mem
ory
pip
eline
datadata
configuration
pip
eline
pip
eline
![Page 9: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/9.jpg)
9
Architectures: Data – Processor LocalityArchitectures: Data – Processor Locality
• Field Programmable Gate Array (FPGA)– Compute by configuring Boolean functions and local memory
• Processor Array / Multi-core Processor– Assemble many (simple) processors and memories on one chip
• Processor-in-Memory (PIM)– Insert processing elements directly into RAM chips
• Stream Processor– Create data locality through a hierarchy of memories
![Page 10: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/10.jpg)
10
OverviewOverview
• Computation / Bandwidth / Power
• CPU – GPU Comparison
• GPU Characteristics
![Page 11: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/11.jpg)
11
The GPU is a Fast, Parallel Array ProcessorThe GPU is a Fast, Parallel Array Processor
Input Arrays: 1D, 3D,
2D (typical)
Vertex Processor (VP)
Kernel changes indexregions of output arrays
Rasterizer
Creates data streams from index regions
Stream of array elements,order unknown
Fragment Processor (FP)
Kernel changes each datum independently,
reads more input arrays
Output Arrays: 1D, 3D (slice),
2D (typical)
![Page 12: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/12.jpg)
12
Index Regions in Output ArraysIndex Regions in Output Arrays
Output region• Quads and Triangles– Fastest option
Output region
• Line segments– Slower, try to pair lines to
2xh, wx2 quads
Output region
• Point Clouds– Slowest, try to gather
points into larger forms
![Page 13: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/13.jpg)
13
High Level Graphics Language for the High Level Graphics Language for the KernelsKernels
• Float data types:– half 16-bit (s10e5), float 32-bit (s23e8)
• Vectors, structs and arrays:– float4, float vec[6] , float3x4, float arr[5][3], struct {}
• Arithmetic and logic operators: – +, -, *, /; &&, ||, !
• Trignonometric, exponential functions:– sin, asin, exp, log, pow, …
• User defined functions– max3(float a, float b, float c) { return max(a,max(b,c)); }
• Conditional statements, loops:– if, for, while, dynamic branching in PS3
• Streaming and random data access
![Page 14: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/14.jpg)
14
Input and Output ArraysInput and Output Arrays
CPU• Input and output
arrays may overlap
GPU• Input and output arrays
must not overlap
Input
Output
Input
Output
![Page 15: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/15.jpg)
15
Native Memory Layout – Data LocalityNative Memory Layout – Data Locality
CPU• 1D input
• 1D output
• Higher dimensions with offsets
GPU• 1D, 2D, 3D input
• 2D output
• Other dimensions with offsets
Input Input Output
Output
Color coded localityred (near), blue (far)
![Page 16: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/16.jpg)
16
Data-Flow: Gather and ScatterData-Flow: Gather and Scatter
CPU• Arbitrary gather
• Arbitrary scatter
GPU• Arbitrary gather
• Restricted scatter
Input Output Input Output
Input Output Input Output
![Page 17: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/17.jpg)
17
OverviewOverview
• Computation / Bandwidth / Power
• CPU – GPU Comparison
• GPU Characteristics
![Page 18: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/18.jpg)
18
1) Computational Performance1) Computational PerformanceG
FL
OP
S
chart courtesy
of John Owens
ATI R520
Note: Sustained performance is usually much lower and depends heavily on the memory system !
![Page 19: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/19.jpg)
19
2) Memory Performance2) Memory Performance
• CPU– Large cache– Few processing elements– Optimized for spatial and
temporal data reuse
GeForce 7800 GTXGeForce 7800 GTX Pentium 4Pentium 4
chart courtesy
of Ian Buck
Memory access types: Cache, Sequential, Random
• GPU – Small cache– Many processing elements– Optimized for sequential
(streaming) data access
![Page 20: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/20.jpg)
20
3) Configuration Overhead3) Configuration Overhead
Configu-Configu-rationrationlimitedlimited
Compu-Compu-tationtationlimitedlimited
chart courtesy
of Ian Buck
![Page 21: Why GPUs?](https://reader035.vdocument.in/reader035/viewer/2022062801/5681433e550346895dafb25a/html5/thumbnails/21.jpg)
21
ConclusionsConclusions
• Parallelism is now indispensable to further increase performance
• Both memory and processing element dominated designs have pros and cons
• Mapping algorithms to the appropriate architecture allows enormous speedups
• Many of GPU’s restrictions are crucial for parallel efficiency (Eat the cake or have it)