gpu and the brick wall
DESCRIPTION
GPU programingThe Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fast ILP Wall: diminishing returns on more ILP HWTRANSCRIPT
![Page 1: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/1.jpg)
Graphics Processing Unit (GPU)Architecture and Programming
TU/e 5kk73Zhenyu Ye
Bart MesmanHenk Corporaal
2010-11-08
![Page 2: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/2.jpg)
Today's Topics
• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends
![Page 3: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/3.jpg)
Today's Topics
• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends
![Page 4: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/4.jpg)
System Architecture
![Page 5: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/5.jpg)
GPU ArchitectureNVIDIA Fermi, 512 Processing Elements (PEs)
![Page 6: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/6.jpg)
What Can It Do?Render triangles.
NVIDIA GTX480 can render 1.6 billion triangles per second!
![Page 7: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/7.jpg)
General Purposed Computing
ref: http://www.nvidia.com/object/tesla_computing_solutions.html
![Page 8: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/8.jpg)
The Vision of NVIDIA"Within the next few years, there will be single-chip graphics
devices more powerful and versatile than any graphics system that has ever been built, at any price."
-- David Kirk, NVIDIA, 1998
![Page 9: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/9.jpg)
ref: http://www.llnl.gov/str/JanFeb05/Seager.html
Single-Chip GPU v.s. Fastest Super Computers
![Page 10: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/10.jpg)
Top500 Super Computer in June 2010
![Page 11: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/11.jpg)
GPU Will Top the List in Nov 2010
![Page 12: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/12.jpg)
The Gap Between CPU and GPU
ref: Tesla GPU Computing Brochure
![Page 13: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/13.jpg)
GPU Has 10x Comp Density
Given the same chip area, the achievable performance of GPU is 10x higher than that of CPU.
![Page 14: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/14.jpg)
Evolution of Intel PentiumPentium I Pentium II
Pentium III Pentium IV
Chip areabreakdown
Q: What can you observe? Why?
![Page 15: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/15.jpg)
Extrapolation of Single Core CPUIf we extrapolate the trend, in a few generations, Pentium will look like:
Of course, we know it did not happen.
Q: What happened instead? Why?
![Page 16: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/16.jpg)
Evolution of Multi-core CPUsPenryn Bloomfield
Gulftown Beckton
Chip areabreakdown
Q: What can you observe? Why?
![Page 17: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/17.jpg)
Let's Take a Closer Look
Less than 10% of total chip area is used for the real execution.
Q: Why?
![Page 18: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/18.jpg)
The Memory Hierarchy
Notes on Energy at 45nm: 64-bit Int ADD takes about 1 pJ.64-bit FP FMA takes about 200 pJ.
It seems we can not further increase the computational density.
![Page 19: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/19.jpg)
The Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW
David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link
![Page 20: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/20.jpg)
The Brick Wall -- UC Berkeley's ViewPower Wall: power expensive, transistors freeMemory Wall: Memory slow, multiplies fastILP Wall: diminishing returns on more ILP HW
Power Wall + Memory Wall + ILP Wall = Brick Wall
David Patterson, "Computer Architecture is Back - The Berkeley View of the Parallel Computing Research Landscape", Stanford EE Computer Systems Colloquium, Jan 2007, link
![Page 21: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/21.jpg)
How to Break the Brick Wall?
Hint: how to exploit the parallelism inside the application?
![Page 22: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/22.jpg)
Step 1: Trade Latency with Throughput
Hind the memory latency through fine-grained interleaved threading.
![Page 23: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/23.jpg)
Interleaved Multi-threading
![Page 24: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/24.jpg)
Interleaved Multi-threading
The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency
![Page 25: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/25.jpg)
Interleaved Multi-threading
The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency
Fine-grained interleaved multi-threading:Pros: ?Cons: ?
![Page 26: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/26.jpg)
Interleaved Multi-threading
The granularity of interleaved multi-threading:• 100 cycles: hide off-chip memory latency• 10 cycles: + hide cache latency• 1 cycle: + hide branch latency, instruction dependency
Fine-grained interleaved multi-threading:Pros: remove branch predictor, OOO scheduler, large cacheCons: register pressure, etc.
![Page 27: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/27.jpg)
Fine-Grained Interleaved Threading
Pros: reduce cache size,no branch predictor, no OOO scheduler
Cons: register pressure,thread scheduler,require huge parallelism
Without and with fine-grained interleaved threading
![Page 28: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/28.jpg)
HW SupportRegister file supports zero overhead context switch between interleaved threads.
![Page 29: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/29.jpg)
Can We Make Further Improvement?
Reducing large cache gives 2x computational density.
Q: Can we make further improvements?
Hint:We have only utilized thread level parallelism (TLP) so far.
![Page 30: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/30.jpg)
Step 2: Single Instruction Multiple Data
SSE has 4 data lanes GPU has 8/16/24/... data lanes
GPU uses wide SIMD: 8/16/24/... processing elements (PEs)CPU uses short SIMD: usually has vector width of 4.
![Page 31: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/31.jpg)
Hardware SupportSupporting interleaved threading + SIMD execution
![Page 32: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/32.jpg)
Single Instruction Multiple Thread (SIMT)
Hide vector width using scalar threads.
![Page 33: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/33.jpg)
Example of SIMT ExecutionAssume 32 threads are grouped into one warp.
![Page 34: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/34.jpg)
Step 3: Simple Core
The Stream Multiprocessor (SM) is a light weight core compared to IA core.
Light weight PE:Fused Multiply Add (FMA)
SFU:Special Function Unit
![Page 35: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/35.jpg)
NVIDIA's Motivation of Simple Core
"This [multiple IA-core] approach is analogous to trying to build an airplane by putting wings on a train."
--Bill Dally, NVIDIA
![Page 36: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/36.jpg)
Review: How Do We Reach Here?NVIDIA Fermi, 512 Processing Elements (PEs)
![Page 37: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/37.jpg)
Throughput Oriented Architectures
1. Fine-grained interleaved threading (~2x comp density)2. SIMD/SIMT (>10x comp density)3. Simple core (~2x comp density)
Key architectural features of throughput oriented processor.
ref: Michael Garland. David B. Kirk, "Understanding throughput-oriented architectures", CACM 2010. (link)
![Page 38: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/38.jpg)
Today's Topics
• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends
![Page 39: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/39.jpg)
CUDA ProgrammingMassive number (>10000) of light-weight threads.
![Page 40: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/40.jpg)
Express Data Parallelism in Threads
Compare thread program with vector program.
![Page 41: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/41.jpg)
Vector Program
Scalar program float A[4][8];do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; }}
Vector program (vector width of 8)
float A[4][8];
do-all(i=0;i<4;i++){ movups xmm0, [ &A[i][0] ] incps xmm0 movups [ &A[i][0] ], xmm0}
Vector width is exposed to programmers.
![Page 42: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/42.jpg)
CUDA Program
Scalar program float A[4][8];do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; }}
CUDA program
float A[4][8]; kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++;}
• CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP).
• Hardware converts TLP into DLP at run time.
![Page 43: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/43.jpg)
Two Levels of Thread HierarchykernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++;}
![Page 44: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/44.jpg)
Multi-dimension Thread and Block ID
kernelF<<<(2,2),(4,2)>>>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++;}
Both grid and thread block can have two dimensional index.
![Page 45: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/45.jpg)
Scheduling Thread Blocks on SMExample:Scheduling 4 thread blocks on 3 SMs.
![Page 46: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/46.jpg)
Executing Thread Block on SM
Executed on machine with width of 4:
Executed on machine with width of 8:
Notes: the number of Processing Elements (PEs) is transparent to programmer.
kernelF<<<(2,2),(4,2)>>>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++;}
![Page 47: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/47.jpg)
Multiple Levels of Memory HierarchyName Cache? cycle read-only?
Global L1/L2 200~400 (cache miss) R/W
Shared No 1~3 R/W
Constant Yes 1~3 Read-only
Texture Yes ~100 Read-only
Local L1/L2 200~400 (cache miss) R/W
![Page 48: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/48.jpg)
Explicit Management of Shared MemShared memory is frequently used to exploit locality.
![Page 49: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/49.jpg)
Shared Memory and Synchronization
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; //allocate smem i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Example: average filter with 3x3 window
3x3 window on image
Image data in DRAM
![Page 50: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/50.jpg)
Shared Memory and Synchronization
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __sync(); // thread wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Example: average filter over 3x3 window
3x3 window on image
Stage data in shared mem
![Page 51: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/51.jpg)
Shared Memory and Synchronization
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); // every thread is ready A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Example: average filter over 3x3 window
3x3 window on image
all threads finish the load
![Page 52: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/52.jpg)
Shared Memory and Synchronization
kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; __sync(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] ... + smem[i+1][i+1] ) / 9;}
Example: average filter over 3x3 window
3x3 window on image
Start computation
![Page 53: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/53.jpg)
Programmers Think in Threads
Q: Why make this hassle?
![Page 54: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/54.jpg)
Why Use Thread instead of Vector?
Thread Pros:• Portability. Machine width is transparent in ISA.• Productivity. Programmers do not need to take care the
vector width of the machine.
Thread Cons:• Manual sync. Give up lock-step within vector.• Scheduling of thread could be inefficient.• Debug. "Threads considered harmful". Thread program
is notoriously hard to debug.
![Page 55: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/55.jpg)
Features of CUDA
• Programmers explicitly express DLP in terms of TLP.• Programmers explicitly manage memory hierarchy.• etc.
![Page 56: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/56.jpg)
Today's Topics
• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends
![Page 57: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/57.jpg)
Micro-architectureGF100 micro-architecture
![Page 58: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/58.jpg)
HW Groups Threads Into WarpsExample: 32 threads per warp
![Page 59: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/59.jpg)
Example of ImplementationNote: NVIDIA may use a more complicated implementation.
![Page 60: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/60.jpg)
ExampleProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Assume warp 0 and warp 1 are scheduled for execution.
![Page 61: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/61.jpg)
Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Read source operands:r1 for warp 0r4 for warp 1
![Page 62: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/62.jpg)
Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Push ops to op collector:r1 for warp 0r4 for warp 1
![Page 63: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/63.jpg)
Read Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Read source operands:r2 for warp 0r5 for warp 1
![Page 64: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/64.jpg)
Buffer Src OpProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Push ops to op collector:r2 for warp 0r5 for warp 1
![Page 65: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/65.jpg)
ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Compute the first 16 threads in the warp.
![Page 66: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/66.jpg)
ExecuteProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Compute the last 16 threads in the warp.
![Page 67: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/67.jpg)
Write backProgram Address: Inst0x0004: add r0, r1, r20x0008: sub r3, r4, r5
Write back:r0 for warp 0r3 for warp 1
![Page 68: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/68.jpg)
Other High Performance GPU
• ATI Radeon 5000 series.
![Page 69: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/69.jpg)
ATI Radeon 5000 Series Architecture
![Page 70: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/70.jpg)
Radeon SIMD Engine
• 16 Stream Cores (SC)• Local Data Share
![Page 71: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/71.jpg)
VLIW Stream Core (SC)
![Page 72: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/72.jpg)
Local Data Share (LDS)
![Page 73: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/73.jpg)
Today's Topics
• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends
![Page 74: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/74.jpg)
Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure
Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping
Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion
Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity
![Page 75: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/75.jpg)
Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure
Optimizations on memory bandwidth • Global memory coalesce • Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping
Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion
Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity
![Page 76: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/76.jpg)
Shared Mem Contains Multiple Banks
![Page 77: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/77.jpg)
Compute CapabilityNeed arch info to perform optimization.
ref: NVIDIA, "CUDA C Programming Guide", (link)
![Page 78: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/78.jpg)
Shared Memory (compute capability 2.x)
withoutbankconflict:
withbankconflict:
![Page 79: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/79.jpg)
Performance OptimizationOptimizations on memory latency tolerance• Reduce register pressure• Reduce shared memory pressure
Optimizations on memory bandwidth • Global memory alignment and coalescing• Avoid shared memory bank conflicts• Grouping byte access • Avoid Partition camping
Optimizations on computation efficiency • Mul/Add balancing• Increase floating point proportion
Optimizations on operational intensity • Use tiled algorithm• Tuning thread granularity
![Page 80: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/80.jpg)
Global Memory In Off-Chip DRAMAddress space is interleaved among multiple channels.
![Page 81: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/81.jpg)
Global Memory
![Page 82: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/82.jpg)
Global Memory
![Page 83: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/83.jpg)
Global Memory
![Page 84: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/84.jpg)
Roofline ModelIdentify performance bottleneck: computation bound v.s. bandwidth bound
![Page 85: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/85.jpg)
Optimization Is Key for Attainable Gflops/s
![Page 86: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/86.jpg)
Computation, Bandwidth, LatencyIllustrating three bottlenecks in the Roofline model.
![Page 87: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/87.jpg)
Today's Topics
• GPU architecture• GPU programming• GPU micro-architecture• Performance optimization and model• Trends
![Page 88: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/88.jpg)
Trends
Coming architectures:• Intel's Larabee successor: Many Integrated Core (MIC)• CPU/GPU fusion, Intel Sandy Bridge, AMD Llano.
![Page 89: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/89.jpg)
Intel Many Integrated Core (MIC)32 core version of MIC:
![Page 90: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/90.jpg)
Intel Sandy Bridge
Highlight:• Reconfigurable shared L3
for CPU and GPU• Ring bus
![Page 91: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/91.jpg)
Sandy Bridge's New CPU-GPU interface
ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)
![Page 92: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/92.jpg)
Sandy Bridge's New CPU-GPU interface
ref: "Intel's Sandy Bridge Architecture Exposed", from Anandtech, (link)
![Page 93: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/93.jpg)
AMD Llano Fusion APU (expt. Q3 2011)
Notes:• CPU and GPU are not
sharing cache?• Unknown interface
between CPU/GPU
![Page 94: Gpu and The Brick Wall](https://reader038.vdocument.in/reader038/viewer/2022102607/5462fe06b1af9f936c8b524b/html5/thumbnails/94.jpg)
GPU Research in ES Group
GPU research in the Electronic Systems group.http://www.es.ele.tue.nl/~gpuattue/