a case for core-assisted bottleneck acceleration in gpus · a case for core-assisted bottleneck...
TRANSCRIPT
![Page 1: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/1.jpg)
A Case for Core-Assisted
Bottleneck Acceleration in GPUsEnabling Flexible Data Compression
with Assist Warps
Nandita Vijaykumar
Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick,
Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir,
Todd C. Mowry, Onur Mutlu
![Page 2: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/2.jpg)
Executive Summary
Observation: Imbalances in execution leave GPU resources underutilized
Our Goal: Employ underutilized GPU resources to do something useful – accelerate bottlenecks using helper threads
Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture?
Our Solution: CABA (Core-Assisted Bottleneck Acceleration)
A new framework to enable helper threading in GPUs
Enables flexible data compression to alleviate the memory bandwidth bottleneck
A wide set of use cases (e.g., prefetching, memoization)
Key Results: Using CABA to implement data compression in
memory improves performance by 41.7%2
![Page 3: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/3.jpg)
GPUs today are used for a wide range
of applications …
Computer Vision Data Analytics Scientific
Simulation
Medical
Imaging
3
![Page 4: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/4.jpg)
Challenges in GPU Efficiency
Memory
Hierarchy
Register File Cores
GPU Streaming Multiprocessor
Thread
0
Thread
1
Thread
2
Thread
3
Full! Idle!
Thread limits lead to an underutilized register file The memory bandwidth bottleneck leads to idle cores
Threads
4
Idle!
Full!
![Page 5: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/5.jpg)
Motivation: Unutilized On-chip Memory
24% of the register file is unallocated on average
Similar trends for on-chip scratchpad memory
0%10%20%30%40%50%60%70%80%90%
100%
% U
nalloca
ted R
egis
ters
5
![Page 6: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/6.jpg)
Motivation: Idle Pipelines
Memory Bound
Compute Bound
0%
20%
40%
60%
80%
100%
CONS JPEG LPS MUM RAY SCP PVC PVR bfs Avg.
% C
ycl
es
Active
Stalls
0%
20%
40%
60%
80%
100%
NN STO bp hs dmr NQU SLA lc pt mc
% C
ycl
es
Active
Stalls
6
67% of cycles idle
35% of cycles idle
![Page 7: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/7.jpg)
Motivation: Summary
Heterogeneous application requirements lead to:
Bottlenecks in execution
Idle resources
7
![Page 8: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/8.jpg)
Our Goal
Memory
Hierarchy
Cores Register File
Use idle resources to do something useful:
accelerate bottlenecks using helper threads
A flexible framework to enable helper threading in GPUs:
Core-Assisted Bottleneck Acceleration (CABA)8
Helper
threads
![Page 9: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/9.jpg)
Helper threads in GPUs
Large body of work in CPUs …
[Chappell+ ISCA ’99, MICRO ’02], [Yang+ USC TR ’98],
[Dubois+ CF ’04], [Zilles+ ISCA ’01], [Collins+ ISCA ’01,
MICRO ’01], [Aamodt+ HPCA ’04], [Lu+ MICRO ’05],
[Luk+ ISCA ’01], [Moshovos+ ICS ’01], [Kamruzzaman+
ASPLOS ’11], etc.
However, there are new challenges with GPUs…
9
![Page 10: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/10.jpg)
Challenge
How do you efficiently
manage and use helper threads
in a throughput-oriented architecture?
10
![Page 11: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/11.jpg)
Managing Helper Threads in GPUs
Thread
Warp
Block Software
Hardware
Where do we add helper threads?11
![Page 12: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/12.jpg)
Approach #1: Software-only
Regular threads
Helper threads
No hardware changes
Coarse grained
Not aware of runtime
program behavior
12
Synchronization is
difficult
![Page 13: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/13.jpg)
Where Do We Add Helper Threads?
Thread
Warp
Block Software
Hardware
13
![Page 14: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/14.jpg)
Approach #2: Hardware-only
14
Fine-grained control
– Synchronization
– Enforcing Priorities
GPU
Cores Register File
Warps
Core 0 Core 1
Reg File 0
Reg File 1
CPU
Reg File 0
Reg File 1Providing contexts
efficiently is difficult
![Page 15: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/15.jpg)
CABA: An Overview
“Tight coupling” of helper threads and regular threads
SW
HW “Decoupled management” of helper threads
and regular threads
Efficient context management
Simpler data communication
Dynamic management of threads
Fine-grained synchronization
15
![Page 16: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/16.jpg)
CABA: 1. In Software
Helper threads:
Tightly coupled to regular threads
Simply instructions injected into the GPU pipelines
Share the same context as the regular threads
Regs
Block
16
Regular threads
Helper threads
Efficient context management
Simpler data communication
![Page 17: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/17.jpg)
CABA: 2. In Hardware
Helper threads:
Decoupled from regular threads
Tracked at the granularity of a warp – Assist Warp
Each regular (parent) warp can have different assist
warps
Parent Warp: X
Assist Warp: A
Assist Warp: B17
Dynamic management
of threads
Fine-grained
synchronization
![Page 18: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/18.jpg)
Key Functionalities
Triggering and squashing assist warps
Associating events with assist warps
Deploying active assist warps
Scheduling instructions for execution
Enforcing priorities
Between assist warps and parent warps
Between different assist warps
18
![Page 19: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/19.jpg)
Deploy
Scheduler
CABA: Mechanism
ALU
Fetch
I-Cache
Assist Warp Store
Writeback
InstructionBuffer
Assist WarpBuffer
ScoreboardDecode
ALUALU
Mem
Issue
Trigger
Assist Warp Controller
Assist Warp Store
Holds instructions for different assist warp routines
Assist Warp Controller
Central point of control for: o Triggering assist warpso Squashing them
Tracks progress for active assist warps
Assist WarpBuffer
Stages instructions from triggered assist warps for execution
Helps enforce priorities
19
![Page 20: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/20.jpg)
Other functionality
In the paper:
More details on the hardware structures
Data communication and synchronization
Enforcing priorities
20
![Page 21: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/21.jpg)
CABA: Applications
Data compression
Memoization
Prefetching
…
21
![Page 22: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/22.jpg)
A Case for CABA: Data Compression
Data compression can help alleviate the memory
bandwidth bottleneck - transmits data in a more
condensed form
Memory
Hierarchy
CompressedUncompressed
CABA employs idle compute pipelines to perform compression
Idle!
22
![Page 23: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/23.jpg)
Data Compression with CABA
Use assist warps to:
Compress cache blocks before writing to memory
Decompress cache blocks before placing into the cache
CABA flexibly enables various compression algorithms
Example: BDI Compression [Pekhimenko+ PACT ’12]
Parallelizable across SIMT width
Low latency
Others: FPC [Alameldeen+ TR ’04], C-Pack [Chen+ VLSI ’10]
23
![Page 24: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/24.jpg)
Walkthrough of Decompression
Scheduler
L1DL2 +
Memory
Assist WarpStore
Assist Warp
Controller
Cores
Hit!Miss!
Trigger
24
![Page 25: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/25.jpg)
Walkthrough of Compression
Scheduler
L1DL2 +
Memory
Assist WarpStore
Assist Warp
Controller
Cores
Trigger
25
![Page 26: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/26.jpg)
Evaluation
![Page 27: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/27.jpg)
Methodology
Simulator: GPGPUSim, GPUWattch Workloads
Lonestar, Rodinia, MapReduce, CUDA SDK
System Parameters 15 SMs, 32 threads/warp 48 warps/SM, 32768 registers, 32KB Shared Memory Core: 1.4GHz, GTO scheduler , 2 schedulers/SM Memory: 177.4GB/s BW, 6 GDDR5 Memory Controllers,
FR-FCFS scheduling Cache: L1 - 16KB, 4-way associative; L2 - 768KB, 16-way
associative
Metrics Performance: Instructions per Cycle (IPC) Bandwidth Consumption: Fraction of cycles the DRAM data
bus is busy 27
![Page 28: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/28.jpg)
Effect on Performance
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
Norm
alized P
erf
orm
ance
CABA-BDI No-Overhead-BDI
CABA provides a 41.7% performance improvement CABA achieves performance close to that of designs
with no overhead for compression 28
41.7%
![Page 29: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/29.jpg)
Effect on Bandwidth Consumption
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
Mem
ory
Band
wid
th C
on
sum
ptio
n
Baseline CABA-BDI
Data compression with CABA alleviates
the memory bandwidth bottleneck 29
![Page 30: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/30.jpg)
Different Compression Algorithms
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
Norm
alized P
erf
orm
ance
CABA-FPC CABA-BDI CABA-CPack CABA-BestOfAll
CABA is flexible: Improves performance with
different compression algorithms 30
![Page 31: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/31.jpg)
Other Results
CABA’s performance is similar to pure-hardware
based BDI compression
CABA reduces the overall system energy (22%) by
decreasing the off-chip memory traffic
Other evaluations:
Compression ratios
Sensitivity to memory bandwidth
Capacity compression
Compression at different levels of the hierarchy
31
![Page 32: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/32.jpg)
Conclusion
Observation: Imbalances in execution leave GPU resources underutilized
Our Goal: Employ underutilized GPU resources to do something useful – accelerate bottlenecks using helper threads
Challenge: How do you efficiently manage and use helper threads in a throughput-oriented architecture?
Our Solution: CABA (Core-Assisted Bottleneck Acceleration)
A new framework to enable helper threading in GPUs
Enables flexible data compression to alleviate the memory bandwidth bottleneck
A wide set of use cases (e.g., prefetching, memoization)
Key Results: Using CABA to implement data compression in
memory improves performance by 41.7%32
![Page 33: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/33.jpg)
A Case for Core-Assisted
Bottleneck Acceleration in GPUsEnabling Flexible Data Compression
with Assist Warps
Nandita Vijaykumar
Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick,
Rachata Ausavarangnirun, Chita Das, Mahmut Kandemir,
Todd C. Mowry, Onur Mutlu
![Page 34: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/34.jpg)
Backup Slides34
![Page 35: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/35.jpg)
Effect on Energy
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
No
rma
lized
En
erg
y
CABA-BDI Ideal-BDI HW-BDI-Mem HW-BDI
CABA reduces the overall system energy by decreasing the off-chip memory traffic
35
![Page 36: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/36.jpg)
Effect on Compression Ratio
36
![Page 37: A Case for Core-Assisted Bottleneck Acceleration in GPUs · A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar](https://reader033.vdocument.in/reader033/viewer/2022041918/5e6aae47e3e68a6e7716cc30/html5/thumbnails/37.jpg)
Other Uses of CABA
37
Hardware Memoization
Goal: avoid redundant computation by reusing previous results over the same/similar inputs
Idea:
hash the inputs at predefined points
use load/store pipelines to save inputs in shared memory
eliminate redundant computation by loading stored results
Prefetching
Similar to CPU