mask: redesigning the gpu memory hierarchy to support ......b. translation-aware l2 bypass improves...
TRANSCRIPT
![Page 1: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/1.jpg)
MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency
Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose
Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu
![Page 2: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/2.jpg)
Executive Summary
2
Problem: Address translation overheadslimit the latency hiding capability of a GPU
Key IdeaPrioritize address translation requests over data requests
MASK: a GPU memory hierarchy thatA. Reduces shared TLB contention
B. Improves L2 cache utilization
C. Reduces page walk latency
MASK improves system throughput by 57.8% on average
over state-of-the-art address translation mechanisms
Large performance loss vs. no translation
High contention at the shared TLB
Low L2 cache utilization
![Page 3: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/3.jpg)
Outline
3
• Executive Summary
• Background, Key Challenges and Our Goal
• MASK: A Translation-aware Memory Hierarchy
• Evaluation
• Conclusion
![Page 4: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/4.jpg)
Why Share Discrete GPUs?
4
• Enables multiple GPGPU applications torun concurrently
• Better resource utilization • An application often cannot utilize an entire GPU• Different compute and bandwidth demands
• Enables GPU sharing in the cloud• Multiple users spatially share each GPU
Key requirement: fine-grained memory protection
![Page 5: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/5.jpg)
GPU Core
Private TLB
State-of-the-art Address Translation in GPUs
5
GPU CoreGPU CoreGPU Core
Shared TLB
Private TLB
Page Table Walkers
Page Table(in main memory)
Private TLB Private TLB
Private
Shared
App 1
App 2
![Page 6: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/6.jpg)
GPU Core
Private TLB
A TLB Miss Stalls Multiple Warps
6
GPU CoreGPU CoreGPU Core
Shared TLB
Private TLB
Page Table Walkers
Page Table(in main memory)
Private TLB Private TLB
Private
Shared
App 1
App 2
StalledStalled
Data in a page isshared by many threads
All threadsaccess the same page
![Page 7: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/7.jpg)
GPU Core
Private TLB
Multiple Page Walks Happen Together
7
GPU CoreGPU CoreGPU Core
Shared TLB
Private TLB
Page Table Walkers
Page Table(in main memory)
Private TLB Private TLB
Private
Shared
App 1
App 2
GPU’s parallelism creates parallel page walks
Stalled StalledStalledStalled
Data in a page isshared by many threads
All threadsaccess the same page
![Page 8: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/8.jpg)
0 0.2 0.4 0.6 0.8 1
Ideal SharedTLB PWCache
Effect of Translation on Performance
8
Normalized Performance
![Page 9: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/9.jpg)
0 0.2 0.4 0.6 0.8 1
Ideal SharedTLB PWCache
Effect of Translation on Performance
9
Normalized Performance
![Page 10: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/10.jpg)
What causes the large performance loss?
0 0.2 0.4 0.6 0.8 1
Ideal SharedTLB PWCache
Effect of Translation on Performance
10
37.4%
Normalized Performance
![Page 11: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/11.jpg)
Problem 1: Contention at the Shared TLB
11
• Multiple GPU applications contend for the TLB
0.00.20.40.60.81.0
App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2
L2 T
LB
Mis
s R
ate
(Lo
we
r is
Be
tte
r) Alone Shared
3DS_HISTO CONS_LPS MUM_HISTO RED_RAY
![Page 12: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/12.jpg)
Problem 1: Contention at the Shared TLB
12
• Multiple GPU applications contend for the TLB
0.00.20.40.60.81.0
App 1 App 2 App 1 App 2 App 1 App 2 App 1 App 2
L2 T
LB
Mis
s R
ate
(Lo
we
r is
Be
tte
r) Alone Shared
3DS_HISTO CONS_LPS MUM_HISTO RED_RAY
Contention at the shared TLB leads to lower performance
![Page 13: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/13.jpg)
Problem 2: Thrashing at the L2 Cache
13
• L2 cache can be used to reduce page walk latency Partial translation data can be cached
• Thrashing Source 1: Parallel page walks Different address translation data evicts each other
• Thrashing Source 2: GPU memory intensity Demand-fetched data evicts address translation data
L2 cache is ineffective at reducing page walk latency
![Page 14: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/14.jpg)
Observation: Address Translation Is Latency Sensitive
14
• Multiple warps share data from a single page
0
10
20
30
40
3D
SB
FS
2B
LK
BP
CF
DC
ON
SF
FT
FW
TG
UP
SH
IST
OH
SJP
EG
LIB
LP
SL
UD
LU
HM
MM
UM
NN
NW
QT
CR
AY
RE
DS
AD
SC
SC
AN
SC
PS
PM
VS
RA
DT
RD
Ave
rag
eWarp
s S
talled
Per
On
e T
LB
Mis
s
A single TLB miss causes 8 warps to stall on average
![Page 15: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/15.jpg)
Observation: Address Translation Is Latency Sensitive
15
• Multiple warps share data from a single page
• GPU’s parallelism causes multiple concurrent page walks
0
20
40
60
3D
SB
FS
2B
LK
BP
CF
DC
ON
SF
FT
FW
TG
UP
SH
IST
OH
SJP
EG
LIB
LP
SL
UD
LU
HM
MM
UM
NN
NW
QT
CR
AY
RE
DS
AD
SC
SC
AN
SC
PS
PM
VS
RA
DT
RD
Ave
rag
e
Co
ncu
rren
tP
ag
e W
alk
s
High address translation latency More stalled warps
![Page 16: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/16.jpg)
Our Goals
16
•Reduce shared TLB contention
• Improve L2 cache utilization
•Lower page walk latency
![Page 17: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/17.jpg)
Outline
17
• Executive Summary
• Background, Key Challenges and Our Goal
• MASK: A Translation-aware Memory Hierarchy
• Evaluation
• Conclusion
![Page 18: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/18.jpg)
MASK: A Translation-aware Memory Hierarchy
18
•Reduce shared TLB contentionA. TLB-fill Tokens
• Improve L2 cache utilizationB. Translation-aware L2 Bypass
•Lower page walk latencyC. Address-space-aware Memory Scheduler
![Page 19: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/19.jpg)
A: TLB-fill Tokens
19
• Goal: Limit the number of warps that can fill the TLB A warp with a token fills the shared TLB
A warp with no token fills a very small bypass cache
• Number of tokens changes based on TLB miss rate Updated every epoch
• Tokens are assigned based on warp ID
Benefit: Limits contention at the shared TLB
![Page 20: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/20.jpg)
MASK: A Translation-aware Memory Hierarchy
20
•Reduce shared TLB contentionA. TLB-fill Tokens
• Improve L2 cache utilizationB. Translation-aware L2 Bypass
•Lower page walk latencyC. Address-space-aware Memory Scheduler
![Page 21: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/21.jpg)
B: Translation-aware L2 Bypass
21
• L2 hit rate decreases for deep page walk levels
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1
![Page 22: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/22.jpg)
B: Translation-aware L2 Bypass
22
• L2 hit rate decreases for deep page walk levels
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1Page Table Level 2
![Page 23: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/23.jpg)
B: Translation-aware L2 Bypass
23
• L2 hit rate decreases for deep page walk levels
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1Page Table Level 2Page Table Level 3
![Page 24: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/24.jpg)
B: Translation-aware L2 Bypass
24
• L2 hit rate decreases for deep page walk levels
Some address translation data does not benefit from caching
Only cache address translation data with high hit rate
0 0.2 0.4 0.6 0.8 1
L2 Cache Hit Rate
Page Table Level 1Page Table Level 2Page Table Level 3Page Table Level 4
![Page 25: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/25.jpg)
Cache
0 0.2 0.4 0.6 0.8 1
B: Translation-aware L2 Bypass
25
• Goal: Cache address translation data with high hit rate
L2 Cache Hit Rate
Page Table Level 1Page Table Level 2Page Table Level 3Page Table Level 4 Bypass
Benefit 1: Better L2 cache utilization for translation data
Benefit 2: Bypassed requests No L2 queuing delay
Average L2 Cache Hit Rate
![Page 26: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/26.jpg)
MASK: A Translation-aware Memory Hierarchy
26
•Reduce shared TLB contentionA. TLB-fill Tokens
• Improve L2 cache utilizationB. Translation-aware L2 Bypass
•Lower page walk latencyC. Address-space-aware Memory Scheduler
![Page 27: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/27.jpg)
C: Address-space-aware Memory Scheduler
27
• Cause: Address translation requests are treated similarly to data demand requests
Idea: Lower address translation request latency
0
100
200
300
400
500
DR
AM
Late
ncy
0.0
0.2
0.4
0.6
0.8
1.0
DR
AM
Ban
dw
idth
Address Translation Requests Data Demand Requests
![Page 28: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/28.jpg)
C: Address-space-aware Memory Scheduler
28
• Idea 1: Prioritize address translation requestsover data demand requests
To DRAM
Memory Scheduler
Golden Queue
Address Translation Request
Normal Queue
Data Demand Request
High Priority
Low Priority
![Page 29: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/29.jpg)
C: Address-space-aware Memory Scheduler
29
• Idea 1: Prioritize address translation requestsover data demand requests
• Idea 2: Improve quality-of-service using the Silver Queue
To DRAM
Memory Scheduler
Golden Queue
Address Translation Request
Silver Queue
Data Demand Request(Applications take turns)
Normal Queue
Data Demand Request
High Priority
Low Priority
Each application takes turn injecting into the Silver Queue
![Page 30: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/30.jpg)
Outline
30
• Executive summary
• Background, Key Challenges and Our Goal
• MASK: A Translation-aware Memory Hierarchy
• Evaluation
• Conclusion
![Page 31: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/31.jpg)
Methodology
31
• Mosaic simulation platform [MICRO ’17] - Based on GPGPU-Sim and MAFIA [Jog et al., MEMSYS ’15]
- Models page walks and virtual-to-physical mapping
- Available at https://github.com/CMU-SAFARI/Mosaic
• NVIDIA GTX750 Ti
• Two GPGPU applications execute concurrently
• CUDA-SDK, Rodinia, Parboil, LULESH, SHOC suites- 3 workload categories based on TLB miss rate
![Page 32: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/32.jpg)
Comparison Points
32
• State-of-the-art CPU–GPU memory management [Power et al., HPCA ’14] PWCache: Page Walk Cache GPU MMU design
SharedTLB: Shared TLB GPU MMU design
• Ideal: Every TLB access is an L1 TLB hit
![Page 33: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/33.jpg)
0.0
0.5
1.0
1.5
2.0
2.5
0-HMR 1-HMR 2-HMR Average
No
rmalized
P
erf
orm
an
ce
PWCache SharedTLB MASK Ideal
Performance
33
57.8%52.0%61.2%58.7%
MASK outperforms state-of-the-art design for every workload
![Page 34: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/34.jpg)
Other Results in the Paper
34
• MASK reduces unfairness
• Effectiveness of each individual component• All three MASK components are effective
• Sensitivity analysis over multiple GPU architectures• MASK improves performance on all evaluated architectures,
including CPU–GPU heterogeneous systems
• Sensitivity analysis to different TLB sizes• MASK improves performance on all evaluated sizes
• Performance improvement over different memory scheduling policies• MASK improves performance over other
state-of-the-art memory schedulers
![Page 35: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/35.jpg)
Conclusion
35
Problem: Address translation overheadslimit the latency hiding capability of a GPU
Key IdeaPrioritize address translation requests over data requests
MASK: a translation-aware GPU memory hierarchyA. TLB-fill Tokens reduces shared TLB contention
B. Translation-aware L2 Bypass improves L2 cache utilization
C. Address-space-aware Memory Scheduler reduces page walk latency
MASK improves system throughput by 57.8% on average
over state-of-the-art address translation mechanisms
Large performance loss vs. no translation
High contention at the shared TLB
Low L2 cache utilization
![Page 36: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/36.jpg)
MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency
Rachata Ausavarungnirun Vance Miller Joshua Landgraf Saugata Ghose
Jayneel Gandhi Adwait Jog Christopher J. Rossbach Onur Mutlu
![Page 37: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/37.jpg)
Backup Slides
37
![Page 38: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/38.jpg)
Other Ways to Manage TLB Contention
38
• Prefetching:• Stream prefetcher is ineffective for multiple workloads
• GPU’s parallelism makes it hard to predict which translation data to prefetch
• GPU’s parallelism causes thrashing on the prefetched data
• Reuse-based technique:• Lowers TLB hit rate
• Most pages have similar TLB hit rate
![Page 39: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/39.jpg)
Other Ways to Manage L2 Thrashing
39
• Cache Partitioning• Performs ~3% worse on average compared to
Translation-aware L2 Bypass
• Multiple address translation requests still thrash each other
• Can lead to underutilization
• Lowers hit rate of data requests
• Cache Insertion Policy• Does not yield better hit rate for lower page table level
• Does not benefit from lower queuing latency
![Page 40: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/40.jpg)
Utilizing Large Page?• One single large page size
High demand paging latency
> 90% performance overhead with demand paging
All threads stall during large page PCIe transfer
• Mosaic [Ausavarungnirun et al., MICRO’17] Supports for multiple page sizes
Demand paging happens on small page granularity
Allocates data from the same application in large page granularity
Opportunistically coalesces small page to reduce TLB contention
MASK + Mosaic performs within 5% of the Ideal TLB
![Page 41: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/41.jpg)
Area and Storage Overhead
41
• Area overhead• <1% area of its original components (Shared TLB, L2$,
Memory Scheduler)
• Storage overhead• TLB-fill Tokens:
• 3.8% extra storage on Shared TLB
• Translation-aware L2 Bypass: • 0.1% extra storage on L2$
• Address-space-aware Memory Scheduler:• 6% extra memory request buffer
![Page 42: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/42.jpg)
0.0
0.5
1.0
1.5
2.0
0-HMR 1-HMR 2-HMR Average
Un
fair
nes
s
PWCache SharedTLB MASK
20.1%
25.0%21.8% 22.4%
Unfairness
MASK is effective at improving fairness
![Page 43: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/43.jpg)
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
0-HMR 1-HMR 2-HMR Average
We
igh
ted
Sp
eed
up
Static PWCache SharedTLB MASK-TLB
MASK-Cache MASK-DRAM MASK Ideal
Performance vs. Other Baselines
![Page 44: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/44.jpg)
0.0
0.5
1.0
1.5
2.0
0-HMR 1-HMR 2-HMR Average
Un
fair
ness
Static PWCache SharedTLB MASK
22.4%21.8%25.0%20.1%
Unfairness
![Page 45: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/45.jpg)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
3D
S_B
P
3D
S_H
IST
O
BLK
_LP
S
CF
D_M
M
CO
NS
_LP
S
CO
NS
_LU
H
FW
T_B
P
HIS
TO
_G
UP
HIS
TO
_LP
S
LU
H_B
FS
2
LU
H_G
UP
MM
_C
ON
S
MU
M_
HIS
TO
NW
_H
S
NW
_L
PS
RA
Y_G
UP
RA
Y_H
S
RE
D_B
P
RE
D_G
UP
RE
D_M
M
RE
D_R
AY
RE
D_S
C
SC
AN
_C
ON
S
SC
AN
_H
IST
O
SC
AN
_S
AD
SC
AN
_S
RA
D
SC
P_G
UP
SC
P_H
S
SC
_F
WT
SR
AD
_3D
S
TR
D_H
S
TR
D_LP
S
TR
D_M
UM
TR
D_R
AY
TR
D_R
ED
Ave
rag
e
No
rma
lized
DR
AM
Ban
dw
idth
Uti
l.
Address Translation Requests Data Demand Requests
DRAM Utilization Breakdowns
![Page 46: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/46.jpg)
0
200
400
600
800
10003
DS
_B
P
3D
S_
HIS
TO
BLK
_L
PS
CF
D_
MM
CO
NS
_L
PS
CO
NS
_L
UH
FW
T_
BP
HIS
TO
_G
UP
HIS
TO
_L
PS
LU
H_B
FS
2
LU
H_G
UP
MM
_C
ON
S
MU
M_
HIS
TO
NW
_H
S
NW
_L
PS
RA
Y_
GU
P
RA
Y_
HS
RE
D_
BP
RE
D_
GU
P
RE
D_
MM
RE
D_
RA
Y
RE
D_
SC
SC
AN
_C
ON
S
SC
AN
_H
IST
O
SC
AN
_S
AD
SC
AN
_S
RA
D
SC
P_
GU
P
SC
P_
HS
SC
_F
WT
SR
AD
_3D
S
TR
D_
HS
TR
D_
LP
S
TR
D_
MU
M
TR
D_
RA
Y
TR
D_
RE
D
Ave
rag
e
DR
AM
Late
ncy
(Cycle
s)
Address Translation Requests Data Demand Requests
DRAM Latency Breakdowns
![Page 47: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/47.jpg)
0
1
2
3
4
5
HISTO_GUP HISTO_LPS NW_HS NW_LPS RAY_GUP RAY_HS SCP_GUP SCP_HS
Weig
hte
d S
pe
ed
up
Static PWCache SharedTLBMASK-TLB MASK-Cache MASK-DRAMMASK
0-HMR Performance
![Page 48: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/48.jpg)
0
1
2
3
4
5
Weig
hte
d S
peed
up
Static PWCache SharedTLB
MASK-TLB MASK-Cache MASK-DRAM
MASK
1-HMR Performance
![Page 49: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/49.jpg)
0
1
2
3
4
5
Weig
hte
d S
peed
up
Static PWCache SharedTLBMASK-TLB MASK-Cache MASK-DRAMMASK
2-HMR Performance
![Page 50: MASK: Redesigning the GPU Memory Hierarchy to Support ......B. Translation-aware L2 Bypass improves L2 cache utilization C. Address-space-aware Memory Scheduler reduces page walk latency](https://reader034.vdocument.in/reader034/viewer/2022042417/5f335b4976b41f25881395e1/html5/thumbnails/50.jpg)
0
0.2
0.4
0.6
0.8
1
No
rmalized
Perf
orm
an
ce
PWCache SharedTLB Ideal
Additional Baseline Performance
50
45.6%