staged memory scheduling - github pages · 2020-04-28 · staged memory scheduling rachata...
TRANSCRIPT
![Page 1: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/1.jpg)
Staged Memory Scheduling
Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian,
Gabriel H. Loh*, Onur Mutlu
Carnegie Mellon University, *AMD Research
June 12th 2012
![Page 2: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/2.jpg)
Executive Summary
Observation: Heterogeneous CPU-GPU systems require
memory schedulers with large request buffers
Problem: Existing monolithic application-aware memory
scheduler designs are hard to scale to large request buffer sizes
Solution: Staged Memory Scheduling (SMS)
decomposes the memory controller into three simple stages:
1) Batch formation: maintains row buffer locality
2) Batch scheduler: reduces interference between applications
3) DRAM command scheduler: issues requests to DRAM
Compared to state-of-the-art memory schedulers:
SMS is significantly simpler and more scalable
SMS provides higher performance and fairness
2
![Page 3: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/3.jpg)
Outline
Background
Motivation
Our Goal
Observations
Staged Memory Scheduling
1) Batch Formation
2) Batch Scheduler
3) DRAM Command Scheduler
Results
Conclusion
3
![Page 4: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/4.jpg)
Main Memory is a Bottleneck
4
Memory Scheduler
Core 1 Core 2 Core 3 Core 4
To DRAM
Mem
ory
Request
Buffer
![Page 5: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/5.jpg)
All cores contend for limited off-chip bandwidth
Inter-application interference degrades system performance
The memory scheduler can help mitigate the problem
How does the memory scheduler deliver good performance and fairness?
Main Memory is a Bottleneck
5
Memory Scheduler
Core 1 Core 2 Core 3 Core 4
Mem
ory
Request
Buffer
Req Req Req Req Req Req
Req
Req Req
![Page 6: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/6.jpg)
Three Principles of Memory Scheduling
6
![Page 7: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/7.jpg)
Currently open row
B
Prioritize row-buffer-hit requests [Rixner+, ISCA’00]
To maximize memory bandwidth
Three Principles of Memory Scheduling
7
![Page 8: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/8.jpg)
Currently open row
B
Prioritize row-buffer-hit requests [Rixner+, ISCA’00]
To maximize memory bandwidth
Three Principles of Memory Scheduling
8
Req 1 Row A
Req 2 Row B
Req 3 Row C
Req 4 Row A
Req 5 Row B
![Page 9: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/9.jpg)
Currently open row
B
Prioritize row-buffer-hit requests [Rixner+, ISCA’00]
To maximize memory bandwidth
Three Principles of Memory Scheduling
9
Req 1 Row A
Req 2 Row B
Req 3 Row C
Req 4 Row A
Req 5 Row B
Older
Newer
![Page 10: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/10.jpg)
Currently open row
B
Prioritize row-buffer-hit requests [Rixner+, ISCA’00]
To maximize memory bandwidth
Three Principles of Memory Scheduling
10
Req 1 Row A
Req 2 Row B
Req 3 Row C
Req 4 Row A
Req 5 Row B
Older
Newer
![Page 11: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/11.jpg)
Prioritize row-buffer-hit requests [Rixner+, ISCA’00]
To maximize memory bandwidth
Prioritize latency-sensitive applications [Kim+, HPCA’10]
To maximize system throughput
Three Principles of Memory Scheduling
11
Application Memory Intensity (MPKI)
1 5
2 1
3 2
4 10
![Page 12: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/12.jpg)
Prioritize row-buffer-hit requests [Rixner+, ISCA’00]
To maximize memory bandwidth
Prioritize latency-sensitive applications [Kim+, HPCA’10]
To maximize system throughput
Three Principles of Memory Scheduling
12
Application Memory Intensity (MPKI)
1 5
2 1
3 2
4 10
![Page 13: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/13.jpg)
Prioritize row-buffer-hit requests [Rixner+, ISCA’00]
To maximize memory bandwidth
Prioritize latency-sensitive applications [Kim+, HPCA’10]
To maximize system throughput
Ensure that no application is starved [Mutlu and Moscibroda,
MICRO’07]
To minimize unfairness
Three Principles of Memory Scheduling
13
![Page 14: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/14.jpg)
Outline
Background
Motivation: CPU-GPU Systems
Our Goal
Observations
Staged Memory Scheduling
1) Batch Formation
2) Batch Scheduler
3) DRAM Command Scheduler
Results
Conclusion
14
![Page 15: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/15.jpg)
Memory Scheduling for CPU-GPU Systems
Current and future systems integrate a GPU along with multiple cores
GPU shares the main memory with the CPU cores
GPU is much more (4x-20x) memory-intensive than CPU
How should memory scheduling be done when GPU is integrated on-chip?
15
![Page 16: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/16.jpg)
Introducing the GPU into the System
16
Memory Scheduler
Core 1 Core 2 Core 3 Core 4
To DRAM
![Page 17: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/17.jpg)
Introducing the GPU into the System
17
Memory Scheduler
Core 1 Core 2 Core 3 Core 4
To DRAM
GPU
![Page 18: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/18.jpg)
GPU occupies a significant portion of the request buffers
Limits the MC’s visibility of the CPU applications’ differing memory behavior can lead to a poor scheduling decision
Introducing the GPU into the System
18
Memory Scheduler
Core 1 Core 2 Core 3 Core 4
To DRAM
Req Req
GPU
Req Req Req Req Req Req Req
Req Req Req Req Req Req Req
Req Req Req Req Req Req Req
Req
Req
![Page 19: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/19.jpg)
Naïve Solution: Large Monolithic Buffer
19
Memory Scheduler
To DRAM
Core 1 Core 2 Core 3 Core 4 GPU
![Page 20: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/20.jpg)
Problems with Large Monolithic Buffer
20
Memory Scheduler
Req
Req
Req
Req
Req
Req Req
Req Req Req
Req
Req
Req
Req Req
Req Req
Req Req Req
Req
Req Req
Req
Req
Req
Req
Req
Req Req
Req
Req
Req
Req
Req Req Req Req
Req Req
Req Req
![Page 21: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/21.jpg)
A large buffer requires more complicated logic to:
Analyze memory requests (e.g., determine row buffer hits)
Analyze application characteristics
Assign and enforce priorities
This leads to high complexity, high power, large die area
Problems with Large Monolithic Buffer
21
Memory Scheduler
Req
Req
Req
Req
Req
Req Req
Req Req Req
Req
Req
Req
Req Req
Req Req
Req Req Req
Req
Req Req
Req
Req
Req
Req
Req
Req Req
Req
Req
Req
Req
Req Req Req Req
Req Req
Req Req
![Page 22: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/22.jpg)
Problems with Large Monolithic Buffer
22
Memory Scheduler
Req
Req
Req
Req
Req
Req Req
Req Req Req
Req
Req
Req
Req Req
Req Req
Req Req Req
Req
Req Req
Req
Req
Req
Req
Req
Req Req
Req
Req
Req
Req
Req Req Req Req
Req Req
Req Req
More Complex Memory Scheduler
![Page 23: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/23.jpg)
Design a new memory scheduler that is:
Scalable to accommodate a large number of requests
Easy to implement
Application-aware
Able to provide high performance and fairness, especially in heterogeneous CPU-GPU systems
Our Goal
23
![Page 24: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/24.jpg)
Outline
Background
Motivation: CPU-GPU Systems
Our Goal
Observations
Staged Memory Scheduling
1) Batch Formation
2) Batch Scheduler
3) DRAM Command Scheduler
Results
Conclusion
24
![Page 25: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/25.jpg)
Key Functions of a Memory Controller
Memory controller must consider three different things concurrently when choosing the next request:
1) Maximize row buffer hits
Maximize memory bandwidth
2) Manage contention between applications
Maximize system throughput and fairness
3) Satisfy DRAM timing constraints
Current systems use a centralized memory controller design to accomplish these functions
Complex, especially with large request buffers
25
![Page 26: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/26.jpg)
Key Idea: Decouple Tasks into Stages
Idea: Decouple the functional tasks of the memory controller
Partition tasks across several simpler HW structures (stages)
1) Maximize row buffer hits
Stage 1: Batch formation
Within each application, groups requests to the same row into batches
2) Manage contention between applications
Stage 2: Batch scheduler
Schedules batches from different applications
3) Satisfy DRAM timing constraints
Stage 3: DRAM command scheduler
Issues requests from the already-scheduled order to each bank
26
![Page 27: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/27.jpg)
Outline
Background
Motivation: CPU-GPU Systems
Our Goal
Observations
Staged Memory Scheduling
1) Batch Formation
2) Batch Scheduler
3) DRAM Command Scheduler
Results
Conclusion
27
![Page 28: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/28.jpg)
SMS: Staged Memory Scheduling
28
Memory Scheduler
Core 1 Core 2 Core 3 Core 4
To DRAM
GPU
Req
Req
Req
Req
Req
Req Req
Req Req Req
Req Req Req
Req Req
Req Req
Req Req Req
Req
Req Req
Req
Req
Req
Req
Req Req
Req Req Req
Req Req Req Req Req Req
Req
Req
Req Req
Req
Monolit
hic
Sch
edule
r
![Page 29: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/29.jpg)
SMS: Staged Memory Scheduling
29
Core 1 Core 2 Core 3 Core 4
To DRAM
GPU
Req Req Req Req Req Batch Scheduler
Stage 1
Stage 2
Stage 3
Batch Formation
DRAM Command Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
![Page 30: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/30.jpg)
Stage 1
Stage 2
SMS: Staged Memory Scheduling
30
Core 1 Core 2 Core 3 Core 4
To DRAM
GPU
Req Req Batch Scheduler
Batch Formation
Stage 3
DRAM Command Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
![Page 31: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/31.jpg)
Stage 1: Batch Formation
Goal: Maximize row buffer hits
At each core, we want to batch requests that access the same row within a limited time window
A batch is ready to be scheduled under two conditions
1) When the next request accesses a different row
2) When the time window for batch formation expires
Keep this stage simple by using per-core FIFOs
31
![Page 32: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/32.jpg)
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation Example
32
To Stage 2 (Batch Scheduling)
Stage 1
Batch Formation
![Page 33: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/33.jpg)
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation Example
33
Row A Row B
To Stage 2 (Batch Scheduling)
Stage 1
Batch Formation
![Page 34: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/34.jpg)
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation Example
34
Row A Row B
Row B
To Stage 2 (Batch Scheduling)
Stage 1
Batch Formation
![Page 35: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/35.jpg)
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation Example
35
Row A Row B
Row B
Row C
To Stage 2 (Batch Scheduling)
Stage 1
Batch Formation
![Page 36: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/36.jpg)
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation Example
36
Row C
To Stage 2 (Batch Scheduling)
Next request goes to a different row Stage 1
Batch Formation
Row A Row B
Row B
![Page 37: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/37.jpg)
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation Example
37
Row A
Row C
Row D
To Stage 2 (Batch Scheduling)
Next request goes to a different row Stage 1
Batch Formation
Row A Row B
Row B
![Page 38: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/38.jpg)
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation Example
38
Row D
Row D
To Stage 2 (Batch Scheduling)
Row A
Next request goes to a different row Stage 1
Batch Formation
Row C
Row A Row B
Row B
![Page 39: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/39.jpg)
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation Example
39
Row A
Row D
Row F
Batch Boundary
To Stage 2 (Batch Scheduling)
Time window expires
Next request goes to a different row Stage 1
Batch Formation
Row D
Row A
Row C
Row A Row B
Row B
![Page 40: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/40.jpg)
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation Example
40
Row E
Row E
Batch Boundary
To Stage 2 (Batch Scheduling)
Time window expires
Next request goes to a different row Stage 1
Batch Formation
Row A
Row D
Row F Row D
Row A
Row C
Row A Row B
Row B
![Page 41: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/41.jpg)
SMS: Staged Memory Scheduling
41
Stage 1
Stage 2
Core 1 Core 2 Core 3 Core 4
To DRAM
GPU
Req Req Batch Scheduler
Batch Formation
Stage 3
DRAM Command Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
![Page 42: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/42.jpg)
Stage 2: Batch Scheduler
Goal: Minimize interference between applications
Stage 1 forms batches within each application
Stage 2 schedules batches from different applications
Schedules the oldest batch from each application
Question: Which application’s batch should be scheduled next?
Goal: Maximize system performance and fairness
To achieve this goal, the batch scheduler chooses between two different policies
42
![Page 43: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/43.jpg)
Stage 2: Two Batch Scheduling Algorithms
Shortest Job First (SJF)
Prioritize the applications with the fewest outstanding memory requests because they make fast forward progress
Pro: Good system performance and fairness
Con: GPU and memory-intensive applications get deprioritized
Round-Robin (RR)
Prioritize the applications in a round-robin manner to ensure that memory-intensive applications can make progress
Pro: GPU and memory-intensive applications are treated fairly
Con: GPU and memory-intensive applications significantly slow down others
43
![Page 44: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/44.jpg)
Stage 2: Batch Scheduling Policy
The importance of the GPU varies between systems and over time Scheduling policy needs to adapt to this
Solution: Hybrid Policy
At every cycle:
With probability p : Shortest Job First Benefits the CPU
With probability 1-p : Round-Robin Benefits the GPU
System software can configure p based on the importance/weight of the GPU
Higher GPU importance Lower p value
44
![Page 45: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/45.jpg)
SMS: Staged Memory Scheduling
45
Stage 1
Stage 2
Core 1 Core 2 Core 3 Core 4
To DRAM
GPU
Req Req Batch Scheduler
Batch Formation
Stage 3
DRAM Command Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
![Page 46: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/46.jpg)
Stage 3: DRAM Command Scheduler
High level policy decisions have already been made by:
Stage 1: Maintains row buffer locality
Stage 2: Minimizes inter-application interference
Stage 3: No need for further scheduling
Only goal: service requests while satisfying DRAM timing constraints
Implemented as simple per-bank FIFO queues
46
![Page 47: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/47.jpg)
Putting Everything Together
47
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation
GPU
![Page 48: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/48.jpg)
Batch Scheduler
Putting Everything Together
48
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation
GPU
Stage 2:
![Page 49: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/49.jpg)
Batch Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
Putting Everything Together
49
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation
Stage 3: DRAM Command Scheduler
GPU
Stage 2:
![Page 50: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/50.jpg)
Current Batch Scheduling
Policy
SJF
Batch Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
Putting Everything Together
50
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation
Stage 3: DRAM Command Scheduler
GPU
Stage 2:
![Page 51: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/51.jpg)
Current Batch Scheduling
Policy
SJF
Current Batch Scheduling
Policy
SJF
Batch Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
Putting Everything Together
51
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation
Stage 3: DRAM Command Scheduler
GPU
Stage 2:
![Page 52: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/52.jpg)
Current Batch Scheduling
Policy
SJF
Current Batch Scheduling
Policy
RR
Batch Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
Putting Everything Together
52
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation
Stage 3: DRAM Command Scheduler
GPU
Stage 2:
![Page 53: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/53.jpg)
Current Batch Scheduling
Policy
SJF
Current Batch Scheduling
Policy
RR
Batch Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
Putting Everything Together
53
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation
Stage 3: DRAM Command Scheduler
GPU
Stage 2:
![Page 54: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/54.jpg)
Current Batch Scheduling
Policy
SJF
Current Batch Scheduling
Policy
RR
Batch Scheduler
Bank 1 Bank 2 Bank 3 Bank 4
Putting Everything Together
54
Core 1 Core 2 Core 3 Core 4
Stage 1: Batch Formation
Stage 3: DRAM Command Scheduler
GPU
Stage 2:
![Page 55: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/55.jpg)
Complexity
Compared to a row hit first scheduler, SMS consumes*
66% less area
46% less static power
Reduction comes from:
Monolithic scheduler stages of simpler schedulers
Each stage has a simpler scheduler (considers fewer properties at a time to make the scheduling decision)
Each stage has simpler buffers (FIFO instead of out-of-order)
Each stage has a portion of the total buffer size (buffering is distributed across stages)
55 * Based on a Verilog model using 180nm library
![Page 56: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/56.jpg)
Outline
Background
Motivation: CPU-GPU Systems
Our Goal
Observations
Staged Memory Scheduling
1) Batch Formation
2) Batch Scheduler
3) DRAM Command Scheduler
Results
Conclusion
56
![Page 57: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/57.jpg)
Methodology
Simulation parameters
16 OoO CPU cores, 1 GPU modeling AMD Radeon™ 5870
DDR3-1600 DRAM 4 channels, 1 rank/channel, 8 banks/channel
Workloads
CPU: SPEC CPU 2006
GPU: Recent games and GPU benchmarks
7 workload categories based on the memory-intensity of CPU applications
Low memory-intensity (L)
Medium memory-intensity (M)
High memory-intensity (H)
57
![Page 58: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/58.jpg)
Comparison to Previous Scheduling Algorithms
FR-FCFS [Rixner+, ISCA’00]
Prioritizes row buffer hits
Maximizes DRAM throughput
Low multi-core performance Application unaware
ATLAS [Kim+, HPCA’10]
Prioritizes latency-sensitive applications
Good multi-core performance
Low fairness Deprioritizes memory-intensive applications
TCM [Kim+, MICRO’10]
Clusters low and high-intensity applications and treats each separately
Good multi-core performance and fairness
Not robust Misclassifies latency-sensitive applications
58
![Page 59: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/59.jpg)
Evaluation Metrics
CPU performance metric: Weighted speedup
GPU performance metric: Frame rate speedup
CPU-GPU system performance: CPU-GPU weighted speedup
59
![Page 60: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/60.jpg)
Evaluation Metrics
CPU performance metric: Weighted speedup
GPU performance metric: Frame rate speedup
CPU-GPU system performance: CPU-GPU weighted speedup
60
![Page 61: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/61.jpg)
Evaluated System Scenarios
CPU-focused system
GPU-focused system
61
![Page 62: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/62.jpg)
Evaluated System Scenario: CPU Focused
GPU has low weight (weight = 1)
Configure SMS such that p, SJF probability, is set to 0.9
Mostly uses SJF batch scheduling prioritizes latency-
sensitive applications (mainly CPU)
62
![Page 63: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/63.jpg)
Evaluated System Scenario: CPU Focused
GPU has low weight (weight = 1)
Configure SMS such that p, SJF probability, is set to 0.9
Mostly uses SJF batch scheduling prioritizes latency-
sensitive applications (mainly CPU)
63
![Page 64: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/64.jpg)
Evaluated System Scenario: CPU Focused
GPU has low weight (weight = 1)
Configure SMS such that p, SJF probability, is set to 0.9
Mostly uses SJF batch scheduling prioritizes latency-
sensitive applications (mainly CPU)
64
1
![Page 65: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/65.jpg)
SJF batch scheduling policy allows latency-sensitive applications to get serviced as fast as possible
0
2
4
6
8
10
12
L ML M HL HML HM H Avg
CG
WS
FR-FCFS
ATLAS
TCM
SMS_0.9
Performance: CPU-Focused System
65
+17.2% over ATLAS
p=0.9
Workload Categories
![Page 66: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/66.jpg)
SJF batch scheduling policy allows latency-sensitive applications to get serviced as fast as possible
0
2
4
6
8
10
12
L ML M HL HML HM H Avg
CG
WS
FR-FCFS
ATLAS
TCM
SMS_0.9
Performance: CPU-Focused System
66
+17.2% over ATLAS
SMS is much less complex than previous schedulers p=0.9
Workload Categories
![Page 67: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/67.jpg)
Evaluated System Scenario: GPU Focused
GPU has high weight (weight = 1000)
Configure SMS such that p, SJF probability, is set to 0
Always uses round-robin batch scheduling prioritizes
memory-intensive applications (GPU)
67
![Page 68: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/68.jpg)
Evaluated System Scenario: GPU Focused
GPU has high weight (weight = 1000)
Configure SMS such that p, SJF probability, is set to 0
Always uses round-robin batch scheduling prioritizes
memory-intensive applications (GPU)
68
![Page 69: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/69.jpg)
Evaluated System Scenario: GPU Focused
GPU has high weight (weight = 1000)
Configure SMS such that p, SJF probability, is set to 0
Always uses round-robin batch scheduling prioritizes
memory-intensive applications (GPU)
69
1000
![Page 70: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/70.jpg)
Round-robin batch scheduling policy schedules GPU requests more frequently
0
200
400
600
800
1000
L ML M HL HML HM H Avg
CG
WS
FR-FCFS
ATLAS
TCM
SMS_0
Performance: GPU-Focused System
70
+1.6% over FR-FCFS
p=0
Workload Categories
![Page 71: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/71.jpg)
Round-robin batch scheduling policy schedules GPU requests more frequently
0
200
400
600
800
1000
L ML M HL HML HM H Avg
CG
WS
FR-FCFS
ATLAS
TCM
SMS_0
Performance: GPU-Focused System
71
+1.6% over FR-FCFS
SMS is much less complex than previous schedulers p=0
Workload Categories
![Page 72: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/72.jpg)
Performance at Different GPU Weights
72
0
0.2
0.4
0.6
0.8
1
0.001 0.1 10 1000
Syste
m P
erf
orm
an
ce
GPUweight
Previous Best Best Previous Scheduler
![Page 73: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/73.jpg)
Performance at Different GPU Weights
73
0
0.2
0.4
0.6
0.8
1
0.001 0.1 10 1000
Syste
m P
erf
orm
an
ce
GPUweight
Previous Best Best Previous Scheduler
ATLAS TCM FR-FCFS
![Page 74: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/74.jpg)
Performance at Different GPU Weights
74
0
0.2
0.4
0.6
0.8
1
0.001 0.1 10 1000
Syste
m P
erf
orm
an
ce
GPUweight
Previous Best
SMS SMS
Best Previous Scheduler
![Page 75: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/75.jpg)
At every GPU weight, SMS outperforms the best previous scheduling algorithm for that weight
Performance at Different GPU Weights
75
0
0.2
0.4
0.6
0.8
1
0.001 0.1 10 1000
Syste
m P
erf
orm
an
ce
GPUweight
Previous Best
SMS SMS
Best Previous Scheduler
![Page 76: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/76.jpg)
Additional Results in the Paper
Fairness evaluation
47.6% improvement over the best previous algorithms
Individual CPU and GPU performance breakdowns
CPU-only scenarios
Competitive performance with previous algorithms
Scalability results
SMS’ performance and fairness scales better than previous algorithms as the number of cores and memory channels increases
Analysis of SMS design parameters
76
![Page 77: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/77.jpg)
Outline
Background
Motivation: CPU-GPU Systems
Our Goal
Observations
Staged Memory Scheduling
1) Batch Formation
2) Batch Scheduler
3) DRAM Command Scheduler
Results
Conclusion
77
![Page 78: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/78.jpg)
Conclusion
Observation: Heterogeneous CPU-GPU systems require
memory schedulers with large request buffers
Problem: Existing monolithic application-aware memory
scheduler designs are hard to scale to large request buffer size
Solution: Staged Memory Scheduling (SMS)
decomposes the memory controller into three simple stages:
1) Batch formation: maintains row buffer locality
2) Batch scheduler: reduces interference between applications
3) DRAM command scheduler: issues requests to DRAM
Compared to state-of-the-art memory schedulers:
SMS is significantly simpler and more scalable
SMS provides higher performance and fairness
78
![Page 79: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/79.jpg)
Staged Memory Scheduling
Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian,
Gabriel H. Loh*, Onur Mutlu
Carnegie Mellon University, *AMD Research
June 12th 2012
![Page 80: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/80.jpg)
Backup Slides
80
![Page 81: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/81.jpg)
Row Buffer Locality on Batch Formation
OoO batch formation improves the performance of the system by:
~3% when the batch scheduler uses SJF policy most of the time
~7% when the batch scheduler uses RR most of the time
However, OoO batch formation is more complex
OoO buffering instead of FIFO queues
Need to fine tune the time window of the batch formation based on application characteristics (only 3%-5% performance gain without fine tuning)
81
![Page 82: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/82.jpg)
Row Buffer Locality on Batch Formation
82
0
1
2
3
4
5
6
CPU-WS
0 10 20 30 40 50 60 70 80 90
GPU-Frame Rate
![Page 83: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/83.jpg)
Key Differences Between CPU and GPU
83
0
100
200
300
400
Memory Intensity
L2 M
PKI
Graphic Applications CPU Applications
~4x difference
![Page 84: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/84.jpg)
MLP and RBL
Key differences between a CPU application and a GPU application
84
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Row Buffer Locality
0
1
2
3
4
5
6
Memory Level Parallelism
![Page 85: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/85.jpg)
CPU-GPU Performance Tradeoff
85
0
10
20
30
40
50
60
70
80
90
1 0.5 0.1 0.05 0
Fra
me
Ra
te
SJF Probability
GPU Frame Rate
0
1
2
3
4
5
6
1 0.5 0.1 0.05 0
We
igh
ted
Sp
ee
du
p
SJF Probability
CPU Performance
![Page 86: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/86.jpg)
Dealing with Multi-Threaded Applications
Batch formation: Groups requests from each application in a per-thread FIFO
Batch scheduler: Detects critical threads and prioritizes them over non-critical threads
Previous works have shown how to detect and schedule critical threads
1) Bottleneck Identification and Scheduling in MT applications [Joao+, ASPLOS’12]
2) Parallel Application Memory Scheduling
[Ebrahimi, MICRO’11]
DRAM command scheduler: Stays the same
86
![Page 87: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/87.jpg)
Dealing with Prefetch Requests
Previous works have proposed several solutions:
Prefetch-Aware Shared-Resource Management for Multi-Core Systems [Ebrahimi+, ISCA’11]
Coordinated Control of Multiple Prefetchers in Multi-Core Systems [Ebrahimi+, MICRO’09]
Prefetch-aware DRAM Controller [Lee+, MICRO’08]
Handling Prefetch Requests in SMS:
SMS can handle prefetch requests before they enter the memory controller (e.g., source throttling based on prefetch accuracy)
SMS can handle prefetch requests by prioritizing/deprioritizing prefetch batch at the batch scheduler (based on prefetch accuracy)
87
![Page 88: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/88.jpg)
Fairness Evaluation
88
Unfairness (Lower is better)
![Page 89: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/89.jpg)
Performance at Different Buffer Sizes
89
![Page 90: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/90.jpg)
CPU and GPU Performance Breakdowns
90
CPU WS Frame Rate
![Page 91: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/91.jpg)
CPU-Only Results
91
![Page 92: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/92.jpg)
Scalability to Number of Cores
92
![Page 93: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/93.jpg)
Scalability to Number of Memory Controllers
93
![Page 94: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/94.jpg)
Detailed Simulation Methodology
Number of cores 16 GPU Max throughput 1600 ops/cycle
Number of GPU 1 GPU Texture/Z/Color units
80/128/32
CPU reorder buffers 128 entries DRAM Bus 64 bits/channel
L1 (private) cache size 32KB, 4 ways
DRAM row buffer size 2KB
L2 (shared) cache size 8MB, 16 ways
MC Request buffer size
300 entries
ROB Size 128 entries
94
![Page 95: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/95.jpg)
Analysis to Different SMS Parameters
95
![Page 96: Staged Memory Scheduling - GitHub Pages · 2020-04-28 · Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie](https://reader034.vdocument.in/reader034/viewer/2022050413/5f89da5bca9fa3376376bf15/html5/thumbnails/96.jpg)
Global Bypass
What if the system is lightly loaded?
Batching will increase the latency of requests
Global Bypass
Disable the batch formation when the number of total requests is lower than a threshold
96