jiang lin 1, qingda lu 2, xiaoning ding 2, zhao zhang 1, xiaodong zhang 2, and p. sadayappan 2...
TRANSCRIPT
![Page 1: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/1.jpg)
1
Jiang Lin1, Qingda Lu2, Xiaoning Ding2, Zhao Zhang1, Xiaodong Zhang2, and P.
Sadayappan2
Gaining Insights into Multi-Core Cache Partitioning:
Bridging the Gap between Simulation and Real Systems
1 Department of ECE
Iowa State University
2 Department of CSE
The Ohio State University
![Page 2: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/2.jpg)
22
Shared Caches Can be a Critical Bottleneck in Multi-Core Processors L2/L3 caches are shared by multiple cores
Intel Xeon 51xx (2core/L2) AMD Barcelona (4core/L3)Sun T2, ... (8core/L2)
Effective cache partitioning is critical to address the bottleneck caused by the conflicting accesses in shared caches.
Several hardware cache partitioning methods have been proposed with different optimization objectives Performance: [HPCA’02], [HPCA’04], [Micro’06]Fairness: [PACT’04], [ICS’07], [SIGMETRICS’07]QoS: [ICS’04], [ISCA’07]
Shared L2/L3 cache
Core Core …… Core
![Page 3: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/3.jpg)
33
Limitations of Simulation-Based Studies
Excessive simulation timeWhole programs can not be evaluated. It
would take several weeks/months to complete a single SPEC CPU2006 benchmark
As the number of cores continues to increase, simulation ability becomes even more limited
Absence of long-term OS activitiesInteractions between processor/OS affect
performance significantlyProneness to simulation inaccuracy
Bugs in simulatorImpossible to model many dynamics and
details of the system
![Page 4: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/4.jpg)
44
Our Approach to Address the Issues
Design and implement OS-based Cache PartitioningEmbedding cache partitioning mechanism in
OSBy enhancing page coloring technique To support both static and dynamic cache
partitioningEvaluate cache partitioning policies on
commodity processorsExecution- and measurement-basedRun applications to completionMeasure performance with hardware counters
![Page 5: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/5.jpg)
55
Four Questions to Answer
Can we confirm the conclusions made by the simulation-based studies?
Can we provide new insights and findings that simulation is not able to?
Can we make a case for our OS-based approach as an effective option to evaluate multicore cache partitioning designs?
What are advantages and disadvantages for OS-based cache partitioning?
![Page 6: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/6.jpg)
66
Outline
IntroductionDesign and implementation of OS-based
cache partitioning mechanismsEvaluation environment and workload
constructionCache partitioning policies and their resultsConclusion
![Page 7: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/7.jpg)
77
OS-Based Cache Partitioning MechanismsStatic cache partitioning
Predetermines the amount of cache blocks allocated to each program at the beginning of its execution
Page coloring enhancementDivides shared cache to multiple regions and
partition cache regions through OS page address mapping
Dynamic cache partitioningAdjusts cache quota among processes dynamically Page re-coloringDynamically changes processes’ cache usage
through OS page address re-mapping
![Page 8: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/8.jpg)
88
Page Coloring
virtual page numberVirtual address page offset
physical page numberPhysical address Page offset
Address translation
Cache tag Block offsetSet indexCache address
Physically indexed cache
page color bits
… …
OS control
=
•Physically indexed caches are divided into multiple regions (colors).•All cache lines in a physical page are cached in one of those regions (colors).
OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).
![Page 9: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/9.jpg)
99
Enhancement for Static Cache Partitioning
… …...
………
………
Physically indexed cache
………
………
Physical pages are grouped to page binsaccording to their page color1
234
…
i+2
ii+1
…
Process 1
1234
…
i+2
ii+1
…
Process 2
OS
address m
apping
Shared cache is partitioned between two processes through address mapping.
Cost: Main memory space needs to be partitioned too (co-partitioning).
![Page 10: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/10.jpg)
1010
Dynamic Cache Partitioning
Why?Programs have dynamic behaviorsMost proposed schemes are dynamic
How?Page re-coloring
How to handle overhead?Measure overhead by performance counterRemove overhead in result (emulating
hardware schemes)
![Page 11: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/11.jpg)
1111
Allocated color
Dynamic Cache Partitioning through Page Re-Coloring
page links table
……
N - 1
0
1
2
3
Page re-coloring:Allocate page in new colorCopy memory contentsFree old page
Allocated color
Pages of a process are organized into linked lists by their colors.
Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.
![Page 12: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/12.jpg)
1212
Control the Page Migration Overhead
Control the frequency of page migrationFrequent enough to capture application phase
changesNot too often to introduce large page migration
overhead
Lazy migration: avoid unnecessary page migrationObservation: Not all pages are accessed between
their two migrations.Optimization: do not migrate a page until it is
accessed
![Page 13: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/13.jpg)
13
After the optimizationOn average, 2% page migration
overheadUp to 7%.13
Lazy Page Migration
Process page links
……
N - 1
0
1
2
3
Avoid unnecessary page migration for these pages!
Allocated color
Allocated color
![Page 14: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/14.jpg)
1414
Outline
IntroductionDesign and implementation of OS-based
cache partitioning mechanismsEvaluation environment and workload
constructionCache partitioning policies and their resultsConclusion
![Page 15: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/15.jpg)
1515
Experimental Environment
Dell PowerEdge1950Two-way SMP, Intel dual-core Xeon 5160Shared 4MB L2 cache, 16-way8GB Fully Buffered DIMM
Red Hat Enterprise Linux 4.02.6.20.3 kernelPerformance counter tools from HP (Pfmon)Divide L2 cache into 16 colors
![Page 16: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/16.jpg)
1616
Benchmark Classification
Is it sensitive to L2 cache capacity? Red group: IPC(1M L2 cache)/IPC(4M L2 cache) < 80%
Give red benchmarks more cache: big performance gain Yellow group: 80% <IPC(1M L2 cache)/IPC(4M L2 cache)
< 95% Give yellow benchmarks more cache: moderate
performance gain
Else: Does it extensively access L2 cache? Green group: > = 14 accesses / 1K cycle
Give it small cache Black group: < 14 accesses / 1K cycle
Cache insensitive
29 benchmarks from SPEC CPU2006
6 9 6 8
![Page 17: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/17.jpg)
1717
Workload Construction
6 9 6
6
9
6
2-core
RR (3 pairs)
RY (6 pairs)
RG (6 pairs)
YY (3 pairs)
YG (6 pairs) GG (3 pairs)
27 workloads: representative benchmark combinations
![Page 18: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/18.jpg)
1818
Outline
IntroductionOS-based cache partitioning mechanismEvaluation environment and workload
constructionCache partitioning policies and their results
PerformanceFairness
Conclusion
![Page 19: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/19.jpg)
1919
Performance – MetricsDivide metrics into evaluation metrics and policy
metrics [PACT’06]Evaluation metrics:
Optimization objectives, not always available during run-time
Policy metricsUsed to drive dynamic partitioning policies: available
during run-timeSum of IPC, Combined cache miss rate, Combined cache
misses
![Page 20: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/20.jpg)
2020
Static PartitioningTotal #color of cache: 16Give at least two colors to each program
Make sure that each program get 1GB memory to avoid swapping (because of co-partitioning)
Try all possible partitionings for all workloads(2:14), (3:13), (4:12) ……. (8,8), ……, (13:3),
(14:2)Get value of evaluation metricsCompared with performance of all partitionings
with performance of shared cache
![Page 21: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/21.jpg)
2121
Performance – Optimal Static Partitioning
Performance gai n of opti mal stati c parti ti oni ng
1.00
1.05
1.10
1.15
1.20
1.25
RR RY RG YY YG GG
Throughtput Average Weighted Speedup Normalized SMT Speedup Fair Speedup
Confirm that cache partitioning has significant performance impact
Different evaluation metrics have different performance gains
RG-type of workloads have largest performance gains (up to 47%)
Other types of workloads also have performance gains (2% to 10%)
![Page 22: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/22.jpg)
2222
A New Finding
Workload RG1: 401.bzip2 (Red) + 410.bwaves (Green)
Intuitively, giving more cache space to 401.bzip2 (Red)Increases the performance of 401.bzip2 largely
(Red)Decreases the performance of 410.bwaves
slightly (Green)
However, we observe that
![Page 23: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/23.jpg)
23
Memory Bandwidth Utilization
2.702.752.802.852.902.953.003.05
2:14
3:13
4:12
5:11
6:10 7:9
8:8
9:7
10:6
11:5
12:4
13:3
14:2
Partitionings
GB/s
Average Memory Access Latency
140142144146148150152154156
2:14
3:13
4:12
5:11
6:10 7:9
8:8
9:7
10:6
11:5
12:4
13:3
14:2
Partitionings
ns
23
Insight into Our Finding
![Page 24: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/24.jpg)
24
Insight into Our Finding
We have the same observation in RG4, RG5 and YG5
This is not observed by simulation Did not model main memory sub-system in
detail Assumed fixed memory access latency
Shows the advantages of our execution- and measurement-base study
![Page 25: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/25.jpg)
2525
Performance - Dynamic Partition Policy
Init: Partition the cache as (8:8)
Run current partition (P0:P1) for one epoch
finished
Try one epoch for each of the two neighboringpartitions: (P0 – 1: P1+1) and (P0 + 1: P1-1)
Choose next partitioning with best policy metrics measurement
No
YesExit
A simple greedy policy.
Emulate policy of [HPCA’02]
![Page 26: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/26.jpg)
2626
Performance – Static & Dynamic
Use combined miss rates as policy metricsFor RG-type, and some RY-type:
Static partitioning outperforms dynamic partitioningFor RR- and RY-type, and some RY-type
Dynamic partitioning outperforms static partitioning
![Page 27: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/27.jpg)
2727
Fairness – Metrics and Policy [PACT’04]Metrics
Evaluation metrics FM0 difference in slowdown, small is better
Policy metrics
PolicyRepartitioning and rollback
![Page 28: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/28.jpg)
2828
Fairness - Result
Dynamic partitioning can achieve better fairness If we use FM0 as both evaluation metrics and policy
metrics None of policy metrics (FM1 to FM5) is good enough to
drive the partitioning policy to get comparable fairness with static partitioning
Strong correlation was reported in simulation-based study – [PACT’04]
None of policy metrics has consistently strong correlation with FM0 SPEC CPU2006 (ref input) SPEC CPU2000 (test input)
Complete trillions of instructions less than one billion instruction
4MB L2 cache 512KB L2 cache
![Page 29: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/29.jpg)
2929
Conclusion
Confirmed some conclusions made by simulationsProvided new insights and findings
Give cache space from one to another, increase performance of both
Poor correlation between evaluation and policy metrics for fairness
Made a case for our OS-based approach as an effective option for evaluation of multicore cache partitioning
Advantages of OS-based cache partitioning Working on commodity processors for an execution- and
measurement-based studyDisadvantages of OS-based cache partitioning
Co-partitioning (may underutilize memory), migration overhead
![Page 30: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/30.jpg)
3030
Ongoing Work
Reduce migration overhead on commodity processors
Cache partitioning at the compiler levelPartition cache at object level
Hybrid cache partitioning methodRemove the cost of co-partitioningAvoid page migration overhead
![Page 31: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/31.jpg)
31
Jiang Lin1, Qingda Lu2, Xiaoning Ding2, Zhao Zhang1, Xiaodong Zhang2, and P.
Sadayappan2
Gaining Insights into Multi-Core Cache Partitioning:
Bridging the Gap between Simulation and Real Systems
1 Iowa State University
2 The Ohio State University
Thanks!
![Page 32: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/32.jpg)
3232
Backup Slides
![Page 33: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/33.jpg)
3333
Fairness - Correlation between Evaluation Metrics and Policy Metrics (Reported by [PACT’04])
-1-0.8-0.6-0.4-0.200.20.40.60.81
apsi+equake gzip+apsi swim+gzip tree+mcf AVG18
Corr(M1,M0) Corr(M2,M0) Corr(M3,M0)
Corr(M4,M0) Corr(M5,M0)
Strong correlation was reported in simulation study – [PACT’04]
![Page 34: Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging](https://reader036.vdocument.in/reader036/viewer/2022062421/56649e545503460f94b4a964/html5/thumbnails/34.jpg)
3434
Fairness - Correlation between Evaluation Metrics and Policy Metrics (Our result)
None of policy metrics has consistently strong correlation with FM0SPEC CPU2006 (ref input) SPEC CPU2000 (test input) Complete trillions of instructions less than one billion
instruction4MB L2 cache 512KB L2 cache
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
YY1 YY2 YY3 YG1 YG2 YG3 YG4 YG5 YG6 GG1 GG2 GG3
FM1
FM3
FM4
FM5