Download - SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors
![Page 1: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/1.jpg)
SOS: A Software-Oriented Distributed Shared Cache Management Approach for
Chip Multiprocessors
Lei Jin and Sangyeun Cho
Dept. of Computer ScienceUniversity of Pittsburgh
![Page 2: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/2.jpg)
University of Pittsburgh
PACT 2009
Chip Multiprocessor Development Cease of performance scaling of uniprocessors has turned
researchers to chip multiprocessor architectures The number of cores is increasing at a fast pace
1998 2000 2002 2004 2006 2008 20100
2
4
6
8
Pentium 4Power5
Pentium D
Core 2Duo
Ahtlon X2
Power6
Phenom X3Core i7
Phenom X4
Opteron
Cor
e C
ount
Timeline
Source: Wikipedia
![Page 3: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/3.jpg)
University of Pittsburgh
PACT 2009
A CMP = N cores + one (coherent) cache system
Cache
The CMP Cache
Cache
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
![Page 4: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/4.jpg)
University of Pittsburgh
PACT 2009
A CMP = N cores + one (coherent) cache system How can one cache system sustain the growth of N cores?
The CMP Cache
Cache
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
![Page 5: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/5.jpg)
University of Pittsburgh
PACT 2009
A CMP = N cores + one (coherent) cache system How can one cache system sustain the growth of N cores?
The CMP Cache
Core L1 I/DCache
L2 Cache Slice
Directory Router
Non-Uniform Cache Architecture (NUCA) Shared cache scheme vs. private cache scheme
![Page 6: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/6.jpg)
University of Pittsburgh
PACT 2009
Hybrid Cache Schemes Victim Replication [Zhang and Asanovic ISCA `05] Adaptive Selective Replication [Beckmann et al. MICRO `06] CMP-NuRAPID [Chishti et al. ISCA `05] Cooperative Caching [Chang and Sohi ISCA `06] R-NUCA [Hardavelles et al. ISCA `09]
Problems with hardware-based schemes:• Hardware complexity• Limited scalability
![Page 7: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/7.jpg)
University of Pittsburgh
PACT 2009
The Challenge CMPs provide the scalability of the core count A cache system with scalable performance is critical in CMPs Available hardware-based schemes failed to do so
We propose a Software-Oriented Shared (SOS) cache man-agement approach:• Minimum hardware support• Good scalability
![Page 8: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/8.jpg)
University of Pittsburgh
PACT 2009
Our Contributions We studied access patterns in multithreaded workloads and
found they can be utilized to improve locality
We proposed the SOS scheme, which offloads the work from hardware to software analysis
We evaluated our scheme and proved that it is a promising approach
![Page 9: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/9.jpg)
University of Pittsburgh
PACT 2009
Outline Motivation Observation in access patterns SOS scheme Evaluation results Conclusions
![Page 10: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/10.jpg)
University of Pittsburgh
PACT 2009
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160%
20%
40%
60%
80%
100%
Total Sharer#
Concurrent Sharer#
Observation L2 cache access distribution of Cholesky
# of access to blocks shared by 15 threads or less during whole exe-cution.
# of access to blocks shared by 15 threads or less simultaneouslyC
umul
ativ
e P
erce
ntag
e of
Acc
esse
s
Sharer Count
![Page 11: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/11.jpg)
University of Pittsburgh
PACT 2009
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160%
20%
40%
60%
80%
100%
Total Sharer#
Concurrent Sharer#
Observation L2 cache accesses are skewed at the two extremes
Cum
ulat
ive
Per
cent
age
of A
cces
ses
Sharer Count
~50% highlyshared access
~30% privatedata access
![Page 12: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/12.jpg)
University of Pittsburgh
PACT 2009
Access Patterns Static data vs. dynamic data
• Static data: location and size are known prior to execution (e.g. global data)
• Dynamic data: location and size vary among executions, but patterns may persist (e.g. data allocated by malloc(), stack data)
• Dynamic data is more important than static data Common access patterns for dynamic data are:
• Even partition• Scattered• Dominant owner• Shared
![Page 13: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/13.jpg)
University of Pittsburgh
PACT 2009
Even Partition Pattern A continuous memory space is partitioned evenly among
threads
Main thread:Array = malloc(sizeof(int) * NumProc * N);
Thread [ProcNo]:for(i = 0; i < N; i++)
Array[ProcNo * N + i] = x;
T0 T1 T2 T3
![Page 14: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/14.jpg)
University of Pittsburgh
PACT 2009
Scattered Pattern Memory spaces are not continuous, but each is owned by one
thread
Main thread:ArrayPtr = malloc(sizeof(int) * NumProc);for(i = 0; i < NumProc; i++)
ArrayPtr[i] = malloc(sizeof(int) * Size[i]);
Thread [ProcNo]:for(i = 0; i < Size[i]; i++)
ArrayPtr[ProcNo][i] = i;
T0 T1 T2 T3Gap Gap
![Page 15: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/15.jpg)
University of Pittsburgh
PACT 2009
Other Patterns Dominant owner: data are accessed by multiple threads, but
one thread contributes the access significantly more than the others
Shared: data are widely shared
![Page 16: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/16.jpg)
University of Pittsburgh
PACT 2009
Outline Motivation Observation in access patterns SOS scheme Evaluation results Conclusions
![Page 17: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/17.jpg)
University of Pittsburgh
PACT 2009
SOS Scheme The SOS scheme consists of 3 components:
L2 CacheAccess Profiling
Page Clustering & Pattern Recognition
Page coloring
Replication
One-time offline analysis Run-time
![Page 18: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/18.jpg)
University of Pittsburgh
PACT 2009
Page Clustering We take a machine-learning based approach:
Per-threadL2 Cache Access Trace
T0 T1 T2 T3
K-meansClustering
C0 (1, 0, 0, 0)C1 (0, 1, 0, 0)C2 (0, 0, 1, 0)C3 (0, 0, 0, 1)C4 (1, 1, 1, 1)
Per-Page Histogram
Dynamic Area
![Page 19: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/19.jpg)
University of Pittsburgh
PACT 2009
Pattern Recognition Assume a dynamic area consists of 8 pages:
Pages accessed mostly by thread 0
Pages accessed mostly by thread 3
Highly shared pages
C0 (1, 0, 0, 0)C1 (0, 1, 0, 0)C2 (0, 0, 1, 0)C3 (0, 0, 0, 1)C4 (1, 1, 1, 1)
P1P0
P3P2
P4
P6 P7
C0
C1
C2
C3
C4 P5
Initial centroids forK-means clustering
![Page 20: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/20.jpg)
University of Pittsburgh
PACT 2009
Pattern Recognition Assume a dynamic area consists of 8 pages:
C0 (1, 0, 0, 0)C1 (0, 1, 0, 0)C2 (0, 0, 1, 0)C3 (0, 0, 0, 1)C4 (1, 1, 1, 1)
P1P0
P3P2
P4
P6 P7
C0
C1
C2
C3
C4 P5
Initial centroids forK-means clustering
P0 – P1
P2 – P3
P4 – P5
P6 – P7
Compare
Ideal Partition
![Page 21: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/21.jpg)
University of Pittsburgh
PACT 2009
Hints Representation & Utilization For dynamic data, pattern type is associated with every dy-
namic allocation system call[FileName, Line#, Pattern Type]
For static data, page location is explicitly given:[Virtual Page Num, Tile ID]
SOS data management policy:• Pattern type is translated into actual partition when the dynamic area lo-
cation and size are known by the OS• Page location is assigned on demand if the partition information (hint) is
available• Data without corresponding hints are treated as highly shared and dis-
tributed at block level• Data replication is enabled for shared data
![Page 22: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/22.jpg)
University of Pittsburgh
PACT 2009
Architectural Support To allow flexible data placement in L2 cache, we add two
fields in page table entry and TLB entry [Jin and Cho CMP-MSI `07, Cho and Jin MICRO `06]
The OS is responsible for providing TID and BIN• Main memory access is the same as before, with the translated physical
page address• L2 cache addressing mode depends the value of TID and BIN
Virtual Page NumberPhysical Page NumberP TID BINa TLB entry
To form physical address for main memory access
To locate page in L2 cache
![Page 23: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/23.jpg)
University of Pittsburgh
PACT 2009
Outline Motivation Observations in access patterns SOS scheme Evaluation results Conclusions
![Page 24: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/24.jpg)
University of Pittsburgh
PACT 2009
Experiment Setup We use a simics-based memory simulator, modeling a 16-tile
CMP with 4x4 2D mesh on-chip network Each core has 2-issue in-order pipeline with private L1 I/D
caches and an L2 cache slice Programs from SPLASH-2 suite and PARSEC suite are se-
lected as benchmarks with 3 different input sizes Small input set is used to profile and generate hints, while
median and large input sets are used to evaluate the SOS performance
For brevity, we only present results of 4 representative pro-grams (barnes, lu, cholesky, swaption) and the overall aver-age of 14 programs
![Page 25: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/25.jpg)
University of Pittsburgh
PACT 2009
barnes lu cholesky swaption avg of all apps0%
20%
40%
60%
80%
100%
Accuracy is measured by the percentage of pages that are placed in the tile with most access
Hint Accuracy
Small inputMedian input
![Page 26: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/26.jpg)
University of Pittsburgh
PACT 2009
0%
20%
40%
60%
80%
100%
StaticDominant OwnerPrivate ScatterOrderred ScatterEven PartitionShared
Breakdown of L2 Cache Accesses
Patterns vary among different programs A large percentage of L2 access can be tackled by page placement The shared data are evenly distributed and handled by replication
![Page 27: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/27.jpg)
University of Pittsburgh
PACT 2009
Hint-guided data placement significantly reduces the number of remote cache accesses
Our SOS scheme removes nearly 87% of remote accesses!
barne
s lu
chole
sky
swap
tions
avg o
f all a
pps
0%
20%
40%
60%
80%
100%
Shared VR Hints Only SOS
Remote Access Comparison
![Page 28: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/28.jpg)
University of Pittsburgh
PACT 2009
Hint-guided data placement tracks private cache performance closely
SOS performs nearly 20% better than shared cache scheme
barne
s lu
chole
sky
swap
tions
avg o
f all a
pps
0%
20%
40%
60%
80%
100%Shared Private VR Hints Only SOS
Execution Time
![Page 29: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/29.jpg)
University of Pittsburgh
PACT 2009
Related Work Lu et al. PACT `09
• Analyzing the array access and performing data layout transformation to improve the data affinity
Marathe and Mueller PPoPP `06• Profiling truncated program before every run• Deriving optimal page location based on the sampled access trace• Optimizing data locality for cc-NUMA
Hardavellas et al. ISCA `09• Dynamic identification of private and shared pages• Private mapping for private pages and fine-grained broadcast-mapping
of shared pages• Focuses of server workloads
![Page 30: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/30.jpg)
University of Pittsburgh
PACT 2009
Conclusions We propose a software-oriented approach for shared cache
management: controlling data placement and replication
This is the first work on software-managed distributed shared cache scheme for CMPs
We show that multithreaded programs exhibit data access patterns that can be exploited to improve data affinity
We demonstrate that software-oriented shared cache man-agement is a promising approach through experiments• 19% performance improvement over shared cache scheme
![Page 31: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/31.jpg)
University of Pittsburgh
PACT 2009
Thank you and Questions?
![Page 32: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/32.jpg)
University of Pittsburgh
PACT 2009
Future Work Further study of more complex access patterns can show
more benefits of our software-oriented cache management scheme.
Extend the current scheme to server workloads, which exhibit totally different cache behaviors from scientific workloads.
![Page 33: SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors](https://reader036.vdocument.in/reader036/viewer/2022062815/56816932550346895de083bb/html5/thumbnails/33.jpg)
University of Pittsburgh
PACT 2009
barnes lu cholesky swaption avg of all apps0%
20%
40%
60%
80%
100%
Hint Coverage Hint coverage measures the percentage of L2 cache ac-
cesses to the pages guided by SOS.Small inputMedian input