alternatives to eager hardware cache coherence on large-scale cmps marcelo cintra university of...
TRANSCRIPT
Alternatives to Eager Hardware Cache Coherence on
Large-Scale CMPs
Marcelo Cintra
University of Edinburghhttp://
www.homepages.inf.ed.ac.uk/mc
ORNL - March 2007 2
Acknowledgments
Research Team– Dr. Tom Ashby (now at IMEC)– Christian Fensch– Pedro Jimenez– Constantino Ribeiro– Marius Kuhn
Funding– UK – EPSRC– EC – HiPEAC (NE)– EC – SARC (IP)
ORNL - March 2007 3
Why Cache Coherence?
When is it necessary?– Shared-memory parallel applications (including OS)– Thread migration under multiprogramming– Implementing synchronization (e.g., spin-locks and
Decker’s algorithm)
When is it not so important?– Sequential applications with no migration– Applications with not much locality or reuse of data– Message-passing parallel applications
ORNL - March 2007 4
Hardware Cache Coherence?
What is good about it?– Portability and transparency: no need to compile
applications for different hierarchies– High performance: high selectivity of invalidations,
low traffic, low latency operations
What is not so good about it?– Extra hardware: for coherence state, for controllers,
and for suitable interconnects– Overkill:
When coherence is not necessary (see previous slide) When software can do a reasonable, or even better, job
ORNL - March 2007 5
Hardware Cache Coherence?
Extra hardware:– Shared bus fabrics and crossbars will take up a
large amount of chip real estate (easily >50% overhead for 32 processors or more by some accounts)
– Protocols and controllers for distributed directories and multi-segment buses are complex and difficult to verify
Overkill:– Communication in applications does not occur at
all times, but hardware assumes so– Hardware is always there even when not needed
ORNL - March 2007 6
Current CMP’s
Example: IBM Power4– 2 8-issue cores with
private 32KB L1– Crossbar to 3 L2 banks
1.5MB total– 174M transistors in
0.22micron on 412mm2 and Power5
– 2 8-issue 2-way SMT cores with private 32KB L1
– Crossbar to 3 L2 banks 1.9MB total
– 276M transistors in 0.13micron on 389mm2
ORNL - March 2007 7
Current CMP’s
Example: Sun T1– 8 1-issue 4-way MT
cores with private 8KB L1
– Crossbar to 4 L2 banks 3MB total
– ?M transistors in 0.09micron on ?mm2
and T2– 8 1-issue 8-way MT
cores– Crossbar to 8 L2
banks 4MB total– ?M transistors in
0.065micron on ?mm2
ORNL - March 2007 8
Where are we heading?
Moore’s Law applied to number of cores? Not yet.– Multithreading delays the need for more cores, but
unlikely to go beyond 8-way– Ever larger shared on-chip cache is easy solution
(i.e., with directory); but how large can/should we go?
But what if we want it to be so?– Must find scalable and area-efficient solutions to
cache coherence problem– Must find ways to make the most out of the off-chip
data bandwidth– Must (ideally) make architecture that is good for
various workload typesIs it time to re-think traditional hardware cache coherence?
ORNL - March 2007 9
Our Tiled Architecture Approach
Simple single- or dual-issue, possibly multithreaded, cores
Point-to-point interconnect Cache coherence mostly handled by OS and
compiler/programmer with minimal hardware support
Cache coherence can be turned off with little hardware waste for certain applications
Reconfigurable NUCA caches maximize the usefulness of the on-chip storage
ORNL - March 2007 10
Outline
Motivation Page level placement and remote cache access
– Design– Evaluation– Related Work and Conclusions
Software self invalidations and writebacks– Design– Evaluation– Related Work and Conclusions
Current and Future Directions
ORNL - March 2007 11
Baseline Tiled Architecture
Single issue RISC processors
Point-to-point interconnects
Caches are only for local data storage and there is no hardware cache coherence
Shown to be good for ILP- and DLP-based applications (e.g., MIT’s Raw)
But, no support for shared-memory parallel applications
ORNL - March 2007 12
Key Ideas
There is only one copy of each data in all L1 caches (thus, no coherence problem)– Later we will relax this constraint a little bit
Data is allocated to L1 caches on a page level– OS controls placement– New hardware tables ensure that location of
data can be easily determined– Placement can change at quiescent points (e.g.,
barriers) Hardware supports remote cache accesses
Should work well if most data is placed locally and the distances/latency on chip are
not too high
ORNL - March 2007 13
Mechanism Overview
Data placement– There is a spectrum of options
– Dynamic per line is what previous proposals for NUCA-like and adaptive shared/private cache schemes do
– Static per page is very restrictive and compiler/OS has to get the mapping absolutely right
– Our proposal: dynamic OS controlled mapping per page
- +- +
Hardware ComplexityMapping Flexibility
Dynamic perline
Static perpage
Our proposal
ORNL - March 2007 14
Data Placement
New level of indirection allows OS to map any given page to any L1 cache
OS extensions– Page table is extended to keep owner tile ID alongside
(per page) virtual-to-physical address translation (note that this exposes the chip structure to the OS)
– Default mapping policy is first-touch Hardware extensions
– TLB-like table that caches the most recent translations and is used to locate the line upon loads and stores (MAP)
– Instead of a directory coherence controller, a device to handle remote cache access requests (RAC)
ORNL - March 2007 15
Data Placement
Why not simply divide the physical address space across tiles and use the virtual-to-physical mapping to place pages?– More restrictive– Fragmentation in main memory– More difficult to change mapping
L1
L1
L1
L1
P addr. V addr. L1
L1
L1
L1
P addr. V addr.
ORNL - March 2007 16
Data Placement
Difficulty:– L1 caches are often virtually-indexed to speed
up local processor access– Remote cache requests with physical
addresses would require some inverse translation at the remote tile
Solution:– Ship virtual addresses over the on-chip
interconnect– TLB’s only keep virtual-to-physical
translations for the data allocated locally
ORNL - March 2007 17
Load/Store Mechanism
1. Virtual address (v-addr) is used to index L1, TLB, and MAP
2. If MAP entry points to local L1 then proceed with cache and TLB lookups as usual
3. If MAP entry points to remote L1 then abort L1 lookup and ignore TLB result (including miss)
4. v-addr is sent over network to home tile’s RAC
5. v-addr is used to index both L1 and TLB
6. Remote cache and TLB miss/hits are handled as usual
ORNL - March 2007 18
Allowing Migration of Data
To migrate pages: – Invalidate MAP entries– Write-back dirty cache lines– Invalidate cache contents– The following access will generate a (possibly) new
mapping Must be done at a quiescent state: e.g., barrier Instructions for cache writeback and invalidate
already exist in most IS’s, all we need is to invalidate the MAP as well
Note that it is not necessary to invalidate the TLB’s
ORNL - March 2007 19
Allowing Migration of Data
P0L1
1. P0 ld 0xA0 (pa→P0)
TLB MAP Memory
P1L1 TLB MAP
P2L1 TLB MAP
2. P1 ld 0xA1 (remote)3. P0 st 0xA2
4. Barrier (write-back + inv. L1, inv. MAP)
pa→P0
pa→P0
ORNL - March 2007 20
Allowing Migration of Data
P0L1
1. P0 ld 0xA0 (pa→P0)
TLB MAP Memory
P1L1 TLB MAP
P2L1 TLB MAP
2. P1 ld 0xA1 (remote)3. P0 st 0xA2
4. Barrier (write-back + inv. L1, inv. MAP)
6. P0 ld 0xA0 (remote)
pa→P1
5. P1 ld 0xA1 (pa→P1)pa→P1
ORNL - March 2007 21
Allowing Replication of Read-Only Data
Controlled replication: multiple readers but single, fixed, writer
Caveat: assume release consistent (RC) programs Operation
– Processors reading data in a page get a local mapping in MAP
– First write traps to OS and tile becomes the home/owner– Subsequent reads with old local mappings can continue– Subsequent writes, and reads without mapping, are
directed to home tile– At locks must also invalidate MAP entries– At barriers same procedure as for migration
Key features– OS does not have to keep track of sharers– Ownership of modified data is not transferred
ORNL - March 2007 22
Allowing Replication of Data
P0L1 TLB MAP Memory
P1L1 TLB MAP
P2L1 TLB MAP
2. P1 ld 0xA1 (pa→P1)3. P0 st 0xA2 (owner)4. Barrier (write-back + inv. L1, inv. MAP)
1. P0 ld 0xA0 (pa→P0)
pa→P0
pa→P1
ORNL - March 2007 23
Allowing Replication of Data
P0L1 TLB MAP Memory
P1L1 TLB MAP
P2L1 TLB MAP
2. P1 ld 0xA1 (pa→P1)3. P0 st 0xA2 (owner)4. Barrier (write-back + inv. L1, inv. MAP)5. P0 ld 0xA2 (pa→P0)6. P2 ld 0xA1 (pa→P2)
1. P0 ld 0xA0 (pa→P0)
pa→P0
pa→P2
7. P2 st 0xA1 (owner)8. P0 acq. lock (invalidate MAP)
ORNL - March 2007 24
Allowing Replication of Data
P0L1 TLB MAP Memory
P1L1 TLB MAP
P2L1 TLB MAP
2. P1 ld 0xA1 (pa→P1)3. P0 st 0xA2 (owner)4. Barrier (write-back + inv. L1, inv. MAP)5. P0 ld 0xA2 (pa→P0)6. P2 ld 0xA1 (pa→P2)
1. P0 ld 0xA0 (pa→P0)
pa→P2
pa→P2
7. P2 st 0xA1 (owner)8. P0 acq. lock (invalidate MAP)
9. P0 ld 0xA1 (pa→P2)
ORNL - March 2007 25
Evaluation Environment
Simulator:– PowerPC IS– Pipelined single-issue cores and network modeled
with Liberty– Caches and memory borrowed from SimpleScalar– All latencies modeled in detail; contention
modeled in detail, except at internal nodes in the network
Compiler: gcc 3.4.4 Applications: Splash-2 Systems simulated:
– Proposed scheme with and without migration+sharing (NUCA-Dist and NUCA-Dist+MS)
– Idealized distributed directory coherent system (Dir-Coh)
ORNL - March 2007 26
Overall Performance
NUCA-Dist+MS performs close (within 42%, and 19% on avg.) to idealized directory system
05
10152025303540
chole
sky
FFT LUra
dix
barn
esfm
moc
ean
radio
sity
raytr
ace
volre
nd
water-n
sq
water-s
pt
speedup
Dir-Coh NUCA-Dist+MS
No gap3%
42%
ORNL - March 2007 27
Memory Access Breakdown
Remote missLocal hit Remote hit Local miss
cholesky
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
FFT
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
LU
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
Off-chip accesses very rare for all systems
ORNL - March 2007 28
Memory Access Breakdown
Local hit Remote hit Local miss Remote missLocal hit Remote hit Local miss
radix
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
barnes
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
fmm
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
Migration and replication significantly reduce the amount of remote cache accesses
ORNL - March 2007 29
Memory Access Breakdown
Local hit Remote hit Local miss Remote miss
ocean
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
radiosity
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
raytrace
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
ORNL - March 2007 30
Memory Access Breakdown
Local hit Remote hit Local miss Remote miss
volrend
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
water-nsq
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
water-spt
0% 20% 40% 60% 80% 100%
32'
32'
16'
16'
1'
Dir-
Coh
NU
CA
-Dis
t+M
SN
UC
A-D
ist
Seq
NUCA-Dist+MS has on avg. 90% of local accesses
ORNL - March 2007 31
Ovhd. of Writebacks and Invalidations
Writebacks and invalidations don’t seem to be a major bottleneck
Recall:– At barriers: write back
dirty cache lines, invalidate cache and MAP
– At lock acquires: write back dirty cache lines, invalidate MAP
Overhead (%)
Benchmark Barriers Lock
cholesky <0.01 2.93
FFT 0.1 <0.01
LU 4.03 <0.01
radix 2.75 4.23
barnes 0.65 1.8
fmm 0.7 0.9
ocean 22.38 0.43
radiosity <0.01 17.4
raytrace 0.38 8.07
volrend <0.01 <0.01
water-nsq 0.1 1.08
water-spt 3.83 0.5
ORNL - March 2007 32
Related Work
Tiled architectures:– Raw (MIT); TRIPS (U. Texas); Cyclops (IBM);
Wavescalar (U. Washington); Vector-Thread (MIT) Shared/Private on-chip caches
– NUCA (U. Texas); CMP-NUCA (U. Wisconsin); NuRAPID (Purdue); Victim Replication (MIT); Cooperative Caching (U. Wisconsin)
Alternatives to Snooping and Directory Coherence:– Token Coherence and Ring Coherence (U.
Wisconsin) Software DSM
ORNL - March 2007 33
Conclusions
Alternative to full hardware distributed directory cache coherence that divides work between OS and hardware
Mostly acceptable performance degradation:– Within 50% for all applications and within 15% for 8 out of
12 applications w.r.t. an ideal hardware coherent system– Small number of remote cache accesses (10% on average
and 38% maximum) Likely much less complex than distributed directory
coherence:– Only request-response transactions; no forwarding or
multiple invalidations necessary– Less storage required compared to directory state
ORNL - March 2007 34
Outline
Motivation Page level placement and remote cache access
– Design– Evaluation– Related Work and Conclusions
Software self invalidations and writebacks– Design– Evaluation– Related Work and Conclusions
Current and Future Directions
ORNL - March 2007 35
Hardware vs. Software Cache Coherence
Hardware+ Fast communication and possibly more
accurate invalidations– Eager write propagation at the time of store+ Unnecessary invalidations on false sharing– No exploitation of knowledge of memory
consistency model, synchronization points, and communication patterns
+ Has been done commercially for many years and proven to work well
ORNL - March 2007 36
Hardware vs. Software Cache Coherence
Software+ Compiler/programmer may know exactly what needs to
be communicated and when– Exact sharing information not usually easy to identify
statically– Hard to deal with hardware artifacts such as false
sharing, speculative execution, and prefetching– Never really left the niche of research prototypes and
DSP’s– Only worked for loop-based data-parallel scientific
applications+ Less hardware required
ORNL - March 2007 37
Software Coh. for Release Consistency
Our approach: look at the communication requirements of the memory consistency model (RC in our case)– Synchronization points explicitly defined– No communication between threads happens in-between
synchronization points– Lock acquires:
Data modified by previous lock owner must be used (transitively)
– Lock releases: Data modified must be made visible to future lock owners (transitively)
– Barriers: Data modified by all threads before the barrier must be used by all
threads after the barrier
ORNL - March 2007 38
Software Coh. for Release Consistency
To enforce cache coherence on RC– Lock acquires:
Write back dirty lines Invalidate cache contents
– Lock release: Write back dirty lines
– Barrier: Write back dirty lines Invalidate cache contents
– Use per-word dirty bits to perform selective writeback in memory to handle false sharing
ORNL - March 2007 39
Evaluation Environment Simulator:
– Sparc IS– Non-pipelined single-issue cores, caches, and network modeled with Simics– All latencies modeled in detail; contention modeled in detail at bus
Compiler: Sun C compiler Applications: Splash-2 (scientific&engineering) and ALPBench (media) Systems simulated:
– Proposed scheme with self-invalidations and writebacks (flush)– Hardware coherent system (MSI)
ORNL - March 2007 40
Overall Performance
For a surprising number of applications software coherence performs close (within 10%) to MSI
ORNL - March 2007 41
Overall Performance
Somewhat worse results for larger (256KB vs. 64KB), but still fairly close to MSI
ORNL - March 2007 43
Related Work
Software cache coherence:– Pfister et. al., ICPP’85; Cheong and Veidenbaum,
ISCA’88; Cytron et. al., ICPP’88– Compiler-directed, only successful with regular
data-parallel applications Most recent/relevant work: shared-regions,
Sandhu et. al., PPoPP’93– Also based on explicit synchronization– Required user help to identify critical sections and
the associated data Software DSM
ORNL - March 2007 44
Conclusions
Revisited software-only cache coherence mechanisms
Pure software-based cache coherence for release-consistent applications performs surprisingly well– Within 50% for all applications and within 10% for 9
out of 13 applications w.r.t. hardware coherent system Some applications still suffer due to unnecessary
cache invalidations and writebacks Hardware needed to perform more selective
cache invalidation
ORNL - March 2007 45
Outline
Motivation Page level placement and remote cache access
– Design– Evaluation– Related Work and Conclusions
Software self invalidations and writebacks– Design– Evaluation– Related Work and Conclusions
Current and Future Directions
ORNL - March 2007 46
Current and Future Directions
Tiled architectures– Hybrid hardware-software cache coherence schemes– Mechanisms to support streaming applications– Reconfigurable on-chip cache hierarchies
Speculative parallelization– Tuning of common microarchitectural features– Automatic compilation and optimization with gcc– Software-only speculative parallelization
Performance prediction and design space exploration– With machine learning techniques
Database Systems– JIT code generation and optimization for relational queries
Alternatives to Eager Hardware Cache Coherence on
Large-Scale CMPs
Marcelo Cintra
University of Edinburghhttp://
www.homepages.inf.ed.ac.uk/mc
ORNL - March 2007 48
Implementing Locks
Strait-forward to implement Compare&Swap style primitives (there is only one copy of the lock variable)
Load-link/Store-conditional primitives are not as easy to implement without cache coherence
Coherence guarantees that all caches will be notified of any conflicting store
Absence of coherence makes it impossible to detect conflict
ORNL - March 2007 49
Implementing Locks
Solution:– Place RESERVE register in the tile that is
home to the lock– RESERVE register is shared by all locks in
pages assigned to the tile– Must not allow new LL requests to overwrite
the RESERVE register (to avoid livelock) Must also add a timeout mechanism to prevent
deadlock when there is no matching SC
– Must add an ID tag to the RESERVE register to identify the processor with the pending LL (to avoid simultaneous acquisition of a lock)
ORNL - March 2007 51
Evaluation Environment
Applications: Splash-2 benchmarks with reference input sets (except for radiosity)
ORNL - March 2007 52
Memory Latencies
NUCA-Dist+MS performs very close (within x% of ideal coherent system
0
10
20
30
40
50
60
70
80
mesa art equake ammp vpr mcf crafty vortex bzip2 average
Fra
ctio
n o
f lo
ops
(%)
0.5<S≤1 1<S≤2 2<S≤3 3<S≤4
ORNL - March 2007 54
Evaluation Environment
Applications: Splash-2 applications and ALPBench benchmarks, both with reference input sets (except MPGDEC/ENC)
ORNL - March 2007 55
Sources of largest errors (top 10%)
Source of error Number Error (%)
Incorrect IR workload estimation4 54~116Unknown iteration count(i<P) 3 54~61
Unknown inner loop iteration count2 98~161
Biased conditional 1 136
ORNL - March 2007 56
Application Characteristics
Application
TREE
MDG
Loops
accel_10
interf_1000
WUPWISE muldeo_200’muldoe_200’
% of Seq.Time
94
86
41
Spec dataSize (KB)
< 1
< 1
12,000
AP3M Shgravll_700
LUCAS mers_mod_square(line 444)
78
20
3,000
4,000