coarse-grained coherence mikko h. lipasti associate professor electrical and computer engineering...
TRANSCRIPT
Coarse-Grained Coherence
Mikko H. LipastiAssociate Professor
Electrical and Computer EngineeringUniversity of Wisconsin – Madison
Joint work with: Jason Cantin, IBM (Ph.D. ’06)Natalie Enright JergerProf. Jim SmithProf. Li-Shiuan Peh (Princeton)
http://www.ece.wisc.edu/~pharm
Motivation Multiprocessors are commonplace
Historically, glass house servers Now laptops, soon cell phones
Most common multiprocessor Symmetric processors w/coherent
caches Logical extension of time-shared
uniprocessors Easy to program, reason about
Not so easy to buildAug 30, 2007 Mikko Lipasti-University of Wisconsin
Coherence Granularity Track each individual word
Too much overhead Track larger blocks
32B – 128B common Less overhead, exploit spatial locality Large blocks cause false sharing
P0 P1 P2 P3 P4 P5 P6 P7
Solution: use multiple granularities Small blocks: manage local read/write
permissions Large blocks: track global behavior
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Coarse-Grained Coherence Initially
Identify non-shared regions Decouple obtaining coherence
permission from data transfer Filter snoops to reduce broadcast
bandwidth Later
Enable aggressive prefetching Optimize DRAM accesses Customize protocol, interconnect to
match
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Coarse-Grained Coherence Optimizations lead to
Reduced memory miss latency Reduced cache-to-cache miss latency Reduced snoop bandwidth Fewer exposed cache misses Elimination of unnecessary DRAM reads Power savings on bus, interconnect,
caches, and in DRAM World peace and end to global warming
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Coarse-Grained Coherence Tracking
Memory is divided into coarse-grained regions Aligned, power-of-two multiple of cache line
size Can range from two lines to a physical page
A cache-like structure is added to each processor for monitoring coherence at the granularity of regions Region Coherence Array (RCA)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Each entry has an address tag, state, and count of lines cached by the processor
The region state indicates if the processor and / or other processors are sharing / modifying lines in the region
Customize policy/protocol/interconnect to exploit region state
Region Coherence Arrays
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-Grained Coherence Techniques
Broadcast Snoop Reduction [ISCA 2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Unnecessary Broadcasts
0%
20%
40%
60%
80%
100%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Req
ues
ts
Write-back
Writes
Read
I-Fetch
DCB
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Broadcast Snoop Reduction Identify requests that don’t need a
broadcast
Send data requests directly to memory w/o broadcasting Reducing broadcast traffic Reducing memory latency
Avoid sending non-data requests externallyExample
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Simulator EvaluationPHARMsim: near-RTL but written in C
Execution-driven simulator built on top of SimOS-PPC
Four 4-way superscalar out-of-order processors
Two-level hierarchy with split L1, unified 1MB L2 caches, and 64B lines
Separate address / data networks –similar to Sun Fireplane
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Workloads Scientific
Ocean, Raytrace, Barnes
Multiprogrammed SPECint2000_rate, SPECint95_rate
Commercial (database, web) TPC-W, TPC-B, TPC-H SPECweb99, SPECjbb2000
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Broadcasts Avoided
0%
20%
40%
60%
80%
100%U
nne
cess
ary
128
B2
56B
512
B1
KB
2K
B4
KB
Un
nece
ssa
ry1
28B
256
B5
12B
1K
B2
KB
4K
BU
nne
cess
ary
128
B2
56B
512
B1
KB
2K
B4
KB
Un
nece
ssa
ry1
28B
256
B5
12B
1K
B2
KB
4K
B
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Req
ues
ts
Write-backs
I-Fetches
Writes
Reads
DCB
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Execution Time
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Exe
cuti
on
Tim
e
Baseline 128B 256B 512B 1KB 2KB 4KB
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Summary Eliminates nearly all unnecessary
broadcasts
Reduces snoop activity by 65% Fewer broadcasts Fewer lookups
Provides modest speedup
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Prefetching in Multiprocessors Prefetching
Anticipate future reference, fetch into cache Many prefetching heuristics possible
Current systems: next-block, stride Proposed: skip pointer, content-based
Some/many prefetched blocks are not used Multiprocessors complications
Premature or unnecessary prefetches Permission thrashing if blocks are shared
Separate study [ISPASS 2006]
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Lines from non-shared regions can be prefetched stealthily and efficiently
Without disturbing other processors Without downgrades, invalidations Without preventing them from obtaining
exclusive copies
Without broadcasting prefetch requests
Fetched from DRAM with low overheadExample
Stealth Prefetching
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Stealth Prefetching After a threshold number of L2 misses (2), the
rest of the lines from a region are prefetched
These lines are buffered close to the processor for later use (Stealth Data Prefetch Buffer)
After accessing the RCA, requests may obtain data from the buffer as they would from memory To access data, region must be in valid state and a
broadcast unnecessary for coherent access
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
L2 Misses Prefetched
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Scientific Multiprogrammed Commercial Arithmetic Mean
L2
Mis
ses
SP-512B SP-1KB SP-2KB Perfect
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Speedup
0%
4%
8%
12%
16%
20%
24%
28%
32%
36%
Scientific Multiprogrammed Commercial Arithmetic Mean
Spe
edup
CGCT -512B Region SP-512B SP-1KB SP-2KB
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
SummaryStealth Prefetching can prefetch data:
Stealthily: Only non-shared data prefetched Prefetch requests not broadcast
Aggressively: Large regions prefetched at once, 80-90%
timely
Efficiently: Piggybacked onto a demand request Fetched from DRAM in open-page mode
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Modern systems overlap the DRAM access with the snoop, speculatively accessing DRAM before snoop response
Trading DRAM bandwidth for latency Wasting power
Approximately 25% of DRAM requests are reads that speculatively access DRAM unnecessarily
Power-Efficient DRAM Speculation
Broadcast ReqSnoop TagsSend Resp
DRAM Read Xmit Block
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
DRAM Operations
0%
20%
40%
60%
80%
100%
Scientific Mean MultiprogrammedMean
CommercialMean
Overall Mean
DR
AM
Req
ues
ts
Writes
Useful Reads
MisspeculatedReads
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Direct memory requests are non-speculative
Lines from externally-dirty regions likely to be sourced from another processor’s cache Region state can serve as a prediction Need not access DRAM speculatively
Initial requests to a region (state unknown) have a lower but significant probability of obtaining data from other processors’ caches
Power-Efficient DRAM Speculation
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Useless DRAM Reads
0%
20%
40%
60%
80%
100%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
DR
AM
Rea
ds
Externally-Clean Region
UnknownRegion State
Externally-Dirty Region
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Useful DRAM Reads
0%
20%
40%
60%
80%
100%E
xt-D
irty
Ext
-Cle
an
Ext
-U
nkno
wn
Ext
-Dirt
y
Ext
-Cle
an
Ext
-U
nkno
wn
Ext
-Dirt
y
Ext
-Cle
an
Ext
-U
nkno
wn
Ext
-Dirt
y
Ext
-Cle
an
Ext
-U
nkno
wn
Scientific Mean MultiprogrammedMean
CommercialMean
Overall Mean
DR
AM
Rea
ds
False Positives
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
DRAM Reads Performed/Delayed
0.0%
101.6%
0.0% 0.0%
81.5%
6.9%
78.2% 77.2%
13.3%12.5%
100.0%
71.4%
0%
20%
40%
60%
80%
100%
120%R
eads
Per
form
ed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Baseline CGCT,Speculate All
CGCT, OracleSpeculation
No-speculateDirty Regions
No-speculateDirty or
UnknownRegions
No-speculate
DR
AM
Rea
ds
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
SummaryPower-Efficient DRAM Speculation:
Can reduce DRAM reads 20%, with less than 1% degradation in performance 7% slowdown with nonspeculative DRAM
Nearly doubles interval between DRAM requests, allowing modules to stay in low-power modes longer
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Chip Multiprocessor Interconnect
Options Buses: don’t scale Crossbars: too
expensive Rings: too slow Packet-switched mesh
Attractive for all the same 1990’s DSM reasons Scalable Low latency High link utilization
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
CMP Interconnection Networks
But… Cables/traces are now
on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop
Router latency adds up 3-4 cycles per hop
Store-and-forward Lots of activity/power
Is this the right answer?
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Circuit-Switched Interconnects
Communication patterns Spatial locality to memory Pairwise communication
Circuit-switched links Avoid switching/routing Reduce latency Save power?
Poor utilization! Maybe OK
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Router Design
Switches consist of Configurable crossbar Configuration memory 4-stage router pipeline exposes only 1 cycle if
CS Can also act as packet-switched network Design details in [CA Letters ‘07]
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Protocol Optimization Initial 3-hop miss establishes CS path Subsequent miss requests
Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list
Benefits Reduced 3-hop latency Less activity, less power
Hybrid Circuit Switching (1)
•Hybrid Circuit Switching improves performance by up to 7%Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Hybrid Circuit Switching (2)
•Positive interaction in co-designed interconnect & protocol•More circuit reuse => greater latency benefit
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
SummaryHybrid Circuit Switching:
Routing overhead eliminated Still enable high bandwidth when
needed Co-designed protocol
Optimize cache-to-cache transfers
Substantial performance benefits To do: power analysis
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
Server Consolidation on CMPs CMP as consolidation platform Simplify system administration
Save power, cost and physical infrastructure Study combinations of individual
workloads in full system environment Micro-coded hypervisor schedules VMs
See An Evaluation of Server Consolidation Workloads for Multi-Core Designs in IISWC 2007 for additional details Nugget: shared LLC a big win
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Virtual Proximity Interactions between VM
scheduling, placement, and interconnect Goal: placement agnostic scheduling Best workload balance
Evaluate 3 scheduling policies Gang, Affinity and Load Balanced
HCS provides virtual proximity
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Scheduling Algorithms Gang Scheduling
Co-schedules all threads of a VM No idle-cycle stealing
Affinity Scheduling VMs assigned to neighboring cores Can steal idle cycles across VMs sharing
core Load Balanced Scheduling
Ready threads assigned to any core Any/all VMs can steal idle cycles Over time, VM fragments across chip
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
•Load balancing wins with fast interconnect•Affinity scheduling wins with slow interconnect•HCS creates virtual proximity
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
• HCS able to provide virtual proximity
Virtual Proximity Performance
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
•As physical distance (hop count) increases, HCS provides significantly lower latency
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
SummaryVirtual Proximity [in submission]
Enables placement agnostic hypervisor scheduler
Results: Up to 17% better than affinity scheduling Idle cycle reduction : 84% over gang and 41% over
affinity Low-latency interconnect mitigates increase in L2
cache conflicts from load balancing L2 misses up by 10% but execution time reduced by
11%
A flexible, distributed address mapping combined with HCS out-performs a localized affinity-based memory mapping by an average of 7%
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
Circuit Switched Snooping (1) Scalable, efficient broadcasting on
unordered network Remove latency overhead of directory
indirection Extend point-to-point circuit-switched
links to trees Low latency multicast via circuit-
switched tree Help provide performance isolation
as requests do not share same communication medium
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Circuit-Switched Snooping (2) Extend Coarse Grain Coherence
Tracking (CGCT) Remove unnecessary broadcasts Convert broadcasts to multicasts
Effective in Server Consolidation Workloads Very few coherence requests to
globally shared data
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snooping Interconnect Switches consist of
Configurable crossbar Configuration memory
Circuits span two or more nodes, based on RCA
Snooping occurs across circuits
All sharers in region join circuit
Each link can physically accommodate multiple circuits
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Circuit-Switched Snooping Use RCA to identify subsets of
nodes that share data Create shared circuits among
these nodes Design challenges
Multi-drop, bidirectional circuits Memory ordering
Results: very much in progress
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students
Gordie Bell (also IBM), Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease
Graduates, current employment: Intel: Ilhyun Kim, Morris Marden, Craig
Saldanha, Madhu Seshadri IBM: Trey Cain, Jason Cantin, Brian Mestan AMD: Kevin Lepak Sun Microsystems: Matt Ramsay, Razvan
Cheveresan, Pranay Koka
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Current Focus Areas Multiprocessors
Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems
Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions
Software Java Virtual Machine run-time optimization Workload development and characterization
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Funding National Science Foundation Intel Research Council IBM Faculty Partnership Awards IBM Shared University Research
equipment Schneider ECE Faculty Fellowship UW Graduate School
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Questions?http://www.ece.wisc.edu/
~pharm
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Backup Slides
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region
Processor Other Processors Broadcast Needed?
Invalid (I) No Cached Copies Unknown Yes
Clean-Invalid (CI) Unmodified Copies Only No Cached Copies No
Clean-Clean (CC) Unmodified Copies Only Unmodified Copies Only For Modifiable Copy
Clean-Dirty (CD) Unmodified Copies Only Modified/Unmodified Copies Yes
Dirty-Invalid (DI) Modified/Unmodified Copies No Cached Copies No
Dirty-Clean (DC) Modified/Unmodified Copies Unmodified Copies Only For Modifiable Copy
Dirty-Dirty (DD) Modified/Unmodified Copies Modified/Unmodified Copies Yes
Region Coherence Arrays
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Region Coherence Arrays On cache misses, the region state is read
to determine if a broadcast is necessary On external snoops, the region state is
read to provide a region snoop response Piggybacked onto the conventional response Used to update other processors’ region state
The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
P0 P1
M0 M1
Network
$0 RCA $1 RCA001
Invalid000
DIExclusive Invalid0000 Invalid000
Invalid0000 Invalid000Exclusive
0010
0011
• P1 stores 100002
MISS
• Snoop performed
• Response sent
• Data transfer
Store: 100002
RFO: P1, 100002
0010 Pending 001 Pending
Owned, Region Owned
DDPending
RFO: P1, 100002Owned, Region Owned
DDInvalid Modified
DataData
Coarse-Grain Coherence Tracking
Region Coherence Array added; two lines per region
Region not exclusive anymore
Hits in P0 cache
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Overhead Storage for RCA Two bits in snoop response for
region snoop response Region Externally Clean/Dirty
2-way set-assoc. RCA, 48-bit addresses Bits / Set Total Kilobytes Tag Overhead Cache Overhead
2K-Entries 74 9.3 5.0% 0.8%
4K-Entries 72 18.0 9.7% 1.5%
8K-Entries 70 35.0 48.6% 2.8%
16K-Entries 68 68.0 88.3% 5.5%
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Overhead RCA maintains inclusion over caches
RCA must respond correctly to external requests if lines cached
When regions evicted from RCA, their lines are evicted from the cache
Replacement algorithm uses line count to favor regions with no lines cached
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snoop Traffic – Peak
0
2
4
6
8
10
12
14
16
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Pea
k B
road
cast
s /
1000
CP
U C
ycle
s
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snoop Traffic – Average
0
2
4
6
8
10
12
14
16
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Ave
rag
e B
road
cast
s / 1
000
CP
U C
ycle
s
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snoop Traffic
Peak snoop traffic is halved
Average snoop traffic reduced by nearly two thirds
The system is more scalable, and may effectively support more processors
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Coarse-Grain Coherence Tracking can be used to filter external snoops Send external requests to RCA first If region valid and line-count nonzero,
send external request to cache Reduces power consumption in the
cache tag arrays Increases broadcast snoop latency
Tag Lookups Filtered
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Tag Lookups Filtered
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%O
racl
e1
28B
256
B5
12B
1K
B2
KB
4K
BO
racl
e1
28B
256
B5
12B
1K
B2
KB
4K
BO
racl
e1
28B
256
B5
12B
1K
B2
KB
4K
BO
racl
e1
28B
256
B5
12B
1K
B2
KB
4K
B
Scientific Mean MultiprogrammedMean
CommercialMean
Overall Mean
Ext
ern
al R
equ
ests
Tag LookupsFiltered
Tag Lookups forBroadcasts Avoided
Write-back TagLookups
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Line Evictions for Inclusion
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%1
28
B
25
6B
51
2B
1K
B
2K
B
4K
B
12
8B
25
6B
51
2B
1K
B
2K
B
4K
B
12
8B
25
6B
51
2B
1K
B
2K
B
4K
B
12
8B
25
6B
51
2B
1K
B
2K
B
4K
B
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Re
gio
ns
Ev
icte
d
8 lines evicted
7 lines evicted
6 lines evicted
5 lines evicted
4 lines evicted
3 lines evicted
2 lines evicted
1 line evicted
0 lines evicted
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
L2
Mis
s R
atio
Baseline 128B 256B 512B 1KB 2KB 4KB
L2 Miss Ratio Increase
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Lines from a region may be prefetched again after a threshold number of L2 misses (currently 2).
A bit mask of the lines cached since the last prefetch is used to avoid prefetching useless data
Stealth Prefetching
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Stealth Prefetching
Invalid
PendingData
PendingRequested
Data
Valid
Line PrefetchInitiated
ProcessorMiss Request
Data, Sendto Cache
Data
Processor Miss Request
Invalidate
Invalidate
Invalid
PendingData
PendingRequested
Data
Valid
Line PrefetchInitiated
ProcessorMiss Request
Data, Sendto Cache
Data
Processor Miss Request
Invalidate
Invalidate
Prefetched lines are managed by a simple protocol
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Prefetch Timeliness
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Scientific Multiprogrammed Commercial Arithmetic Mean
Tim
ely
Pre
fetc
hes
SP-512B SP-1KB SP-2KB
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Data Traffic
0%10%20%30%40%50%60%70%80%90%
100%110%120%130%140%150%
Scientific Multiprogrammed Commercial Arithmetic Mean
Dat
a T
raffi
c
Baseline CGCT-512B SP-512B SP-1KB SP-2KB
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Period Between DRAM Requests
0
200
400
600
800
1000
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Pro
cess
or
Cyc
les
Baseline CGCT, Speculate AllNo-speculate Dirty Region No-speculate Dirty or Unknown RegionsNo-speculate
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Switch design
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Value-Aware Techniques Coherence misses in multiprocessors
Store Value Locality [Lepak ‘03] Ensuring consistency
Value-based checks [Cain ‘04] Reducing speculation
Operand significance Create (nearly) nonspeculative execution
schedule Java Virtual Machine runtime optimization
[Su] Speculative optimizations [VEE ’07]
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Complexity-Effective Techniques
Scalable dynamic scheduling hardware Half-price architecture [Kim ’03] Macro-op scheduling [Kim ’03] Operand significance [Gunadi]
Scalable snoop-based coherence Coarse-grained coherence [Cantin ’06] Circuit-switched coherence [Enright]
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Power-Efficient Techniques Power-efficient techniques
Reduced speculation [Gunadi] Clock gating [E. Hill]
Transparent pipelines need fine-grained stalls
Redistribute coarse-grained stall cycles Circuit-switched coherence [Enright]
Reduce overhead of CMP cache coherence Improve latency, power
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Cache Coherence Problem
P0 P1Load A
A 0
Load A
A 0
Store A<= 1
1
Load A
Memory
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Cache Coherence Problem
P0 P1Load A
A 0
Load A
A 0
Store A<= 1
Memory
1
Load A
A 1
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snoopy Cache Coherence All cache misses broadcast on shared
bus Processors and memory snoop and respond
Cache block permissions enforced Multiple readers allowed (shared state) Only a single writer (exclusive state)
Must upgrade block before writing to it Other copies invalidated
Read/write-shared blocks bounce from cache to cache Migratory sharing
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Data
P0
$0
Invalid0000 Pending0010
Example: Conventional Snooping
P1
$1
M0 M1
Network
Load: 100002
Invalid0000
Tag State
Read: P0, 100002
Read: P0, 100002
• P0 loads 100002
MISS
• Snoop performed
Invalid0000
Invalid0000
• Response sent
InvalidInvalid
• Data transfer
Data
Exclusive
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
$0 RCA
Coarse-Grain Coherence Tracking
P0 P1
$1
M0 M1
Network
RCA• P0 loads 100002
Load: 100002
Read: P0, 100002 Invalid, Region Not Shared
Data
Tag State
Invalid0000
Invalid0000
Invalid0000
Invalid0000
Invalid000
Invalid000 MISS
Pending0010
• Snoop performed
Pending
Invalid
Invalid
000
000
• Response sent
Read: P0, 100002Invalid, Region Not Shared
• Data transfer
DIExclusive 001
Region Coherence Array added; two lines per region
Data
P0 has exclusive access to region
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
P0 P1
M0 M1
Network
$0 RCA $1 RCAInvalid0000
001
Invalid000
0010 DIExclusive Invalid0000 Invalid000
Invalid0000 Invalid000
Tag State
• P0 loads 110002
Load: 110002
MISS, Region Hit
• Direct request sent
• Data transferRead: P0, 110002
Data
Pending0011 Exclusive
Coarse-Grain Coherence Tracking
Region Coherence Array added; two lines per region
Data
Exclusive region state, broadcast unnecessary
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Impact on Execution Time
0%
20%
40%
60%
80%
100%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Exe
cu
tio
n T
ime
Baseline CGCT, Speculate All
No-speculate Dirty Regions No-speculate Dirty or Unknown Regions
No-speculate
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
P0 P1
M0 M1
Network
$0RCA $1
RCAInvalid0000
001
Invalid0000100
DI
Exclusive Invalid0000
Invalid000
Invalid0000
Invalid000
Tag State
• P0 loads 0x28
Load: 0x28
MISS, RCA Hit
• Direct request sent
• Data transfer
Read: P0, 0x28Prefetch: 11002
Data
Pending0101 Exclusive
Stealth Prefetching
Data
SDPB
Invalid0000 Invalid0000
Pending
Pending
Valid
Valid0110
0111
• Prefetch data
SDPB
Prefetch: 11002
Invalid
Invalid0000
0000
Assume 8-byte lines, 32-byte regions, 2-line threshold
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Stealth Prefetching
P0 P1
M0 M1
Network
$0RCA $1
RCA001
Invalid0000100
DI
Exclusive Invalid0000
Invalid000
Invalid0000
Invalid000
Tag State
0101 Exclusive
SDPB
Invalid0000 Invalid0000
0000
0000
Valid
Valid0110
0111
• P0 loads 0x30
Load: 0x30
Pending0110
Invalid
Exclusive
Data
MISS, SDPB Hit
SDPB
• Data TransferReturn Data
Assume 8-byte lines, 32-byte regions, 2-line threshold
Communication Latencies
CC-NUMA CMP
Local Cache Access 12 12
Remote Cache-to-Cache Transfer
12 + 21 * H * 3(H = hop count)
12 + 4 * H * 3
Local Memory Access 150 150
Remote Memory Access
150 + 21 * H * 2 150 + 4 * H *2
•Remote cache access is 2-5x faster in CMPs than NUMA machines•Lower communication latencies allow for more flexible thread placement
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
ConfigurationSimulation Parameters
Cores 16 single-threaded light-weight, in-order
Interconnect 2-D Packet-Switched Mesh3-cycle router pipeline (baseline)
Hybrid Circuit-Switched Mesh4 Circuits
L1 Cache Split I/D, 16KB each (2 cycles)
L2 Cache Private, 128 KB (6 cycles)
L3 Cache Shared, 16 MB (16 1MB banks)12 cycles
Memory Latency 150 cyclesWorkload Mixes
Mix 1 TPC-W (4) + TPC-H (4)
Mix 2 TPC-W (4) + SPECjbb (4)
Mix 3 TPC-H (4) + SPECjbb(4)Aug 30, 2007 Mikko Lipasti-University of Wisconsin
•Load Balancing with HCS outperforms local placement•Virtual proximity to memory home node
Effect of Memory Placement
Aug 30, 2007 Mikko Lipasti-University of Wisconsin