![Page 1: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/1.jpg)
Coarse-Grained Coherence
Mikko H. LipastiAssociate Professor
Electrical and Computer EngineeringUniversity of Wisconsin – Madison
Joint work with: Jason Cantin, IBM (Ph.D. ’06)Natalie Enright JergerProf. Jim SmithProf. Li-Shiuan Peh (Princeton)
http://www.ece.wisc.edu/~pharm
![Page 2: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/2.jpg)
Motivation Multiprocessors are commonplace
Historically, glass house servers Now laptops, soon cell phones
Most common multiprocessor Symmetric processors w/coherent
caches Logical extension of time-shared
uniprocessors Easy to program, reason about
Not so easy to buildAug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 3: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/3.jpg)
Coherence Granularity Track each individual word
Too much overhead Track larger blocks
32B – 128B common Less overhead, exploit spatial locality Large blocks cause false sharing
P0 P1 P2 P3 P4 P5 P6 P7
Solution: use multiple granularities Small blocks: manage local read/write
permissions Large blocks: track global behavior
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 4: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/4.jpg)
Coarse-Grained Coherence Initially
Identify non-shared regions Decouple obtaining coherence
permission from data transfer Filter snoops to reduce broadcast
bandwidth Later
Enable aggressive prefetching Optimize DRAM accesses Customize protocol, interconnect to
match
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 5: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/5.jpg)
Coarse-Grained Coherence Optimizations lead to
Reduced memory miss latency Reduced cache-to-cache miss latency Reduced snoop bandwidth Fewer exposed cache misses Elimination of unnecessary DRAM reads Power savings on bus, interconnect,
caches, and in DRAM World peace and end to global warming
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 6: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/6.jpg)
Coarse-Grained Coherence Tracking
Memory is divided into coarse-grained regions Aligned, power-of-two multiple of cache line
size Can range from two lines to a physical page
A cache-like structure is added to each processor for monitoring coherence at the granularity of regions Region Coherence Array (RCA)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 7: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/7.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Each entry has an address tag, state, and count of lines cached by the processor
The region state indicates if the processor and / or other processors are sharing / modifying lines in the region
Customize policy/protocol/interconnect to exploit region state
Region Coherence Arrays
![Page 8: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/8.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-Grained Coherence Techniques
Broadcast Snoop Reduction [ISCA 2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
![Page 9: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/9.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Unnecessary Broadcasts
0%
20%
40%
60%
80%
100%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Req
ues
ts
Write-back
Writes
Read
I-Fetch
DCB
![Page 10: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/10.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Broadcast Snoop Reduction Identify requests that don’t need a
broadcast
Send data requests directly to memory w/o broadcasting Reducing broadcast traffic Reducing memory latency
Avoid sending non-data requests externallyExample
![Page 11: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/11.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Simulator EvaluationPHARMsim: near-RTL but written in C
Execution-driven simulator built on top of SimOS-PPC
Four 4-way superscalar out-of-order processors
Two-level hierarchy with split L1, unified 1MB L2 caches, and 64B lines
Separate address / data networks –similar to Sun Fireplane
![Page 12: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/12.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Workloads Scientific
Ocean, Raytrace, Barnes
Multiprogrammed SPECint2000_rate, SPECint95_rate
Commercial (database, web) TPC-W, TPC-B, TPC-H SPECweb99, SPECjbb2000
![Page 13: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/13.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Broadcasts Avoided
0%
20%
40%
60%
80%
100%U
nne
cess
ary
128
B2
56B
512
B1
KB
2K
B4
KB
Un
nece
ssa
ry1
28B
256
B5
12B
1K
B2
KB
4K
BU
nne
cess
ary
128
B2
56B
512
B1
KB
2K
B4
KB
Un
nece
ssa
ry1
28B
256
B5
12B
1K
B2
KB
4K
B
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Req
ues
ts
Write-backs
I-Fetches
Writes
Reads
DCB
![Page 14: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/14.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Execution Time
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Exe
cuti
on
Tim
e
Baseline 128B 256B 512B 1KB 2KB 4KB
![Page 15: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/15.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Summary Eliminates nearly all unnecessary
broadcasts
Reduces snoop activity by 65% Fewer broadcasts Fewer lookups
Provides modest speedup
![Page 16: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/16.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
![Page 17: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/17.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Prefetching in Multiprocessors Prefetching
Anticipate future reference, fetch into cache Many prefetching heuristics possible
Current systems: next-block, stride Proposed: skip pointer, content-based
Some/many prefetched blocks are not used Multiprocessors complications
Premature or unnecessary prefetches Permission thrashing if blocks are shared
Separate study [ISPASS 2006]
![Page 18: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/18.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Lines from non-shared regions can be prefetched stealthily and efficiently
Without disturbing other processors Without downgrades, invalidations Without preventing them from obtaining
exclusive copies
Without broadcasting prefetch requests
Fetched from DRAM with low overheadExample
Stealth Prefetching
![Page 19: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/19.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Stealth Prefetching After a threshold number of L2 misses (2), the
rest of the lines from a region are prefetched
These lines are buffered close to the processor for later use (Stealth Data Prefetch Buffer)
After accessing the RCA, requests may obtain data from the buffer as they would from memory To access data, region must be in valid state and a
broadcast unnecessary for coherent access
![Page 20: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/20.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
L2 Misses Prefetched
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Scientific Multiprogrammed Commercial Arithmetic Mean
L2
Mis
ses
SP-512B SP-1KB SP-2KB Perfect
![Page 21: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/21.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Speedup
0%
4%
8%
12%
16%
20%
24%
28%
32%
36%
Scientific Multiprogrammed Commercial Arithmetic Mean
Spe
edup
CGCT -512B Region SP-512B SP-1KB SP-2KB
![Page 22: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/22.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
SummaryStealth Prefetching can prefetch data:
Stealthily: Only non-shared data prefetched Prefetch requests not broadcast
Aggressively: Large regions prefetched at once, 80-90%
timely
Efficiently: Piggybacked onto a demand request Fetched from DRAM in open-page mode
![Page 23: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/23.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
![Page 24: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/24.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Modern systems overlap the DRAM access with the snoop, speculatively accessing DRAM before snoop response
Trading DRAM bandwidth for latency Wasting power
Approximately 25% of DRAM requests are reads that speculatively access DRAM unnecessarily
Power-Efficient DRAM Speculation
Broadcast ReqSnoop TagsSend Resp
DRAM Read Xmit Block
![Page 25: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/25.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
DRAM Operations
0%
20%
40%
60%
80%
100%
Scientific Mean MultiprogrammedMean
CommercialMean
Overall Mean
DR
AM
Req
ues
ts
Writes
Useful Reads
MisspeculatedReads
![Page 26: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/26.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Direct memory requests are non-speculative
Lines from externally-dirty regions likely to be sourced from another processor’s cache Region state can serve as a prediction Need not access DRAM speculatively
Initial requests to a region (state unknown) have a lower but significant probability of obtaining data from other processors’ caches
Power-Efficient DRAM Speculation
![Page 27: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/27.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Useless DRAM Reads
0%
20%
40%
60%
80%
100%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
DR
AM
Rea
ds
Externally-Clean Region
UnknownRegion State
Externally-Dirty Region
![Page 28: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/28.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Useful DRAM Reads
0%
20%
40%
60%
80%
100%E
xt-D
irty
Ext
-Cle
an
Ext
-U
nkno
wn
Ext
-Dirt
y
Ext
-Cle
an
Ext
-U
nkno
wn
Ext
-Dirt
y
Ext
-Cle
an
Ext
-U
nkno
wn
Ext
-Dirt
y
Ext
-Cle
an
Ext
-U
nkno
wn
Scientific Mean MultiprogrammedMean
CommercialMean
Overall Mean
DR
AM
Rea
ds
False Positives
![Page 29: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/29.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
DRAM Reads Performed/Delayed
0.0%
101.6%
0.0% 0.0%
81.5%
6.9%
78.2% 77.2%
13.3%12.5%
100.0%
71.4%
0%
20%
40%
60%
80%
100%
120%R
eads
Per
form
ed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Rea
dsP
erfo
rmed
Rea
dsD
elay
ed
Baseline CGCT,Speculate All
CGCT, OracleSpeculation
No-speculateDirty Regions
No-speculateDirty or
UnknownRegions
No-speculate
DR
AM
Rea
ds
![Page 30: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/30.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
SummaryPower-Efficient DRAM Speculation:
Can reduce DRAM reads 20%, with less than 1% degradation in performance 7% slowdown with nonspeculative DRAM
Nearly doubles interval between DRAM requests, allowing modules to stay in low-power modes longer
![Page 31: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/31.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
![Page 32: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/32.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Chip Multiprocessor Interconnect
Options Buses: don’t scale Crossbars: too
expensive Rings: too slow Packet-switched mesh
Attractive for all the same 1990’s DSM reasons Scalable Low latency High link utilization
![Page 33: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/33.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
CMP Interconnection Networks
But… Cables/traces are now
on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop
Router latency adds up 3-4 cycles per hop
Store-and-forward Lots of activity/power
Is this the right answer?
![Page 34: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/34.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Circuit-Switched Interconnects
Communication patterns Spatial locality to memory Pairwise communication
Circuit-switched links Avoid switching/routing Reduce latency Save power?
Poor utilization! Maybe OK
![Page 35: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/35.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Router Design
Switches consist of Configurable crossbar Configuration memory 4-stage router pipeline exposes only 1 cycle if
CS Can also act as packet-switched network Design details in [CA Letters ‘07]
![Page 36: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/36.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Protocol Optimization Initial 3-hop miss establishes CS path Subsequent miss requests
Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list
Benefits Reduced 3-hop latency Less activity, less power
![Page 37: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/37.jpg)
Hybrid Circuit Switching (1)
•Hybrid Circuit Switching improves performance by up to 7%Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 38: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/38.jpg)
Hybrid Circuit Switching (2)
•Positive interaction in co-designed interconnect & protocol•More circuit reuse => greater latency benefit
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 39: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/39.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
SummaryHybrid Circuit Switching:
Routing overhead eliminated Still enable high bandwidth when
needed Co-designed protocol
Optimize cache-to-cache transfers
Substantial performance benefits To do: power analysis
![Page 40: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/40.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
![Page 41: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/41.jpg)
Server Consolidation on CMPs CMP as consolidation platform Simplify system administration
Save power, cost and physical infrastructure Study combinations of individual
workloads in full system environment Micro-coded hypervisor schedules VMs
See An Evaluation of Server Consolidation Workloads for Multi-Core Designs in IISWC 2007 for additional details Nugget: shared LLC a big win
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 42: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/42.jpg)
Virtual Proximity Interactions between VM
scheduling, placement, and interconnect Goal: placement agnostic scheduling Best workload balance
Evaluate 3 scheduling policies Gang, Affinity and Load Balanced
HCS provides virtual proximity
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 43: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/43.jpg)
Scheduling Algorithms Gang Scheduling
Co-schedules all threads of a VM No idle-cycle stealing
Affinity Scheduling VMs assigned to neighboring cores Can steal idle cycles across VMs sharing
core Load Balanced Scheduling
Ready threads assigned to any core Any/all VMs can steal idle cycles Over time, VM fragments across chip
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 44: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/44.jpg)
•Load balancing wins with fast interconnect•Affinity scheduling wins with slow interconnect•HCS creates virtual proximity
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 45: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/45.jpg)
• HCS able to provide virtual proximity
Virtual Proximity Performance
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 46: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/46.jpg)
•As physical distance (hop count) increases, HCS provides significantly lower latency
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 47: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/47.jpg)
SummaryVirtual Proximity [in submission]
Enables placement agnostic hypervisor scheduler
Results: Up to 17% better than affinity scheduling Idle cycle reduction : 84% over gang and 41% over
affinity Low-latency interconnect mitigates increase in L2
cache conflicts from load balancing L2 misses up by 10% but execution time reduced by
11%
A flexible, distributed address mapping combined with HCS out-performs a localized affinity-based memory mapping by an average of 7%
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 48: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/48.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
![Page 49: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/49.jpg)
Circuit Switched Snooping (1) Scalable, efficient broadcasting on
unordered network Remove latency overhead of directory
indirection Extend point-to-point circuit-switched
links to trees Low latency multicast via circuit-
switched tree Help provide performance isolation
as requests do not share same communication medium
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 50: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/50.jpg)
Circuit-Switched Snooping (2) Extend Coarse Grain Coherence
Tracking (CGCT) Remove unnecessary broadcasts Convert broadcasts to multicasts
Effective in Server Consolidation Workloads Very few coherence requests to
globally shared data
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 51: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/51.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snooping Interconnect Switches consist of
Configurable crossbar Configuration memory
Circuits span two or more nodes, based on RCA
Snooping occurs across circuits
All sharers in region join circuit
Each link can physically accommodate multiple circuits
![Page 52: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/52.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Circuit-Switched Snooping Use RCA to identify subsets of
nodes that share data Create shared circuits among
these nodes Design challenges
Multi-drop, bidirectional circuits Memory ordering
Results: very much in progress
![Page 53: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/53.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Talk Outline Motivation Overview of Coarse-grained Coherence Techniques
Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping
Research Group Overview
![Page 54: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/54.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students
Gordie Bell (also IBM), Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease
Graduates, current employment: Intel: Ilhyun Kim, Morris Marden, Craig
Saldanha, Madhu Seshadri IBM: Trey Cain, Jason Cantin, Brian Mestan AMD: Kevin Lepak Sun Microsystems: Matt Ramsay, Razvan
Cheveresan, Pranay Koka
![Page 55: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/55.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Current Focus Areas Multiprocessors
Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems
Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions
Software Java Virtual Machine run-time optimization Workload development and characterization
![Page 56: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/56.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Funding National Science Foundation Intel Research Council IBM Faculty Partnership Awards IBM Shared University Research
equipment Schneider ECE Faculty Fellowship UW Graduate School
![Page 57: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/57.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Questions?http://www.ece.wisc.edu/
~pharm
![Page 58: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/58.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Backup Slides
![Page 59: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/59.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region
Processor Other Processors Broadcast Needed?
Invalid (I) No Cached Copies Unknown Yes
Clean-Invalid (CI) Unmodified Copies Only No Cached Copies No
Clean-Clean (CC) Unmodified Copies Only Unmodified Copies Only For Modifiable Copy
Clean-Dirty (CD) Unmodified Copies Only Modified/Unmodified Copies Yes
Dirty-Invalid (DI) Modified/Unmodified Copies No Cached Copies No
Dirty-Clean (DC) Modified/Unmodified Copies Unmodified Copies Only For Modifiable Copy
Dirty-Dirty (DD) Modified/Unmodified Copies Modified/Unmodified Copies Yes
Region Coherence Arrays
![Page 60: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/60.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Region Coherence Arrays On cache misses, the region state is read
to determine if a broadcast is necessary On external snoops, the region state is
read to provide a region snoop response Piggybacked onto the conventional response Used to update other processors’ region state
The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region
![Page 61: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/61.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
P0 P1
M0 M1
Network
$0 RCA $1 RCA001
Invalid000
DIExclusive Invalid0000 Invalid000
Invalid0000 Invalid000Exclusive
0010
0011
• P1 stores 100002
MISS
• Snoop performed
• Response sent
• Data transfer
Store: 100002
RFO: P1, 100002
0010 Pending 001 Pending
Owned, Region Owned
DDPending
RFO: P1, 100002Owned, Region Owned
DDInvalid Modified
DataData
Coarse-Grain Coherence Tracking
Region Coherence Array added; two lines per region
Region not exclusive anymore
Hits in P0 cache
![Page 62: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/62.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Overhead Storage for RCA Two bits in snoop response for
region snoop response Region Externally Clean/Dirty
2-way set-assoc. RCA, 48-bit addresses Bits / Set Total Kilobytes Tag Overhead Cache Overhead
2K-Entries 74 9.3 5.0% 0.8%
4K-Entries 72 18.0 9.7% 1.5%
8K-Entries 70 35.0 48.6% 2.8%
16K-Entries 68 68.0 88.3% 5.5%
![Page 63: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/63.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Overhead RCA maintains inclusion over caches
RCA must respond correctly to external requests if lines cached
When regions evicted from RCA, their lines are evicted from the cache
Replacement algorithm uses line count to favor regions with no lines cached
![Page 64: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/64.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snoop Traffic – Peak
0
2
4
6
8
10
12
14
16
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Pea
k B
road
cast
s /
1000
CP
U C
ycle
s
![Page 65: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/65.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snoop Traffic – Average
0
2
4
6
8
10
12
14
16
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Bas
elin
e
128B
256B
512B
1KB
2KB
4KB
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Ave
rag
e B
road
cast
s / 1
000
CP
U C
ycle
s
![Page 66: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/66.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snoop Traffic
Peak snoop traffic is halved
Average snoop traffic reduced by nearly two thirds
The system is more scalable, and may effectively support more processors
![Page 67: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/67.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Coarse-Grain Coherence Tracking can be used to filter external snoops Send external requests to RCA first If region valid and line-count nonzero,
send external request to cache Reduces power consumption in the
cache tag arrays Increases broadcast snoop latency
Tag Lookups Filtered
![Page 68: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/68.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Tag Lookups Filtered
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%O
racl
e1
28B
256
B5
12B
1K
B2
KB
4K
BO
racl
e1
28B
256
B5
12B
1K
B2
KB
4K
BO
racl
e1
28B
256
B5
12B
1K
B2
KB
4K
BO
racl
e1
28B
256
B5
12B
1K
B2
KB
4K
B
Scientific Mean MultiprogrammedMean
CommercialMean
Overall Mean
Ext
ern
al R
equ
ests
Tag LookupsFiltered
Tag Lookups forBroadcasts Avoided
Write-back TagLookups
![Page 69: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/69.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Line Evictions for Inclusion
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%1
28
B
25
6B
51
2B
1K
B
2K
B
4K
B
12
8B
25
6B
51
2B
1K
B
2K
B
4K
B
12
8B
25
6B
51
2B
1K
B
2K
B
4K
B
12
8B
25
6B
51
2B
1K
B
2K
B
4K
B
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Re
gio
ns
Ev
icte
d
8 lines evicted
7 lines evicted
6 lines evicted
5 lines evicted
4 lines evicted
3 lines evicted
2 lines evicted
1 line evicted
0 lines evicted
![Page 70: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/70.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
L2
Mis
s R
atio
Baseline 128B 256B 512B 1KB 2KB 4KB
L2 Miss Ratio Increase
![Page 71: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/71.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Lines from a region may be prefetched again after a threshold number of L2 misses (currently 2).
A bit mask of the lines cached since the last prefetch is used to avoid prefetching useless data
Stealth Prefetching
![Page 72: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/72.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Stealth Prefetching
Invalid
PendingData
PendingRequested
Data
Valid
Line PrefetchInitiated
ProcessorMiss Request
Data, Sendto Cache
Data
Processor Miss Request
Invalidate
Invalidate
Invalid
PendingData
PendingRequested
Data
Valid
Line PrefetchInitiated
ProcessorMiss Request
Data, Sendto Cache
Data
Processor Miss Request
Invalidate
Invalidate
Prefetched lines are managed by a simple protocol
![Page 73: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/73.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Prefetch Timeliness
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Scientific Multiprogrammed Commercial Arithmetic Mean
Tim
ely
Pre
fetc
hes
SP-512B SP-1KB SP-2KB
![Page 74: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/74.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Data Traffic
0%10%20%30%40%50%60%70%80%90%
100%110%120%130%140%150%
Scientific Multiprogrammed Commercial Arithmetic Mean
Dat
a T
raffi
c
Baseline CGCT-512B SP-512B SP-1KB SP-2KB
![Page 75: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/75.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Period Between DRAM Requests
0
200
400
600
800
1000
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Pro
cess
or
Cyc
les
Baseline CGCT, Speculate AllNo-speculate Dirty Region No-speculate Dirty or Unknown RegionsNo-speculate
![Page 76: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/76.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Switch design
![Page 77: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/77.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Value-Aware Techniques Coherence misses in multiprocessors
Store Value Locality [Lepak ‘03] Ensuring consistency
Value-based checks [Cain ‘04] Reducing speculation
Operand significance Create (nearly) nonspeculative execution
schedule Java Virtual Machine runtime optimization
[Su] Speculative optimizations [VEE ’07]
![Page 78: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/78.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Complexity-Effective Techniques
Scalable dynamic scheduling hardware Half-price architecture [Kim ’03] Macro-op scheduling [Kim ’03] Operand significance [Gunadi]
Scalable snoop-based coherence Coarse-grained coherence [Cantin ’06] Circuit-switched coherence [Enright]
![Page 79: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/79.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Power-Efficient Techniques Power-efficient techniques
Reduced speculation [Gunadi] Clock gating [E. Hill]
Transparent pipelines need fine-grained stalls
Redistribute coarse-grained stall cycles Circuit-switched coherence [Enright]
Reduce overhead of CMP cache coherence Improve latency, power
![Page 80: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/80.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Cache Coherence Problem
P0 P1Load A
A 0
Load A
A 0
Store A<= 1
1
Load A
Memory
![Page 81: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/81.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Cache Coherence Problem
P0 P1Load A
A 0
Load A
A 0
Store A<= 1
Memory
1
Load A
A 1
![Page 82: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/82.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Snoopy Cache Coherence All cache misses broadcast on shared
bus Processors and memory snoop and respond
Cache block permissions enforced Multiple readers allowed (shared state) Only a single writer (exclusive state)
Must upgrade block before writing to it Other copies invalidated
Read/write-shared blocks bounce from cache to cache Migratory sharing
![Page 83: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/83.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Data
P0
$0
Invalid0000 Pending0010
Example: Conventional Snooping
P1
$1
M0 M1
Network
Load: 100002
Invalid0000
Tag State
Read: P0, 100002
Read: P0, 100002
• P0 loads 100002
MISS
• Snoop performed
Invalid0000
Invalid0000
• Response sent
InvalidInvalid
• Data transfer
Data
Exclusive
![Page 84: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/84.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
$0 RCA
Coarse-Grain Coherence Tracking
P0 P1
$1
M0 M1
Network
RCA• P0 loads 100002
Load: 100002
Read: P0, 100002 Invalid, Region Not Shared
Data
Tag State
Invalid0000
Invalid0000
Invalid0000
Invalid0000
Invalid000
Invalid000 MISS
Pending0010
• Snoop performed
Pending
Invalid
Invalid
000
000
• Response sent
Read: P0, 100002Invalid, Region Not Shared
• Data transfer
DIExclusive 001
Region Coherence Array added; two lines per region
Data
P0 has exclusive access to region
![Page 85: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/85.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
P0 P1
M0 M1
Network
$0 RCA $1 RCAInvalid0000
001
Invalid000
0010 DIExclusive Invalid0000 Invalid000
Invalid0000 Invalid000
Tag State
• P0 loads 110002
Load: 110002
MISS, Region Hit
• Direct request sent
• Data transferRead: P0, 110002
Data
Pending0011 Exclusive
Coarse-Grain Coherence Tracking
Region Coherence Array added; two lines per region
Data
Exclusive region state, broadcast unnecessary
![Page 86: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/86.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Impact on Execution Time
0%
20%
40%
60%
80%
100%
Scientific Mean MultiprogrammedMean
Commercial Mean Overall Mean
Exe
cu
tio
n T
ime
Baseline CGCT, Speculate All
No-speculate Dirty Regions No-speculate Dirty or Unknown Regions
No-speculate
![Page 87: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/87.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
P0 P1
M0 M1
Network
$0RCA $1
RCAInvalid0000
001
Invalid0000100
DI
Exclusive Invalid0000
Invalid000
Invalid0000
Invalid000
Tag State
• P0 loads 0x28
Load: 0x28
MISS, RCA Hit
• Direct request sent
• Data transfer
Read: P0, 0x28Prefetch: 11002
Data
Pending0101 Exclusive
Stealth Prefetching
Data
SDPB
Invalid0000 Invalid0000
Pending
Pending
Valid
Valid0110
0111
• Prefetch data
SDPB
Prefetch: 11002
Invalid
Invalid0000
0000
Assume 8-byte lines, 32-byte regions, 2-line threshold
![Page 88: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/88.jpg)
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
Stealth Prefetching
P0 P1
M0 M1
Network
$0RCA $1
RCA001
Invalid0000100
DI
Exclusive Invalid0000
Invalid000
Invalid0000
Invalid000
Tag State
0101 Exclusive
SDPB
Invalid0000 Invalid0000
0000
0000
Valid
Valid0110
0111
• P0 loads 0x30
Load: 0x30
Pending0110
Invalid
Exclusive
Data
MISS, SDPB Hit
SDPB
• Data TransferReturn Data
Assume 8-byte lines, 32-byte regions, 2-line threshold
![Page 89: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/89.jpg)
Communication Latencies
CC-NUMA CMP
Local Cache Access 12 12
Remote Cache-to-Cache Transfer
12 + 21 * H * 3(H = hop count)
12 + 4 * H * 3
Local Memory Access 150 150
Remote Memory Access
150 + 21 * H * 2 150 + 4 * H *2
•Remote cache access is 2-5x faster in CMPs than NUMA machines•Lower communication latencies allow for more flexible thread placement
Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 90: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/90.jpg)
ConfigurationSimulation Parameters
Cores 16 single-threaded light-weight, in-order
Interconnect 2-D Packet-Switched Mesh3-cycle router pipeline (baseline)
Hybrid Circuit-Switched Mesh4 Circuits
L1 Cache Split I/D, 16KB each (2 cycles)
L2 Cache Private, 128 KB (6 cycles)
L3 Cache Shared, 16 MB (16 1MB banks)12 cycles
Memory Latency 150 cyclesWorkload Mixes
Mix 1 TPC-W (4) + TPC-H (4)
Mix 2 TPC-W (4) + SPECjbb (4)
Mix 3 TPC-H (4) + SPECjbb(4)Aug 30, 2007 Mikko Lipasti-University of Wisconsin
![Page 91: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason](https://reader035.vdocument.in/reader035/viewer/2022062517/56649e7a5503460f94b7a759/html5/thumbnails/91.jpg)
•Load Balancing with HCS outperforms local placement•Virtual proximity to memory home node
Effect of Memory Placement
Aug 30, 2007 Mikko Lipasti-University of Wisconsin