![Page 1: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/1.jpg)
Snoop Filtering and
Coarse-Grain Memory TrackingAndreas MoshovosUniv. of Toronto/ECE
Short Course at the University of Zaragoza, July 2009Some slides by J. Zebchuk or the original paper authors
![Page 2: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/2.jpg)
JETTY Snoop-Filtering for Reduced Power in SMP
ServersAndreas Moshovos
Babak Falsafi, ECE, Carnegie MellonGokhan Memik, ECE, Northwestern
Alok Choudhary, ECE, Northwestern
Int’l Conference on High-Performance Architecture, 2001
![Page 3: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/3.jpg)
Power is Becoming Important• Architecture is a science of tradeoffs• Thus far:
Performance vs. Cost vs. Complexity• Today:
vs. Power
• Where?– Mobile Devices– Desktops/Servers Our Focus
![Page 4: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/4.jpg)
Power-Aware Servers• Revisit the design of SMP servers
– 2 or more CPUs per machine– Snoop coherence-based
• Why?– File, web, databases, your typical desktop– Cost effective too
• This work - a first step:Power-Aware Snoopy-Coherence
![Page 5: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/5.jpg)
Power-Aware Snoop-Coherence• Conventional
– All L2 caches snoop all memory traffic– Power expended by all on any memory access
• Jetty-Enhanced– Tiny structure on L2-backside– Filters most “would-be-misses”– Less power expended on most snoop misses– No changes to protocol necessary– No performance loss
![Page 6: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/6.jpg)
Roadmap• Why Power is a Concern for Servers?• Snoopy-Coherence Basics• An Opportunity for Reducing Power• JETTY• Results• Summary
![Page 7: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/7.jpg)
Why is Power Important?Power Could Ultimately Limit Performance
• Power Demands have been increasing• Deliver Energy to and on chip• Dissipate Heat• Limit:
– Amount of resources & frequency– Feasibility
• Cooling a solution: Cost & Integration?Reducing Power Demands is much more convenient
![Page 8: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/8.jpg)
What can be done?• Redesign Circuits• Clock Gating and Frequency Scaling
– A lot has been done thus far– Still active
• Rethink Architectural Decisions– Orthogonal to others
Reduce Power Under Performance Constraints
![Page 9: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/9.jpg)
The “Silver Bullet” Solution• Good if there was one• However, till one is found...
• Look at all structures• Rethink Design• Propose Power-Optimized versions
• This is what we’re doing for performance
![Page 10: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/10.jpg)
Snoopy Cache Coherence
All L2 tags see all bus accessesIntervene when necessary
Main Memory
CPU Core
L1
L2
CPU Core
Hit
![Page 11: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/11.jpg)
How About Power?
All L2 tags see all bus accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses
Main Memory
L1
L2
CPU CoreCPU Core CPU Core
missmiss
![Page 12: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/12.jpg)
JETTY: A Would be Snoop-Miss Filter
Imprecise: May filter a would-be miss Never filters snoop-hits
JETTY
addr
Not here!
CPU n
Would be Snoop-Miss:
JETTY
addr
Don’t Know
CPU n
Would be Snoop-Hit:
Detect most misses using fewer resources
![Page 13: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/13.jpg)
Potential for Savings Exist
• Most Snoops miss– 91% AVG
• Many L2 accesses are due to Snoop Misses– 55% AVG
• Sizeable Potential Power Savings:– 20% - 50% of total L2 power
![Page 14: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/14.jpg)
Exclude-Jetty
• Subset of what is not cached
cached
not cached
How? Cache recent snoop-misses locally
ExcludeJETTY
![Page 15: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/15.jpg)
Exclude-Jetty
• Subset of what you don’t have
Works well for producer-consumer
![Page 16: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/16.jpg)
Include-Jetty
• Superset of what is cached
cached
not cached
How? Well...
includeJETTY
![Page 17: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/17.jpg)
Include-Jetty
address
bit vector 0
bit vector 1
bit vector 2
f( )
h( )
g( )
• Not-CachedAny zero bit
• May be CachedAll bits set
Later I was told this is a Bloom filter…
![Page 18: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/18.jpg)
Include-Jetty
• Superset of what you have
This is a counting bloom filter:L-CBF: A Low Power, Fast Counting Bloom Filter ImplementationElham Safi, Andreas Moshovos and Andreas Veneris,In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.
Partial overlapping indexes worked better
![Page 19: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/19.jpg)
Hybrid-Jetty• Some cases Exclude-J works well• Some other Include-J is better• Combine
– Access in parallel on snoop– Allocation
• IJ always• If IJ fails to filter then to EJ• EJ coverage increases
![Page 20: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/20.jpg)
Latency?• Jetty may increase snoop-response time• Can only be determined on a design by design basis• Largest Jetty:
– Five 32x32 bit register files
![Page 21: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/21.jpg)
Results• Used SPLASH-II
– Scientific applications– “Large” Datasets
• e.g., 4-80Megs of main memory allocated• Access Counts: 60M-1.7B
– 4-way SMP, MOESI– 1M direct-mapped L2, 64b 32b subblocks– 32k direct-mapped L1, 32b blocks
• Coverage & Power (analytical model)
![Page 22: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/22.jpg)
Coverage: Hybrid-Jetty
• Can capture 74% of all snoop-misses
bette
r
0%
20%
40%
60%
80%
100%
ba ch em ff fm lu oc ra rt un AVG10x4x7 + 32x4 9x4x7 + 32x4 8x4x7 + 32x4
![Page 23: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/23.jpg)
Power-Savings
• 28% of overall L2 power
0%
10%
20%
30%
40%
50%
ba ch em ff fm oc ra rt un AVG
bette
r
![Page 24: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/24.jpg)
Summary• Power is becoming important
– Performance, Reliability and Feasibility• Unique Opportunities Exist for Servers
• JETTY: Filter Snoops that would miss– 74% of all snoops– 28% of L2 power saved– No protocol changes– No performance loss
![Page 25: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/25.jpg)
Power efficient cache coherence
C. Saldanha, M. LipastiWorkshop on Memory Performance Issues
(in conjunction with ISCA), June 2001.
![Page 26: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/26.jpg)
MEMORY
Serial Snooping• Avoids Speculative transmission of Snoop packets.• Check the nearest neighbor• Data supplied with minimum latency and power
![Page 27: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/27.jpg)
TLB and Snoop Energy-Reduction using Virtual Caches in
Low-Power Chip-Multiprocessors
Magnus Ekman, *Fredrik Dahlgren, and Per Stenström
Chalmers University of TechnologyEricsson Mobile Platforms
Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002
![Page 28: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/28.jpg)
Page Sharing Tables• On snoop requesting node gets a page-level sharing vector
Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs
If a PST entry is evicted the whole page must be evicted
![Page 29: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/29.jpg)
29
RegionScout: Exploiting Coarse Grain Sharing in Snoop
Coherence
Andreas [email protected]
Int’l Conference on Computer Architecture 2005
![Page 30: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/30.jpg)
30
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
Improving Snoop Coherence
Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth
Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use
Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping
![Page 31: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/31.jpg)
31
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
RegionScout: Avoid Some Snoops
• Frequent case: non-sharing even at a coarse level/Region• RegionScout: Dynamically Identify Non-Shared Regions
– First Request to a Region Identifies it as not Shared– Subsequent Requests do not need to be broadcast
• Uses Imprecise Information– Small structures– Layer on top of conventional coherence– No additional constraints
![Page 32: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/32.jpg)
32
Roadmap• Conventional Coherence:
– The need for power-aware designs
• Potential: Program Behavior
• RegionScout: What and How
• Implementation
• Evaluation
• Summary
![Page 33: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/33.jpg)
33
Coherence Basics
• Given request for memory block X (address)• Detect where its current value resides
Main Memory
snoopsnoop
X
hit
CPU CPU CPU
![Page 34: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/34.jpg)
34
Conventional Coherence not Power-Aware/Bandwidth-Effective
All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accessesBandwidth: broadcast all coherent requests
Main Memory
L2
CPU
missmiss
CPU CPU
![Page 35: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/35.jpg)
35
RegionScout Motivation: Sharing is Coarse
• Region: large continuous memory area, power of 2 size• CPU X asks for data block in region R
1. No one else has X2. No one else has any block in R
RegionScout Exploits this BehaviorLayered Extension over Snoop Coherence
Typical Memory Space Snapshot: colored by owner(s)
addresses
![Page 36: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/36.jpg)
Optimization Opportunities
• Power and Bandwidth– Originating node: avoid asking others– Remote node: avoid tag lookup
CPU
I$ D$
CPU
I$ D$
Memory
SWITCH
CPU
I$ D$
![Page 37: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/37.jpg)
Potential: Region Miss Frequency
0%
25%
50%
75%
100%
256 512 1K 2K 4K 8K 16K
p4.512K
p4.1M
p8.512K
p8.1M
% o
f all
requ
ests
Region Size
Even with a 16K Region~45% of requests miss in all remote nodes
bette
r
Glo
bal R
egio
n M
isse
s
![Page 38: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/38.jpg)
RegionScout at Work: Non-Shared Region Discovery
First request detects a non-shared region
Main Memory
CPUCPU CPU
Global Region Miss
Region Miss Region Miss12 2
3
Record: Non-Shared Regions Record: Locally Cached Regions
![Page 39: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/39.jpg)
RegionScout at Work: Avoiding Snoops
Subsequent request avoids snoops
Main Memory
CPUCPU CPU
Global Region Miss
1
2
Record: Non-Shared Regions Record: Locally Cached Regions
![Page 40: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/40.jpg)
RegionScout is Self-Correcting
Request from another node invalidates non-shared record
Main Memory
CPUCPU CPU
12 2
Record: Non-Shared Regions Record: Locally Cached Regions
![Page 41: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/41.jpg)
• Requesting Node provides address:
• At Originating Node – from CPU: – Have I discovered that this region is not shared?
• At Remote Nodes – from Interconnect: – Do I have a block in the region?
Implementation: Requirements
Region Tag offsetlg(Region Size)
CPU
address
![Page 42: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/42.jpg)
Remembering Non-Shared Regions
• Records non-shared regions• Lookup by Region portion prior to issuing a request• Snoop requests and invalidate
Region Tag offsetaddress
validNon-Shared Region Table
Few entries16x4 in most experiments
![Page 43: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/43.jpg)
What Regions are Locally Cached?
• If we had as many counters as regions:– Block Allocation: counter[region]++– Block Eviction: counter[region]--– Region cached only if counter[region] non-zero
• Not Practical:– E.g., 16K Regions and 4G Memory 256K counters
Region Tag offset
counter
![Page 44: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/44.jpg)
Moshovos ©What Regions are Locally Cached?
• Use few Counters Imprecise: – Records a superset of locally cached Regions– False positives: lost opportunity, correctness preserved
Region Tag offset
counter
hashCached Region Hash
“Counter”: + on block allocation - on block evictionFew entries, e.g., 256
p bits
P-bit 1 if counter non-zero used for lookups
![Page 45: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/45.jpg)
Moshovos ©Roadmap• Conventional Coherence
• Program Behavior: Region Miss Frequency
• RegionScout
• Evaluation
• Summary
![Page 46: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/46.jpg)
Moshovos ©Evaluation Overview• Methodology
• Filter rates– Practical Filters can capture many Region Misses
• Interconnect bandwidth reduction
![Page 47: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/47.jpg)
Moshovos ©Methodology• In-House simulator based on Simplescalar
– Execution driven– All instructions simulated – MIPS like ISA– System calls faked by passing them to host OS– Synchronization using load-linked/store-conditional– Simple in-order processors– Memory requests complete instantaneously– MESI snoop coherence– 1 or 2 level memory hierarchy– WATTCH power models
• SPLASH II benchmarks– Scientific workloads– Feasibility study
![Page 48: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/48.jpg)
Moshovos ©Filter Rates
0%
25%
50%
75%
100%
256 512 1K 2K
p4.512K.R4K
p4.512K.R16K
p8.512K.R4K
p8.512K.R16KIden
tifie
dG
loba
l Reg
ion
Mis
ses
CRH Size
bette
r
For small CRH better to use large regionsPractical RegionScout filters capture a lot of the potential
![Page 49: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/49.jpg)
Moshovos ©Bandwidth Reduction
0%
25%
50%
75%
100%
2K 4K 8K 16K
p4.512K
p8.512K
p4.64K
p8.64K
Mes
sage
s
Region Size
bette
r
CM
P
Moderate Bandwidth Savings for SMP (15%-22%)More so for CMP (>25%)
![Page 50: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/50.jpg)
Moshovos ©Related Work• RegionScout
– Technical Report, Dec. 2003
• Jetty– Moshovos, Memik, Falsafi, Choudhary, HPCA 2001
• PST– Eckman, Dahlgren, and Stenström, ISLPED 2002
• Coarse-Grain Coherence– Cantin, Lipasti and Smith, ISCA 2005
![Page 51: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/51.jpg)
Moshovos ©
51
Summary• Exploit program behavior/optimize a frequent case
– Many requests result in a global region miss
• RegionScout– Practical filter mechanism– Dynamically detect would-be region misses– Avoid broadcasts– Save tag lookup power and interconnect bandwidth – Small structures– Layered extension over existing mechanisms– Invisible to programmer and the OS
![Page 52: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/52.jpg)
Coarse-Grain Coherence
J. Cantin, M. Lipasti and J. E. SmithISCA 2005
![Page 53: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/53.jpg)
Coarse-Grain Coherence• Exploits the same phenomenon as RegionScout• Protocol extended to keep track of region state as well
– Additional optimizations• Uses an additional region tag array to do so• Region replacements
– Must scan and find the block and evict them
![Page 54: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/54.jpg)
Flexible snooping: adaptive forwarding and filtering of snoops in embedded-ring
multiprocessorsK. Strauss, X. Shen, J. Torrellas
International Symposium on Computer Architecture, June 2006.
![Page 55: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/55.jpg)
Karin Strauss
Flexible
Snoopi
ng
55
Predictors and algorithms
snoopforwardExact
forward then snoop
Aggforward
snoopforward then snoop
Subset
action on positive prediction
action on negative prediction
predictor / algorithm
Superset
Con snoop then forward
node can supply
in predictor
set of addresses:
Ring-specific
![Page 56: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/56.jpg)
Karin Strauss
Flexible
Snoopi
ng
56
Predictor implementation
• Subset– associative table:
subset of addresses that can be supplied by node
• Superset– bloom filter: superset of addresses that can be supplied by node– associative table (exclude cache):
addresses that recently suffered false positives
• Exact– associative table: all addresses that can be supplied by node– downgrading: if address has to be evicted from predictor table,
corresponding line in node has to be downgraded
![Page 57: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/57.jpg)
Design and Implementation of the Blue Gene/P Snoop Filter
Valentina Salapura, Matthias Blumrich, Alan Gara
Int’l Conf. on High-Performance Computer Architecture, 2008
![Page 58: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/58.jpg)
![Page 59: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/59.jpg)
Three Mechanisms• Stream registers
– Contiguous data areas– Adaptive to cache arbitrarily sized contiguous regions with a single
register– Stream registers track strided and sequential streams
• Snoop caches– Cache of recently executed snoop requests– Multiple requests to same line do not have to cause multiple
snoop lookups– Snoop caches track locality
• Range filter– Identify regions of known non-shared data– Configured by software
![Page 60: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/60.jpg)
Stream Registers• Base = where the block starts• Mask = which bits are common
– Example: base 0111 mask 1101 01X1 may be in the cache• Over time Mask becomes all zeros• How to reset?• Cache Wrap
– Each set uses Round-Robin replacement– Count replacements per set– Cache wrap when all counters > ways– Copy all streams to history and use combination– Next time throw out history
![Page 61: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/61.jpg)
Stream Registers: An Example• Direct mapped cache with two blocks
• At this point the filter reports that the cache contains:– 001 and 011– 101 and 111
• The first two are not there• Eventually the filter becomes
saturated and can filter much• How can we get rid of the 011 /
1x1?
empty
empty
001
empty
empty
empty
001 / 111empty
001
011
001 / 1X1empty
101
011
001 / 111
101 / 111
101
111
001 / 1X1
101 / 1X1
Tim
e
cache Stream registers
![Page 62: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/62.jpg)
Avoiding Saturation: Exploiting Cache Warping
empty
empty
001
empty
empty
empty
001 / 111empty
001
011
001 / 1X1empty
101
011
empty101 / 111
101
111
empty
101 / 1X1
Tim
ecache Stream registers
empty
empty
empty
empty
001 / 1X1empty
001 / 1X1
empty
001 / 1X1
empty
Shadow
Cache Warp Can discard Shadow
![Page 63: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/63.jpg)
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip
MultiprocessorsChinnakrishnan S. Ballapuram
Ahmad SharifHsien-Hsin S. Lee
ASPLOS 2008
![Page 64: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/64.jpg)
Software-Hardware Hybrid• Software Directs hardware what to do
– Mechanisms very similar to Jetty and RegionScout
• Paper incorrectly states that:– Jetty does not work for CMPs
• It does not work well for small scale CMPs– RegionScout works only for busses
• Is interconnect agnostic
![Page 65: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/65.jpg)
RegionTracker: A Framework for Coarse-Grain Optimizations in the On-chip
Memory HierarchyJason Zebchuk, Elham Safi and Andreas Moshovos
Int’l Symposium on Microarchitecture, 2007
![Page 66: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/66.jpg)
EPFL, Jan. 2008
66Aenao Group/Toronto
Future Caches: Just Larger?
CPU
I$ D$
CPU
I$ D$
CPU
I$ D$
interconnect
Main Memory
1. “Big Picture” Management2. Store Metadata
10s – 100s of MB
![Page 67: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/67.jpg)
EPFL, Jan. 2008
67Aenao Group/Toronto
Conventional Block Centric Cache
• “Small” Blocks– Optimizes Bandwidth and Performance
• Large L2/L3 caches especially
Fine-Grain View of Memory
L2 Cache
Big Picture Lost
![Page 68: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/68.jpg)
EPFL, Jan. 2008
68Aenao Group/Toronto
“Big Picture” View
• Region: 2n sized, aligned area of memory• Patterns and behavior exposed
– Spatial locality
• Exploit for performance/area/power
Coarse-Grain View of Memory
L2 Cache
![Page 69: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/69.jpg)
EPFL, Jan. 2008
69Aenao Group/Toronto
Exploiting Coarse-Grain Patterns
• Many existing coarse-grain optimizations• Add new structures to track coarse-grain information
CPU
L2 Cache
Stealth Prefetching
Run-time Adaptive Cache Hierarchy Management via Reference Analysis
Destination-Set Prediction
Spatial Memory Streaming
Coarse-Grain Coherence Tracking
RegionScout
Circuit-Switched Coherence
Hard to justify for a commercial design
Coarse-Grain Framework
Embed coarse-grain information in tag array
Support many different optimizations with less area overhead
Adaptable optimization FRAMEWORK
Virtual Tree CoherencePower-Efficient DRAMSpeculation
![Page 70: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/70.jpg)
EPFL, Jan. 2008
70Aenao Group/Toronto
L2 Cache
RegionTracker Solution
Manage blocks, but also track and manage regions
Tag Array
L1
L1
L1
L1
Data Array
Data Blocks
BlockRequests
Block Requests
RegionTracker
RegionProbes
RegionResponses
![Page 71: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/71.jpg)
EPFL, Jan. 2008
71Aenao Group/Toronto
RegionTracker Summary• Replace conventional tag array:
– 4-core CMP with 8MB shared L2 cache– Within 1% of original performance– Up to 20% less tag area– Average 33% less energy consumption
• Optimization Framework:– Stealth Prefetching: same performance, 36% less area– RegionScout: 2x more snoops avoided, no area overhead
![Page 72: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/72.jpg)
EPFL, Jan. 2008
72Aenao Group/Toronto
Road Map
• Introduction
• Goals
• Coarse-Grain Cache Designs
• RegionTracker: A Tag Array Replacement
• RegionTracker: An Optimization Framework
• Conclusion
![Page 73: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/73.jpg)
EPFL, Jan. 2008
73Aenao Group/Toronto
Goals1. Conventional Tag Array Functionality
– Identify data block location and state– Leave data array un-changed
2. Optimization Framework Functionality– Is Region X cached?– Which blocks of Region X are cached? Where?– Evict or migrate Region X– Easy to assign properties to each Region
![Page 74: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/74.jpg)
EPFL, Jan. 2008
74Aenao Group/Toronto
Coarse-Grain Cache Designs
• Increased BW, Decreased hit-rates
Region X
Large Block SizeTag Array Data Array
![Page 75: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/75.jpg)
EPFL, Jan. 2008
75Aenao Group/Toronto
Sector Cache
• Decreased hit-rates
Region X
Tag Array Data Array
![Page 76: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/76.jpg)
EPFL, Jan. 2008
76Aenao Group/Toronto
Sector Pool Cache
• High Associativity (2 - 4 times)
Region X
Tag Array Data Array
![Page 77: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/77.jpg)
EPFL, Jan. 2008
77Aenao Group/Toronto
Decoupled Sector Cache
• Region information not exposed• Region replacement requires scanning multiple
entries
Region X
Tag Array Data ArrayStatus Table
![Page 78: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/78.jpg)
EPFL, Jan. 2008
78Aenao Group/Toronto
Design Requirements• Small block size (64B)• Miss-rate does not increase• Lookup associativity does not increase• No additional access latency
– (i.e., No scanning, no multiple block evictions)
• Does not increase latency, area, or energy• Allows banking and interleaving
• Fit in conventional tag array “envelope”
![Page 79: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/79.jpg)
EPFL, Jan. 2008
79Aenao Group/Toronto
RegionTracker: A Tag Array Replacement
L1
L1
L1
L1
Data Array
• 3 SRAM arrays, combined smaller than tag array
RegionVectorArray
BlockStatusTable
EvictedRegionBuffer
![Page 80: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/80.jpg)
EPFL, Jan. 2008
80Aenao Group/Toronto
Common Case: Hit
Region Tag RVA Index Region Offset Block Offset49 061021
Address:
Region Vector Array(RVA)
Region Tag ……
block0
block15
wayV
Block Offset19 6 0
Block Status Table(BST)
1 4
status
3 2
Data Array + BST Index
To Data Array
Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region
![Page 81: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/81.jpg)
EPFL, Jan. 2008
81Aenao Group/Toronto
Worst Case (Rare): Region Miss
Region Tag RVA Index Region Offset Block Offset
49 061021
Address:
Region Vector Array(RVA)
Region Tag ……
block0
block15
wayV
Block Offset19 6 0
Block Status Table(BST)
status
3
Ptr
2
Data Array + BST Index
EvictedRegionBuffer(ERB)No
Match!
Ptr
![Page 82: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/82.jpg)
82Aenao Group/Toronto
Methodology• Flexus simulator from CMU SimFlex group
– Based on Simics full-system simulator• 4-core CMP modeled after Piranha
– Private 32KB, 4-way set-associative L1 caches– Shared 8MB, 16-way set-associative L2 cache– 64-byte blocks
• Miss-rates: Functional simulation of 2 billion instructions per core• Performance and Energy: Timing simulation using SMARTS sampling methodology• Area and Power: Full custom implementation on 130nm commercial technology• 9 commercial workloads:
– WEB: SpecWEB on Apache and Zeus– OLTP: TPC-C on DB2 and Oracle– DSS: 5 TPC-H queries on DB2
Interconnect
L2
PD$ I$
PD$ I$
PD$ I$
PD$ I$
![Page 83: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/83.jpg)
83Aenao Group/Toronto
Miss-Rates vs. Area
• Sector Cache: 512KB sectors, SPC and RT: 1KB regions• Trade-offs comparable to conventional cache
0.99
1
1.01
1.02
1.03
1.04
1.05
0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Sector Pool Cache
RegionTracker
Conventional Tags
better
Rel
ativ
e M
iss-
Rat
e
Relative Tag Array Area
Sector Cache (0.25, 1.26)
14-way 15-way
52-way
48-way
![Page 84: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/84.jpg)
84Aenao Group/TorontoEPFL, Jan. 2008
Performance & Energy
0.97
0.98
0.99
1.00
1.01
1.02
1.03
WEB OLTP DSS0%
10%
20%
30%
40%
50%
WEB OLTP DSS
• 12-way set-associative RegionTracker: 20% less area• Error bars: 95% confidence interval
• Performance within 1%, with 33% tag energy reduction
Nor
mal
ized
Exe
cutio
n Ti
me
better
Red
uctio
n in
Tag
Ene
rgy better
Performance Energy
![Page 85: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/85.jpg)
85Aenao Group/Toronto
Road Map
• Introduction
• Goals
• Coarse-Grain Cache Designs
• RegionTracker: A Tag Array Replacement
• RegionTracker: An Optimization Framework
• Conclusion
![Page 86: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/86.jpg)
86Aenao Group/Toronto
RegionTracker: An Optimization Framework
L1
L1
L1
L1
RVA
ERB
Data Array
BST
Stealth Prefetching:Average 20% performance improvementDrop-in RegionTracker for 36% less area overhead
RegionScout:In-depth analysis
![Page 87: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/87.jpg)
87Aenao Group/Toronto
Snoop Coherence: Common Case
Main Memory
CPU CPU CPURead x
missmiss
Read x+1Read x+2Read x+n
Many snoops are to non-shared regions
![Page 88: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/88.jpg)
88Aenao Group/Toronto
RegionScout
Eliminate broadcasts for non-shared regions
Main Memory
CPUCPU CPU
Global Region Miss
Region Miss
Non-Shared Regions Locally Cached Regions
Read x
RegionMiss
MissMiss
![Page 89: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/89.jpg)
89Aenao Group/Toronto
RegionTracker Implementation
• Minimal overhead to support RegionScout optimization
• Still uses less area than conventional tag array
Non-Shared Regions
Add 1 bit to each RVA entry
Locally Cached Regions
Already provided by RVA
![Page 90: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/90.jpg)
90Aenao Group/Toronto
RegionTracker + RegionScout
RS 7KB RS 12KB RS 22KB RSRT0%
10%
20%
30%
40%
50%HMEAN
Red
uctio
n in
Sno
op B
road
cast
s
better
4 processors, 512KB L2 Caches 1KB regions
Avoid 41% of Snoop Broadcasts,no area overhead compared to conventional tag array
![Page 91: Snoop Filtering and Coarse-Grain Memory Tracking](https://reader035.vdocument.in/reader035/viewer/2022062520/56815c2a550346895dca00b6/html5/thumbnails/91.jpg)
EPFL, Jan. 2008
91Aenao Group/Toronto
Result Summary• Replace Conventional Tag Array:
– 20% Less tag area– 33% Less tag energy– Within 1% of original performance
• Coarse-Grain Optimization Framework:– 36% reduction in area overhead for Stealth Prefetching– Filter 41% of snoop broadcasts with no area overhead compared
to conventional cache