![Page 1: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/1.jpg)
Interactions Between Compression and Prefetching in Chip Multiprocessors
Alaa R. Alameldeen* David A. Wood
Intel Corporation University of
Wisconsin-Madison
*Work largely done while at UW-Madison
![Page 2: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/2.jpg)
2
Overview Addressing memory wall in CMPs
Compression Cache Compression: Increases effective cache size
– Increases latency, uses additional cache tags Link Compression: Increases effective pin bandwidth
Hardware Prefetching+ Decreases average cache latency?– Increases bandwidth demand contention– Increases working set size cache pollution
Contribution #1: Adaptive Prefetching Uses additional tags (like cache compression) Avoids useless and harmful prefetches
Contribution #2: Prefetching&compression interact positively e.g., Link compression helps prefetching’s bandwidth demand
![Page 3: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/3.jpg)
3
Outline
Overview
Compressed CMP Design
– Compressed Cache Hierarchy– Link Compression
Adaptive Prefetching
Evaluation
Conclusions
![Page 4: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/4.jpg)
4
Compressed Cache Hierarchy
Shared L2 Cache (Compressed)Shared L2 Cache (Compressed)
CompressionCompressionDecompressionDecompression
UncompressedUncompressedLineLine
BypassBypass
Processor 1Processor 1
L1 Cache L1 Cache (Uncompressed)(Uncompressed)
Memory Controller (Could Memory Controller (Could compress/decompress)compress/decompress)
…………………………………………
Processor nProcessor n
L1 Cache L1 Cache (Uncompressed)(Uncompressed)
To/From other chips/memoryTo/From other chips/memory
![Page 5: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/5.jpg)
5
Cache Compression: Decoupled Variable-Segment Cache (ISCA 2004)
Example cache set
Addr B compressed 2 Addr B compressed 2
Data AreaData Area
Addr A uncompressed 3Addr A uncompressed 3
Addr C compressed 6Addr C compressed 6
Addr D compressed 4Addr D compressed 4
Tag AreaTag Area
Compression StatusCompression Status Compressed SizeCompressed SizeTag is present Tag is present but line isn’tbut line isn’t
![Page 6: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/6.jpg)
6
Link Compression
On-chip Memory Controller transfers compressed messages
Data Messages
– 1-8 sub-messages (flits), 8-bytes each
Off-chip memory controller combines flits and stores to memory
Memory ControllerMemory Controller
Memory ControllerMemory Controller
Processors / L1 CachesProcessors / L1 Caches
L2 CacheL2 Cache
CMPCMP
To/From MemoryTo/From Memory
![Page 7: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/7.jpg)
7
Outline
Overview
Compressed CMP Design
Adaptive Prefetching
Evaluation
Conclusions
![Page 8: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/8.jpg)
8
Hardware Stride-Based Prefetching
Evaluation based on IBM Power 4 implementation
Stride-based prefetching
+ L1 prefetching hides L2 latency+ L2 prefetching hides memory latency
- Increases working set size cache pollution
- Increases on-chip & off-chip bandwidth demand contention
Prefetching Taxonomy [Srinivasan et al. 2004]:
– Useless prefetches increase traffic not misses– Harmful prefetches replace useful cache lines
![Page 9: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/9.jpg)
9
Prefetching in CMPs
0
20
40
60
80
100
120
140
160
180
200
1p 2p 4p 8p 16p
Sp
ee
du
p (
%)
No PFStride-Based PFAdaptive PF
Zeus
![Page 10: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/10.jpg)
10
Prefetching in CMPs
0
20
40
60
80
100
120
140
160
180
200
1p 2p 4p 8p 16p
Sp
ee
du
p (
%)
No PFStride-Based PFAdaptive PF
Can we avoid useless & harmful prefetches?
Prefetching Hurts
Performance
Zeus
![Page 11: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/11.jpg)
11
Prefetching in CMPs
0
20
40
60
80
100
120
140
160
180
200
1p 2p 4p 8p 16p
Sp
ee
du
p (
%)
No PFStride-Based PFAdaptive PF
Adaptive prefetching avoids hurting performance
Zeus
![Page 12: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/12.jpg)
12
Adaptive Prefetching
Addr B compressed 2Addr B compressed 2
Data AreaData Area
Addr A uncompressed 3 Addr A uncompressed 3
Addr C compressed 6Addr C compressed 6
Addr D compressed 4Addr D compressed 4
Tag AreaTag Area
Compression StatusCompression Status Compressed SizeCompressed SizeTag is present Tag is present but line isn’tbut line isn’t
PF Bit: Set on PF line allocation, reset on accessPF Bit: Set on PF line allocation, reset on access
√
√
How do we identify useful, useless, harmful prefetches?
![Page 13: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/13.jpg)
13
Addr B compressed 2Addr B compressed 2
Useful Prefetch
Addr A uncompressed 3 Addr A uncompressed 3
Addr C compressed 6Addr C compressed 6
Addr D compressed 4Addr D compressed 4
√
Read/Write Address B
PF bit set PF used before replacement
Access resets PF bit
√
![Page 14: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/14.jpg)
14
Addr E compressed 6Addr E compressed 6
Addr B compressed 2Addr B compressed 2
Useless Prefetch
Addr A uncompressed 3 Addr A uncompressed 3
Addr C compressed 6Addr C compressed 6
Addr D compressed 4Addr D compressed 4
Replace Address C with Address E
PF bit set PF unused before replacement
PF did not change #misses, only increased traffic
√
![Page 15: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/15.jpg)
15
Addr B compressed 2Addr B compressed 2
Harmful Prefetch
Addr A uncompressed 3 Addr A uncompressed 3
Addr E compressed 6Addr E compressed 6
Addr D compressed 4Addr D compressed 4
√
Read/Write Address D
Tag found Could have been a hit
Heuristic: If PF bit of other line set Avoidable miss
Prefetch(es) incurred an extra miss
![Page 16: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/16.jpg)
16
Adaptive Prefetching : Summary
Uses extra compression tags & PF bit per tag
– PF bit set when prefetched line is allocated, reset on access– Extra tags still maintain address tag, status
Use counter for #startup prefetches per cache: 0 – MAX
– Increment for useful prefetches– Decrement for useless & harmful prefetches
Classification of Prefetches:
– Useful Prefetch: PF bit set on a cache hit– Useless Prefetch: Replacing a line whose PF bit is set– Harmful Prefetch: Cache miss matches invalid line’s tag, valid
line tag has PF bit set
![Page 17: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/17.jpg)
17
Outline
Overview
Compressed CMP Design
Adaptive Prefetching
Evaluation
– Workloads and System Configuration– Performance– Compression and Prefetching Interactions
Conclusions
![Page 18: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/18.jpg)
18
Simulation Setup
Workloads:
– Commercial workloads [Computer’03, CAECW’02] : • OLTP: IBM DB2 running a TPC-C like workload • SPECJBB• Static Web serving: Apache and Zeus
– SPEComp benchmarks:• apsi, art, fma3d, mgrid
Simulator:
– Simics full system simulator; augmented with:– Multifacet General Execution-driven Multiprocessor Simulator
(GEMS) [Martin, et al., 2005, http://www.cs.wisc.edu/gems/]
![Page 19: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/19.jpg)
19
System Configuration 8-core CMP
Cores: single-threaded, out-of-order superscalar with an 11-stage pipeline, 64 entry IW, 128 entry ROB, 5 GHz clock frequency
L1 Caches: 64K instruction, 64K data, 4-way SA, 320 GB/sec total on-chip bandwidth (to/from L1), 3-cycle latency
Shared L2 Cache: 4 MB, 8-way SA (uncompressed), 15-cycle uncompressed latency
Memory: 400 cycles access latency, 20 GB/sec memory bandwidth
Prefetching
– 8 unit/negative/non-unit stride streams for L1 and L2 for each processor– Issue 6 L1 prefetches (max) on L1 miss – Issue 25 L2 prefetches (max) on L2 miss
![Page 20: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/20.jpg)
20
Performance
0
20
40
60
80
100
120
140
160
Sp
eed
up
(%
)
No PF or Compr
Always PF
Compr
Both
![Page 21: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/21.jpg)
21
Performance
0
20
40
60
80
100
120
140
160
Sp
eed
up
(%
)
No PF or Compr
Always PF
Compr
Both
![Page 22: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/22.jpg)
22
Performance
0
20
40
60
80
100
120
140
160
Sp
eed
up
(%
)
No PF or Compr
Always PF
Compr
Both
![Page 23: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/23.jpg)
23
Performance
0
20
40
60
80
100
120
140
160
Sp
eed
up
(%
)
No PF or Compr
Always PF
Compr
Both
Significant Improvements when Combining Prefetching and Compression
![Page 24: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/24.jpg)
24
Performance: Adaptive Prefetching
0
20
40
60
80
100
120
140
160S
pe
ed
up
(%
)
No PF
Always PF
Adaptive PF
Adaptive PF avoids significant performance losses
![Page 25: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/25.jpg)
25
Compression and Prefetching Interactions
Positive Interactions:
– L1 prefetching hides part of decompression overhead– Link compression reduces increased bandwidth
demand because of prefetching– Cache compression increases effective L2 size, L2
prefetching increases working set size
Negative Interactions:
– L2 prefetching and L2 compression can eliminate the same misses
Is the overall interaction positive or negative?
![Page 26: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/26.jpg)
26
Interactions Terminology
Assume a base system S with two architectural enhancements A and B, All systems run program P
Speedup(A) = Runtime(P, S) / Runtime(P, A)
Speedup(B) = Runtime(P, S) / Runtime (P, B)
Speedup(A, B) = Speedup(A) x Speedup(B)
x (1 + Interaction(A,B))
![Page 27: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/27.jpg)
27
Interaction Between Prefetching and Compression
-5
0
5
10
15
20
25
Inte
ract
ion (
%)
apache zeus oltp jbb art apsi fma3d mgrid
Interaction is positive for all benchmarks except apsiInteraction is positive for all benchmarks except apsi
![Page 28: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/28.jpg)
28
Conclusions
Compression can help CMPs– Cache compression improves CMP performance – Link Compression reduces pin bandwidth demand
Hardware prefetching can hurt performance– Increases cache pollution– Increases bandwidth demand
Contribution #1: Adaptive prefetching reduces useless & harmful prefetches– Gets most of benefit from prefetching– Avoids large performance losses
Contribution #2: Compression&prefetching interact positively
![Page 29: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/29.jpg)
29
Google “Wisconsin GEMS”
Works w/ Simics (free to academics) commercial OS/apps
SPARC out-of-order, x86 in-order, CMPs, SMPs, & LogTM
GPL release of GEMS used in four HPCA 2007 papers
DetailedProcessor
Model
OpalSimics
Microbenchmarks
RandomTester
Dete
rmin
isti
c
Con
ten
ded
lock
s
Tra
ce fl
ie
![Page 30: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/30.jpg)
30
Backup Slides
![Page 31: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/31.jpg)
31
Frequent Pattern Compression (FPC)
A significance-based compression algorithm
– Compresses each 32-bit word separately– Suitable for short (32-256 byte) cache lines– Compressible Patterns: zeros, sign-ext. 4,8,16-bits,
zero-padded half-word, two SE half-words, repeated byte
– A 64-byte line is decompressed in a five-stage pipeline
More details in Alaa’s thesis
![Page 32: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/32.jpg)
32
Positive and Negative Interactions
0
1
2
3
4
5
6
7
8
Base Base+A Base+B Base+A+B(No
Interaction)
Base+A+B(Positive
Interaction)
Base+A+B(Negative
Interaction)
Speedup
![Page 33: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/33.jpg)
33
Interaction Between Adaptive Prefetching and Compression
-5
0
5
10
15
20
Inte
ract
ion (
%)
apache zeus oltp jbb art apsi fma3d mgrid
Less BW impact Less BW impact Interaction slightly -ve (except art, mgrid)Interaction slightly -ve (except art, mgrid)
![Page 34: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/34.jpg)
34
Positive Interaction: Pin Bandwidth
Compression saves bandwidth consumed by prefetchingCompression saves bandwidth consumed by prefetching
![Page 35: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/35.jpg)
35
Negative Interaction: Avoided Misses
Small Fraction of misses is avoided by bothSmall Fraction of misses is avoided by both
![Page 36: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/36.jpg)
36
Sensitivity to Processor Count
![Page 37: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/37.jpg)
37
Compression Speedup
![Page 38: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/38.jpg)
38
Compression Bandwidth Demand
![Page 39: Interactions Between Compression and Prefetching in Chip Multiprocessors](https://reader035.vdocument.in/reader035/viewer/2022062314/56812bdb550346895d904ac6/html5/thumbnails/39.jpg)
39
Commercial Workloads & Adaptive Prefetching