lecture 12: caches and memory. chained inverters maintain a stable state access gates provide access...
TRANSCRIPT
![Page 1: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/1.jpg)
Advanced MicroarchitectureLecture 12: Caches and Memory
![Page 2: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/2.jpg)
2
SRAM {Over|Re}view
• Chained inverters maintain a stable state• Access gates provide access to the cell• Writing to a cell involves over-powering the
two small storage inverters
Lecture 12: Caches and Memory
1 00 1
1 1
b b
“6T SRAM” cell
2 access gates2T per inverter
![Page 3: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/3.jpg)
3
64×1-bit SRAM Array Organization
Lecture 12: Caches and Memory 1
-of-8
Deco
der
1-of-8 Decoder
Why are we readingboth b and b?
“Wordline”
“Bitlines”
“ColumnMux”
![Page 4: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/4.jpg)
4
SRAM Density vs. Speed• 6T cell must be small as possible to have
dense storage– Bigger caches– Smaller transistors slower transistors
Lecture 12: Caches and Memory
*Long* metal line with alot of parasitic loading
So dinky inverters cannotdrive their outputs very
quickly…
![Page 5: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/5.jpg)
5
Sense Amplifiers• Type of differential amplifier
– Two inputs, amplifies the difference
Lecture 12: Caches and Memory
DiffAmp
X
Ya × (X – Y) + Vbias
b
b
Bitlinesprecharged
To Vdd
Wordlineenabled
Small cell dischargesbitline very slowly
Sense amp “sees” the differencequickly and outputs b’s value
Sometimes prechargebitlines to Vdd/2 which
makes a bigger “delta”for faster sensing
![Page 6: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/6.jpg)
6
Multi-Porting
Lecture 12: Caches and Memory
b1 b1
Wordline1
b2 b2
Wordline2
Wordlines = 2 × portsBitlines = 4 × ports
Area = O(ports2)
![Page 7: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/7.jpg)
7
Port Requirements• ARF, PRF, RAT all need many read and write
ports to support superscalar execution– Luckily, these have limited number of
entries/bytes
• Caches also need multiple ports– Not as many ports– But the overall size is much larger
Lecture 12: Caches and Memory
![Page 8: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/8.jpg)
8
Delay Of Regular Caches• I$
– low port requirement (one fetch group/$-line per cycle)
– latency only exposed on branch mispredict• D$
– higher port requirement (multiple LD/ST per cycle)
– latency often on critical path of execution• L2
– lower port requirement (most accesses hit in L1)
– latency less important (only observed on L1 miss)
– optimizing for hit rate usually more important than latency• difference between L2 latency and DRAM latency is
large
Lecture 12: Caches and Memory
![Page 9: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/9.jpg)
9
Banking
Lecture 12: Caches and Memory
Deco
der
Deco
der
Deco
der
Deco
der
SRAMArray
Sen
se
Sen
se
Sen
se
Sen
se
ColumnMuxing
Big4-portedL1 DataCache
SD
eco
der
SRAMArray
SD
eco
der
SRAMArray
SD
eco
der
SRAMArray
SD
eco
der
SRAMArray
4 Banks, 1 port eachEach bank is much faster
Slow due to quadraticarea growth
![Page 10: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/10.jpg)
10
Bank Conflicts• Banking provides high bandwidth• But only if all accesses are to different
banks
• Banks typically address interleaved– For N banks– Addr bank[Addr % N]
• Addr on cache line granularity
– For 4 banks, 2 accesses, chance of conflict is 25%
– Need to match # banks to access patterns/BWLecture 12: Caches and Memory
![Page 11: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/11.jpg)
11
Associativity• You should know this already
Lecture 12: Caches and Memory
foo’s value
foo
direct mapped
foo’s value
foo
foo
fully associativeRAM CAM
foo
foo’s valuefoo
set associativeCAM/RAM hybrid?
![Page 12: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/12.jpg)
12
Set-Associative Caches• Set-associativity good for reducing conflict
misses• Cost: slower cache access
– often dominated by the tag array comparisons– Basically mini-CAM logic
Lecture 12: Caches and Memory
• Must trade off:– Smaller cache size– Longer latency– Lower associativity
• Every option hurts performance
= = = =
40-50 bitcomparison!
![Page 13: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/13.jpg)
13
Way-Prediction• If figuring out the way takes too long, then
just guess!
Lecture 12: Caches and Memory
WayPred
LoadPC
Payload
S X X X
“E”
= = = =
Tag checkstill occursto validateway-pred
• May be hard to predict way if the same load accesses different addresses
![Page 14: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/14.jpg)
14
Way-Prediction (2)• Organize data array s.t. left most way is
the MRU
Lecture 12: Caches and Memory
MRU LRU
Accesses
Way-predict the MRU wayWay-prediction keeps hitting
On way-miss, move blockto MRU position
Way-prediction continuesto hit
Way-Miss (Cache Hit)
Complication: data array needs datapathfor swapping blocks (maybe 100’s of bits)
Normally just update a few LRU bits inthe tag array (< 10 bits?)
![Page 15: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/15.jpg)
15
Partial Tagging• Like BTBs, just use part of the tag
Lecture 12: Caches and Memory
= = = = = = = =
Tag array lookupnow much faster!
Partial tags lead to false hits:Tag 0x45120001 looks like a hit
for Address 0x3B120001 Similar to way-prediction, full tagcomparison still needed to verify“real” hit --- not on critical path
![Page 16: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/16.jpg)
16
… in the LSQ• Partial tagging can be used in the LSQ as
well
Lecture 12: Caches and Memory
Do address check onpartial addresses only
On a partial hit,forward the data
Slower completetag check verifies
the match/no match
Replay or flushas needed
If a store finds a later partially-matchedload, don’t do pipeline flush right away
Penalty is too severe, wait for slowcheck before flushing the pipe
![Page 17: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/17.jpg)
17
Interaction With Scheduling• Bank conflicts, way-mispredictions, partial-
tag false hits– All change the latency of the load instruction
– Increases frequency of replays• more “replay conditions” exist/encountered
– Need careful tradeoff between• performance (reducing effective cache latency)• performance (frequency of replaying instructions)• power (frequency of replaying instructions)
Lecture 12: Caches and Memory
![Page 18: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/18.jpg)
18
Alternatives to Adding Associativity• More Set-Assoc needed when number of
items mapping to same cache set > number of ways
• Not all sets suffer from high conflict rates
• Idea: provide a little extra associativity, but not for each and every set
Lecture 12: Caches and Memory
![Page 19: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/19.jpg)
19
Victim Cache
Lecture 12: Caches and Memory
A B C DE
X Y Z
J K L MN
A B CD
J K LM
A B C D
X Y Z
P Q RJ K L M
A B
J
VictimCache
AE
BC
N
JK
CD
K
L
L
M
Every access is a miss!ABCED and JKLMN
do not “fit” in a 4-wayset associative cache
Victim cache providesa “fifth way” so long asonly four sets overflowinto it at the same time
Can even provide 6th
or 7th … ways
P Q R
![Page 20: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/20.jpg)
20
Skewed Associativity
Lecture 12: Caches and Memory
A
B
C
D
W
X
Y
Z
A X Y CB D
W Z
Lots of misses
Regular Set-Associative Cache
A
X
YC
Skewed-Associative Cache
B
D
WZ
Fewer of misses
![Page 21: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/21.jpg)
21
Required Associativity Varies• Program stack needs very little
associativity– spatial locality
• stack frame is laid out sequentially• function usually only refers to own stack frame
Lecture 12: Caches and Memory
f()
g()
h()
j()
k()
Call Stack
Addresseslaid out inlinearorganization
MRU LRU
Layout in 4-way Cache
Associativity not being used effectively
![Page 22: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/22.jpg)
22
Stack Cache
Lecture 12: Caches and Memory
f()
g()
h()
j()
k()
“Nice” stackaccesses
Disorganizedheap accesses
Lots of conflicts!
“Regular”Cache
Stack Cache
![Page 23: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/23.jpg)
23
Stack Cache (2)• Stack cache portion can be a lot simpler
due to direct-mapped structure– relatively easily prefetched for by monitoring
call/retn’s• “Regular” cache portion can have lower
associativity– doesn’t have conflicts due to stack/heap
interaction
Lecture 12: Caches and Memory
![Page 24: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/24.jpg)
24
Stack Cache (3)• Which cache does a load access?
– Many ISA’s have a “default” stack-pointer register
Lecture 12: Caches and Memory
LDQ 0[$sp]
LDQ 12[$sp]
LDQ 8[$t3]
LDQ 24[$sp]
LDQ 0[$t1]
Stack Cache
Regular Cache
MOV $t3 = $sp
X Need stack base and offsetinformation, and then needto check each cache accessagainst these bounds
Wrong cache replay
![Page 25: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/25.jpg)
25
Multi-Lateral Caches• Normal cache is “uni-lateral” in that
everything goes into the same place
• Stack cache is an example of “multi-lateral” caches– multiple cache structures with disjoint contents– I$ vs. D$ could be considered multi-lateral
Lecture 12: Caches and Memory
![Page 26: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/26.jpg)
26
Access Patterns• Stack cache showed how different loads
exhibit different access patterns
Lecture 12: Caches and Memory
Stack(multiple push/pop’s
of frames)
Heap(heavily data-dependent
access patterns)
Streaming(linear accesses
with low/no reuse)
![Page 27: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/27.jpg)
27
Low-Reuse Accesses• Streaming
– once you’re done decoding MPEG frame, no need to revisit
• Other
Lecture 12: Caches and Memory
Fields map to different cache lines
struct tree_t { int valid; int other_fields[24]; int num_children; struct tree_t * children;};
while(some condition) { struct tree_t * parent = getNextRoot(…); if(parent->valid) { doTreeTraversalStuff(parent); doMoreStuffToTree(parent); pickFruitFromTree(parent); }}
parent->valid accessed once,and then not used again
![Page 28: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/28.jpg)
28
Filter Caches• Several proposed variations
– annex cache, pollution control cache, etc.
Lecture 12: Caches and Memory
SmallFilterCache
MainCache
Fill on miss
First-time missesare placed in filtercache If accessed again, promote
to the main cache
If not accessed again, eventually LRU’d out
Main cache only containslines with proven reuse
One-time-use lines havebeen filtered out
Can be thought of as the“dual” of the victim cache
![Page 29: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/29.jpg)
29
Trouble w/ Multi-Lateral Caches• More complexity
– load may need to be routed to different places• may require some form of prediction to pick the right
one– guessing wrong can cause replays
• or accessing multiple in parallel increases power– no bandwidth benefit
– more sources to bypass from• costs both latency and power in bypass network
Lecture 12: Caches and Memory
![Page 30: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/30.jpg)
30
Memory-Level Parallelism (MLP)• What if memory latency is 10000 cycles?
– Not enough traditional ILP to cover this latency– Runtime dominated by waiting for memory– What matters is overlapping memory accesses
• MLP: “number of outstanding cache misses [to main memory] that can be generated and executed in an overlapped manner.”
• ILP is a property of a DFG; MLP is a metric– ILP is independent of the underlying execution
engine– MLP is dependent on the microarchitecture
assumptions– You can measure MLP for uniprocessor, CMP,
etc.
Lecture 12: Caches and Memory
![Page 31: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/31.jpg)
31
uArchs for MLP• WIB – Waiting Instruction Buffer
Lecture 12: Caches and Memory
Scheduler
Load miss
No instructions inforward slice canexecute
Eventually all independentinsts issue and schedulercontains only insts in the
forward slice… stalled
Scheduler
WIB
Loadmiss
Move forward sliceto separate buffer
Independent instscontinue to issue
New insts keepthe scheduler busy
Eventually exposeother independentload misses (MLP)
![Page 32: Lecture 12: Caches and Memory. Chained inverters maintain a stable state Access gates provide access to the cell Writing to a cell involves over-powering](https://reader035.vdocument.in/reader035/viewer/2022070410/56649f1e5503460f94c36857/html5/thumbnails/32.jpg)
32
WIB Hardware• Similar to replay – continue issuing
dependent instructions, but need to shunt to the WIB
• WIB hardware can potentially be large– WIB doesn’t do scheduling – no CAM logic
needed
• Need to redispatch from WIB back into RS when load comes back from memory– like redispatching from replay-queue
Lecture 12: Caches and Memory