ece 550d fundamentals of computer systems and engineering...
TRANSCRIPT
ECE 550D Fundamentals of Computer Systems and Engineering
Fall 2017
Memory Hierarchy
Prof. John Board
Duke University
Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke), and Amir Roth (Penn)
2
Memory Hierarchy
• Basic concepts
• Technology background
• Organizing a single memory component
• ABC
• Write issues
• Miss classification and optimization
• Organizing an entire memory hierarchy
• Virtual memory
• Highly integrated into real hierarchies, but…
• …won’t talk about until later
CPU Mem I/O
System software
App App App
3
SRAM vs DRAM
• SRAM: Static Random Access Memory
• Static: memory is based on latches, a stored 0 or 1 is electrically stable, or static. 8 transistors per bit if we were using S-R latches (can do better but several transistors per bit).
• Fast access time since driven.
• Random Access Memory: we use decoders to select exactly 1 of 2k elements of our memory; we can access any memory location equally easily in any order (as opposed to sequential access memory – magnetic tape for instance)
k:2k decoder n bits = n latches per
element
Here, 3 bit address
accesses one of 8
n-bit elements
4
SRAM vs DRAM
• DRAM: Dynamic Random Access Memory
• Imagine 32 billion leaky cups (capacitors): (4 gigabyte ram, 8 bits per byte) you pour water into the cups you want to have a “1” and leave empty the cups that have a “0”.
• You still have a decoder (still a RAM) – you select one memory element (say one byte) – imagine 8 straws for the 8 bits of the byte.
• You suck on the straw – if any water comes out, there used to be a 1 stored there, but you just destroyed it. If you suck air, it was and still is a 0. (Destructive read)
• And if you wait too long, the water leaks away, so you have to constantly (about 12 times per second) check each bit and refill it if it is a ”1”. (Dynamic instead of static storage).
• Insane? But only 1 transistor per bit stored, and much lower power consumption.
• But slow access time since discharging a capacitor, not driving a circuit.
5
How Do We Build Instruction/Data Memory?
• Register file? Just a multi-ported SRAM – i.e. just lots of flip flops
• 32 32-bit registers = 1Kb = 128B. Need a 5:32 decoder – not bad
• Multiple ports make it bigger and slower but still OK
• Instruction/data memory? Just a single-ported SRAM? • Uh, umm… it’s 232B = 4GB!!!!
– It would be huge, expensive, and pointlessly slow with a naïve decoder (32:4G decoder – how many 32 input and gates does this need!?!)
– And consume enormous amounts of power
– And we can’t build something that big on-chip anyway
• Most ISAs now 64-bit, so memory is really as large as 264B = 16EB
PC IM intRF
DM
6
So What Do We Do? Motivation for Caches:
• “Primary” instruction/data memories (we will call them cache memories): small single-ported SRAMs…
• “primary” = “in the datapath”
• Key 1: they contain only a dynamic subset of “memory”
• Subset is small enough to fit in a reasonable SRAM and access quickly
• Key 2: missing chunks fetched on demand (transparent to program)
• From somewhere else… (next slide)
• Program has illusion that all 4GB (16EB) of memory is physically there
• Just like it has the illusion that all instructions execute atomically
IM 64KB
DM 16KB
PC intRF
7
But…
• If requested insn/data not found in primary memory • Doesn’t the place it comes from have to be a 4GB (16EB) SRAM?
• And won’t it be huge, expensive, and slow? And can we build it?
4GB(16EB)?
IM 64KB
DM 16KB
PC intRF
8
Memory Overview
• Functionality • “Like a big array…”
• N-bit address bus (on N-bit machine)
• Data bus: typically read/write on same bus
• Can have multiple ports: address/data bus pairs
• Access time: • Access latency ~ #bits * #ports2
M
address data
9
Memory Performance Equation
• For memory component M • Access: read or write to M
• Hit: desired data found in M
• Miss: desired data not found in M
• Must get from another component
• No notion of “miss” in register file
• Fill: action of placing data in M
• %miss (miss-rate): #misses / #accesses
• thit: time to read data from (write data to) M
• tmiss: time to read data into M
• Performance metric: average access time
tavg = thit + %miss * tmiss
M
thit
tmiss
%miss
10
Memory Hierarchy
tavg = thit + %miss * tmiss
• Problem: hard to get low thit and %miss in one structure • Large structures have low %miss but higher thit
• Small structures have low thit but higher %miss
• Solution: use a hierarchy of memory structures • Known from the very beginning
“Ideally, one would desire an infinitely large memory capacity such that any particular word would be immediately available … We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has a greater capacity than the preceding but which is less quickly accessible.”
Burks,Goldstine,VonNeumann “Preliminary discussion of the logical design of an electronic computing instrument”
IAS memo 1946
11
Abstract Memory Hierarchy
• Hierarchy of memory components • Upper levels: small → low thit, high %miss
• Going down: larger → higher thit, lower %miss
• Connected by buses • Ignore for the moment
• Make average access time close to M1’s • How?
• Most frequently accessed data in M1
• M1 + next most frequently accessed in M2, etc.
• Automatically move data up/down hierarchy
M2
M3
M4
M
M1
pipeline
12
Why Memory Hierarchy Works
• 10/90 rule (of thumb) • 10% of static insns/data account for 90% of accessed insns/data
• Instructions: inner loops
• Data: frequently used globals, inner loop stack variables
• Temporal locality • Recently accessed instructions/data likely to be accessed again soon
• Instructions: inner loops (next iteration)
• Data: inner loop local variables, globals
• Hierarchy can be “reactive”: move things up when accessed
• Spatial locality • Instructions/data near recently accessed insns/data likely accessed
soon
• Instructions: sequential execution
• Data: elements in array, fields in struct, variables in stack frame
• Hierarchy can be “proactive”: move things up speculatively
13
Exploiting Heterogeneous Technologies
• Apparent problem – Lower level components must be huge
– Huge SRAMs are difficult to build and expensive
• Solution: don’t use SRAM for lower levels • Cheaper, denser storage technologies
• Will be slower than SRAM, but that’s OK
• Won’t be accessed very frequently
• We have no choice anyway
• Upper levels: SRAM → expensive, fast
• Going down: DRAM, Disk/SSD → cheaper, fast
SRAM
SRAM?
DRAM
DISK
SRAM
pipeline
14
Memory Technology Overview
• Latency • SRAM: <1 to 5ns (on chip)
• DRAM: ~100ns — 100x or more slower than SRAM
• (spinning) Disk: 10,000,000ns or 10ms — 100,000x slower than DRAM
• (SSD) Flash: ~200ns — 2x slower than DRAM (for reads, much slower for writes)
• Bandwidth • SRAM: 10-100GB/sec
• DRAM: ~1GB/sec — 10x less than SRAM
• Disk: 100MB/sec (0.1 GB/sec) — sequential access only
• Flash: about same as DRAM for read (much less for writes)
• Cost: what can $300 buy today a few years ago? • SRAM: 4MB
• DRAM: 1,000MB (1GB) — 250x cheaper than SRAM
• Disk: 400,000MB (400GB) — 400x cheaper than DRAM
• Flash: 4,000 MB (4GB) — 4x cheaper than DRAM
15
(Traditional) Concrete Memory Hierarchy
• (0th level: register file)
• 1st level: I$, D$ (L1 insn/data caches)
• 2nd level: L2 (cache) • On-chip, certainly on-package (with CPU)
• Made of SRAM
• 3rd level: L3 (cache) • Same as L2, may be off-chip
• Starting to appear
• ...
• N-1 level: main memory • Off-chip
• Made of DRAM
• N level: disk (swap space) • Electrical-mechanical (or SSD)
pipeline
D$
L3
Main
Memory
I$
Disk(swap)
L2
16
Virtual Memory Teaser
• For 32-bit ISA • 4GB disk is easy
• Even 4GB main memory is common
• For 64-bit ISA • 16EB main memory is right out
• Oct2017: 4G=$34, so 16 EB=$136Billion
• Even 16EB disk is extremely difficult
• (most 64-bit ISA don’t support full 64 bit address space: Intel – 48 bits in 2017)
• Virtual memory • Never referenced addresses don’t have to
physically exist anywhere!
• Next week…
pipeline
D$
L3
Main
Memory
I$
Disk(swap)
L2
17
Start With “Caches”
• “Cache”: hardware managed • Missing chunks retrieved by hardware
• SRAM technology • Technology basis of latency
• Cache organization • ABC
• Miss classification & optimization
• What about writes?
• Cache hierarchy organization
• Some example calculations
Hardware
managed
Software
managed
pipeline
D$
L3
Main
Memory
I$
Disk(swap)
L2
18
Why Are There 2-3 Levels of Cache?
• “Memory Wall”: memory 100X slower than primary caches • Multiple levels of cache needed to bridge the difference
• “Disk Wall?”: disk is 100,000X slower than memory • Why aren’t there 56 levels of main memory to bridge that difference?
• Doesn’t matter: program can’t keep itself busy for 10M cycles
• So slow, may as well swap out and run another program
Copyr
ight
Els
evie
r S
cie
ntific 2
003
Most famous graph in computer architecture
Log scale
+35–55%
+7%
19
Evolution of Cache Hierarchies
Intel 486 (1989)
8KB
I/D$
1.5MB L2
L3 tags
64KB D$
64KB I$
IBM Power5 (dual core)
(2004)
• Chips today are 30–70% cache by area
20
RAM and SRAM
• Reality: large storage are not really built with flip-flops and giant muxes
• RAM (random access memory) • Ports implemented as shared buses called wordlines/bitlines
• SRAM: static RAM • Static = bit maintains its value indefinitely, as long as power is on
• Bits implemented as cross-coupled inverters (CCIs)
+ 2 gates, 4 transistors per bit
• All processor storage arrays: regfile, caches, branch predictor, etc.
• Other forms of RAM: Dynamic RAM (DRAM), Flash (non-volatile RAM, or NV-RAM)
21
Basic RAM
• Storage array • M words of N bits each (e.g., 4w, 2b each)
• RAM storage array • M by N array of “bits” (e.g., 4 by 2)
• RAM port • Grid of wires that overlays bit array
• M wordlines: carry 1H decoded address
• N bitlines: carry data
• RAM port operation • Send address → 1 wordline goes high
• “bits” on this line read/write bitline data
• Operation depends on bit/W/B connection
• “Magic” analog stuff
W0
W1
B0 B1
W2
W3
Ad
dre
ss
Data
1/0
1/0
1/0
1/0
1/0
1/0
1/0
1/0
22
Basic SRAM
• Storage array • M words of N bits each (e.g., 4w, 2b each)
• SRAM storage array • M by N array of CCI’s (e.g., 4 by 2)
• SRAM port • Grid of wires that overlays CCI array
• M wordlines: carry 1-Hot decoded address
• N bitlines: carry data
• SRAM port operation • Send address → 1 wordline goes high
• CCIs on this line read/write bitline data
• Operation depends on CCI/W/B connection
• “Magic” analog stuff
W0
W1
B0 B1
W2
W3
Ad
dre
ss
Data
23
ROMS:
• ROMs = Read Only memory
• Similar layout (wordlines, bitlines) to RAMs
• Except not writeable: fixed connections to Power/Gnd instead of CCI
• Also EPROMs • Programmable once electronically
• And EEPROMs • Eraseable and re-programable (very
slow)
W0
W1
B0 B1
W2
W3
Addre
ss
Data
1/0
1/0
1/0
1/0
1/0
1/0
1/0
1/0
24
SRAM Read/Write Port
• Cache: read/write on same port • Not at the same time
• Trick: write port with additional bitline
• “Double-ended” or “differential” bitlines
• Smaller → faster than separate ports
25
SRAM Read/Write
• Some extra logic on the edges • To write: tristates “at the top”
• Drive write data when appropriate
W0
W1
B0 B1
W2
W3
Ad
dre
ss
~B0 ~B1
B0 B1 ~B0 ~B1
26
SRAM Read/Write
• Some extra logic on the edges • To write: tristates “at the top”
• Drive write data when appropriate
• To read: 2 things at the bottom
• Ability to equalize bit lines
• Sense amps
W0
W1
W2
W3
Ad
dre
ss
B0 B1 ~B0 ~B1
SA SA
50
SRAMS -> Caches
• Use SRAMs to make caches • Hold a sub-set of memory
• Reading: • Input: Address to read (32 or 64 bits)
• Output:
• Hit? 1-bit: was it there?
• Data: if there, requested value Addre
ss
Data
SRAM
Data Hit
Tag
SRAM
51
Cache Performance Metrics
Miss Rate
• Fraction of memory references not found in cache (misses / accesses)
• 1 – hit rate
• Typical numbers (in percentages):
• 3-10% for L1
• can be quite small (e.g., < 1%) for L2, depending on size, etc.
Hit Time
• Time to deliver a line in the cache to the processor
• includes time to determine whether the line is in the cache
• Typical numbers:
• 1-2 clock cycle for L1
• 5-20 clock cycles for L2
Miss Penalty
• Additional time required because of a miss
• typically 50-200 cycles for main memory (Trend: increasing!)
From CMU 15-213
52
Lets think about those numbers
Huge difference between a hit and a miss
• 100X, if just L1 and main memory
Would you believe 99% hits is twice as good as 97%?
• Consider these numbers:
cache hit time of 1 cycle
miss penalty of 100 cycles
So, average access time is:
97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles
This is why “miss rate” is used instead of “hit rate”
From CMU 15-213
53
Associative memory, or Content-Addressable Memory (CAM)
• Mentioned last time: a memory we access by content rather than address
• “A CAM is designed such that the user supplies a data word and the CAM searches its entire memory to see if that data word is stored anywhere.” (Wikipedia) (answer can be nowhere!)
Regular Memory
Word-addressable
Memory (2n words)
N-bit address
1 word of data (in or out) Storage for n words,
AND n comparators
for doing parallel
search
n-word CAM
word to match Address of match,
Or NO MATCH
54
General cache mechanics
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Larger, slower, cheaper memory
is partitioned into “blocks”
Data is copied between
levels in block-sized
transfer units
8 9 14 3
Smaller, faster, more expensive
memory caches a subset of
the blocks Cache:
Memory: 4
4
4 10
10
10
From lecture-9.ppt, Carnegie-Mellon University course 15-213
55
Cache organization: Blocks
• Caches will always interact with the next highest level of the memory an entire “block” at a time – in level 1 caches, blocks range from 8-64 bytes typically, larger in L2/L3.
• Consider 1 megabyte memory with B=32 bytes (8 words) and a system with just one cache, 128 bytes. So 32768 blocks in all.
MemBlock Addr Data (32 bytes per block)
0 0-31 <some data>
1 32-63
2 64-95
…
32766 1048512-1048543
32767 1048544-1048575
Main Memory
56
Cache organization: Blocks
• In this case (1MB mem, B=32), each 20 bit physical address is
15 bit block id (0-32757) 5 bit byte offset
20-bit memory address
57
Cache organization
• Here is our 128 byte (4 block) cache
• Many problems!
• Out of the 32768 blocks in main memory, which 4 should be in the cache?
• How do we identify blocks?
• Is the block in the cache valid? (i.e. has it been initialized or is it garbage?)
• How do we know which block is here? What is our search strategy?
• What do we do if the block we want is not here?
• 3 choices – Fully Associative, Direct-mapped, Set-associative caches solve the identification problem in different ways.
Cache block
Block ID? (tag)
Data (32 bytes per block) Valid? (1 bit)
0 ?? <some data>
1 ?how?
2
3
58
General Organization of a Cache
B = 2b bytes
per cache block
E lines
per set
S = 2s sets
t tag bits
per line
Cache size: C = B x E x S data bytes
• • • B–1 1 0
• • • B–1 1 0
valid
valid
tag
tag set 0: • • •
• • • B–1 1 0
• • • B–1 1 0
valid
valid
tag
tag set 1: • • •
• • • B–1 1 0
• • • B–1 1 0
valid
valid
tag
tag set S-1: • • •
• • •
Cache is an array
of sets
Each set contains
one or more lines
Each line holds a
block of data
1 valid bit per line
From CMU 15-213
59
Most flexible: Fully Associative Cache
• Anything can be anywhere! (in our later language, the cache consists of a single “set”)
• Our running example : 1MB mem, B=32, so our cache tag will be the full 15 bit block ID of a main memory block
• Needs a full comparator per cache block (so 4 in our simple example). Any of our 32768 memblocks can be in any location. The array of 4 tags is a CAM.
• Sadly, other than tiny ones, FA caches are too complex and slow to be practical (due to the comparators)
Cache block 15 bit tag Data (32 bytes per block)
0 <some data>
1
2
3
60
Most flexible: Fully Associative Cache
• Cache address format for 20 bit address of running example
• So Address 0x15A45 comes along:
• Real question: is the 15-bit tag 0x0AD2 currently stored in any tag field of my FA cache? Check the CAM. If so, the 5th byte in the associated cache block is the byte I want.
15-bit tag 5 bit byte
offset
00010101101001000101
61
Direct-Mapped Caches
• The full flexibility of FA caches slows them down too much, especially for a L1 cache. Other extreme: Direct-mapped cache.
• Cache will consist of S sets, with one line per set (so 4 sets in our running example).
62
Example: Direct-Mapped Cache
Simplest kind of cache, easy to build (only 1 tag compare required per access)
Characterized by exactly one line per set.
valid
valid
valid
tag
tag
tag
• • •
set 0:
set 1:
set S-1:
E=1 lines per set cache block
cache block
cache block
Cache size: C = B x S data bytes
From CMU 15-213
63
Accessing Direct-Mapped Caches
Set selection
• Use the set index bits to determine the set of interest.
t bits s bits
0 0 0 0 1 0 m-1
b bits
tag set index block offset
selected set valid
valid
valid
tag
tag
tag
• • •
set 0:
set 1:
set S-1:
cache block
cache block
cache block
64
Direct Mapped Caches
• In our running example, 4 blocks in the cache, so 4 sets, so 2-bit set index. That leaves a 13 bit tag field.
• So Address 0x15A45 comes along:
• If the address is in the cache, it’s in set 2. Is the 13-bit tag currently in set 2 0x02B4, AND is the block valid? If so, cache hit.
13-bit tag 2-bit
Set
ID
5-bit byte
offset
00010101101001000101 Offset 5
SetID=2
65
Sets in main memory
• In our running example, memory blocks 0,4,8,12,16,…,32764 compete for set 0 in the cache.
• Blocks 1,5,9,13…,32765 compete for set 1 in the cache
• …
• Blocks 3,7,11,15,…,32767 compete for set 3.
• So if blocks 0 and 4 are important to the program right now, only room for one of them in the cache, even if the other three entries in the cache are empty!
• DM: really fast – only need one comparator for entire cache, but inflexible (lower hit rate)
Main Memory Block
0
1
2
3
4
5
6
7
…
32767
66
Engineering compromise: Set-Associative Caches
• Most of the speed of direct mapped caches, but with some of the additional flexibility (and thus higher hit rate) of fully associative caches
• We have sets, like in DM caches, but we have more than 1 block per set.
67
Example: Set Associative Cache
Characterized by more than one line per set
E=2 lines per set
valid tag set 0:
set 1:
set S-1:
• • •
cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
E-way associative cache
68
Set-associative cache
• In our running example, if our 128 byte, 4 block cache is 2-way set associative, there will be 2 sets with 2 blocks each.
• So the “SetID” needs just 1 bit, tags are now 14 bits.
• Need 2 comparators, need to check both tags in the chosen set on each memory access – more than 1, but better than 4!
• “Real” caches are often 2-way or 4-way set-associative
14-bit tag 1-bit
Set
ID
5-bit byte
offset
69
Notice that middle bits used as index
t bits s bits
0 0 0 0 1 0 m-1
b bits
tag set index block offset
70
Why Use Middle Bits as Index?
High-Order Bit Indexing
• Adjacent memory lines would map to same cache entry
• Poor use of spatial locality
4-line Cache
00
01
10
11
High-Order
Bit Indexing 0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Middle-Order
Bit Indexing 0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Middle-Order Bit Indexing
Consecutive memory lines map to different cache lines
Can hold S*B*E-byte region of
address space in cache at one
time
71
Back to our regularly scheduled slides
72
Step 1: Data Basics
• 32-bit addresses • 4 Byte words only (to start)
• Start with blocks that are 1 word each • 4KB, organized as 1K 4B blocks
• Block: basic unit of data in cache
• Physical cache implementation • 1K (1024) by 4B (32) SRAM
• Called data array
• 10-bit address input
• 32-bit data input/output
10 10
24
data
32
32
addr
73
Which bits to use for index?
• Can skip the lowest log2(block_size) bits: those tell us which byte in the block we’re looking for.
• Of the remaining bits, do we pick the lowest ones or the highest ones?
• If we pick highest bits for index: • Two addresses that are numerically close will both map to the same block
• Neighbors in memory are likely to collide; fight over the same block
• Opposite of what we want – this penalizes spatial locality
• Bad!
31:22
Memory map
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
74
Which bits to use for index?
• Can skip the lowest log2(block_size) bits: those tell us which byte in the block we’re looking for.
• Of the remaining bits, do we pick the lowest ones or the highest ones?
• If we pick lowest bits for index: • Two addresses that are numerically close will map to different blocks
• Neighbors in memory get neighboring blocks
• Spatial locality leads to broad use of cache capacity
• Good!
11:2
Memory map
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
Blo
ck
75
Looking Up A Block
• Q: which 10 of the 32 address bits to use?
• A: bits [11:2] • 2 LS bits [1:0] are the offset bits
• Locate byte within word
• Don’t need these to locate word
• Next 10 LS bits [11:2] are the index bits
• These locate the word
• Nothing says index must be these bits
• But these work best in practice
• Why? (think about it)
[11:2]
data 11:2 addr
76
Knowing that You Found It
• Hold a subset of memory • How do we know if we have what we need?
• 220 different addresses map to one particular block
• Build separate and parallel tag array • 1K by 21-bit SRAM
• 20-bit (next slide) tag + 1 valid bit
• Lookup algorithm • Read tag indicated by index bits
• (Tag matches & valid bit set)
? Hit → data is good
: Miss → data is garbage, wait…
==
hit
[11:2]
data 11:2 31:12 addr
[31:12]
77
Cache Use of Addresses
• Split address into three parts: • Offset: least-significant log2(block-size)
• Index: next log2(number-of-sets)
• Tag: everything else
1:0 11:2 31:12
Tag Index Offset
78
Cache Behavior Example
Set # Valid Tag Data
0 0 000 00 00 00 00
1 0 000 00 00 00 00
2 0 000 00 00 00 00
3 0 000 00 00 00 00
4 0 000 00 00 00 00
5 0 000 00 00 00 00
6 0 000 00 00 00 00
7 0 000 00 00 00 00
CRITICAL: Cache starts empty (valid = 0). 8 sets, 16 bit address for example
79
Cache Behavior Example
Set # Valid Tag Data
0 0 000 00 00 00 00
1 0 000 00 00 00 00
2 0 000 00 00 00 00
3 0 000 00 00 00 00
4 0 000 00 00 00 00
5 (101) 0 000 00 00 00 00
6 0 000 00 00 00 00
7 0 000 00 00 00 00
Access address 0x1234 = 0001 0010 0011 0100 Offset = 0
Index = 5 Tag = 091
Not valid: miss
(doesn’t matter if tags match – invalid!)
80
Handling a Cache Miss
• What if requested word isn’t in the cache?
• How does it get in there?
• Cache controller: FSM • Remembers miss address
• Asks next level of memory
• Waits for response
• (and stalls CPU if necessary)
• Writes data/tag into proper locations in cache, SETS VALID BIT
• All of this happens on the fill path
• Sometimes called backside
==
[11:2]
data
[31:12]
cc
hit addr
[31:12]
81
Cache Behavior Example (DM)
Set # Valid Tag Data
0 0 000 00 00 00 00
1 0 000 00 00 00 00
2 0 000 00 00 00 00
3 0 000 00 00 00 00
4 0 000 00 00 00 00
5 1 091 0F 1E 39 EC
6 0 000 00 00 00 00
7 0 000 00 00 00 00
Access address 0x1234 = 0001 0010 0011 0100 (now a hit after processing)
lb: 00 00 00 EC
lh: 00 00 39 EC
lw: 0F 1E 39 EC
82
Cache Behavior Example (DM)
Set # Valid Tag Data
0 0 000 00 00 00 00
1 0 000 00 00 00 00
2 0 000 00 00 00 00
3 0 000 00 00 00 00
4 0 000 00 00 00 00
5 1 091 0F 1E 39 EC
6 0 000 00 00 00 00
7 0 000 00 00 00 00
Valid && Tag match -> hit lb: 00 00 00 1E
lh: 00 00 0F 1E
lw: (unaligned)
Access address 0x1236 = 0001 0010 0011 0110 Offset = 2
Index = 5 Tag = 091
83
Cache Behavior Example (DM)
Set # Valid Tag Data
0 0 000 00 00 00 00
1 0 000 00 00 00 00
2 0 000 00 00 00 00
3 0 000 00 00 00 00
4 0 000 00 00 00 00
5 1 091 0F 1E 39 EC
6 0 000 00 00 00 00
7 0 000 00 00 00 00
Not valid: miss
Access address 0x1238 = 0001 0010 0011 1000 Offset = 0
Index = 6 Tag = 091
84
Cache Behavior Example (DM)
Set # Valid Tag Data
0 0 000 00 00 00 00
1 0 000 00 00 00 00
2 0 000 00 00 00 00
3 0 000 00 00 00 00
4 0 000 00 00 00 00
5 1 091 0F 1E 39 EC
6 0 000 00 00 00 00
7 0 000 00 00 00 00
Access address 0x1238 = 0001 0010 0011 1000
Make request to next level...
wait for it....
1 091
85
Cache Behavior Example (DM)
Set # Valid Tag Data
0 0 000 00 00 00 00
1 0 000 00 00 00 00
2 0 000 00 00 00 00
3 0 000 00 00 00 00
4 0 000 00 00 00 00
5 1 091 0F 1E 39 EC
6 1 091 3C 99 11 12
7 0 000 00 00 00 00
Valid, but tag does not match: miss
Access address 0x2234 = 0010 0010 0011 0100 Offset = 0
Index = 5 Tag = 111
86
Cache Behavior Example (DM)
Set # Valid Tag Data
0 0 000 00 00 00 00
1 0 000 00 00 00 00
2 0 000 00 00 00 00
3 0 000 00 00 00 00
4 0 000 00 00 00 00
5 1 091 0F 1E 39 EC
6 1 091 3C 99 11 12
7 0 000 00 00 00 00
Access address 0x2234 = 0010 0010 0011 0100
Make request to next level...
wait for it....
111
87
Cache Behavior Example (DM)
Set # Valid Tag Data
0 0 000 00 00 00 00
1 0 000 00 00 00 00
2 0 000 00 00 00 00
3 0 000 00 00 00 00
4 0 000 00 00 00 00
5 1 111 01 CF D0 87
6 1 091 3C 99 11 12
7 0 000 00 00 00 00
Access address 0x2234 = 0010 0010 0011 0100
Note that now, 0x1234 is gone
replaced by 0x2234
88
Cache Misses and CPI
• I$ and D$ misses stall datapath (multi-cycle or pipeline) • Increase CPI
• Cache hits built into “base” CPI
• E.g., Loads = 5 cycles in multi-cycle includes thit
• Some loads may take more cycles...
– Need to know latency of “average” load (tavg)
P
C I$
Register
File
S
X
Data
Mem
<<
2
I
R D O
B
A
89
Measuring Cache Performance
• Ultimate metric is tavg • Cache capacity roughly determines thit
• Lower-level memory structures determine tmiss
• Measure %miss
• Hardware performance counters (since Pentium)
• Performance Simulator
• Paper simulation (like we just did)
• Only works for small caches
• Small number of requests (would not do for 1M accesses)
90
Cache Miss Paper Simulation (DM again)
• 4-bit addr, 8B cache, 2B blocks -> 4 sets, already initialized
• Tag, index, offset?
Address Tag Index Offset Set 0 tag
Set 1 tag
Set 2 tag
Set3 tag
Result
C 1100 invalid 0 0 1
E 1110
8 1000
3 0011
8 1000
0 0000
8 1000
4 0100
6 0110
91
Cache Miss Paper Simulation
• 4-bit addresses, 8B cache, 2B blocks -> 4 sets
• Tag: 1 bit, Index: 2 bits, Offset: 1 bit
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C invalid 0 0 1
E
8
3
8
0
8
4
6
92
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets (data doesn’t matter!)
• What happens for each request?
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C invalid 0 0 1
E
8
3
8
0
8
4
6
93
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C 1 2 0 invalid 0 0 1 Miss
E invalid 0 1 1
8
3
8
0
8
4
6
• What happens for each request?
94
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C 1 2 0 invalid 0 0 1 Miss
E 1 3 0 invalid 0 1 1 Hit
8 invalid 0 1 1
3
8
0
8
4
6
• What happens for each request?
95
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C 1 2 0 invalid 0 0 1 Miss
E 1 3 0 invalid 0 1 1 Hit
8 1 0 0 invalid 0 1 1 Miss
3 1 0 1 1
8
0
8
4
6
• What happens for each request?
96
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C 1 2 0 invalid 0 0 1 Miss
E 1 3 0 invalid 0 1 1 Hit
8 1 0 0 invalid 0 1 1 Miss
3 0 1 1 1 0 1 1 Hit
8 1 0 1 1
0
8
4
6
• What happens for each request?
97
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C 1 2 0 invalid 0 0 1 Miss
E 1 3 0 invalid 0 1 1 Hit
8 1 0 0 invalid 0 1 1 Miss
3 0 1 1 1 0 1 1 Hit
8 1 0 0 1 0 1 1 Hit
0 1 0 1 1
8
4
6
• What happens for each request?
98
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C 1 2 0 invalid 0 0 1 Miss
E 1 3 0 invalid 0 1 1 Hit
8 1 0 0 invalid 0 1 1 Miss
3 0 1 1 1 0 1 1 Hit
8 1 0 0 1 0 1 1 Hit
0 0 0 0 1 0 1 1 Miss
8 0 0 1 1
4
6
• What happens for each request?
99
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C 1 2 0 invalid 0 0 1 Miss
E 1 3 0 invalid 0 1 1 Hit
8 1 0 0 invalid 0 1 1 Miss
3 0 1 1 1 0 1 1 Hit
8 1 0 0 1 0 1 1 Hit
0 0 0 0 1 0 1 1 Miss
8 1 0 0 0 0 1 1 Miss
4 1 0 1 1
6
• What happens for each request?
100
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C 1 2 0 invalid 0 0 1 Miss
E 1 3 0 invalid 0 1 1 Hit
8 1 0 0 invalid 0 1 1 Miss
3 0 1 1 1 0 1 1 Hit
8 1 0 0 1 0 1 1 Hit
0 0 0 0 1 0 1 1 Miss
8 1 0 0 0 0 1 1 Miss
4 0 2 0 1 0 1 1 Miss
6 1 0 0 1
• What happens for each request?
101
Cache Miss Paper Simulation
• 8B cache, 2B blocks -> 4 sets
Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result
C 1 2 0 invalid 0 0 1 Miss
E 1 3 0 invalid 0 1 1 Hit
8 1 0 0 invalid 0 1 1 Miss
3 0 1 1 1 0 1 1 Hit
8 1 0 0 1 0 1 1 Hit
0 0 0 0 1 0 1 1 Miss
8 1 0 0 0 0 1 1 Miss
4 0 2 0 1 0 1 1 Miss
6 0 3 0 1 0 0 1 Miss
• What happens for each request?
102
Cache Miss Paper Simulation
• %miss: 6 / 9 = 66% • Not good...
• How could we improve it? Result
Miss
Hit
Miss
Hit
Hit
Miss
Miss
Miss
Miss
103
Capacity and Performance
• Simplest way to reduce %miss: increase capacity + Miss rate decreases monotonically
• “Working set”: instructions/data program is actively using
– thit increases
• tavg ?
• Given capacity, manipulate %miss by changing organization
Cache Capacity
%miss
“working set” size
104
Block Size
• One possible re-organization: increase block size + Exploit spatial locality
– Caveat: increase conflicts too
– Increases thit: need word select mux
• By a little, not too bad
+ Reduce tag overhead
==
hit
[11:3]
data 11:3 31:12 addr
[31:12]
8B
[2] block size↑
105
Tag Overhead
• “4KB cache” means cache holds 4KB of data (capacity) • Tag storage is considered overhead
• Valid bit usually not counted
• Tag overhead = tag size / data size
• 4KB cache with 4B blocks? • 4B blocks → 2-bit offset
• 4KB cache / 4B blocks → 1024 blocks → 10-bit index
• 32-bit address – 2-bit offset – 10-bit index = 20-bit tag
• 20-bit tag / 32-bit block = 63% overhead
• (plus 1 comparator – not bad, would be a lot worse with Fully associative design!)
106
Block Size and Tag Overhead
• 4KB cache with 1024 4B blocks? • 4B blocks → 2-bit offset, 1024 frames → 10-bit index
• 32-bit address – 2-bit offset – 10-bit index = 20-bit tag
• 20-bit tag / 32-bit block = 63% overhead
• 4KB cache with 512 8B blocks • 8B blocks → 3-bit offset, 512 frames → 9-bit index
• 32-bit address – 3-bit offset – 9-bit index = 20-bit tag
• 20-bit tag / 64-bit block = 32% overhead
• Notice: tag size is same, but data size is twice as big
• A realistic example: 64KB cache with 64B blocks • 16-bit tag / 512-bit block = ~ 2% overhead
107
Cache Miss Paper Simulation
• 8B cache, 4B blocks -> 2 sets
Address Tag Index Offset Set 0 Set 1 Result
C 1 1 0 invalid 0 Miss
E 1 1 2 invalid 1 Hit
8 1 0 0 invalid 1 Miss
3 0 0 3 1 1 Miss
8 1 0 0 0 1 Miss
0 0 0 0 1 1 Miss
8 1 0 0 0 1 Miss
4 0 1 0 1 1 Miss
6 0 1 2 1 0 Hit
• 8,3: new conflicts (fewer sets)
• 4,6: spatial locality (now in same set)
108
Block Size and Miss Rate Redux
+ Bigger Block: Spatial prefetching • For blocks with adjacent addresses
• Turns miss/miss pairs into miss/hit pairs
• Example: 4, 6
– Conflicts • For blocks with non-adjacent addresses (but in adjacent frames)
• Turns hits into misses by disallowing simultaneous residence
• Example: 8, 3
• Both effects always present to some degree • Spatial prefetching dominates initially (until 64–128B)
• Conflicts dominate afterwards
• Optimal block size is 32–256B (varies across programs)
• Typical: 64B
Block Size
%miss
109
Block Size and Miss Penalty
• Does increasing block size increase tmiss? • Don’t larger blocks take longer to read, transfer, and fill?
• They do, but…
• tmiss of an isolated miss is not affected • Critical Word First / Early Restart (CRF/ER)
• Requested word fetched first, pipeline restarts immediately
• Remaining words in block transferred/filled in the background
• tmiss’es of a cluster of misses will suffer • Reads/transfers/fills of two misses cannot be overlapped
• Latencies start to pile up
• This is technically a bandwidth problem (more later)
110
Cache Miss Paper Simulation
• 8B cache, 4B blocks -> 2 sets
Address Tag Index Offset Set 0 Set 1 Result
C 1 1 0 invalid 0 Miss
E 1 1 2 invalid 1 Hit
8 1 0 0 invalid 1 Miss
3 0 0 3 1 1 Miss
8 1 0 0 0 1 Miss
0 0 0 0 1 1 Miss
8 1 0 0 0 1 Miss
4 0 1 0 1 1 Miss
6 0 1 2 1 0 Hit
• 8 (1000) and 0 (0000): same set for any $ < 16B
• Can we do anything about this?
111
Associativity
• New organizational dimension: Associativity • Block can reside in one of few frames
• Frame groups called sets
• Each frame in set called a way
• This is 2-way set-associative (SA)
• 1-way → direct-mapped (DM)
• 1-set → fully-associative (FA)
• Lookup algorithm • Use index bits to find set
• Read data/tags in all frames in parallel
• Any (match && valid bit) ? Hit : Miss ==
hit
[10:2]
data 10:2 31:11 addr
[31:11]
4B
==
4B
associativity↑
112
Cache Behavior 2-ways
Set #
Way 0 Way 1
V Tag Data V Tag Data
0 0 000 00 00 00 00 0 000 00 00 00 00
1 0 000 00 00 00 00 0 000 00 00 00 00
2 0 000 00 00 00 00 0 000 00 00 00 00
3 0 000 00 00 00 00 0 000 00 00 00 00
Cache: 4 sets, 2 ways, 4B blocks
113
Cache Behavior 2-ways
Set #
Way 0 Way 1
V Tag Data V Tag Data
0 0 000 00 00 00 00 0 000 00 00 00 00
1 0 000 00 00 00 00 0 000 00 00 00 00
2 0 000 00 00 00 00 0 000 00 00 00 00
3 0 000 00 00 00 00 0 000 00 00 00 00
Access address 0x1234 = 0001 0010 0011 0100 Offset = 0
Index = 1 Tag = 123
Miss. Request from next level. Wait...
1 123
114
Cache Behavior 2-ways
Set #
Way 0 Way 1
V Tag Data V Tag Data
0 0 000 00 00 00 00 0 000 00 00 00 00
1 1 123 0F 1E 39 EC 0 000 00 00 00 00
2 0 000 00 00 00 00 0 000 00 00 00 00
3 0 000 00 00 00 00 0 000 00 00 00 00
Miss. Request from next level. Wait...
1 223
Access address 0x2234 = 0001 0010 0011 0100 Offset = 0
Index = 1 Tag = 223
115
Cache Behavior 2-ways
Set #
Way 0 Way 1
V Tag Data V Tag Data
0 0 000 00 00 00 00 0 000 00 00 00 00
1 1 123 0F 1E 39 EC 1 223 01 CF D0 87
2 0 000 00 00 00 00 0 000 00 00 00 00
3 0 000 00 00 00 00 0 000 00 00 00 00
Hit. In Way 0
Access address 0x1234 = 0001 0010 0011 0100 Offset = 0
Index = 1 Tag = 123
116
Cache Miss Paper Simulation
• 8B cache, 2B blocks, 2 ways -> 2 sets
Set 0 Set 1
Address Tag Index Offset Way0 Way1 Way0 Way1 Result
C 3 0 0 invalid 0 0 1 Miss
E 3 1 0 3 0 0 1 Miss
8 2 0 0 3 0 0 3 Miss
3 0 1 1 3 2 0 3 Hit
8 2 0 0 3 2 0 3 Hit
0 0 0 0 3 2 0 3 Miss
8 2 0 0 0 2 0 3 Hit
4 1 0 0 0 2 0 3 Miss
6 1 1 0 1 2 0 3 Miss
• What happens for each request?
117
Cache structure math summary
• Given capacity, block_size, ways (associativity), and word_size.
• Cache parameters:
• num_frames = capacity / block_size
• sets = num_frames / ways = capacity / block_size / ways
• Address bit fields:
• offset_bits = log2(block_size)
• index_bits = log2(sets)
• tag_bits = word_size - index_bits - offset_bits
• Numeric way to get offset/index/tag from address:
• block_offset = addr % block_size
• index = (addr / block_size) % sets
• tag = addr / (sets*block_size)
118
Replacement Policies
• Set-associative caches present a new design choice • On cache miss, which block in set to replace (kick out)?
• Belady’s (oracle): block that will be used furthest in future
• Random
• FIFO (first-in first-out)
• LRU (least recently used) • Fits with temporal locality, LRU = least likely to be used in future
• NMRU (not most recently used) • An easier to implement approximation of LRU
• Equal to LRU for 2-way SA caches
119
NMRU Implementation
• Add MRU field to each set • MRU data is encoded “way”
• Hit? update MRU
• Fill? write enable ~MRU (in 2-way)
• Need to pick 1 of (n-1) NMRU if more than 2 ways for write enable
==
hit data addr
==
~ WE
120
Associativity And Performance
• The associativity game + Higher associative caches have lower %miss
– thit increases
• But not much for low associativities (2,3,4,5)
• tavg?
• Block-size and number of sets should be powers of two • Makes indexing easier (just rip bits out of the address)
• 5-way set-associativity? No problem (but powers of 2 still very common)
Associativity
%miss ~5
121
Full Associativity
• How to implement full (or at least high) associativity? • This way is terribly inefficient
• 1K matches are unavoidable, but 1K data reads + 1K-to-1 mux?
==
hit data addr
[31:2]
== == == ==
122
Full-Associativity with CAMs
• CAM: content-addressable memory • Array of words with built-in comparators
• Input is data (tag)
• Output is 1H encoding of matching slot
• Fully associative cache • Tags as CAM, data as RAM
• Effective but expensive (EE reasons)
• Upshot: used for 16-/32-way associativity
– No good way to build 1024-way associativity
+ No real need for it, either
hit
[31:2]
data 31:2 addr
==
==
==
==
==
look mom, no index bits
123
CAM -> Content Addressable Memory
• Input: Data to match • (ex on left: 3 bits)
• Output: matching entries • (ex on left: 4 entries)
• Will not be tested on these electrical details of CAMs, but basic idea of a CAM is fair game!
Data
Ma
tch
124
~B2 B3
Match?
~B1 B1
CAM circuit
• CAM match port looks different from RAM r/w port
• Cells look similar • Note: Bit stored on right, ~Bit on left (opposite of inputs)
~B0 B0
Vcc
125
~B2 B3
Match?
~B1 B1
CAM circuit
~B0 B0
• CAM match port looks different from RAM r/w port
• Cells look similar • Note: Bit stored on right, ~Bit on left (opposite of inputs)
• Step 1: Precharge match line to 1 (first half of cycle)
Vcc
126
~B2 B3
Match?
~B1 B1
CAM circuit
~B0 B0
• CAM match port looks different from RAM r/w port
• Cells look similar • Note: Bit stored on right, ~Bit on left (opposite of inputs)
• Step 1: Precharge match line to 1 (first half of cycle)
• Step 2: Send data/~data down bit lines
Vcc
127
~B2 B3
Match?
~B1 B1
CAM circuit
~B0 B0
• CAM match port looks different from RAM r/w port
• Cells look similar • Note: Bit stored on right, ~Bit on left (opposite of inputs)
• Step 1: Precharge match line to 1 (first half of cycle)
• Step 2: Send data/~data down bit lines • Two 1s on same side (bit line != data) open NMOS path -> gnd
Vcc
128
~B2 B3
Match?
~B1 B1
CAM circuit
~B0 B0
• CAM match port looks different from RAM r/w port
• Cells look similar • Note: Bit stored on right, ~Bit on left (opposite of inputs)
• Step 1: Precharge match line to 1 (first half of cycle)
• Step 2: Send data/~data down bit lines • Two 1s on same side (bit line != data) open NMOS path -> gnd
• Drains match line 1->0
Vcc
129
~B2 B3
Match?
~B1 B1
CAM circuit
~B0 B0
• CAM match port looks different from RAM r/w port
• Cells look similar • Note: Bit stored on right, ~Bit on left (opposite of inputs)
• Step 1: Precharge match line to 1 (first half of cycle)
• Step 2: Send data/~data down bit lines • Two 1s on same side (bit line != data) open NMOS path -> gnd
• Drains match line 1->0
Vcc
130
~B2 B3
Match?
~B1 B1
CAM circuit
~B0 B0
• Note that if all bits match, each side has a 1 and a 0
• One NMOS in the path from Match -> Gnd is closed
• No conductive path -> Match keeps its charge @ 1
Vcc
131
~B2 B3
Match?
~B1 B1
CAMs: Slow and High Power..
~B0 B0
• CAMs are slow and high power
• Pre-charge all, discharge most match lines every search
• Pre-charge + discharge take time: capacitive load of match line
• Bit lines have high capacitive load: Driving 1 transistor per row
Vcc
132
ABC
• Capacity + Decreases capacity misses
– Increases thit
• Associativity + Decreases conflict misses
– Increases thit
• Block size – Increases conflict misses
+ Decreases compulsory misses
± Increases or decreases capacity misses
• Little effect on thit, may exacerbate tmiss
• How much they help depends...
133
Different Problems -> Different Solutions
• Suppose we have 16B, direct-mapped cache w/ 4B blocks • 4 sets
• Examine some access patterns and think about what would help
• Misses in red
• Access pattern A: • As is: 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26
• 8B blocks? 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26
• 2-way assoc? 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26
• Access pattern B: • As is: 0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6
• 8B blocks? 0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6
• 2-way assoc? 0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6
• Access pattern C (All 3): • 0,20,40,60,48,36,24,12,1,21,41,61,49,37,25,13,2,22,42,62,50,38,…
134
Analyzing Misses: 3C Model (Hill)
• Divide cache misses into categories based on cause • Compulsory: block size is too small (i.e., address not seen before)
• Capacity: capacity is too small
• Conflict: associativity is too low
135
Different Problems -> Different Solutions
• Access pattern A: Compulsory misses • 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26
• For misses, have not accessed that block
• Size/associativity won’t help (never had it)
• Larger block -> include more data in one block -> more hits
• Recognizing compulsory misses • Never seen the block before
136
Different Problems -> Different Solutions
• Access pattern B: Conflict misses • 0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6
• 0 and 128 map to same set (set 0): kick each other out (“conflict”)
• Larger block? No help
• Larger cache? Only helps if MUCH larger (256 B instead of 16B)
• Higher associativity? Fixes problem
• Can have both 0 and 128 in set 0 at same time (different ways)
• Recognizing conflict misses: • Count unique blocks between last access and miss (inclusive)
• Number of unique blocks <= number of blocks in cache? Conflict
• Enough space to hold them all...
• Just must be having set conflict
137
Different Problems -> Different Solutions
• Access pattern C: Capacity Misses • 0,20,40,60,48,36,24,12,1,21,41,61,49,37,25,13,2,22,42,62,50,38,…
• Larger block size? No help
• Even 16B block (entire cache) won’t help
• Associativity? No help... even at full assoc
• After 0, 20, 40, 60: kick out 0 for 48
• Kick out 20 for 36
• Kick out 40 for 24...
• Solution: make cache larger
• Doubling cache size turns all most misses into hits
• A few compulsory misses remain
• 0,20,40,60,48,36,24,12,1,21,41,61,49,37,25,13,2,22,42,62,50,38,…
• Recognizing Capacity Misses • Count unique blocks between last access and miss (inclusive)
• Number of unique blocks > number of blocks in cache? Capacity
• Just can’t hold them all
138
Miss Categorization Flow Chart
Seen Same Block
Before?
Compulsory
Compare # Unique Blocks
Referenced to Number Cache Can
Hold
# Referenced <= # Cache Can Hold
Conflict
Capacity
# Referenced >
# Cache Can Hold
139
ABC
• Capacity + Decreases capacity misses
– Increases thit
• Associativity + Decreases conflict misses
– Increases thit
• Block size – Increases conflict misses
+ Decreases compulsory misses
± Increases or decreases capacity misses
• Little effect on thit, may exacerbate tmiss
• How much they help depends...
140
Two Optimizations
• Victim buffer: for conflict misses • Technically: reduces tmiss for these misses, doesn’t eliminate them
• Depends how you do your accounting
• Prefetching: for capacity/compulsory misses
141
Victim Buffer
• Conflict misses: not enough associativity • High associativity is expensive, but also rarely needed
• 3 blocks mapping to same 2-way set and accessed (XYZ)+
• Victim buffer (VB): small FA cache (e.g., 4 entries) • Small so very fast
• Blocks kicked out of cache placed in VB
• On miss, check VB: hit ? Place block back in cache
• 4 extra ways, shared among all sets
+ Only a few sets will need it at any given time
• On cache fill path: reduces tmiss, no impact on thit
+ Very effective in practice
$
Next-level-$
VB
142
Prefetching
• Prefetching: put blocks in cache proactively/speculatively • In software: insert prefetch (non-binding load) insns into code
• In hardware: cache controller generates prefetch addresses
• Keys: anticipate upcoming miss addresses accurately • Timeliness: initiate prefetches sufficiently in advance
• But not so far in advance that it kicks out good stuff
• Accuracy: don’t evict useful data
• Prioritize handling real misses over prefetches
• Simple algorithm: next block prefetching • Miss address X → prefetch address X+block_size
• Works for instructions: sequential execution
• What about non-sequential execution?
• Works for data: arrays
• What about other data-structures?
• Address prediction is actively researched area
$
Next-level-$
cc
143
Write Issues
• So far we have looked at reading from cache • Insn fetches, loads
• What about writing into cache • Stores, not an issue for insn caches (why they are simpler)
• Several new issues • Must read tags first before writing data
• Cannot be in parallel
• Cache may have dirty data
• Data which has been updated in this cache, but not lower levels
• Must be written back to lower level before eviction
144
Recall Data Memory Stage of Datapath
• So far, have just assume D$ in Memory Stage... • Actually a bit more complex for a couple reasons...
L1 D$
145
Problem with Writing #1: Store Misses
• Load instruction misses D$: • Have to stall datapath
• Need missing data to complete instruction
• (Fancier: stall at first consumer rather than load)
• Store instruction misses D$: • Stall?
• Would really like not to
• Store is writing the data
• Need rest of block because we cannot have part of a block
• Generally do not support “these bytes are valid, those are not”
• How to avoid?
146
Problem with Writing #2: Serial Tag/Data Access
• Load can read tags/data in parallel • Read both SRAMs
• Compare Tags -> Select proper way (if any)
• Stores cannot write tags/data in parallel • Read tags/write data array at same time??
• How to know which way?
• Or even if its a hit?
• Incorrect -> overwrote data from somewhere else..
• Multi-cycle data-path: • Stores take an extra cycle? Increase CPI
• Pipelined data-path: • Tags in one stage, Data in the next?
• Works for stores, but loads serialize tags/data -> higher CPI
147
Store Buffer
• Stores write into a store buffer • Holds address, size, data, of stores
L1 D$
148
Store Buffer
• Stores write into a store buffer • Holds address, size, data, of stores
• Store data written from store buffer into cache
• Miss? Data stays in buffer until hit
L1 D$
149
Store Buffer
• Loads search store buffer for matching store • Match? Forward data from the store
• No match: Use data from D$
• Addresses are CAM: allow search for match
L1 D$
150
Store Buffer
• How does this resolve our issues?
• Problem with Writing #1: Store misses • Stores write to store buffer and are done
• FSM writes stores into D$ from store buffer
• Misses stall store buffer -> D$ write (but not pipeline)
• Pipeline will stall on full store buffer
• Problem with Writing #2: Tags -> Data • FSM that writes stores to D$ can check tags... then write data
• Decoupled from data path’s normal execution
• Can happen whenever loads are not using the D$
151
Write Propagation
• When to propagate new value to (lower level) memory?
• Write-thru: immediately – Requires additional bus bandwidth
• Not common
• Write-back: when block is replaced • Blocks may be dirty now
• Dirty bit (in tag array)
• Cleared on fill
• Set by a store to the block
152
Write Back: Dirty Misses
• Writeback caches may have dirty misses: • Victim block (one to be replaced) is dirty
• Must first writeback to next level
• Then request data for miss
• Slower :(
• Solution:
• Add a buffer on back side of cache: writeback buffer
• Small full associative buffer, holds a few lines
• Request miss data immediately
• Put dirty line in WBB
• Writeback later 1
2 3
$
Next-level-$
WBB
153
What this means to the programmer
• If you’re writing code, you want good performance.
• The cache is crucial to getting good performance.
• The effect of the cache is influenced by the order of memory accesses.
CONCLUSION:
The programmer can change the order of memory accesses to improve performance!
154
Cache performance matters!
• A HUGE component of software performance is how it interacts with cache
• Example:
Assume that x[i][j] is stored next to x[i][j+1] in memory (“row major order”).
Which will have fewer cache misses?
for (k = 0; k < 100; k++)
for (j = 0; j < 100; j++)
for (i = 0; i < 5000; i++)
x[i][j] = 2 * x[i][j];
for (k = 0; k < 100; k++)
for (i = 0; i < 5000; i++)
for (j = 0; j < 100; j++)
x[i][j] = 2 * x[i][j];
A
B Adapted from Lebeck and Porter (creative commons)
155
Blocking (Tiling) Example
/* Before */
for(i = 0; i < SIZE; i++)
for (j = 0; j < SIZE; j++)
for (k = 0; k < SIZE; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
• Two Inner Loops: • Read all NxN elements of z[ ] (N = SIZE)
• Read N elements of 1 row of y[ ] repeatedly
• Write N elements of 1 row of x[ ]
• Capacity Misses a function of N & Cache Size: • 3 NxN => no capacity misses; otherwise ...
• Idea: compute on BxB submatrix that fits
Adapted from Lebeck and Porter (creative commons)
156
Blocking (Tiling) Example
/* After */
for(ii = 0; ii < SIZE; ii += B)
for (jj = 0; jj < SIZE; jj += B)
for (kk = 0; kk < SIZE; kk +=B)
for(i = ii; i < MIN(ii+B-1,SIZE); i++)
for (j = jj; j < MIN(jj+B-1,SIZE); j++)
for (k = kk; k < MIN(kk+B-1,SIZE); k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
• Capacity Misses decrease
2N3 + N2 to 2N3/B +N2
• B called Blocking Factor (Also called Tile Size)
Adapted from Lebeck and Porter (creative commons)
157
Hilbert curves: A fancy trick for matrix locality
• Turn a 1D value into an n-dimensional “walk” of a cube space (like a 2D or 3D matrix) in a manner that maximizes locality
• Extra overhead to compute curve path, but computation takes no memory, and cache misses are very expensive, so it may be worth it
• (Actual algorithm for these curves is simple and easy to find)
158
Brief History of DRAM
• DRAM (memory): a major force behind computer industry • Modern DRAM came with introduction of IC (1970)
• Preceded by magnetic “core” memory (1950s)
• More closely resembles today’s disks than memory
• And by mercury delay lines before that (ENIAC)
• Re-circulating vibrations in mercury tubes
“the one single development that put computers on their feet was the
invention of a reliable form of memory, namely the core memory… It’s cost was reasonable, it was reliable, and because it was reliable it could in due course be made large”
Maurice Wilkes
Memoirs of a Computer Programmer, 1985
159
SRAM
• SRAM: “6T” cells • 6 transistors per bit
• 4 for the CCI
• 2 access transistors
• Static • CCIs hold state
• To read • Equalize, swing, amplify
• To write • Overwhelm
ad
dre
ss
data0 ~data0 data1 ~data1
SA SA
160
DRAM
• DRAM: dynamic RAM • Bits as capacitors
• Transistors as ports
• “1T” cells: one access transistor per bit
• “Dynamic” means • Capacitors not connected to pwr/gnd
• Stored charge decays over time
• Must be explicitly refreshed
• Designed for density + ~6–8X denser than SRAM
– But slower too
ad
dre
ss
data
SA SA
161
DRAM Read (simplified version)
• Bit line pre-charged to 0.5 (think: pipe half full)
• Storage at 1 (think: tank full of water)
Stored value = 1
Bit line = 0.5
162
DRAM Read (simplified version)
• Bit-line and capacitor equalize • Think: opening valve between pipe + tank
• Settle out a bit above 0.5 if 1 was stored • A bit less if 0 was stored
Stored value = 0.55
Bit line = 0.55
163
DRAM Read (simplified version)
• Destroyed the stored value in the process • Could not read this again: change too small to detect
Stored value = 0.55
Bit line = 0.55
164
DRAM Operation I
• Sense amps detect small swing • Amplify into 0 or 1
• This read: very slow • Why? No Vcc/Gnd connection in storage
• Need to deal with destructive reads: • Might want to read again...
• Also need to be able to write
ad
dre
ss
data
SA SA
165
DRAM Operation I
• Add some d-latches (row buffer) • Ok to use d-latches, not DFFs
• No path from output->input when enabled
• Also add a tri-state path back • From the d-latch to the bit-line
• Can drive the output of the d-latch onto bit lines
• After we read, drive the value back
• “Refill” (or re-empty) the capacitor
ad
dre
ss
data
SA SA
DL DL
166
DRAM Read (better version)
• SA amplifies 0.55 -> 1
• DL is enabled: latches the 1
• Tri-state disabled
Stored value = 0.55
Bit line = 0.55
SA
DL
SA output = 1
DL output = 1
Tri-state
output = Z
167
DRAM Read (better version)
• Enable tri-state • Drives 1 back up bit-line
Stored value = 0.55
Bit line = 0.55
SA
DL
SA output = 1
DL output = 1
Tri-state
output = 1
168
DRAM Read (better version)
• Enable tri-state • Drives 1 back up bit-line
• Starts to push value back up towards 1 (takes time)
Stored value = 0.75
Bit line = 0.75
SA
DL
SA output = 1
DL output = 1
Tri-state
output = 1
169
DRAM Read (better version)
• Enable tri-state • Drives 1 back up bit-line
• Starts to push value back up towards 1 (takes time)
• Eventually restores value.
Stored value = 1
Bit line = 1
SA
DL
SA output = 1
DL output = 1
Tri-state
output = 1
170
DRAM Operation
• Open row (read bits -> row buffer)
• Read “columns” • Mux selects right part of RB
• Send data on bus -> processor
• Write “columns” • Change values in dlatches
• May read/write multiple columns
• Close row • Close access transistors
• Pre-charge bit lines
• Row must remain open long enough
• Must fully restore capacitors
DRAM
bit array
row buffer
Row address
SAs Write data
Column
address
171
DRAM Refresh
• DRAM periodically refreshes all contents • Loops through all rows
• Open row (read -> RB)
• Leave row open long enough
• Close row
• 1–2% of DRAM time occupied by refresh
ad
dre
ss
data
SA SA
DL DL
172
Aside: Non-Volatile CMOS Storage
• Before we leave the subject of CMOS storage technology…
• Another important kind: flash • “Floating gate”: no conductor/semi-conductor
• Quantum tunneling involved in writing it
• Effectively no leakage (key feature)
• Non-volatile: remembers state when power is off
• Slower than DRAM
• Wears out with writes
• Eventually writes just do not work
173
Memory Bus
• Memory bus: connects CPU package with main memory • Has its own clock
• Typically slower than CPU internal clock: 100–500MHz vs. 3GHz
• Synchronous DRAM (SDRAM) used in “real” main memories operates on this clock
• Is often itself internally pipelined
• Clock implies bandwidth: 100MHz → start new transfer every 10ns
• Clock doesn’t imply latency: 100MHz !→ transfer takes 10ns
• DRAM is slower than this but can pipeline multiple accesses
• Bandwidth is more important: determines peak performance
174
Memory Latency and Bandwidth
• Nominal clock frequency applies to CPU and caches
• Careful when doing calculations • Clock frequency increases don’t reduce memory or bus latency
• May make misses come out faster
• At some point memory bandwidth may become a bottleneck
• Further increases in clock speed won’t help at all
175
Clock Frequency Example
• Baseline setup • Processor clock: 1GHz.
• 20% loads, 15% stores, 20% branches, 45% ALU
• Branches: 3, ALU/stores: 4, Loads: 4 + tavgL1
• L1 D$: thit = 1 cycle, 10% miss
• L2$: thit = 20 cycles, 5% miss
• Memory: 200 cycles
tavgL2 = 20 + 0.05 * 200 = 30
tavgL1 = 1 + 0.10 * 30 = 4
Average load latency = 4 + 4 = 8
CPI = 0.2 * 8 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 4.6
Performance = 217 MIPS
The clock rate is 1GHz, or 1e9 cycles/second. The CPI is 4.6 cycles/instruction. (1e9 cycles/second) / (4.6 cycles/instruction) = 217,391,304 instructions/second = 217 MIPS
Computation tavgL2 = 20 + 0.05 * 200 = 30 tavgL1 = 1 + 0.10 * 30 = 4 Average load latency = 4 + 4 = 8 CPI = 0.2 * 8 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 4.6 The clock rate is 1GHz, or 1e9 cycles/second. The CPI is 4.6 cycles/instruction. Performance = (1e9 cycles/second) / (4.6 cycles/instruction) = 217,391,304 instructions/second = 217 MIPS
176
Clock Frequency Example
• Baseline setup • Processor clock: 2GHz.
• 20% loads, 15% stores, 20% branches, 45% ALU
• Branches: 3, ALU/stores: 4, Loads: 4 + tavgL1
• L1 D$: thit = 1 cycle, 10% miss
• L2$: thit = 20 cycles, 5% miss
• Memory: 400 cycles
tavg = 20 + 0.05 * 400 = 40
tavg = 1 + 0.10 * 40 = 5
Average load latency = 4 + 5 = 9
CPI = 0.2 * 9 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 4.8
Performance = 417 MIPS (91% speedup, for 100% freq increase)
Computation tavgL2 = 20 + 0.05 * 400 = 40 tavgL1 = 1 + 0.10 * 40 = 5 Average load latency = 4 + 5 = 9 CPI = 0.2 * 9 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 4.8 The clock rate is 2GHz, or 2e9 cycles/second. The CPI is 4.8 cycles/instruction. Performance = (2e9 cycles/second) / (4.8 cycles/instruction) = 416,666,666 instructions/second = 417 MIPS (91% speedup, for 100% freq increase)
177
Clock Frequency Example
• Baseline setup • Processor clock: 4GHz.
• 20% loads, 15% stores, 20% branches, 45% ALU
• Branches: 3, ALU/stores: 4, Loads: 4 + tavgL1
• L1 D$: thit = 1 cycle, 10% miss
• L2$: thit = 20 cycles, 5% miss
• Memory: 800 cycles
tavg = 20 + 0.05 * 800 = 60
tavg = 1 + 0.10 * 60 = 7
Average load latency = 4 + 7 = 11
CPI = 0.2 * 11 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 5.2
Performance = 769 MIPS (84% speedup, for 100% freq increase)
Computation tavgL2 = 20 + 0.05 * 800 = 60 tavgL1 = 1 + 0.10 * 60 = 7 Average load latency = 4 + 7 = 11 CPI = 0.2 * 11 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 5.2 The clock rate is 4GHz, or 4e9 cycles/second. The CPI is 5.2 cycles/instruction. Performance = (4e9 cycles/second) / (5.2 cycles/instruction) = 769,230,769 instructions/second = 769 MIPS (84% speedup, for 100% freq increase)
178
Actually a Bit Worse..
• Only looked at D$ miss impact • Ignored store misses: assumed storebuffer can keep up
• Also have I$ misses
• At some point, become bandwidth constrained • Effectively makes tmiss go up (think of a traffic jam)
• Also makes things we ignored matter
• Storebuffer may not be able to keep up as well -> store stalls
• Data we previously prefetched may not arrive in time
• Effectively makes %miss go up
179
Clock Frequency and Real Programs
Detailed Simulation Results
- Includes all caches, bandwidth,...
- Has L3 on separate clock
- Real programs
- 2.0 Ghz -> 5.0 Ghz (150% increase)
hmmer:
- Very low %miss
- Good performance for clock
- 125% speedup
lbm, milc:
- Very high %miss
- Not much performance gained
- lbm: 32%
- milc: 14%
180
Summary
• tavg = thit + %miss * tmiss • thit and %miss in one component? Difficult
• Memory hierarchy • Capacity: smaller, low thit → bigger, low%miss
• 10/90 rule, temporal/spatial locality
• Technology: expensive→cheaper
• SRAM →DRAM→Disk: reasonable total cost
• Organizing a memory component • ABC, write policies
• 3C miss model: how to eliminate misses?
• Technologies: • DRAM, SRAM, Flash
CPU Mem I/O
System software
App App App