appendix c memory hierarchy. why care about memory hierarchies? processor-memory performance gap...
TRANSCRIPT
![Page 1: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/1.jpg)
Appendix CMemory Hierarchy
![Page 2: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/2.jpg)
Why care about memory hierarchies?
1
10
100
1,000
10,000
100,000
1980 1985 1990 1995 2000 2005 2010
Year
Pe
rfo
rma
nc
e
Memory
ProcessorProcessor-MemoryPerformance GapGrowing
Major source of stall cycles: memory accesses
2
![Page 3: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/3.jpg)
Levels of the Memory Hierarchy
CPU Registers100s Bytes<0.5 ns
CacheK Bytes1 ns1-0.1 cents/bit
Main MemoryM Bytes100ns$.0001-.00001 cents /bit
DiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
CapacityAccess TimeCost
Tapeinfinitesec-min10 -8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
![Page 4: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/4.jpg)
Motivating memory hierarchies Two structures that hold data
Registers: small array of storage Memory: large array of storage
What characteristics would we like memory to have? High capacity Low latency Low cost
Can’t satisfy these requirements with one memory technology
4
![Page 5: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/5.jpg)
Memory hierarchy Solution: use a little bit of everything!
Small SRAM array (cache) Small means fast and cheap
Larger DRAM array (Main memory) Hope you rarely have to use it
Extremely large disk Costs are decreasing at a faster rate than we fill them
5
![Page 6: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/6.jpg)
Terminology Find data you want at a given level: hit Data is not present at that level: miss
In this case, check the next lower level Hit rate: Fraction of accesses that hit at a
given level (1 – hit rate) = miss rate
Another performance measure: average memory access time
AMAT = (hit time) + (miss rate) x (miss penalty)
6
![Page 7: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/7.jpg)
Memory hierarchy operation We’d like most accesses to use the cache
Fastest level of the hierarchy But, the cache is much smaller than the
address space Most caches have a hit rate > 80%
How is that possible? Cache holds data most likely to be accessed
7
![Page 8: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/8.jpg)
Principle of locality Programs don’t access data randomly—they
display locality in two forms Temporal locality: if you access a memory
location (e.g., 1000), you are more likely to re-access that location than some random location
Spatial locality: if you access a memory location (e.g., 1000), you are more likely to access a location near it (e.g., 1001) than some random location
8
![Page 9: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/9.jpg)
Cache Basics Fast (but small) memory close to processor When data referenced
If in cache, use cache instead of memory If not in cache, bring into cache
(actually, bring entire block of data, too) Maybe have to kick something else out to do it!
Important decisions Placement: where in the cache can a block go? Identification: how do we find a block in cache? Replacement: what to kick out to make room in
cache? Write policy: What do we do about stores?
![Page 10: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/10.jpg)
4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level?
(Block placement) Q2: How is a block found if it is in the upper level?
(Block identification) Q3: Which block should be replaced on a miss?
(Block replacement) Q4: What happens on a write?
(Write strategy)
10
![Page 11: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/11.jpg)
Q1: Cache Placement Placement
Which memory blocks are allowedinto which cache lines
Placement Policies Direct mapped (block can go to only one line) Fully Associative (block can go to any line) Set-associative (block can go to one of N lines)
E.g., if N=4, the cache is 4-way set associative Other two policies are extremes of this
(E.g., if N=1 we get a direct-mapped cache)
![Page 12: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/12.jpg)
Q1: Block placement Block 12 placed in 8 block cache:
Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets
Cache
01234567 0123456701234567
Memory
111111111122222222223301234567890123456789012345678901
Fully AssociativeDirect Mapped(12 mod 8) = 4
2-Way Assoc(12 mod 4) = 0
12
![Page 13: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/13.jpg)
Q2: Cache Identification When address referenced, need to
Find whether its data is in the cache If it is, find where in the cache This is called a cache lookup
Each cache line must have A valid bit (1 if line has data, 0 if line empty)
We also say the cache line is valid or invalid A tag to identify which block is in the line
(if line is valid)
![Page 14: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/14.jpg)
Q2: Block identification Tag on each block
No need to check index or block offset Increasing associativity shrinks index,
expands tag
BlockOffset
Block Address
IndexTag
14
![Page 15: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/15.jpg)
Address breakdown Block offset: byte address within block
# block offset bits = log2(block size)
Index: line (or set) number within cache # index bits = log2(# of cache lines)
Tag: remaining bits
TagBlockoffset
Index
15
![Page 16: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/16.jpg)
Address breakdown example Given the following:
32-bit address 32 KB direct-mapped cache Each block has 64-byte
What are the sizes for the tag, index, and block offset fields?
index = 9 bits since there are 32KB/64B = 29 blocks
block offset = 6 bits since each block has 64B= 26
tag = 32 – 9 – 6 = 17 bits
16
![Page 17: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/17.jpg)
Q3: Block replacement When we need to evict a line, what do we
choose? Easy choice for direct-mapped What about set-associative or fully-associative?
Want to choose data that is least likely to be used next Temporal locality suggests that’s the line that was
accessed farthest in the past Least recently used (LRU)
Hard to implement exactly in hardware—often approximated
Random (randomly selected line) FIFO (line that has been in cache the longest)
17
![Page 18: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/18.jpg)
Q4: What happens on a write? Write-Through Write-Back
Policy
Data written to cache block
also written to lower-level
memory
Write data only to the cache
Update lower level when a
block falls out of the cache
Debug Easy Hard
Do read misses produce writes? No Yes
Do repeated writes make it to lower level?
Yes No
18
![Page 19: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/19.jpg)
Write Policy Do we allocate cache lines on a write?
Write-allocate A write miss brings block into cache
No-write-allocate A write miss leaves cache as it was
Do we update memory on writes? Write-through
Memory immediately updated on each write Write-back
Memory updated when line replaced
![Page 20: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/20.jpg)
Write Buffers for Write-Through Caches
Q. Why a write buffer ?
ProcessorCache
Write Buffer
Lower Level
Memory
Holds data awaiting write-through to lower level memory
A. So CPU doesn’t stall Q. Why a buffer,
why not just one register ?
A. Bursts of writes arecommon.
20
![Page 21: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/21.jpg)
Write-Back Caches Need a Dirty bit for each line
A dirty line has more recent data than memory Line starts as clean (not dirty) Line becomes dirty on first write to it
Memory not updated yet, cache has the only up-to-date copy of data for a dirty line
Replacing a dirty line Must write data back to memory (write-back)
![Page 22: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/22.jpg)
Basic cache design Cache memory can copy data from any part
of main memory Tag: Memory address Block: Actual data
On each access Compare the address with the tag
If they match hit! Get the data from the cache block
If they don’t miss Get the data from main memory
22
![Page 23: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/23.jpg)
Cache organization Cache consists of multiple tag/block pairs,
called cache lines/blocks Can search lines in parallel (within reason) Each line also has a valid bit Write-back caches have a dirty bit
Note that block sizes can vary Most systems use between 32 and 128 bytes Larger blocks exploit spatial locality Larger block size smaller tag size
23
![Page 24: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/24.jpg)
Direct-mapped cache example Assume the following simple setup
Only 2 levels to hierarchy 16-byte memory 4-bit addresses Cache organization
Direct-mapped 8 total bytes 2 bytes per block 4 lines Write-back cache
Leads to the following address breakdown: Offset: 1 bit Index: 2 bits Tag: 1 bit
24
![Page 25: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/25.jpg)
Direct-mapped cache example: initial stateInstructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Registers:$t0 = ?, $t1 = ?
0
1
2
3
4
5
6
7
Block #
![Page 26: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/26.jpg)
Direct-mapped cache example: access #1Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 1 = 00012
Tag = 0 Index = 00 Offset = 1
Hits: 0Misses: 0
Registers:$t0 = ?, $t1 = ?
26
0
1
2
3
4
5
6
7
Block #
![Page 27: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/27.jpg)
Direct-mapped cache example: access #1Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 0 78 29
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 1 = 00012
Tag = 0 Index = 00 Offset = 1
Hits: 0Misses: 1
Registers:$t0 = 29, $t1 = ?
27
0
1
2
3
4
5
6
7
Block #
![Page 28: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/28.jpg)
Direct-mapped cache example: access #2Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 0 78 29
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 8 = 10002
Tag = 1 Index = 00 Offset = 0
Hits: 0Misses: 1
Registers:$t0 = 29, $t1 = ?
28
0
1
2
3
4
5
6
7
Block #
![Page 29: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/29.jpg)
Direct-mapped cache example: access #2Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 8 = 10002
Tag = 1 Index = 00 Offset = 0
Hits: 0Misses: 2
Registers:$t0 = 29, $t1 = 18
29
0
1
2
3
4
5
6
7
Block #
![Page 30: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/30.jpg)
Direct-mapped cache example: access #3Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Address = 4 = 01002
Tag = 0 Index = 10 Offset = 0
Hits: 0Misses: 2
Registers:$t0 = 29, $t1 = 18
30
0
1
2
3
4
5
6
7
Block #
![Page 31: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/31.jpg)
Direct-mapped cache example: access #3Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 0 18 150
0 0 0 0 0
Address = 4 = 01002
Tag = 0 Index = 10 Offset = 0
Hits: 0Misses: 3
Registers:$t0 = 29, $t1 = 18
04/20/23 31M. Geiger CIS 570 Lec. 13
0
1
2
3
4
5
6
7
Block #
![Page 32: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/32.jpg)
Direct-mapped cache example: access #4Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 71
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 0 18 150
0 0 0 0 0
Address = 13 = 11012
Tag = 1 Index = 10 Offset = 1
Hits: 0Misses: 3
Registers:$t0 = 29, $t1 = 18
32
0
1
2
3
4
5
6
7
Block #
![Page 33: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/33.jpg)
Direct-mapped cache example: access #4Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 18
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 0 18 150
0 0 0 0 0
Address = 13 = 11012
Tag = 1 Index = 10 Offset = 1
Hits: 0Misses: 4
Registers:$t0 = 29, $t1 = 18
Must write backdirty block
04/20/23 33M. Geiger CIS 570 Lec. 13
0
1
2
3
4
5
6
7
Block #
![Page 34: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/34.jpg)
Direct-mapped cache example: access #4Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 18
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 1 19 29
0 0 0 0 0
Address = 13 = 11012
Tag = 1 Index = 10 Offset = 1
Hits: 0Misses: 4
Registers:$t0 = 29, $t1 = 18
34
0
1
2
3
4
5
6
7
Block #
![Page 35: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/35.jpg)
Direct-mapped cache example: access #5Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 18
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 1 19 29
0 0 0 0 0
Address = 9 = 10012
Tag = 1 Index = 00 Offset = 1
Hits: 0Misses: 4
Registers:$t0 = 29, $t1 = 18
04/20/23 35M. Geiger CIS 570 Lec. 13
0
1
2
3
4
5
6
7
Block #
![Page 36: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/36.jpg)
Direct-mapped cache example: access #5Instructions:
lb $t0, 1($zero)lb $t1, 8($zero)sb $t1, 4($zero)sb $t0,
13($zero)lb $t1, 9($zero)
Memory
0 78
1 29
2 120
3 123
4 18
5 150
6 162
7 173
8 18
9 21
10 33
11 28
12 19
13 200
14 210
15 225
Cache
V D Tag Data
1 0 1 18 21
0 0 0 0 0
1 1 1 19 29
0 0 0 0 0
Address = 9 = 10012
Tag = 1 Index = 00 Offset = 1
Hits: 1Misses: 4
Registers:$t0 = 29, $t1 = 21
04/20/23 36M. Geiger CIS 570 Lec. 13
0
1
2
3
4
5
6
7
Block #
![Page 37: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/37.jpg)
Cache performance Simplified model:
CPU time = (CPU clock cycles + memory stall cycles) cycle time
memory stall cycles = # of misses x miss penalty = IC X Misses/instruction x miss penalty = IC x memory accesses/instruction x miss rate miss penalty
Average CPI = CPI(without stalls)+ memory accesses/instruction x miss rate miss penalty
AMAT = hit time + miss rate x miss penalty
37
![Page 38: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/38.jpg)
Example A computer has CPI =1 when all hits. Loads and
stores are 50% of instructions. If the miss penalty is 25 cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits
For all hits: CPU time =(ICxCPI +0)xCCT =ICx1.0 x CCT Real cache with stalls
Memory stall cycles =IC x (1+0.5)x0.02x25 = ICx0.75 CPU time = (ICx1.0+ICx0.75)xCCT = 1.75ICx CCT
Speedup = 1.75ICxCCT/ICxCCT = 1.75
38
![Page 39: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/39.jpg)
Average memory access time For unified cache
AMAT = (hit time) + (miss rate) x (miss penalty) For split cache
AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty)
For multi-level cache AMAT = hit timeL1 + miss rateL1x miss penaltyL1 = hit timeL1
+ miss rateL1x (hit timeL2 + miss rateL2 x miss penaltyL2) Miss rate(L2) is measured on the leftovers from L1 cache.
39
![Page 40: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/40.jpg)
Example (split cache vs unified cache) Which has the lower miss rate? A 16KB instruction cache with a 16KB data cache or a 32KB unified cache? If the miss per 1000 instructions for instruction, data and unified caches are 3.82, 40.9 and 43.3, respectively. Assume 36% of instructions are data transfer instructions. Assume a hit takes 1 CC and the miss penalty is 200 cc. A load or store hit takes 1 extra cc on a unified cache. Which is the AMAT?
Find miss rate = miss/instructions / memory access/instruction Miss rate(I) = 3.82/1000 / 1 = 0.004 Miss rate(D) = 40.9/1000 / 0.36 = 0.114 Miss rate (U) = 43.3/1000 / (1+0.36) = 0.0318 Miss rate (Split) = 74%x0.0004 + 26%x0.114 = 0.0326 A 32KB unified cache has a slightly lower miss rate
40
![Page 41: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/41.jpg)
Example (cont.)
AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty)
AMAT (split) = 74%(1 + 0.004x200)+ 26%x(1+0.114x200) =7.52
AMAT(unified) = 74%x(1+0.0318x200) + 26%(1+1+0.0318x200) = 7.62
41
![Page 42: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/42.jpg)
Another Example (multilevel cache) Suppose that in 1000 memory reference there are 40 misses in the L1 cache and 20 misses in the L2 cache. What are the miss rates? Assume miss penalty from L2 cache to memory is 200 CC, the hit time of the L2 cache is 10 CC, the hit time of L1 is 1 CC. What is the AMAT?
Miss rate (L1) = 40/1000 = 4% Miss rate (L2) = 20/40 = 50%
AMAT = hit timeL1 + miss rateL1x miss penaltyL1 = = hit timeL1 + miss rateL1x (hit timeL2 + miss rateL2 x miss penaltyL2) = 1 + 4%(10+ 50%x200) = 5.4 CC
42
![Page 43: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/43.jpg)
Reasons for cache misses AMAT = (hit time) + (miss rate) x (miss penalty) Reduce misses improve performance The three C’s
First reference to an address: Compulsory miss Increasing the block size
Cache is too small to hold data: Capacity miss Increase the cache size
Replaced from a busy line or set: Conflict miss Increase associativity Would have had hit in a fully associative cache
43
![Page 44: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/44.jpg)
44
Six Basic Cache Optimizations Reducing Miss Rate
1. Larger Block size (Compulsory misses)
2. Larger Cache size (Capacity misses)
3. Higher Associativity (Conflict misses Reducing Miss Penalty
4. Multilevel Caches
5. Giving Read misses Priority over Writes e.g., Read complete before earlier writes in write buffer
Reducing hit time6. Avoiding Address Translation during Cache Indexing
![Page 45: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/45.jpg)
Problems with memory DRAM is too expensive to buy many
gigabytes We need our programs to work even if they
require more memory than we have A program that works on a machine with 512 MB
should still work on a machine with 256 MB Most systems run multiple programs
45
![Page 46: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/46.jpg)
Solutions Leave the problem up to the programmer
Assume programmer knows exact configuration Overlays
Compiler identifies mutually exclusive regions Virtual memory
Use hardware and software to automatically translate references from virtual address (what the programmer sees) to physical address (index to DRAM or disk)
46
![Page 47: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/47.jpg)
Benefits of virtual memory
CPU Memory
A0-A31 A0-A31
D0-D31 D0-D31
Data
User programs run in a standardizedvirtual address space
Address Translation hardware managed by the operating system (OS)
maps virtual address to physical memory
“Physical Addresses”
AddressTranslation
Virtual Physical
“Virtual Addresses”
Hardware supports “modern” OS features:Protection, Translation, Sharing
47
![Page 48: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/48.jpg)
Managing virtual memory Effectively treat main memory as a cache
Blocks are called pages Misses are called page faults
Virtual address consists of virtual page number and page offset
Virtual page number Page offset01131
48
![Page 49: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/49.jpg)
Page tables encode virtual address spaces
A machine usually supports
pages of a few sizes
(MIPS R4000):
A valid page table entry codes physical memory “frame” address for the page
A virtual address spaceis divided into blocks
of memory called pages
Physical Address Space
Virtual Address Space
frame
frame
frame
frame
49
![Page 50: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/50.jpg)
Page tables encode virtual address spaces
A machine usually supports
pages of a few sizes
(MIPS R4000):
PhysicalMemory Space
A valid page table entry codes physical memory “frame” address for the page
A virtual address spaceis divided into blocks
of memory called pagesframe
frame
frame
frame
A page table is indexed by a virtual address
virtual address
Page Table
04/20/23 50
![Page 51: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/51.jpg)
PhysicalMemory Space
Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry)
Virtual memory => treat memory cache for disk
Details of Page Table Virtual Address
Page Table
indexintopagetable
Page TableBase Reg
V AccessRights PA
V page no. offset12
table locatedin physicalmemory
P page no. offset12
Physical Address
frame
frame
frame
frame
virtual address
Page Table
51
![Page 52: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/52.jpg)
Paging the page tableA table for 4KB pages for a 32-bit address
space has 1M entries Each process needs its own address space!
P1 index P2 index Page Offset
31 12 11 02122
32 bit virtual address
Top-level table wired in main memory
Subset of 1024 second-level tables in main memory; rest are on disk or
unallocated
Two-level Page Tables
04/20/23 52M. Geiger CIS 570 Lec. 13
![Page 53: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/53.jpg)
VM and Disk: Page replacement policy
...
Page Table
1 0
useddirty
1 00 11 10 0
Set of all pagesin Memory Tail pointer:
Clear the usedbit in thepage table
Head pointerPlace pages on free list if used bitis still clear.Schedule pages with dirty bit set tobe written to disk.
Freelist
Free Pages
Dirty bit: page written.
Used bit: set to
1 on any reference
Architect’s role: support setting dirty and used
bits04/20/23 53M. Geiger CIS 570 Lec. 13
![Page 54: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/54.jpg)
Virtual memory performance Address translation requires a physical
memory access to read the page table Must then access physical memory again to
actually get the data Each load performs at least 2 memory reads Each store performs at least 1 memory read
followed by a write
04/20/23 54M. Geiger CIS 570 Lec. 13
![Page 55: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/55.jpg)
Improving virtual memory performance Use a cache for common translations
Translation lookaside buffer (TLB)
Virtual page
v tag Physical page
Pg offset
04/20/23 55M. Geiger CIS 570 Lec. 13
![Page 56: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/56.jpg)
Caches and virtual memory Using two different addresses: virtual and
physical Which should we use to access cache? Physical address
Pros: simpler to manage Cons: slower access
Virtual address Pros: faster access Cons: aliasing, difficult management
Use both: virtually indexed, physically tagged
56
![Page 57: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/57.jpg)
Three Advantages of Virtual Memory Translation:
Program can be given consistent view of memory, even though physical memory is scrambled
Makes multithreading reasonable (now used a lot!) Only the most important part of program (“Working Set”) must be in physical
memory. Contiguous structures (like stacks) use only as much physical memory as
necessary yet still grow later. Protection:
Different threads (or processes) protected from each other. Different pages can be given special behavior
(Read Only, Invisible to user programs, etc). Kernel data protected from User programs Very important for protection from malicious programs
Sharing: Can map same physical page to multiple users
(“Shared memory”) Allows programs to share same physical memory without knowing what else
is thereMakes memory appear larger than it actually is
57
![Page 58: Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses](https://reader033.vdocument.in/reader033/viewer/2022051820/56649f1c5503460f94c31b98/html5/thumbnails/58.jpg)
Average memory access time AMAT = (hit time) + (miss rate) x (miss
penalty) Given the following:
Cache: 1 cycle access time Memory: 100 cycle access time Disk: 10,000 cycle access time
What is the average memory access time if the cache hit rate is 90% and the memory hit rate is 80%?
58