memory hierarchy design
DESCRIPTION
Memory Hierarchy Design. Outline. Introduction Reviews of the ABCs of caches Cache Performance Reducing Cache Miss Penalty Reducing Miss Rate Reducing Cache Miss Penalty or Miss Rate Via Parallelism Reducing Hit Time Main Memory and Organizations for Improving Performance - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/1.jpg)
1
Memory Hierarchy Design
![Page 2: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/2.jpg)
2
Outline
• Introduction• Reviews of the ABCs of caches• Cache Performance• Reducing Cache Miss Penalty• Reducing Miss Rate• Reducing Cache Miss Penalty or Miss Rate Via Parallelism• Reducing Hit Time• Main Memory and Organizations for Improving Performance• Memory Technology• Virtual Memory• Protection and Examples of Virtual Memory• Assignment Questions
![Page 3: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/3.jpg)
3
5.1 Introduction
![Page 4: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/4.jpg)
4
Memory Hierarchy Design
• Motivated by the principle of locality - A 90/10 type of rule– Take advantage of 2 forms of locality
• Spatial - nearby references are likely• Temporal - same reference is likely soon
• Also motivated by cost/performance structures– Smaller hardware is faster: SRAM, DRAM, Disk, Tape– Access vs. bandwidth variations– Fast memory is more expensive
• Goal – Provide a memory system with cost almost as low as the cheapest level and speed almost as fast as the fastest level
![Page 5: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/5.jpg)
5
DRAM/CPU Gap
• CPU performance improves at 55%/year– In 1996 it was a phenomenal 18% per month
• DRAM - has improved at 7% per year
![Page 6: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/6.jpg)
6
Levels in A Typical Memory Hierarchy
![Page 7: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/7.jpg)
7
Sample Memory Hierarchy
![Page 8: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/8.jpg)
8
5.2 Review of the ABCs of Caches
![Page 9: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/9.jpg)
9
36 Basic Terms on Caches
Cache Full associative Write allocate
Virtual memory dirty bit unified cache
memory stall cycles block offset misses per instruction
directed mapped write back block
valid bit data cache locality
block address hit time address trace
write through cache miss set
instruction cache page fault random placement
average memory access time miss rate index field
cache hit n-way set associative no-write allocate
page least-recently used write buffer
miss penalty tag field write stall
![Page 10: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/10.jpg)
10
Cache
• The first level of the memory hierarchy encountered once the address leaves the CPU– Persistent mismatch between CPU and main-memory speeds– Exploit the principle of locality by providing a small, fast memory
between CPU and main memory -- the cache memory
• Cache is now applied whenever buffering is employed to reuse commonly occurring terms (ex. file caches)
• Caching – copying information into faster storage system– Main memory can be viewed as a cache for secondary storage
![Page 11: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/11.jpg)
11
General Hierarchy Concepts
• At each level - block concept is present (block is the caching unit)– Block size may vary depending on level
• Amortize longer access by bringing in larger chunk• Works if locality principle is true
– Hit - access where block is present - hit rate is the probability– Miss - access where block is absent (in lower levels) - miss rate
• Mirroring and consistency– Data residing in higher level is subset of data in lower level– Changes at higher level must be reflected down - sometime
• Policy of sometime is the consistency mechanism
• Addressing– Whatever the organization you have to know how to get at it!– Address checking and protection
![Page 12: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/12.jpg)
12
Physical Address Structure
• Key is that you want different block sizes at different levels
![Page 13: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/13.jpg)
13
Latency and Bandwidth
• The time required for the cache miss depends on both latency and bandwidth of the memory (or lower level)
• Latency determines the time to retrieve the first word of the block
• Bandwidth determines the time to retrieve the rest of this block
• A cache miss is handled by hardware and causes processors following in-order execution to pause or stall until the data are available
![Page 14: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/14.jpg)
14
Predicting Memory Access Times
• On a hit: simple access time to the cache• On a miss: access time + miss penalty
– Miss penalty = access time of lower + block transfer time– Block transfer time depends on
• Block size - bigger blocks mean longer transfers• Bandwidth between the two levels of memory
– Bandwidth usually dominated by the slower memory and the bus protocol
• Performance – Average-Memory-Access-Time = Hit-Access-Time + Miss-Rate *
Miss-Penalty– Memory-stall-cycles = IC * Memory-reference-per-instruction * Miss-
Rate * Miss-Penalty
![Page 15: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/15.jpg)
15
Block Sizes, Miss Rates & Penalties, Accesses
![Page 16: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/16.jpg)
16
Typical Memory Hierarchy Parameters for WS or SS
![Page 17: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/17.jpg)
17
Typical Parameters in Modern CPU
![Page 18: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/18.jpg)
18
Headaches of Memory Hierarchies
• CPU never knows for sure if an access will hit• How deep will a miss be - i. e. miss penalty
– If short then the CPU just waits– If long then probably best to work on something else – task switch
• Implies that the amount can be predicted with reasonable accuracy
• Task switch better be fast or productivity/efficiency will suffer
• Implies some new needs– More hardware accounting– Software readable accounting information (address trace)
![Page 19: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/19.jpg)
19
Four Standard Questions
• Block Placement– Where can a block be placed in the upper level?
• Block Identification– How is a block found if it is in the upper level?
• Block Replacement– Which block should be replaced on a miss?
• Write Strategy– What happens on a write?
Answer the four questions for the first level of the memory hierarchy
![Page 20: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/20.jpg)
20
Block Placement Options
• Direct Mapped– (Block address) MOD (# of cache blocks)
• Fully Associative– Can be placed anywhere
• Set Associative– Set is a group of n blocks -- each block is called a way– Block first mapped into a set (Block address) MOD (# of cache
sets)– Placed anywhere in the set
• Most caches are direct mapped, 2- or 4-way set associative
![Page 21: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/21.jpg)
21
Block Placement Options (Cont.)
Continuum of levels of set associativity
(m=0) (m=3) (m=2)
![Page 22: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/22.jpg)
22
Block Identification
• Each cache block carries tags• Address Tags: which block am I?
– Physical address now: address tag## set index## block offset– Note relationship of block size, cache size, and tag size– The smaller the set tag the cheaper it is to find
• Status Tags: what state is the block in?– valid, dirty, etc.
r (address tag)
m (set index)
n(block offset)
2n bytesper block
2m addressable sets in the cache
Physical address =r + m + n bits
Many memory blocks may map to the same cache block
![Page 23: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/23.jpg)
23
Block Identification (Cont.)
r (address tag) m n
Physical address = r + m + n bits
2n bytesper block
2m addressable sets in the cache
• Caches have an address tag on each block frame that gives the block address.
• A valid bit to say whether or not this entry contains a valid address.
• The block frame address can be divided into the tag filed and the index field.
![Page 24: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/24.jpg)
24
Block Replacement
• Random: just pick one and chuck it– Simple hash game played on target block frame address– Some use truly random
• But lack of reproducibility is a problem at debug time
• LRU - least recently used– Need to keep time since each block was last accessed
• Expensive if number of blocks is large due to global compare• Hence approximation is often used = Use bit tag and LFU
• FIFOOnly one choice for direct-mappedplacement
![Page 25: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/25.jpg)
25
Data Cache Misses Per 1000 Instructions
64 byte blocks on a Alpha using 10 SPEC2000
![Page 26: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/26.jpg)
26
Short Summaries from the Previous Figure
• More-way associative is better for small cache• 2- or 4-way associative perform similar to 8-way associative
for larger caches• Larger cache size is better• LRU is the best for small block sizes• Random works fine for large caches• FIFO outperforms random in smaller caches• Little difference between LRU and random for larger caches
![Page 27: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/27.jpg)
27
Improving Cache Performance
• MIPS mix is 10% stores and 37% loads– Writes are about 10%/(100%+10%+37%) = 7% of overall memory
traffic, and 10%/(10%+37%)=21% of data cache traffic
• Make the common case fast– Implies optimizing caches for reads
• Read optimizations– Block can be read concurrent with tag comparison– On a hit the read information is passed on– On a miss the - nuke the block and start the miss access
• Write optimizations– Can’t modify until after tag check - hence take longer
![Page 28: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/28.jpg)
28
Write Options
• Write through: write posted to cache line and through to next lower level– Incurs write stall (use an intermediate write buffer to reduce the stall)
• Write back– Only write to cache not to lower level– Implies that cache and main memory are now inconsistent
• Mark the line with a dirty bit• If this block is replaced and dirty then write it back
• Pro’s and Con’s both are useful– Write through
• No write on read miss, simpler to implement, no inconsistency with main memory
– Write back• Uses less main memory bandwidth, write times independent of main
memory speeds• Multiple writes within a block require only one write to the main memory
![Page 29: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/29.jpg)
29
Write Miss Options
• Two choices for implementation– Write allocate – or fetch on write
• Load the block into cache, and then do the write in cache• Usually the choice for write-back caches
– No-write allocate – or write around• Modify the block where it is, but do not load the block in the
cache• Usually the choice for write-through caches• Danger - goes against the locality principle grain• But other delayed completion games are possible
![Page 30: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/30.jpg)
30
Example
• Fully associative write-back cache with many cache entries that start empty
• Read/Write sequence– Write Mem[100];– Write Mem[100];– Read Mem[200];– Write Mem[200];– Write Mem[100]
• Four misses and one hit for no-write allocate; two misses and three hits for write allocate
![Page 31: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/31.jpg)
31
Different Memory-Hierarchy Consideration for Desktop, Server, Embedded System
• Servers– More context switches increase compulsory miss rates
• Desktops are concerned more with average latency, whereas servers are also concerned about memory bandwidth
– The importance of protection escalates – Have greater bandwidth demands
• Embedded systems– Worry about worst-case performance: caches improve average-case
performance– Power and battery life less HW less HW-intensive optimization– Protection role is diminished– Often no disk storage– Write-back is more attractive
![Page 32: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/32.jpg)
32
The Alpha AXP 21264 Data Cache
• The cache contains 65,536 bytes of data in 64-byte blocks with two-way set associative placement (total 512 sets in the cache), write back, and write allocate on a write miss
• The 44-bit physical address is divided into three fields: the 29-bit Tag, 9-bit Index, and 6-bit block offset
• Although each block is 64 bytes, 8 bytes within a block is accessed per time– 3 bits from the block offset are used to index the proper 8 bytes
![Page 33: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/33.jpg)
33
The Alpha AXP 21264 Data Cache (Cont.)
![Page 34: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/34.jpg)
34
The Alpha AXP 21264 Data Cache (Cont.)
• Read hit: three clock cycles for 4 steps instructions in the following two 2 clock cycles would wait if they tried to use the load result
• Read miss: 64 bytes are read from the next level– Block replacement: FIFO with a round-robin bit
• Update data, address tag, valid bit, and the round-robin bit– Write back with one dirty bit per block– 8 victim buffers (or write buffers)
• If the victim buffer is full, the cache must wait
![Page 35: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/35.jpg)
35
The Alpha AXP 21264 Data Cache (Cont.)
• Write hit: the first three steps are the same as read. Since 21264 executes out-of-order, only after it signals the instruction has committed and the cache tag comparison indicates a hit are the data written to the cache
• Write miss: similar to read miss (write allocate)• Separate instruction and data caches
– Each has 64KB
![Page 36: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/36.jpg)
36
Unified vs. Split Cache
• Instruction cache and data cache• Unified cache
– structural hazards for load and store operations
• Split cache– Most recent processors choose split cache– Separate ports for instruction and data caches – double bandwidth– Opportunity of optimizing each cache separately – different capacity,
block sizes, and associativity
![Page 37: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/37.jpg)
37
Unified vs. Split Cache
Miss per 1000 instructions for instruction, data, and unified caches.Instruction reference is about 74%. The data are for 2-way associative caches with 64-byte blocks
![Page 38: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/38.jpg)
38
5.3 Cache Performance
![Page 39: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/39.jpg)
39
Cache Performance
![Page 40: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/40.jpg)
40
Cache Performance Example
• Each instruction takes 2 clock cycle (ignore memory stalls)• Cache miss penalty – 50 clock cycles• Miss rate = 2%• Average 1.33 memory reference per instructions
• Ideal – IC * 2 * cycle-time• With cache – IC*(2+1.33*2%*50)*cycle-time = IC * 3.33 * cycle-time• No cache – IC * (2+1.33*100%*50)*cycle-time• The importance of cache for CPUs with lower CPI and higher clock
rates is greater – Amdahl’s Law
![Page 41: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/41.jpg)
41
Average Memory Access Time VS CPU Time
• Compare two different cache organizations– Miss rate – direct-mapped (1.4%), 2-way associative (1.0%)– Clock-cycle-time – direct-mapped (2.0ns), 2-way associative (2.2ns)
• CPI with a perfect cache – 2.0, average memory reference per instruction – 1.3; miss-penalty – 70ns; hit-time – 1 CC
• Average Memory Access Time (Hit time + Miss_rate * Miss_penalty)• AMAT(Direct) = 1 * 2 + (1.4% * 70) = 2.98ns• AMAT(2-way) = 1 * 2.2 + (1.0% * 70) = 2.90ns
• CPU Time• CPU(Direct) = IC * (2 * 2 + 1.3 * 1.4% * 70) = 5.27 * IC• CPU(2-way) = IC * (2 * 2.2 + 1.3 * 1.0% * 70) = 5.31 * IC
Since CPU time is our bottom-line evaluation, and since direct mapped is simpler to build, the preferred cache is direct mapped in this example
![Page 42: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/42.jpg)
42
Unified and Split Cache
• Unified – 32KB cache, Split – 16KB IC and 16KB DC• Hit time – 1 clock cycle, miss penalty – 100 clock cycles• Load/Store hit takes 1 extra clock cycle for unified cache• 36% load/store – reference to cache: 74% instruction, 26% data
• Miss rate(16KB instruction) = 3.82/1000/1.0 = 0.004Miss rate (16KB data) = 40.9/1000/0.36 = 0.114
• Miss rate for split cache – (74%*0.004) + (26%*0.114) = 0.0324Miss rate for unified cache – 43.3/1000/(1+0.36) = 0.0318
• Average-memory-access-time = % inst * (hit-time + inst-miss-rate * miss-penalty) + % data * (hit-time + data-miss-rate * miss-penalty)
• AMAT(Split) = 74% * (1 + 0.004 * 100) + 26% * (1 + 0.114 * 100) = 4.24• AMAT(Unified) = 74% * (1 + 0.0318 * 100) + 26% * (1 + 1 + 0.0318*
100) = 4.44
![Page 43: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/43.jpg)
43
Improving Cache Performance
• Average-memory-access-time = Hit-time + Miss-rate * Miss-penalty
• Strategies for improving cache performance– Reducing the miss penalty– Reducing the miss rate– Reducing the miss penalty or miss rate via parallelism– Reducing the time to hit in the cache
![Page 44: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/44.jpg)
44
5.4 Reducing Cache Miss Penalty
![Page 45: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/45.jpg)
45
Techniques for Reducing Miss Penalty
• Multilevel Caches (the most important)• Critical Word First and Early Restart• Giving Priority to Read Misses over Writes• Merging Write Buffer• Victim Caches
![Page 46: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/46.jpg)
46
Multi-Level Caches
• Probably the best miss-penalty reduction• Performance measurement for 2-level caches
– AMAT = Hit-time-L1 + Miss-rate-L1* Miss-penalty-L1– Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2– AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-time-L2 + Miss-rate-L2 *
Miss-penalty-L2)
![Page 47: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/47.jpg)
47
Multi-Level Caches (Cont.)
• Definitions:– Local miss rate: misses in this cache divided by the total number of
memory accesses to this cache (Miss-rate-L2)– Global miss rate: misses in this cache divided by the total number of
memory accesses generated by CPU (Miss-rate-L1 x Miss-rate-L2) – Global Miss Rate is what matters
• Advantages:– Capacity misses in L1 end up with a significant penalty reduction
since they likely will get supplied from L2• No need to go to main memory
– Conflict misses in L1 similarly will get supplied by L2
![Page 48: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/48.jpg)
48
Miss Rate Example
• Suppose that in 1000 memory references there are 40 misses in the first-level cache and 20 misses in the second-level cache– Miss rate for the first-level cache = 40/1000 (4%)– Local miss rate for the second-level cache = 20/40 (50%)– Global miss rate for the second-level cache = 20/1000 (2%)
![Page 49: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/49.jpg)
49
Miss Rate Example (Cont.)
• Assume miss-penalty-L2 is 100 CC, hit-time-L2 is 10 CC, hit-time-L1 is 1 CC, and 1.5 memory reference per instruction. What is average memory access time and average stall cycles per instructions? Ignore writes impact.– AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-time-L2 + Miss-rate-L2 *
Miss-penalty-L2) = 1 + 4% * (10 + 50% * 100) = 3.4 CC– Average memory stalls per instruction = Misses-per-instruction-L1 *
Hit-time-L2 + Misses-per-instructions-L2*Miss-penalty-L2= (40*1.5/1000) * 10 + (20*1.5/1000) * 100 = 3.6 CC
• Or (3.4 – 1.0) * 1.5 = 3.6 CC
![Page 50: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/50.jpg)
50
Comparing Local and Global Miss Rates
32KB L1 cache
More assumptions are shown inthe legend of Figure 5.10
![Page 51: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/51.jpg)
51
Relative Execution Time by L2-Cache Size
Cache size iswhat matters
Reference execution timeof 1.0 is for 8192KB L2cache with 1 CC latencyon a L2 hit
![Page 52: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/52.jpg)
52
Comparing Local and Global Miss Rates
• Huge 2nd level caches• Global miss rate close to single level cache rate provided L2
>> L1• Global cache miss rate should be used when evaluating
second-level caches (or 3rd, 4th,… levels of hierarchy)• Many fewer hits than L1, target reduce misses
![Page 53: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/53.jpg)
53
Impact of L2 Cache Associativity
• Hit-time-L2 – Direct mapped = 10 CC; 2-way set associativity = 10.1 CC (usually
round up to integral number of CC, 10 or 11 CC)
• Local-miss-rate-L2– Direct mapped = 25%; 2-way set associativity = 20%
• Miss-penalty-L2 = 100CC• Miss-penalty-L2
– Direct mapped = 10 + 25% * 100 = 35 CC– 2-way (10 CC) = 10 + 20% * 100 = 30 CC– 2-way (11 CC) = 11 + 20% * 100 = 31 CC
![Page 54: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/54.jpg)
54
Critical Word First and Early Restart
• Do not wait for full block to be loaded before restarting CPU– Critical Word First – request the missed word first from memory and
send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first
– Early restart -- as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution
• Benefits of critical word first and early restart depend on– Block size: generally useful only in large blocks – Likelihood of another access to the portion of the block that has not
yet been fetched• Spatial locality problem: tend to want next sequential word, so
not clear if benefit
block
![Page 55: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/55.jpg)
55
Giving Priority to Read Misses Over Writes
• In write through, write buffers complicate memory access in that they might hold the updated value of location needed on a read miss– RAW conflicts with main memory reads on cache misses
• Read miss waits until the write buffer empty increase read miss penalty (old MIPS 1000 with 4-word buffer by 50% )
• Check write buffer contents before read, and if no conflicts, let the memory access continue
• Write Back?– Read miss replacing dirty block
– Normal: Write dirty block to memory, and then do the read
– Instead copy the dirty block to a write buffer, then do the read, and then do the write
– CPU stall less since restarts as soon as do read
SW R3, 512(R0)
LW R1, 1024(R0)
LW R2, 512(R0)
![Page 56: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/56.jpg)
56
Merging Write Buffer
• An entry of write buffer often contain multi-words. However, a write often involves single word– A single-word write occupies the whole entry if no write-merging
• Write merging: check to see if the address of a new data matches the address of a valid write buffer entry. If so, the new data are combined with that entry
• Advantage– Multi-word writes are usually faster than single-word writes– Reduce the stalls due to the write buffer being full
![Page 57: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/57.jpg)
57
Write-Merging Illustration
![Page 58: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/58.jpg)
58
Victim Caches
• Remember what was just discarded in case it is need again• Add small fully associative cache (called victim cache)
between the cache and the refill path– Contain only blocks discarded from a cache because of a miss– Are checked on a miss to see if they have the desired data before
going to the next lower-level of memory• If yes, swap the victim block and cache block
– Addressing both victim and regular cache at the same time• The penalty will not increase
• Jouppi (DEC SRC) shows miss reduction of 20 - 95%– For a 4KB direct mapped cache with 1-5 victim blocks
![Page 59: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/59.jpg)
59
Victim Cache Organization
![Page 60: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/60.jpg)
60
5.5 Reducing Miss Rate
![Page 61: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/61.jpg)
61
Classify Cache Misses - 3 C’s
• Compulsory independent of cache size– First access to a block no choice but to load it– Also called cold-start or first-reference misses– Measured by a infinite cache (ideal)
• Capacity decrease as cache size increases– Cache cannot contain all the blocks needed during execution, then
blocks being discarded will be later retrieved– Measured by a fully associative cache
• Conflict (Collision) decrease as associativity increases– Side effect of set associative or direct mapping– A block may be discarded and later retrieved if too many blocks map
to the same cache block
![Page 62: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/62.jpg)
62
Miss Distributions vs. the 3 C’s (Total Miss Rate)
independent of cache sizes
decrease ascapacity increases
decrease as associativity increases
![Page 63: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/63.jpg)
63
Miss Distributions
Normalized to direct-mapped organization
![Page 64: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/64.jpg)
64
Techniques for Reducing Miss Rate
• Larger Block Size• Larger Caches• Higher Associativity• Way Prediction and Pseudo-associative Caches• Compiler optimizations
![Page 65: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/65.jpg)
65
Larger Block Sizes
• Obvious advantages: reduce compulsory misses– Reason is due to spatial locality
• Obvious disadvantage– Higher miss penalty: larger block takes longer to move– May increase conflict misses and capacity miss if cache is small
Don’t let increase in miss penalty outweigh the decrease in miss rate
![Page 66: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/66.jpg)
66
Miss Rate VS Block SizeLarger block may increase conflict and capacity miss
![Page 67: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/67.jpg)
67
Actual Miss Rate VS. Block Size
![Page 68: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/68.jpg)
68
Miss Rate VS. Miss Penalty
• Assume memory system takes 80 CC of overhead and then deliver 16 bytes every 2 CC. Hit time = 1 CC
• Miss penalty– Block size 16 = 80 + 2 = 82– Block size 32 = 80 + 2 * 2 = 84– Block size 256 = 80 + 16 * 2 = 112
• AMAT = hit_time + miss_rate*miss_penalty– 256-byte in a 256 KB cache = 1 + 0.49% * 112 = 1.549 CC
![Page 69: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/69.jpg)
69
AMAT VS. Block Size for Different-Size Caches
![Page 70: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/70.jpg)
70
Large Caches
• Help with both conflict and capacity misses• May need longer hit time AND/OR higher HW cost• Popular in off-chip caches
![Page 71: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/71.jpg)
71
Higher Associativity
• 8-way set associative is for practical purposes as effective in reducing misses as fully associative
• 2: 1 Rule of thumb– 2 way set associative of size N/ 2 is about the same as a direct
mapped cache of size N (held for cache size < 128 KB)
• Greater associativity comes at the cost of increased hit time– Lengthen the clock cycle– Hill [1988] suggested hit time for 2-way vs. 1-way: external cache
+10%, internal + 2%
![Page 72: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/72.jpg)
72
Effect of Higher Associativity for AMAT
Clock-cycle-time (2-way) = 1.10 * Clock-cycle-time (1-way)Clock-cycle-time (4-way) = 1.12 * Clock-cycle-time (1-way)Clock-cycle-time (8-way) = 1.14 * Clock-cycle-time (1-way)
![Page 73: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/73.jpg)
73
Way Prediction
• Extra bits are kept in cache to predict the way, or block within the set of the next cache access
• Multiplexor is set early to select the desired block, and only a single tag comparison is performed that clock cycle
• A miss results in checking the other blocks for matches in subsequent clock cycles
• Alpha 21264 uses way prediction in its 2-way set-associative instruction cache. Simulation using SPEC95 suggested way prediction accuracy is in excess of 85%
![Page 74: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/74.jpg)
74
Pseudo-Associative Caches
• Attempt to get the miss rate of set-associative caches and the hit speed of direct-mapped cache
• Idea– Start with a direct mapped cache– On a miss check another entry– Usual method is to invert the high order index bit to get the next try
• 010111 110111
• Problem - fast hit and slow hit– May have the problem that you mostly need the slow hit– In this case it is better to swap the blocks
• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles– Better for caches not tied directly to processor (L2)– Used in MIPS R1000 L2 cache, similar in UltraSPARC
![Page 75: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/75.jpg)
75
Relationship Between a Regular Hit Time, Pseudo Hit Time and Miss Penalty
Hit Time
Pseudo Hit Time Miss Penalty
![Page 76: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/76.jpg)
76
Effect of Pseudo-Associative Caches
• Assume that it takes two extra cycles to find the entry in the alternative location (1 to check and 1 to swap)
• AMAT = Hit-time+ Miss-rate * Miss-penalty– Miss-penalty is 1 cycle more than a normal miss penalty (why??)– Miss-rate * Miss-penalty = Miss-rate(2-way) * Miss-penalty(1-way)– Hit-time = Hit-time(1-way) + Alternate_hit_rate * 2– Alternate-hit-rate = Hit-rate(2-way) – Hit-rate(1-way) = Miss-rate(1-
way) – Miss-rate(2-way)
• AMAT(pseudo) = 4.950 (2K), 1.371 (128K)• AMAT(1-way) = 5.90 (2K), 1.50 (128K)• AMAT(2-way) = 4.90 (2K), 1.45 (128K)
![Page 77: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/77.jpg)
77
Compiler Optimization for Code
• Code can easily be arranged without affecting correctness• Reordering the procedures of a program might reduce
instruction miss rates by reducing conflict misses• McFarling's observation using profiling information [1988]
– Reduce miss by 50% for a 2KB direct-mapped instruction cache with 4-byte blocks, and by 75% in an 8KB cache
– Optimized programs on a direct-mapped cache missed less than unoptimized ones on an 8-way set-associative cache of same size
![Page 78: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/78.jpg)
78
Compiler Optimization for Data
• Idea – improve the spatial and temporal locality of the data• Lots of options
– Array merging – Allocate arrays so that paired operands show up in same cache block
– Loop interchange – Exchange inner and outer loop order to improve cache performance
– Loop fusion – For independent loops accessing the same data, fuse these loops into a single aggregate loop
– Blocking – Do as much as possible on a sub- block before moving on
![Page 79: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/79.jpg)
79
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key; improve spatial locality
val key
val key val key val key
![Page 80: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/80.jpg)
80
Loop Interchange Example
/* Before */
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every 100 words; improve spatial locality
![Page 81: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/81.jpg)
81
Loop Fusion Example
/* Before */for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access;Improve temporal locality
Perform different computations on the common data in two loops fuse the two loops
![Page 82: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/82.jpg)
82
Blocking Example
/* Before */for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
{r = 0; for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];}; x[i][j] = r;};
Improve temporal locality + spatial locality
![Page 83: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/83.jpg)
83
Snapshot of x, y, z when i=1(Figure 5.21)
White: not yet touchedLight: older accessDark: newer access
![Page 84: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/84.jpg)
84
Blocking Example (Cont.)
• Dealing with multiple arrays, with some arrays accessed by rows and some by columns– Row-major or column-major-order no help loop interchange no
help
• Idea: compute on BxB submatrix that fits• Two Inner Loops:
– Read all NxN elements of z[]– Read N elements of 1 row of y[] repeatedly– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache Size:– 3 NxNx4 no capacity misses; otherwise ...
![Page 85: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/85.jpg)
85
Blocking Example (Cont.)
/* After */for (jj = 0; jj < N; jj = jj+B)for (kk = 0; kk < N; kk = kk+B)for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B,N); j = j+1){r = 0; for (k = kk; k < min(kk+B,N); k = k+1) {
r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r;};
• B called Blocking Factor• Worst-case capacity Misses from 2N3 + N2 to 2N3/B +N2
• Help register allocation
![Page 86: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/86.jpg)
86
The Age of Accesses to x, y, z (Figure 5.22)
Note in contrast to Figure 5.21, the smaller number of elements accessed
![Page 87: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/87.jpg)
87
Performance Improvement
1 1.5 2 2.5 3
compress
cholesky(nasa7)
spice
mxm (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
vpenta (nasa7)
mergedarrays
loopinterchange
loop fusion blocking
Summary of Compiler Optimizations to Reduce Cache Misses
![Page 88: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/88.jpg)
88
5.6 Reducing Cache Miss Penalty or Miss Rate Via Parallelism
![Page 89: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/89.jpg)
89
Overview
• Overlap the execution of instructions with activity in the memory hierarchy
• Techniques– Non-blocking caches to reduce stalls on cache misses – help with
out-of-order processors– Hardware prefetch of instructions and data– Compiler-controlled prefetching
![Page 90: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/90.jpg)
90
Non-blocking Caches
• Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss– Requires out-of-order execution CPU, like scoreboard or Tomasulo
• Hit-under-miss: reduces the effective miss penalty by working during miss vs. ignoring CPU requests
• Hit-under-multiple-miss or miss-under-miss: may further lower the effective penalty by overlapping multiple misses– Significantly increases the complexity of the cache controller as
there can be multiple outstanding memory accesses– Requires multiple memory banks (otherwise cannot support)– Pentium Pro allows 4 outstanding memory misses
![Page 91: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/91.jpg)
91
Effect of Non-blocking CacheRatio of the average memory stall time (Compare with blocking cache)
FP Avg.:1: 76%2: 51%64: 39%
Int Avg.:1: 81%2: 78%64: 78%
8K DM with 32-byte blocks and 16 CC penalty
![Page 92: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/92.jpg)
92
Example
• Compare 2-way set-associativity or hit-under-one-miss under 8KB data caches– FP miss rate: 11.4% (direct-mapped), 10.7% for (2-way)– INT miss rate: 7.4% (direct-mapped), 6.0% for (2-way)
• FP (Miss_rate * Miss_penalty)– Direct-mapped: 11.4% * 16 = 1.84– 2-way: 10.7% * 16 = 1.71– 1.71/1.84 = 93% hit-under-one-miss is better
• Integer (Miss_rate * Miss_penalty)– Direct-mapped: 7.4% * 16 = 1.18– 2-way: 6.0% * 16 = 0.96– 0.96/1.18 = 81% Almost the same
Hit-under-miss does not increase hit time
![Page 93: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/93.jpg)
93
Non-Blocking Cache (Cont.)
• Difficult to evaluate performance of non-blocking caches– A cache miss does not necessarily stall the CPU– Effective miss penalty is the nonoverlapped time that CPU is stalled– Difficult to judge the impact of any single miss– Difficult to calculate AMAT
• Out-of-order CPUs are capable of hiding the miss penalty of L1 data cache that hits in L2, but cannot hide a significant fraction of an L2 cache miss
• Possible to be more than one miss requests to same block– Must check on misses to be sure it is not to a block already being
requested to avoid possible inconsistency and to save time
![Page 94: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/94.jpg)
94
Hardware Prefetching of I&D
• Use hardware other than the cache to prefetch what you expect to need ahead of time
• AXP 21064 I-fetches 2 blocks on a miss– Target block goes to the I-cache– Next block goes to the instruction stream buffer (ISB)– If requested block is in the ISB then it moves to the Icache and next block
only is promoted from the next lower level.– 1, 4, 16 block ISB catches 15-25%, 50%, 72% of the misses
• Works with data blocks too:– Jouppi: 1 DSB got 25% misses from 4KB cache; 4 streams got 43%– Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to
70% of misses from 2 64KB, 4-way set associative caches
• Prefetching relies on having extra memory bandwidth that can be used without penalty (otherwise would be unused)
![Page 95: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/95.jpg)
95
Effect of HW Prefetching
• AMAT(prefetch) = Hit-time + Miss-rate * Prefetch-hit-rate * prefetch-hit-time + Miss-rate * (1-Prefetch-hit-rate) * Miss-penalty
• Parameters– Prefetch-hit-time: 1 clock cycle; prefetch hit rate: 25%– Miss rate: 1.10%(8KB cache); Hit time: 2 clock cycle; Miss penalty:
50 clock cycle
• AMAT(prefetch) = 2.41525• The miss rate of a cache without prefetching has to be
0.83%(8(1.10%) 16(0.64%)) to achieve the equivalent AMAT
![Page 96: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/96.jpg)
96
Compiler-Controlled Prefetching
• Data Prefetch– Register Prefetch: load data into register (HP PA-RISC loads)– Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9)– Perfetch instruction example: prefetch(b[j+7][0])– Special prefetching instructions cannot cause faults; a form of
speculative execution– Best candidates are loops– Issuing Prefetch Instructions takes time
• Is cost of prefetch issues < savings in reduced misses?
• Also works for instruction prefetch
![Page 97: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/97.jpg)
97
5.7 Reducing Hit Time
![Page 98: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/98.jpg)
98
Reducing Hit Time
• Hit time is critical because it affects the clock cycle time– On many machines, cache access time limits the clock cycle rate
• A fast hit time is multiplied in importance beyond the average memory access time formula because it helps everything– Average-Memory-Access-Time = Hit-Access-Time + Miss-Rate
* Miss-Penalty• Miss-penalty is clock-cycle dependent
![Page 99: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/99.jpg)
99
Techniques for Reducing Hit Time
• Small and Simple Caches• Avoid Address Translation during Indexing of the Cache• Pipelined Cache Access• Trace Caches
![Page 100: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/100.jpg)
100
Small and Simple Caches
• A time-consuming portion of a cache hit: use the index portion to read the tag and then compare it to the address
• Small caches – smaller hardware is faster– Keep the L1 cache small enough to fit on the same chip as CPU– Keep the tags on-chip, and the data off-chip for L2 caches
• Simple caches – direct-Mapped cache– Trading hit time for increased miss-rate
• Small direct mapped misses more often than small associative caches
• But simpler structure makes the hit go faster
![Page 101: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/101.jpg)
101
Access Time as Size and Associativity Vary in a CMOS Cache
![Page 102: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/102.jpg)
102
Virtual Addressed Caches
• Parallel rather than sequential access– Physical addressed caches access the TLB to generate the physical
address, then do the cache access
• Avoid address translation during cache index– Implies virtual addressed cache– Address translation proceeds in parallel with cache index
• If translation indicates that the page is not mapped - then the result of the index is not a hit
• Or if a protection violation occurs - then an exception results• All is well when neither happen
• Too good to be true?
![Page 103: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/103.jpg)
103
Virtually Addressed Caches
CPU
TLB
$
MEM
VA
PA
PA
ConventionalOrganization
CPU
$
TLB
MEM
VA
VA
PA
Virtually Addressed CacheTranslate only on miss
Synonym (Alias) Problem
VATags
$ means cache
CPU
$ TLB
MEM
VA
VATags
PAL2 $
Overlap $ access with VA translation: requires $
index to remain invariantacross translation
![Page 104: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/104.jpg)
104
Paging Hardware with TLB
Cacheis here
![Page 105: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/105.jpg)
105
Problems with Virtual Caches
• Protection – necessary part of the virtual to physical address translation– Copy protection information on a miss, add a field to hold it, and
check it on every access to virtually addressed cache.
• Task switch causes the same virtual address to refer to different physical address– Hence cache must be flushed
• Creating huge task switch overhead• Also creates huge compulsory miss rates for new process
– Use PID’s as part of the tag to aid discrimination
![Page 106: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/106.jpg)
106
Miss Rate of Virtual CachesPIDs increases Uniprocess – 0.3% to 0.5%PIDs saves 0.6% to 4.3% over purging
![Page 107: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/107.jpg)
107
Problems with Virtual Caches (Cont.)
• Synonyms or Alias– OS and User code have different virtual addresses which map to the
same physical address (facilitates copy-free sharing)– Two copies of the same data in a virtual cache consistency issue– Anti-aliasing (HW) mechanisms guarantee single copy
• On a miss, check to make sure none match PA of the data being fetched (must VA PA); otherwise, invalidate
– SW can help - e.g. SUN’s version of UNIX• Page coloring - aliases must have same low-order 18 bits
• I/O – use PA– Require mapping to VA to interact with a virtual cache
![Page 108: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/108.jpg)
108
Pipelining Writes for Fast Write Hits – Pipelined Cache
• Write hits usually take longer than read hits– Tag must be checked before writing the data
• Pipelines the write– 2 stages – Tag Check and Update Cache (can be more in practice)
– Current write tag check & previous write cache update
• Result– Looks like a write happens on every cycle
– Cycle-time can stay short since real write is spread over
– Mostly works if CPU is not dependent on data from a write
• Spot any problems if read and write ordering is not preserved by the memory system?
• Reads play no part in this pipeline since they already operate in parallel with the tag check
![Page 109: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/109.jpg)
109
Trace Caches
• Conventional caches limit the instructions in a static cache block to spatial locality
• Conventional caches may be entered from and exited by a taken branch first and last portion of a block are unused– Taken branches or jumps are 1 in 5 to 10 instructions
• A 64-byte block has 16 instructions space utilization problem
• A trace cache stores instructions only from the branch entry point to the exit of the trace avoid header and trailer overhead
![Page 110: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/110.jpg)
110
Trace Cache
![Page 111: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/111.jpg)
111
Trace Caches (Cont.)
• Complicated address mapping mechanism, as addresses are no longer aligned to power of 2 multiples of word size
• May store the same instructions multiple time in I-cache– Conditional branches making different choices result in the same
instructions being part of separate traces, which each occupy space in the cache
• Intel NetBurst (foundation of Pentium 4)
![Page 112: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/112.jpg)
112
Cache Optimization Summary
![Page 113: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/113.jpg)
113
5.9 Main Memory
![Page 114: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/114.jpg)
114
Main Memory -- 3 important issues
• Capacity• Latency
– Access time: time between a read is requested and the word arrives– Cycle time: min time between requests to memory (> access time)
• Memory needs the address lines to be stable between accesses– By addressing big chunks - like an entire cache block (amortize the
latency)– Critical to cache performance when the miss is to main
• Bandwidth -- # of bytes read or written per unit time– Affects the time it takes to transfer the block
![Page 115: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/115.jpg)
115
Example of Memory Latency and Bandwidth
• Consider– 4 cycle to send the address– 56 cycles per word of access– 4 cycle to transmit the data
• Hence if main memory is organized by word– 64 cycles has to be spent for every word we want to access
• Given a cache line of 4 words (8 bytes per word)– 256 cycles is the miss penalty– Memory bandwidth = 1/8 byte per clock cycle (4 * 8 /256)
![Page 116: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/116.jpg)
116
Improving Main Memory Performance
• Simple: – CPU, Cache, Bus, Memory same width (32 or 64 bits)
• Wide: – CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits
& 256 bits; UtraSPARC 512)
• Interleaved: – CPU, Cache, Bus 1 word: Memory N Modules
(4 Modules); example is word interleaved
![Page 117: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/117.jpg)
117
3 Examples of Bus Width, Memory Width, and Memory Interleaving to Achieve Memory Bandwidth
![Page 118: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/118.jpg)
118
Wider Main Memory
• Doubling or quadrupling the width of the cache or memory will doubling or quadrupling the memory bandwidth– Miss penalty is reduced correspondingly
• Cost and Drawback– More cost on memory bus– Multiplexer between the cache and the CPU may be on the critical
path (CPU is still access the cache one word at a time)• Multiplexors can be put between L1 and L2
– The design of error correction become more complicated• If only a portion of the block is updated, all other portions must be
read for calculating the new error correction code– Since main memory is traditionally expandable by the customer, the
minimum increment is doubled or quadrupled
![Page 119: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/119.jpg)
119
Simple Interleaved Memory
• Memory chips are organized into banks to read or write multiple words at a time, rather than a single word– Share address lines with a memory controller– Keep the memory bus the same but make it run faster– Take advantage of potential memory bandwidth of all DRAMs banks– The banks are often one word wide– Good for accessing consecutive memory location
• Miss penalty of 4 + 56 + 4 * 4 or 76 CC (0.4 bytes per CC)
Organization of Four-way Interleaved Memory
Interleaving factor = #_of_banks (usually power of 2)
Bank_# = address MOD #_of_banksAddress_within_bank = Floor(Address / #_of_banks)
![Page 120: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/120.jpg)
120
What Can Interleaving and a Wide Memory Buy?
• Block size = 1, 2, 4 words. Miss rate = 3%, 2% 1.2% correspondingly• Memory Bus width = 1 word, memory access per instruction = 1.2• Cache miss penalty = 64 cycles (as above)• Average cycles per instruction (ignore cache misses) = 2• CPI = 2 + (1.2 * 3% *64) = 4.3 (1-word block)
• Block size = 2 words• 64-bit bus and memory, no interleaving = 2 + (1.2 * 2% * 2 * 64) = 5.07• 64-bit bus and memory, interleaving = 2 + (1.2 * 2% * (4+56+2*4)) = 3.63• 128-bit bus and memory, no interleaving = 2 + (1.2 * 2% * 1* 64) = 3.54
• Block size = 4 words• 64-bit bus and memory, no interleaving = 2 + (1.2 * 1.2% * 4 *64) = 5.69• 64-bit bus and memory, interleaving = 2 + (1.2 * 1.2% * (4+56+4*4)) = 3.09• 128-bit bus and memory, no interleaving = 2 + (1.2 * 1.2% * 2 *64) = 3.84
![Page 121: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/121.jpg)
121
Simple Interleaved Memory (Cont.)
• Interleaved memory is logically a wide memory, except that accesses to bank are staged over time to share bus
• How many banks should be included?– More than # of CC to access word in bank
• To achieve the goal that delivering information from a new bank each clock for sequential accesses avoid waiting
• Disadvantages– Making multiple banks are expensive larger chip, few chips
• 512MB RAM– 256 chips of 4M*4 bits 16 banks of 16 chips– 16 chips of 64M*4 bit only 1 bank
– More difficulty in main memory expansion (like wider memory)
![Page 122: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/122.jpg)
122
Independent Memory Banks
• Memory banks for independent accesses vs. faster sequential accesses (like wider or interleaved memory)– Multiple memory controller
• Good for…– Multiprocessor I/O– CPU with Hit under n Misses, Non-blocking Cache
![Page 123: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/123.jpg)
123
5.9 Memory Technology
![Page 124: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/124.jpg)
124
DRAM Technology
• Semiconductor Dynamic Random Access Memory• Emphasize on cost per bit and capacity• Multiplex address lines cutting # of address pins in half
– Row access strobe (RAS) first, then column access strobe (CAS)– Memory as a 2D matrix – rows go to a buffer– Subsequent CAS selects subrow
• Use only a single transistor to store a bit– Reading that bit can destroy the information– Refresh each bit periodically (ex. 8 milliseconds) by writing back
• Keep refreshing time less than 5% of the total time
• DRAM capacity is 4 to 8 times that of SRAM
![Page 125: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/125.jpg)
125
DRAM Technology (Cont.)
• DIMM: Dual inline memory module– DRAM chips are commonly sold on small boards called DIMMs– DIMMs typically contain 4 to 16 DRAMs
• Slowing down in DRAM capacity growth– Four times the capacity every three years, for more than 20 years– New chips only double capacity every two year, since 1998
• DRAM performance is growing at a slower rate– RAS (related to latency): 5% per year– CAS (related to bandwidth): 10%+ per year
![Page 126: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/126.jpg)
126
RAS improvement
A performance improvement in RAS of about 5% per year
![Page 127: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/127.jpg)
127
SRAM Technology
• Cache uses SRAM: Static Random Access Memory• SRAM uses six transistors per bit to prevent the information
from being disturbed when read no need to refresh– SRAM needs only minimal power to retain the charge in the standby
modegood for embedded applications – No difference between access time and cycle time for SRAM
• Emphasize on speed and capacity– SRAM address lines are not multiplexed
• SRAM speed is 8 to 16x that of DRAM
![Page 128: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/128.jpg)
128
ROM and Flash
• Embedded processor memory• Read-only memory (ROM)
– Programmed at the time of manufacture– Only a single transistor per bit to represent 1 or 0– Used for the embedded program and for constant– Nonvolatile and indestructible
• Flash memory: – Nonvolatile but allow the memory to be modified– Reads at almost DRAM speeds, but writes 10 to 100 times slower– DRAM capacity per chip and MB per dollar is about 4 to 8 times
greater than flash
![Page 129: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/129.jpg)
129
Improving Memory Performance in a Standard DRAM Chip
• Fast page mode: time signals that allow repeated accesses to buffer without another row access time
• Synchronous RAM (SDRAM): add a clock signal to DRAM interface, so that the repeated transfer would not bear overhead to synchronize with the controller– Asynchronous DRAM involves overhead to sync with controller– Peak speed per memory module 800—1200MB/sec in 2001
• Double data rate (DDR): transfer data on both the rising edge and falling edge of DRAM clock signal – Peak speed per memory module 1600—2400MB/sec in 2001
![Page 130: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/130.jpg)
130
RAMBUS
• RAMBUS optimizes the interface between DRAM and CPU• RAMBUS makes a single chip act more like a memory
system than a memory component– Each chip has interleaved memory and high-speed interface
• 1st generation RAMBUS: RDAM– Replace RAS/CAS with a bus that allows other accesses over it
between the sending of the address and return of the data– Each chip has four banks, each with their own row buffer– A chip can return a variable amount of data from a single request,
and even perform its refresh– Clock signal and transfer on both edges of its clock– 300 MHz clock
![Page 131: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/131.jpg)
131
RAMBUS (Cont.)
• 2nd generation RAMBUS: direct RDRAM (DRDRAM)– Offer up to 1.6GB/sec of bandwidth– Separate row- and column-command buses– 18-bit data bus; 16 internal banks; 8 row buffers; 400 MHz
• RAMBUS are sold in RIMMs: one RAMBUS chip per RIMM• RAMBUS vs. DDR SDRAM
– DIMM bandwidth (multiple DRAM chips) is closer to RAMBUS– RDRAM and DRDRAM have a price premium over traditional DRAM
• Larger chips• In 2001, it is factor of 2• Section 5.16 has a detailed price-performance evaluation
![Page 132: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/132.jpg)
132
5.10 Virtual Memory
![Page 133: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/133.jpg)
133
Virtual Memory
• Virtual memory divides physical memory into blocks (called page or segment) and allocates them to different processes
• With virtual memory, the CPU produces virtual addresses that are translated by a combination of HW and SW to physical addresses, which accesses main memory. The process is called memory mapping or address translation
• Today, the two memory-hierarchy levels controlled by virtual memory are DRAMs and magnetic disks
![Page 134: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/134.jpg)
134
Example of Virtual to Physical Address Mapping
Mapping by apage table
![Page 135: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/135.jpg)
135
Address Translation Hardware for Paging
frame number frame offset
f (l-n) d (n)
![Page 136: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/136.jpg)
136
Page table when some pages are not in main memory…
illegal access
OS puts the process in the backing store when it starts executing.
![Page 137: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/137.jpg)
137
Virtual Memory (Cont.)
• Permits applications to grow bigger than main memory size• Helps with multiple process management
– Each process gets its own chunk of memory– Permits protection of 1 process’ chunks from another– Mapping of multiple chunks onto shared physical memory– Mapping also facilitates relocation (a program can run in any
memory location, and can be moved during execution)– Application and CPU run in virtual space (logical memory, 0 – max)– Mapping onto physical space is invisible to the application
• Cache VS. VM– Block becomes a page or segment– Miss becomes a page or address fault
![Page 138: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/138.jpg)
138
Typical Page Parameters
![Page 139: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/139.jpg)
139
Cache vs. VM Differences
• Replacement– Cache miss handled by hardware– Page fault usually handled by OS
• Addresses– VM space is determined by the address size of the CPU– Cache space is independent of the CPU address size
• Lower level memory– For caches - the main memory is not shared by something else– For VM - most of the disk contains the file system
• File system addressed differently - usually in I/ O space• VM lower level is usually called SWAP space
![Page 140: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/140.jpg)
140
2 VM Styles - Paged or Segmented?
• Virtual systems can be categorized into two classes: pages (fixed-size blocks), and segments (variable-size blocks)
Page Segment
Words per address One Two (segment and offset)
Programmer visible? Invisible to application programmer
May be visible to application programmer
Replacing a block Trivial (all blocks are the same size)
Hard (must find contiguous, variable-size, unused portion of main memory)
Memory use inefficiency
Internal fragmentation (unused portion of page)
External fragmentation (unused pieces of main memory)
Efficient disk traffic Yes (adjust page size to balance access time and transfer time)
Not always (small segments may transfer just a few bytes)
![Page 141: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/141.jpg)
141
Virtual Memory – The Same 4 Questions
• Block Placement– Choice: lower miss rates and complex placement or vice versa
• Miss penalty is huge, so choose low miss rate place anywhere• Similar to fully associative cache model
• Block Identification - both use additional data structure– Fixed size pages - use a page table– Variable sized segments - segment table
frame number frame offset
f (l-n) d (n)
![Page 142: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/142.jpg)
142
Address Translation Hardware for Paging
frame number frame offset
f (l-n) d (n)
![Page 143: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/143.jpg)
143
Block Identification Example
11 0113
2 * 4 + 1 = 9 Physical space = 25
Logical space = 24
Page size = 22
PT Size = 24/22= 22
Each PT entry needs 5-2 bits
010 019
![Page 144: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/144.jpg)
144
Virtual Memory – The Same 4 Questions (Cont.)
• Block Replacement -- LRU is the best– However true LRU is a bit complex – so use approximation
• Page table contains a use tag, and on access the use tag is set• OS checks them every so often - records what it sees in a data
structure - then clears them all• On a miss the OS decides who has been used the least and
replace that one
• Write Strategy -- always write back– Due to the access time to the disk, write through is silly– Use a dirty bit to only write back pages that have been modified
![Page 145: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/145.jpg)
145
Techniques for Fast Address Translation
• Page table is kept in main memory (kernel memory)– Each process has a page table
• Every data/instruction access requires two memory accesses– One for the page table and one for the data/instruction– Can be solved by the use of a special fast-lookup hardware cache
called associative registers or translation look-aside buffers (TLBs)
• If locality applies then cache the recent translation– TLB = translation look-aside buffer– TLB entry: virtual page no, physical page no, protection bit, use bit,
dirty bit
![Page 146: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/146.jpg)
146
TLB = Translation Look-aside Buffer
• The TLB must be on chip; otherwise it is worthless– Fully associative – parallel search
• Typical TLB’s– Hit time - 1 cycle– Miss penalty - 10 to 30 cycles– Miss rate - .1% to 2%– TLB size - 32 B to 8 KB
![Page 147: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/147.jpg)
147
Paging Hardware with TLB
![Page 148: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/148.jpg)
148
TLB of Alpha 21264
Address Space Number: process ID to prevent context switch
A total of 128 TLB entries
![Page 149: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/149.jpg)
149
Effective Access Time
• Associative Lookup = time unit (hit time)• Assume memory cycle time is x time unit (miss penalty)• Hit ratio = • Effective Access Time (EAT)
EAT = (x + ) + (2x + )(1 – )
= (2–) * x +
![Page 150: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/150.jpg)
150
Effective Access Time (Cont.)
• Example 1– Associate lookup = 20
– Memory access = 100
– Hit ratio = 0.8
– EAT = (100 + 20) * 0.8 + (200 + 20) * 0.2= 1.2 * 100 + 20 = 140
• Example 2– Associate lookup = 20
– Memory access = 100
– Hit ratio = 0.98
– EAT = (100 + 20) * 0.98 + (200 + 20) * 0.02= 1.02 * 100 + 20 = 122
40% slow in memory access time 22% slow in memory access time
![Page 151: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/151.jpg)
151
Page Size – An Architectural Choice
• Large pages are good:– Reduces page table size– Amortizes the long disk access– If spatial locality is good then hit rate will improve– Reduce the number of TLB miss
• Large pages are bad:– More internal fragmentation
• If everything is random each structure’s last page is only half full – Process start up time takes longer
![Page 152: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/152.jpg)
152
5.11 Protection and Examples of VM
![Page 153: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/153.jpg)
153
Protection
• Multiprogramming forces us to worry about it• Hence lots of processes
– Hence task switch overhead– HW must provide savable state– OS must promise to save and restore properly– Most machines task switch every few milliseconds– A task switch typically takes several microseconds
![Page 154: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/154.jpg)
154
Protection Options
• Simplest - base and bound (valid if Base Address Bound)– 2 registers - check each address falls between the values
• These registers must be changed by the OS but not the application
– Need for 2 modes: regular & privileged• Hence need to privilege-trap and provide mode switch ability
• VM provides another option– Check as part of the VA PA translation process– Memory protection implemented by associating protection bit with
each page• The protection bits reside in the page table & TLB• Read-only, read-write, execute-only
![Page 155: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/155.jpg)
155
Hardware Support for Relocation and Limit Registers
![Page 156: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/156.jpg)
156
Memory Protection with Valid-Invalid bit
•A process with length 10,468
•10468 – 12287 are also invalid
•PTLR
![Page 157: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/157.jpg)
157
Protection Options (Cont.)
• Rings - ala MULTIC’s and now Pentium– Inner is most privileged - outer is least
• Capabilities (i432) - similar to a key or password model– OS hands them out - so they’re difficult to forge– In some cases they can be passed between app’s
![Page 158: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/158.jpg)
158
VM Example
• Alpha 21264• Intel Pentium
Paged segmentation & Multi-level Paging
![Page 159: Memory Hierarchy Design](https://reader030.vdocument.in/reader030/viewer/2022013012/5681446b550346895db0ff7e/html5/thumbnails/159.jpg)
159
1. (a) Give the types of “Conflict misses”.
(b) Which principle of locality does the first miss rate reduction technique
address? Explain why? [8+8]
2. (a) Give the three categories of cache organizations based on the block
placement.
(b) Describe the block, when it is found in the cache? [8+8]
3. (a) What are fully associative caches? Explain how they are used in calculating
capacity misses?
(b) Define pseudoset .How does the pseudo associative cache works. [8+8]
Assignment Questions