homework 6 - cs.iupui.edufgsong/csci402/lecture_notes/lecture_20.pdf · an example of snoopy...
TRANSCRIPT
4/10/18
1
Homework 6• BTW, This is your last homework
• 5.1.1-5.1.3• 5.2.1-5.2.2• 5.3.1-5.3.5 • 5.4.1-5.4.2• 5.6.1-5.6.5• 5.12.1
• Assigned today, Tuesday, April 10• Due time: 11:59PM on Monday, April 23
1
CSCI 402: Computer Architectures
Memory Hierarchy (4)
Fengguang SongDepartment of Computer & Information ScienceIUPUI
4/10/18
2
Recall
• Direct-mapped cache– Hardware, stored content and overhead
• A few examples to divide a memory address• How to handle read/write? Either hit or miss• Memory stall cycles per instruction– Effective CPI: base CPI + memory stall cycles
• AMAT– Structure: Faster L1 cache + Slow main memory
3
Associative Caches10
empty
empty
Direct Mapped Cache:
Block address % Ne.g., N=4: 0,4,8,12,…
4/10/18
3
Associative Caches• It will make block placement more flexible
– In direct-mapped cache, only one choice!• Fully associative cache
– Allow a memory block to enter Any cache line• So, this requires all entries to be searched at once• i.e., comparator per cache line (very expensive!)
• n-way set associative cache– Each set contains n entries– Block number will determine which set
• i.e., DNQEM�PWODGT��OQFWNQ��5GVU– Search all entries in a given set at once
• so, just n comparators for n ways (much less expensive!)
11
2 sets, 4 entries per set
12
Expensive Cost of Full Associativity• A fully associative cache is expensive to implement
1. There is no block-index in the address subdivision, nearly the entire address must be used as the tag èincreasing cache space
2. We must check the tag of all blocks èmany comparatorsTag (32 bits) DataValid Address (32 bits)
=
Hit
26
Tag
=
=
4/10/18
4
Use Different Caches
12 mod 8 12 mod 4 Anywhere
• Suppose a cache can store 8 blocks• What is the location of memory block address 12?
13
• We have talked Direct Mapped Cache, n-way Set Associative Cache, Fully Associative Cache
• In fact, we could think of all cache strategies as a variation of Set-associative cache
14
4/10/18
5
Varying Associativity Size• For a cache with 8 entries, …
15
Every cache is a variation of set-associative cache.
Comparing Associativity• Compare different types of cache with only 4 blocks:
– (1) Direct mapped cache– (2) 2-way set associative cache– (3) Fully associative cache– Block access sequence: 0, 8, 0, 6, 8 (block addresses)
• (1) Direct mapped (block address mod ?)
Block address
Cache index
Hit/miss Cache content after access0 1 2 3
0 0 miss Mem[0]8 0 miss Mem[8]0 0 miss Mem[0]6 2 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]
16
All misses!
4/10/18
6
Associativity Example• (2) 2-way set associative (mod 2)
Block address
Cache index
Hit/miss Cache content after accessSet 0 Set 1
0 0 miss Mem[0]8 0 miss Mem[0] Mem[8]0 0 hit Mem[0] Mem[8]6 0 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]
• (3) Fully associativeBlock
addressHit/miss Cache content after access
0 miss Mem[0]8 miss Mem[0] Mem[8]0 hit Mem[0] Mem[8]6 miss Mem[0] Mem[8] Mem[6]8 hit Mem[0] Mem[8] Mem[6]
Best result17
0, 8, 0, 6, 8
How Much Associativity is
Appropriate?
• Increasing associativity can decrease miss rate
– But will be more costly to build
– Also, with diminishing returns (see data below)
• Simulations with 64KB D-cache, 16-word blocks,
SPEC2000
• Data miss rate:
– 1-way: 10.3%
– 2-way: 8.6%
– 4-way: 8.3%
– 8-way: 8.1%
18
4/10/18
7
19
“Address Subdivision” for Set-Associative Cache?
• If a cache has 2s sets and each block has 2n bytes, then memory address can be partitioned as follows:
• The arithmetic to compute a set index:
Block Address = Memory Address / 2n
Set Index = Block Address mod 2s
Address (m bits)
s(m-s-n) n
Tag Set IndexBlockoffset
Organization of Set Associative Cache
• Each cache block has 4 bytes• 4-way set associative cache• 256 sets
20
2
4/10/18
8
Cache Replacement Policy• Direct mapped: No choice -> must replace the existing block• n-way set associative: You have n choices
– Prefer invalid entry if there is one– Otherwise, choose one among entries in the set– But which one to choose? how?
There are Different Cache Replacement Policies:• Least-recently used (LRU) policy
– Replace the one that has not been used for the longest time• Simple for 2-way, manageable for 4-way, too hard beyond that
• Random policy– Has approximately the same performance as LRU for high associativity
• FIFO policy– Choose the one that enters the cache first (i.e., replace the oldest one)
21
Question: What happens if the assigned cache block space is already occupied?
Next, Cache Coherence Problem…
22
4/10/18
9
What is Finite State Machine (FSM)• We use a FSM to
define a sequence of control steps
• Set of states • Each edge has an
event• Transition between
states of the FSM– Current state is stored in
a register– Next state
= fn (Current state,Input event)
23
Fig: Cache Controller FSM
set valid
bit &tag
Cache Coherence Problem
• Suppose two CPU cores share a physical memory space– Assuming write-through cache
�5.10 P
arallelism and M
emory H
ierarchies: Cache C
oherence
Time step
Event CPU A�s cache
CPU B�s cache
Memory[x]
0 0
1 CPU A reads X 0 0
2 CPU B reads X 0 0 0
3 CPU A writes 1 to X 1 0 1
24
CPUA CPUB
X (in memory)
4/10/18
10
Coherence Definition
• Informally, Every read should return the most recently written value
• Formally,– P writes X; P reads X (no intervening writes)Þ Read returns written value
– P1 writes X; P2 reads X (later)Þ P2’s read returns written value
• e.g. CPU B reading X after step 3 in previous example
– P1 writes X, P2 writes XÞ All processors see writes in the same order
• End up with the same final value for X
25
Cache Coherence Protocols• Defines what operations should be
performed by caches, in multiprocessors,to ensure coherence– How to migrate data to local cache– How to replicate read-shared data
• The classic “snooping protocols”– Each cache monitors bus reads/writes
26
4/10/18
11
Snooping Protocols• Cache gets exclusive access to a block
whenever writing to a block– Broadcasts an invalidate message on the bus– Subsequent read in other caches à miss
• Then the owning cache will supply updated value
CPU activity Bus activity CPU A�s cache
CPU B�s cache
Memory[x]
[x] = 0CPU A reads X Cache miss for X 0 0CPU B reads X Cache miss for X 0 0 0CPU A writes 1 to X Invalidate for X 1 0CPU B read X Cache miss for X 1 1 1
27Detailed protocols shown in the following slides…
28
An Example of Snoopy Protocol (MSI)• Invalidate protocol, write-back cache
• Each cache block is in one of 3 states (track these):– Shared : block can be read
• Clean in all caches and up-to-date in memory– Exclusive (or Modified) : cache has the only copy, its
writeable, and dirty– Invalid : block contains no data
• All caches “snoop” bus– If there is a read miss message on bus, it can be satisfied by
one of the caches• Write to Shared block is treated as a miss (a bus
action to invalidate)
4/10/18
12
29
Snoopy-Cache State Machine-I • Cache state
transitions for CPU requestsfor a cache block
InvalidShared
(read only)
Exclusive(read/write)
CPU Read
CPU Write
CPU read hit
Place read misson bus
Place write miss on bus
CPU read missWrite back block,Place read misson bus
CPU WritePlace write miss on bus
CPU read missPlace read miss on bus
CPU write missWrite back cache block,Place write miss on bus
CPU read hitCPU write hit
Cache Block States
Multilevel Caches in Practice• Primary cache (or Level-1 cache) is attached to CPU
– Smallest, but fastest cache• Level-2 cache services misses from L1 cache
– Larger, slower, but still much faster than main memory• Main memory services L2 cache misses• Sometimes high-end systems include L3 cache• However, too many levels introduce significant overhead
– Need to keep data consistent (L1 < L2 < L3)– Need to communicate across all levels
35
L1 cacheCPU MainMemory
L2 cache
4/10/18
13
Real-World Multilevel Caches �5.13 AR
M C
ortex-A8 and Intel Core i7 M
em H
ierarchy
36