homework 6 - cs.iupui.edufgsong/csci402/lecture_notes/lecture_20.pdf · an example of snoopy...

4/10/18

1

Homework 6• BTW, This is your last homework

• 5.1.1-5.1.3• 5.2.1-5.2.2• 5.3.1-5.3.5 • 5.4.1-5.4.2• 5.6.1-5.6.5• 5.12.1

• Assigned today, Tuesday, April 10• Due time: 11:59PM on Monday, April 23

1

CSCI 402: Computer Architectures

Memory Hierarchy (4)

Fengguang SongDepartment of Computer & Information ScienceIUPUI

4/10/18

2

Recall

• Direct-mapped cache– Hardware, stored content and overhead

• A few examples to divide a memory address• How to handle read/write? Either hit or miss• Memory stall cycles per instruction– Effective CPI: base CPI + memory stall cycles

• AMAT– Structure: Faster L1 cache + Slow main memory

3

Associative Caches10

empty

empty

Direct Mapped Cache:

Block address % Ne.g., N=4: 0,4,8,12,…

4/10/18

3

Associative Caches• It will make block placement more flexible

– In direct-mapped cache, only one choice!• Fully associative cache

– Allow a memory block to enter Any cache line• So, this requires all entries to be searched at once• i.e., comparator per cache line (very expensive!)

• n-way set associative cache– Each set contains n entries– Block number will determine which set

• i.e., DNQEM�PWODGT��OQFWNQ��5GVU– Search all entries in a given set at once

• so, just n comparators for n ways (much less expensive!)

11

2 sets, 4 entries per set

12

Expensive Cost of Full Associativity• A fully associative cache is expensive to implement

1. There is no block-index in the address subdivision, nearly the entire address must be used as the tag èincreasing cache space

2. We must check the tag of all blocks èmany comparatorsTag (32 bits) DataValid Address (32 bits)

=

Hit

26

Tag

=

=

4/10/18

4

Use Different Caches

12 mod 8 12 mod 4 Anywhere

• Suppose a cache can store 8 blocks• What is the location of memory block address 12?

13

• We have talked Direct Mapped Cache, n-way Set Associative Cache, Fully Associative Cache

• In fact, we could think of all cache strategies as a variation of Set-associative cache

14

4/10/18

5

Varying Associativity Size• For a cache with 8 entries, …

15

Every cache is a variation of set-associative cache.

Comparing Associativity• Compare different types of cache with only 4 blocks:

– (1) Direct mapped cache– (2) 2-way set associative cache– (3) Fully associative cache– Block access sequence: 0, 8, 0, 6, 8 (block addresses)

• (1) Direct mapped (block address mod ?)

Block address

Cache index

Hit/miss Cache content after access0 1 2 3

0 0 miss Mem[0]8 0 miss Mem[8]0 0 miss Mem[0]6 2 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]

16

All misses!

4/10/18

6

Associativity Example• (2) 2-way set associative (mod 2)

Block address

Cache index

Hit/miss Cache content after accessSet 0 Set 1

0 0 miss Mem[0]8 0 miss Mem[0] Mem[8]0 0 hit Mem[0] Mem[8]6 0 miss Mem[0] Mem[6]8 0 miss Mem[8] Mem[6]

• (3) Fully associativeBlock

addressHit/miss Cache content after access

0 miss Mem[0]8 miss Mem[0] Mem[8]0 hit Mem[0] Mem[8]6 miss Mem[0] Mem[8] Mem[6]8 hit Mem[0] Mem[8] Mem[6]

Best result17

0, 8, 0, 6, 8

How Much Associativity is

Appropriate?

• Increasing associativity can decrease miss rate

– But will be more costly to build

– Also, with diminishing returns (see data below)

• Simulations with 64KB D-cache, 16-word blocks,

SPEC2000

• Data miss rate:

– 1-way: 10.3%

– 2-way: 8.6%

– 4-way: 8.3%

– 8-way: 8.1%

18

4/10/18

7

19

“Address Subdivision” for Set-Associative Cache?

• If a cache has 2s sets and each block has 2n bytes, then memory address can be partitioned as follows:

• The arithmetic to compute a set index:

Block Address = Memory Address / 2n

Set Index = Block Address mod 2s

Address (m bits)

s(m-s-n) n

Tag Set IndexBlockoffset

Organization of Set Associative Cache

• Each cache block has 4 bytes• 4-way set associative cache• 256 sets

20

2

4/10/18

8

Cache Replacement Policy• Direct mapped: No choice -> must replace the existing block• n-way set associative: You have n choices

– Prefer invalid entry if there is one– Otherwise, choose one among entries in the set– But which one to choose? how?

There are Different Cache Replacement Policies:• Least-recently used (LRU) policy

– Replace the one that has not been used for the longest time• Simple for 2-way, manageable for 4-way, too hard beyond that

• Random policy– Has approximately the same performance as LRU for high associativity

• FIFO policy– Choose the one that enters the cache first (i.e., replace the oldest one)

21

Question: What happens if the assigned cache block space is already occupied?

Next, Cache Coherence Problem…

22

4/10/18

9

What is Finite State Machine (FSM)• We use a FSM to

define a sequence of control steps

• Set of states • Each edge has an

event• Transition between

states of the FSM– Current state is stored in

a register– Next state

= fn (Current state,Input event)

23

Fig: Cache Controller FSM

set valid

bit &tag

Cache Coherence Problem

• Suppose two CPU cores share a physical memory space– Assuming write-through cache

�5.10 P

arallelism and M

emory H

ierarchies: Cache C

oherence

Time step

Event CPU A�s cache

CPU B�s cache

Memory[x]

0 0

1 CPU A reads X 0 0

2 CPU B reads X 0 0 0

3 CPU A writes 1 to X 1 0 1

24

CPUA CPUB

X (in memory)

4/10/18

10

Coherence Definition

• Informally, Every read should return the most recently written value

• Formally,– P writes X; P reads X (no intervening writes)Þ Read returns written value

– P1 writes X; P2 reads X (later)Þ P2’s read returns written value

• e.g. CPU B reading X after step 3 in previous example

– P1 writes X, P2 writes XÞ All processors see writes in the same order

• End up with the same final value for X

25

Cache Coherence Protocols• Defines what operations should be

performed by caches, in multiprocessors,to ensure coherence– How to migrate data to local cache– How to replicate read-shared data

• The classic “snooping protocols”– Each cache monitors bus reads/writes

26

4/10/18

11

Snooping Protocols• Cache gets exclusive access to a block

whenever writing to a block– Broadcasts an invalidate message on the bus– Subsequent read in other caches à miss

• Then the owning cache will supply updated value

CPU activity Bus activity CPU A�s cache

CPU B�s cache

Memory[x]

[x] = 0CPU A reads X Cache miss for X 0 0CPU B reads X Cache miss for X 0 0 0CPU A writes 1 to X Invalidate for X 1 0CPU B read X Cache miss for X 1 1 1

27Detailed protocols shown in the following slides…

28

An Example of Snoopy Protocol (MSI)• Invalidate protocol, write-back cache

• Each cache block is in one of 3 states (track these):– Shared : block can be read

• Clean in all caches and up-to-date in memory– Exclusive (or Modified) : cache has the only copy, its

writeable, and dirty– Invalid : block contains no data

• All caches “snoop” bus– If there is a read miss message on bus, it can be satisfied by

one of the caches• Write to Shared block is treated as a miss (a bus

action to invalidate)

4/10/18

12

29

Snoopy-Cache State Machine-I • Cache state

transitions for CPU requestsfor a cache block

InvalidShared

(read only)

Exclusive(read/write)

CPU Read

CPU Write

CPU read hit

Place read misson bus

Place write miss on bus

CPU read missWrite back block,Place read misson bus

CPU WritePlace write miss on bus

CPU read missPlace read miss on bus

CPU write missWrite back cache block,Place write miss on bus

CPU read hitCPU write hit

Cache Block States

Multilevel Caches in Practice• Primary cache (or Level-1 cache) is attached to CPU

– Smallest, but fastest cache• Level-2 cache services misses from L1 cache

– Larger, slower, but still much faster than main memory• Main memory services L2 cache misses• Sometimes high-end systems include L3 cache• However, too many levels introduce significant overhead

– Need to keep data consistent (L1 < L2 < L3)– Need to communicate across all levels

35

L1 cacheCPU MainMemory

L2 cache

4/10/18

13

Real-World Multilevel Caches �5.13 AR

M C

ortex-A8 and Intel Core i7 M

em H

ierarchy

36

homework 6 - cs.iupui.edufgsong/csci402/lecture_notes/lecture_20.pdf · an example of snoopy...

Documents