ECE 4750 Computer Architecture, Fall 2014
T04 Single-Cycle Cache Memories
School of Electrical and Computer EngineeringCornell University
revision: 2014-10-06-13-16
1 Memory Technology 2
2 Cache Memory Overview 6
3 Single-Cycle Direct-Mapped Cache Memory 16
3.1. Cache Memory Datapath . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2. Cache Memory Control Unit . . . . . . . . . . . . . . . . . . . . . . . 19
4 Analyzing Cache Memory Performance 20
1
1 Memory Technology
1. Memory Technology
Level-High Latch Positive Edge-Triggered Register
D Q
clk
D Q
clk
QD
clk clk
2
1 Memory Technology
Memory Arrays: Register Files
Memory Arrays: SRAM
65nm [Bai04]45nm [Mistry07] 130nm [Tyagi00]32nm [Natarajan08] 90nm [Thompson02]
1 micron
3
1 Memory Technology
Memory Arrays: DRAM
Adapted from [Foss, ʺImplementing Application-Specific Memory.ʺ ISSCCʹ96]
Row
Dec
oder
I/Os
Column Decoder
HelperFFs
I/O Strip
Mem
ory
Con
troll
er
Rankwordline
bit
line
Bank
Sub-bank
Bank
Array Core Array Block Bank & Chip Channel
On-ChipSRAM
DRAM inDedicated Process
SRAM inDedicated Process
On-ChipDRAM
4
1 Memory Technology
Memory Technology Trade-Offs
Latches &Registers
Register Files
SRAM
DRAMHigh CapacityHigh LatencyLow Bandwidth
Low CapacityLow LatencyHigh Bandwidth(more and wider ports)
Flash & Disk
Aside: Combinational vs. Synchronous SRAMs
5
2 Cache Memory Overview
2. Cache Memory Overview
P
I$
D$
Network
Network
Network
P
I$
P
I$
P
I$
D$ D$ D$
Ma
in M
em
ory
• Processors for computation
• Memories for storage
• Networks for communication
Caching books from library analogy
Our goal is to do some research on a new computer architecture, and sowe wish to consult the literature to learn more about past computersystems. The library contains all of the literature we are interested in.However, there are too many distractions at the library, so we prefer todo our reading in our office. Our office has an empty bookshelf whichcan hold ten books or so, and an empty desk which can hold a singlebook at a time.
Desk(can hold one book)
Book Shelf(can hold a few books)
Library(can hold many books)
6
2 Cache Memory Overview
Books from library with no bookshelf “cache”
Desk(can hold one book)
Library(can hold many books)
Need book 1
Checkout book 1(10 min)
Walk to library (15m)
Walk to office (15m)
Read some of book 1(10 min)
Need book 2Checkout book 2(10 min)
Walk to library (15m)
Walk to office (15m)
Read some of book 2(10 min)
Need book 1again! Checkout book 1
(10 min)
Walk to library (15m)
Walk to office (15m)
Read some of book 1(10 min)
Need book 2
• Average latency to access a book: 40 minutes
• Average throughput including reading time: 1.2 books/hour
• Latency to access library limits our throughput
7
2 Cache Memory Overview
Books from library with bookshelf “cache”Desk
(can hold one book)
Need book 1
Read some of book 1
Book Shelf(can hold a few books)
Library(can hold many books)
Check bookshelf (5m) Walk to library (15m)
Walk to office (15m)Checkout book 1and book 2(10 min)
Check bookshelf (5m)
Cache Miss!
Read some of book 2
Read some of book 1
Cache Hit! (Spatial Locality)Check bookshelf (5m)Cache Hit! (Temporarl Locality)
• Average latency to access a book: <20 minutes
• Average throughput including reading time: ≈2 books/hour
• Bookshelf acts as a small “cache” of the books in the library
– Cache Hit: Book is on the bookshelf when we check, so there is no need togo to the library to get the book
– Cache Miss: Book is not on the bookshelf when we check, so we need togo to the library to get the book
• Caches exploit structure in the access pattern to avoid the libraryaccess time limiting throughput
– Temporal Locality: If we access a book once we are likely to access thesame book again in the near future
– Spatial Locality: If we access a book on a given topic we are likely toaccess other books on the same topic in the near future
8
2 Cache Memory Overview
Cache memories in computer architecture
Main Memory(many storage locations)
Cache(moderate storage locations)
Processor(few storage locations)
P $ MCache Accesses
(10 or fewer cycles)Main Memory Access
(100s of cycles)
Cache memories exploit temporal and spatial locality
Address
Time
Instruction fetches
Stack accesses
Data accesses
n loop iterations
subroutine call
subroutine return
argument access
vector access
scalar accesses
9
2 Cache Memory Overview
Understanding locality for assembly programs
Examine each of the following assembly programs and rank eachprogram based on the level of temporal and spatial locality in both theinstruction and data address stream on a scale from 0 to 5 with 0 beingno locality and 5 being very significant locality.
Inst Inst Data DataTemp Spat Temp Spat
loop:lw r1, 0(r2)lw r3, 0(r4)addiu r5, r1, r3sw r5, 0(r6)addiu r2, r2, 4addiu r4, r4, 4addiu r6, r6, 4addiu r7, r7, -1bne r7, r0, loop
loop:lw r1, 0(r2)lw r3, 0(r1) # random ptrslw r4, 0(r3) # random ptrsaddiu r4, r4, 1addiu r2, r2, 4addiu r7, r7, -1bne r7, r0, loop
loop:lw r1, 0(r2) # many diffjalr r1 # func ptrsaddiu r2, r2, 4addiu r7, r7, -1bne r7, r0, loop
10
2 Cache Memory Overview
Four key questions for cache memories
• How much data is stored in a cache line?
• How do we store multiple lines in the cache?
• What data is replaced to make room for new data when cache is full?
• How should we manage writes?
How much data is stored in a cache line?
00
V Tag
tag
30b
32bhit
00
V Tag
tag
29b
32bhit 32b
32b
off
1b
00
V Tag
tag
28b
32bhit 32b
32b
off
2b
32b 32b
11
2 Cache Memory Overview
How do we store multiple lines in the cache?
Directed Mapped
00
V Tag
off
2b
tag
26b
32b
32b32b 32b 32b
idx
2bData
hit
Two-Way Set-Associative
00
V Tag
off
2b
tag
27b
32b
32b32b 32b 32b
idx
1bData
hit
Way 1
Way 0
12
2 Cache Memory Overview
Four-Way Fully-Associative
00
V Tag
off
2b
tag
28b
32b
32b32b 32b 32b
Data
hit
What data is replaced to make room for new data when cache is full?
• No choice in a direct-mapped cache
• Random
– Good average case performance, but difficult to implement
• Least Recently Used (LRU)
– Replace cache line which has not been accessed recently– Exploits temporal locality– LRU cache state must be updated on every access which is expensive– True implementation only feasible for small sets (2-way)– Pseudo-LRU uses binary tree to approximate LRU for higher associativity
• First-In First-Out (FIFO, Round Robin)
– Simpler implementation, but does not exploit temporal locality– Potentially useful in large fully associative caches
13
2 Cache Memory Overview
How should we manage writes?
• Write-Through with No Write Allocate
– On write miss, write memory but do not bring line into cache– On write hit, write both cache and memory– Requires more memory bandwidth, but simpler to implement
• Write-Back with Write Allocate
– On write miss, bring cache line into cache then write– On write hit, only write cache, do not write memory– Only update memory when a dirty cache line is evicted– More efficient, but more complicated to implement
Categorizing Misses: The Three C’s
• Compulsory : first-reference to a block• Capacity : cache is too small to hold all data needed by program• Conflict : collisions in a specific set
Classifying misses in a cache with a target capacity and associativity as asequence of three questions:
• Question 1) Would this miss occur in a cache with infinite capacity? If theanswer is yes, then this is a compulsory miss and we are done. If the answeris no, then consider question 2.
• Question 2) Would this miss occur in a fully associative cache with the targetcapacity? If the answer is yes, then this is a capacity miss and we are done. Ifthe answer is no, then consider question 3.
• Question 3) Would this miss occur in a cache with the desired capacity andassociativity? If the answer is yes, then this is a conflict miss and we aredone. If the answer is no, then this is not a miss – it is a hit!
14
2 Cache Memory Overview
Categorizing Misses
Assume we have a 128B directed-mapped cache with 64B cache lines. Classifyeach request as hit, compulsory miss, capacity miss, or conflict miss.
Set
idx off 0 1 Type
rd 0x000 0x000
rd 0x044
rd 0x04c
rd 0x008
rd 0x104
rd 0x14c
rd 0x000
rd 0x048
Assume we have a 64B two-way set-associative cache with 16B cache lines.Classify each requests as hit, compulsory miss, capacity miss, or conflict miss.
Set 0 Set 1
idx off Way 0 Way 1 Way 0 Way 1 Type
rd 0x104 0x000
rd 0x100
rd 0x208
rd 0x204
rd 0x30c
rd 0x40c
15
3 Single-Cycle Direct-Mapped Cache Memory
3. Single-Cycle Direct-Mapped Cache Memory
Transactions
• We usually think of subsystemsexecuting “transactions”
• Processor: instructions aretransactions
• Memory: mem requests and memresponses are transactions
Technology Constraints
• Goal is for cache and mainmemory to execute eachtransaction in less than one cycle
• Assume idealistic technologywith dual-ported SRAMs for thetag arrays and single-cyclecombinational main memory
Control Status
Control Unit
Datapath
<1 cyclecombinational
Memory
TagArray
mreq mresp
DataArray
creq cresp
128b
32b
Four key questions for cache memories
• How much data is stored in each cache line?
– 16-byte cache lines
• How many cache lines and how do we manage these lines?
– 64-byte total cache capacity– Two-way set-associative
• What data is replaced to make room for new data when cache is full?
– One-to-one mapping
• How should we manage writes?
– Write through, no allocate
16
3 Single-Cycle Direct-Mapped Cache Memory
Steps involved in executing transactions
CheckTag Array
Read Wordfrom
Data Array
ChooseVictim Line
Read Linefrom
Main Mem
Write Lineto
Data Array
Write Wordto
Data Array
Write Wordto
Main Mem
ReturnWord
CheckTag Array
hitmiss hit
READ Transactions WRITE Transactions
17
3 Single-Cycle Direct-Mapped Cache Memory 3.1. Cache Memory Datapath
3.1. Cache Memory Datapath
As with processors, we can design our cache datapath by incrementally addingsupport for each transaction in the pipeline and resolving conflicts in thedatapath using muxes.
Implementing READ transactions that hit
=
tag 00offidx
27b 1b 2b
offtagidx idxcreqaddr
Way
0
DataArray
darray_en
tarray0_en
tarray0_match
128b
[31:0]
[63:32]
[95:64]
[127:96]32b cresp
data
idx
=
tagidx
Way
1tarray1_en
tarray1_match
Tag Array
off
Adding READ transactions that miss
=
offtagidx idxcreqaddr
Way
0
DataArray
darray_en
tarray0_en
tarray0_match
128b
[31:0]
[63:32]
[95:64]
[127:96]32b cresp
data
idx
=
tagidx
Way
1
tarray1_en
tarray1_match
Tag Array
off
mreqaddr
z4b
mrespdata
128b
darray_wen
128bdarray_read
_mux_sel
2b
tag
tarray0_wen
tag
tarray1_wen
18
3 Single-Cycle Direct-Mapped Cache Memory 3.2. Cache Memory Control Unit
Adding WRITE transactions
=
offtagidx idxcreqaddr
Way
0
DataArray
darray_en
tarray0_en
tarray0_match
128b
[31:0]
[63:32]
[95:64]
[127:96]32b cresp
data
idx
=
tagidx
Way
1
tarray1_en
tarray1_match
Tag Array
off
mreqaddr
z4b
mrespdata
128b
darray_wen
128bdarray_read
_mux_sel
2b
tag
tarray0_wen
tag
tarray1_wen
creqdata
replunit
32b
mreqdata
zext
128b
128b
darray_write_mux_sel
darray_word_en
3.2. Cache Memory Control Unit
READ WRITE
tarray0 en
tarray0 wen
tarray1 en
tarray1 wen
darray en
darray wen
darray word en
darray write mux sel
darray read mux sel
19
4 Analyzing Cache Memory Performance
We will need to keep valid bits in the control unit, with one valid bit for everycache line. We will also need to keep LRU bits which are updated on every hit toindicate which line should be the next victim. Assume we create the followingtwo internal control signals to be used in the previous control signal table.
hit = ( tarray0_match && valid0[idx] )|| ( tarray1_match && valid1[idx] )
victim = lru[idx]
4. Analyzing Cache Memory Performance
Processor CacheMain
Memory
Hit
Miss
Average Memory Access Latency (AMAL)
AMAL = ( Hit Time + ( Miss Rate ×Miss Penalty ) ) × Cycle Time
Microarchitecture Hit Time Cycle Time
this topic→ Single-Cycle Cache 1 longTopic 04 FSM Cache >1 shortTopic 04 Pipelined Cache ≈1 short
20
4 Analyzing Cache Memory Performance
Estimating cycle time
=
offtagidx idxcreqaddr
Way
0
DataArray
darray_en
tarray0_en
tarray0_match
128b
[31:0]
[63:32]
[95:64]
[127:96]32b cresp
data
idx
=
tagidx
Way
1
tarray1_en
tarray1_match
Tag Array
off
mreqaddr
z4b
mrespdata
128b
darray_wen
128bdarray_read
_mux_sel
2b
tag
tarray0_wen
tag
tarray1_wen
creqdata
replunit
32b
mreqdata
zext
128b
128b
darray_write_mux_sel
darray_word_en
21