Download - ECE 4750 Computer Architecture, Fall 2014 T04 Single-Cycle ...€¦ · Cache Memory Control Unit Adding WRITE transactions = idx tag idx off creq addr Way 0 Data Array darray_en tarray0_en

ECE 4750 Computer Architecture, Fall 2014

T04 Single-Cycle Cache Memories

School of Electrical and Computer EngineeringCornell University

revision: 2014-10-06-13-16

1 Memory Technology 2

2 Cache Memory Overview 6

3 Single-Cycle Direct-Mapped Cache Memory 16

3.1. Cache Memory Datapath . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2. Cache Memory Control Unit . . . . . . . . . . . . . . . . . . . . . . . 19

4 Analyzing Cache Memory Performance 20

1

1 Memory Technology

1. Memory Technology

Level-High Latch Positive Edge-Triggered Register

D Q

clk

D Q

clk

QD

clk clk

2

1 Memory Technology

Memory Arrays: Register Files

Memory Arrays: SRAM

65nm [Bai04]45nm [Mistry07] 130nm [Tyagi00]32nm [Natarajan08] 90nm [Thompson02]

1 micron

3

1 Memory Technology

Memory Arrays: DRAM

Adapted from [Foss, ʺImplementing Application-Specific Memory.ʺ ISSCCʹ96]

Row

Dec

oder

I/Os

Column Decoder

HelperFFs

I/O Strip

Mem

ory

Con

troll

er

Rankwordline

bit

line

Bank

Sub-bank

Bank

Array Core Array Block Bank & Chip Channel

On-ChipSRAM

DRAM inDedicated Process

SRAM inDedicated Process

On-ChipDRAM

4

1 Memory Technology

Memory Technology Trade-Offs

Latches &Registers

Register Files

SRAM

DRAMHigh CapacityHigh LatencyLow Bandwidth

Low CapacityLow LatencyHigh Bandwidth(more and wider ports)

Flash & Disk

Aside: Combinational vs. Synchronous SRAMs

5

2 Cache Memory Overview

2. Cache Memory Overview

P

I$

D$

Network

Network

Network

P

I$

P

I$

P

I$

D$ D$ D$

Ma

in M

em

ory

• Processors for computation

• Memories for storage

• Networks for communication

Caching books from library analogy

Our goal is to do some research on a new computer architecture, and sowe wish to consult the literature to learn more about past computersystems. The library contains all of the literature we are interested in.However, there are too many distractions at the library, so we prefer todo our reading in our office. Our office has an empty bookshelf whichcan hold ten books or so, and an empty desk which can hold a singlebook at a time.

Desk(can hold one book)

Book Shelf(can hold a few books)

Library(can hold many books)

6


Books from library with no bookshelf “cache”

Desk(can hold one book)


Need book 1

Checkout book 1(10 min)

Walk to library (15m)

Walk to office (15m)

Read some of book 1(10 min)

Need book 2Checkout book 2(10 min)




Need book 1again! Checkout book 1

(10 min)




Need book 2

• Average latency to access a book: 40 minutes

• Average throughput including reading time: 1.2 books/hour

• Latency to access library limits our throughput

7


Books from library with bookshelf “cache”Desk

(can hold one book)

Need book 1

Read some of book 1

Book Shelf(can hold a few books)


Check bookshelf (5m) Walk to library (15m)

Walk to office (15m)Checkout book 1and book 2(10 min)

Check bookshelf (5m)

Cache Miss!

Read some of book 2

Read some of book 1

Cache Hit! (Spatial Locality)Check bookshelf (5m)Cache Hit! (Temporarl Locality)

• Average latency to access a book: <20 minutes

• Average throughput including reading time: ≈2 books/hour

• Bookshelf acts as a small “cache” of the books in the library

– Cache Hit: Book is on the bookshelf when we check, so there is no need togo to the library to get the book

– Cache Miss: Book is not on the bookshelf when we check, so we need togo to the library to get the book

• Caches exploit structure in the access pattern to avoid the libraryaccess time limiting throughput

– Temporal Locality: If we access a book once we are likely to access thesame book again in the near future

– Spatial Locality: If we access a book on a given topic we are likely toaccess other books on the same topic in the near future

8


Cache memories in computer architecture

Main Memory(many storage locations)

Cache(moderate storage locations)

Processor(few storage locations)

P $ MCache Accesses

(10 or fewer cycles)Main Memory Access

(100s of cycles)

Cache memories exploit temporal and spatial locality

Address

Time

Instruction fetches

Stack accesses

Data accesses

n loop iterations

subroutine call

subroutine return

argument access

vector access

scalar accesses

9


Understanding locality for assembly programs

Examine each of the following assembly programs and rank eachprogram based on the level of temporal and spatial locality in both theinstruction and data address stream on a scale from 0 to 5 with 0 beingno locality and 5 being very significant locality.

Inst Inst Data DataTemp Spat Temp Spat

loop:lw r1, 0(r2)lw r3, 0(r4)addiu r5, r1, r3sw r5, 0(r6)addiu r2, r2, 4addiu r4, r4, 4addiu r6, r6, 4addiu r7, r7, -1bne r7, r0, loop

loop:lw r1, 0(r2)lw r3, 0(r1) # random ptrslw r4, 0(r3) # random ptrsaddiu r4, r4, 1addiu r2, r2, 4addiu r7, r7, -1bne r7, r0, loop

loop:lw r1, 0(r2) # many diffjalr r1 # func ptrsaddiu r2, r2, 4addiu r7, r7, -1bne r7, r0, loop

10


Four key questions for cache memories

• How much data is stored in a cache line?

• How do we store multiple lines in the cache?

• What data is replaced to make room for new data when cache is full?

• How should we manage writes?

How much data is stored in a cache line?

00

V Tag

tag

30b

32bhit

00

V Tag

tag

29b

32bhit 32b

32b

off

1b

00

V Tag

tag

28b

32bhit 32b

32b

off

2b

32b 32b

11


How do we store multiple lines in the cache?

Directed Mapped

00

V Tag

off

2b

tag

26b

32b

32b32b 32b 32b

idx

2bData

hit

Two-Way Set-Associative

00

V Tag

off

2b

tag

27b

32b

32b32b 32b 32b

idx

1bData

hit

Way 1

Way 0

12


Four-Way Fully-Associative

00

V Tag

off

2b

tag

28b

32b

32b32b 32b 32b

Data

hit

What data is replaced to make room for new data when cache is full?

• No choice in a direct-mapped cache

• Random

– Good average case performance, but difficult to implement

• Least Recently Used (LRU)

– Replace cache line which has not been accessed recently– Exploits temporal locality– LRU cache state must be updated on every access which is expensive– True implementation only feasible for small sets (2-way)– Pseudo-LRU uses binary tree to approximate LRU for higher associativity

• First-In First-Out (FIFO, Round Robin)

– Simpler implementation, but does not exploit temporal locality– Potentially useful in large fully associative caches

13


How should we manage writes?

• Write-Through with No Write Allocate

– On write miss, write memory but do not bring line into cache– On write hit, write both cache and memory– Requires more memory bandwidth, but simpler to implement

• Write-Back with Write Allocate

– On write miss, bring cache line into cache then write– On write hit, only write cache, do not write memory– Only update memory when a dirty cache line is evicted– More efficient, but more complicated to implement

Categorizing Misses: The Three C’s

• Compulsory : first-reference to a block• Capacity : cache is too small to hold all data needed by program• Conflict : collisions in a specific set

Classifying misses in a cache with a target capacity and associativity as asequence of three questions:

• Question 1) Would this miss occur in a cache with infinite capacity? If theanswer is yes, then this is a compulsory miss and we are done. If the answeris no, then consider question 2.

• Question 2) Would this miss occur in a fully associative cache with the targetcapacity? If the answer is yes, then this is a capacity miss and we are done. Ifthe answer is no, then consider question 3.

• Question 3) Would this miss occur in a cache with the desired capacity andassociativity? If the answer is yes, then this is a conflict miss and we aredone. If the answer is no, then this is not a miss – it is a hit!

14


Categorizing Misses

Assume we have a 128B directed-mapped cache with 64B cache lines. Classifyeach request as hit, compulsory miss, capacity miss, or conflict miss.

Set

idx off 0 1 Type

rd 0x000 0x000

rd 0x044

rd 0x04c

rd 0x008

rd 0x104

rd 0x14c

rd 0x000

rd 0x048

Assume we have a 64B two-way set-associative cache with 16B cache lines.Classify each requests as hit, compulsory miss, capacity miss, or conflict miss.

Set 0 Set 1

idx off Way 0 Way 1 Way 0 Way 1 Type

rd 0x104 0x000

rd 0x100

rd 0x208

rd 0x204

rd 0x30c

rd 0x40c

15

3 Single-Cycle Direct-Mapped Cache Memory

3. Single-Cycle Direct-Mapped Cache Memory

Transactions

• We usually think of subsystemsexecuting “transactions”

• Processor: instructions aretransactions

• Memory: mem requests and memresponses are transactions

Technology Constraints

• Goal is for cache and mainmemory to execute eachtransaction in less than one cycle

• Assume idealistic technologywith dual-ported SRAMs for thetag arrays and single-cyclecombinational main memory

Control Status

Control Unit

Datapath

<1 cyclecombinational

Memory

TagArray

mreq mresp

DataArray

creq cresp

128b

32b

Four key questions for cache memories

• How much data is stored in each cache line?

– 16-byte cache lines

• How many cache lines and how do we manage these lines?

– 64-byte total cache capacity– Two-way set-associative

• What data is replaced to make room for new data when cache is full?

– One-to-one mapping

• How should we manage writes?

– Write through, no allocate

16

3 Single-Cycle Direct-Mapped Cache Memory

Steps involved in executing transactions

CheckTag Array

Read Wordfrom

Data Array

ChooseVictim Line

Read Linefrom

Main Mem

Write Lineto

Data Array

Write Wordto

Data Array

Write Wordto

Main Mem

ReturnWord

CheckTag Array

hitmiss hit

READ Transactions WRITE Transactions

17

3 Single-Cycle Direct-Mapped Cache Memory 3.1. Cache Memory Datapath

3.1. Cache Memory Datapath

As with processors, we can design our cache datapath by incrementally addingsupport for each transaction in the pipeline and resolving conflicts in thedatapath using muxes.

Implementing READ transactions that hit

=

tag 00offidx

27b 1b 2b

offtagidx idxcreqaddr

Way

0

DataArray

darray_en

tarray0_en

tarray0_match

128b

[31:0]

[63:32]

[95:64]

[127:96]32b cresp

data

idx

=

tagidx

Way

1tarray1_en

tarray1_match

Tag Array

off

Adding READ transactions that miss

=


Way

0

DataArray

darray_en

tarray0_en

tarray0_match

128b

[31:0]

[63:32]

[95:64]

[127:96]32b cresp

data

idx

=

tagidx

Way

1

tarray1_en

tarray1_match

Tag Array

off

mreqaddr

z4b

mrespdata

128b

darray_wen

128bdarray_read

_mux_sel

2b

tag

tarray0_wen

tag

tarray1_wen

18

3 Single-Cycle Direct-Mapped Cache Memory 3.2. Cache Memory Control Unit

Adding WRITE transactions

=


Way

0

DataArray

darray_en

tarray0_en

tarray0_match

128b

[31:0]

[63:32]

[95:64]

[127:96]32b cresp

data

idx

=

tagidx

Way

1

tarray1_en

tarray1_match

Tag Array

off

mreqaddr

z4b

mrespdata

128b

darray_wen

128bdarray_read

_mux_sel

2b

tag

tarray0_wen

tag

tarray1_wen

creqdata

replunit

32b

mreqdata

zext

128b

128b

darray_write_mux_sel

darray_word_en

3.2. Cache Memory Control Unit

READ WRITE

tarray0 en

tarray0 wen

tarray1 en

tarray1 wen

darray en

darray wen

darray word en

darray write mux sel

darray read mux sel

19

4 Analyzing Cache Memory Performance

We will need to keep valid bits in the control unit, with one valid bit for everycache line. We will also need to keep LRU bits which are updated on every hit toindicate which line should be the next victim. Assume we create the followingtwo internal control signals to be used in the previous control signal table.

hit = ( tarray0_match && valid0[idx] )|| ( tarray1_match && valid1[idx] )

victim = lru[idx]

4. Analyzing Cache Memory Performance

Processor CacheMain

Memory

Hit

Miss

Average Memory Access Latency (AMAL)

AMAL = ( Hit Time + ( Miss Rate ×Miss Penalty ) ) × Cycle Time

Microarchitecture Hit Time Cycle Time

this topic→ Single-Cycle Cache 1 longTopic 04 FSM Cache >1 shortTopic 04 Pipelined Cache ≈1 short

20

4 Analyzing Cache Memory Performance

Estimating cycle time

=


Way

0

DataArray

darray_en

tarray0_en

tarray0_match

128b

[31:0]

[63:32]

[95:64]

[127:96]32b cresp

data

idx

=

tagidx

Way

1

tarray1_en

tarray1_match

Tag Array

off

mreqaddr

z4b

mrespdata

128b

darray_wen

128bdarray_read

_mux_sel

2b

tag

tarray0_wen

tag

tarray1_wen

creqdata

replunit

32b

mreqdata

zext

128b

128b

darray_write_mux_sel

darray_word_en

21

Download - ECE 4750 Computer Architecture, Fall 2014 T04 Single-Cycle ...€¦ · Cache Memory Control Unit Adding WRITE transactions = idx tag idx off creq addr Way 0 Data Array darray_en tarray0_en

Top Related