cse410 13 cache

8/12/2019 Cse410 13 Cache

1/42

2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 1

Cache Memory

CSE 410, Spring 2008

Computer Systems

http://www.cs.washington.edu/410

8/12/2019 Cse410 13 Cache

2/42


Reading and References

Reading

Computer Organization and Design, Patterson and Hennessy

Section 7.1 Introduction

Section 7.2 The Basics of Caches

Section 7.3 Measuring and Improving Cache Performance

Reference

OSC (The dino book), Chapter 8: focus on paging CDA&AQA, Chapter 5

Chapter 4, See MIPS Run, D. Sweetman

IBM and Cell chips: http://www.blachford.info/computer/Cell/Cell0_v2.html

8/12/2019 Cse410 13 Cache

3/42


Arthur W. Burks, Herman H. Goldstine,

John von Neumann

We are therefore forced to recognize the possibility

of constructing a hierarchy of memories, each of

which has greater capacity than the preceding but

which is less quickly accessible

This team was forced to realize this in the 1960s

Cachea safe place to store something (webster)

Cache (cse & ee ) : a small and fast local memory

Cachessome consider this to be

8/12/2019 Cse410 13 Cache

4/42


The Quest for Speed - Memory

If all memory accesses (IF/lw/sw) accessed main

memory, programs would run exponentially slower

Compare memory access times: 2^4- 2^6 time slower?

Could be even slower depending on pipeline length!

And its getting worse

processors speed up by ~50% annually

memory accesses speed up by ~9% annually

its becoming harder and harder to keep these processors

fed

8/12/2019 Cse410 13 Cache

5/42


A Solution: Memory Hierarchy*

Keep copies of the active

data in the small, fast,

expensive storage

Keep all data in the big,

slow, cheap storage

fast, small,

expensive

storage

slow, large,

cheap storage

8/12/2019 Cse410 13 Cache

6/42


Memory Hierarchy**

10M

100

10

.5-5

8/12/2019 Cse410 13 Cache

7/42


What is a Cache?

A subset* of a larger memory

A small and fast place to store frequently

accessed items

Can be an instruction cache, a video cache, a

streaming buffer/cache

Like a fisheye view of memory, where

magnification == speed

8/12/2019 Cse410 13 Cache

8/42


What is a Cache?

A cache allows for fastaccesses to a subset of a largerdata store

Your web browsers cache gives you fast access to pages

you visited recently

faster because its stored locally

subset because the web wont fit on your disk

The memory cache gives the processor fast access to

memory that it used recently faster because its fancy and usually located on the CPU chip

subset because the cache is smaller than main memory

8/12/2019 Cse410 13 Cache

9/42


Memory Hierarchy

CPU

Registers

L1 cache

L2 Cache

Main Memory

8/12/2019 Cse410 13 Cache

10/42


8/12/2019 Cse410 13 Cache

11/42


IBMs Cell Chips

8/12/2019 Cse410 13 Cache

12/42


Locality of reference

Temporal locality - nearness in time Data being accessed now will probably be

accessed again soon

Useful data tends to continue to be useful Spatial locality - nearness in address

Data near the data being accessed now will

probably be needed soon Useful data is often accessed sequentially

8/12/2019 Cse410 13 Cache

13/42


Memory Access Patterns

Memory accesses

dontusually look

like this

random accesses

Memory accesses do

usuallylook like this

hot variables

step through arrays

8/12/2019 Cse410 13 Cache

14/42


Cache Terminology

Hit andMiss the data item is in the cache or the data item is not in the

cache

Hit rate andMiss rate the percentage of references that the data item is in the

cache or not in the cache

Hit time andMiss time the time required to access data in the cache (cache access

time) and the time required to access data not in the cache(memory access time)

8/12/2019 Cse410 13 Cache

15/42


Effective Access Time

teffective= (h)tcache+ (1-h)tmemory

effective

access time

memory access

time

cache

miss rate

cache

access time

cache

hit rate

aka, Average Memory Access Time (AMAT)

8/12/2019 Cse410 13 Cache

16/42


# of bits in a Cache

Caches are expensive memories that carryadditional metadata.

Note the bits needed for a cache is a function

of the # of cache blocks as well as the addresssize (embedded in the tag bit calculation)

2n * ( block size + tag size + valid bit )

Where n = number of blocks

How many words per block? = 2m

tag size = (32nm2)

8/12/2019 Cse410 13 Cache

17/42


Cache Contents

When do we put something in the cache? when it is used for the first time

When do we remove something in the cache?

when we need the space in the cache for some otherentry

all of memory wont fit on the CPU chip so not

every location in memory can be cached

8/12/2019 Cse410 13 Cache

18/42


A small two-level hierarchy

8-word cache

32-word memory (128 bytes)

0=

000

0000

4=

000

0100

8=

000

1000

116=

11

10100

120=

11

11000

124=

11

11100

12=

000

1100

16=

001

0000

20=

001

0100

8/12/2019 Cse410 13 Cache

19/42


Fully Associative Cache

0x00012D10Y0101100

0x00000005N0001100

0x0349A291Y1101100

0x000123A8Y0100000

0x00000200N1111100

0x00000410Y0100100

0x09D91D11N0000100

0x00000001Y0010100

ValueValidAddress In a fully associativecache, any memory word can

be placed in any cacheline

each cache line stores anaddress and a data value

accesses are slow (butnot as slow as you mightthink)

Why? Check alladdresses at once.

8/12/2019 Cse410 13 Cache

20/42


Direct Mapped Caches

Fully associative caches are often too slow With direct mapped caches the address of the

item determines where in the cache to store it

In our example, the lowest order two bits are thebyte offset within the word stored in the cache

The next three bits of the address dictate the

location of the entry within the cache

The remaining higher order bits record the rest of

the original address as a tag for this entry

8/12/2019 Cse410 13 Cache

21/42


Address Tags

A tagis a label for a cache entry indicatingwhere it came from

The upper bits of the data items address

1011101

7 bit Address

0111110Byte Offset (2)Index (3)Tag (2)

8/12/2019 Cse410 13 Cache

22/42


Direct Mapped Cache

0x00012D10Y00

0x00000005N10

0x0349A291Y11

0x000123A8Y000x00000200N10

0x00000410Y01

0x09D91D11N10

0x00000001Y11

ValueValidTag

0112= 3

1002= 4

1012= 5

1102= 6

1112= 7

0102= 2

0012= 1

0002= 0

Cache

Index

0001100

1010000

1110100

0011000

1011100

0101000

1000100

1100000

Memory

Address

Cache Contents

8/12/2019 Cse410 13 Cache

23/42


N-way Set Associative Caches

Direct mapped caches cannot store more than one addresswith the same index Simple case of 1-way Set associative

Compare tofully associative, where N = number of blocks

If two addresses collide, then you overwrite the older entry

2-way associative caches can store two different addresseswith the same index 3-way, 4-way and 8-way set associative designs too

Reduces misses due to conflicts/competition/thrashing for

same cache block, as each cache block can hold more Larger sets imply slower accesses

Need to compare N-Items, where N in D.M. is 1.

8/12/2019 Cse410 13 Cache

24/42


0x00012D10Y000x00000005N10

0x0349A291Y11

0x000123A8Y00

0x00000200N10

0x00000410Y01

0x09D91D11N10

0x00000001Y11

ValueValidTag

0x000000A2N100x00000333N11

0x00003333Y10

0x0000C002Y01

0x00000005N10

0x000000CFY11

0x0000003BN10

0x00000002Y00

ValueValidTag

2-way Set Associative Cache

011100

101

110

111

010

001

000

Index

The highlighted cache entry contains values for

addresses 10101xx2and 11101xx2.

8/12/2019 Cse410 13 Cache

25/42


Associativity Spectrum

Direct Mapped

Fast to access

Conflict Misses

Fully Associative

Slow to access

No Conflict Misses

N-way Associative

Slower to access

Fewer Conflict Misses

8/12/2019 Cse410 13 Cache

26/42


Spatial Locality

Using the cache improves performance bytaking advantage of temporal locality

When a word in memory is accessed it is loaded

into cache memory It is then available quickly if it is needed again

soon

This does nothing for spatial locality

Whats the solution here?

8/12/2019 Cse410 13 Cache

27/42


Memory Blocks

Divide memory into blocks If any word in a block is accessed, then load

an entire block into the cache

Block 00x000000000x0000003F

Block 10x000000400x0000007F

Block 20x00000080

0x000000BF

w13 w14tag valid w15w12w11w10w9w8w7w6w5w4w3w2w1w0

Cache line for 16 word block size

8/12/2019 Cse410 13 Cache

28/42


Address Tags Revisited

A cache block size > 1 word requires the address tobe divided differently

Instead of a byte offset into a word, we need a byteoffset into the block

Assume we have 10-bit addresses, 8 cache lines, and4 words (16 bytes) per cache line block

0101100111

10 bit Address

0111110010

Block Offset (4)Index (3)Tag (3)

8/12/2019 Cse410 13 Cache

29/42


The Effects of Block Size

Big blocks are good Fewer first time misses

Exploits spatial locality

Counter: Law of diminishing returns applies hereif block size becomes a significant fraction ofoverall cache size

Small blocks are good

Dont evict as much data when bringing in a newentry

More likely that all items in the block will turn outto be useful

8/12/2019 Cse410 13 Cache

30/42


Reads vs. Writes

Caching is essentially making a copy of thedata

When you read, the copies still match when

youre done (read op is immutable) When you write, the results must eventually

propagate to both copies (a hierarchy)

Especially at the lowest level of the hierarchy,which is in some sense the permanent copy

8/12/2019 Cse410 13 Cache

31/42


Cache Misses

Reads are easier, due to constraints A dedicated memory unit (or, memory controller) has the

job of fetching and filling cache lines

Writes are trickier as we need to maintain consistentmemory! What if memory changed at a top level but not at either

main memory or the disk?

Write-through: write to your cache, and percolate thatwrite down (slow and straightforward, factor of 10)

Use a cache or buffer to percolate the write down Optimization: Write-back (or, write-on-replace) Faster, more complex to implement

8/12/2019 Cse410 13 Cache

32/42


Write-Through Caches

Write all updates to both cache and memory Advantages

The cache and the memory are always consistent

Evicting a cache line is cheap because no dataneeds to be written out to memory at eviction

Easy to implement

Disadvantages

Runs at memory speeds when writing One solution: use a cache. We can use writebuffer\victim buffer to mask this delay

8/12/2019 Cse410 13 Cache

33/42


Write-Back Caches

Write the update to the cache only. Write tomemory only when cache block is evicted

Advantage

Runs at cache speed rather than memory speed Some writes never go all the way to memory

When a whole block is written back, can usehigh bandwidth transfer

Disadvantage complexity required to maintain consistency

8/12/2019 Cse410 13 Cache

34/42


Dirty bit

When evicting a block from a write-backcache, we could

always write the block back to memory

write it back only if we changed it

Caches use a dirty bit to mark if a linewas changed

the dirty bit is 0 when the block is loaded

it is set to 1 if the block is modified when the line is evicted, it is written back only

if the dirty bit is 1

8/12/2019 Cse410 13 Cache

35/42


i-Cache and d-Cache

There usually are two separate caches forinstructions and data.

Avoids structural hazards in pipelining

The combined cache is twice as big but still has anaccess time of a small cache

Allows both caches to operate in parallel, for twice

the bandwidth

But the miss rate will slightly increase, due to the

extra partition that a shared cache wouldnt have

8/12/2019 Cse410 13 Cache

36/42


Latency V.S. Throughput

Latency: The time to get the first requestedword of the cache (if present)

The whole point to caches: this should berelatively small (compared to other memories)

Throughput: The time it takes to fetch the restof the block (could be many words)

Defined by your hardware

wide bandwidths preferred: at what cost? DDRregister dual read/write on clock edge

8/12/2019 Cse410 13 Cache

37/42


Cache Line Replacement

How do you decide which cache block toreplace?

If the cache is direct-mapped, its easy

only one slot per indexonly one choice! Otherwise, common strategies:

Random Why might this be worthwhile?

Least Recently Used (LRU) FIFO approximation works reasonably well here

8/12/2019 Cse410 13 Cache

38/42


LRU Implementations

LRU is very difficult to implement for highdegrees of associativity

4-way approximation:

1 bit to indicate least recently used pair 1 bit per pair to indicate least recently used item inthis pair

Another approximation: FIFO queuing

We will see this again at the operating systemlevel

8/12/2019 Cse410 13 Cache

39/42


Multi-Level Caches

Use each level of the memory hierarchy as acache over the next lowest level

(or, simply recurse again on our currentabstraction)

Inserting level 2 between levels 1 and 3allows:

level 1 to have a higher miss rate (so can be

smaller and cheaper) level 3 to have a larger access time (so can be

slower and cheaper)

8/12/2019 Cse410 13 Cache

40/42


Summary: Classifying Caches

Where can a block be placed? Direct mapped, N-way Set or Fully associative

How is a block found? Direct mapped: one choice -> by index

Set associative: by index and search Fully associative: by search (all at once)

What happens on a write access? Write-back or Write-through

Which block should be replaced? Random

LRU (Least Recently Used)

8/12/2019 Cse410 13 Cache

41/42

8/12/2019 Cse410 13 Cache

42/42


Being Explored

Cache Coherency in multiprocessor systems Consider keeping multiple l1s in sync, when separated by a slowbus

Want each processor to have its own cache Fast local access

No interference with/from other processors

No problem if processors operate in isolation (a thread tree perproc, for example)

But: now what happens if more than one processoraccesses a cache line at the same time? How do we keep multiple copies consistent?

Consider multiple writes to the same data Overhead in all this synchronization/message passing on a possibly

slow bus

What about synchronization with main storage?

cse410 13 cache

Documents