cse410 13 cache

Upload: rajesh-tiwary

Post on 03-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Cse410 13 Cache

    1/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 1

    Cache Memory

    CSE 410, Spring 2008

    Computer Systems

    http://www.cs.washington.edu/410

  • 8/12/2019 Cse410 13 Cache

    2/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 2

    Reading and References

    Reading

    Computer Organization and Design, Patterson and Hennessy

    Section 7.1 Introduction

    Section 7.2 The Basics of Caches

    Section 7.3 Measuring and Improving Cache Performance

    Reference

    OSC (The dino book), Chapter 8: focus on paging CDA&AQA, Chapter 5

    Chapter 4, See MIPS Run, D. Sweetman

    IBM and Cell chips: http://www.blachford.info/computer/Cell/Cell0_v2.html

  • 8/12/2019 Cse410 13 Cache

    3/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 3

    Arthur W. Burks, Herman H. Goldstine,

    John von Neumann

    We are therefore forced to recognize the possibility

    of constructing a hierarchy of memories, each of

    which has greater capacity than the preceding but

    which is less quickly accessible

    This team was forced to realize this in the 1960s

    Cachea safe place to store something (webster)

    Cache (cse & ee ) : a small and fast local memory

    Cachessome consider this to be

  • 8/12/2019 Cse410 13 Cache

    4/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 4

    The Quest for Speed - Memory

    If all memory accesses (IF/lw/sw) accessed main

    memory, programs would run exponentially slower

    Compare memory access times: 2^4- 2^6 time slower?

    Could be even slower depending on pipeline length!

    And its getting worse

    processors speed up by ~50% annually

    memory accesses speed up by ~9% annually

    its becoming harder and harder to keep these processors

    fed

  • 8/12/2019 Cse410 13 Cache

    5/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 5

    A Solution: Memory Hierarchy*

    Keep copies of the active

    data in the small, fast,

    expensive storage

    Keep all data in the big,

    slow, cheap storage

    fast, small,

    expensive

    storage

    slow, large,

    cheap storage

  • 8/12/2019 Cse410 13 Cache

    6/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 6

    Memory Hierarchy**

    10M

    100

    10

    .5-5

  • 8/12/2019 Cse410 13 Cache

    7/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 7

    What is a Cache?

    A subset* of a larger memory

    A small and fast place to store frequently

    accessed items

    Can be an instruction cache, a video cache, a

    streaming buffer/cache

    Like a fisheye view of memory, where

    magnification == speed

  • 8/12/2019 Cse410 13 Cache

    8/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 8

    What is a Cache?

    A cache allows for fastaccesses to a subset of a largerdata store

    Your web browsers cache gives you fast access to pages

    you visited recently

    faster because its stored locally

    subset because the web wont fit on your disk

    The memory cache gives the processor fast access to

    memory that it used recently faster because its fancy and usually located on the CPU chip

    subset because the cache is smaller than main memory

  • 8/12/2019 Cse410 13 Cache

    9/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 9

    Memory Hierarchy

    CPU

    Registers

    L1 cache

    L2 Cache

    Main Memory

  • 8/12/2019 Cse410 13 Cache

    10/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 10

  • 8/12/2019 Cse410 13 Cache

    11/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 11

    IBMs Cell Chips

  • 8/12/2019 Cse410 13 Cache

    12/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 12

    Locality of reference

    Temporal locality - nearness in time Data being accessed now will probably be

    accessed again soon

    Useful data tends to continue to be useful Spatial locality - nearness in address

    Data near the data being accessed now will

    probably be needed soon Useful data is often accessed sequentially

  • 8/12/2019 Cse410 13 Cache

    13/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 13

    Memory Access Patterns

    Memory accesses

    dontusually look

    like this

    random accesses

    Memory accesses do

    usuallylook like this

    hot variables

    step through arrays

  • 8/12/2019 Cse410 13 Cache

    14/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 14

    Cache Terminology

    Hit andMiss the data item is in the cache or the data item is not in the

    cache

    Hit rate andMiss rate the percentage of references that the data item is in the

    cache or not in the cache

    Hit time andMiss time the time required to access data in the cache (cache access

    time) and the time required to access data not in the cache(memory access time)

  • 8/12/2019 Cse410 13 Cache

    15/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 15

    Effective Access Time

    teffective= (h)tcache+ (1-h)tmemory

    effective

    access time

    memory access

    time

    cache

    miss rate

    cache

    access time

    cache

    hit rate

    aka, Average Memory Access Time (AMAT)

  • 8/12/2019 Cse410 13 Cache

    16/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 16

    # of bits in a Cache

    Caches are expensive memories that carryadditional metadata.

    Note the bits needed for a cache is a function

    of the # of cache blocks as well as the addresssize (embedded in the tag bit calculation)

    2n * ( block size + tag size + valid bit )

    Where n = number of blocks

    How many words per block? = 2m

    tag size = (32nm2)

  • 8/12/2019 Cse410 13 Cache

    17/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 17

    Cache Contents

    When do we put something in the cache? when it is used for the first time

    When do we remove something in the cache?

    when we need the space in the cache for some otherentry

    all of memory wont fit on the CPU chip so not

    every location in memory can be cached

  • 8/12/2019 Cse410 13 Cache

    18/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 18

    A small two-level hierarchy

    8-word cache

    32-word memory (128 bytes)

    0=

    000

    0000

    4=

    000

    0100

    8=

    000

    1000

    116=

    11

    10100

    120=

    11

    11000

    124=

    11

    11100

    12=

    000

    1100

    16=

    001

    0000

    20=

    001

    0100

  • 8/12/2019 Cse410 13 Cache

    19/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 19

    Fully Associative Cache

    0x00012D10Y0101100

    0x00000005N0001100

    0x0349A291Y1101100

    0x000123A8Y0100000

    0x00000200N1111100

    0x00000410Y0100100

    0x09D91D11N0000100

    0x00000001Y0010100

    ValueValidAddress In a fully associativecache, any memory word can

    be placed in any cacheline

    each cache line stores anaddress and a data value

    accesses are slow (butnot as slow as you mightthink)

    Why? Check alladdresses at once.

  • 8/12/2019 Cse410 13 Cache

    20/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 20

    Direct Mapped Caches

    Fully associative caches are often too slow With direct mapped caches the address of the

    item determines where in the cache to store it

    In our example, the lowest order two bits are thebyte offset within the word stored in the cache

    The next three bits of the address dictate the

    location of the entry within the cache

    The remaining higher order bits record the rest of

    the original address as a tag for this entry

  • 8/12/2019 Cse410 13 Cache

    21/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 21

    Address Tags

    A tagis a label for a cache entry indicatingwhere it came from

    The upper bits of the data items address

    1011101

    7 bit Address

    0111110Byte Offset (2)Index (3)Tag (2)

  • 8/12/2019 Cse410 13 Cache

    22/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 22

    Direct Mapped Cache

    0x00012D10Y00

    0x00000005N10

    0x0349A291Y11

    0x000123A8Y000x00000200N10

    0x00000410Y01

    0x09D91D11N10

    0x00000001Y11

    ValueValidTag

    0112= 3

    1002= 4

    1012= 5

    1102= 6

    1112= 7

    0102= 2

    0012= 1

    0002= 0

    Cache

    Index

    0001100

    1010000

    1110100

    0011000

    1011100

    0101000

    1000100

    1100000

    Memory

    Address

    Cache Contents

  • 8/12/2019 Cse410 13 Cache

    23/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 23

    N-way Set Associative Caches

    Direct mapped caches cannot store more than one addresswith the same index Simple case of 1-way Set associative

    Compare tofully associative, where N = number of blocks

    If two addresses collide, then you overwrite the older entry

    2-way associative caches can store two different addresseswith the same index 3-way, 4-way and 8-way set associative designs too

    Reduces misses due to conflicts/competition/thrashing for

    same cache block, as each cache block can hold more Larger sets imply slower accesses

    Need to compare N-Items, where N in D.M. is 1.

  • 8/12/2019 Cse410 13 Cache

    24/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 24

    0x00012D10Y000x00000005N10

    0x0349A291Y11

    0x000123A8Y00

    0x00000200N10

    0x00000410Y01

    0x09D91D11N10

    0x00000001Y11

    ValueValidTag

    0x000000A2N100x00000333N11

    0x00003333Y10

    0x0000C002Y01

    0x00000005N10

    0x000000CFY11

    0x0000003BN10

    0x00000002Y00

    ValueValidTag

    2-way Set Associative Cache

    011100

    101

    110

    111

    010

    001

    000

    Index

    The highlighted cache entry contains values for

    addresses 10101xx2and 11101xx2.

  • 8/12/2019 Cse410 13 Cache

    25/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 25

    Associativity Spectrum

    Direct Mapped

    Fast to access

    Conflict Misses

    Fully Associative

    Slow to access

    No Conflict Misses

    N-way Associative

    Slower to access

    Fewer Conflict Misses

  • 8/12/2019 Cse410 13 Cache

    26/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 26

    Spatial Locality

    Using the cache improves performance bytaking advantage of temporal locality

    When a word in memory is accessed it is loaded

    into cache memory It is then available quickly if it is needed again

    soon

    This does nothing for spatial locality

    Whats the solution here?

  • 8/12/2019 Cse410 13 Cache

    27/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 27

    Memory Blocks

    Divide memory into blocks If any word in a block is accessed, then load

    an entire block into the cache

    Block 00x000000000x0000003F

    Block 10x000000400x0000007F

    Block 20x00000080

    0x000000BF

    w13 w14tag valid w15w12w11w10w9w8w7w6w5w4w3w2w1w0

    Cache line for 16 word block size

  • 8/12/2019 Cse410 13 Cache

    28/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 28

    Address Tags Revisited

    A cache block size > 1 word requires the address tobe divided differently

    Instead of a byte offset into a word, we need a byteoffset into the block

    Assume we have 10-bit addresses, 8 cache lines, and4 words (16 bytes) per cache line block

    0101100111

    10 bit Address

    0111110010

    Block Offset (4)Index (3)Tag (3)

  • 8/12/2019 Cse410 13 Cache

    29/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 29

    The Effects of Block Size

    Big blocks are good Fewer first time misses

    Exploits spatial locality

    Counter: Law of diminishing returns applies hereif block size becomes a significant fraction ofoverall cache size

    Small blocks are good

    Dont evict as much data when bringing in a newentry

    More likely that all items in the block will turn outto be useful

  • 8/12/2019 Cse410 13 Cache

    30/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 30

    Reads vs. Writes

    Caching is essentially making a copy of thedata

    When you read, the copies still match when

    youre done (read op is immutable) When you write, the results must eventually

    propagate to both copies (a hierarchy)

    Especially at the lowest level of the hierarchy,which is in some sense the permanent copy

  • 8/12/2019 Cse410 13 Cache

    31/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 31

    Cache Misses

    Reads are easier, due to constraints A dedicated memory unit (or, memory controller) has the

    job of fetching and filling cache lines

    Writes are trickier as we need to maintain consistentmemory! What if memory changed at a top level but not at either

    main memory or the disk?

    Write-through: write to your cache, and percolate thatwrite down (slow and straightforward, factor of 10)

    Use a cache or buffer to percolate the write down Optimization: Write-back (or, write-on-replace) Faster, more complex to implement

  • 8/12/2019 Cse410 13 Cache

    32/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 32

    Write-Through Caches

    Write all updates to both cache and memory Advantages

    The cache and the memory are always consistent

    Evicting a cache line is cheap because no dataneeds to be written out to memory at eviction

    Easy to implement

    Disadvantages

    Runs at memory speeds when writing One solution: use a cache. We can use writebuffer\victim buffer to mask this delay

  • 8/12/2019 Cse410 13 Cache

    33/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 33

    Write-Back Caches

    Write the update to the cache only. Write tomemory only when cache block is evicted

    Advantage

    Runs at cache speed rather than memory speed Some writes never go all the way to memory

    When a whole block is written back, can usehigh bandwidth transfer

    Disadvantage complexity required to maintain consistency

  • 8/12/2019 Cse410 13 Cache

    34/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 34

    Dirty bit

    When evicting a block from a write-backcache, we could

    always write the block back to memory

    write it back only if we changed it

    Caches use a dirty bit to mark if a linewas changed

    the dirty bit is 0 when the block is loaded

    it is set to 1 if the block is modified when the line is evicted, it is written back only

    if the dirty bit is 1

  • 8/12/2019 Cse410 13 Cache

    35/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 35

    i-Cache and d-Cache

    There usually are two separate caches forinstructions and data.

    Avoids structural hazards in pipelining

    The combined cache is twice as big but still has anaccess time of a small cache

    Allows both caches to operate in parallel, for twice

    the bandwidth

    But the miss rate will slightly increase, due to the

    extra partition that a shared cache wouldnt have

  • 8/12/2019 Cse410 13 Cache

    36/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 36

    Latency V.S. Throughput

    Latency: The time to get the first requestedword of the cache (if present)

    The whole point to caches: this should berelatively small (compared to other memories)

    Throughput: The time it takes to fetch the restof the block (could be many words)

    Defined by your hardware

    wide bandwidths preferred: at what cost? DDRregister dual read/write on clock edge

  • 8/12/2019 Cse410 13 Cache

    37/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 37

    Cache Line Replacement

    How do you decide which cache block toreplace?

    If the cache is direct-mapped, its easy

    only one slot per indexonly one choice! Otherwise, common strategies:

    Random Why might this be worthwhile?

    Least Recently Used (LRU) FIFO approximation works reasonably well here

  • 8/12/2019 Cse410 13 Cache

    38/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 38

    LRU Implementations

    LRU is very difficult to implement for highdegrees of associativity

    4-way approximation:

    1 bit to indicate least recently used pair 1 bit per pair to indicate least recently used item inthis pair

    Another approximation: FIFO queuing

    We will see this again at the operating systemlevel

  • 8/12/2019 Cse410 13 Cache

    39/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 39

    Multi-Level Caches

    Use each level of the memory hierarchy as acache over the next lowest level

    (or, simply recurse again on our currentabstraction)

    Inserting level 2 between levels 1 and 3allows:

    level 1 to have a higher miss rate (so can be

    smaller and cheaper) level 3 to have a larger access time (so can be

    slower and cheaper)

  • 8/12/2019 Cse410 13 Cache

    40/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 40

    Summary: Classifying Caches

    Where can a block be placed? Direct mapped, N-way Set or Fully associative

    How is a block found? Direct mapped: one choice -> by index

    Set associative: by index and search Fully associative: by search (all at once)

    What happens on a write access? Write-back or Write-through

    Which block should be replaced? Random

    LRU (Least Recently Used)

  • 8/12/2019 Cse410 13 Cache

    41/42

  • 8/12/2019 Cse410 13 Cache

    42/42

    2/23/2014 cse410-13-cache 2006-07 Perkins, DW Johnson and University of Washington 42

    Being Explored

    Cache Coherency in multiprocessor systems Consider keeping multiple l1s in sync, when separated by a slowbus

    Want each processor to have its own cache Fast local access

    No interference with/from other processors

    No problem if processors operate in isolation (a thread tree perproc, for example)

    But: now what happens if more than one processoraccesses a cache line at the same time? How do we keep multiple copies consistent?

    Consider multiple writes to the same data Overhead in all this synchronization/message passing on a possibly

    slow bus

    What about synchronization with main storage?