chapter 3: memory and i/o systems modern processor...

Computer Science Department University of Central Florida

Chapter 3: Memory and I/O Systems Modern Processor Design: Fundamentals of Superscalar Processors

Memory Hierarchy

•  “Anyone can build a fast CPU. The trick is to build a fast system.” – Seymour Cray

•  Memory –  Just an “ocean of bits” –  Many technologies are available

•  Key issues –  Technology (how bits are stored) –  Placement (where bits are stored) –  Identification (finding the right bits) –  Replacement (finding space for new bits) –  Write policy (propagating changes to bits)

•  Must answer these regardless of memory type

Types of Memory

Type Size Speed Cost/bit

Register < 1KB < 1ns $$$$

On-chip SRAM 8KB-6MB < 10ns $$$

Off-chip SRAM 1Mb – 16Mb < 20ns $$

DRAM 64MB – 1TB < 100ns $

Disk 40GB – 1PB < 20ms ~0

Memory Hierarchy

Registers

On-Chip SRAM

Off-Chip SRAM

Why Memory Hierarchy?

•  Need lots of bandwidth

–  Implication: ultra fast access and/or a high number of bits per access

–  And that’s for single-issue –  And for ONE processor!

•  Need lots of storage –  64MB (minimum) to multiple TB

•  Must be cheap per bit –  (TB x anything) is a lot of money!

•  These requirements seem incompatible

sGcycles

instDref

IfetchB

instIfetch

cycleinstBW

244.0410.1

×⎥⎦

⎤⎢⎣

⎡×+××=

Why Memory Hierarchy?

•  Fast and small memories –  Enable quick access (fast cycle time) –  Enable lots of bandwidth (1+ L/S/I-fetch/cycle)

•  Slower larger memories –  Capture larger share of memory –  Still relatively fast

•  Slow huge memories –  Hold rarely-needed state –  Needed for correctness

•  All together: provide appearance of large, fast memory with cost of cheap, slow memory –  This illusion CAN be broken! –  “Virtual memory leads to virtual performance” – Seymour

Cray (He also believed this about memory hierarchies)

Announcements

•  Homework 2 due Tu 2/7 in class

•  Reading, S&L Chapter 3 up to 3.4.4 –  Chapter 5 of H&P (4th ed) as further reference –  Chapter 2 of H&P (5th ed) as further reference

•  Project 0 posted, due 2/9, cache inclusion today (?) –  My home directory exists once again…

•  Today –  Memory Hierarchy and Caches

Why Does a Hierarchy Work?

•  Locality of reference –  Temporal locality

• Reference same memory location repeatedly –  Spatial locality

• Reference near neighbors around the same time

•  Empirically observed –  Significant! –  Even small local storage (8KB) often satisfies >90% of

references to multi-MB data set

Memory Hierarchy

I & D L1 Cache

Shared L2 Cache

Main Memory

Temporal Locality • Keep recently referenced items at higher levels

• Future references satisfied quickly

Spatial Locality • Bring neighbors of recently referenced to higher levels

• Future references satisfied quickly

Four Burning Questions

•  These are: –  Placement

• Where can a block of memory go? –  Identification

• How do I find a block of memory? –  Replacement

• How do I make space for new blocks? –  Write Policy

• How do I propagate changes? •  Consider these for caches

–  Usually SRAM

•  Will consider main memory, disks later

Placement

Memory Type Placement Comments

Registers Anywhere; Int, FP, SPR

Compiler/programmer manages

Cache (SRAM)

Fixed in H/W Direct-mapped, set-associative, fully-associative

DRAM Anywhere O/S manages

Disk Anywhere O/S manages

How are caches organized

•  For simplicity, data is stored in fixed-length parcels called “blocks” or “lines” (load alignment is typically required in modern architectures)

•  Data is “tagged” with its address in memory (so that it can be found and identified)

•  As a black box:

Cache searches for

data @ X

CPU asks for data @ location X

If not there, cache asks main memory for data (called a cache miss)

CPU gets data (it may never talk to main memory)

Block size

•  Typical values for L1 cache: 16 to 64 bytes •  Exploit spatial locality:

–  Bring in larger blocks –  Slows down time it takes to fix a miss “miss repair time” –  Too large and “hog” storage (“cache pollution”)

block address block offset

63 b b-1 0

# block offset bits = log2(Block Size)

Placement

•  Address Range –  Exceeds cache capacity

•  Map address to finite capacity –  Called a hash –  Usually just masks high-order

•  Direct-mapped –  Block can only exist in one

location –  Hash collisions cause problems

SRAM Cache

Address

Data Out

Index Tag Offset

32-bit Address

Offset

Block Size

Announcements

•  Homework 2 due Wed 9/19 in class

•  Reading, S&L Chapter 3 up to 3.4.4 –  Chapter 5 of H&P as further reference (4th edition) –  Chapter 2 of H&P as further reference (5th edition)

•  Project 0 posted, due M 9/24 –  Last parameter in colon separated list is lowercase L not a 1 –  Small benchmarks?

• bzip2 input.random 2 • gcc with bank.i (cat test_single_job) Ex:

–  Gcc cccp.i –o cccp.s

• Others? perl, vortex… –  Small caches trick…

A direct mapped cache 31 3 2 0

Tag block offset index

hit/miss

row dec

tag (26 bits) blocks (8 bytes)

word select (mux)

set (holds 1 block)

(26) (3) (3)

Placement

•  Set-associative –  Block can be n locations

(n-way SA) –  Hash collisions:

•  still OK •  Identification

–  Still perform tag check –  However, only n in

parallel

SRAM Cache

Address

Data Out

Offset

32-bit Address

Tag Index

n Tags n Data Blocks Index

Not all metrics can be trusted…

•  Today Punxsutawney Phil predicted 6 more weeks of winter

•  Of the 113 predictions on record he has only predicted an early spring 14 times

•  Accuracy? –  Punxsutawney Groundhog Club Inner Circle says 100% –  StormFax Weather Almanac says 39%.

A (2-way) set associative cache 31 3 2 0

tag block offset index

tag (27bits)

blocks (8 bytes)

word select (mux)

block select (mux)

Notes: this cache is the same size as the (previous) direct mapped cache, but the index field is 1 bit shorter (2 bits total) A direct-mapped cache is really a 1-way set associative cache

row dec

set (holds 2 blocks)

(27) (2) (3)

Placement

•  Fully-associative –  Block can exist anywhere –  No more hash collisions

•  Identification –  How do I know I have the right

block? –  Called a tag check

• Must store address tags • Compare against address

•  Expensive! –  Tag & comparator per block

SRAM Cache

Address

Data Out

Offset

32-bit Address

Offset

Hit Tag Check

A fully associative cache (also called a content addressable memory or CAM)

31 3 2 0

tag block offset

tag (29bits)

word select (mux)

Notes: Same size as previous caches, but no row decoder and no index at all! Comparators (on per block) take the place of the row decoder

Placement and Identification

Offset

32-bit Address

Tag Index

Portion Length Purpose

Offset o=log2(block size) Select word within block

Index i=log2(number of sets) Select set of blocks

Tag t=#address bits - o - i ID block within set

Decoding the address •  Cache is normally specified as follows:

–  {SIZE, ASSOC, BLOCKSIZE} •  SIZE: total bytes of data storage •  ASSOC: associativity (# of blocks in a set/cache line) •  BLOCKSIZE: size of cache block in bytes

–  Question: how do we decode the address?

–  REMEMBER the following equation

–  Then compute size of each address field

tag index block offset

BLOCKSIZEASSOCSIZEsets×

tag index block offset

)(log2 BLOCKSIZE)(#log2 sets)(log)(#log)(# 22 BLOCKSIZEsetssaddressbit −−

Example

•  Processor accesses a 256 Byte direct-mapped cache, which has block size of 32 Bytes, with following sequence of addresses. 0xFF0040E0, 0xBEEF005C, 0xFF0040E2, 0xFF0040E8, 0x00101078, 0x002183E0, 0x00101064, 0x0012255C,

0x00122544, … Show contents of cache after each access, count # of hits, count # of replacements.

Analysis

•  A processor accesses a 256 Byte direct-mapped cache, which has block size of 32 Bytes, with following sequence of addresses. Show contents of the cache after each access, count # of hits, count # of replacements.

•  C = 8, S = 1, B = 5. => How many sets in the cache? 8 # index bits = log2(# sets) = log2(8) = 3 # block offset bits = log2(block size) = log2(32 bytes) = 5 # tag bits = total # address bits - # index bits - # block offset bits = 32 bits – 3 bits – 5 bits = 24

31 5 4 0

(24) (3) (5)

Address Stream

Address (hex) Tag (hex) Index & Offset bits (binary)

Index (decimal)

Comment

0xFF0040E0 0xFF0040 1110 0000 7 Miss

0xBEEF005C 0xBEEF00 0101 1100 2 Miss

0xFF0040E2 0xFF0040 1110 0010 7 Hit

0xFF0040E8 0xFF0040 1110 1000 7 Hit

0x00101078 0x001010 0111 1000 3 Miss

0x002183E0 0x002183 1110 0000 7 Miss/Repl

0x00101064 0x001010 0110 0100 3 Hit

0x0012255C 0x001225 0101 1100 2 Miss/Repl

0x00122544 0x001225 0100 0100 2 Hit

Replacement

•  Cache has finite size –  What do we do when it is full?

•  Analogy: desktop full? –  Move books to bookshelf to make room

•  Same idea: –  Move blocks to next level of cache

Announcements

•  Homework 2 due Wed 9/19 in class

•  Project 0 posted, due M 9/24 –  Small benchmarks?

• Others? perl, vortex… –  Small caches trick…

Replacement

•  How do we choose victim? –  Verbs: Victimize, evict, replace, cast out

•  Several policies are possible –  FIFO (first-in-first-out) –  LRU (least recently used) –  NMRU (not most recently used) –  LFU (least frequently used) –  Pseudo-random (yes, really!)

•  Pick victim within set where a = associativity –  If a <= 2, LRU is cheap and easy (1 bit) –  If a > 2, it gets harder –  Pseudo-random works pretty well for caches

LRU example •  Blocks A, B, C, D, and E all map to the same set •  Trace: A B C D D D B C A E •  Assign sequence numbers (increment by 1 for each access) •  A(0) B(1) C(2) D(3) D(4) D(5) B(6) C(7) A(8) E(9) •  Alternatively, keep repl list, move accessed elem. to tail

A(0) B(1)

A(0) C(2) B(1)

A(0) C(2) B(1) D(3)

A(0) C(2) B(1) D(4)

A(0) C(2) B(1) D(5)

A(0) C(2) B(6) D(5)

A(0) C(7) B(6) D(5)

A(8) C(7) B(6) D(5)

A(8) C(7) B(6) E(9)

Revised Example

•  A processor accesses a 128 byte 2-way set-associative cache, which has block size of 16 bytes and LRU replacement policy, with following sequence of addresses. 0xFF0040E0, 0xBEEF005C, 0xFF0040E2, 0xFF0040E8, 0x00101078, 0x002183E0, 0x00101064, 0x0012255C,

0x00122544, …

Show contents of the cache after each access, count # of hits, count # of replacements.

Analysis

128sets # =×

=ASSOCBLOCKSIZE

# index bits = log2(# sets) = log2(4) = 2 # block offset bits = log2(block size) = log2(16 bytes) = 4 # tag bits = total # address bits - # index bits - # block

offset bits = 32 bits – 2 bits – 4 bits = 26

31 4 3 0

(26) (2) (4)

Address stream Address (hex) Tag (hex) Index &

Offset bits (binary)

Index (decimal)

Comment

0xFF0040E0 0x3FC0103 10 0000 2 Miss

0xBEEF005C 0x2FBBC01 01 1100 1 Miss

0xFF0040E2 0x3FC0103 10 0010 2 Hit

0xFF0040E8 0x3FC0103 10 1000 2 Hit

0x00101078 0x0004041 11 1000 3 Miss

0x002183E0 0x000860F 10 0000 2 Miss

0x00101064 0x0004041 10 0100 2 Miss/Repl

0x0012255C 0x0004895 01 1100 1 Miss

0x00122544 0x0004895 00 0100 0 Miss

Write Policy

•  Memory hierarchy –  2 or more copies of same block

• Main memory and/or disk • Caches

•  What to do on a write? –  Eventually, all copies must be changed –  Write must propagate to all levels

•  Ignore MP for now – writes cause an even bigger problem there!

Write Policy

•  Easiest policy: write-through •  Every write propagates directly through hierarchy

–  Write in L1, L2, memory, disk (?!?) •  Why is this a bad idea?

–  Very high bandwidth requirement –  Remember, large memories are slow –  Typically small queue called a write buffer to help ease

bandwidth mismatch

•  Popular in real systems only to the L2 –  Every write updates L1 and L2 –  Beyond L2, use write-back policy

Write Policy

•  Most widely used: write-back •  Maintain state of each line in a cache

–  Invalid – not present in the cache –  Clean – present, but not written (unmodified) –  Dirty – present and written (modified) –  For multiprocessor systems we will add more state bits…

•  Store state in tag array, next to address tag –  Mark dirty bit on a write

•  On eviction, check dirty bit –  If set, write back dirty line to next level –  Called a writeback or castout

•  Victim Cache?

Write Policy

•  Complications of write-back policy –  Stale copies lower in the hierarchy –  Must always check higher level for dirty copies before

accessing copy in a lower level

•  Not a big problem in uniprocessors –  In multiprocessors: the cache coherence problem

•  I/O devices that use DMA (direct memory access) can cause coherence problems even in uniprocessors –  Called coherent I/O –  Must check caches for dirty copies when reading main

memory –  Update/invalidate copies in the cache when writing main

memory

Another Write Policy

•  On a write miss, where does the data go? •  Two alternatives:

–  write allocate •  allocate a cache line for the new data, and replace old line •  just like a read miss

–  write no-allocate • do not allocate space in the cache for the data • only really makes sense in systems with write buffers

•  How to handle read miss after write miss? –  Write allocate, data will be present in cache –  Write no-allocate may have to snoop write buffer

Announcements

•  Homework 2 due Tu 2/7 in class

•  Project 0 posted, due 2/9 –  Small benchmarks?

• Others? perl, vortex… –  Small caches trick… format is dl1:size:width:assoc:l

Announcements

•  Homework 2 due W 9/19 in class •  Problem 3.3: Assume LRU •  Penultimate problem, virtual memory and TLBs (Monday)

•  Reading, S&L Chapter 3 up to 3.4.4 plus 3.5 and 3.6 –  Chapter 2 (old:5) of H&P as further reference

•  Project 0 posted, due M 9/24 –  Small benchmarks?

•  bzip2 input.random 2 •  gcc with bank.i •  Others? perl, vortex…

–  Small caches trick…CACHE_MK_BADDR

Inclusion and Exclusion

•  In multi-level caches, to maintain cache coherence with reasonable complexity: –  Inclusion

• Any L1 data will appear in L2 • Best when L2 >> L1, MP systems, allows larger L2 BS

–  Exclusion • No overlap between L1 and L2 data, swap lines on miss! • AMD Athlon • Requires L1 block size = L2 block size • Adv. diminished if L2 >> L1 size

–  How to enforce either property? •  In the project, implement inclusion

Caches and Performance

•  Caches –  Enable design for common case: cache hit

• Cycle time, pipeline organization • Recovery policy

–  Uncommon case: cache miss • Fetch from next level

–  Apply recursively if multiple levels

• What to do in the meantime?

•  What is performance impact? •  Various optimizations are possible

Performance Impact

•  Cache hit latency –  Included in “pipeline” portion of CPI

• E.g. IBM study: 1.15 CPI with 100% cache hits –  Typically 1-3 cycles for L1 cache

•  Intel/HP McKinley: 1 cycle –  Heroic array design –  No address generation: load r1, (r2)

•  IBM Power4: 3 cycles –  Address generation –  Array access –  Word select and align

Cache Hits and Performance

•  Cache hit latency determined by: –  Cache organization

•  Associativity –  Parallel tag checks expensive, slow –  Way select slow (fan-in, wires)

•  Block size –  Word select may be slow (fan-in,

wires) •  Number of block (sets x associativity)

–  Wire delay across array –  “Manhattan distance” = width + height –  Word line delay: width –  Bit line delay: height

•  Array design is an art form –  Detailed analog circuit/wire delay

modeling

Word Line

Bit Line

Cache Misses and Performance

•  Miss penalty –  Detect miss: 1 or more cycles –  Find victim (replace line): 1 or more cycles

• Write back if dirty –  Request line from next level: several cycles –  Transfer line from next level: several cycles

•  (block size) / (bus width) –  Fill line into data array, update tag array: 1+ cycles –  Resume execution (early restart? Critical word first?)

•  In practice: 6 cycles to 100s of cycles

Cache Miss Rate

•  Determined by: –  Program characteristics

• Temporal locality • Spatial locality

–  Cache organization • Block size, associativity, number of sets

Cache Miss Rates: 3 (4?) C’s [Hill]

•  Compulsory (cold) miss –  First-ever reference to a given block of memory

•  Capacity miss –  Working set exceeds cache capacity –  Useful blocks (with future references) displaced

•  Conflict miss –  Placement restrictions (not fully-associative) cause useful

blocks to be displaced –  Think of as capacity within set

•  Coherence misses –  Misses incurred because another entity “stole” the line away

Cache Miss Rate Effects

•  Number of blocks (sets x associativity) –  Bigger is better: fewer conflicts, greater capacity

•  Associativity –  Higher associativity reduces conflicts –  Very little benefit beyond 8-way set-associative

•  Block size –  Larger blocks exploit spatial locality –  Usually: miss rates improve until 64B-256B –  512B or more miss rates get worse

• Larger blocks less efficient: more capacity misses • Fewer placement choices: more conflict misses

Cache Miss Rate

•  Subtle tradeoffs between cache organization parameters –  Large blocks reduce compulsory misses but increase miss

penalty • #compulsory = (working set) / (block size) • #transfers = (block size)/(bus width)

–  Large blocks increase conflict misses • #blocks = (cache size) / (block size)

–  Associativity reduces conflict misses –  Associativity increases access time

•  Can an associative cache ever have a higher miss rate than a direct-mapped cache of the same capacity?

Cache Miss Rates: 3 C’s

0123456789

8K1W 8K4W 16K1W 16K4W

per In

tion (

ConflictCapacityCompulsory

•  Vary size and associativity –  Compulsory misses are constant –  Capacity and conflict misses are reduced

Cache Miss Rates: 3 C’s

012345678

8K32B8K64B

16K32B

16K64B

per In

tion (

ConflictCapacityCompulsory

•  Vary size and block size –  Compulsory misses drop with increased block size –  Capacity and conflict can increase with larger blocks

•  How does this affect performance? •  Performance = Time / Program

•  Cache organization affects cycle time –  Hit latency

•  Cache misses affect CPI

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.Instructions Cycles Program Instruction

Time Cycle

(code size)

(CPI) (cycle time)

Cache Misses and CPI

•  Cycles spent handling misses are strictly additive •  Miss_penalty is recursively defined at next level of cache

hierarchy as weighted sum of hit latency and miss latency

instperrateMisspenaltyMissinst

cyclesinstmiss

misscycles

instcycles

instcyclesCPI

misshit

____ ×+=

•  Pl is miss penalty at each of n levels of cache (but hit at the next level, i.e, hit latency of next level of cache)

•  MPIl is miss rate per instruction at each of n levels of cache •  Miss rate specification:

–  Per instruction: easy to incorporate in CPI –  Per reference: must convert to per instruction

hit MPIPinst

cyclesCPI ×+= ∑=1

•  For multi-level caches the MP of the L1 cache is just the AMAT of the L2 cache

•  Can simply chain equations together

MPMRinstrefsCPICPI IDEAL **/+=

MPMRtAMAT h *+=

Cache Performance Example

•  Assume following: –  CPI with perfect caches = 1.15, 40% of instructions are ld/st –  L1 instruction cache with 98% hit rate –  L1 data cache with 96% hit rate –  Shared L2 cache with 40% local miss rate –  L1 miss penalty/L2 hit latency of 8 cycles –  L2 miss penalty of:

•  10 cycles latency to request word from memory •  2 cycles per 16B bus transfer, 4x16B = 64B block

transferred • Hence 8 cycles transfer plus 1 cycle to fill L2 • Total penalty 10+8+1 = 19 cycles

hit MPIPinst

cyclesCPI ×+= ∑=1

086.2456.048.015.1

024.01948.015.1

06.040.019

04.002.0815.1

⎟⎠

⎞⎜⎝

⎛ +×+=

instmiss

misscycles

instref

refmiss

misscycles

instmiss

misscyclesCPI

[ ]( )( )71.15616.15.16.15*036.15.119*4.08

*)04.0*/4.0(

02.0*/0.115.1

⎥⎦

⎤⎢⎣

⎡ ++=

cyclesinstrefsinstrefs

MPMRinstrefsCPICPI IDEAL **/+=

MPMRtAMAT h *+=

[ ]( )( )76.561.415.1128*036.15.1300*4.08

*)04.0*/4.0(

02.0*/0.115.1

⎥⎦

⎤⎢⎣

⎡ ++=

cyclesinstrefsinstrefs

CPIMPMRinstrefsCPICPI IDEAL **/+=

MPMRtAMAT h *+=

•  CPI equation –  Only holds for misses that cannot be overlapped with other activity –  Store misses often overlapped

• Place store in store queue • Wait for miss to complete • Perform store • Allow subsequent instructions to continue in parallel

–  Modern out-of-order processors also do this for loads • Cache performance modeling requires detailed modeling

of entire processor core • Cache performance has a profound impact on overall

performance

Improving Cache Performance

•  Cache Performance is determined by

•  Use better technology: –  use faster RAMs –  cost and availability are limitations –  more on RAMs and memory systems later

•  Decrease Miss Rate •  Decrease Miss Penalty •  Decrease Hit Time

pmh trtAMAT ⋅+=

•  Use better technology •  Decrease Miss Rate •  Decrease Miss Penalty •  Decrease Hit Time

Decreasing Miss Rate

•  Reduce the number of cache misses •  Open-ended goal, many ways to attack the problem •  Start with the 3 C’s

–  think of ways to reduce cold, capacity, and conflict misses

Improving Locality (Instruction Stream)

•  Instruction text placement –  Profile program, place unreferenced or rarely referenced

paths “elsewhere” • Maximize temporal locality •  “cording”

–  Eliminate taken branches • Fall-through path has spatial locality

–  Trace caches

Increase Block Size

•  Larger block size better exploits spatial locality but –  larger block size means larger miss penalty

•  takes longer time to transfer the block –  if block size is too big

•  average access time goes up (may need heavier banking) and •  overall miss rate can increase because temporal locality is reduced

when the replaced data would have been reused (too few lines in $) •  an extreme example for the previous cache: a single 64KB block

Average Access

Time Miss Penalty

Block Size

Miss Rate Exploits Spatial Locality

Fewer blocks: compromises temporal locality

Increased Miss Penalty & Miss Rate

Block Size Block Size

Higher Associativity

•  Reduce the number of conflict misses –  more places to put data

•  Two general rules of thumb (empirical): –  an 8-way set-associative $ performs near fully associative –  direct mapped cache of size N has same MR as a 2-way set

associative cache of size N/2 (2:1 cache rule of thumb)

•  Tradeoff is increased hit time –  10% for TTL/ECL, 2% for custom CMOS (Hill[1988])

•  Commonly see high associativity in 2nd-level caches –  less common in 1st-level caches

Victim Cache

•  Place small fully-associative cache between real cache and its refill path –  contains only blocks replaced on recent misses (victims)

•  On a miss: –  check victim cache –  if present, swap victim and cache entry –  else fetch as usual, put new victim in victim cache

•  Shown effective for small direct-mapped caches –  trade-off is area and additional control complexity

•  More modern technique to reduce conflict misses –  does not trade-off hit time –  miss penalty?

Pseudo Set-Associative Caches

•  Combine advantages of direct-mapped and set-associative caches

•  Perform cache access as in a direct-mapped cache –  if hit, done –  if miss, check another cache entry! –  one scheme: invert MSB of index and check there

•  Effectively a second set –  but, no parallel comparators or muxes (less HW) –  one set is fast access, one is slow access (extra cycle) –  want fast hits and not the slow hits

•  Need some form of way prediction –  can result in lower AMAT than both DM and SA caches –  variable hit times complicate pipelines – use in lower levels

Hardware Prefetching

•  A technique to improve cold and capacity misses •  Have hardware fetch extra lines on a miss

–  Can store in cache, or a separate stream buffer •  Alpha 21064 fetches 2 blocks on an instruction miss

–  A single entry buffer captures 15%-25% of misses •  Can do the same for data cache, even multiple buffers

–  Less effective if program accesses are non-unit stride •  Disadvantages

–  Hardware always prefetches –  Scheme relies on excess available memory bandwidth

• Trades off bandwidth for latency –  Can hurt performance if it interferes with demand misses

Software Prefetching

•  Avoid disadvantages of hardware prefetching with selectivity –  compiler-directed prefetching –  compiler can analyze code and know where misses occur

•  Insert a special prefetch instruction into the code stream –  most useful when it is a non-binding prefetch

•  turns into a nop on an exception –  don’t prefetch everything, too much instruction overhead! –  ideally just prefetch the misses, and ideally early enough! –  sophisticated compiler analysis in general case

•  Requires the existence of lockup-free (non-blocking) caches •  Subsequent load to same cache line will

–  hit in cache if prefetch is back from memory system –  miss, but not issue, if prefetch still outstanding

Compiler Optimizations

•  Reduce miss rates without changing the hardware! •  Code is easily re-ordered

–  cording rearranges procedures to reduce conflict misses –  use profiled information

•  Data is more interesting (and harder) –  still, can to re-arrange data accesses to improve locality

•  Examples: (a little later) –  array merging –  loop interchange –  loop fusion –  blocking

Reducing Miss Penalty

•  Some cache misses are inevitable –  when they do happen, want to service as quickly as possible –  one key component is memory speed and organization

•  Other miss penalty reduction techniques: –  giving read misses priority over writes –  subblock placement –  early restart and critical word first –  lockup-free caches –  multilevel caches

Read Priority

•  Processor need not wait for (isolated) writes –  but what if we want to read – RAW through memory –  write-through caches: must keep an eye on outstanding

writes • write buffer •  serialization in memory

•  Reads do stall CPU – give priority to reads –  but serialize/forward if overlap with earlier write –  modern processors do this extensively –  must track all previous stores to avoid correctness violations

Fill Before Spill

•  In writeback caches •  If line is Dirty on a read/write miss, need to write it back •  This increases miss penalty for the demand miss •  Solution: spill buffer

–  fetch demand miss from memory –  spill dirty line into on-chip spill buffer –  write spill buffer to memory in background after demand

miss •  Subsequent misses wait for spill buffer to empty

–  or even snoop •  What about a system request from another processor or

from the I/O system?

Sub-blocking

•  Want few tags (i.e., large blocks) but manageable miss penalty

•  Create multiple valid bits per cache line –  Valid bit per sub-block –  if WB cache, dirty bit per sub-block –  in a miss need only to fetch that sub-block –  must evict all dirty sub-blocks on replacement –  decreases tag storage without increasing miss penalty –  can be useful in multiprocessor systems

•  With write-through caches write misses need not fetch anything if sub-block size is one word

Early Restart

•  Decrease miss penalty with no new hardware –  well, okay, with some more complicated control

•  Strategy: impatience! •  There is no need to wait for entire line to be fetched •  Early Restart – as soon as the requested word (or double

word) of the cache block arrives, let the CPU continue execution

•  If CPU references another cache line or a later word in the same line: stall

•  Early restart is often combined with the next technique…

•  Improvement over early restart –  request missed word first from memory system –  send it to the CPU as soon as it arrives –  CPU consumes word while rest of line arrives

•  Even more complicated control logic –  memory system must also be changed –  block fetch must wrap around

•  Example: 32B block (8 words), miss on address 20

–  words return from memory system as follows: 20, 24, 28, 0, 4, 8, 12, 16

• other sequences possible

Critical Word First

0 4 8 12 16 20 24 28

Lockup-Free Caches

•  The CPU need not stall on cache misses –  dynamically scheduled processors can hide memory latency –  caches must be non-blocking or lockup-free

•  Hit under miss schemes allow data cache to supply data for other lines during a cache miss –  Most of the performance improvement of Pentium MMX was due

to implementation of hit under miss •  Extensions include “hit under multiple miss” (overlap misses) •  Significantly complicates cache control!

–  R10000 implementation has a Miss Handling Table (MHT) • Sometimes referred to as set of MSHRs

–  MHT handles request merging for prefetches –  also acts as write buffer for sequential writes

2nd-Level Caches

•  Add another level of cache between CPU and main memory –  allows first-level cache to remain small and fast –  second-level is slower but much larger (MBs)

•  Reduces overall miss penalty, complicated perf analysis •  AMAT = Hit timeL1 + Miss RateL1 x Miss PenaltyL1 •  Miss PenaltyL1 = Hit timeL2 + Miss RateL2 x Miss PenaltyL2 •  What is 2nd-level miss rate?

–  local miss rate – number of cache misses / cache accesses –  global miss rate – number of cache misses / CPU memory refs

•  Local miss rate can be large…why? •  Global miss rate may be more useful measure

Miss Rates: An Example

•  CPU A generates 100,000 loads and stores. The first-level cache has 3,000 misses. The second level-cache has 1,500 misses. What are the various miss rates?

•  For first-level cache local and global miss rates are same: –  3,000/100,000 = 3%

•  For second-level cache, local miss rate is: –  1,500 / 3,000 = 50%

•  For second-level cache, global miss rate is: –  1,500 / 100,000 = 1.5%

Second-Level Cache Design

•  Speed of 2nd level cache affects only miss penalty not CPU clock –  will it lower the AMAT portion of the CPI? –  how much does it cost?

•  Size of 2nd level cache >> first level •  Most capacity misses go away, leaving conflict misses •  2nd-level caches therefore

–  typically have some degree of associativity > 1 –  have large block sizes –  emphasis shifts from fast hits to fewer misses

Small, Simple Caches

•  On many machines today the cache access sets the cycle time

•  Hit time is therefore important beyond its effect on AMAT •  Index portion of address reads cache tag for comparison

–  this part of the hit time is the most time consuming –  smaller hardware is faster, use small 1st-level caches –  1st-level caches also have to be on chip, again small –  compromise: tags on chip, data off

•  Simple caches (direct-mapped) also have better hit times and consume less power –  tag access overlapped with data access

Avoid Translations

•  Virtual memory discussed later –  CPU issues virtual addresses –  these are translated into physical addresses by page table

•  Caches are either virtual caches or physical caches –  guess what kind of addresses each hold…

•  Fast caches suggest all caches should be virtual –  no translations on cache hit path

•  Problems with virtual caches –  inter-process aliasing –  intra-process aliasing

chapter 3: memory and i/o systems modern processor...

Documents

fhg mocat fall2015 web

fall2015 week 9 slides

chapter 006 revised fall2015

fall2015 week 8

melab ht fall2015

yourfamily fall2015

a quantitative approach, fifth...

dow contracts fall2015

fall2015 web

chapter 010revised fall2015 faculty

intercedemagazine fall2015 web

fall2015 week 7 slides

lookbook fall2015

cda 5106 advanced computer architecture i simultaneous...

prospector09-fall2015 p136-147

ccsem newsletter fall2015

cat fall2015 lr 140715 singlep

rustytinroof fall2015

fall2015 morannewsletter web

chapter 3 instruction-level parallelism and its...