chapter 3: memory and i/o systems modern processor...

Post on 19-Jul-2018

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Computer Science Department University of Central Florida

Chapter 3: Memory and I/O Systems Modern Processor Design: Fundamentals of Superscalar Processors

3

Memory Hierarchy

•  “Anyone can build a fast CPU. The trick is to build a fast system.” – Seymour Cray

•  Memory –  Just an “ocean of bits” –  Many technologies are available

•  Key issues –  Technology (how bits are stored) –  Placement (where bits are stored) –  Identification (finding the right bits) –  Replacement (finding space for new bits) –  Write policy (propagating changes to bits)

•  Must answer these regardless of memory type

4

Types of Memory

Type Size Speed Cost/bit

Register < 1KB < 1ns $$$$

On-chip SRAM 8KB-6MB < 10ns $$$

Off-chip SRAM 1Mb – 16Mb < 20ns $$

DRAM 64MB – 1TB < 100ns $

Disk 40GB – 1PB < 20ms ~0

5

Memory Hierarchy

Registers

On-Chip SRAM

Off-Chip SRAM

DRAM

Disk

CAPA

CITY

SPEE

D a

nd C

OST

6

Why Memory Hierarchy?

•  Need lots of bandwidth

–  Implication: ultra fast access and/or a high number of bits per access

–  And that’s for single-issue –  And for ONE processor!

•  Need lots of storage –  64MB (minimum) to multiple TB

•  Must be cheap per bit –  (TB x anything) is a lot of money!

•  These requirements seem incompatible

sGB

sGcycles

DrefB

instDref

IfetchB

instIfetch

cycleinstBW

2.11

244.0410.1

=

×⎥⎦

⎤⎢⎣

⎡×+××=

7

Why Memory Hierarchy?

•  Fast and small memories –  Enable quick access (fast cycle time) –  Enable lots of bandwidth (1+ L/S/I-fetch/cycle)

•  Slower larger memories –  Capture larger share of memory –  Still relatively fast

•  Slow huge memories –  Hold rarely-needed state –  Needed for correctness

•  All together: provide appearance of large, fast memory with cost of cheap, slow memory –  This illusion CAN be broken! –  “Virtual memory leads to virtual performance” – Seymour

Cray (He also believed this about memory hierarchies)

Announcements

•  Homework 2 due Tu 2/7 in class

•  Reading, S&L Chapter 3 up to 3.4.4 –  Chapter 5 of H&P (4th ed) as further reference –  Chapter 2 of H&P (5th ed) as further reference

•  Project 0 posted, due 2/9, cache inclusion today (?) –  My home directory exists once again…

•  Today –  Memory Hierarchy and Caches

8

9

Why Does a Hierarchy Work?

•  Locality of reference –  Temporal locality

• Reference same memory location repeatedly –  Spatial locality

• Reference near neighbors around the same time

•  Empirically observed –  Significant! –  Even small local storage (8KB) often satisfies >90% of

references to multi-MB data set

10

Memory Hierarchy

CPU

I & D L1 Cache

Shared L2 Cache

Main Memory

Disk

Temporal Locality • Keep recently referenced items at higher levels

• Future references satisfied quickly

Spatial Locality • Bring neighbors of recently referenced to higher levels

• Future references satisfied quickly

11

Four Burning Questions

•  These are: –  Placement

• Where can a block of memory go? –  Identification

• How do I find a block of memory? –  Replacement

• How do I make space for new blocks? –  Write Policy

• How do I propagate changes? •  Consider these for caches

–  Usually SRAM

•  Will consider main memory, disks later

12

Placement

Memory Type Placement Comments

Registers Anywhere; Int, FP, SPR

Compiler/programmer manages

Cache (SRAM)

Fixed in H/W Direct-mapped, set-associative, fully-associative

DRAM Anywhere O/S manages

Disk Anywhere O/S manages

13

How are caches organized

•  For simplicity, data is stored in fixed-length parcels called “blocks” or “lines” (load alignment is typically required in modern architectures)

•  Data is “tagged” with its address in memory (so that it can be found and identified)

•  As a black box:

Cache searches for

data @ X

CPU asks for data @ location X

If not there, cache asks main memory for data (called a cache miss)

CPU gets data (it may never talk to main memory)

14

Block size

•  Typical values for L1 cache: 16 to 64 bytes •  Exploit spatial locality:

–  Bring in larger blocks –  Slows down time it takes to fix a miss “miss repair time” –  Too large and “hog” storage (“cache pollution”)

block address block offset

63 b b-1 0

# block offset bits = log2(Block Size)

15

Placement

•  Address Range –  Exceeds cache capacity

•  Map address to finite capacity –  Called a hash –  Usually just masks high-order

bits

•  Direct-mapped –  Block can only exist in one

location –  Hash collisions cause problems

SRAM Cache

Hash

Address

Index

Data Out

Index Tag Offset

32-bit Address

Offset

Block Size

Announcements

•  Homework 2 due Wed 9/19 in class

•  Reading, S&L Chapter 3 up to 3.4.4 –  Chapter 5 of H&P as further reference (4th edition) –  Chapter 2 of H&P as further reference (5th edition)

•  Project 0 posted, due M 9/24 –  Last parameter in colon separated list is lowercase L not a 1 –  Small benchmarks?

• bzip2 input.random 2 • gcc with bank.i (cat test_single_job) Ex:

–  Gcc cccp.i –o cccp.s

• Others? perl, vortex… –  Small caches trick…

•  Today –  Memory Hierarchy and Caches

16

17

A direct mapped cache 31 3 2 0

Tag block offset index

6 5

hit/miss

row dec

=?

tag (26 bits) blocks (8 bytes)

word select (mux)

MAR

MDR

set (holds 1 block)

(26) (3) (3)

18

Placement

•  Set-associative –  Block can be n locations

(n-way SA) –  Hash collisions:

•  still OK •  Identification

–  Still perform tag check –  However, only n in

parallel

SRAM Cache

Hash

Address

Data Out

Offset

Index

Offset

32-bit Address

Tag Index

n Tags n Data Blocks Index

?= ?=

?= ?=

Tag

Not all metrics can be trusted…

•  Today Punxsutawney Phil predicted 6 more weeks of winter

•  Of the 113 predictions on record he has only predicted an early spring 14 times

•  Accuracy? –  Punxsutawney Groundhog Club Inner Circle says 100% –  StormFax Weather Almanac says 39%.

19

20

A (2-way) set associative cache 31 3 2 0

tag block offset index

5 4

tag (27bits)

blocks (8 bytes)

word select (mux)

MAR

MDR

block select (mux)

Notes: this cache is the same size as the (previous) direct mapped cache, but the index field is 1 bit shorter (2 bits total) A direct-mapped cache is really a 1-way set associative cache

row dec

=? =?

set (holds 2 blocks)

(27) (2) (3)

21

Placement

•  Fully-associative –  Block can exist anywhere –  No more hash collisions

•  Identification –  How do I know I have the right

block? –  Called a tag check

• Must store address tags • Compare against address

•  Expensive! –  Tag & comparator per block

SRAM Cache

Hash

Address

Data Out

Offset

32-bit Address

Offset

Tag

Hit Tag Check

?=

Tag

22

A fully associative cache (also called a content addressable memory or CAM)

31 3 2 0

tag block offset

tag (29bits)

word select (mux)

MAR

MDR

=?

tag

=?

tag

=?

tag

=?

tag

=?

tag

=?

tag

=?

tag

=?

tag

Notes: Same size as previous caches, but no row decoder and no index at all! Comparators (on per block) take the place of the row decoder

23

Placement and Identification

Offset

32-bit Address

Tag Index

Portion Length Purpose

Offset o=log2(block size) Select word within block

Index i=log2(number of sets) Select set of blocks

Tag t=#address bits - o - i ID block within set

24

Decoding the address •  Cache is normally specified as follows:

–  {SIZE, ASSOC, BLOCKSIZE} •  SIZE: total bytes of data storage •  ASSOC: associativity (# of blocks in a set/cache line) •  BLOCKSIZE: size of cache block in bytes

–  Question: how do we decode the address?

–  REMEMBER the following equation

–  Then compute size of each address field

tag index block offset

? ? ?

BLOCKSIZEASSOCSIZEsets×

=#

tag index block offset

)(log2 BLOCKSIZE)(#log2 sets)(log)(#log)(# 22 BLOCKSIZEsetssaddressbit −−

25

Example

•  Processor accesses a 256 Byte direct-mapped cache, which has block size of 32 Bytes, with following sequence of addresses. 0xFF0040E0, 0xBEEF005C, 0xFF0040E2, 0xFF0040E8, 0x00101078, 0x002183E0, 0x00101064, 0x0012255C,

0x00122544, … Show contents of cache after each access, count # of hits, count # of replacements.

26

Analysis

•  A processor accesses a 256 Byte direct-mapped cache, which has block size of 32 Bytes, with following sequence of addresses. Show contents of the cache after each access, count # of hits, count # of replacements.

•  C = 8, S = 1, B = 5. => How many sets in the cache? 8 # index bits = log2(# sets) = log2(8) = 3 # block offset bits = log2(block size) = log2(32 bytes) = 5 # tag bits = total # address bits - # index bits - # block offset bits = 32 bits – 3 bits – 5 bits = 24

31 5 4 0

tag block offset index

8 7

(24) (3) (5)

27

Address Stream

Address (hex) Tag (hex) Index & Offset bits (binary)

Index (decimal)

Comment

0xFF0040E0 0xFF0040 1110 0000 7 Miss

0xBEEF005C 0xBEEF00 0101 1100 2 Miss

0xFF0040E2 0xFF0040 1110 0010 7 Hit

0xFF0040E8 0xFF0040 1110 1000 7 Hit

0x00101078 0x001010 0111 1000 3 Miss

0x002183E0 0x002183 1110 0000 7 Miss/Repl

0x00101064 0x001010 0110 0100 3 Hit

0x0012255C 0x001225 0101 1100 2 Miss/Repl

0x00122544 0x001225 0100 0100 2 Hit

28

Replacement

•  Cache has finite size –  What do we do when it is full?

•  Analogy: desktop full? –  Move books to bookshelf to make room

•  Same idea: –  Move blocks to next level of cache

Announcements

•  Homework 2 due Wed 9/19 in class

•  Reading, S&L Chapter 3 up to 3.4.4 –  Chapter 5 of H&P as further reference (4th edition) –  Chapter 2 of H&P as further reference (5th edition)

•  Project 0 posted, due M 9/24 –  Small benchmarks?

• bzip2 input.random 2 • gcc with bank.i (cat test_single_job) Ex:

–  Gcc cccp.i –o cccp.s

• Others? perl, vortex… –  Small caches trick…

•  Today –  Memory Hierarchy and Caches

29

30

Replacement

•  How do we choose victim? –  Verbs: Victimize, evict, replace, cast out

•  Several policies are possible –  FIFO (first-in-first-out) –  LRU (least recently used) –  NMRU (not most recently used) –  LFU (least frequently used) –  Pseudo-random (yes, really!)

•  Pick victim within set where a = associativity –  If a <= 2, LRU is cheap and easy (1 bit) –  If a > 2, it gets harder –  Pseudo-random works pretty well for caches

31

LRU example •  Blocks A, B, C, D, and E all map to the same set •  Trace: A B C D D D B C A E •  Assign sequence numbers (increment by 1 for each access) •  A(0) B(1) C(2) D(3) D(4) D(5) B(6) C(7) A(8) E(9) •  Alternatively, keep repl list, move accessed elem. to tail

A(0)

A(0) B(1)

A(0) C(2) B(1)

A(0) C(2) B(1) D(3)

A(0) C(2) B(1) D(4)

A(0) C(2) B(1) D(5)

A(0) C(2) B(6) D(5)

A(0) C(7) B(6) D(5)

A(8) C(7) B(6) D(5)

A(8) C(7) B(6) E(9)

32

Revised Example

•  A processor accesses a 128 byte 2-way set-associative cache, which has block size of 16 bytes and LRU replacement policy, with following sequence of addresses. 0xFF0040E0, 0xBEEF005C, 0xFF0040E2, 0xFF0040E8, 0x00101078, 0x002183E0, 0x00101064, 0x0012255C,

0x00122544, …

Show contents of the cache after each access, count # of hits, count # of replacements.

33

Analysis

4216

128sets # =×

=ASSOCBLOCKSIZE

SIZE

# index bits = log2(# sets) = log2(4) = 2 # block offset bits = log2(block size) = log2(16 bytes) = 4 # tag bits = total # address bits - # index bits - # block

offset bits = 32 bits – 2 bits – 4 bits = 26

31 4 3 0

tag block offset index

6 5

(26) (2) (4)

34

Address stream Address (hex) Tag (hex) Index &

Offset bits (binary)

Index (decimal)

Comment

0xFF0040E0 0x3FC0103 10 0000 2 Miss

0xBEEF005C 0x2FBBC01 01 1100 1 Miss

0xFF0040E2 0x3FC0103 10 0010 2 Hit

0xFF0040E8 0x3FC0103 10 1000 2 Hit

0x00101078 0x0004041 11 1000 3 Miss

0x002183E0 0x000860F 10 0000 2 Miss

0x00101064 0x0004041 10 0100 2 Miss/Repl

0x0012255C 0x0004895 01 1100 1 Miss

0x00122544 0x0004895 00 0100 0 Miss

35

Write Policy

•  Memory hierarchy –  2 or more copies of same block

• Main memory and/or disk • Caches

•  What to do on a write? –  Eventually, all copies must be changed –  Write must propagate to all levels

•  Ignore MP for now – writes cause an even bigger problem there!

36

Write Policy

•  Easiest policy: write-through •  Every write propagates directly through hierarchy

–  Write in L1, L2, memory, disk (?!?) •  Why is this a bad idea?

–  Very high bandwidth requirement –  Remember, large memories are slow –  Typically small queue called a write buffer to help ease

bandwidth mismatch

•  Popular in real systems only to the L2 –  Every write updates L1 and L2 –  Beyond L2, use write-back policy

37

Write Policy

•  Most widely used: write-back •  Maintain state of each line in a cache

–  Invalid – not present in the cache –  Clean – present, but not written (unmodified) –  Dirty – present and written (modified) –  For multiprocessor systems we will add more state bits…

•  Store state in tag array, next to address tag –  Mark dirty bit on a write

•  On eviction, check dirty bit –  If set, write back dirty line to next level –  Called a writeback or castout

•  Victim Cache?

38

Write Policy

•  Complications of write-back policy –  Stale copies lower in the hierarchy –  Must always check higher level for dirty copies before

accessing copy in a lower level

•  Not a big problem in uniprocessors –  In multiprocessors: the cache coherence problem

•  I/O devices that use DMA (direct memory access) can cause coherence problems even in uniprocessors –  Called coherent I/O –  Must check caches for dirty copies when reading main

memory –  Update/invalidate copies in the cache when writing main

memory

Another Write Policy

•  On a write miss, where does the data go? •  Two alternatives:

–  write allocate •  allocate a cache line for the new data, and replace old line •  just like a read miss

–  write no-allocate • do not allocate space in the cache for the data • only really makes sense in systems with write buffers

•  How to handle read miss after write miss? –  Write allocate, data will be present in cache –  Write no-allocate may have to snoop write buffer

Announcements

•  Homework 2 due Tu 2/7 in class

•  Reading, S&L Chapter 3 up to 3.4.4 –  Chapter 5 of H&P as further reference (4th edition) –  Chapter 2 of H&P as further reference (5th edition)

•  Project 0 posted, due 2/9 –  Small benchmarks?

• bzip2 input.random 2 • gcc with bank.i (cat test_single_job) Ex:

–  Gcc cccp.i –o cccp.s

• Others? perl, vortex… –  Small caches trick… format is dl1:size:width:assoc:l

•  Today –  Memory Hierarchy and Caches

41

Announcements

•  Homework 2 due W 9/19 in class •  Problem 3.3: Assume LRU •  Penultimate problem, virtual memory and TLBs (Monday)

•  Reading, S&L Chapter 3 up to 3.4.4 plus 3.5 and 3.6 –  Chapter 2 (old:5) of H&P as further reference

•  Project 0 posted, due M 9/24 –  Small benchmarks?

•  bzip2 input.random 2 •  gcc with bank.i •  Others? perl, vortex…

–  Small caches trick…CACHE_MK_BADDR

•  Today –  Memory Hierarchy and Caches

42

43

Inclusion and Exclusion

•  In multi-level caches, to maintain cache coherence with reasonable complexity: –  Inclusion

• Any L1 data will appear in L2 • Best when L2 >> L1, MP systems, allows larger L2 BS

–  Exclusion • No overlap between L1 and L2 data, swap lines on miss! • AMD Athlon • Requires L1 block size = L2 block size • Adv. diminished if L2 >> L1 size

–  How to enforce either property? •  In the project, implement inclusion

44

Caches and Performance

•  Caches –  Enable design for common case: cache hit

• Cycle time, pipeline organization • Recovery policy

–  Uncommon case: cache miss • Fetch from next level

–  Apply recursively if multiple levels

• What to do in the meantime?

•  What is performance impact? •  Various optimizations are possible

45

Performance Impact

•  Cache hit latency –  Included in “pipeline” portion of CPI

• E.g. IBM study: 1.15 CPI with 100% cache hits –  Typically 1-3 cycles for L1 cache

•  Intel/HP McKinley: 1 cycle –  Heroic array design –  No address generation: load r1, (r2)

•  IBM Power4: 3 cycles –  Address generation –  Array access –  Word select and align

46

Cache Hits and Performance

•  Cache hit latency determined by: –  Cache organization

•  Associativity –  Parallel tag checks expensive, slow –  Way select slow (fan-in, wires)

•  Block size –  Word select may be slow (fan-in,

wires) •  Number of block (sets x associativity)

–  Wire delay across array –  “Manhattan distance” = width + height –  Word line delay: width –  Bit line delay: height

•  Array design is an art form –  Detailed analog circuit/wire delay

modeling

Word Line

Bit Line

47

Cache Misses and Performance

•  Miss penalty –  Detect miss: 1 or more cycles –  Find victim (replace line): 1 or more cycles

• Write back if dirty –  Request line from next level: several cycles –  Transfer line from next level: several cycles

•  (block size) / (bus width) –  Fill line into data array, update tag array: 1+ cycles –  Resume execution (early restart? Critical word first?)

•  In practice: 6 cycles to 100s of cycles

48

Cache Miss Rate

•  Determined by: –  Program characteristics

• Temporal locality • Spatial locality

–  Cache organization • Block size, associativity, number of sets

49

Cache Miss Rates: 3 (4?) C’s [Hill]

•  Compulsory (cold) miss –  First-ever reference to a given block of memory

•  Capacity miss –  Working set exceeds cache capacity –  Useful blocks (with future references) displaced

•  Conflict miss –  Placement restrictions (not fully-associative) cause useful

blocks to be displaced –  Think of as capacity within set

•  Coherence misses –  Misses incurred because another entity “stole” the line away

50

Cache Miss Rate Effects

•  Number of blocks (sets x associativity) –  Bigger is better: fewer conflicts, greater capacity

•  Associativity –  Higher associativity reduces conflicts –  Very little benefit beyond 8-way set-associative

•  Block size –  Larger blocks exploit spatial locality –  Usually: miss rates improve until 64B-256B –  512B or more miss rates get worse

• Larger blocks less efficient: more capacity misses • Fewer placement choices: more conflict misses

51

Cache Miss Rate

•  Subtle tradeoffs between cache organization parameters –  Large blocks reduce compulsory misses but increase miss

penalty • #compulsory = (working set) / (block size) • #transfers = (block size)/(bus width)

–  Large blocks increase conflict misses • #blocks = (cache size) / (block size)

–  Associativity reduces conflict misses –  Associativity increases access time

•  Can an associative cache ever have a higher miss rate than a direct-mapped cache of the same capacity?

52

Cache Miss Rates: 3 C’s

0123456789

8K1W 8K4W 16K1W 16K4W

Miss

per In

struc

tion (

%)

ConflictCapacityCompulsory

•  Vary size and associativity –  Compulsory misses are constant –  Capacity and conflict misses are reduced

53

Cache Miss Rates: 3 C’s

012345678

8K32B8K64B

16K32B

16K64B

Miss

per In

struc

tion (

%)

ConflictCapacityCompulsory

•  Vary size and block size –  Compulsory misses drop with increased block size –  Capacity and conflict can increase with larger blocks

54

Cache Misses and Performance

•  How does this affect performance? •  Performance = Time / Program

•  Cache organization affects cycle time –  Hit latency

•  Cache misses affect CPI

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.Instructions Cycles Program Instruction

Time Cycle

(code size)

= X X

(CPI) (cycle time)

55

Cache Misses and CPI

•  Cycles spent handling misses are strictly additive •  Miss_penalty is recursively defined at next level of cache

hierarchy as weighted sum of hit latency and miss latency

instperrateMisspenaltyMissinst

cyclesinstmiss

misscycles

instcycles

instcycles

instcycles

instcyclesCPI

hit

hit

misshit

____ ×+=

×+=

+==

56

Cache Misses and CPI

•  Pl is miss penalty at each of n levels of cache (but hit at the next level, i.e, hit latency of next level of cache)

•  MPIl is miss rate per instruction at each of n levels of cache •  Miss rate specification:

–  Per instruction: easy to incorporate in CPI –  Per reference: must convert to per instruction

l

n

ll

hit MPIPinst

cyclesCPI ×+= ∑=1

57

Cache Misses and CPI

•  For multi-level caches the MP of the L1 cache is just the AMAT of the L2 cache

•  Can simply chain equations together

MPMRinstrefsCPICPI IDEAL **/+=

MPMRtAMAT h *+=

58

Cache Performance Example

•  Assume following: –  CPI with perfect caches = 1.15, 40% of instructions are ld/st –  L1 instruction cache with 98% hit rate –  L1 data cache with 96% hit rate –  Shared L2 cache with 40% local miss rate –  L1 miss penalty/L2 hit latency of 8 cycles –  L2 miss penalty of:

•  10 cycles latency to request word from memory •  2 cycles per 16B bus transfer, 4x16B = 64B block

transferred • Hence 8 cycles transfer plus 1 cycle to fill L2 • Total penalty 10+8+1 = 19 cycles

59

Cache Performance Example

l

n

ll

hit MPIPinst

cyclesCPI ×+= ∑=1

086.2456.048.015.1

024.01948.015.1

06.040.019

04.002.0815.1

=++=

×++=

××+

⎟⎠

⎞⎜⎝

⎛ +×+=

instmiss

misscycles

instref

refmiss

misscycles

instmiss

instmiss

misscyclesCPI

60

Cache Performance Example

( )

[ ]( )( )71.15616.15.16.15*036.15.119*4.08

*)04.0*/4.0(

02.0*/0.115.1

=+=

+=

+

⎥⎦

⎤⎢⎣

⎡ ++=

cyclesinstrefsinstrefs

CPI

MPMRinstrefsCPICPI IDEAL **/+=

MPMRtAMAT h *+=

61

Cache Performance Example

( )

[ ]( )( )76.561.415.1128*036.15.1300*4.08

*)04.0*/4.0(

02.0*/0.115.1

=+=

+=

+

⎥⎦

⎤⎢⎣

⎡ ++=

cyclesinstrefsinstrefs

CPIMPMRinstrefsCPICPI IDEAL **/+=

MPMRtAMAT h *+=

62

Cache Misses and Performance

•  CPI equation –  Only holds for misses that cannot be overlapped with other activity –  Store misses often overlapped

• Place store in store queue • Wait for miss to complete • Perform store • Allow subsequent instructions to continue in parallel

–  Modern out-of-order processors also do this for loads • Cache performance modeling requires detailed modeling

of entire processor core • Cache performance has a profound impact on overall

performance

Improving Cache Performance

•  Cache Performance is determined by

•  Use better technology: –  use faster RAMs –  cost and availability are limitations –  more on RAMs and memory systems later

•  Decrease Miss Rate •  Decrease Miss Penalty •  Decrease Hit Time

pmh trtAMAT ⋅+=

Improving Cache Performance

•  Use better technology •  Decrease Miss Rate •  Decrease Miss Penalty •  Decrease Hit Time

Decreasing Miss Rate

•  Reduce the number of cache misses •  Open-ended goal, many ways to attack the problem •  Start with the 3 C’s

–  think of ways to reduce cold, capacity, and conflict misses

66

Improving Locality (Instruction Stream)

•  Instruction text placement –  Profile program, place unreferenced or rarely referenced

paths “elsewhere” • Maximize temporal locality •  “cording”

–  Eliminate taken branches • Fall-through path has spatial locality

–  Trace caches

Increase Block Size

•  Larger block size better exploits spatial locality but –  larger block size means larger miss penalty

•  takes longer time to transfer the block –  if block size is too big

•  average access time goes up (may need heavier banking) and •  overall miss rate can increase because temporal locality is reduced

when the replaced data would have been reused (too few lines in $) •  an extreme example for the previous cache: a single 64KB block

Average Access

Time Miss Penalty

Block Size

Miss Rate Exploits Spatial Locality

Fewer blocks: compromises temporal locality

Increased Miss Penalty & Miss Rate

Block Size Block Size

Higher Associativity

•  Reduce the number of conflict misses –  more places to put data

•  Two general rules of thumb (empirical): –  an 8-way set-associative $ performs near fully associative –  direct mapped cache of size N has same MR as a 2-way set

associative cache of size N/2 (2:1 cache rule of thumb)

•  Tradeoff is increased hit time –  10% for TTL/ECL, 2% for custom CMOS (Hill[1988])

•  Commonly see high associativity in 2nd-level caches –  less common in 1st-level caches

Victim Cache

•  Place small fully-associative cache between real cache and its refill path –  contains only blocks replaced on recent misses (victims)

•  On a miss: –  check victim cache –  if present, swap victim and cache entry –  else fetch as usual, put new victim in victim cache

•  Shown effective for small direct-mapped caches –  trade-off is area and additional control complexity

•  More modern technique to reduce conflict misses –  does not trade-off hit time –  miss penalty?

Pseudo Set-Associative Caches

•  Combine advantages of direct-mapped and set-associative caches

•  Perform cache access as in a direct-mapped cache –  if hit, done –  if miss, check another cache entry! –  one scheme: invert MSB of index and check there

•  Effectively a second set –  but, no parallel comparators or muxes (less HW) –  one set is fast access, one is slow access (extra cycle) –  want fast hits and not the slow hits

•  Need some form of way prediction –  can result in lower AMAT than both DM and SA caches –  variable hit times complicate pipelines – use in lower levels

Hardware Prefetching

•  A technique to improve cold and capacity misses •  Have hardware fetch extra lines on a miss

–  Can store in cache, or a separate stream buffer •  Alpha 21064 fetches 2 blocks on an instruction miss

–  A single entry buffer captures 15%-25% of misses •  Can do the same for data cache, even multiple buffers

–  Less effective if program accesses are non-unit stride •  Disadvantages

–  Hardware always prefetches –  Scheme relies on excess available memory bandwidth

• Trades off bandwidth for latency –  Can hurt performance if it interferes with demand misses

Software Prefetching

•  Avoid disadvantages of hardware prefetching with selectivity –  compiler-directed prefetching –  compiler can analyze code and know where misses occur

•  Insert a special prefetch instruction into the code stream –  most useful when it is a non-binding prefetch

•  turns into a nop on an exception –  don’t prefetch everything, too much instruction overhead! –  ideally just prefetch the misses, and ideally early enough! –  sophisticated compiler analysis in general case

•  Requires the existence of lockup-free (non-blocking) caches •  Subsequent load to same cache line will

–  hit in cache if prefetch is back from memory system –  miss, but not issue, if prefetch still outstanding

Compiler Optimizations

•  Reduce miss rates without changing the hardware! •  Code is easily re-ordered

–  cording rearranges procedures to reduce conflict misses –  use profiled information

•  Data is more interesting (and harder) –  still, can to re-arrange data accesses to improve locality

•  Examples: (a little later) –  array merging –  loop interchange –  loop fusion –  blocking

Improving Cache Performance

•  Use better technology •  Decrease Miss Rate •  Decrease Miss Penalty •  Decrease Hit Time

Reducing Miss Penalty

•  Some cache misses are inevitable –  when they do happen, want to service as quickly as possible –  one key component is memory speed and organization

•  Other miss penalty reduction techniques: –  giving read misses priority over writes –  subblock placement –  early restart and critical word first –  lockup-free caches –  multilevel caches

Read Priority

•  Processor need not wait for (isolated) writes –  but what if we want to read – RAW through memory –  write-through caches: must keep an eye on outstanding

writes • write buffer •  serialization in memory

•  Reads do stall CPU – give priority to reads –  but serialize/forward if overlap with earlier write –  modern processors do this extensively –  must track all previous stores to avoid correctness violations

Fill Before Spill

•  In writeback caches •  If line is Dirty on a read/write miss, need to write it back •  This increases miss penalty for the demand miss •  Solution: spill buffer

–  fetch demand miss from memory –  spill dirty line into on-chip spill buffer –  write spill buffer to memory in background after demand

miss •  Subsequent misses wait for spill buffer to empty

–  or even snoop •  What about a system request from another processor or

from the I/O system?

Sub-blocking

•  Want few tags (i.e., large blocks) but manageable miss penalty

•  Create multiple valid bits per cache line –  Valid bit per sub-block –  if WB cache, dirty bit per sub-block –  in a miss need only to fetch that sub-block –  must evict all dirty sub-blocks on replacement –  decreases tag storage without increasing miss penalty –  can be useful in multiprocessor systems

•  With write-through caches write misses need not fetch anything if sub-block size is one word

Early Restart

•  Decrease miss penalty with no new hardware –  well, okay, with some more complicated control

•  Strategy: impatience! •  There is no need to wait for entire line to be fetched •  Early Restart – as soon as the requested word (or double

word) of the cache block arrives, let the CPU continue execution

•  If CPU references another cache line or a later word in the same line: stall

•  Early restart is often combined with the next technique…

•  Improvement over early restart –  request missed word first from memory system –  send it to the CPU as soon as it arrives –  CPU consumes word while rest of line arrives

•  Even more complicated control logic –  memory system must also be changed –  block fetch must wrap around

•  Example: 32B block (8 words), miss on address 20

–  words return from memory system as follows: 20, 24, 28, 0, 4, 8, 12, 16

• other sequences possible

Critical Word First

0 4 8 12 16 20 24 28

Lockup-Free Caches

•  The CPU need not stall on cache misses –  dynamically scheduled processors can hide memory latency –  caches must be non-blocking or lockup-free

•  Hit under miss schemes allow data cache to supply data for other lines during a cache miss –  Most of the performance improvement of Pentium MMX was due

to implementation of hit under miss •  Extensions include “hit under multiple miss” (overlap misses) •  Significantly complicates cache control!

–  R10000 implementation has a Miss Handling Table (MHT) • Sometimes referred to as set of MSHRs

–  MHT handles request merging for prefetches –  also acts as write buffer for sequential writes

2nd-Level Caches

•  Add another level of cache between CPU and main memory –  allows first-level cache to remain small and fast –  second-level is slower but much larger (MBs)

•  Reduces overall miss penalty, complicated perf analysis •  AMAT = Hit timeL1 + Miss RateL1 x Miss PenaltyL1 •  Miss PenaltyL1 = Hit timeL2 + Miss RateL2 x Miss PenaltyL2 •  What is 2nd-level miss rate?

–  local miss rate – number of cache misses / cache accesses –  global miss rate – number of cache misses / CPU memory refs

•  Local miss rate can be large…why? •  Global miss rate may be more useful measure

Miss Rates: An Example

•  CPU A generates 100,000 loads and stores. The first-level cache has 3,000 misses. The second level-cache has 1,500 misses. What are the various miss rates?

•  For first-level cache local and global miss rates are same: –  3,000/100,000 = 3%

•  For second-level cache, local miss rate is: –  1,500 / 3,000 = 50%

•  For second-level cache, global miss rate is: –  1,500 / 100,000 = 1.5%

Second-Level Cache Design

•  Speed of 2nd level cache affects only miss penalty not CPU clock –  will it lower the AMAT portion of the CPI? –  how much does it cost?

•  Size of 2nd level cache >> first level •  Most capacity misses go away, leaving conflict misses •  2nd-level caches therefore

–  typically have some degree of associativity > 1 –  have large block sizes –  emphasis shifts from fast hits to fewer misses

Improving Cache Performance

•  Use better technology •  Decrease Miss Rate •  Decrease Miss Penalty •  Decrease Hit Time

Small, Simple Caches

•  On many machines today the cache access sets the cycle time

•  Hit time is therefore important beyond its effect on AMAT •  Index portion of address reads cache tag for comparison

–  this part of the hit time is the most time consuming –  smaller hardware is faster, use small 1st-level caches –  1st-level caches also have to be on chip, again small –  compromise: tags on chip, data off

•  Simple caches (direct-mapped) also have better hit times and consume less power –  tag access overlapped with data access

Avoid Translations

•  Virtual memory discussed later –  CPU issues virtual addresses –  these are translated into physical addresses by page table

•  Caches are either virtual caches or physical caches –  guess what kind of addresses each hold…

•  Fast caches suggest all caches should be virtual –  no translations on cache hit path

•  Problems with virtual caches –  inter-process aliasing –  intra-process aliasing

More on Avoiding Translations

•  Can do translation and cache access in separate pipe stages –  fast cycle time but slow hits –  also increase in load delay slots

•  Intriguing option: virtually-indexed, physically-tagged caches –  page offset bits of VA are not translated –  use page offset bits as cache index –  do translation of page number (tag) in parallel –  compare physical tag with translated tag

•  Good idea. Problems when # cache lines > page size –  can use page coloring again –  or high associativity caches

Pipelining Write Hits

•  Write hits take longer than read hits (why?) –  solution: pipeline the writes –  tag comparison done during “memory” stage –  data queued or buffered –  when data array free, write data to cache if hit –  logical pipeline between writes (Alpha 21064)

Array Merging

•  Some weak programmers produce code like: •  int val[SIZE]; •  int key[SIZE]; •  …and then proceed to reference key and val in lockstep •  What’s the problem?

Array Merging

•  Some weak programmers produce code like: •  int val[SIZE]; •  int key[SIZE]; •  …and then proceed to reference key and val in lockstep •  What’s the problem? •  Danger is that these accesses may interfere w/ each other •  Solution: merge the arrays into a single array of records: •  struct merge { •  int val; •  int key; •  }; •  struct merge merged_array[SIZE];

•  Good programming practice obviates the optimization

Loop Interchange

•  Some weak programmers produce code like: •  for (j=0; j < 100; j++) •  for (k=0; k < 100; k++) •  x[k][j] = 2 * x[k][j];

•  What’s the problem?

Loop Interchange

•  Some weak programmers produce code like: •  for (j=0; j < 100; j++) •  for (k=0; k < 100; k++) •  x[k][j] = 2 * x[k][j];

•  What’s the problem? –  C is a row-major language (Fortran is column-major) –  This code has a stride of 100 words, not 1 –  No spatial locality, poor hit rates

•  Solution: interchange the loops! •  for (k=0; k < 100; k++) •  for (j=0; j < 100; j++) •  x[k][j] = 2 * x[k][j];

•  Does not affect number of instructions, just more hits!

Loop Fusion

•  Some weak programmers produce code like: •  for (j=0; j < N; j++) •  for (k=0; k < N; k++) •  a[j][k] = 1/b[j][k] * c[j][k]; •  for (j=0; j < N; j++) •  for (k=0; k < N; k++) •  d[j][k] = a[j][k] + c[j][k]; •  What’s the problem?

Loop Fusion

•  Some weak programmers produce code like: •  for (j=0; j < N; j++) •  for (k=0; k < N; k++) •  a[j][k] = 1/b[j][k] * c[j][k]; •  for (j=0; j < N; j++) •  for (k=0; k < N; k++) •  d[j][k] = a[j][k] + c[j][k]; •  What’s the problem?

–  No temporal locality if arrays are big enough –  Codes takes misses to a and c arrays twice

•  Solution: fuse the loops! •  for (j=0; j < N; j++) •  for (k=0; k < N; k++) { •  a[j][k] = 1/b[j][k] * c[j][k]; •  d[j][k] = a[j][k] + c[j][k];

Blocking I

•  Hardest, but most famous cache optimization –  Meant to reduce capacity misses (conflict indirectly) –  Even strong programmers don’t block code on first try –  Canonical example is matrix multiplication

•  for (i=0; i < N; i++) •  for (j=0; j < N; j++) { •  r=0; •  for (k=0; k < N; k++) •  r += y[i][k] * z[k][j]; •  x[i][j] = r; •  }

•  Number of capacity misses depends on N and cache size –  Accesses 2N3 + N2 words in N3 operations –  Want to ensure better cache behavior –  Inner loops compute in steps of B rather than sweeping x

and z

Blocking II

•  B is the blocking factor •  for (jj=0; jj < N; jj+=B) •  for (kk=0; kk < N; kk+=B) •  for (i=0; i < N; i++) •  for (j=jj; j < min(jj+B,N); j++) { •  r=0; •  for (k=kk; k < min(kk+B,N); k++) •  r += y[i][k] * z[k][j]; •  x[i][j] += r; •  }•  Reduces words accessed to 2N3/B + N2

–  y benefits from spatial locality, z from temporal –  Reduces capacity misses by decreasing working set –  Reduces probability of conflict misses for same reason

98

Main Memory

•  DRAM chips

•  Memory organization–  Interleaving

–  Banking

•  Memory controller design

99

DRAM Chip Organization

Sense Amps

Row Buffer

Column Decoder

Row

Dec

oder

WordLines

Bitl ines

MemoryCellRow

Address

ColumnAddress

Bitline

Wordline

Capacitor

Transistor

Data bus

Array

•  Optimized for density, not speed •  Data stored as charge in capacitor •  Discharge on reads => destructive reads •  Charge leaks over time

–  refresh every 2 ms

l  Need to precharge data lines before access

100

DRAM Chip Organization

Sense Amps

Row Buffer

Column Decoder

Row

Dec

oder

WordLines

Bitl ines

MemoryCellRow

Address

ColumnAddress

Bitline

Wordline

Capacitor

Transistor

Data bus

Array

l  Address pins are time-multiplexed

–  Row address strobe (RAS)

–  Column address strobe (CAS)

•  Current generation DRAM –  256Mbit, moving to 1Gbit –  133 MHz synchronous interface –  Data bus 2, 4, 8, 16 bits

101

DRAM Chip Organization

Sense Amps

Row Buffer

Column Decoder

Row

Dec

oder

WordLines

Bitl ines

MemoryCellRow

Address

ColumnAddress

Bitline

Wordline

Capacitor

Transistor

Data bus

Array

•  New RAS results in: –  Bitline precharge –  Row decode, sense –  Row buffer write (up to

8K)

l  New CAS–  Row buffer read

l  Streaming row accesses desirable

–  Much faster (3x)

102

Simple Main Memory

•  Consider these parameters:–  1 cycle to send address

–  6 cycles to access each word

–  1 cycle to send word back

•  Miss penalty for a 4-word block–  (1 + 6 + 1) x 4 = 32

•  How can we speed this up?

103

Wider(Parallel) Main Memory

•  Make memory wider–  Read out all words in parallel

•  Memory parameters–  1 cycle to send address

–  6 to access a double word

–  1 cycle to send it back

•  Miss penalty for 4-word block: (1+6+1) x 2 = 16

•  Costs–  Wider bus

–  Larger minimum expansion unit

104

Interleaved and Parallel Organization

DRAM

AddrCmdD at a

C S

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

AddrCmdData

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

AddrCmdDat a

CS

Add rCmdDa ta

CS Ad

d rCmdDat a

AddrCmdDa ta

Serial Parallel

Non-interleaved

Interleaved

105

Interleaved Memory Summary

•  Parallel memory adequate for sequential accesses–  Load cache block: multiple sequential words–  Good for writeback caches

•  Banking useful otherwise•  Can also do both

–  Within each bank: parallel memory path–  Across banks

•  Can support multiple concurrent cache accesses (nonblocking)

106

Memory Controller Organization

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.The image cannot be displayed. Your computer may not have enough

memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.The image cannot be displayed. Your computer may not have enough

memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.The image cannot be displayed. Your computer may not have enough

memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.The image cannot be displayed. Your computer may not have enough

memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.The image cannot be displayed. Your computer may not have enough

memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.The image cannot be displayed. Your computer may not have enough

memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

ReadQ WriteQ RespQ

Scheduler Buffer

DIMM(s) DIMM(s)

Bank0 Bank1

Commands Data Commands Data

Memory

Controller

107

Memory Controller Organization

•  ReadQ–  Buffers multiple reads, enables scheduling optimizations

•  WriteQ–  Buffers writes, allows reads to bypass writes, enables scheduling opt.

•  RespQ–  Buffers responses until bus available

•  Scheduler–  FIFO? Or schedule to maximize for row hits (CAS accesses)–  Scan queues for references to same page–  Looks similar to issue queue with page number broadcasts for tag match

•  Buffer–  Builds transfer packet from multiple memory words to send over

processor bus

108

Main Memory and Virtual Memory

•  Use of virtual memory –  Main memory becomes another level in the memory

hierarchy –  Enables programs with address space or working set that

exceed physically available memory • No need for programmer to manage overlays, etc. • Sparse use of large address space is OK

–  Allows multiple users or programs to timeshare limited amount of physical memory space and address space

•  Bottom line: efficient use of expensive resource, and ease of programming

109

Virtual Memory

•  Enables –  Use more memory than system has –  Think program is only one running

• Don’t have to manage address space usage across programs

•  e.g. think it always starts at address 0x0 –  Memory protection

• Each program has private VA space: no-one else can clobber it

–  Better performance • Start running a large program before all of it has been

loaded from disk

•  How to do it? Add a level of indirection

110

Virtual Memory – Placement

•  Main memory managed in larger blocks –  Page size typically 4K – 16K

•  Fully flexible placement; fully associative –  Operating system manages placement –  Indirection through page table –  Maintain mapping between:

• Virtual address (seen by programmer) • Physical address (seen by main memory)

111

Virtual Memory – Placement

•  Fully associative implies expensive lookup? –  In caches, yes: check multiple tags in parallel

•  In virtual memory, expensive lookup is avoided by using a level of indirection –  Lookup table or hash table –  Called a page table

112

Virtual Memory – Identification

•  Similar to cache tag array –  Page table entry contains PA, dirty bit

•  Virtual address: –  Matches programmer view; based on register values –  Can be the same for multiple programs sharing same

system, without conflicts •  Physical address:

–  Invisible to programmer, managed by O/S –  Created/deleted on demand basis, can change

Virtual Address Physical Address Dirty bit

0x20004000 0x2000 Y/N

113

Virtual Memory – Replacement

•  Similar to caches: –  FIFO –  LRU; overhead too high

• Approximated with reference bit checks –  Random

•  O/S decides, manages –  Take an Operating Systems Course

114

Virtual Memory – Write Policy

•  Write back –  Disks are too slow to write through

•  Page table maintains dirty bit –  Hardware must set dirty bit on first write –  O/S checks dirty bit on eviction –  Dirty pages written to backing store

• Disk write, 10+ ms

115

Virtual Memory Implementation

•  Caches have fixed policies, hardware FSM for control, pipeline stall

•  VM has very different miss penalties –  Remember disks are 10+ ms!

•  Hence engineered differently

Announcements

•  Homework 2 due now! •  Final Project posted, form groups by today

–  Abstract due to me via email Friday 10/8 6pm •  No class Wednesday 10/6 •  Midterm Exam, in class Monday 10/18

–  Open book, open notes, bring calculator, no network

•  Today: –  Finish chapter 3 topics VM, TLBs –  Start: ISA

• End of S&L Chapter 1, Appendix A&B of H&P • Read Chapter 2 of S&L

116

117

Page Faults

•  A virtual memory miss is a page fault –  Physical memory location does not exist –  Exception is raised, save PC –  Invoke OS page fault handler

• Find a physical page (possibly evict) •  Initiate fetch from disk

–  Switch to other task that is ready to run –  Interrupt when disk access complete –  Restart original instruction

Announcements

•  Project 0 due Wed. 9/30 –  Be sure to check out the Discussions on Webcourses

•  Using Olympus account for Final Project? –  Please send your NID/PID in an email to Lek

(snilpani@cs.ucf.edu) by tonight

•  Homework 2 posted! Due 1 week from today –  P3.1, P3.3, P3.4, P3.13, P3.24 from Shen & Lipasti

•  Today: –  Finish Virtual Memory, and ISA Overview

118

119

Address Translation

•  O/S and hardware communicate via PTE •  How do we find a PTE?

–  &PTE = PTBR + page number * sizeof(PTE) –  PTBR is private for each program

• Context switch replaces PTBR (page table base register) contents

VA PA Dirty Ref Protection

0x20004000 0x2000 Y/N Y/N Read/Write/Execute

120

Address Translation

PPN D PTBR

Virtual Page Number Offset

+ Prot

PA = PPN concatenated with (untranslated) Offset

121

Page Table Size

•  How big is page table? –  232 / 4K * 4B = 4MB per program (!) –  Much worse for 64-bit machines

•  To make it smaller –  Use limit register(s)

•  If VA exceeds limit, invoke O/S to grow region –  Use a multi-level page table –  Make the page table pageable (use VM)

122

Multilevel Page Table

PTBR +

Offset

+

+

123

Hashed Page Table

•  Use a hash table or inverted page table or frame table –  PT contains an entry for each real address

•  Instead of entry for every virtual address –  Entry is found by hashing VA –  Oversize PT to reduce collisions: #PTE = 4 x (#phys. pages)

124

Hashed Page Table

PTBR

Virtual Page Number Offset

Hash PTE2 PTE1 PTE0 PTE3

125

High-Performance VM

•  VA translation –  Additional memory reference to PTE –  Each instruction fetch/load/store now 2 memory references!

• Or more, with multilevel table or hash collisions –  Even if PTE are cached in D/I-caches, still slow

•  Hence, use special-purpose cache for PTEs –  Called TLB (translation lookaside buffer) –  Caches PTE entries –  Exploits temporal and spatial locality (just a cache)

126

Translation Lookaside Buffer (TLB)

•  Set associative (a) or fully associative (b) •  Both widely employed

Tag Index

Announcements

•  Project 0 is due Tues. 2/16 –  Discussed inclusion last week

•  Homework 2 posted, due now –  P3.1, P3.3, P3.4, P3.13, P3.24 from Shen and Lipasti

•  Reading posted for Chapter 3 (up to 3.6) –  Will continue rest of chapter after other topics

•  Memory Hierarchy slides re-posted •  Virtual Memory slides broken out and posted •  Today:

–  Finish chapter 3 topics –  Next time: ISA

127

128

Interaction of TLB and Cache

•  Serial lookup: first TLB then D-cache •  Excessive latency

129

Virtual Memory Protection

•  Each process/program has private virtual address space –  Automatically protected from rogue programs

•  Sharing is possible, necessary, desirable –  Avoid copying, staleness issues, etc.

•  Sharing in a controlled manner –  Grant specific permissions

• Read • Write • Execute • Any combination

130

VM Sharing

•  Share memory locations by: –  Map shared physical location into both address spaces:

• E.g. PA 0xC00DA becomes: –  VA 0x2D000DA for process 0 –  VA 0x4D000DA for process 1

–  Either process can read/write shared location

•  However, causes synonym problem

131

VA Synonyms

•  Virtually-addressed caches are desirable –  No need to translate VA to PA before cache lookup –  Faster hit time, translate only on misses

• L2 cache and lower *only* use physical addresses •  However, VA synonyms cause problems

–  Can end up with two copies of same physical line –  Causes coherence problems

•  Solutions: –  Flush caches/TLBs on context switch (one copy of data in

the cache) –  Store PID in the cache itself, treat as part of tag for hit/miss –  Other ways

132

Virtually Indexed Physically Tagged L1

•  Parallel lookup of TLB and cache •  Faster hit time •  Index bits must be untranslated

–  Restricts size of n-associative cache to n x (virtual page size) Why? –  E.g. 4-way SA cache with 4KB pages max. size is 16KB

•  Other solutions?

133

Input/Output

•  Performance Issues •  Device Characteristics

and Types –  Graphics Displays –  Networks –  Disks

•  I/O System Architecture –  Buses –  I/O Processors

•  High-Performance Disk Architectures

CPU

Memory I/O Bridge

Display Adapter

Disk Controller

Network Interface

LAN

I/O B

us

Processor Bus

Disk

SCSI Bus

134

Motivation

•  I/O necessary –  To/from users (display, keyboard, mouse) –  To/from non-volatile media (disk, tape) –  To/from other computers (networks)

•  Key questions –  How fast? –  Getting faster?

•  Amdahl’s Law –  Speed up CPU by 10x, but I/O takes 10% of time –  Overall speedup only = 1/(0.9/10 + 0.1) = 5.26x

135

Examples

Device I or O? Partner Data Rate KB/s

Mouse I Human 0.01

Display O Human 60,000

Modem I/O Machine 2-8

LAN I/O Machine 500-6000

Tape Storage Machine 2000

Disk Storage Machine 2000-100,000

136

I/O Performance

•  What is performance? •  Supercomputers read/write 1GB of data

–  Want high bandwidth to vast data (bytes/sec) •  Transaction processing does many independent small I/

Os –  Want high I/O rates (I/Os per sec) –  May want fast response times

•  File systems –  Want fast response time first –  Lots of locality

137

Magnetic Disks

•  Stack of platters •  Two surfaces per platter •  Tracks •  Heads move together •  Sectors •  Disk access

–  Queueing + seek –  Rotation + transfer

Disk(s)

Platter(s)

Read/WriteHead(s)

Track

Sector

Read/WriteHead

Spindle

138

Magnetic Disks

•  Seek = 10-20ms but smaller with locality •  Rotation = ½ rotation/3600rpm = 8.3ms •  Transfer = x / 2-4MB/s

–  E.g. 4kB/4MB/s = 1ms

•  Total: 19.3ms –  Higher-end drives more like 10-12ms

•  Remember: mechanical => ms

139

Disk Arrays

•  Collection of individual disks –  Each disk has own arm/head

•  Data Distribution –  Independent, fine-grained, coarse-grained

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A0 A0 A0 A0

A1 A1 A1 A1

A2 A2 A2 A2

B0 B0 B0 B0

B1 B1 B1 B1

B2 B2 B2 B2

C0 C0 C0 C0

C1 C1 C1 C1

C2 C2 C2 C2

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

Independent Fine-grained Coarse-grained

140

Redundancy Mechanisms

•  Disk failures are a significant fraction of hardware failures –  Disk striping increases the number of corrupted/lost files

per failure

•  Data Replication –  disk "mirroring" –  allow multiple reads –  writes must be synchronized

•  Parity Protection –  use a parity disk

141

Redundant Arrays of Inexpensive Disks (RAIDs)

•  Arrays of small, cheap, disks to provide high performance and high reliability. –  D = number of ``data'' disks in group –  C = number of ``check'' disks in group

•  Level 1: Mirrored Disks (D = 1; C = 1) –  Overhead is high

•  Level 2: Memory Style ECC (e.g., D = 10; C = 4) –  Like ECC codes for DRAMs –  Read all bits across group –  Merge update bits with those not being updated; recompute

parities –  Rewrite full group, including checks

142

Redundant Arrays of Inexpensive Disks (RAIDs)

•  Level 3: Bit-Interleaved Parity –  (e.g., D = 4; C = 1) –  Key: failed disk is easily identified by controller, –  no need for special code to identify failed disk

•  Level 4: Block-Interleaved Parity –  Coarse-grain striping –  Like level 3 plus ability to do more than one small I/O at a

time. –  Write must update disk with data and parity disk

•  Level 5: Block Interleaved Distributed Parity –  Parity spread out across disks in a group:

different updates of parities go to different disks.

143

RAID4 vs RAID5

•  RAID5 –  Parity blocks are skewed across disks –  Eliminates parity disk bottleneck for writes

0

4

8

12

16

1

5

9

13

17

2

6

10

14

18

3

7

11

15

19

P

P

P

P

P

0

4

8

12

P

1

5

9

P

16

2

6

P

14

17

3

P

10

15

18

P

7

11

16

19

144

LAN = Ethernet

•  Ethernet not technically optimal –  80x86 is not technically optimal either!

•  Nevertheless, many efforts to displace it have failed (token ring, ATM)

•  Emerging: System Area Network (SAN) –  Reduce SW stack (TCP/IP processing) –  Reduce HW stack (interface on memory bus) –  New standard: Infiniband (http://www.infinibandta.org)

145

Buses in a Computer System CPU

Memory I/O Bridge

Display Adapter

Disk Controller

Network Interface

LAN

I/O B

us

Processor Bus

Disk

Storage Bus

•  Hierarchical paths

•  I/O Processing –  Program controlled –  DMA –  Dedicated I/O processors

146

Bus Types

•  CPU-Memory Buses –  “Front-Side” bus –  Very high performance

•  64-128 bits wide •  100-200 MHz • May use both clock edges (DDR) •  800 MB/sec – 64 GB/sec

–  Usually custom designs •  “Back-Side” bus to Cache

–  Even higher performance •  I/O Buses

–  Compatibility is important –  Usually standard designs (PCI, SCSI)

• PCI – 32 bits, 33 MHz, 130 MB/sec

top related