ece 550d fundamentals of computer systems and engineering...

ECE 550D Fundamentals of Computer Systems and Engineering

Fall 2017

Memory Hierarchy

Prof. John Board

Duke University

Slides are derived from work by Profs. Tyler Bletsch and Andrew Hilton (Duke), and Amir Roth (Penn)

2

Memory Hierarchy

• Basic concepts

• Technology background

• Organizing a single memory component

• ABC

• Write issues

• Miss classification and optimization

• Organizing an entire memory hierarchy

• Virtual memory

• Highly integrated into real hierarchies, but…

• …won’t talk about until later

CPU Mem I/O

System software

App App App

3

SRAM vs DRAM

• SRAM: Static Random Access Memory

• Static: memory is based on latches, a stored 0 or 1 is electrically stable, or static. 8 transistors per bit if we were using S-R latches (can do better but several transistors per bit).

• Fast access time since driven.

• Random Access Memory: we use decoders to select exactly 1 of 2k elements of our memory; we can access any memory location equally easily in any order (as opposed to sequential access memory – magnetic tape for instance)

k:2k decoder n bits = n latches per

element

Here, 3 bit address

accesses one of 8

n-bit elements

4

SRAM vs DRAM

• DRAM: Dynamic Random Access Memory

• Imagine 32 billion leaky cups (capacitors): (4 gigabyte ram, 8 bits per byte) you pour water into the cups you want to have a “1” and leave empty the cups that have a “0”.

• You still have a decoder (still a RAM) – you select one memory element (say one byte) – imagine 8 straws for the 8 bits of the byte.

• You suck on the straw – if any water comes out, there used to be a 1 stored there, but you just destroyed it. If you suck air, it was and still is a 0. (Destructive read)

• And if you wait too long, the water leaks away, so you have to constantly (about 12 times per second) check each bit and refill it if it is a ”1”. (Dynamic instead of static storage).

• Insane? But only 1 transistor per bit stored, and much lower power consumption.

• But slow access time since discharging a capacitor, not driving a circuit.

5

How Do We Build Instruction/Data Memory?

• Register file? Just a multi-ported SRAM – i.e. just lots of flip flops

• 32 32-bit registers = 1Kb = 128B. Need a 5:32 decoder – not bad

• Multiple ports make it bigger and slower but still OK

• Instruction/data memory? Just a single-ported SRAM? • Uh, umm… it’s 232B = 4GB!!!!

– It would be huge, expensive, and pointlessly slow with a naïve decoder (32:4G decoder – how many 32 input and gates does this need!?!)

– And consume enormous amounts of power

– And we can’t build something that big on-chip anyway

• Most ISAs now 64-bit, so memory is really as large as 264B = 16EB

PC IM intRF

DM

6

So What Do We Do? Motivation for Caches:

• “Primary” instruction/data memories (we will call them cache memories): small single-ported SRAMs…

• “primary” = “in the datapath”

• Key 1: they contain only a dynamic subset of “memory”

• Subset is small enough to fit in a reasonable SRAM and access quickly

• Key 2: missing chunks fetched on demand (transparent to program)

• From somewhere else… (next slide)

• Program has illusion that all 4GB (16EB) of memory is physically there

• Just like it has the illusion that all instructions execute atomically

IM 64KB

DM 16KB

PC intRF

7

But…

• If requested insn/data not found in primary memory • Doesn’t the place it comes from have to be a 4GB (16EB) SRAM?

• And won’t it be huge, expensive, and slow? And can we build it?

4GB(16EB)?

IM 64KB

DM 16KB

PC intRF

8

Memory Overview

• Functionality • “Like a big array…”

• N-bit address bus (on N-bit machine)

• Data bus: typically read/write on same bus

• Can have multiple ports: address/data bus pairs

• Access time: • Access latency ~ #bits * #ports2

M

address data

9

Memory Performance Equation

• For memory component M • Access: read or write to M

• Hit: desired data found in M

• Miss: desired data not found in M

• Must get from another component

• No notion of “miss” in register file

• Fill: action of placing data in M

• %miss (miss-rate): #misses / #accesses

• thit: time to read data from (write data to) M

• tmiss: time to read data into M

• Performance metric: average access time

tavg = thit + %miss * tmiss

M

thit

tmiss

%miss

10

Memory Hierarchy

tavg = thit + %miss * tmiss

• Problem: hard to get low thit and %miss in one structure • Large structures have low %miss but higher thit

• Small structures have low thit but higher %miss

• Solution: use a hierarchy of memory structures • Known from the very beginning

“Ideally, one would desire an infinitely large memory capacity such that any particular word would be immediately available … We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has a greater capacity than the preceding but which is less quickly accessible.”

Burks,Goldstine,VonNeumann “Preliminary discussion of the logical design of an electronic computing instrument”

IAS memo 1946

11

Abstract Memory Hierarchy

• Hierarchy of memory components • Upper levels: small → low thit, high %miss

• Going down: larger → higher thit, lower %miss

• Connected by buses • Ignore for the moment

• Make average access time close to M1’s • How?

• Most frequently accessed data in M1

• M1 + next most frequently accessed in M2, etc.

• Automatically move data up/down hierarchy

M2

M3

M4

M

M1

pipeline

12

Why Memory Hierarchy Works

• 10/90 rule (of thumb) • 10% of static insns/data account for 90% of accessed insns/data

• Instructions: inner loops

• Data: frequently used globals, inner loop stack variables

• Temporal locality • Recently accessed instructions/data likely to be accessed again soon

• Instructions: inner loops (next iteration)

• Data: inner loop local variables, globals

• Hierarchy can be “reactive”: move things up when accessed

• Spatial locality • Instructions/data near recently accessed insns/data likely accessed

soon

• Instructions: sequential execution

• Data: elements in array, fields in struct, variables in stack frame

• Hierarchy can be “proactive”: move things up speculatively

13

Exploiting Heterogeneous Technologies

• Apparent problem – Lower level components must be huge

– Huge SRAMs are difficult to build and expensive

• Solution: don’t use SRAM for lower levels • Cheaper, denser storage technologies

• Will be slower than SRAM, but that’s OK

• Won’t be accessed very frequently

• We have no choice anyway

• Upper levels: SRAM → expensive, fast

• Going down: DRAM, Disk/SSD → cheaper, fast

SRAM

SRAM?

DRAM

DISK

SRAM

pipeline

14

Memory Technology Overview

• Latency • SRAM: <1 to 5ns (on chip)

• DRAM: ~100ns — 100x or more slower than SRAM

• (spinning) Disk: 10,000,000ns or 10ms — 100,000x slower than DRAM

• (SSD) Flash: ~200ns — 2x slower than DRAM (for reads, much slower for writes)

• Bandwidth • SRAM: 10-100GB/sec

• DRAM: ~1GB/sec — 10x less than SRAM

• Disk: 100MB/sec (0.1 GB/sec) — sequential access only

• Flash: about same as DRAM for read (much less for writes)

• Cost: what can $300 buy today a few years ago? • SRAM: 4MB

• DRAM: 1,000MB (1GB) — 250x cheaper than SRAM

• Disk: 400,000MB (400GB) — 400x cheaper than DRAM

• Flash: 4,000 MB (4GB) — 4x cheaper than DRAM

15

(Traditional) Concrete Memory Hierarchy

• (0th level: register file)

• 1st level: I$, D$ (L1 insn/data caches)

• 2nd level: L2 (cache) • On-chip, certainly on-package (with CPU)

• Made of SRAM

• 3rd level: L3 (cache) • Same as L2, may be off-chip

• Starting to appear

• ...

• N-1 level: main memory • Off-chip

• Made of DRAM

• N level: disk (swap space) • Electrical-mechanical (or SSD)

pipeline

D$

L3

Main

Memory

I$

Disk(swap)

L2

16

Virtual Memory Teaser

• For 32-bit ISA • 4GB disk is easy

• Even 4GB main memory is common

• For 64-bit ISA • 16EB main memory is right out

• Oct2017: 4G=$34, so 16 EB=$136Billion

• Even 16EB disk is extremely difficult

• (most 64-bit ISA don’t support full 64 bit address space: Intel – 48 bits in 2017)

• Virtual memory • Never referenced addresses don’t have to

physically exist anywhere!

• Next week…

pipeline

D$

L3

Main

Memory

I$

Disk(swap)

L2

17

Start With “Caches”

• “Cache”: hardware managed • Missing chunks retrieved by hardware

• SRAM technology • Technology basis of latency

• Cache organization • ABC

• Miss classification & optimization

• What about writes?

• Cache hierarchy organization

• Some example calculations

Hardware

managed

Software

managed

pipeline

D$

L3

Main

Memory

I$

Disk(swap)

L2

18

Why Are There 2-3 Levels of Cache?

• “Memory Wall”: memory 100X slower than primary caches • Multiple levels of cache needed to bridge the difference

• “Disk Wall?”: disk is 100,000X slower than memory • Why aren’t there 56 levels of main memory to bridge that difference?

• Doesn’t matter: program can’t keep itself busy for 10M cycles

• So slow, may as well swap out and run another program

Copyr

ight

Els

evie

r S

cie

ntific 2

003

Most famous graph in computer architecture

Log scale

+35–55%

+7%

19

Evolution of Cache Hierarchies

Intel 486 (1989)

8KB

I/D$

1.5MB L2

L3 tags

64KB D$

64KB I$

IBM Power5 (dual core)

(2004)

• Chips today are 30–70% cache by area

20

RAM and SRAM

• Reality: large storage are not really built with flip-flops and giant muxes

• RAM (random access memory) • Ports implemented as shared buses called wordlines/bitlines

• SRAM: static RAM • Static = bit maintains its value indefinitely, as long as power is on

• Bits implemented as cross-coupled inverters (CCIs)

+ 2 gates, 4 transistors per bit

• All processor storage arrays: regfile, caches, branch predictor, etc.

• Other forms of RAM: Dynamic RAM (DRAM), Flash (non-volatile RAM, or NV-RAM)

21

Basic RAM

• Storage array • M words of N bits each (e.g., 4w, 2b each)

• RAM storage array • M by N array of “bits” (e.g., 4 by 2)

• RAM port • Grid of wires that overlays bit array

• M wordlines: carry 1H decoded address

• N bitlines: carry data

• RAM port operation • Send address → 1 wordline goes high

• “bits” on this line read/write bitline data

• Operation depends on bit/W/B connection

• “Magic” analog stuff

W0

W1

B0 B1

W2

W3

Ad

dre

ss

Data

1/0

1/0

1/0

1/0

1/0

1/0

1/0

1/0

22

Basic SRAM

• Storage array • M words of N bits each (e.g., 4w, 2b each)

• SRAM storage array • M by N array of CCI’s (e.g., 4 by 2)

• SRAM port • Grid of wires that overlays CCI array

• M wordlines: carry 1-Hot decoded address

• N bitlines: carry data

• SRAM port operation • Send address → 1 wordline goes high

• CCIs on this line read/write bitline data

• Operation depends on CCI/W/B connection

• “Magic” analog stuff

W0

W1

B0 B1

W2

W3

Ad

dre

ss

Data

23

ROMS:

• ROMs = Read Only memory

• Similar layout (wordlines, bitlines) to RAMs

• Except not writeable: fixed connections to Power/Gnd instead of CCI

• Also EPROMs • Programmable once electronically

• And EEPROMs • Eraseable and re-programable (very

slow)

W0

W1

B0 B1

W2

W3

Addre

ss

Data

1/0

1/0

1/0

1/0

1/0

1/0

1/0

1/0

24

SRAM Read/Write Port

• Cache: read/write on same port • Not at the same time

• Trick: write port with additional bitline

• “Double-ended” or “differential” bitlines

• Smaller → faster than separate ports

25

SRAM Read/Write

• Some extra logic on the edges • To write: tristates “at the top”

• Drive write data when appropriate

W0

W1

B0 B1

W2

W3

Ad

dre

ss

~B0 ~B1

B0 B1 ~B0 ~B1

26

SRAM Read/Write

• Some extra logic on the edges • To write: tristates “at the top”

• Drive write data when appropriate

• To read: 2 things at the bottom

• Ability to equalize bit lines

• Sense amps

W0

W1

W2

W3

Ad

dre

ss

B0 B1 ~B0 ~B1

SA SA

50

SRAMS -> Caches

• Use SRAMs to make caches • Hold a sub-set of memory

• Reading: • Input: Address to read (32 or 64 bits)

• Output:

• Hit? 1-bit: was it there?

• Data: if there, requested value Addre

ss

Data

SRAM

Data Hit

Tag

SRAM

51

Cache Performance Metrics

Miss Rate

• Fraction of memory references not found in cache (misses / accesses)

• 1 – hit rate

• Typical numbers (in percentages):

• 3-10% for L1

• can be quite small (e.g., < 1%) for L2, depending on size, etc.

Hit Time

• Time to deliver a line in the cache to the processor

• includes time to determine whether the line is in the cache

• Typical numbers:

• 1-2 clock cycle for L1

• 5-20 clock cycles for L2

Miss Penalty

• Additional time required because of a miss

• typically 50-200 cycles for main memory (Trend: increasing!)

From CMU 15-213

52

Lets think about those numbers

Huge difference between a hit and a miss

• 100X, if just L1 and main memory

Would you believe 99% hits is twice as good as 97%?

• Consider these numbers:

cache hit time of 1 cycle

miss penalty of 100 cycles

So, average access time is:

97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles

99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

This is why “miss rate” is used instead of “hit rate”

From CMU 15-213

53

Associative memory, or Content-Addressable Memory (CAM)

• Mentioned last time: a memory we access by content rather than address

• “A CAM is designed such that the user supplies a data word and the CAM searches its entire memory to see if that data word is stored anywhere.” (Wikipedia) (answer can be nowhere!)

Regular Memory

Word-addressable

Memory (2n words)

N-bit address

1 word of data (in or out) Storage for n words,

AND n comparators

for doing parallel

search

n-word CAM

word to match Address of match,

Or NO MATCH

54

General cache mechanics

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Larger, slower, cheaper memory

is partitioned into “blocks”

Data is copied between

levels in block-sized

transfer units

8 9 14 3

Smaller, faster, more expensive

memory caches a subset of

the blocks Cache:

Memory: 4

4

4 10

10

10

From lecture-9.ppt, Carnegie-Mellon University course 15-213

55

Cache organization: Blocks

• Caches will always interact with the next highest level of the memory an entire “block” at a time – in level 1 caches, blocks range from 8-64 bytes typically, larger in L2/L3.

• Consider 1 megabyte memory with B=32 bytes (8 words) and a system with just one cache, 128 bytes. So 32768 blocks in all.

MemBlock Addr Data (32 bytes per block)

0 0-31 <some data>

1 32-63

2 64-95

…

32766 1048512-1048543

32767 1048544-1048575

Main Memory

56

Cache organization: Blocks

• In this case (1MB mem, B=32), each 20 bit physical address is

15 bit block id (0-32757) 5 bit byte offset

20-bit memory address

57

Cache organization

• Here is our 128 byte (4 block) cache

• Many problems!

• Out of the 32768 blocks in main memory, which 4 should be in the cache?

• How do we identify blocks?

• Is the block in the cache valid? (i.e. has it been initialized or is it garbage?)

• How do we know which block is here? What is our search strategy?

• What do we do if the block we want is not here?

• 3 choices – Fully Associative, Direct-mapped, Set-associative caches solve the identification problem in different ways.

Cache block

Block ID? (tag)

Data (32 bytes per block) Valid? (1 bit)

0 ?? <some data>

1 ?how?

2

3

58

General Organization of a Cache

B = 2b bytes

per cache block

E lines

per set

S = 2s sets

t tag bits

per line

Cache size: C = B x E x S data bytes

• • • B–1 1 0

• • • B–1 1 0

valid

valid

tag

tag set 0: • • •

• • • B–1 1 0

• • • B–1 1 0

valid

valid

tag

tag set 1: • • •

• • • B–1 1 0

• • • B–1 1 0

valid

valid

tag

tag set S-1: • • •

• • •

Cache is an array

of sets

Each set contains

one or more lines

Each line holds a

block of data

1 valid bit per line

From CMU 15-213

59

Most flexible: Fully Associative Cache

• Anything can be anywhere! (in our later language, the cache consists of a single “set”)

• Our running example : 1MB mem, B=32, so our cache tag will be the full 15 bit block ID of a main memory block

• Needs a full comparator per cache block (so 4 in our simple example). Any of our 32768 memblocks can be in any location. The array of 4 tags is a CAM.

• Sadly, other than tiny ones, FA caches are too complex and slow to be practical (due to the comparators)

Cache block 15 bit tag Data (32 bytes per block)

0 <some data>

1

2

3

60

Most flexible: Fully Associative Cache

• Cache address format for 20 bit address of running example

• So Address 0x15A45 comes along:

• Real question: is the 15-bit tag 0x0AD2 currently stored in any tag field of my FA cache? Check the CAM. If so, the 5th byte in the associated cache block is the byte I want.

15-bit tag 5 bit byte

offset

00010101101001000101

61

Direct-Mapped Caches

• The full flexibility of FA caches slows them down too much, especially for a L1 cache. Other extreme: Direct-mapped cache.

• Cache will consist of S sets, with one line per set (so 4 sets in our running example).

62

Example: Direct-Mapped Cache

Simplest kind of cache, easy to build (only 1 tag compare required per access)

Characterized by exactly one line per set.

valid

valid

valid

tag

tag

tag

• • •

set 0:

set 1:

set S-1:

E=1 lines per set cache block

cache block

cache block

Cache size: C = B x S data bytes

From CMU 15-213

63

Accessing Direct-Mapped Caches

Set selection

• Use the set index bits to determine the set of interest.

t bits s bits

0 0 0 0 1 0 m-1

b bits

tag set index block offset

selected set valid

valid

valid

tag

tag

tag

• • •

set 0:

set 1:

set S-1:

cache block

cache block

cache block

64

Direct Mapped Caches

• In our running example, 4 blocks in the cache, so 4 sets, so 2-bit set index. That leaves a 13 bit tag field.

• So Address 0x15A45 comes along:

• If the address is in the cache, it’s in set 2. Is the 13-bit tag currently in set 2 0x02B4, AND is the block valid? If so, cache hit.

13-bit tag 2-bit

Set

ID

5-bit byte

offset

00010101101001000101 Offset 5

SetID=2

65

Sets in main memory

• In our running example, memory blocks 0,4,8,12,16,…,32764 compete for set 0 in the cache.

• Blocks 1,5,9,13…,32765 compete for set 1 in the cache

• …

• Blocks 3,7,11,15,…,32767 compete for set 3.

• So if blocks 0 and 4 are important to the program right now, only room for one of them in the cache, even if the other three entries in the cache are empty!

• DM: really fast – only need one comparator for entire cache, but inflexible (lower hit rate)

Main Memory Block

0

1

2

3

4

5

6

7

…

32767

66

Engineering compromise: Set-Associative Caches

• Most of the speed of direct mapped caches, but with some of the additional flexibility (and thus higher hit rate) of fully associative caches

• We have sets, like in DM caches, but we have more than 1 block per set.

67

Example: Set Associative Cache

Characterized by more than one line per set

E=2 lines per set

valid tag set 0:

set 1:

set S-1:

• • •

cache block

valid tag cache block





E-way associative cache

68

Set-associative cache

• In our running example, if our 128 byte, 4 block cache is 2-way set associative, there will be 2 sets with 2 blocks each.

• So the “SetID” needs just 1 bit, tags are now 14 bits.

• Need 2 comparators, need to check both tags in the chosen set on each memory access – more than 1, but better than 4!

• “Real” caches are often 2-way or 4-way set-associative

14-bit tag 1-bit

Set

ID

5-bit byte

offset

69

Notice that middle bits used as index

t bits s bits

0 0 0 0 1 0 m-1

b bits

tag set index block offset

70

Why Use Middle Bits as Index?

High-Order Bit Indexing

• Adjacent memory lines would map to same cache entry

• Poor use of spatial locality

4-line Cache

00

01

10

11

High-Order

Bit Indexing 0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1010

1011

1100

1101

1110

1111

Middle-Order

Bit Indexing 0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1010

1011

1100

1101

1110

1111

Middle-Order Bit Indexing

Consecutive memory lines map to different cache lines

Can hold S*B*E-byte region of

address space in cache at one

time

71

Back to our regularly scheduled slides

72

Step 1: Data Basics

• 32-bit addresses • 4 Byte words only (to start)

• Start with blocks that are 1 word each • 4KB, organized as 1K 4B blocks

• Block: basic unit of data in cache

• Physical cache implementation • 1K (1024) by 4B (32) SRAM

• Called data array

• 10-bit address input

• 32-bit data input/output

10 10

24

data

32

32

addr

73

Which bits to use for index?

• Can skip the lowest log2(block_size) bits: those tell us which byte in the block we’re looking for.

• Of the remaining bits, do we pick the lowest ones or the highest ones?

• If we pick highest bits for index: • Two addresses that are numerically close will both map to the same block

• Neighbors in memory are likely to collide; fight over the same block

• Opposite of what we want – this penalizes spatial locality

• Bad!

31:22

Memory map

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

74

Which bits to use for index?

• Can skip the lowest log2(block_size) bits: those tell us which byte in the block we’re looking for.

• Of the remaining bits, do we pick the lowest ones or the highest ones?

• If we pick lowest bits for index: • Two addresses that are numerically close will map to different blocks

• Neighbors in memory get neighboring blocks

• Spatial locality leads to broad use of cache capacity

• Good!

11:2

Memory map

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

Blo

ck

75

Looking Up A Block

• Q: which 10 of the 32 address bits to use?

• A: bits [11:2] • 2 LS bits [1:0] are the offset bits

• Locate byte within word

• Don’t need these to locate word

• Next 10 LS bits [11:2] are the index bits

• These locate the word

• Nothing says index must be these bits

• But these work best in practice

• Why? (think about it)

[11:2]

data 11:2 addr

76

Knowing that You Found It

• Hold a subset of memory • How do we know if we have what we need?

• 220 different addresses map to one particular block

• Build separate and parallel tag array • 1K by 21-bit SRAM

• 20-bit (next slide) tag + 1 valid bit

• Lookup algorithm • Read tag indicated by index bits

• (Tag matches & valid bit set)

? Hit → data is good

: Miss → data is garbage, wait…

==

hit

[11:2]

data 11:2 31:12 addr

[31:12]

77

Cache Use of Addresses

• Split address into three parts: • Offset: least-significant log2(block-size)

• Index: next log2(number-of-sets)

• Tag: everything else

1:0 11:2 31:12

Tag Index Offset

78

Cache Behavior Example

Set # Valid Tag Data

0 0 000 00 00 00 00

1 0 000 00 00 00 00

2 0 000 00 00 00 00

3 0 000 00 00 00 00

4 0 000 00 00 00 00

5 0 000 00 00 00 00

6 0 000 00 00 00 00

7 0 000 00 00 00 00

CRITICAL: Cache starts empty (valid = 0). 8 sets, 16 bit address for example

79

Cache Behavior Example


0 0 000 00 00 00 00

1 0 000 00 00 00 00

2 0 000 00 00 00 00

3 0 000 00 00 00 00

4 0 000 00 00 00 00

5 (101) 0 000 00 00 00 00

6 0 000 00 00 00 00

7 0 000 00 00 00 00

Access address 0x1234 = 0001 0010 0011 0100 Offset = 0

Index = 5 Tag = 091

Not valid: miss

(doesn’t matter if tags match – invalid!)

80

Handling a Cache Miss

• What if requested word isn’t in the cache?

• How does it get in there?

• Cache controller: FSM • Remembers miss address

• Asks next level of memory

• Waits for response

• (and stalls CPU if necessary)

• Writes data/tag into proper locations in cache, SETS VALID BIT

• All of this happens on the fill path

• Sometimes called backside

==

[11:2]

data

[31:12]

cc

hit addr

[31:12]

81

Cache Behavior Example (DM)


0 0 000 00 00 00 00

1 0 000 00 00 00 00

2 0 000 00 00 00 00

3 0 000 00 00 00 00

4 0 000 00 00 00 00

5 1 091 0F 1E 39 EC

6 0 000 00 00 00 00

7 0 000 00 00 00 00

Access address 0x1234 = 0001 0010 0011 0100 (now a hit after processing)

lb: 00 00 00 EC

lh: 00 00 39 EC

lw: 0F 1E 39 EC

82



0 0 000 00 00 00 00

1 0 000 00 00 00 00

2 0 000 00 00 00 00

3 0 000 00 00 00 00

4 0 000 00 00 00 00

5 1 091 0F 1E 39 EC

6 0 000 00 00 00 00

7 0 000 00 00 00 00

Valid && Tag match -> hit lb: 00 00 00 1E

lh: 00 00 0F 1E

lw: (unaligned)


Index = 5 Tag = 091

83



0 0 000 00 00 00 00

1 0 000 00 00 00 00

2 0 000 00 00 00 00

3 0 000 00 00 00 00

4 0 000 00 00 00 00

5 1 091 0F 1E 39 EC

6 0 000 00 00 00 00

7 0 000 00 00 00 00

Not valid: miss


Index = 6 Tag = 091

84



0 0 000 00 00 00 00

1 0 000 00 00 00 00

2 0 000 00 00 00 00

3 0 000 00 00 00 00

4 0 000 00 00 00 00

5 1 091 0F 1E 39 EC

6 0 000 00 00 00 00

7 0 000 00 00 00 00

Access address 0x1238 = 0001 0010 0011 1000

Make request to next level...

wait for it....

1 091

85



0 0 000 00 00 00 00

1 0 000 00 00 00 00

2 0 000 00 00 00 00

3 0 000 00 00 00 00

4 0 000 00 00 00 00

5 1 091 0F 1E 39 EC

6 1 091 3C 99 11 12

7 0 000 00 00 00 00

Valid, but tag does not match: miss


Index = 5 Tag = 111

86



0 0 000 00 00 00 00

1 0 000 00 00 00 00

2 0 000 00 00 00 00

3 0 000 00 00 00 00

4 0 000 00 00 00 00

5 1 091 0F 1E 39 EC

6 1 091 3C 99 11 12

7 0 000 00 00 00 00

Access address 0x2234 = 0010 0010 0011 0100

Make request to next level...

wait for it....

111

87



0 0 000 00 00 00 00

1 0 000 00 00 00 00

2 0 000 00 00 00 00

3 0 000 00 00 00 00

4 0 000 00 00 00 00

5 1 111 01 CF D0 87

6 1 091 3C 99 11 12

7 0 000 00 00 00 00

Access address 0x2234 = 0010 0010 0011 0100

Note that now, 0x1234 is gone

replaced by 0x2234

88

Cache Misses and CPI

• I$ and D$ misses stall datapath (multi-cycle or pipeline) • Increase CPI

• Cache hits built into “base” CPI

• E.g., Loads = 5 cycles in multi-cycle includes thit

• Some loads may take more cycles...

– Need to know latency of “average” load (tavg)

P

C I$

Register

File

S

X

Data

Mem

<<

2

I

R D O

B

A

89

Measuring Cache Performance

• Ultimate metric is tavg • Cache capacity roughly determines thit

• Lower-level memory structures determine tmiss

• Measure %miss

• Hardware performance counters (since Pentium)

• Performance Simulator

• Paper simulation (like we just did)

• Only works for small caches

• Small number of requests (would not do for 1M accesses)

90

Cache Miss Paper Simulation (DM again)

• 4-bit addr, 8B cache, 2B blocks -> 4 sets, already initialized

• Tag, index, offset?

Address Tag Index Offset Set 0 tag

Set 1 tag

Set 2 tag

Set3 tag

Result

C 1100 invalid 0 0 1

E 1110

8 1000

3 0011

8 1000

0 0000

8 1000

4 0100

6 0110

91

Cache Miss Paper Simulation

• 4-bit addresses, 8B cache, 2B blocks -> 4 sets

• Tag: 1 bit, Index: 2 bits, Offset: 1 bit

Address Tag Index Offset Set 0 Set 1 Set 2 Set3 Result

C invalid 0 0 1

E

8

3

8

0

8

4

6

92


• 8B cache, 2B blocks -> 4 sets (data doesn’t matter!)

• What happens for each request?


C invalid 0 0 1

E

8

3

8

0

8

4

6

93


• 8B cache, 2B blocks -> 4 sets


C 1 2 0 invalid 0 0 1 Miss

E invalid 0 1 1

8

3

8

0

8

4

6


94





E 1 3 0 invalid 0 1 1 Hit

8 invalid 0 1 1

3

8

0

8

4

6


95






8 1 0 0 invalid 0 1 1 Miss

3 1 0 1 1

8

0

8

4

6


96







3 0 1 1 1 0 1 1 Hit

8 1 0 1 1

0

8

4

6


97







3 0 1 1 1 0 1 1 Hit

8 1 0 0 1 0 1 1 Hit

0 1 0 1 1

8

4

6


98







3 0 1 1 1 0 1 1 Hit

8 1 0 0 1 0 1 1 Hit

0 0 0 0 1 0 1 1 Miss

8 0 0 1 1

4

6


99







3 0 1 1 1 0 1 1 Hit

8 1 0 0 1 0 1 1 Hit

0 0 0 0 1 0 1 1 Miss

8 1 0 0 0 0 1 1 Miss

4 1 0 1 1

6


100







3 0 1 1 1 0 1 1 Hit

8 1 0 0 1 0 1 1 Hit

0 0 0 0 1 0 1 1 Miss

8 1 0 0 0 0 1 1 Miss

4 0 2 0 1 0 1 1 Miss

6 1 0 0 1


101







3 0 1 1 1 0 1 1 Hit

8 1 0 0 1 0 1 1 Hit

0 0 0 0 1 0 1 1 Miss

8 1 0 0 0 0 1 1 Miss

4 0 2 0 1 0 1 1 Miss

6 0 3 0 1 0 0 1 Miss


102


• %miss: 6 / 9 = 66% • Not good...

• How could we improve it? Result

Miss

Hit

Miss

Hit

Hit

Miss

Miss

Miss

Miss

103

Capacity and Performance

• Simplest way to reduce %miss: increase capacity + Miss rate decreases monotonically

• “Working set”: instructions/data program is actively using

– thit increases

• tavg ?

• Given capacity, manipulate %miss by changing organization

Cache Capacity

%miss

“working set” size

104

Block Size

• One possible re-organization: increase block size + Exploit spatial locality

– Caveat: increase conflicts too

– Increases thit: need word select mux

• By a little, not too bad

+ Reduce tag overhead

==

hit

[11:3]


[31:12]

8B

[2] block size↑

105

Tag Overhead

• “4KB cache” means cache holds 4KB of data (capacity) • Tag storage is considered overhead

• Valid bit usually not counted

• Tag overhead = tag size / data size

• 4KB cache with 4B blocks? • 4B blocks → 2-bit offset

• 4KB cache / 4B blocks → 1024 blocks → 10-bit index

• 32-bit address – 2-bit offset – 10-bit index = 20-bit tag

• 20-bit tag / 32-bit block = 63% overhead

• (plus 1 comparator – not bad, would be a lot worse with Fully associative design!)

106

Block Size and Tag Overhead

• 4KB cache with 1024 4B blocks? • 4B blocks → 2-bit offset, 1024 frames → 10-bit index



• 4KB cache with 512 8B blocks • 8B blocks → 3-bit offset, 512 frames → 9-bit index



• Notice: tag size is same, but data size is twice as big

• A realistic example: 64KB cache with 64B blocks • 16-bit tag / 512-bit block = ~ 2% overhead

107



Address Tag Index Offset Set 0 Set 1 Result

C 1 1 0 invalid 0 Miss

E 1 1 2 invalid 1 Hit

8 1 0 0 invalid 1 Miss

3 0 0 3 1 1 Miss

8 1 0 0 0 1 Miss

0 0 0 0 1 1 Miss

8 1 0 0 0 1 Miss

4 0 1 0 1 1 Miss

6 0 1 2 1 0 Hit

• 8,3: new conflicts (fewer sets)

• 4,6: spatial locality (now in same set)

108

Block Size and Miss Rate Redux

+ Bigger Block: Spatial prefetching • For blocks with adjacent addresses

• Turns miss/miss pairs into miss/hit pairs

• Example: 4, 6

– Conflicts • For blocks with non-adjacent addresses (but in adjacent frames)

• Turns hits into misses by disallowing simultaneous residence

• Example: 8, 3

• Both effects always present to some degree • Spatial prefetching dominates initially (until 64–128B)

• Conflicts dominate afterwards

• Optimal block size is 32–256B (varies across programs)

• Typical: 64B

Block Size

%miss

109

Block Size and Miss Penalty

• Does increasing block size increase tmiss? • Don’t larger blocks take longer to read, transfer, and fill?

• They do, but…

• tmiss of an isolated miss is not affected • Critical Word First / Early Restart (CRF/ER)

• Requested word fetched first, pipeline restarts immediately

• Remaining words in block transferred/filled in the background

• tmiss’es of a cluster of misses will suffer • Reads/transfers/fills of two misses cannot be overlapped

• Latencies start to pile up

• This is technically a bandwidth problem (more later)

110



Address Tag Index Offset Set 0 Set 1 Result

C 1 1 0 invalid 0 Miss

E 1 1 2 invalid 1 Hit

8 1 0 0 invalid 1 Miss

3 0 0 3 1 1 Miss

8 1 0 0 0 1 Miss

0 0 0 0 1 1 Miss

8 1 0 0 0 1 Miss

4 0 1 0 1 1 Miss

6 0 1 2 1 0 Hit

• 8 (1000) and 0 (0000): same set for any $ < 16B

• Can we do anything about this?

111

Associativity

• New organizational dimension: Associativity • Block can reside in one of few frames

• Frame groups called sets

• Each frame in set called a way

• This is 2-way set-associative (SA)

• 1-way → direct-mapped (DM)

• 1-set → fully-associative (FA)

• Lookup algorithm • Use index bits to find set

• Read data/tags in all frames in parallel

• Any (match && valid bit) ? Hit : Miss ==

hit

[10:2]


[31:11]

4B

==

4B

associativity↑

112

Cache Behavior 2-ways

Set #

Way 0 Way 1

V Tag Data V Tag Data

0 0 000 00 00 00 00 0 000 00 00 00 00

1 0 000 00 00 00 00 0 000 00 00 00 00

2 0 000 00 00 00 00 0 000 00 00 00 00

3 0 000 00 00 00 00 0 000 00 00 00 00

Cache: 4 sets, 2 ways, 4B blocks

113


Set #

Way 0 Way 1


0 0 000 00 00 00 00 0 000 00 00 00 00

1 0 000 00 00 00 00 0 000 00 00 00 00

2 0 000 00 00 00 00 0 000 00 00 00 00

3 0 000 00 00 00 00 0 000 00 00 00 00


Index = 1 Tag = 123

Miss. Request from next level. Wait...

1 123

114


Set #

Way 0 Way 1


0 0 000 00 00 00 00 0 000 00 00 00 00

1 1 123 0F 1E 39 EC 0 000 00 00 00 00

2 0 000 00 00 00 00 0 000 00 00 00 00

3 0 000 00 00 00 00 0 000 00 00 00 00

Miss. Request from next level. Wait...

1 223


Index = 1 Tag = 223

115


Set #

Way 0 Way 1


0 0 000 00 00 00 00 0 000 00 00 00 00

1 1 123 0F 1E 39 EC 1 223 01 CF D0 87

2 0 000 00 00 00 00 0 000 00 00 00 00

3 0 000 00 00 00 00 0 000 00 00 00 00

Hit. In Way 0


Index = 1 Tag = 123

116


• 8B cache, 2B blocks, 2 ways -> 2 sets

Set 0 Set 1

Address Tag Index Offset Way0 Way1 Way0 Way1 Result


E 3 1 0 3 0 0 1 Miss

8 2 0 0 3 0 0 3 Miss

3 0 1 1 3 2 0 3 Hit

8 2 0 0 3 2 0 3 Hit

0 0 0 0 3 2 0 3 Miss

8 2 0 0 0 2 0 3 Hit

4 1 0 0 0 2 0 3 Miss

6 1 1 0 1 2 0 3 Miss


117

Cache structure math summary

• Given capacity, block_size, ways (associativity), and word_size.

• Cache parameters:

• num_frames = capacity / block_size

• sets = num_frames / ways = capacity / block_size / ways

• Address bit fields:

• offset_bits = log2(block_size)

• index_bits = log2(sets)

• tag_bits = word_size - index_bits - offset_bits

• Numeric way to get offset/index/tag from address:

• block_offset = addr % block_size

• index = (addr / block_size) % sets

• tag = addr / (sets*block_size)

118

Replacement Policies

• Set-associative caches present a new design choice • On cache miss, which block in set to replace (kick out)?

• Belady’s (oracle): block that will be used furthest in future

• Random

• FIFO (first-in first-out)

• LRU (least recently used) • Fits with temporal locality, LRU = least likely to be used in future

• NMRU (not most recently used) • An easier to implement approximation of LRU

• Equal to LRU for 2-way SA caches

119

NMRU Implementation

• Add MRU field to each set • MRU data is encoded “way”

• Hit? update MRU

• Fill? write enable ~MRU (in 2-way)

• Need to pick 1 of (n-1) NMRU if more than 2 ways for write enable

==

hit data addr

==

~ WE

120

Associativity And Performance

• The associativity game + Higher associative caches have lower %miss

– thit increases

• But not much for low associativities (2,3,4,5)

• tavg?

• Block-size and number of sets should be powers of two • Makes indexing easier (just rip bits out of the address)

• 5-way set-associativity? No problem (but powers of 2 still very common)

Associativity

%miss ~5

121

Full Associativity

• How to implement full (or at least high) associativity? • This way is terribly inefficient

• 1K matches are unavoidable, but 1K data reads + 1K-to-1 mux?

==

hit data addr

[31:2]

== == == ==

122

Full-Associativity with CAMs

• CAM: content-addressable memory • Array of words with built-in comparators

• Input is data (tag)

• Output is 1H encoding of matching slot

• Fully associative cache • Tags as CAM, data as RAM

• Effective but expensive (EE reasons)

• Upshot: used for 16-/32-way associativity

– No good way to build 1024-way associativity

+ No real need for it, either

hit

[31:2]

data 31:2 addr

==

==

==

==

==

look mom, no index bits

123

CAM -> Content Addressable Memory

• Input: Data to match • (ex on left: 3 bits)

• Output: matching entries • (ex on left: 4 entries)

• Will not be tested on these electrical details of CAMs, but basic idea of a CAM is fair game!

Data

Ma

tch

124

~B2 B3

Match?

~B1 B1

CAM circuit

• CAM match port looks different from RAM r/w port

• Cells look similar • Note: Bit stored on right, ~Bit on left (opposite of inputs)

~B0 B0

Vcc

125

~B2 B3

Match?

~B1 B1

CAM circuit

~B0 B0



• Step 1: Precharge match line to 1 (first half of cycle)

Vcc

126

~B2 B3

Match?

~B1 B1

CAM circuit

~B0 B0




• Step 2: Send data/~data down bit lines

Vcc

127

~B2 B3

Match?

~B1 B1

CAM circuit

~B0 B0




• Step 2: Send data/~data down bit lines • Two 1s on same side (bit line != data) open NMOS path -> gnd

Vcc

128

~B2 B3

Match?

~B1 B1

CAM circuit

~B0 B0





• Drains match line 1->0

Vcc

129

~B2 B3

Match?

~B1 B1

CAM circuit

~B0 B0





• Drains match line 1->0

Vcc

130

~B2 B3

Match?

~B1 B1

CAM circuit

~B0 B0

• Note that if all bits match, each side has a 1 and a 0

• One NMOS in the path from Match -> Gnd is closed

• No conductive path -> Match keeps its charge @ 1

Vcc

131

~B2 B3

Match?

~B1 B1

CAMs: Slow and High Power..

~B0 B0

• CAMs are slow and high power

• Pre-charge all, discharge most match lines every search

• Pre-charge + discharge take time: capacitive load of match line

• Bit lines have high capacitive load: Driving 1 transistor per row

Vcc

132

ABC

• Capacity + Decreases capacity misses

– Increases thit

• Associativity + Decreases conflict misses

– Increases thit

• Block size – Increases conflict misses

+ Decreases compulsory misses

± Increases or decreases capacity misses

• Little effect on thit, may exacerbate tmiss

• How much they help depends...

133

Different Problems -> Different Solutions

• Suppose we have 16B, direct-mapped cache w/ 4B blocks • 4 sets

• Examine some access patterns and think about what would help

• Misses in red

• Access pattern A: • As is: 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26

• 8B blocks? 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26

• 2-way assoc? 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26

• Access pattern B: • As is: 0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6

• 8B blocks? 0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6

• 2-way assoc? 0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6

• Access pattern C (All 3): • 0,20,40,60,48,36,24,12,1,21,41,61,49,37,25,13,2,22,42,62,50,38,…

134

Analyzing Misses: 3C Model (Hill)

• Divide cache misses into categories based on cause • Compulsory: block size is too small (i.e., address not seen before)

• Capacity: capacity is too small

• Conflict: associativity is too low

135


• Access pattern A: Compulsory misses • 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26

• For misses, have not accessed that block

• Size/associativity won’t help (never had it)

• Larger block -> include more data in one block -> more hits

• Recognizing compulsory misses • Never seen the block before

136


• Access pattern B: Conflict misses • 0, 128, 1, 129, 2, 130, 3, 131, 4, 132, 5, 133, 6

• 0 and 128 map to same set (set 0): kick each other out (“conflict”)

• Larger block? No help

• Larger cache? Only helps if MUCH larger (256 B instead of 16B)

• Higher associativity? Fixes problem

• Can have both 0 and 128 in set 0 at same time (different ways)

• Recognizing conflict misses: • Count unique blocks between last access and miss (inclusive)

• Number of unique blocks <= number of blocks in cache? Conflict

• Enough space to hold them all...

• Just must be having set conflict

137


• Access pattern C: Capacity Misses • 0,20,40,60,48,36,24,12,1,21,41,61,49,37,25,13,2,22,42,62,50,38,…

• Larger block size? No help

• Even 16B block (entire cache) won’t help

• Associativity? No help... even at full assoc

• After 0, 20, 40, 60: kick out 0 for 48

• Kick out 20 for 36

• Kick out 40 for 24...

• Solution: make cache larger

• Doubling cache size turns all most misses into hits

• A few compulsory misses remain

• 0,20,40,60,48,36,24,12,1,21,41,61,49,37,25,13,2,22,42,62,50,38,…

• Recognizing Capacity Misses • Count unique blocks between last access and miss (inclusive)

• Number of unique blocks > number of blocks in cache? Capacity

• Just can’t hold them all

138

Miss Categorization Flow Chart

Seen Same Block

Before?

Compulsory

Compare # Unique Blocks

Referenced to Number Cache Can

Hold

# Referenced <= # Cache Can Hold

Conflict

Capacity

# Referenced >

# Cache Can Hold

139

ABC

• Capacity + Decreases capacity misses

– Increases thit

• Associativity + Decreases conflict misses

– Increases thit

• Block size – Increases conflict misses

+ Decreases compulsory misses

± Increases or decreases capacity misses

• Little effect on thit, may exacerbate tmiss

• How much they help depends...

140

Two Optimizations

• Victim buffer: for conflict misses • Technically: reduces tmiss for these misses, doesn’t eliminate them

• Depends how you do your accounting

• Prefetching: for capacity/compulsory misses

141

Victim Buffer

• Conflict misses: not enough associativity • High associativity is expensive, but also rarely needed

• 3 blocks mapping to same 2-way set and accessed (XYZ)+

• Victim buffer (VB): small FA cache (e.g., 4 entries) • Small so very fast

• Blocks kicked out of cache placed in VB

• On miss, check VB: hit ? Place block back in cache

• 4 extra ways, shared among all sets

+ Only a few sets will need it at any given time

• On cache fill path: reduces tmiss, no impact on thit

+ Very effective in practice

$

Next-level-$

VB

142

Prefetching

• Prefetching: put blocks in cache proactively/speculatively • In software: insert prefetch (non-binding load) insns into code

• In hardware: cache controller generates prefetch addresses

• Keys: anticipate upcoming miss addresses accurately • Timeliness: initiate prefetches sufficiently in advance

• But not so far in advance that it kicks out good stuff

• Accuracy: don’t evict useful data

• Prioritize handling real misses over prefetches

• Simple algorithm: next block prefetching • Miss address X → prefetch address X+block_size

• Works for instructions: sequential execution

• What about non-sequential execution?

• Works for data: arrays

• What about other data-structures?

• Address prediction is actively researched area

$

Next-level-$

cc

143

Write Issues

• So far we have looked at reading from cache • Insn fetches, loads

• What about writing into cache • Stores, not an issue for insn caches (why they are simpler)

• Several new issues • Must read tags first before writing data

• Cannot be in parallel

• Cache may have dirty data

• Data which has been updated in this cache, but not lower levels

• Must be written back to lower level before eviction

144

Recall Data Memory Stage of Datapath

• So far, have just assume D$ in Memory Stage... • Actually a bit more complex for a couple reasons...

L1 D$

145

Problem with Writing #1: Store Misses

• Load instruction misses D$: • Have to stall datapath

• Need missing data to complete instruction

• (Fancier: stall at first consumer rather than load)

• Store instruction misses D$: • Stall?

• Would really like not to

• Store is writing the data

• Need rest of block because we cannot have part of a block

• Generally do not support “these bytes are valid, those are not”

• How to avoid?

146

Problem with Writing #2: Serial Tag/Data Access

• Load can read tags/data in parallel • Read both SRAMs

• Compare Tags -> Select proper way (if any)

• Stores cannot write tags/data in parallel • Read tags/write data array at same time??

• How to know which way?

• Or even if its a hit?

• Incorrect -> overwrote data from somewhere else..

• Multi-cycle data-path: • Stores take an extra cycle? Increase CPI

• Pipelined data-path: • Tags in one stage, Data in the next?

• Works for stores, but loads serialize tags/data -> higher CPI

147

Store Buffer

• Stores write into a store buffer • Holds address, size, data, of stores

L1 D$

148

Store Buffer

• Stores write into a store buffer • Holds address, size, data, of stores

• Store data written from store buffer into cache

• Miss? Data stays in buffer until hit

L1 D$

149

Store Buffer

• Loads search store buffer for matching store • Match? Forward data from the store

• No match: Use data from D$

• Addresses are CAM: allow search for match

L1 D$

150

Store Buffer

• How does this resolve our issues?

• Problem with Writing #1: Store misses • Stores write to store buffer and are done

• FSM writes stores into D$ from store buffer

• Misses stall store buffer -> D$ write (but not pipeline)

• Pipeline will stall on full store buffer

• Problem with Writing #2: Tags -> Data • FSM that writes stores to D$ can check tags... then write data

• Decoupled from data path’s normal execution

• Can happen whenever loads are not using the D$

151

Write Propagation

• When to propagate new value to (lower level) memory?

• Write-thru: immediately – Requires additional bus bandwidth

• Not common

• Write-back: when block is replaced • Blocks may be dirty now

• Dirty bit (in tag array)

• Cleared on fill

• Set by a store to the block

152

Write Back: Dirty Misses

• Writeback caches may have dirty misses: • Victim block (one to be replaced) is dirty

• Must first writeback to next level

• Then request data for miss

• Slower :(

• Solution:

• Add a buffer on back side of cache: writeback buffer

• Small full associative buffer, holds a few lines

• Request miss data immediately

• Put dirty line in WBB

• Writeback later 1

2 3

$

Next-level-$

WBB

153

What this means to the programmer

• If you’re writing code, you want good performance.

• The cache is crucial to getting good performance.

• The effect of the cache is influenced by the order of memory accesses.

CONCLUSION:

The programmer can change the order of memory accesses to improve performance!

154

Cache performance matters!

• A HUGE component of software performance is how it interacts with cache

• Example:

Assume that x[i][j] is stored next to x[i][j+1] in memory (“row major order”).

Which will have fewer cache misses?

for (k = 0; k < 100; k++)

for (j = 0; j < 100; j++)

for (i = 0; i < 5000; i++)

x[i][j] = 2 * x[i][j];

for (k = 0; k < 100; k++)

for (i = 0; i < 5000; i++)

for (j = 0; j < 100; j++)

x[i][j] = 2 * x[i][j];

A

B Adapted from Lebeck and Porter (creative commons)

155

Blocking (Tiling) Example

/* Before */

for(i = 0; i < SIZE; i++)

for (j = 0; j < SIZE; j++)

for (k = 0; k < SIZE; k++)

c[i][j] = c[i][j] + a[i][k]*b[k][j];

• Two Inner Loops: • Read all NxN elements of z[ ] (N = SIZE)

• Read N elements of 1 row of y[ ] repeatedly

• Write N elements of 1 row of x[ ]

• Capacity Misses a function of N & Cache Size: • 3 NxN => no capacity misses; otherwise ...

• Idea: compute on BxB submatrix that fits

Adapted from Lebeck and Porter (creative commons)

156

Blocking (Tiling) Example

/* After */

for(ii = 0; ii < SIZE; ii += B)

for (jj = 0; jj < SIZE; jj += B)

for (kk = 0; kk < SIZE; kk +=B)

for(i = ii; i < MIN(ii+B-1,SIZE); i++)

for (j = jj; j < MIN(jj+B-1,SIZE); j++)

for (k = kk; k < MIN(kk+B-1,SIZE); k++)

c[i][j] = c[i][j] + a[i][k]*b[k][j];

• Capacity Misses decrease

2N3 + N2 to 2N3/B +N2

• B called Blocking Factor (Also called Tile Size)

Adapted from Lebeck and Porter (creative commons)

157

Hilbert curves: A fancy trick for matrix locality

• Turn a 1D value into an n-dimensional “walk” of a cube space (like a 2D or 3D matrix) in a manner that maximizes locality

• Extra overhead to compute curve path, but computation takes no memory, and cache misses are very expensive, so it may be worth it

• (Actual algorithm for these curves is simple and easy to find)

158

Brief History of DRAM

• DRAM (memory): a major force behind computer industry • Modern DRAM came with introduction of IC (1970)

• Preceded by magnetic “core” memory (1950s)

• More closely resembles today’s disks than memory

• And by mercury delay lines before that (ENIAC)

• Re-circulating vibrations in mercury tubes

“the one single development that put computers on their feet was the

invention of a reliable form of memory, namely the core memory… It’s cost was reasonable, it was reliable, and because it was reliable it could in due course be made large”

Maurice Wilkes

Memoirs of a Computer Programmer, 1985

159

SRAM

• SRAM: “6T” cells • 6 transistors per bit

• 4 for the CCI

• 2 access transistors

• Static • CCIs hold state

• To read • Equalize, swing, amplify

• To write • Overwhelm

ad

dre

ss

data0 ~data0 data1 ~data1

SA SA

160

DRAM

• DRAM: dynamic RAM • Bits as capacitors

• Transistors as ports

• “1T” cells: one access transistor per bit

• “Dynamic” means • Capacitors not connected to pwr/gnd

• Stored charge decays over time

• Must be explicitly refreshed

• Designed for density + ~6–8X denser than SRAM

– But slower too

ad

dre

ss

data

SA SA

161

DRAM Read (simplified version)

• Bit line pre-charged to 0.5 (think: pipe half full)

• Storage at 1 (think: tank full of water)

Stored value = 1

Bit line = 0.5

162


• Bit-line and capacitor equalize • Think: opening valve between pipe + tank

• Settle out a bit above 0.5 if 1 was stored • A bit less if 0 was stored

Stored value = 0.55

Bit line = 0.55

163


• Destroyed the stored value in the process • Could not read this again: change too small to detect

Stored value = 0.55

Bit line = 0.55

164

DRAM Operation I

• Sense amps detect small swing • Amplify into 0 or 1

• This read: very slow • Why? No Vcc/Gnd connection in storage

• Need to deal with destructive reads: • Might want to read again...

• Also need to be able to write

ad

dre

ss

data

SA SA

165

DRAM Operation I

• Add some d-latches (row buffer) • Ok to use d-latches, not DFFs

• No path from output->input when enabled

• Also add a tri-state path back • From the d-latch to the bit-line

• Can drive the output of the d-latch onto bit lines

• After we read, drive the value back

• “Refill” (or re-empty) the capacitor

ad

dre

ss

data

SA SA

DL DL

166

DRAM Read (better version)

• SA amplifies 0.55 -> 1

• DL is enabled: latches the 1

• Tri-state disabled

Stored value = 0.55

Bit line = 0.55

SA

DL

SA output = 1

DL output = 1

Tri-state

output = Z

167


• Enable tri-state • Drives 1 back up bit-line

Stored value = 0.55

Bit line = 0.55

SA

DL

SA output = 1

DL output = 1

Tri-state

output = 1

168



• Starts to push value back up towards 1 (takes time)

Stored value = 0.75

Bit line = 0.75

SA

DL

SA output = 1

DL output = 1

Tri-state

output = 1

169



• Starts to push value back up towards 1 (takes time)

• Eventually restores value.

Stored value = 1

Bit line = 1

SA

DL

SA output = 1

DL output = 1

Tri-state

output = 1

170

DRAM Operation

• Open row (read bits -> row buffer)

• Read “columns” • Mux selects right part of RB

• Send data on bus -> processor

• Write “columns” • Change values in dlatches

• May read/write multiple columns

• Close row • Close access transistors

• Pre-charge bit lines

• Row must remain open long enough

• Must fully restore capacitors

DRAM

bit array

row buffer

Row address

SAs Write data

Column

address

171

DRAM Refresh

• DRAM periodically refreshes all contents • Loops through all rows

• Open row (read -> RB)

• Leave row open long enough

• Close row

• 1–2% of DRAM time occupied by refresh

ad

dre

ss

data

SA SA

DL DL

172

Aside: Non-Volatile CMOS Storage

• Before we leave the subject of CMOS storage technology…

• Another important kind: flash • “Floating gate”: no conductor/semi-conductor

• Quantum tunneling involved in writing it

• Effectively no leakage (key feature)

• Non-volatile: remembers state when power is off

• Slower than DRAM

• Wears out with writes

• Eventually writes just do not work

173

Memory Bus

• Memory bus: connects CPU package with main memory • Has its own clock

• Typically slower than CPU internal clock: 100–500MHz vs. 3GHz

• Synchronous DRAM (SDRAM) used in “real” main memories operates on this clock

• Is often itself internally pipelined

• Clock implies bandwidth: 100MHz → start new transfer every 10ns

• Clock doesn’t imply latency: 100MHz !→ transfer takes 10ns

• DRAM is slower than this but can pipeline multiple accesses

• Bandwidth is more important: determines peak performance

174

Memory Latency and Bandwidth

• Nominal clock frequency applies to CPU and caches

• Careful when doing calculations • Clock frequency increases don’t reduce memory or bus latency

• May make misses come out faster

• At some point memory bandwidth may become a bottleneck

• Further increases in clock speed won’t help at all

175

Clock Frequency Example

• Baseline setup • Processor clock: 1GHz.

• 20% loads, 15% stores, 20% branches, 45% ALU

• Branches: 3, ALU/stores: 4, Loads: 4 + tavgL1

• L1 D$: thit = 1 cycle, 10% miss

• L2$: thit = 20 cycles, 5% miss

• Memory: 200 cycles

tavgL2 = 20 + 0.05 * 200 = 30

tavgL1 = 1 + 0.10 * 30 = 4

Average load latency = 4 + 4 = 8

CPI = 0.2 * 8 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 4.6

Performance = 217 MIPS

The clock rate is 1GHz, or 1e9 cycles/second. The CPI is 4.6 cycles/instruction. (1e9 cycles/second) / (4.6 cycles/instruction) = 217,391,304 instructions/second = 217 MIPS

Computation tavgL2 = 20 + 0.05 * 200 = 30 tavgL1 = 1 + 0.10 * 30 = 4 Average load latency = 4 + 4 = 8 CPI = 0.2 * 8 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 4.6 The clock rate is 1GHz, or 1e9 cycles/second. The CPI is 4.6 cycles/instruction. Performance = (1e9 cycles/second) / (4.6 cycles/instruction) = 217,391,304 instructions/second = 217 MIPS

176








tavg = 20 + 0.05 * 400 = 40

tavg = 1 + 0.10 * 40 = 5


CPI = 0.2 * 9 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 4.8

Performance = 417 MIPS (91% speedup, for 100% freq increase)

Computation tavgL2 = 20 + 0.05 * 400 = 40 tavgL1 = 1 + 0.10 * 40 = 5 Average load latency = 4 + 5 = 9 CPI = 0.2 * 9 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 4.8 The clock rate is 2GHz, or 2e9 cycles/second. The CPI is 4.8 cycles/instruction. Performance = (2e9 cycles/second) / (4.8 cycles/instruction) = 416,666,666 instructions/second = 417 MIPS (91% speedup, for 100% freq increase)

177








tavg = 20 + 0.05 * 800 = 60

tavg = 1 + 0.10 * 60 = 7


CPI = 0.2 * 11 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 5.2

Performance = 769 MIPS (84% speedup, for 100% freq increase)

Computation tavgL2 = 20 + 0.05 * 800 = 60 tavgL1 = 1 + 0.10 * 60 = 7 Average load latency = 4 + 7 = 11 CPI = 0.2 * 11 + 0.15 * 4 + 0.2 * 3 + 0.45 * 4 = 5.2 The clock rate is 4GHz, or 4e9 cycles/second. The CPI is 5.2 cycles/instruction. Performance = (4e9 cycles/second) / (5.2 cycles/instruction) = 769,230,769 instructions/second = 769 MIPS (84% speedup, for 100% freq increase)

178

Actually a Bit Worse..

• Only looked at D$ miss impact • Ignored store misses: assumed storebuffer can keep up

• Also have I$ misses

• At some point, become bandwidth constrained • Effectively makes tmiss go up (think of a traffic jam)

• Also makes things we ignored matter

• Storebuffer may not be able to keep up as well -> store stalls

• Data we previously prefetched may not arrive in time

• Effectively makes %miss go up

179

Clock Frequency and Real Programs

Detailed Simulation Results

- Includes all caches, bandwidth,...

- Has L3 on separate clock

- Real programs

- 2.0 Ghz -> 5.0 Ghz (150% increase)

hmmer:

- Very low %miss

- Good performance for clock

- 125% speedup

lbm, milc:

- Very high %miss

- Not much performance gained

- lbm: 32%

- milc: 14%

180

Summary

• tavg = thit + %miss * tmiss • thit and %miss in one component? Difficult

• Memory hierarchy • Capacity: smaller, low thit → bigger, low%miss

• 10/90 rule, temporal/spatial locality

• Technology: expensive→cheaper

• SRAM →DRAM→Disk: reasonable total cost

• Organizing a memory component • ABC, write policies

• 3C miss model: how to eliminate misses?

• Technologies: • DRAM, SRAM, Flash

CPU Mem I/O

System software

App App App

ece 550d fundamentals of computer systems and engineering...

Documents