lecture 2: cache - guceee.guc.edu.eg/courses/electronics/elct707 microcomputer...dr. eng. amr t....

Lecture 2:

Cache A Quest for Speed

Dr. Eng. Amr T. Abdel-Hamid

Elect 707

Winter 2014

Co

mp

ute

r Arc

hite

ctu

re

Text book slides: Computer Architecture: A Quantitative Approach 5th Edition, John L. Hennessy & David A. Patterso with modifications.

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Memory Hierarchy

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Why Memory Organization?

µProc

60%/yr.

DRAM

7%/yr.

1

10

100

1000

19

80

19

81

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

DRAM

CPU 1

98

2

Processor-Memory

Performance Gap:

(grows 50% / year)

Perf

orm

ance

―Moore’s Law‖

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

The Principle of Locality

The Principle of Locality:

Programs access a relatively small portion of the address sp

ace at any instant of time.

Two Different Types of Locality:

Temporal Locality (Locality in Time): If an item is referenced,

it will tend to be referenced again soon (e.g., loops, reuse)

Spatial Locality (Locality in Space): If an item is referenced, i

tems whose addresses are close by tend to be referenced so

on (e.g., straightline code, array access)

Basic Principle to overcome the memory/proc. Interaction

problem

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

What is a cache? Small, fast storage used to improve average access time to slow m

emory.

Exploits spatial and temporal locality

In computer architecture, almost everything is a cache!

Registers a cache on variables

First-level cache a cache on second-level cache

Second-level cache a cache on memory

Memory a cache on disk (virtual memory)

Proc/Regs

L1-Cache

L2-Cache

Memory

Disk, etc.

Bigger Faster

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Cache Definitions Hit: data appears in some block in the upper level (example: Block X)

Hit Rate: the fraction of memory access found in the upper level

Hit Time: Time to access the upper level which consists of access tim

e + Time to determine hit/miss

Miss: data needs to be retrieve from a block in the lower level (Block

Y)

Miss Rate = 1 - (Hit Rate)

Miss Penalty: Time to replace a block in the upper level + Time to del

iver the block the processor

Hit Time << Miss Penalty

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Review: Cache performance

yMissPenaltMissRateHitTimeAMAT

DataDataData

InstInstInst

yMissPenaltMissRateHitTime


Average Memory Access Time (AMAT) = Hit time + Miss rate x Miss penalty (ns or clocks)

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

• Miss-oriented Approach to Memory Access:

– CPIExecution includes ALU and Memory instructions

CycleTimeyMissPenaltMissRateInst

MemAccess

ExecutionCPIICCPUtime

CycleTimeyMissPenaltInst

MemMisses

ExecutionCPIICCPUtime

Cache performance

• Separating out Memory component entirely

– AMAT = Average Memory Access Time

– CPIALUOps does not include memory instructions

CycleTimeAMATInst

MemAccessCPI

Inst

AluOpsICCPUtime

AluOps


DataDataData

InstInstInst



Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Impact on Performance Suppose a processor executes at

Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1

50% arith/logic, 30% ld/st, 20% control

Suppose that 10% of memory operations get 50 cycle miss penalty

Suppose that 1% of instructions get same miss penalty

CPI = ideal CPI + average stalls per instruction=

= 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1

58% of the time the proc is stalled waiting for memory!


Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Four Questions for Memory Hierarchy Desig

ners

Q1: Where can a block be placed in the upper level? (Block plac

ement)

Q2: How is a block found if it is in the upper level?

Q3: Which block should be replaced on a miss?

Q4: What happens on a write?

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

11

Block Placement

Q1: Where can a block be placed in the upper level?

Fully Associative,

Set Associative,

Direct Mapped

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

12

1 KB Direct Mapped Cache, 32B blocks

For a 2^N byte cache:

The uppermost (32 - N) bits are always the Cache Tag

The lowest M bits are the Byte Select (Block Size = 2 ^M)

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0 4 31

:

Cache Tag Example: 0x50

Ex: 0x01

0x50

Stored as part

of the cache “state”

Valid Bit

:

31

Byte 1 Byte 31 :

Byte 32 Byte 33 Byte 63 :

Byte 992 Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

13

Set Associative Cache

N-way set associative: N entries for each Cache Index

N direct mapped caches operates in parallel

How big is the tag?

Example: Two-way set associative cache

Cache Index selects a ―set‖ from the cache

The two tags in the set are compared to the input in parallel

Data is selected based on the tag result Cache Data

Cache Block 0

Cache Tag Valid

: : :

Cache Data

Cache Block 0

Cache Tag Valid

: : :

Cache Index

Mux 0 1 Sel1 Sel0

Cache Block

Compare Adr Tag

Compare

OR

Hit

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Block Placement

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Q2: How is a block found if it is in the upper level?

Index identifies set of possibilities

Tag on each block

No need to check index or block offset

Increasing associativity shrinks index, expands tag

Block Offset

Block Address

Index Tag

Cache size = Associativity * 2indexsize * 2offestsize

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Q3: Which block should be replaced on a mi

ss?

Easy for Direct Mapped

Set Associative or Fully Associative:

Random

LRU (Least Recently Used)

Assoc: 2-way 4-way 8-way

Size LRU Ran LRU Ran LRU Ran

16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%

64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%

256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Q4: What happens on a write?

Write-through: all writes update cache and underlying memory/cache

Can always discard cached data - most up-to-date data is in memory

Cache control bit: only a valid bit

Write-back: all writes simply update cache

Can’t just discard cached data - may have to write it back to memory

Cache control bits: both valid and dirty bits

Other Advantages:

Write-through:

memory (or other processors) always have latest data

Simpler management of cache

Write-back:

much lower bandwidth, since data often overwritten multiple times

Better tolerance to long-latency memory?

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Write Buffer for Write Through

A Write Buffer is needed between the Cache and Memory

Processor: writes data into the cache and the write buffer

Memory controller: write contents of the buffer to memory

Write buffer is just a FIFO:

Typical number of entries: 4

Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cyc

le

Processor Cache

Write Buffer

DRAM

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

The Cache Design Space

Several interacting dimensions

cache size

block size

associativity

replacement policy

write-through vs write-back

The optimal choice is a compromise

depends on access characteristics

workload

use (I-cache, D-cache, TLB)

depends on technology / cost

Simplicity often wins

Associativity

Cache Size

Block Size

Bad

Good Less More

Factor A Factor B

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

20

Review: Improving Cache Performance

1. Reduce the miss rate,

2. Reduce the miss penalty, or

3. Reduce the time to hit in the cache.

CPUtime IC CPIExecution

Memory accesses

InstructionMiss rate Miss penalty

Clock cycle time

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

21

Reducing Misses Classifying Misses: 3 Cs

Compulsory—The first access to a block is not in the cache, so the block

must be brought into the cache. Also called cold start misses or first refere

nce misses.

Capacity—If the cache cannot contain all the blocks needed during execut

ion of a program, capacity misses will occur due to blocks being discarded

and later retrieved.

Conflict—If block-placement strategy is set associative or direct mapped, c

onflict misses (in addition to compulsory & capacity misses) will occur bec

ause a block can be discarded and later retrieved if too many blocks map t

o its set. Also called collision misses or interference misses.

More recent, 4th ―C‖:

Coherence - Misses caused by cache coherence.

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

22

Classify Cache Misses, How?

(1) Infinite cache, fully associative

Compulsory misses

(2) Finite cache, fully associative

Compulsory misses + Capacity misses

(3) Finite cache, limited associativity

Compulsory misses + Capacity misses + Conflict misses

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

23

Cache Size (KB)

Mis

s R

ate

pe

r Ty

pe

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

3Cs Miss Rate

Conflict

Compulsory extremely small

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Reduce the miss rate,

Larger cache

Reduce Misses via Larger Block Size

Reduce Misses via Higher Associativity

Reducing Misses via Victim Cache

Reducing Misses via Pseudo-Associativity

Reducing Misses by HW Prefetching Instr, Data

Reducing Misses by SW Prefetching Data

Reducing Misses by Compiler Optimizations

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Reduce Misses via Larger Block Size

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

26

Cache Size (KB)

Mis

s R

ate

pe

r Ty

pe

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 4 8

16

32

64

128

1-way

2-way

4-way

8-way

Capacity

Compulsory

Reduce Misses via Higher Associativity

Conflict

miss rate 1-way associative cache size X ~= miss rate 2-way associative cache size X/2

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

27

Reducing Misses via a ―Victim Cache‖

Add buffer to place data dis

carded from cache

Jouppi [1990]: 4-entry victim

cache removed 20% to 95%

of conflicts for a 4 KB direct

mapped data cache

To Next Lower Level In Hierarchy

DATA TAGS

One Cache line of Data Tag and Comparator




Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

28

Reducing Misses via ―Pseudo-Associativity‖

How to combine fast hit time of Direct Mapped and have the lo

wer conflict misses of 2-way SA cache?

Divide cache: on a miss, check other half of cache to see if the

re, if so have a pseudo-hit (slow hit)

Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles

Better for caches not tied directly to processor (L2)

Used in MIPS R1000 L2 cache, similar in UltraSPARC

Hit Time

Pseudo Hit Time Miss Penalty

Time

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

29

Reducing Misses by Hardware Pre-fetchi

ng of Instructions & Data E.g., Instruction Prefetching

Alpha 21064 fetches 2 blocks on a miss

Extra block placed in ―stream buffer‖

On miss check stream buffer

Works with data blocks too:

Jouppi [1990] 1 data stream buffer got 25% misses from 4K

B cache; 4 streams got 43%

Palacharla & Kessler [1994] for scientific programs for 8 str

eams got 50% to 70% of misses from 2 64KB, 4-way set as

sociative caches

Prefetching relies on having extra memory bandwidth tha

t can be used without penalty

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

30

Reducing Misses by Software Prefetching Dat

a

Build a Special prefetching instructions that cannot cause faults; a form of speculative execution

Data Prefetch Cache Prefetch: load into cache

(MIPS IV, PowerPC, SPARC v. 9)

Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses?

Higher superscalar reduces difficulty of issue bandwidth

Relies on having extra memory bandwidth that can be used without penalty

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

31

Reducing Misses by Compiler Optimizations

McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software

How?

For instructions:

look at conflicts (using tools they developed)

Reorder procedures in memory so as to reduce conflict misses

For Data

Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays

Loop Interchange: change nesting of loops to access data in order stored in memory

Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

32

Merging Arrays Example

/* Before: 2 sequential arrays */

int val[SIZE];

int key[SIZE];

/* After: 1 array of stuctures */

struct merge {

int val;

int key;

};

struct merge merged_array[SIZE];

Reducing conflicts between val & key;

improve spatial locality

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

33

Loop Interchange Example

/* Before */

for (k = 0; k < 100; k = k+1)

for (j = 0; j < 100; j = j+1)

for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];

/* After */

for (k = 0; k < 100; k = k+1)

for (i = 0; i < 5000; i = i+1)

for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

Sequential accesses instead of striding through memory every

100 words; improved spatial locality

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

34

Loop Fusion Example

/* Before */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

a[i][j] = 1/b[i][j] * c[i][j];

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

d[i][j] = a[i][j] + c[i][j];

/* After */

for (i = 0; i < N; i = i+1)

for (j = 0; j < N; j = j+1)

{ a[i][j] = 1/b[i][j] * c[i][j];

d[i][j] = a[i][j] + c[i][j];}

2 misses per access to a & c vs. one miss per access; impro

ve spatial locality

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Reduce the miss penalty,

Read priority over write on miss

Early Restart and Critical Word First on miss

Non-blocking Caches (Hit under Miss, Miss under Miss)

Second Level Cache

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Read Priority over Write on Miss

write buffer

CPU in out

DRAM (or lower mem)

Write Buffer

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Read Priority over Write on Miss

Write-through w/ write buffers => RAW conflicts with main me

mory reads on cache misses

If simply wait for write buffer to empty, might increase read miss p

enalty (old MIPS 1000 by 50% )

Check write buffer contents before read; if no conflicts, let the me

mory access continue

Write-back want buffer to hold displaced blocks

Read miss replacing dirty block

Normal: Write dirty block to memory, and then do the read

Instead copy the dirty block to a write buffer, then do the read, an

d then do the write

CPU stall less since restarts as soon as do read

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Early Restart and Critical Word First

Don’t wait for full block to be loaded before restarting CPU

Early restart—As soon as the requested word of the block ar�rives,

send it to the CPU and let the CPU continue execution

Critical Word First—Request the missed word first from memory a

nd send it to the CPU as soon as it arrives; let the CPU continue e

xecution while filling the rest of the words in the block. Also called

wrapped fetch and requested word first

Generally useful only in large blocks,

Spatial locality a problem; tend to want next sequential word, s

o not clear if benefit by early restart

block

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Non-blocking Caches to reduce stalls on misse

s

Non-blocking cache or lockup-free cache allow data cache to c

ontinue to supply cache hits during a miss

requires F/E bits on registers or out-of-order execution

requires multi-bank memories

―hit under miss‖ reduces the effective miss penalty by working

during miss vs. ignoring CPU requests

―hit under multiple miss‖ or ―miss under miss‖ may further lowe

r the effective miss penalty by overlapping multiple misses

Significantly increases the complexity of the cache controller as th

ere can be multiple outstanding memory accesses

Requires muliple memory banks (otherwise cannot support)

Penium Pro allows 4 outstanding memory misses

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Add a second-level cache

L2 Equations

AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

AMAT = Hit Time + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)

Definitions:

Local miss rate— misses in this cache divided by the total number of m

emory accesses to this cache (Miss rateL2)

Global miss rate—misses in this cache divided by the total number of

memory accesses generated by the CPU

(Miss RateL1 x Miss RateL2)

Global Miss Rate is what matters

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Cache Optimization Summary

Technique MR MP HT Complexity

Larger Block Size + – 0

Higher Associativity + – 1

Victim Caches + 2

Pseudo-Associative Caches + 2

HW Prefetching of Instr/Data + 2

Compiler Controlled Prefetching + 3

Compiler Reduce Misses + 0

Priority to Read Misses + 1

Early Restart & Critical Word 1st + 2

Non-Blocking Caches + 3

Second Level Caches + 2

mis

s r

ate

m

iss p

enalty

Dr. A

mr T

ala

at

Elect 707

Co

mp

ute

r Arc

hite

ctu

re a

nd

Ap

plic

atio

ns

Next WEEK

Start the project +

Sheet 1 Online

QUIZ 1 covering Pipelining and Hazards

lecture 2: cache - guceee.guc.edu.eg/courses/electronics/elct707 microcomputer...dr. eng. amr t....

Documents