lecture 2: cache - guceee.guc.edu.eg/courses/electronics/elct707 microcomputer...dr. eng. amr t....
TRANSCRIPT
Lecture 2:
Cache A Quest for Speed
Dr. Eng. Amr T. Abdel-Hamid
Elect 707
Winter 2014
Co
mp
ute
r Arc
hite
ctu
re
Text book slides: Computer Architecture: A Quantitative Approach 5th Edition, John L. Hennessy & David A. Patterso with modifications.
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Memory Hierarchy
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Why Memory Organization?
µProc
60%/yr.
DRAM
7%/yr.
1
10
100
1000
19
80
19
81
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
DRAM
CPU 1
98
2
Processor-Memory
Performance Gap:
(grows 50% / year)
Perf
orm
ance
―Moore’s Law‖
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
The Principle of Locality
The Principle of Locality:
Programs access a relatively small portion of the address sp
ace at any instant of time.
Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is referenced,
it will tend to be referenced again soon (e.g., loops, reuse)
Spatial Locality (Locality in Space): If an item is referenced, i
tems whose addresses are close by tend to be referenced so
on (e.g., straightline code, array access)
Basic Principle to overcome the memory/proc. Interaction
problem
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
What is a cache? Small, fast storage used to improve average access time to slow m
emory.
Exploits spatial and temporal locality
In computer architecture, almost everything is a cache!
Registers a cache on variables
First-level cache a cache on second-level cache
Second-level cache a cache on memory
Memory a cache on disk (virtual memory)
Proc/Regs
L1-Cache
L2-Cache
Memory
Disk, etc.
Bigger Faster
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Cache Definitions Hit: data appears in some block in the upper level (example: Block X)
Hit Rate: the fraction of memory access found in the upper level
Hit Time: Time to access the upper level which consists of access tim
e + Time to determine hit/miss
Miss: data needs to be retrieve from a block in the lower level (Block
Y)
Miss Rate = 1 - (Hit Rate)
Miss Penalty: Time to replace a block in the upper level + Time to del
iver the block the processor
Hit Time << Miss Penalty
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Review: Cache performance
yMissPenaltMissRateHitTimeAMAT
DataDataData
InstInstInst
yMissPenaltMissRateHitTime
yMissPenaltMissRateHitTime
Average Memory Access Time (AMAT) = Hit time + Miss rate x Miss penalty (ns or clocks)
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
• Miss-oriented Approach to Memory Access:
– CPIExecution includes ALU and Memory instructions
CycleTimeyMissPenaltMissRateInst
MemAccess
ExecutionCPIICCPUtime
CycleTimeyMissPenaltInst
MemMisses
ExecutionCPIICCPUtime
Cache performance
• Separating out Memory component entirely
– AMAT = Average Memory Access Time
– CPIALUOps does not include memory instructions
CycleTimeAMATInst
MemAccessCPI
Inst
AluOpsICCPUtime
AluOps
yMissPenaltMissRateHitTimeAMAT
DataDataData
InstInstInst
yMissPenaltMissRateHitTime
yMissPenaltMissRateHitTime
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Impact on Performance Suppose a processor executes at
Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1
50% arith/logic, 30% ld/st, 20% control
Suppose that 10% of memory operations get 50 cycle miss penalty
Suppose that 1% of instructions get same miss penalty
CPI = ideal CPI + average stalls per instruction=
= 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = (1.1 + 1.5 + .5) cycle/ins = 3.1
58% of the time the proc is stalled waiting for memory!
yMissPenaltMissRateHitTimeAMAT
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Four Questions for Memory Hierarchy Desig
ners
Q1: Where can a block be placed in the upper level? (Block plac
ement)
Q2: How is a block found if it is in the upper level?
Q3: Which block should be replaced on a miss?
Q4: What happens on a write?
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
11
Block Placement
Q1: Where can a block be placed in the upper level?
Fully Associative,
Set Associative,
Direct Mapped
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
12
1 KB Direct Mapped Cache, 32B blocks
For a 2^N byte cache:
The uppermost (32 - N) bits are always the Cache Tag
The lowest M bits are the Byte Select (Block Size = 2 ^M)
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0 4 31
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as part
of the cache “state”
Valid Bit
:
31
Byte 1 Byte 31 :
Byte 32 Byte 33 Byte 63 :
Byte 992 Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
13
Set Associative Cache
N-way set associative: N entries for each Cache Index
N direct mapped caches operates in parallel
How big is the tag?
Example: Two-way set associative cache
Cache Index selects a ―set‖ from the cache
The two tags in the set are compared to the input in parallel
Data is selected based on the tag result Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Data
Cache Block 0
Cache Tag Valid
: : :
Cache Index
Mux 0 1 Sel1 Sel0
Cache Block
Compare Adr Tag
Compare
OR
Hit
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Block Placement
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Q2: How is a block found if it is in the upper level?
Index identifies set of possibilities
Tag on each block
No need to check index or block offset
Increasing associativity shrinks index, expands tag
Block Offset
Block Address
Index Tag
Cache size = Associativity * 2indexsize * 2offestsize
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Q3: Which block should be replaced on a mi
ss?
Easy for Direct Mapped
Set Associative or Fully Associative:
Random
LRU (Least Recently Used)
Assoc: 2-way 4-way 8-way
Size LRU Ran LRU Ran LRU Ran
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Q4: What happens on a write?
Write-through: all writes update cache and underlying memory/cache
Can always discard cached data - most up-to-date data is in memory
Cache control bit: only a valid bit
Write-back: all writes simply update cache
Can’t just discard cached data - may have to write it back to memory
Cache control bits: both valid and dirty bits
Other Advantages:
Write-through:
memory (or other processors) always have latest data
Simpler management of cache
Write-back:
much lower bandwidth, since data often overwritten multiple times
Better tolerance to long-latency memory?
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Write Buffer for Write Through
A Write Buffer is needed between the Cache and Memory
Processor: writes data into the cache and the write buffer
Memory controller: write contents of the buffer to memory
Write buffer is just a FIFO:
Typical number of entries: 4
Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cyc
le
Processor Cache
Write Buffer
DRAM
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
The Cache Design Space
Several interacting dimensions
cache size
block size
associativity
replacement policy
write-through vs write-back
The optimal choice is a compromise
depends on access characteristics
workload
use (I-cache, D-cache, TLB)
depends on technology / cost
Simplicity often wins
Associativity
Cache Size
Block Size
Bad
Good Less More
Factor A Factor B
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
20
Review: Improving Cache Performance
1. Reduce the miss rate,
2. Reduce the miss penalty, or
3. Reduce the time to hit in the cache.
CPUtime IC CPIExecution
Memory accesses
InstructionMiss rate Miss penalty
Clock cycle time
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
21
Reducing Misses Classifying Misses: 3 Cs
Compulsory—The first access to a block is not in the cache, so the block
must be brought into the cache. Also called cold start misses or first refere
nce misses.
Capacity—If the cache cannot contain all the blocks needed during execut
ion of a program, capacity misses will occur due to blocks being discarded
and later retrieved.
Conflict—If block-placement strategy is set associative or direct mapped, c
onflict misses (in addition to compulsory & capacity misses) will occur bec
ause a block can be discarded and later retrieved if too many blocks map t
o its set. Also called collision misses or interference misses.
More recent, 4th ―C‖:
Coherence - Misses caused by cache coherence.
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
22
Classify Cache Misses, How?
(1) Infinite cache, fully associative
Compulsory misses
(2) Finite cache, fully associative
Compulsory misses + Capacity misses
(3) Finite cache, limited associativity
Compulsory misses + Capacity misses + Conflict misses
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
23
Cache Size (KB)
Mis
s R
ate
pe
r Ty
pe
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
3Cs Miss Rate
Conflict
Compulsory extremely small
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Reduce the miss rate,
Larger cache
Reduce Misses via Larger Block Size
Reduce Misses via Higher Associativity
Reducing Misses via Victim Cache
Reducing Misses via Pseudo-Associativity
Reducing Misses by HW Prefetching Instr, Data
Reducing Misses by SW Prefetching Data
Reducing Misses by Compiler Optimizations
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Reduce Misses via Larger Block Size
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
26
Cache Size (KB)
Mis
s R
ate
pe
r Ty
pe
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 4 8
16
32
64
128
1-way
2-way
4-way
8-way
Capacity
Compulsory
Reduce Misses via Higher Associativity
Conflict
miss rate 1-way associative cache size X ~= miss rate 2-way associative cache size X/2
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
27
Reducing Misses via a ―Victim Cache‖
Add buffer to place data dis
carded from cache
Jouppi [1990]: 4-entry victim
cache removed 20% to 95%
of conflicts for a 4 KB direct
mapped data cache
To Next Lower Level In Hierarchy
DATA TAGS
One Cache line of Data Tag and Comparator
One Cache line of Data Tag and Comparator
One Cache line of Data Tag and Comparator
One Cache line of Data Tag and Comparator
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
28
Reducing Misses via ―Pseudo-Associativity‖
How to combine fast hit time of Direct Mapped and have the lo
wer conflict misses of 2-way SA cache?
Divide cache: on a miss, check other half of cache to see if the
re, if so have a pseudo-hit (slow hit)
Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles
Better for caches not tied directly to processor (L2)
Used in MIPS R1000 L2 cache, similar in UltraSPARC
Hit Time
Pseudo Hit Time Miss Penalty
Time
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
29
Reducing Misses by Hardware Pre-fetchi
ng of Instructions & Data E.g., Instruction Prefetching
Alpha 21064 fetches 2 blocks on a miss
Extra block placed in ―stream buffer‖
On miss check stream buffer
Works with data blocks too:
Jouppi [1990] 1 data stream buffer got 25% misses from 4K
B cache; 4 streams got 43%
Palacharla & Kessler [1994] for scientific programs for 8 str
eams got 50% to 70% of misses from 2 64KB, 4-way set as
sociative caches
Prefetching relies on having extra memory bandwidth tha
t can be used without penalty
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
30
Reducing Misses by Software Prefetching Dat
a
Build a Special prefetching instructions that cannot cause faults; a form of speculative execution
Data Prefetch Cache Prefetch: load into cache
(MIPS IV, PowerPC, SPARC v. 9)
Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses?
Higher superscalar reduces difficulty of issue bandwidth
Relies on having extra memory bandwidth that can be used without penalty
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
31
Reducing Misses by Compiler Optimizations
McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software
How?
For instructions:
look at conflicts (using tools they developed)
Reorder procedures in memory so as to reduce conflict misses
For Data
Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in order stored in memory
Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
32
Merging Arrays Example
/* Before: 2 sequential arrays */
int val[SIZE];
int key[SIZE];
/* After: 1 array of stuctures */
struct merge {
int val;
int key;
};
struct merge merged_array[SIZE];
Reducing conflicts between val & key;
improve spatial locality
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
33
Loop Interchange Example
/* Before */
for (k = 0; k < 100; k = k+1)
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (k = 0; k < 100; k = k+1)
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
Sequential accesses instead of striding through memory every
100 words; improved spatial locality
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
34
Loop Fusion Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];
/* After */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs. one miss per access; impro
ve spatial locality
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Reduce the miss penalty,
Read priority over write on miss
Early Restart and Critical Word First on miss
Non-blocking Caches (Hit under Miss, Miss under Miss)
Second Level Cache
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Read Priority over Write on Miss
write buffer
CPU in out
DRAM (or lower mem)
Write Buffer
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Read Priority over Write on Miss
Write-through w/ write buffers => RAW conflicts with main me
mory reads on cache misses
If simply wait for write buffer to empty, might increase read miss p
enalty (old MIPS 1000 by 50% )
Check write buffer contents before read; if no conflicts, let the me
mory access continue
Write-back want buffer to hold displaced blocks
Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the read
Instead copy the dirty block to a write buffer, then do the read, an
d then do the write
CPU stall less since restarts as soon as do read
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Early Restart and Critical Word First
Don’t wait for full block to be loaded before restarting CPU
Early restart—As soon as the requested word of the block ar�rives,
send it to the CPU and let the CPU continue execution
Critical Word First—Request the missed word first from memory a
nd send it to the CPU as soon as it arrives; let the CPU continue e
xecution while filling the rest of the words in the block. Also called
wrapped fetch and requested word first
Generally useful only in large blocks,
Spatial locality a problem; tend to want next sequential word, s
o not clear if benefit by early restart
block
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Non-blocking Caches to reduce stalls on misse
s
Non-blocking cache or lockup-free cache allow data cache to c
ontinue to supply cache hits during a miss
requires F/E bits on registers or out-of-order execution
requires multi-bank memories
―hit under miss‖ reduces the effective miss penalty by working
during miss vs. ignoring CPU requests
―hit under multiple miss‖ or ―miss under miss‖ may further lowe
r the effective miss penalty by overlapping multiple misses
Significantly increases the complexity of the cache controller as th
ere can be multiple outstanding memory accesses
Requires muliple memory banks (otherwise cannot support)
Penium Pro allows 4 outstanding memory misses
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Add a second-level cache
L2 Equations
AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1
Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2
AMAT = Hit Time + Miss RateL1 x (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)
Definitions:
Local miss rate— misses in this cache divided by the total number of m
emory accesses to this cache (Miss rateL2)
Global miss rate—misses in this cache divided by the total number of
memory accesses generated by the CPU
(Miss RateL1 x Miss RateL2)
Global Miss Rate is what matters
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Cache Optimization Summary
Technique MR MP HT Complexity
Larger Block Size + – 0
Higher Associativity + – 1
Victim Caches + 2
Pseudo-Associative Caches + 2
HW Prefetching of Instr/Data + 2
Compiler Controlled Prefetching + 3
Compiler Reduce Misses + 0
Priority to Read Misses + 1
Early Restart & Critical Word 1st + 2
Non-Blocking Caches + 3
Second Level Caches + 2
mis
s r
ate
m
iss p
enalty
Dr. A
mr T
ala
at
Elect 707
Co
mp
ute
r Arc
hite
ctu
re a
nd
Ap
plic
atio
ns
Next WEEK
Start the project +
Sheet 1 Online
QUIZ 1 covering Pipelining and Hazards